This page will give some more information on my entry for the GitHub Data Challenge 2014, which can be found here.
In order to achieve this, I decided to create three data analysis types, which give the user an overall view, a per-user view, and a per language combination view, respectively. I will briefly discuss the data acquisition process, and I will give a short discussion per visualization too.
The data acquistion process was vital for this project. The goal was to be able to indicate per user, whether or not he/she knew a particular programming language. In order to keep the number of languages somewhat limited, I decided to look at only the top 20 languages (based on this list). GitHub suggested several ways to acquire the data that could be used to create the entries for this competition. I very quickly found, however, that due to the approximately 3 million users, the only feasible solution was the use of Google BigQuery. The GitHub API is (sensibly) limited to 5000 requests per hour. This would mean approximately 600 hours of full speed querying. Google BigQuery offered me the capability to do near real-time querying on the GitHub data.
The data I wanted to use should have indicated per user whether or not he/she can 'speak' a given programming language. An optimal solution would have been to analyze every commit made by a user. However, this would have led to huge computation load, as the commit object in the GitHub API does not currently contain the language breakdown when a list of commits is fetched (https://developer.github.com/v3/repos/commits/#list-commits-on-a-repository). Because of this, I very quickly realized it would not be possible to analyze each user's full language skills. However, per repository attributed to a user, it was easily possible to obtain the language of the repository. To give some idea of the amount of data processing needed, even Google BigQuery took over a minute to execute this query!) I used the following BigQuery query to fetch the data
SELECT repository_owner, repository_language USERNAME, LANGUAGE
GROUP BY repository_owner, repository_language
SELECT repository_owner, repository_language
Visualization 1 is a chord diagram, which indicates the relationship between all possible combinations of programming languages. This data was computed by creating all possible pairs that could be created using the list of 20 languages I have analyzed. By analyzing the combinations, and the number of users that speak both of the languages in question, we get a good idea of what languages are spoken most, but also which languages are 'spoken' quite a lot, but not in combination. It gives a different perspective of the user-language landscape on GitHub.
Visualization 2 makes direct use of the structure of the MySQL database I described in the section above. It allows you to search for a particular username and find out which languages this users speaks. While not very revolutionary, it is a very natural and logical way to query the data I obtained.
Visualization 3 is the exact inverse of the second visualization. It offers you the capability of finding users that speak a given combination of languages. This may be useful if you're looking for a specific skillset for a project, and are looking for someone to help you out.
Like any project, this project also has a number of flaws. The main one is naturally the fact that all statistics are based on the dominant language of every repository. This means that if a user has contributed actively to a project of language A, but has no repos of their own in this language, we will think that this user does not speak language A. This is a serious flaw, which sadly I was unable to solve in time for the entry deadline of the GitHub Data Challenge. It should be noted, however, that if a user is passionate/proficient in a particular language, one would expect this user to have repositories where this language is the dominant language.
Apart from missing the languages of the commits, it should also be noted that if a repository contains multiple languages, this system only takes the dominant one into account. Similarly to the interface GitHub itself provides, a repository is marked as a 'Language A' repository based on the dominant language. While this gives a good high-level overview, it is somewhat detrimental to overall accuracy.
Another thing to keep in mind is that this overview is a snapshot view, which very much depends on the exact time at which this snapshot is taken. In the future perhaps it would be possible to use real-time data, but for now this was outside the scope of this project.
Any feedback on my work is more than welcome. I really enjoy discussing development work with others, and I would love to hear your ideas about my work, or suggestions/critism :). Hit me up on Twitter or through my website