I built a little tool I am calling Repo Explorer. This post describes why I built it, and what I learnt using it.
(The code is on GitHub; to say that it is in Alpha stage, would be charitable.)
Git repositories have always fascinated me:
- the underlying architecture is fascinating. There are products that use git as their object model for out-of-the-box collaboration and non-destructive editing.
- it is how developers — and teams, organizations & communities — work together. Much of our current day technological advancement relies on git at a very fundamental level.
Also, git is a “repository” of information that goes beyond the code it hosts. If you are looking for a rich source of information to mine, a git repository has a lot to offer. It is essentially temporal information wrapped around a social graph, with objects (the written code in the files) that contain strict semantics at the core.
With that in mind, I began exploring using a little tool that I built: The Repo Explorer.
It’s currently at a nascent stage of development, but still a lot of fun to use.
- I chose Python mostly for the pygit2 binding for libgit2, which appears to be mature and well-documented.
- I ended up shelling out quite a bit for
libgit2did not seem to do what I wanted, specially for tracking deleted files properly.
- I ended up shelling out quite a bit for
- Flask for routing and templates.
- D3. It is the most obvious choice, but I often wonder if one really needs it for most visualizations. For simple visualizations, d3-scale is probably what’s most useful. For more complex applications, I end up fighting the library for control, and wishing that I had written custom code.
- Performance with DOM is terrible for heavy repositories with more than 20K commits, like swift’s. Rendering with Canvas is my next task.
The Set Up
You can select one of the imported repositories (I imported a couple dozen of interesting ones, and will keep adding more). Then you can view the repository commits either as a timeline, how the authors work during the day, or how they work in days of a week.
You can also view the repository as commits for the top most active files. This is not yet very fleshed out, but probably has the most potential for meaningful exploration.
This is the simplest view, and also the most interesting. Each row represents an author’s commits. To conserve vertical space, only the top committers are shown individually, and the tail end is grouped as a single “author”’s timeline at the bottom.
A commit is shown with insertions at the top in blue bars and deletions at the bottom in red:
Magnitude is represented by the height of the bar, and also with saturation: after the maximum height is reached, the saturation increases.
(Magnitude is not always representative of “busyness” , but may be a decent measure of impact.)
Initial work is usually intense:
Commits are substantive in size and number. After stabilization, smaller and fewer commits would be expected; even if a lot of features are being added, churn will be smaller. This is not always the case, specially with language repositories like Swift and Kotlin that are seeing very active development.
How active are the founders?
More often than not, the founders leave the project at some point, or at least become very “minimally engaged”.
Quite often they leave early:
The exceptions are D3 (this is Mike Bostock’s PhD work after all), Clojure (this is Rich Hickey’s life’s work, but he is not so regular anymore), Flask (Armin keeps checking in), openFrameworks (Arturo seems committed to oF’s community and not just the code), and React (Facebook-subsidized).
The story is surprisingly similar in startups: the technical co-founder works incredibly hard to make things work. Here, for one of the startup repository that I imported, there is an on-off activity stream in the beginning. Then some intense work probably when they went full-time on the project. No one comes close to matching the intensity later (although some of this could be attributed to the churn of adding dependencies etc., the hourly and weekly rhythm visualizations shown later support the intensity angle).
In the startup timeline above, you can see that multiple team members contribute at any given time, almost equally.
Swift shown above, and Spring Framework below show multiple strong developers.
Vagrant and NPM have had at least one strong contributor after the founder stepped out.
nginx sees a very clear handoff from Igor to multiple developers.
Other Timeline Observations
- Flask and Sinatra have required sparse development over the years; microframeworks indeed!
- Express is similar but only after TJ Holowaychuk’s intense work.
- Clojure and nginx are fairly active, but also see stabilization.
Time of Day
This gives more of a micro view into how authors work. It’s quite likely that you would get a random commit at 2AM once in a couple of years of working in an open source library. Maybe you couldn’t sleep. Or maybe your computer clock is off and it is actually 2PM where you are?
Some people have hard cut off times though, like Robert in Carthage:
while some people work around the clock, like Rebecca in npm:
You can tell a startup’s cofounder with how much they end up working around the clock, while the people who come later fall into a 9-5ish pattern.
React is the most 9-5ish open source repository, probably because the Facebook employees treat it as a job?
I would have expected Swift’s commits to fall under a similar pattern, but they are intensely around-the-clock (a passion project for all involved?):
Although most founders work around the clock even in open source repositories, Mike Bostock is an exception:
Day of the Week
Another micro view of how authors work, arranged radially like this:
Open source libraries often see around the week commits:
while in a startup, the technical cofounder works around the week while others usually skip the weekend.
Chris Lattner may have worked more on the weekends:
The files view shows the top files that are included in commits. There is a lot more that the tool can provide for exploration, but here are some things I observed.
clojure/core.clj appears consistently, as it provides the public interface.
clojure/spec.clj has begun showing up since the last beta, and underscores the importance of
clojure.spec to the language.
package.json, with their tendencies to rely on dependencies.
Flask’s top committed files include
helpers.py, which might mean a less organized codebase.
Swift sees a lot of commits for its type checker related files, perhaps pointing to the authors’ pechant for type safety and also how hard of a problem type inference etc. is.
CSDiag.cpp, which is billed as implementing the “diagnostics for the type checker”, sees heavy development in summer 2015 with Swift 2.0’s public release. Maybe supporting protocol extensions was the cuplrit:
More to do
There is a lot to be done; next up are performance improvements, sophisticated files visualization, and keeping the information up to date with the repositoy rather than it coming from a 1-time import.
Thanks to Jim Vallandingham for feedback and guidance.