Backbeat

On TwitterGitHubEmail

Repo Explorer

by Raheel Ahmad

I built a little tool I am calling Repo Explorer. This post describes why I built it, and what I learnt using it.

(The code is on GitHub; to say that it is in Alpha stage, would be charitable.)

Motivation

Git repositories have always fascinated me:

Also, git is a “repository” of information that goes beyond the code it hosts. If you are looking for a rich source of information to mine, a git repository has a lot to offer. It is essentially temporal information wrapped around a social graph, with objects (the written code in the files) that contain strict semantics at the core.

With that in mind, I began exploring using a little tool that I built: The Repo Explorer.

It’s currently at a nascent stage of development, but still a lot of fun to use.

Built with…

The Set Up

You can select one of the imported repositories (I imported a couple dozen of interesting ones, and will keep adding more). Then you can view the repository commits either as a timeline, how the authors work during the day, or how they work in days of a week.

Options

Options

You can also view the repository as commits for the top most active files. This is not yet very fleshed out, but probably has the most potential for meaningful exploration.

Authors Timeline

This is the simplest view, and also the most interesting. Each row represents an author’s commits. To conserve vertical space, only the top committers are shown individually, and the tail end is grouped as a single “author”’s timeline at the bottom.

A commit is shown with insertions at the top in blue bars and deletions at the bottom in red:

Commits up-close

Commits up-close

Magnitude is represented by the height of the bar, and also with saturation: after the maximum height is reached, the saturation increases.

(Magnitude is not always representative of “busyness” , but may be a decent measure of impact.)

Initial work is usually intense:

Early vs. later development in Express

Commits are substantive in size and number. After stabilization, smaller and fewer commits would be expected; even if a lot of features are being added, churn will be smaller. This is not always the case, specially with language repositories like Swift and Kotlin that are seeing very active development.

Swift • Swift's strong development

How active are the founders?

More often than not, the founders leave the project at some point, or at least become very “minimally engaged”.

Quite often they leave early:

Alamofire • Matt leaves early (last row)

Carthage • Justin and Alan leave early

Node • Ryan Dahl leaves after the furious inception

libgit2 • Shawn Pearce starts the work and then leaves (for Google?), but Vincent quickly takes over

The exceptions are D3 (this is Mike Bostock’s PhD work after all), Clojure (this is Rich Hickey’s life’s work, but he is not so regular anymore), Flask (Armin keeps checking in), openFrameworks (Arturo seems committed to oF’s community and not just the code), and React (Facebook-subsidized).

The story is surprisingly similar in startups: the technical co-founder works incredibly hard to make things work. Here, for one of the startup repository that I imported, there is an on-off activity stream in the beginning. Then some intense work probably when they went full-time on the project. No one comes close to matching the intensity later (although some of this could be attributed to the churn of adding dependencies etc., the hourly and weekly rhythm visualizations shown later support the intensity angle).

Technical co-founder works hard to get things off the ground.

Contributors

In the startup timeline above, you can see that multiple team members contribute at any given time, almost equally.

Swift shown above, and Spring Framework below show multiple strong developers.

Spring

Vagrant and NPM have had at least one strong contributor after the founder stepped out.

vagrant

npm

nginx sees a very clear handoff from Igor to multiple developers.

nginx

Other Timeline Observations

flask's sparse development.

Sinatra's mostly sparse development.

Express settles in after TJ's initial fierce work.

Clojure's stabilization.

nginx's stabilization.

Time of Day

This gives more of a micro view into how authors work. It’s quite likely that you would get a random commit at 2AM once in a couple of years of working in an open source library. Maybe you couldn’t sleep. Or maybe your computer clock is off and it is actually 2PM where you are?

Some people have hard cut off times though, like Robert in Carthage:

while some people work around the clock, like Rebecca in npm:

You can tell a startup’s cofounder with how much they end up working around the clock, while the people who come later fall into a 9-5ish pattern.

A startup' day-of-the-hour rhythm.

React is the most 9-5ish open source repository, probably because the Facebook employees treat it as a job?

React's Time of Day commits

I would have expected Swift’s commits to fall under a similar pattern, but they are intensely around-the-clock (a passion project for all involved?):

Swift's Time of Day commits

Although most founders work around the clock even in open source repositories, Mike Bostock is an exception:

D3's Time of Day commits

Day of the Week

Another micro view of how authors work, arranged radially like this:

Day of the week commits, arranged clockwise.

Open source libraries often see around the week commits:

openFrameworks's Day of the Week commits

ember's Day of the Week commits

while in a startup, the technical cofounder works around the week while others usually skip the weekend.

A startup's Day of the Week commits

Chris Lattner may have worked more on the weekends:

Swift's Day of the Week commits

Files

The files view shows the top files that are included in commits. There is a lot more that the tool can provide for exploration, but here are some things I observed.

clojure/core.clj appears consistently, as it provides the public interface. clojure/spec.clj has begun showing up since the last beta, and underscores the importance of clojure.spec to the language.

Clojure's top files

Javascript libraries see a lot of changes in their package.json, with their tendencies to rely on dependencies.

Express's top files

npm's top files

Flask’s top committed files include flask.py and helpers.py, which might mean a less organized codebase.

Flask's top files

Swift sees a lot of commits for its type checker related files, perhaps pointing to the authors’ pechant for type safety and also how hard of a problem type inference etc. is.

Swift's top files

In particular, CSDiag.cpp, which is billed as implementing the “diagnostics for the type checker”, sees heavy development in summer 2015 with Swift 2.0’s public release. Maybe supporting protocol extensions was the cuplrit:

More to do

There is a lot to be done; next up are performance improvements, sophisticated files visualization, and keeping the information up to date with the repositoy rather than it coming from a 1-time import.


Thanks to Jim Vallandingham for feedback and guidance.