Hi there,
I recently learned of fornalder, a tool that creates nice visualizations of contributions to open-source projects by analyzing commits to their git repositories (the author used it to analyze GNOME contributions). I decided to use it to study contributions to the Coq implementation and packages in the Coq opam archive, results below.
The Coq proof assistant
This graph shows the “contributor cohorts” for the Coq proof assistant over time. For example, the big dark-green bar that shows up in 2017 represents the “2017 cohort”, the number of long-term contributors to the Coq implementation that did their first contribution in 2017. The dark-green bar in similar position in each following year represents the contributors from the 2017 cohort that are still active on that year. The bar is smaller in 2018, and again smaller in 2019, as some members of this cohort stop contributors. Short-term contributors (all their contributions fall within a 90-days period) are shown as the “Brief” bars in dark blue at the top.
The main thing we see on this graph is that Coq development activity remained fairly stable until 2013, and started growing steadily during the 2013-2018 period, and is again stable at this level. This suggests a succesful process of opening the development to the wider community, with previously only a small “core group” contributing, and now a wider “expert cool” reaching wider in the Coq community. (Since 2017, we also see a large fraction of occasional contributors each year.)
This graph shows the number of commits from the contributors of each cohort. While the size of cohorts (in number of active contributors) naturally decreases over time, the commit volume of each cohort tend to increase in the first few years, as contributors gain proficiency and become more active.
Coq remains a relatively small project, where contribution volume is skewed by a handful of high-activity contributors. 2011 is the year of Pierre-Marie Pédrot and Enrico Tassi (and no one else), and 2015 saw 16 new long-term contributors, but their surprising volume in current years is dominantly explained by the presence of Théo Zimmermann and Emilio Jesús Gallego Arias.
All Coq packages on opam
I then ran the same visualization tool on all Coq packages listed in the public opam-coq-archive. This is a good subset of all open source Coq developments, but it is biased towards recent project (Coq sources have been shared as tarballs on webpages for a while), and biased towards libraries meant to be reused rather than individual, self-contained proofs.
This graph shows the number of contributors, in yearly cohorts. We can see that the number of contributors has been growing each year, and seems to be reaching a plateau around the 2019 level. (There were more “brief” contributors in 2019, but again slightly more long-term contributors in 2020.) The growth dynamic starting around 2013 is again very clearly visible.
This graph shows the volume of commits. The dynamic is different from the number of contributors – the plateau seems to start in 2016. One thing that I find interesting is that each cohort keeps a good contribution volume for many years (even though the cohort size decreases); clearly the projects in the opam-coq-archive are, in large part, not one-shot project of people working for the duration of their PhD thesis and giving up on it afterwards.
Disclaimer
Some authors commit in different projects using different names, so they may be counted as separate contributors instead. (Inside a project, one may use a .mailmap file to merge contributor identities, but afaik there is no support in git or fornalder for overlaying an extra .mailmap file that would work across repositories.)
If you wish to study the dataset to see if the overall conclusions are endangered by such anomalies, please feel free to replay the data-collection steps. You can either manually inspect git repositories, or play with the SQLite database generated by fornalder.
My take away
These graphs gave me a quantitative validation of the feeling I already had that the Coq ecosystem did a superb job at opening up its development during the 2010s, with remarkable growth from 2013 to 2018. Since 2018 we seem to have reached a plateau, that can be explained by the fact that it remains a niche ecosystem; but the overall community seems healthy, with contributors remaining active for many years, and a sizeable new cohort each year.
(I feel limited in this analysis by the sentiment that a lot of interesting Coq software is not included in the opam-coq-archive. If you know of a way to collect more git repositories, please use it and re-run the analysis!)
Reproduction information
You can find a curated log of my analysis process in logs.md; this should contain enough information for you to reproduce the result, and it could easily be adapted to other software communities.
I uploaded all the small-enough data of my run in this repository, in particular the list of URLs I cloned – some of them failed. Not included: the git repositories, and the database in which fornalder stores its analysis result.