Visualizing contributions to the Coq implementation and all Coq opam packages

Hi there,

I recently learned of fornalder, a tool that creates nice visualizations of contributions to open-source projects by analyzing commits to their git repositories (the author used it to analyze GNOME contributions). I decided to use it to study contributions to the Coq implementation and packages in the Coq opam archive, results below.

The Coq proof assistant

This graph shows the “contributor cohorts” for the Coq proof assistant over time. For example, the big dark-green bar that shows up in 2017 represents the “2017 cohort”, the number of long-term contributors to the Coq implementation that did their first contribution in 2017. The dark-green bar in similar position in each following year represents the contributors from the 2017 cohort that are still active on that year. The bar is smaller in 2018, and again smaller in 2019, as some members of this cohort stop contributors. Short-term contributors (all their contributions fall within a 90-days period) are shown as the “Brief” bars in dark blue at the top.

The main thing we see on this graph is that Coq development activity remained fairly stable until 2013, and started growing steadily during the 2013-2018 period, and is again stable at this level. This suggests a succesful process of opening the development to the wider community, with previously only a small “core group” contributing, and now a wider “expert cool” reaching wider in the Coq community. (Since 2017, we also see a large fraction of occasional contributors each year.)

This graph shows the number of commits from the contributors of each cohort. While the size of cohorts (in number of active contributors) naturally decreases over time, the commit volume of each cohort tend to increase in the first few years, as contributors gain proficiency and become more active.

Coq remains a relatively small project, where contribution volume is skewed by a handful of high-activity contributors. 2011 is the year of Pierre-Marie Pédrot and Enrico Tassi (and no one else), and 2015 saw 16 new long-term contributors, but their surprising volume in current years is dominantly explained by the presence of Théo Zimmermann and Emilio Jesús Gallego Arias.

All Coq packages on opam

I then ran the same visualization tool on all Coq packages listed in the public opam-coq-archive. This is a good subset of all open source Coq developments, but it is biased towards recent project (Coq sources have been shared as tarballs on webpages for a while), and biased towards libraries meant to be reused rather than individual, self-contained proofs.

This graph shows the number of contributors, in yearly cohorts. We can see that the number of contributors has been growing each year, and seems to be reaching a plateau around the 2019 level. (There were more “brief” contributors in 2019, but again slightly more long-term contributors in 2020.) The growth dynamic starting around 2013 is again very clearly visible.

This graph shows the volume of commits. The dynamic is different from the number of contributors – the plateau seems to start in 2016. One thing that I find interesting is that each cohort keeps a good contribution volume for many years (even though the cohort size decreases); clearly the projects in the opam-coq-archive are, in large part, not one-shot project of people working for the duration of their PhD thesis and giving up on it afterwards.

Disclaimer

Some authors commit in different projects using different names, so they may be counted as separate contributors instead. (Inside a project, one may use a .mailmap file to merge contributor identities, but afaik there is no support in git or fornalder for overlaying an extra .mailmap file that would work across repositories.)

If you wish to study the dataset to see if the overall conclusions are endangered by such anomalies, please feel free to replay the data-collection steps. You can either manually inspect git repositories, or play with the SQLite database generated by fornalder.

My take away

These graphs gave me a quantitative validation of the feeling I already had that the Coq ecosystem did a superb job at opening up its development during the 2010s, with remarkable growth from 2013 to 2018. Since 2018 we seem to have reached a plateau, that can be explained by the fact that it remains a niche ecosystem; but the overall community seems healthy, with contributors remaining active for many years, and a sizeable new cohort each year.

(I feel limited in this analysis by the sentiment that a lot of interesting Coq software is not included in the opam-coq-archive. If you know of a way to collect more git repositories, please use it and re-run the analysis!)

Reproduction information

You can find a curated log of my analysis process in logs.md; this should contain enough information for you to reproduce the result, and it could easily be adapted to other software communities.

I uploaded all the small-enough data of my run in this repository, in particular the list of URLs I cloned – some of them failed. Not included: the git repositories, and the database in which fornalder stores its analysis result.

4 Likes

Hi,

Thanks for the analysis and the nice graphs! I agree that I felt the level of contributions for the opam packages plateauing for the last year or two. I thought this was due to the pandemia at first by this may be a more global trend. I think this is not normal though and we should see continuous growth for the packages, even if the core team stabilizes. Probably we could do more advertising.

This may suggestion for the upcoming Coq User Survey, to ask respondents whether the code they write in Coq ends up on the opam archive.

If many people write Coq code for research papers and don’t include it in the opam archive, then I guess you have to figure out whether you want them in the archive (or a different way to store known-good Coq proofs produced by the community). One way to improve coverage would be to get in touch with the “Artifact Evaluation” committees of SIGPLAN research conferences, which get to see a lot of Coq code written for research purposes, and see if they can encourage researchers to follow certain packaging practices (but what should they be? what is the advantage for them?).

(For large projects there is a clear advantage in being on opam-archive: Coq developers use it for benchmarking and/or backward-compatibility testing. This is important for multi-people project that want to remain usable in the logn run. But I would hazard the guess that a lot of Coq code authors do not self-identify as needing this, and are not necessarily fully aware of the reuse opportunities of their work. I say this, ironically, as an irregular author of Coq formalizations who never bothered submitting to the opam archive.)

Indeed, this is a good suggestion to contact SIGPLAN people, will do that. I think the interest for research papers to be on opam is to have a reproducible build. It does not cost us much to check that existing packages are still compatible with new Coq versions, and update the opam dependencies if needed. It can also serve as advertisement, and be a way to run analysis on many developments like you did.

What Maria and I did for CAV’s artifact evaluation this year is to condition the “reusable” badge on having clear installation instructions. We didn’t recommend using specific package archives like OPAM or PyPi or Cabal, but in practice multiple of the Coq submissions use OPAM (though not all). I’m not sure if it would make things better to require OPAM for Coq packages.

I don’t think that it would be in the spirit of artifact evaluation to require one specific technical solution, but I think that this probably a good place or place to have a discussion with authors about good packaging practices. “The Coq community recommends packaging proofs on opam-archive, here is a link to how to do it”. This may help spread the word among authors with vestigial code distribution practices.

OK yes indeed, suggesting / recommending rather than requiring.

I think it would also help to explain to the authors what they gain from distributing through OPAM: as an artifact reviewer, having a proof distributed through OPAM rather than as a plain git repo would make my life marginally harder, not easier (I’d have to figure out where OPAM installed the source files containing the proof to step through them interactively, read the theorem statements, and run coqchk).

Things are different if the artifact is a tool or a library implemented in Coq, since in that case I may just want to use the tool or import the library without looking at the implementation, and OPAM is good for that.