Initial release of machine learning dataset based on Mathematical Components 1.9.0

Dear Coq users and developers,

We are happy to announce the initial release of a dataset for use in machine learning, based on Coq 8.10.2 and the Mathematical Components library version 1.9.0.

The dataset is based on 449 source files from 21 Coq projects, which are in total over 297k lines of code and contains over 23k lemmas, with projects divided into tiers based on adherence to the MathComp coding conventions.

Each source file has two machine-readable files which are lists of S-expressions: one file with Coq tokens and one file with Coq abstract syntax trees. Each token/AST contains detailed information on its source file provenance. These representations were obtained directly from Coq’s implementation via (OCaml PPX) metaprogramming using the SerAPI library. We also provide raw and processed (“chopped”) Coq kernel terms as S-expressions for each lemma statement.

All files not directly derived from Coq code, such as metadata, are released under the MIT license. All direct derivatives of Coq code are released under the same (open source) license as the code itself.

So far, we have used the dataset for deep generation of Coq lemma names (see the separate announcement of the Roosterize tool release) and for formatting Coq files. We look forward to seeing the Coq community leveraging the dataset to further improve productivity for Coq users.

Best regards,
Karl and all contributors to the project