Ah, luckily, the choice of benchmark is orthogonal to the choice of instrumenting vs parsing serapi output.
CoqGym as a benchmark set is great IMO. While Proverbot9001 primarily evaluated on CompCert as one large project, the evaluation on CoqGym benchmarks really helped with comparing; I’m actually primarily using CoqGym in my new work. I’m not sure really what the downsides are, I think it’s primarily been used as a benchmark within the ASTactic framework, which has it’s own pros and cons, but purely as a benchmark set I don’t see anything else that compares.
As for the version thing, I guess I don’t know coqtop that well, it could be that it’s totally stable across versions. I’ve been primarily working with coq-serapi, where the same information is returned but wrapped in a bunch of metadata about the state machine, which appears to have changed between 8.8 and 8.12. I guess I just assumed small formatting things would differ in coqtop too, but maybe not.