LL(n) notations in Coq

Does anybody know if there is a trick to parse LL(n) notations in Coq? Since Coq uses Camlp5, the standard way of parsing can only handle LL(1) languages. However, on the Ocaml level one can cheat by looking ahead in the token stream. I am wondering if such a thing is also possible with Coq level notations.

I would already be quite happy with a way to parse LL(2). As a concrete example, is it possible to write notations for the following LL(2) grammar?

S -> ∊
   | a S A
A -> a b S
   | c

I don’t know much about how parsing in Coq works, but why would you want to parse something like this in Coq? This looks like a grammar that can be implemented using mutually inductive types:

From Coq Require Import List.
Import ListNotations.

(* Abstract syntax tree (ignoring terminal symbols) *)
Inductive ProdS : Type :=
| ProdSϵ : ProdS
| ProdSa : ProdS -> ProdA -> ProdS
with ProdA : Type :=
| ProdAab : ProdS -> ProdA
| ProdAc : ProdA
.

(* The alphabet consits of the symbosl [A;B;C] *)
Inductive sigma := A | B | C.
Definition string := list sigma.

(* You can convert elements of type [S] and [A] to strings *)
Fixpoint S_to_string (s : ProdS) : string :=
  match s with
  | ProdSϵ => []
  | ProdSa s' a => A :: S_to_string s' ++ A_to_string a
  end
with A_to_string (a : ProdA) : string :=
  match a with
  | ProdAab s => A :: B :: S_to_string s
  | ProdAc => [C]
  end.

You could even try to implement a parser for ProdS and ProdA. The parser would be mutually recursively. In the parser for ProdA, if you encounter an a, you know that you need to read an b afterwards. Thus, there is no lookahead required anymore.

Notations are implemented by adding rules to the grammar. Since they are processed by the same parser, they would still be limited to LL(1).

1 Like

Some comments:

Hi everyone. Thanks for all your comments. Just to clarify some things:

  • I’m trying to deeply embed a domain specific language with a non-LL(1) grammar. So I indeed have an AST, but I also want convenient notation. I cannot use strings, because I want to use meta-variables and meta-terms in my notations.
  • This grammar would indeed need to use custom entries.
  • The grammar above is a simplification of my real grammar. Let me modify this example a little bit to make my meta-variable point clear:
S -> ∊
   | ident S A
A -> ident & S
   | #

(Here I replaced a with ident). One possible way you might write this grammar in Coq is as follows:

Declare Custom Entry S.
Declare Custom Entry A.
Notation "[ x ]" := x (x custom S).
Notation "x s a" := (x, s, a) (in custom S, x global, s custom S, a custom A).
Notation "" := tt (in custom S).
Notation "x & s" := (x, s) (in custom A, x global, s custom S).
Notation "#" := tt.

However, this does not work, becuase the grammar is not LL(1) (and because Coq has trouble dealing with the empty notation).

Hi Lasse,

did you try to generate a few random productions of the above language and generate the parse tree manually? I would say this is likely not that intuitive and things don’t get easier by this piece being only a small piece of a larger grammar. A DSL which is difficult to parse for a machine might also be difficult to read and write for humans, which is not desirable for a DSL.

My advice: before you spend a lot of time trying to find out how to implement such a grammar, you might want to do some pencil and paper experiments with humans to see if usage of the grammar - especially reading - is reasonably smooth. Some grammars look natural when writing them but are a night mare when actually using them.

Best regards,

Michael

Hi Michael,

Yes, you are right that LL(n) n > 1 grammars can be problematic, and should generally be avoided. Let me explain why I want to do this anyway (sorry for only doing that now, but my reasons are a bit convoluted so I thought it would be less distracting to leave them out; this clearly didn’t work).

In this case, I’m building a research language, that is not really meant for real-world programming at this stage. So my main concern is that the grammar is as similar as possible to its LateX counterpart, which has Unicode characters in it. For example, it contains vector variables like x⃗ (I hope you can see the arrow on top),

I discovered that there is a cool way one can parse such variables in Coq. In Unicode, one can use combining characters like that combine with a regular character and puts the arrow on top. Since Coq notations ignore whitespace, it is possible to define the following notation:
Notation "x ⃗" := x (x global).
This then allows you to parse notations like x⃗ y⃗ and z⃗ as Coq variables x y and z. This actually works surprisingly well. The only problem is that combining characters come after the main character. This makes it really easy to end up with a LL(2) grammar. For example, if you want to parse either a normal variable or a vector variable, the grammar is already LL(2) because you have to look ahead for the arrow symbol. If the combining character would have come before the main character, everything would have been fine.

So these are my reasons, which I admit are a bit silly, but still it would be cool if this could work.

I heard from issue #13078 that you made it working. I could also get it working (if I’m not mistaken) with the following factorization, though with a different associativity of pairs:

Declare Custom Entry S.
Declare Custom Entry T.
Declare Custom Entry A.

Notation "[ x ]" := x (x custom S at level 0).
Notation "" := tt (in custom S at level 0).
Notation "x t" := (x, t) (in custom S, x global, t custom T).

Notation "#" := (tt, tt) (in custom T).
Notation "x & s" := (tt, (x, s)) (in custom T, x global, s custom S).
Notation "x t a" := ((x, t), a) (in custom T at level 0, x global, t custom T, a custom A).

Notation "#" := tt (in custom A at level 0).
Notation "x & s" := (x, s) (in custom A at level 0, x global, s custom S).

Axiom x X : nat.
Check [ x # ].
Check [ x X & x # ].
Check [ x X & x # ].
Check [ x x # # ].
Check [ x x X & x # # ].
Check [ x x X & x X & x # # ].

It relies on a specific Camlp5 look-ahead on tokens to backtrack from the rule "x & s" whenever it sees that the chain of tokens x & does not fully match.

Yes, I made it work by stupidly realizing that with some effort I could in fact rewrite my grammar to LL(1). But you seem to have managed to get Coq to actually parse a LL(2) language. Impressive! To be honest, I have no idea how this magic works. Does the difference in associativity of the pairs somehow help the parser to decide to look ahead?

To be honest, I don’t really understand how the notation system in Coq works. In principle, it seems to be a LL(1) parser that can automatically deal with left-factorization of rules. And maybe it internally does some other transformations on the grammar? But what you do here seems to be going beyond that :slight_smile:

Actually, the change of associativity of pairs is only minor (see variant below using @mwuttke97’s inductive representation of the grammar).

I’m actually not fully sure of the reason why this work. On one side, Campl5 has a token-list look-ahead, that is, a sequence of tokens is seen as a single token in terms of looking one symbol ahead.

On the other side, there is also some left-factorizations made automatically.

I suspect that, with the order of rules I gave, either of them would be enough here to make SA (below) being able to discriminate between & and sa. If the order of SA rules were swapped, only left-factorization would work though, as far as I can reconstruct what happens.

To be honest, I don’t know also very precisely all the kinds of transformations made by the Camlp5 parsing engine.

The notation system itself is another level on top of Camlp5, and it also does a few transformations (e.g. for recursive notations).

Declare Custom Entry S.
Declare Custom Entry SA.
Declare Custom Entry A.

Inductive ProdS : Type :=
| ProdSϵ : ProdS
| ProdSa : nat -> ProdS * ProdA -> ProdS
with ProdA : Type :=
| ProdAab : nat -> ProdS -> ProdA
| ProdAc : ProdA.

Notation "[ x ]" := x (x custom S at level 0).
Notation "" := ProdSϵ (in custom S at level 0).
Notation "x sa" := (ProdSa x sa) (in custom S at level 0, x global, sa custom SA at level 0).

Notation "#" := (ProdSϵ, ProdAc) (in custom SA at level 0).
Notation "x & s" := (ProdSϵ, ProdAab x s) (in custom SA at level 0, x global, s custom S at level 0).
Notation "x sa a" := (ProdSa x sa, a) (in custom SA at level 0, x global, sa custom SA at level 0, a custom A at level 0).

Notation "#" := ProdAc (in custom A at level 0).
Notation "x & s" := (ProdAab x s) (in custom A at level 0, x global, s custom S at level 0).

Section S.
Variables x X : nat.
Check [ ].
Check [ x # ].
Check [ x X & x # ].
Check [ x X & x # ].
Check [ x x # # ].
Check [ x x X & x # # ].
Check [ x x X & x X & x # # ].
End S.

Ah, I hadn’t realized about the token-list look-ahead. That’s useful to know. It seems to me that the reason your code works must be due to this because I don’t think it can be due to left-factorization of x & s and x sa a. If you do this, you still end up with a first/follow conflict (see this link for an analysis).

I wonder if with this token-list look-ahead, the parser can actually parse all LL(*) languages, or if it ends up somewhere in between LL(1) and LL(*).

Anyway, thanks for your investigation. Always interesting to explore the dark secrets Coq harbors :slight_smile:

If your use of “combining” is accurate, this sounds (a) fragile on its own (b) likely a Coq bug that will have to be fixed, making this more fragile. The question is: does your notation work once you convert the source code to the Unicode NFC normalization form (which many editors will happily do, maybe unconditionally), or does it break because the characters get combined into a precomposed character? Is there any reason the missing precomposed character cannot be added by Unicode?
Somebody should test if notations can match on usual accents; that’d be bad, since for those precomposed characters do exist.

That is an interesting point, and indeed it works:

Require Import Ascii String.

Definition accent x :=
  String x (String "204" (String "129" EmptyString)).

Eval cbv in ("  " ++ accent "e" ++ "   ")%string.

Notation "[ x ́  ]" := x (at level 0).

Definition e := 10.
Definition é := 5.

Eval cbv in  [  é  ]. (* 10 *)
Eval cbv in é. (* 5 *)

The two e's here are equal modulo (unicode) normalization. I don’t think there is any risk for vectors because they don’t seem to nomalize, but indeed this is worrying.

But it should be noted that the snippet above has survived being uploaded to the internet. So no normalization has happened anywhere in between my editor and the text rendered in html…

Not sure how to check without an hex-editor, but what does Eval cbv in [ é ]. give for a precomposed é? If the notation still worked, everything would be fine. Generally, Unicode normalization induces an equivalence on Unicode strings, and functions from Unicode strings, like Coq, should respect this equivalence.

Viceversa, maybe you can get something like this to work:

Notation "[ x ́  ]" := x (at level 0).
Notation "[ x   ]" := x (at level 0).
Eval cbv in [ é ]. (* decomposed, gives 10 *)
Eval cbv in [ é ] . (* precomposed, gives 5 *)

and that would be pretty unfortunate.

Yes, that definitely works for me. So indeed, you can have identically looking expressions evaluate differently.

Notation "[ x ́  ]" := x (at level 0).
Notation "[ x   ]" := x (at level 0).

Definition e := 10.
Definition é := 5.

Eval cbv in [ é ]. (*  10 *)
Eval cbv in [ é ] . (*  5 *)
1 Like