First release of Kôika 🦑, a core language for rule-based hardware design

Hi all,

In preparation for PLDI, Thomas Bourgeat and I recently made our first public release of Kôika, a Coq EDSL for rule-based hardware design. Here is a brief description, taken from the README:

Kôika is an experimental hardware design language inspired by BlueSpec SystemVerilog. Kôika programs are built from rules, small bits of hardware that operate concurrently to compute state updates but provide the illusion of serializable (atomic) updates. Kôika has simple, precise semantics that give you strong guarantees about the behavior of your designs.

Our distribution includes an executable reference implementation of the language written using the Coq proof assistant, machine-checked proofs ensuring that Kôika’s semantics are compatible with one-rule-at-a-time execution, and a formally-verified compiler that generates circuits guaranteed to correctly implement your designs.

You can read more in our PLDI paper (https://pit-claudel.fr/clement/papers/koika-PLDI20.pdf), or by visiting our Github repository (GitHub - mit-plv/koika: A core language for rule-based hardware design 🦑). This is joint work with Adam Chlipala and Arvind, with many contributions from other members of the PLV team at MIT since the paper was written.

Thomas and I will be giving a talk about it at PLDI on Thursday, June 18.

Cheers,
Clément.

2 Likes

Thanks for the invitation to this interesting work! After reading the
Sections 1 and 2, as someone who knows essentially nothing about
hardware description languages (rule-based or not), I am slightly
confused / I have some questions. I tried to write them down below.

The so-called one-rule-at-a-time (ORAAT) semantics […]: the
process is repeated endlessly, as if exactly one rule executed in
each clock cycle. […]
In designing efficient hardware, however, we strive to execute as
many compatible rules as possible in parallel in each clock cycle,
without violating the illusion of running rules one-at-a-time.

I found these two sentences confusing when put together. If the ORAAT
semantics specifies that one rule executes in each clock cycle, then
parallel execution does not preserve the semantics – so the illusion
is violated. Maybe you could just say “one rule at a time” for the
ORAAT semantics (without mentioning clock cycles).

(In retrospect, my understanding is that ORAAT promises the same
behavior no matter how the rules of a trace are clocked (grouped into
clock cycles), so I suppose this encourages people to blur the
distinction between “one rule” and “one cycle”, giving up performance
reasoning to think only about correctness. But you could still be more
fine-grained than that with ORAAT.)

The way I understood the introduction on first read was that
schedulers were justified as being important for performance
reasoning – to reason about clock cycles. However, thinking about
this more I have the impression that there is a natural way about
clock cycles for a given trace of ORAAT execution: batching rules
that are compatible together in the same clock cycle, and starting
a new cycle at the first incompatible rule. The problem then is not so
much to count clock cycles, but that programs are non-deterministic:
there are many different ORAAT traces for a given program, so it’s
unclear which behavior it specifies (and its
performance characteristics). From what I know from the introduction,
I think that this is what you gain by specifying an intra-cycle
schedule: the rules combined with the schedule specify
a deterministic system.

However, pure ORAAT semantics are unable to capture the perfor-
mance implications of EHRs because in ORAAT semantics, EHRs are
indistinguishable from ordinary registers.

Unless I misunderstand something, this “batch rules as long as they
are compatible” way to interpret a trace of ORAAT rules (and map it to
clock cycles) does allow to distinguish EHRs from ordinary registers
(the rules that just do “r.wr0(r.rd0+1)” and “r.wr1(r.rd1+1)” can run
together in the same clock cycle, incrementing “r” twice per cycle,
while they would be incompatible or have a different semantics with
ordinary registers).

I read the introduction a second time now, as I wrote this comment,
and it does talk about determinism, and highlights the fact that
standard BSV specifies a specific deterministic strategy through an
implementation-defined static analysis instead of a explicit
schedules. I still wonder if it would be possible to not suggest the
other interpretation (that this is mostly about
performance reasoning).

Here would be an attempt:

  • Giving just rules gives a non-deterministic system that may behave
    in many different ways.

  • BSV uses an implementation-defined static analysis to schedule
    rules, which users need to understand well to reason about their
    programs.

  • We integrate schedule descriptions as part of the program, giving
    a deterministic semantics. In particular, one can now reason about
    cycles/performance/parallelism without having to reason on separate
    analyses in addition to the program source.

  • In particular, some HDL features are about intra-cycle computation
    (they are designed with parallelism in mind), it’s easier to think
    about them and their performance impact when scheduling is explicit.

Note: this raises the question of whether your schedule-description
mechanism is general/expressive enough to express schedules people
want to have in practice. (If I understand correctly from the
introduction, it does not lets one express scheduling choices that
would depend on the system state / data. Is this a limitation in
practice?)

Reading this introduction left me with another, more general
question. For me naively (from your description of rule-based HDLs)
the natural semantics of a rule-based system would not be ORAAT, but
something like One Set Of Rules Each Cycle (OSOREC ?): a program run
is a list/stream of sets of rules that are compatible in the current
program state, each set giving the behavior during one clock
cycle. (Naively, I would expect that two rules are compatible as long
as they don’t write to the same register.)

ORAAT is a different semantics where sets of rules are taken as lists
(in a linear order), with the added constraint that the system state
should be equivalent to the state of a purely sequential execution
(one clock cycle for each rule in the list). This seems less natural
to me (it adds new incompatibilities between rules in more
complex ways). Provocatively: why do we bother with ORAAT in the first
place? I suppose that the ORAAT semantics has advantages that are
well-known to people in the field, but they are not so clear from the
introduction. (The introduction suggests that in an ORAAT semantics,
each rule can be considered in isolation, as if it executed
atomically; so maybe ease of reasoning/programming? But unless
I misunderstand this is also the case in the OSOREC semantics, where
all the rules in a cycle are independent from each other.)

One thing that is not clear to me is: are the two forms of semantics
equivalent? For any “set of compatible rules” in an OSOREC semantics,
is there a linearization of the rules that works for an ORAAT
semantics? That would be nice, but I have the impression that this is
not the case. For example, consider two registers e, o, with
o starting at 1 and e at 0, with the rules:

rule even:
  e.wr(o.r)
rule odd:
  o.wr(e.r)

then with the OSOREC semantics we can run both rules at once in each
clock cycle (so ‘e’ and ‘o’ remain the negation of each other, they exchange their value at each clock cycle), but the ORAAT semantics considers them
incompatible (so an ORAAT execution would have both registers converge to one value or the other). One can merge the two writes in the same rule, and it is
accepted again, but that looks less modular.

1 Like

Hi Gabriel,

Thanks a lot for your message! I hope the remarks below clarify things a bit.

I think the key insight here is that ORAAT semantics are purely trace-based — they don’t capture any information about time.

In other words, if you run one cycle of a Kôika design (this is a well-defined operation, because Kôika’s semantics do have a notion of cycles), then there is a sequence of rules such that running each rule one by one, committing the resulting logs after each step, will produce the same final output (in terms of register values). In other words, if commit_logs (interp_scheduler scheduler initial_register_values) produces new_register_values, then there is a sequence of rules r1 .. rn such that commit_logs (interp_rule rn (commit_logs (interp_rule … (commit_logs (inter_rule r1 initial_register_values))))) also equals new_register_values.

It’s in that sense that Kôika’s semantics are compatible with ORAAT: a Kôika design always produces results that can be explained in terms of one-rule-at-a-time execution.

Let me go through your post for details:

I think the key word here is as if exactly one rule executed in each clock cycle. ORAAT is a model in which rules never run concurrently. The illusion that we mention is that if you forget about timing, you cannot say from looking at the results of a computation whether it was run in parallel or sequentially (the dynamic checks introduced by Kôika makes sure that serializability violations are rules out)

The way I understood the introduction on first read was that
schedulers were justified as being important for performance
reasoning – to reason about clock cycles.

Yes, that’s the right understanding. In general, we don’t want a design’s functional correctness to hinge on how fast or slow it runs (it may be needed for some tricky cases, but generally we like designs to be latency insensitive). This is usually achieved using queues and back pressure.

The problem then is not so
much to count clock cycles, but that programs are non-deterministic:
there are many different ORAAT traces for a given program, so it’s
unclear which behavior it specifies (and its
performance characteristics).

Expanding on this: in general, it will not matter for functional correctness in a well-designed BlueSpec program which order the rules run on. This is critical because it allows you to reason about rules in isolation. Your typical proof will have a system-level invariant and if you can prove that it’s never broken by any of the rules, then you can conclude that it holds for any corresponding circuit, regardless of what concurrency decisions are made (these are important for performance, of course, so it’s valuable to have Kôika-style semantics with precise performance characteristics). If the functional correctness of a design depends on its timing characteristics, you end up having to reason directly about how sequences of rules may break an invariant and re-establish it, and the reasoning becomes much more complex.

Here’s a concrete example: A CPU design might fetch instructions from memory ad place them in a queue for the instruction decoder. If you know that your memory always responds within one cycle (e.g. with block ram on an FPGA), you could just put a wire between the output of the RAM and the input of the decoder. The problem now is that to reason about the correctness of the design, you need to prove that the BRAM indeed responds on time, otherwise the decode rule will read garbage values. In BlueSpec style, you would put a queue in there, and possibly then optimize the queue away base on a performance analysis.

I think that this is what you gain by specifying an intra-cycle
schedule: the rules combined with the schedule specify
a deterministic system.

Definitely, though note that it’s primarily an advantage for reasoning about performance. It’s a bit of an anti-pattern when your functional correctness starts relying on the schedule.

However, thinking about
this more I have the impression that there is a natural way about
clock cycles for a given trace of ORAAT execution: batching rules
that are compatible together in the same clock cycle, and starting
a new cycle at the first incompatible rule.

When you start thinking of how to map a program with ORAAT semantics to cycles, what you’re doing is designing a compilation strategy. The one you describe isn’t very far from what we do, though it exposes less concurrency: you put a priority order on the rules, and then you stop at the first (dynamic, I presume?) incompatibility.

ORAAT semantics don’t tell you where cycle boundaries are, so I don’t think this example lets you distinguish EHRs from registers. WIth ORAAT, what you see is r=0, r=1, r=2, while your concurrent implementation gives you r=0, r=2 — the end states are not distinguishable (and, since ORAAT doesn’t map to concrete cycles, it doesn’t make much sense to think of distinguishing “intermediate” stages).

What the dynamic checks in Koika rule out is things like running r.wr0(s.rd0) and s.wr0(r.rd0) in parallel. If you did that, you would observe such behaviors as r=0, s=1r=1, s=0, which is not compatible with a one-rule-at-a-time model.

  • Giving just rules gives a non-deterministic system that may behave
    in many different ways.

Correct, though the system is designed so that all these executions are acceptable.
(In the extreme, you can use a global register that tells you which state (in the finite automaton sense) your state machine is in, and disable all rules that are not in that state.)

  • BSV uses an implementation-defined static analysis to schedule
    rules,

Correct, the performance can be inscrutable

which users need to understand well to reason about their
programs.

No, definitely not — users instead gate their rules so that correctness isn’t dependent on execution order.

  • We integrate schedule descriptions as part of the program, giving
    a deterministic semantics. In particular, one can now reason about
    cycles/performance/parallelism without having to reason on separate
    analyses in addition to the program source.

Absolutely right

Note: this raises the question of whether your schedule-description
mechanism is general/expressive enough to express schedules people
want to have in practice. (If I understand correctly from the
introduction, it does not lets one express scheduling choices that
would depend on the system state / data. Is this a limitation in
practice?)

You can express such choices, but it’s not convenient. The basic way to do this is to duplicate rules and gate them based on a read1 of a register written by an earlier rule.

It’s a great question though. Our original design had “tree” schedulers, with a different branch being taken depending on whether a rule succeeds or not, and that’s still what we have in the implementation. We took it out when the reviewers pointed out that we didn’t have examples of its use (we don’t really have any examples where it’s useful, at least not yet).

Reading this introduction left me with another, more general
question. For me naively (from your description of rule-based HDLs)
the natural semantics of a rule-based system would not be ORAAT, but
something like One Set Of Rules Each Cycle (OSOREC ?): a program run
is a list/stream of sets of rules that are compatible in the current
program state, each set giving the behavior during one clock
cycle.
(Naively, I would expect that two rules are compatible as long
as they don’t write to the same register.)

This is an interesting description, ad you could probably build a language like this. You could also likely strengthen the semantics of it in the style that we did with Kôika (in your case, the scheduling decisions would be how to chose the partition of rules to run in each cycle, since compatibility isn’t a transitive relation).

That being said, isn’t ORAAT a special case of OSOREC in which the sets have just one element? Or do you intend for the sets or rules to be maximal in a certain sense (it’s not clear that this is well defined, since compatibility isn’t an equivalence relation that you can use to quotient the rule set).

One problem with this description might be that the unit of a “rule” isn’t exactly meaningful anymore, since you can’t really talk about about a single rule in isolation (the semantic units are the batches of “compatible” rules). I think you can implement OSOREC in BlueSpec by merging “compatible” rules (committing to one particular partition), and then reasoning on those merged rules. If I understand the scheme correctly, it sounds like it’ll add more complexity on top of ORAAT, but I’m not sure what it will buy you.

It would be interesting to think through how your OSOREC system composes. If I have two set of rules and a description of each as nondeterministic transition functions (the description given to me by OSOREC), then what happens when I merge the two systems by pooling their rules together? In ORAAT, the description is a nondeterministic composition of the rules of both systems; in OSOREC, can you say something about the merged system in terms of a composition of the behaviors of the two original systems?

So overall, I think your system will make sense, but reasoning is going to be trickier, precisely because you’ll break ORAAT by forcing rules to be batched together (see below as well).

Finally, note that compatibility is a dynamic property, too (writes may be conditional), and if you try to statically partition rules into sets that do not conflict, you find yourself back where BlueSpec is, with complex heuristics over-approximating conflicts to compute what may or may not run in the same cycle.

ORAAT is a different semantics where sets of rules are taken as lists
(in a linear order), with the added constraint that the system state
should be equivalent to the state of a purely sequential execution
(one clock cycle for each rule in the list)

Not really: ORAAT permits any order. Think of it as a term rewriting system where each rule is a rewrite that operates on the system’s state, and the order of the rules isn’t pre-determined. In that sense, with a good dose of hand-waving, a well-designed BlueSpec program is a confluent one, as long as you quotient the state of the system to only preserve the parts that you care about (different execution order will cause different things to be written in caches, branch predictors, etc, but the observable parts of the design, like writes to external memory, should be the same regardless of the order in which the rules are applies)

With this perspective, what Kôika is doing is letting you tweak how your rewriting engine applies rules, to control concurrency (and also plain efficiency: for example, you typically want to run pipelines backwards, not forwards).

Provocatively: why do we bother with ORAAT in the first
place? I suppose that the ORAAT semantics has advantages that are
well-known to people in the field, but they are not so clear from the
introduction. (The introduction suggests that in an ORAAT semantics,
each rule can be considered in isolation, as if it executed
atomically; so maybe ease of reasoning/programming? But unless
I misunderstand this is also the case in the OSOREC semantics, where
all the rules in a cycle are independent from each other.)

Exactly right: the key is modular reasoning. What you do in OSOREC is essentially take small rules, batch them together into larger rules, and then apply ORAAT semantics to the result. Now you have fewer but larger rules to reason about (and there are questions about dynamic conflicts and partitioning choices)

One thing that is not clear to me is: are the two forms of semantics
equivalent? For any “set of compatible rules” in an OSOREC semantics,
is there a linearization of the rules that works for an ORAAT
semantics?

No, definitely not, because OSOREC introduces sequencing violations; your example illustrates this perfectly. BlueSpec will let you write such a program of course, but you won’t be able to prove that e and o are always the negation of each other.

One can merge the two writes in the same rule, and it is
accepted again, but that looks less modular.

That’s a feature: rules are intended to be the smallest unit that can be reasoned on in isolation, with no concerns about timing — as written, these two rules would be a code smell.

OSOREC suffers from the same problem, because of the partitioning decisions: depending on how you chose to batch the four rules below into compatible sets, won’t you get different results?

rule even:
  e.wr(1 - o.r)
rule odd:
  o.wr(1 - e.r)
rule same_e:
  e.wr(o.r)
rule same_o:
  o.wr(e.r)