Schema DSL formal grammar¶
This is the formal grammar for the Schema DSL — the small text language
parsed by parse_schema() and produced by to_dsl() — written in ABNF
(RFC 5234). It is the normative
companion to the Schema model & DSL page; read that first
for context and examples, and the glossary for the terms
used here (record, field, cardinality, Scalar, Ref).
Every production below has been exercised against the real implementation
in omnist/canonical/dsl.py (the
tokenizer regex and _Parser class); see Worked
examples and the conformance tests in
tests/test_grammar_docs.py.
1. Lexical grammar (tokens)¶
The tokenizer is a single regex alternation
(_TOKEN) tried left to right at each
position; the first alternative that matches wins (Python re tries
alternatives in order and takes the first match, not the longest — but
because each alternative here is anchored to a disjoint leading character
class in practice, this never produces a different result than longest-
match would). Whitespace and comments are discarded, not emitted as tokens.
ws = 1*( %x20 / %x09 / %x0D / %x0A ) ; space, tab, CR, LF
comment = "#" *(%x00-09 / %x0B-10FFFF) ; '#' to end of line
string = DQUOTE *( %x5C %x00-10FFFF / %x20-21 / %x23-10FFFF
/ sans-dquote-backslash ) DQUOTE
; in the tokenizer's own terms: DQUOTE, then any run of
; "\<any char>" (a backslash escape -- the *next* character,
; whatever it is, is consumed verbatim) or any non-DQUOTE
; non-backslash character, then a closing DQUOTE.
DQUOTE = %x22 ; '"'
number = decimal-num / integer-num
decimal-num = ["-"] 1*DIGIT "." 1*DIGIT
integer-num = ["-"] 1*DIGIT
name = (ALPHA / "_") *(ALPHA / DIGIT / "_")
; note: NO hyphen here, unlike OML's IDENT -- a DSL `name`
; allows only [A-Za-z0-9_], never '-'.
ALPHA = %x41-5A / %x61-7A
DIGIT = %x30-39
punct = "{" / "}" / "[" / "]" / ":" / "," / "?"
1.1 String unescaping¶
A string token's value (used for a field label) is computed by
_unquote: it strips the surrounding quotes, then replaces every
backslash-escape pair \X with the single literal character X — there
is no named-escape table (no \n, \t, \uXXXX, etc., unlike OML).
\X always becomes exactly X, whatever X is, including \\ → \ and
\" → ". This means a label like "a\nb" literally contains the two
characters n and b after the backslash is dropped — it is not a
newline. See Worked examples #1.
2. Syntactic grammar¶
schema = *( record-def / root-def )
; declarations may appear in any order/interleaving; there is
; no requirement that `root` come last, though by convention
; (and `to_dsl`'s own output) it does.
record-def = %s"record" name "{" [field *( "," field ) [","]] "}"
; a trailing comma after the last field is allowed (and is
; what `to_dsl` always emits); fields are otherwise comma-
; separated with no trailing/leading comma permitted between
; them.
root-def = %s"root" name
; exactly one root-def must appear in a well-formed schema
; (parse_schema raises SchemaError "a schema must declare a
; root" if none is found); a second root-def silently
; overwrites the first -- there is no duplicate-root check.
field = string [cardinality] ":" type
; the label MUST be a quoted `string` token -- an unquoted
; `name` in label position is a SchemaError ("expected a
; quoted field name"). This is the DSL's quoting rule in
; full: quoted = data string (always a label, the only use
; for a string literal in this grammar); unquoted = schema
; name (a scalar keyword or a Ref).
cardinality = "[" [int] ["," [int]] "]"
; four shapes, all accepted:
; "[" n "]" -> min = max = n
; "[" m "," n "]" -> min = m, max = n
; "[" m "," "]" -> min = m, max = unbounded (None)
; "[" "," n "]" -> min = 0, max = n
; "[" "," "]" -> min = 0, max = unbounded (None)
; "[" "]" (no digits and no comma) is a SchemaError ("empty
; cardinality"). Omitting `cardinality` entirely defaults to
; [1,1] (exactly-one, the same as "[1]").
int = 1*DIGIT
; a cardinality bound is always a bare non-negative integer
; literal at the token level (no leading '-'); the parser
; additionally rejects a `number` token containing "." here
; ("cardinality must be a whole number"). A *negative* min or
; an inverted (max < min) range is still tokenizable but is
; rejected one layer up, by Field's own constructor
; (SchemaError "... has an invalid cardinality [...]"), not
; by the DSL parser itself -- see Worked examples #7-#8.
type = scalar-type / ref-type
scalar-type = scalar-name ["?"]
scalar-name = %s"string" / %s"integer" / %s"number" / %s"boolean"
/ %s"date" / %s"time" / %s"datetime"
; the seven fixed scalar kinds -- SCALAR_NAMES in
; omnist/canonical/schema.py. "?" makes the scalar nullable;
; omitting it means non-nullable.
ref-type = name
; any `name` token that is NOT one of the seven scalar-name
; keywords is parsed as a Ref to a (possibly not-yet-defined)
; record name; resolution is by name lookup in the schema's
; env, so forward references and mutual recursion both work.
; "?" CANNOT follow a ref-type: `Ref?` is a SchemaError
; ("'?' cannot apply to the reference ...; use cardinality
; [0,1] for an optional field") -- nullability is a scalar-
; only concept; optionality on a Ref is expressed via
; cardinality [0,1] instead.
2.1 Reserved names¶
A record-def whose name is one of the seven scalar-name keywords is a
SchemaError at definition time ("... is a reserved scalar name; a record
cannot be defined with this name, or it could never be referenced..."),
because a bare name in type position is always resolved as the builtin
scalar first — defining a same-named record would make it permanently
unreachable. Defining the same record name twice is also a SchemaError
("duplicate definition ...").
2.2 No value-domain composition¶
There is deliberately no | (union), no enum, no literal-valued field, and
no separate union/domain declaration anywhere in this grammar — a
field's type is always exactly one scalar-type or one ref-type, never
a composition of either. See the model spec for the rationale
(a composable value-domain would make schema-directed deserialization
ambiguous).
3. Comments¶
# starts a comment that runs to the end of the line; comments (like
whitespace) are discarded by the tokenizer before the parser ever sees a
token, so a comment may appear anywhere whitespace is allowed — between
declarations, inside a record { ... } body, after a field, etc.
4. Quoting rule (label vs. schema name) — summary¶
This is the single most important disambiguation in the grammar, repeated
here for emphasis: a "quoted" token is always a data string — in this
grammar that only ever means a field's label. An unquoted name is always
a schema name — either one of the seven scalar keywords, or a Ref to
a record defined (or to be defined) elsewhere in the same schema. The two
spellings are never interchangeable: a bare name cannot supply a label,
and a quoted string cannot supply a type.
5. Worked examples¶
Each row was run against parse_schema/to_dsl to confirm the claimed
behavior (see tests/test_grammar_docs.py for the executable form).
| # | Input | Result |
|---|---|---|
| 1 | record R { "a\nb": string } |
label is the literal 3-character string anb (the \n escape pair becomes just n, since there is no named-escape table) — not a + newline + b |
| 2 | "a" [1,5]: string |
field cardinality (min=1, max=5) |
| 3 | "a" [5,]: string |
field cardinality (min=5, max=None) (unbounded) |
| 4 | "a" [,5]: string |
field cardinality (min=0, max=5) |
| 5 | "a" [,]: string |
field cardinality (min=0, max=None) |
| 6 | "a" []: string |
SchemaError: "empty cardinality" |
| 7 | "a" [-1]: string |
tokenizes fine, but Field.__init__ rejects it: SchemaError: "field 'a' has an invalid cardinality [-1,-1]" |
| 8 | "a" [1,0]: string |
tokenizes fine (max < min), rejected the same way: SchemaError: "field 'a' has an invalid cardinality [1,0]" |
| 9 | "a" [1.5]: string |
SchemaError: "cardinality must be a whole number, got '1.5'" |
| 10 | "a": string? |
nullable scalar field, Scalar("string", nullable=True) |
| 11 | "a": Other? (Ref with ?) |
SchemaError: "'?' cannot apply to the reference 'Other'; use cardinality [0,1] for an optional field" |
| 12 | record string { "a": string } |
SchemaError: "'string' is a reserved scalar name; a record cannot be defined with this name..." |
| 13 | record R{"a":string}\nrecord R{"b":string} |
SchemaError: "duplicate definition 'R'" |
| 14 | record R{"a":string} (no root) |
SchemaError: "a schema must declare a root" |
| 15 | record R{a:string} (unquoted label) |
SchemaError: "expected a quoted field name ..., got 'a'" |
| 16 | record R { "a": string, } (trailing comma) |
OK — trailing comma after the last field is accepted |
| 17 | # comment\nrecord R { "a": string } # trailing\nroot R |
comments anywhere whitespace is valid are discarded; schema parses normally |
| 18 | to_dsl(parse_schema('record R { "a" [0,3]: string? }\nroot R')) |
round-trips to 'record R {\n "a" [0,3]: string?,\n}\nroot R\n' |