OML-Core formal grammar¶
This is the formal grammar for OML-Core — the subset of OML the canonical writer emits and every conformant reader must accept — written in ABNF (RFC 5234). It is the normative companion to the user-facing tour in the OML format page; read that first for context and examples, and the glossary for the terms used here (node, edge, label, scalar value).
It also documents the OML-Extended read-only spellings the tokenizer
accepts — the raw string ('...') and triple-quoted multiline string
("""...""") forms — which the canonical writer never produces but
read_oml understands on input.
Every production below has been exercised against the real implementation
in omnist/canonical/oml.py (the _Scanner
tokenizer and _Parser parser); see Worked examples and
the conformance tests in
tests/test_grammar_docs.py.
1. Lexical grammar (tokens)¶
The tokenizer (_Scanner) scans maximal munch with a fixed priority
order at every position: it tries STRING-family and punctuation first
(each pinned to its own leading character, so these never compete with
anything else), then DATETIME, DATE, TIME, NUMBER, the three reserved float
spellings (nan, inf, -inf), INTEGER, then IDENT. The first matching
rule wins and consumes the longest match for its own pattern — there is no
later backtracking between rules. See _Scanner._next for the literal order
this grammar mirrors:
flowchart TD
start(["next token, at the current position"]) --> q{"leading\nchar?"}
q -->|"" or '"| str["STRING-family\n(dquote / raw / multiline)"]
q -->|"{ } :"| punct["punctuation"]
q -->|"other"| dt{"matches\nDATETIME?"}
dt -->|yes| DT["DATETIME"]
dt -->|no| da{"matches DATE\n(and not DATE+'T'+TIME)?"}
da -->|yes| DA["DATE"]
da -->|no| ti{"matches\nTIME?"}
ti -->|yes| TI["TIME"]
ti -->|no| nu{"matches\nNUMBER\n(decimal/exponent)?"}
nu -->|yes| NU["NUMBER"]
nu -->|no| rf{"'-inf' / 'nan' /\n'inf' spelling?"}
rf -->|yes| RF["NUMBER\n(reserved float)"]
rf -->|no| ig{"matches\nINTEGER?"}
ig -->|yes| IG["INTEGER"]
ig -->|no| id{"matches\nIDENT?"}
id -->|yes| ID["IDENT"]
id -->|no| err(["ParseError:\nstray character"])
This is exactly why nan: 1 and inf: 1 are ParseErrors (Worked example
9 below) — nan/inf/-inf are claimed by the reserved-float branch¶
before IDENT is ever tried, so they can never become a bare label; only
the quoted spelling ("nan", "inf") reaches the STRING branch instead.
; -- whitespace, comments, separators -----------------------------------
hspace = %x20 / %x09 ; space, tab (intra-line only)
comment = "#" *(%x00-09 / %x0B-10FFFF) ; '#' to end of line, NL excluded
newline = %x0D.0A / %x0A ; CRLF or LF
SEP = 1*( hspace / comment / newline / ";" )
; one or more of these collapse into a single SEP token, as
; long as at least one newline/";" occurs; pure hspace/comment
; with no newline or ";" is just skipped (no token emitted)
; -- punctuation ----------------------------------------------------------
LBRACE = "{"
RBRACE = "}"
COLON = ":"
; -- identifiers and reserved words --------------------------------------
IDENT = (ALPHA / "_") *(ALPHA / DIGIT / "_" / "-")
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z
DIGIT = %x30-39 ; 0-9
reserved-word = %s"null" / %s"true" / %s"false"
reserved-number = %s"nan" / %s"inf" / "-" %s"inf"
; "nan"/"inf"/"-inf" are matched as NUMBER, never IDENT, by
; priority order (§1 above) -- they are not in the IDENT
; grammar's reserved set, they are pre-empted before IDENT is
; even tried. reserved-word ("null"/"true"/"false") *is*
; IDENT-shaped; it is excluded only at the *parser* level
; (a bare label can't be one of these three words -- see §4).
; -- numeric and temporal literals ---------------------------------------
INTEGER = ["-"] 1*DIGIT
; digit count (sign excluded) is capped at 4300 -- see §6.
NUMBER = decimal-num / exponent-num / reserved-number
decimal-num = ["-"] 1*DIGIT "." 1*DIGIT [exponent]
exponent-num = ["-"] 1*DIGIT exponent
exponent = ("e" / "E") ["+" / "-"] 1*DIGIT
DATE = 4DIGIT "-" 2DIGIT "-" 2DIGIT
; matched only when NOT immediately followed by "T" + a
; TIME-shaped lookahead -- see §1.1.
TIME = 2DIGIT ":" 2DIGIT [":" 2DIGIT ["." 1*6DIGIT]] [tz-offset]
tz-offset = ("+" / "-") 2DIGIT ":" 2DIGIT
DATETIME = DATE "T" 2DIGIT ":" 2DIGIT
[":" 2DIGIT ["." 1*6DIGIT]] [tz-offset]
; -- strings ---------------------------------------------------------------
STRING = dquote-string / raw-string / multiline-string
dquote-string = DQUOTE *( unescaped / escape ) DQUOTE
DQUOTE = %x22 ; '"'
unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
; any character >= U+0020 except '"' (x22) and '\' (x5C);
; control characters (< U+0020) are rejected with a ParseError
escape = "\" ( %x22 / "\" / "/" / "b" / "f" / "n" / "r" / "t"
/ "u" 4HEXDIG [ "\" "u" 4HEXDIG ] )
; \", \\, \/, \b, \f, \n, \r, \t, or \uXXXX -- a \uXXXX in the
; high-surrogate range (D800-DBFF) MUST be immediately
; followed by a second \uXXXX low-surrogate (DC00-DFFF);
; an unpaired high or low surrogate escape is a ParseError.
HEXDIG = DIGIT / %x41-46 / %x61-66 ; 0-9 / A-F / a-f
; -- OML-Extended (read-only; the writer never emits these) ---------------
raw-string = "'" *(%x00-26 / %x28-10FFFF) "'"
; everything between single quotes is literal -- no escape
; processing at all, including backslash and newline.
multiline-string = DQUOTE DQUOTE DQUOTE [newline]
*( ( DQUOTE DQUOTE not-3-dquote ) / mlchar )
DQUOTE DQUOTE DQUOTE
; opens with """, optionally followed immediately by one
; newline (consumed, not part of the value); closes at the
; *first* run of 3 or more '"' characters encountered -- only
; the first 3 quotes of that run are consumed as the
; terminator (see §1.2); a run of exactly 1-2 quotes is
; literal content, not a terminator.
mlchar = %x09 / %x0A / %x20-10FFFF ; tab, LF, or >= U+0020
; (control chars other than tab/LF are rejected)
1.1 DATE vs. DATETIME disambiguation¶
At a position where DATE matches, the scanner does not emit a DATE
token if the character immediately after the date digits is T and the
text following that T itself matches TIME. In that case it instead
tries DATETIME (which, since DATETIME = DATE "T" TIME-body, will
succeed). If the text after T does not look like a time (e.g.
2024-01-01T99, where 99 does not match TIME's 2DIGIT ":" 2DIGIT...
shape), the scanner emits a plain DATE token for the date part, and the
T99 that follows becomes a separate IDENT token (T is alphabetic, so
T99 matches the IDENT pattern) — this is exactly maximal-munch-per-rule,
not a single combined lookahead across the whole token stream. See
Worked examples #1-#2.
1.2 Multiline string termination (OML-Extended)¶
A """...""" string closes at the first point where three or more
consecutive " characters appear. Only the first three of that run are
consumed as the closing delimiter; any further " characters in the run
are left in the source and re-tokenized by the main scanner immediately
after (so e.g. a run of 5 quotes closes the string with the first 3, and
leaves a 2-quote run — which is not by itself a valid STRING start, and
is a ParseError at the parser level, since two bare " would attempt to
open an empty-then-unterminated string). See Worked examples
7-#8.¶
1.3 Reserved spellings are tokenizer-level, not parser-level, for numbers¶
nan, inf, and -inf are recognized as NUMBER tokens by the scanner
itself (priority order in §1), so they can never be mistaken for an
IDENT — nan: 1 always means the label nan is rejected as not a
label (it's a NUMBER token, and _looks_like_edge only treats STRING
and non-reserved IDENT followed by COLON as edges), forcing a quoted
spelling "nan": 1 if a literal label nan is wanted. By contrast null,
true, false are tokenized as ordinary IDENT tokens and excluded only
by the parser (§4) — a meaningful difference if you're hand-rolling a
tokenizer-only consumer. See Worked examples #9-#10.
2. Syntactic grammar (document, edges, values)¶
document = [SEP] ( value / node-edges ) [SEP]
; a document is exactly one node: either a single scalar
; value, or a (possibly empty) list of top-level edges.
; Leading/trailing SEP is allowed; anything else left over
; after the one node is a ParseError ("unexpected trailing
; content").
node-edges = edge *( SEP edge ) / ""
; zero or more `label: value` edges, SEP-separated; the same
; label may repeat (an edge list, not a map) and edges may
; interleave freely -- there is no uniqueness or ordering
; constraint on labels.
edge = label COLON value
label = STRING / bare-label
bare-label = IDENT ; rejected at parse time if IDENT's text
; is "null", "true", or "false" (§4)
value = scalar / "{" [SEP] node-edges [SEP] "}"
; "{}" (with only SEP, if any, between the braces) is a valid
; value: the empty edge list.
scalar = STRING / INTEGER / NUMBER / DATE / TIME / DATETIME
/ %s"null" / %s"true" / %s"false"
; a bare IDENT that is none of null/true/false is a
; ParseError ("bare word ... is not a valid value here;
; strings must be quoted") -- there is no implicit
; string-from-identifier coercion anywhere in OML.
Nesting depth — the number of { braces a value may be wrapped in — is
capped at 200 (_MAX_DEPTH); see §6.
2.1 Top-level node-kind disambiguation¶
document's grammar is ambiguous on paper (a bare IDENT/STRING could
start either a scalar or an edge), but the parser resolves it with one
token of lookahead before committing (_looks_like_edge): if the current
token is a STRING or a non-reserved IDENT and the token immediately
after it is COLON, the document is parsed as node-edges; otherwise it is
parsed as a single scalar. This lookahead only fires at the very start of
the document (and recursively, at the start of each value inside { }) —
it is not a general operator-precedence ambiguity.
One sharp edge from this: null: 1 at the top level is not the
"reserved word as label" error from §4 — null fails the _looks_like_edge
check (reserved words are excluded from the edge lookahead), so the parser
takes the scalar branch, consumes null as the value None, and then
fails afterward on the unconsumed : as ordinary trailing content. The same
null: 1 written inside a node (e.g. a: { null: 1 }) does hit the
explicit reserved-word check in parse_label, with a different, more
specific error message. See Worked examples #11-#12.
3. String escaping rules¶
Inside a dquote-string or multiline-string, the recognized escapes are
exactly: \", \\, \/, \b, \f, \n, \r, \t, and \uXXXX (4 hex
digits). A \uXXXX in the UTF-16 high-surrogate range D800-DBFF must be
immediately followed by a second \uXXXX escape in the low-surrogate range
DC00-DFFF; the pair is combined into one astral code point. Any other
backslash escape, or an unpaired surrogate escape, is a ParseError. Any
literal control character (code point < U+0020, except tab/LF inside a
multiline string) is rejected unescaped.
A raw-string ('...', OML-Extended, read-only) performs no escape
processing at all — a backslash inside single quotes is a literal
backslash, and there is no way to include a ' character in a raw string
(the first ' always closes it).
The canonical writer (_write_string)
only ever emits \", \\, \n, \r, \t, and \u00XX (for other
control characters) — it never emits \/, \b, \f, surrogate pairs (it
emits literal non-ASCII characters unescaped), raw strings, or multiline
strings; those are read-only conveniences.
4. Labels: bare vs. quoted, and reserved words¶
A bare-label is an IDENT whose text is not null, true, or
false — using one of those three words unquoted as a label inside { }
is a ParseError with a message naming the specific reserved word and
suggesting the quoted spelling. Any other IDENT-shaped text, and any
STRING, is a valid label. The canonical writer
(_write_label) emits a bare label only
when the label text matches the IDENT shape and is not one of null,
true, false, nan, inf; otherwise it quotes the label.
5. Top-level document shapes¶
Unlike most formats, an OML document is not required to have a single root
object — document (§2) accepts:
- a single top-level scalar (
"hello",42,null, ...) — the document is that one leaf value, with no wrapping edge; - zero or more top-level edges, of which there can be more than one (including repeats of the same label) — there is no implicit top-level object/array wrapper.
A document containing more than one top-level item that doesn't parse as a
single edge list (e.g. two bare scalars back to back) is a ParseError.
6. Documented limits¶
| Limit | Value | Constant | Enforced where |
|---|---|---|---|
| Integer literal digit count (sign excluded) | 4300 | _MAX_INT_DIGITS in omnist/canonical/oml.py |
_Scanner._emit_integer, at tokenize time |
Nesting depth ({ levels) |
200 | _MAX_DEPTH in omnist/canonical/oml.py |
_Parser.parse_value, at parse time |
The integer-digit limit matches CPython's own default
sys.get_int_max_str_digits() — int-to-str/str-to-int conversion on
arbitrarily large digit strings is superlinear, so this is a
denial-of-service guard, not a data-modeling constraint. The nesting-depth
limit matches the Document model's own bound (document.py). Both raise
ParseError rather than letting a pathological input hang or exhaust
memory/stack. Verified exactly at the boundary: a 4300-digit integer and
200 levels of { both parse; a 4301-digit integer and 201 levels both raise
ParseError (see Worked examples #13-#14, and
tests/test_grammar_docs.py).
7. Worked examples¶
Each row was run against read_oml/write_oml to confirm the claimed
behavior (see tests/test_grammar_docs.py for the executable form).
| # | Input | Result |
|---|---|---|
| 1 | 2024-01-01T10:30 |
DATETIME token → datetime.datetime(2024, 1, 1, 10, 30) |
| 2 | 2024-01-01T99 |
DATE token (2024-01-01) followed by IDENT token (T99) → trailing-content ParseError |
| 3 | a: 'C:\no\escapes' |
raw string, value is the literal text C:\no\escapes (backslashes not processed) |
| 4 | a: """␊hello␊world""" |
multiline string, value "hello\nworld" (leading newline after """ is stripped) |
| 5 | a: """␊says ""hi"" there""" |
embedded 2-quote runs are literal; value 'says ""hi"" there' |
| 6 | null: 1 (inside { }) |
ParseError: 'null' is a reserved word and cannot be a bare label; quote it: "null" |
| 7 | a: """␊x"""" (4 trailing quotes) |
first 3 quotes close the string; the 4th quote starts an unterminated dquote-string → ParseError |
| 8 | a: """␊x""""" (5 trailing quotes) |
first 3 close the string; remaining 2 quotes are a 2-character leftover, rejected as an unexpected token after the value |
| 9 | nan: 1 |
ParseError (nan is tokenized as NUMBER, never reaches the edge/label grammar at all) |
| 10 | "nan": 1 |
OK — quoting forces the STRING token, which is a valid label |
| 11 | null: 1 (top level) |
OK as far as null (parsed as the scalar None), then ParseError on the leftover : as trailing content — a different failure mode than #6 |
| 12 | a: { null: 1 } |
the explicit reserved-word ParseError from #6 (label position is unambiguous inside { }) |
| 13 | integer with 4300 digits / 200 levels of { nesting |
both parse successfully |
| 14 | integer with 4301 digits / 201 levels of { nesting |
both raise ParseError naming the limit |
| 15 | tag: "x"␊tag: "y" |
repeated label → two edges, [('tag', 'x'), ('tag', 'y')] (an array is just a repeated label, never a list value) |
| 16 | a: {} |
empty node value → [('a', [])] |
| 17 | "hello" (top level, no label) |
the whole document is the single scalar 'hello' |