OML¶
OML (Omnist Markup Language) is Omnist's own format — the only one
designed to hold a Document exactly, with no adjustments. Where JSON/YAML/
TOML/XML each give up something (TOML has no null, JSON has no native
dates, XML forces a single root), OML maps onto the Document model 1:1: all
seven scalars, null, repeated labels, interleaving, and multiple top-level
edges are all native — check_oml is always an empty report.
from omnist import read_oml, write_oml, Doc
d = Doc.from_oml('name: "Ann"\ntags: "x"\ntags: "y"\n')
d.to_grouped() # {'name': 'Ann', 'tags': ['x', 'y']}
d.to_oml()
Shape¶
A document is zero or more label: value edges, one per line (or
;-separated on one line). A repeated label is how an array appears — same
as every other format. Values nest with { }:
venue: {
name: "Strange Loop"
building: {
address: { street: "123 Main St"; city: "St. Louis"; country: "US" }
room: "Ballroom A"
}
}
session: {
title: "Schema Compatibility, Revisited"
speaker: {
name: "Ada Lovelace"
bio: """
Works on data models and provenance.
Quote: "Hopper said it best".
Path: C:\\talks\\ada\\slides.key
"""
}
note: "Recording starts five minutes late."
note: 'Slides posted after the talk -- path on the laptop: C:\talks\ada\slides.key'
start: 2024-09-18T14:00:00
duration: 50
tags: "schemas"
tags: "compatibility"
}
attendee_count: 312
virtual: false
This reads straight into the Document [(venue, [...]), (session, [...]),
(attendee_count, 312), (virtual, False)] — session.note is a repeated
label (two notes, in order), session.tags likewise, and session.start
comes back as a real datetime.datetime, not a string.
Edge order here is data, not metadata: the order edges are written and read in OML is preserved in the resulting Document, exactly like any other value. But order is not a schema constraint — a schema validating that Document never looks at the order its edges came in. Two OML documents with the same edges in a different order build two different Documents (they compare unequal), yet both validate identically against the same schema:
from omnist import Doc, parse_schema
doc1 = Doc.from_oml('a: 1\nb: 2')
doc2 = Doc.from_oml('b: 2\na: 1')
doc1 == doc2 # False -- different Documents, order is data
s = parse_schema('record R { "a": integer, "b": integer }\nroot R')
s.validate(doc1).ok # True
s.validate(doc2).ok # True -- same result; validation ignores order
See Validation for the schema side of this.
Scalars are typed by their spelling, not a tag¶
There's no type annotation — the literal's shape says what it is:
| Spelling | Scalar |
|---|---|
"text" |
string |
42 / -42 |
integer |
3.14 / 1e10 / nan / inf / -inf |
number |
true / false |
boolean |
2024-01-01 |
date |
12:30:00 |
time |
2024-01-01T12:30:00 |
datetime |
null |
null |
Bare words are never strings — name: Ann is a syntax error; quote it:
name: "Ann".
Mapping to the Python Document¶
read_oml doesn't build a special OML object — it builds exactly the same
canonical node every other reader builds: a scalar, or a list of
(label, value) edges (see the model spec). Each OML
scalar spelling becomes one specific Python type, with no ambiguity:
| OML spelling | Python type |
|---|---|
"text" |
str |
42 |
int |
3.14 / nan / inf |
float |
true / false |
bool |
2024-01-01 |
datetime.date |
12:30:00 |
datetime.time |
2024-01-01T12:30:00 |
datetime.datetime |
null |
None |
A { } node becomes a nested edge list; a repeated label becomes the same
label appearing more than once, in order — not a list value. That means the
Python builder (doc(...), see the guide) can
construct the exact same Document a piece of OML parses to, field for field:
import datetime
from omnist import read_oml, doc
node = read_oml('''
name: "Ann"
role: "dev"
joined: 2024-01-01
tag: "x"
tag: "y"
manager: null
''')
built = doc({
"name": "Ann",
"role": "dev",
"joined": datetime.date(2024, 1, 1),
"tag": ["x", "y"], # a repeated key -- becomes the label 'tag' twice
"manager": None,
})
node == built.to_data() # True -- identical Document, two different sources
This is what "lossless" means concretely: there is no OML feature that
needs a special case in the builder, and no Document shape the builder can
make that OML can't spell out (every scalar type, null, repeats,
interleaving, arbitrary nesting).
Reading¶
Without a schema¶
Because every OML scalar is already exactly typed by its own literal spelling (the two tables above), reading OML without a schema already hands back the exact right Python type for every leaf — there's no separate coercion step the way there is for JSON/XML, and no "before" picture that differs from the "after" one for any unquoted literal:
from omnist import read_oml
read_oml('d: 2024-01-01\nn: 3')
# [('d', datetime.date(2024, 1, 1)), ('n', 3)]
type(dict(read_oml('d: 2024-01-01\nn: 3'))['d'])
# <class 'datetime.date'>
The one case that isn't already typed is a value written as a quoted
string — "2024-01-01" is unambiguously a str by its own spelling (OML
has no separate date literal that also happens to be quotable), so it stays
a str unless a schema says otherwise:
read_oml('s: "2024-01-01"')
# [('s', '2024-01-01')]
type(dict(read_oml('s: "2024-01-01"'))['s'])
# <class 'str'>
With a schema: validation, not type-upgrading¶
Like every other format, read_oml(text, schema=...) runs the leaves
through the same schema-directed conversion described in
schema-directed deserialization. For OML this
matters less for type-upgrading than it does for the other formats,
precisely because OML's literal syntax already produces the exact right
type for any unquoted scalar — schema= is a no-op for d: 2024-01-01
above. Where it does still convert is the quoted-string case (the one OML
spelling that's deliberately ambiguous about whether it's "just a string"
or "a date written defensively in quotes"):
from omnist import parse_schema, Doc, read_oml
s = parse_schema('record R { "d": date, "n": number }\nroot R')
read_oml('d: "2024-01-01"\nn: 3', schema=s)
# [('d', datetime.date(2024, 1, 1)), ('n', 3.0)]
Doc.from_oml(text, schema=s) is the same conversion through the Doc
wrapper — it just calls read_oml underneath:
Doc.from_oml('d: "2024-01-01"\nn: 3', schema=s).to_data()
# [('d', datetime.date(2024, 1, 1)), ('n', 3.0)]
Once read, the result validates the same way regardless of format (see
the Schema model & DSL for the schema side of this) — for
OML, Schema.validate is the main reason to pass a schema at all, since
shape and field-presence problems (a missing field, the wrong cardinality)
are exactly what validation — not deserialization — catches.
Writing¶
from omnist import write_oml, Doc
write_oml([("name", "Ada")]) # 'name: "Ada"'
Doc.of({"name": "Ada"}).to_oml() # 'name: "Ada"'
OML is the one format with no
WriteReportadjustments:check_omlalways returns an empty report, because every Document shape (all seven scalars,null, repeats, interleaving, multiple top-level edges) maps onto OML without loss. There is nostrict=/report=machinery onwrite_omlfor the same reason.The canonical writer always emits
LFnewlines, 2-space indentation, and the minimal string-escape form (\"and\\plus literal Unicode) — same input always produces the same output (useful for diffing/snapshotting).
Strings: escaping, raw, and multiline¶
A normal string escapes the usual set: \", \\, \n, \t, \r, \b,
\f, and \uXXXX (a surrogate pair of two \uXXXX escapes denotes one
astral code point, e.g. U+1F600). The canonical writer only ever emits this
minimal form — \" and \\ plus literal Unicode, nothing more.
Two extra spellings (read-only — write_oml never produces them, so reading
one and writing it back changes layout, never meaning):
- Raw
'…'— no escape processing at all; ideal for paths/regexes:'C:\talks\ada\slides.key'. The one limitation: it can't contain'(there's no escape for it inside raw strings) — use the quoted or multiline form instead. - Multiline
"""…"""— may contain literal newlines (a newline right after the opening"""is stripped, so the first content line doesn't have to share the delimiter's line); ordinary escapes still work inside. Because the terminator is three quotes, a lone or double"is just content — only a run of three needs\""".
Newlines inside a multiline string are never confused with the structural
line separator: the tokenizer reads """…""" as one token from open to
close (the same way a string can contain # without it becoming a
comment), so only a newline outside any token separates one edge from the
next.
Separators¶
Edges are separated by one or more newlines and/or ;. There's no comma —
OML has no array literal, so nothing invites one. ; is for one-line
("inline") style: { a: 1; b: 2 }.
Errors, not silent surprises¶
A few things that look almost-valid are deliberate hard errors, not lenient parses:
read_oml("a: 1 b: 2") # ParseError -- no separator between edges
read_oml('{ a: 1 }\nb: 2') # ParseError -- braces must wrap the WHOLE document
read_oml("true: 1") # ParseError -- true/false/null can't be bare labels
read_oml('"true": 1') # fine -- quoting always works
read_oml("inf: 1") # ParseError -- inf/nan are reserved NUMBER spellings too
read_oml('"inf": 1') # fine -- write_oml always quotes these labels for you
There's also a digit-count limit on bare integers (4300 digits, matching
CPython's own default sys.get_int_max_str_digits()) and a nesting-depth
limit of 200, matching the Document model's own bound — both raise
ParseError rather than letting a pathological input hang or crash the
process.
Notes¶
- Not yet implemented: a few further OML-Extended conveniences from the
design draft — digit separators (
1_000_000), non-decimal integer literals (0x1F,0o17,0b1010), the pair-free astral escape\u{1F600}, a lenientdate time(space) datetime separator, and a trailingZUTC-zone marker. None of these affect what OML can represent (every Document already round-trips); they're optional input sugar for later. - For the full formal grammar, see the OML-Core grammar.
- See the comparison table for how OML maps every other format's special-cased features, feature by feature.