Schema-directed deserialization¶
Every reader (read_json / read_yaml / read_toml / read_xml / read_oml,
and the matching Doc.from_*) produces a node from raw text.
Without a schema=, every leaf is exactly whatever the format's own
native parser produced — nothing is upgraded. With schema=, the reader
additionally converts each leaf to match the schema's declared
Scalar kind, wherever the conversion is value-exact.
This page covers that conversion: what changes, what doesn't, and why it's
safe to do without guessing.
Schema awareness is one-directional, read-only: a reader optionally
takes schema= to upgrade leaves on the way in, but no writer
(write_json/write_yaml/write_toml/write_xml/write_oml, or the
matching Doc.to_* methods) accepts a schema at all — a writer serializes
the Document exactly as it is, never consulting a schema for how to shape
the output.
flowchart LR
schema["Schema"] -. "schema=" .-> reader["reader (read_*)"]
text["format text"] --> reader --> doc1["Document"]
doc1 --> writer["writer (write_*)"] --> text2["format text"]
The core distinction, demonstrated¶
The same JSON text, read with and without a schema, can hand back a Document where the same field holds a different Python type. That's the whole point of the feature:
from omnist import parse_schema, read_json
text = '{"d": "2024-01-01", "n": 3}'
# No schema: leaves are exactly what JSON's own parser produces.
no_schema = read_json(text)
print(no_schema) # [('d', '2024-01-01'), ('n', 3)]
print(type(dict(no_schema)["d"])) # <class 'str'>
# With schema: leaves are additionally upgraded to match the declared Scalar.
s = parse_schema('record R { "d": date, "n": number }\nroot R')
with_schema = read_json(text, schema=s)
print(with_schema) # [('d', datetime.date(2024, 1, 1)), ('n', 3.0)]
print(type(dict(with_schema)["d"])) # <class 'datetime.date'>
Without schema=, the JSON string "2024-01-01" is a plain str — JSON has
no date type, so its parser can't produce anything else. With schema=,
the same string is upgraded to a real datetime.date because the schema
says the field d is a date and the string is a value-exact ISO-8601 date.
Likewise the JSON integer 3 becomes the Python float 3.0, because the
schema says n is a number.
What "no schema" already looks like, per format¶
The JSON "before" picture above — a leaf is just whatever the format's native parser hands back — is not the same starting point for every format. Some formats' own parsers already produce native Python temporal types for some scalars, with no schema involved at all:
| Format | A date leaf with no schema= |
|---|---|
| JSON | str (e.g. "2024-01-01") — JSON has no date type |
| YAML | datetime.date already — PyYAML's own loader recognizes unquoted ISO dates |
| TOML | datetime.date already — tomllib/TOML's grammar has a native date literal |
| XML | str (e.g. "2024-01-01") — XML has no date type |
| OML | str if written as a quoted string; OML has no separate date literal either, so a date leaf only becomes a real datetime.date once schema= upgrades it |
This means that for YAML and TOML, reading a date field without a schema
can already give you a datetime.date — passing schema= in that case is a
no-op for that field (the value's already value-exact for the declared
scalar). For JSON, XML, and OML, the upgrade from str to datetime.date
only happens once a schema is supplied. Verified directly:
from omnist import parse_schema, read_json, read_yaml, read_toml, read_xml
s = parse_schema('record D { "d": date }\nroot D')
type(dict(read_json('{"d": "2024-01-01"}'))["d"]) # str
type(dict(read_json('{"d": "2024-01-01"}', schema=s))["d"]) # datetime.date
type(dict(read_yaml('d: 2024-01-01'))["d"]) # datetime.date (already!)
type(dict(read_yaml('d: 2024-01-01', schema=s))["d"]) # datetime.date
type(dict(read_toml('d = 2024-01-01'))["d"]) # datetime.date (already!)
type(dict(read_toml('d = 2024-01-01', schema=s))["d"]) # datetime.date
type(dict(read_xml('<d>2024-01-01</d>'))["d"]) # str
type(dict(read_xml('<d>2024-01-01</d>', schema=s))["d"]) # datetime.date
Why the conversion is unambiguous by construction¶
A schema's field declares exactly one Scalar
(or one Ref) — never a union, never an enum of candidate types. So when
deserialization looks at a raw leaf value and a field's declared scalar,
there's never a choice between candidate representations to disambiguate
between — only one question: does this value exactly fit the one scalar
declared, or not. That's why the conversion can run automatically with no
configuration and no heuristics.
Shape problems — a missing or unexpected field, the wrong cardinality, a
record where a scalar is expected — are left to Schema.validate, not
raised by deserialization. materialize/schema= only ever converts a leaf
it can already identify as belonging to a known field's scalar; it passes
mismatched shapes through unchanged for validation to flag.
When a conversion isn't value-exact: ParseError¶
If a leaf's raw value doesn't exactly fit the declared scalar, deserialization
raises ParseError rather than guessing or silently leaving the value
unconverted:
from omnist import parse_schema, read_json, ParseError
s = parse_schema('record R { "n": integer }\nroot R')
read_json('{"n": "abc"}', schema=s)
# ParseError: $.n: 'abc' cannot be read as integer (not a value-exact conversion)
1.5 into integer fails the same way (1.5 has a fractional part, so it's
not a value-exact int), while 4.0 into integer succeeds (4.0 is
value-exact as 4).
Conversion rules¶
The full, per-kind mapping of what validation accepts (checks a value
already in the document, never converts) versus what deserialization
additionally converts (and rejects) for each Scalar kind — along with the
"bool never satisfies integer/number," "number always deserializes
to float," "date/datetime stay mutually exclusive," and "shape
mismatches are validation's job" notes that go with it — lives in one place:
model spec §10, the formal
definition this page's examples are derived from.
materialize: upgrading an already-parsed node¶
schema= on a reader is sugar for parsing, then calling materialize
directly. Use materialize when you already have a node — from a reader
called without schema=, from doc(), or built by hand — and want the same
upgrade applied after the fact:
materialize(node, schema) -> node |
apply the schema-directed upgrade to an already-parsed node |
from omnist import materialize, parse_schema, read_json
s = parse_schema('record R { "d": date }\nroot R')
node = read_json('{"d": "2024-01-01"}') # no schema yet: 'd' is a str
materialize(node, s) # [('d', datetime.date(2024, 1, 1))]
See the API reference for the bare function signatures.