Skip to content

Schema-directed deserialization

Every reader (read_json / read_yaml / read_toml / read_xml / read_oml, and the matching Doc.from_*) produces a node from raw text. Without a schema=, every leaf is exactly whatever the format's own native parser produced — nothing is upgraded. With schema=, the reader additionally converts each leaf to match the schema's declared Scalar kind, wherever the conversion is value-exact. This page covers that conversion: what changes, what doesn't, and why it's safe to do without guessing.

Schema awareness is one-directional, read-only: a reader optionally takes schema= to upgrade leaves on the way in, but no writer (write_json/write_yaml/write_toml/write_xml/write_oml, or the matching Doc.to_* methods) accepts a schema at all — a writer serializes the Document exactly as it is, never consulting a schema for how to shape the output.

flowchart LR
    schema["Schema"] -. "schema=" .-> reader["reader (read_*)"]
    text["format text"] --> reader --> doc1["Document"]
    doc1 --> writer["writer (write_*)"] --> text2["format text"]

The core distinction, demonstrated

The same JSON text, read with and without a schema, can hand back a Document where the same field holds a different Python type. That's the whole point of the feature:

from omnist import parse_schema, read_json

text = '{"d": "2024-01-01", "n": 3}'

# No schema: leaves are exactly what JSON's own parser produces.
no_schema = read_json(text)
print(no_schema)                  # [('d', '2024-01-01'), ('n', 3)]
print(type(dict(no_schema)["d"]))  # <class 'str'>

# With schema: leaves are additionally upgraded to match the declared Scalar.
s = parse_schema('record R { "d": date, "n": number }\nroot R')
with_schema = read_json(text, schema=s)
print(with_schema)                  # [('d', datetime.date(2024, 1, 1)), ('n', 3.0)]
print(type(dict(with_schema)["d"]))  # <class 'datetime.date'>

Without schema=, the JSON string "2024-01-01" is a plain str — JSON has no date type, so its parser can't produce anything else. With schema=, the same string is upgraded to a real datetime.date because the schema says the field d is a date and the string is a value-exact ISO-8601 date. Likewise the JSON integer 3 becomes the Python float 3.0, because the schema says n is a number.

What "no schema" already looks like, per format

The JSON "before" picture above — a leaf is just whatever the format's native parser hands back — is not the same starting point for every format. Some formats' own parsers already produce native Python temporal types for some scalars, with no schema involved at all:

Format A date leaf with no schema=
JSON str (e.g. "2024-01-01") — JSON has no date type
YAML datetime.date already — PyYAML's own loader recognizes unquoted ISO dates
TOML datetime.date already — tomllib/TOML's grammar has a native date literal
XML str (e.g. "2024-01-01") — XML has no date type
OML str if written as a quoted string; OML has no separate date literal either, so a date leaf only becomes a real datetime.date once schema= upgrades it

This means that for YAML and TOML, reading a date field without a schema can already give you a datetime.date — passing schema= in that case is a no-op for that field (the value's already value-exact for the declared scalar). For JSON, XML, and OML, the upgrade from str to datetime.date only happens once a schema is supplied. Verified directly:

from omnist import parse_schema, read_json, read_yaml, read_toml, read_xml

s = parse_schema('record D { "d": date }\nroot D')

type(dict(read_json('{"d": "2024-01-01"}'))["d"])                  # str
type(dict(read_json('{"d": "2024-01-01"}', schema=s))["d"])        # datetime.date

type(dict(read_yaml('d: 2024-01-01'))["d"])                        # datetime.date  (already!)
type(dict(read_yaml('d: 2024-01-01', schema=s))["d"])               # datetime.date

type(dict(read_toml('d = 2024-01-01'))["d"])                       # datetime.date  (already!)
type(dict(read_toml('d = 2024-01-01', schema=s))["d"])              # datetime.date

type(dict(read_xml('<d>2024-01-01</d>'))["d"])                     # str
type(dict(read_xml('<d>2024-01-01</d>', schema=s))["d"])            # datetime.date

Why the conversion is unambiguous by construction

A schema's field declares exactly one Scalar (or one Ref) — never a union, never an enum of candidate types. So when deserialization looks at a raw leaf value and a field's declared scalar, there's never a choice between candidate representations to disambiguate between — only one question: does this value exactly fit the one scalar declared, or not. That's why the conversion can run automatically with no configuration and no heuristics.

Shape problems — a missing or unexpected field, the wrong cardinality, a record where a scalar is expected — are left to Schema.validate, not raised by deserialization. materialize/schema= only ever converts a leaf it can already identify as belonging to a known field's scalar; it passes mismatched shapes through unchanged for validation to flag.

When a conversion isn't value-exact: ParseError

If a leaf's raw value doesn't exactly fit the declared scalar, deserialization raises ParseError rather than guessing or silently leaving the value unconverted:

from omnist import parse_schema, read_json, ParseError

s = parse_schema('record R { "n": integer }\nroot R')
read_json('{"n": "abc"}', schema=s)
# ParseError: $.n: 'abc' cannot be read as integer (not a value-exact conversion)

1.5 into integer fails the same way (1.5 has a fractional part, so it's not a value-exact int), while 4.0 into integer succeeds (4.0 is value-exact as 4).

Conversion rules

The full, per-kind mapping of what validation accepts (checks a value already in the document, never converts) versus what deserialization additionally converts (and rejects) for each Scalar kind — along with the "bool never satisfies integer/number," "number always deserializes to float," "date/datetime stay mutually exclusive," and "shape mismatches are validation's job" notes that go with it — lives in one place: model spec §10, the formal definition this page's examples are derived from.

materialize: upgrading an already-parsed node

schema= on a reader is sugar for parsing, then calling materialize directly. Use materialize when you already have a node — from a reader called without schema=, from doc(), or built by hand — and want the same upgrade applied after the fact:

materialize(node, schema) -> node apply the schema-directed upgrade to an already-parsed node
from omnist import materialize, parse_schema, read_json

s = parse_schema('record R { "d": date }\nroot R')
node = read_json('{"d": "2024-01-01"}')          # no schema yet: 'd' is a str
materialize(node, s)                              # [('d', datetime.date(2024, 1, 1))]

See the API reference for the bare function signatures.