XML¶
A deliberately narrow data-XML profile: elements only, used to carry the
same Documents as the other formats. Install pip install defusedxml so
parsing is hardened against entity-expansion / XXE attacks (without it, the
standard-library parser is used and read_xml emits an UnsafeXMLWarning).
from omnist import read_xml, Doc
d = Doc(read_xml("<order><id>A1</id>"
"<item><sku>W</sku></item><item><sku>G</sku></item></order>"))
d.to_json() # '{"order": {"id": "A1", "item": [{"sku": "W"}, {"sku": "G"}]}}'
How it maps¶
- An element with child elements becomes a node; each child tag is an edge label.
- Repeated elements are a repeated label —
<item/><item/>is the labelitemtwice, i.e. an array, exactly like JSON"item": [{…}, {…}]. - A leaf element is a scalar — its text content.
Read raw, repeated <item> elements come back as the repeated-label edge list
directly, not the regrouped JSON-shaped array:
from omnist import read_xml
read_xml('<items><item>x</item><item>y</item></items>')
# [('items', [('item', 'x'), ('item', 'y')])]
Single document element¶
XML has exactly one top-level element, so an XML Document always has a single top-level edge (the document element's tag). That's why a Document meant to round-trip through XML is single-rooted — wrap your data under one top-level key:
read_xml("<order>…</order>") # -> [("order", […])]
To share a Document with JSON/YAML/TOML, give them a matching single top-level
key ({"order": {…}}). Writing requires a single top-level edge — a Document
with several top-level edges has no single XML document form and raises.
Interleaving is preserved¶
Because the Document is an ordered edge list, XML's interleaved repeats
survive on read — <m/><x/><m/> reads as [(m,…), (x,…), (m,…)], the one thing
a dict-with-arrays can't represent. (Projecting to JSON groups the ms, since
JSON can't interleave.)
Reading¶
Without a schema¶
Element text is untyped, but read_xml's coercion heuristic does try
int/float/bool before falling back to str — it does not attempt
any date/time coercion, since a date string is indistinguishable from any
other string by spelling alone without a declared scalar to check it against:
from omnist import read_xml
read_xml('<r><n>30</n><f>3.5</f><ok>true</ok><d>2024-01-01</d></r>')
# [('r', [('n', 30), ('f', 3.5), ('ok', True), ('d', '2024-01-01')])]
<n>30</n> reads as the integer 30 and <ok>true</ok> as True, but
<d>2024-01-01</d> stays the plain str '2024-01-01' — XML has no date
type and read_xml doesn't guess one from the shape of the text.
With a schema¶
schema= upgrades a leaf to match the schema's declared scalar wherever the
conversion is value-exact — this is what turns the date string above into a
real datetime.date:
from omnist import parse_schema, read_xml
s = parse_schema('record Inner { "d": date, "n": number }\n'
'record R { "r": Inner }\nroot R')
read_xml('<r><d>2024-01-01</d><n>3</n></r>', schema=s)
# [('r', [('d', datetime.date(2024, 1, 1)), ('n', 3.0)])]
(The schema's shape has to mirror the document's — XML always wraps its
content in a single document element, here <r>, so the schema needs a
record for that wrapper too.) See
schema-directed deserialization for the full
conversion rules. Doc.from_xml(text, schema=s) is the same conversion
through the Doc wrapper — it just calls read_xml underneath:
from omnist import Doc
Doc.from_xml('<r><d>2024-01-01</d><n>3</n></r>', schema=s).to_data()
# [('r', [('d', datetime.date(2024, 1, 1)), ('n', 3.0)])]
Writing¶
from omnist import write_xml, Doc
write_xml([("order", [("id", "A1")])])
# '<order>\n <id>A1</id>\n</order>\n'
Doc.of({"order": {"id": "A1"}}).to_xml()
# '<order>\n <id>A1</id>\n</order>\n'
A key that isn't a legal XML element name is sanitized on write (e.g.
"a b"→<a_b>, reported askey.sanitized), and a date/time value is written as text (temporal.stringified).A string that looks like another type reads back as that type. If you write the string
"30"as a leaf, it round-trips as the integer30, not the string"30"—write_xmlreports this ahead of time asstring.ambiguousso you know it won't come back the way it went in.An empty internal node (zero edges,
[]) is indistinguishable from an empty-string leaf ("") once written: both serialize to<tag />, andread_xmlalways reconstructs the empty-string leaf. Writing[]is reported asshape.empty_ambiguousso you know ahead of time that it won't round-trip; writing""round-trips fine and is not flagged.A string containing a character XML 1.0 cannot represent (most C0 control characters -- everything below U+0020 except tab/LF/CR -- or a UTF-16 surrogate) would otherwise produce text that isn't well-formed XML, so
write_xmlreplaces each such character with U+FFFD (the standard replacement character) and reportsstring.illegal_xml_charwith"error"severity --strict=Trueraises instead of silently substituting.A string containing
\ris legal XML, but XML mandates line-ending normalization on parse (\rand\r\nboth become\n), so it doesn't round-trip byte-for-byte.write_xmlleaves\ras-is (no substitution needed) and reports it asstring.cr_normalizedso you know ahead of time the read-back value will differ.See adjustment reports to inspect any of these, or
strict=Trueto raise instead of adjusting.
Notes¶
- Not supported (outside the data-XML profile): attributes, mixed text and
elements, and CDATA. A namespace prefix is stripped (
<n:a>reads asa). - See the comparison table for how XML's attribute- and namespace-dropping stack up against the other formats.