2.1 KiB
roxmltree parsing strategy
XML parsing is hard. Everyone knows that. But the other problem is that it can be represented in very different ways:
- You can preserve comment or ignore them completely or partially.
- You can represent text data as a separated node or embed it into the element node.
- You can keep CDATA as a separated node or merge it into the text node.
- You can preserve XML declaration or ignore it completely.
- ... and many more.
This document explains how roxmltree parses and represents the XML document.
XML declaration
XML declaration is completely ignored. Mostly because it doesn't contain any valuable information for us.
versionis expected to be1.*. Otherwise an error will occur.encodingis irrelevant since we are parsing only valid UTF-8 strings.- And no one really follow the
standaloneconstraints.
DTD
Only ENTITY objects will be resolved. Everything else will be ignored
at the moment.
<!DOCTYPE test [
<!ENTITY a 'text<p/>text'>
]>
<e>&a;</e>
will be parsed into:
<e>text<p/>text</e>
Were p is an element, not a text.
Comments
All comment will be preserved.
Processing instructions
All processing instructions will be preserved.
Whitespaces
All whitespaces inside the root element will be preserved.
<p>
text
</p>
it will be parsed as \n␣␣␣␣text\n.
Same goes to an escaped one:
<p>  text  </p>
it will be parsed as ␣␣text␣␣.
CDATA
CDATA will be embedded to a text node:
<p>t<![CDATA[e ]]> x<![CDATA[t]]></p>
it will be parsed as te  xt.
Text
Text will be unescaped. All entity references will be resolved.
<!DOCTYPE test [
<!ENTITY b 'Some text'>
]>
<p>&b;</p>
it will be parsed as Some text.
Attribute-Value Normalization
Attribute-Value Normalization works as explained in the spec.
Namespaces resolving
roxmltree has a complete support for XML namespaces.