102 lines
2.1 KiB
Markdown
102 lines
2.1 KiB
Markdown
# roxmltree parsing strategy
|
|
|
|
XML parsing is hard. Everyone knows that. But the other problem is that it
|
|
can be represented in very different ways:
|
|
|
|
- You can preserve comment or ignore them completely or partially.
|
|
- You can represent text data as a separated node or embed it into the element node.
|
|
- You can keep CDATA as a separated node or merge it into the text node.
|
|
- You can preserve XML declaration or ignore it completely.
|
|
- ... and many more.
|
|
|
|
This document explains how *roxmltree* parses and represents the XML document.
|
|
|
|
## XML declaration
|
|
|
|
[XML declaration](https://www.w3.org/TR/xml/#NT-XMLDecl) is completely ignored.
|
|
Mostly because it doesn't contain any valuable information for us.
|
|
|
|
- `version` is expected to be `1.*`. Otherwise an error will occur.
|
|
- `encoding` is irrelevant since we are parsing only valid UTF-8 strings.
|
|
- And no one really follow the `standalone` constraints.
|
|
|
|
## DTD
|
|
|
|
Only `ENTITY` objects will be resolved. Everything else will be ignored
|
|
at the moment.
|
|
|
|
```xml
|
|
<!DOCTYPE test [
|
|
<!ENTITY a 'text<p/>text'>
|
|
]>
|
|
<e>&a;</e>
|
|
```
|
|
|
|
will be parsed into:
|
|
|
|
```xml
|
|
<e>text<p/>text</e>
|
|
```
|
|
|
|
Were `p` is an element, not a text.
|
|
|
|
## Comments
|
|
|
|
All comment will be preserved.
|
|
|
|
## Processing instructions
|
|
|
|
All processing instructions will be preserved.
|
|
|
|
## Whitespaces
|
|
|
|
All whitespaces inside the root element will be preserved.
|
|
|
|
```xml
|
|
<p>
|
|
text
|
|
</p>
|
|
```
|
|
|
|
it will be parsed as `\n␣␣␣␣text\n`.
|
|
|
|
Same goes to an escaped one:
|
|
|
|
```xml
|
|
<p>  text  </p>
|
|
```
|
|
|
|
it will be parsed as `␣␣text␣␣`.
|
|
|
|
## CDATA
|
|
|
|
CDATA will be embedded to a text node:
|
|
|
|
```xml
|
|
<p>t<![CDATA[e ]]> x<![CDATA[t]]></p>
|
|
```
|
|
|
|
it will be parsed as `te  xt`.
|
|
|
|
## Text
|
|
|
|
Text will be unescaped. All entity references will be resolved.
|
|
|
|
```xml
|
|
<!DOCTYPE test [
|
|
<!ENTITY b 'Some text'>
|
|
]>
|
|
<p>&b;</p>
|
|
```
|
|
|
|
it will be parsed as `Some text`.
|
|
|
|
## Attribute-Value Normalization
|
|
|
|
[Attribute-Value Normalization](https://www.w3.org/TR/xml/#AVNormalize) works
|
|
as explained in the spec.
|
|
|
|
## Namespaces resolving
|
|
|
|
*roxmltree* has a complete support for XML namespaces.
|