180 lines
8.4 KiB
Markdown
180 lines
8.4 KiB
Markdown
# roxmltree
|
||

|
||
[](https://crates.io/crates/roxmltree)
|
||
[](https://docs.rs/roxmltree)
|
||
[](https://www.rust-lang.org)
|
||
|
||
Represents an [XML](https://www.w3.org/TR/xml/) document as a read-only tree.
|
||
|
||
```rust
|
||
// Find element by id.
|
||
let doc = roxmltree::Document::parse("<rect id='rect1'/>")?;
|
||
let elem = doc.descendants().find(|n| n.attribute("id") == Some("rect1"))?;
|
||
assert!(elem.has_tag_name("rect"));
|
||
```
|
||
|
||
## Why read-only?
|
||
|
||
Because in some cases all you need is to retrieve some data from an XML document.
|
||
And for such cases, we can make a lot of optimizations.
|
||
|
||
## Parsing behavior
|
||
|
||
Sadly, XML can be parsed in many different ways. *roxmltree* tries to mimic the
|
||
behavior of Python's [lxml](https://lxml.de/).
|
||
For more details see [docs/parsing.md](https://github.com/RazrFalcon/roxmltree/blob/master/docs/parsing.md).
|
||
|
||
## Alternatives
|
||
|
||
| Feature/Crate | roxmltree | [libxml2] | [xmltree] | [sxd-document] |
|
||
| ------------------------------- | :--------------: | :-----------------: | :--------------: | :--------------: |
|
||
| Element namespace resolving | ✓ | ✓ | ✓ | ~<sup>1</sup> |
|
||
| Attribute namespace resolving | ✓ | ✓ | | ✓ |
|
||
| [Entity references] | ✓ | ✓ | × | × |
|
||
| [Character references] | ✓ | ✓ | ✓ | ✓ |
|
||
| [Attribute-Value normalization] | ✓ | ✓ | | |
|
||
| Comments | ✓ | ✓ | | ✓ |
|
||
| Processing instructions | ✓ | ✓ | ✓ | ✓ |
|
||
| UTF-8 BOM | ✓ | ✓ | × | × |
|
||
| Non UTF-8 input | | ✓ | | |
|
||
| Complete DTD support | | ✓ | | |
|
||
| Position preserving<sup>2</sup> | ✓ | ✓ | | |
|
||
| HTML support | | ✓ | | |
|
||
| Tree modification | | ✓ | ✓ | ✓ |
|
||
| Writing | | ✓ | ✓ | ✓ |
|
||
| No **unsafe** | ✓ | | ✓ | |
|
||
| Language | Rust | C | Rust | Rust |
|
||
| Dependencies | **0** | - | 2 | 2 |
|
||
| Tested version | 0.20.0 | Apple-provided | 0.10.3 | 0.3.2 |
|
||
| License | MIT / Apache-2.0 | MIT | MIT | MIT |
|
||
|
||
Legend:
|
||
|
||
- ✓ - supported
|
||
- × - parsing error
|
||
- ~ - partial
|
||
- *nothing* - not supported
|
||
|
||
Notes:
|
||
|
||
1. No default namespace propagation.
|
||
2. *roxmltree* keeps all node and attribute positions in the original document,
|
||
so you can easily retrieve it if you need it.
|
||
See [examples/print_pos.rs](examples/print_pos.rs) for details.
|
||
|
||
There is also `elementtree` and `treexml` crates, but they are abandoned for a long time.
|
||
|
||
[Entity references]: https://www.w3.org/TR/REC-xml/#dt-entref
|
||
[Character references]: https://www.w3.org/TR/REC-xml/#NT-CharRef
|
||
[Attribute-Value Normalization]: https://www.w3.org/TR/REC-xml/#AVNormalize
|
||
|
||
[libxml2]: http://xmlsoft.org/
|
||
[xmltree]: https://crates.io/crates/xmltree
|
||
[sxd-document]: https://crates.io/crates/sxd-document
|
||
|
||
## Performance
|
||
|
||
Here are some benchmarks comparing `roxmltree` to other XML tree libraries.
|
||
|
||
```text
|
||
test huge_roxmltree ... bench: 2,997,887 ns/iter (+/- 48,976)
|
||
test huge_libxml2 ... bench: 6,850,666 ns/iter (+/- 306,180)
|
||
test huge_sdx_document ... bench: 9,440,412 ns/iter (+/- 117,106)
|
||
test huge_xmltree ... bench: 41,662,316 ns/iter (+/- 850,360)
|
||
|
||
test large_roxmltree ... bench: 1,494,886 ns/iter (+/- 30,384)
|
||
test large_libxml2 ... bench: 3,250,606 ns/iter (+/- 140,201)
|
||
test large_sdx_document ... bench: 4,242,162 ns/iter (+/- 99,740)
|
||
test large_xmltree ... bench: 13,980,228 ns/iter (+/- 229,363)
|
||
|
||
test medium_roxmltree ... bench: 421,137 ns/iter (+/- 13,855)
|
||
test medium_libxml2 ... bench: 950,984 ns/iter (+/- 34,099)
|
||
test medium_sdx_document ... bench: 1,618,270 ns/iter (+/- 23,466)
|
||
test medium_xmltree ... bench: 4,315,974 ns/iter (+/- 31,849)
|
||
|
||
test tiny_roxmltree ... bench: 2,522 ns/iter (+/- 31)
|
||
test tiny_libxml2 ... bench: 8,931 ns/iter (+/- 235)
|
||
test tiny_sdx_document ... bench: 11,658 ns/iter (+/- 82)
|
||
test tiny_xmltree ... bench: 20,215 ns/iter (+/- 303)
|
||
```
|
||
|
||
When comparing to streaming XML parsers `roxmltree` is slightly slower than `quick-xml`,
|
||
but still way faster than `xmlrs`.
|
||
Note that streaming parsers usually do not provide a proper string unescaping,
|
||
DTD resolving and namespaces support.
|
||
|
||
```text
|
||
test huge_quick_xml ... bench: 2,997,887 ns/iter (+/- 48,976)
|
||
test huge_roxmltree ... bench: 3,147,424 ns/iter (+/- 49,153)
|
||
test huge_xmlrs ... bench: 36,258,312 ns/iter (+/- 180,438)
|
||
|
||
test large_quick_xml ... bench: 1,250,053 ns/iter (+/- 21,943)
|
||
test large_roxmltree ... bench: 1,494,886 ns/iter (+/- 30,384)
|
||
test large_xmlrs ... bench: 11,239,516 ns/iter (+/- 76,937)
|
||
|
||
test medium_quick_xml ... bench: 206,232 ns/iter (+/- 2,157)
|
||
test medium_roxmltree ... bench: 421,137 ns/iter (+/- 13,855)
|
||
test medium_xmlrs ... bench: 3,975,916 ns/iter (+/- 44,967)
|
||
|
||
test tiny_quick_xml ... bench: 2,233 ns/iter (+/- 70)
|
||
test tiny_roxmltree ... bench: 2,522 ns/iter (+/- 31)
|
||
test tiny_xmlrs ... bench: 17,155 ns/iter (+/- 429)
|
||
```
|
||
|
||
### Notes
|
||
|
||
The benchmarks were taken on a Apple M1 Pro.
|
||
You can try running the benchmarks yourself by running `cargo bench` in the `benches` dir.
|
||
|
||
- Since all libraries have a different XML support, benchmarking is a bit pointless.
|
||
- We bench *libxml2* using the *[rust-libxml]* wrapper crate
|
||
|
||
[xml-rs]: https://crates.io/crates/xml-rs
|
||
[quick-xml]: https://crates.io/crates/quick-xml
|
||
[rust-libxml]: https://github.com/KWARC/rust-libxml
|
||
|
||
## Memory overhead
|
||
|
||
`roxmltree` tries to use as little memory as possible to allow parsing
|
||
very large (multi-GB) XML files.
|
||
|
||
The peak memory usage doesn't directly correlate with the file size
|
||
but rather with the amount of nodes and attributes a file has.
|
||
How many attributes had to be normalized (i.e. allocated).
|
||
And how many text nodes had to be preprocessed (i.e. allocated).
|
||
|
||
`roxmltree` never allocates element and attribute names, processing instructions
|
||
and comments.
|
||
|
||
By disabling the `positions` feature, you can shave 8 bytes from each node and attribute.
|
||
|
||
On average, the overhead is around 6-8x the file size.
|
||
For example, our 1.1GB sample XML will peak at 7.6GB RAM with default features enabled
|
||
and at 6.8GB RAM when `positions` is disabled.
|
||
|
||
## Safety
|
||
|
||
- This library must not panic. Any panic should be considered a critical bug and reported.
|
||
- This library forbids `unsafe` code.
|
||
|
||
## API
|
||
|
||
This library uses Rust's idiomatic API based on iterators.
|
||
In case you are more familiar with browser/JS DOM APIs - you can check out
|
||
[tests/dom-api.rs](tests/dom-api.rs) to see how it can be mapped onto the Rust one.
|
||
|
||
## License
|
||
|
||
Licensed under either of
|
||
|
||
- [Apache License v2.0](LICENSE-APACHE)
|
||
- [MIT license](LICENSE-MIT)
|
||
|
||
at your option.
|
||
|
||
## Contribution
|
||
|
||
Unless you explicitly state otherwise, any contribution intentionally submitted
|
||
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be
|
||
dual licensed as above, without any additional terms or conditions.
|