GSoC 2009 for NetBSD: xmltools
  1. In search of a good pull-style XML parser
  2. XML as a tree representation
  3. xmlgrep toy benchmarks
  4. Improving performances of xmlgrep
  5. xmlgrep by examples: playing with Atom
  6. More benchmarks: the bitvopt branch
  7. xmlsed preview: writing XML
  8. xmltools update: new I/O layer and further plans
  9. xmltools implementation: automata and backtracking
  10. xmltools: what's done, what's left
  11. xmltools: after the GSoC

xmlgrep toy benchmarks

(Nhat Minh Lê (rz0) @ 2009-06-12 01:06:18)
As xmlgrep is approaching a pre-pre-alpha release (something that is still very experimental, but nonetheless working, somewhat :), I’ve been doing some basic tests on it, including one simple benchmark. The results are, IMHO, encouraging. The benchmark pits xmlgrep against equivalent existing software:

Since each tool accepts its own patterns and treats them differently, I’ve tried my best to get meaningful results with each, but obviously, the behaviors obtained differ somewhat.

The tests/memplot script was used to make the sampling. The data were plotted using Gnuplot. The following commands were run:

$ xmlgrep 'hw/?="Ab\"a*cus"' d.xml
$ xmlgrep 'hw[?="Ab\"a*cus"]' d.xml
$ xml sel -t -c "//hw[text()='Ab"a*cus']" d.xml
$ xml_grep "hw[string()='Ab\"a*cus']" d.xml
$ grep 'Ab"a\*cus' d.xml

The goal was to retrieve the <hw> element which contains the Ab"a*cus string from a 53M dictionary, retrieved from http://www.ibiblio.org/webster/ and merged into a single file.

The first two lines correspond to two different invocations of xmlgrep, which do not do exactly the same thing, and with the first being the good one, the other inefficiently abusing the look-ahead mechanism. Yet, it was included for comparison purposes and maybe to be fair to the XPath-based alternatives, which have to rely on a predicate.

Results follow; the second and third pictures are just zooms which omit one or another of the candidates.

I know these results don’t mean much, so don’t take them too seriously. I’ve only tested a single simple pattern. Actually, I didn’t really intend to make a comparison, at first, I just wanted to make sure xmlgrep doesn’t leak or otherwise misuse memory. But since the sampling code was there, I thought it’d be fun to do some additional measurements.

Overall benchmark
Low-memory candidates benchmark
Fast candidates benchmark

As expected, xmlstarlet, which uses a full DOM model requires a lot of memory to do its work (more than 400M!). But it is the fastest.

My xmlgrep, when used right, is still nearly 42% slower than xmlstarlet (from above 4s to below 6s), but I think the times remain reasonable; memory-wise, it is also the lightest, with a constant 2.9M in use throughout the whole run. Even GNU grep requires about 2.8M.

Surprisingly, the Twig xmlgrep is not too big a memory killer, though its memory usage increases over time, by steps (though it looks linear on the long run). This appears somewhat strange to me; why some nodes should be pruned from its internal DOM-like tree while others seem to remain for the duration of the program. However, on the speed side, Twig is predictably very slow; it takes nearly two minutes to produce the results comparable to those of its C counterparts.

That’s it for tonight; as one friend of mine told me, more interesting benchmarks will probably come from actual users.