xmlsed prototype

(Nhat Minh Lê (rz0) @ 2009-10-12 22:30:46)

Following David Young’s post on the official NetBSD blog, here’s my own small announcement: xmlsed is here, or at least a somewhat working first version. It’s in the Git repository, under the master branch (everything’s been merged back into master some time ago).

It’s mostly untested, and the exact interface/behavior is likely to change, but if you’re interested in helping out, please read the man pages included in the distribution (xmlsed(1) and xml_pattern(7) should get you started) and play a bit with some XML data (you can find sample files in tests/data/).

So, what does my xmlsed do? Basically, it’s somewhat like sed(1) except it does only one of the many things sed can do (though it’s probably the most popular feature of sed): it replaces things with other things, where, in our case, a "thing" is one or more nodes.

It is based on the new event-handling mini internal framework and uses the same patterns as xmlgrep (and the same matching code), augmented with the register-binding operator >= which lets you put captured nodes into a named variable, called a register. Then you can put nodes of the form <$REGNAME/> in your replacement templates, and voilà! The register’s contents gets inserted in place of the <$REGNAME/>. Something like xmlsed '((a#)>=a%(b#)>=b) <$b/><$a/>' should swap around two consecutive nodes a and b… well, I very much hope it does.

Well, that’s for the good part. Now, there are many things left unresolved in xmlsed, mainly the fact that its behavior is not consistent with xmlgrep. In xmlsed, if you replace a node, it also deletes all of its descendants, whereas when you select a node in xmlgrep (or in a capture in xmlsed), it comes alone, without any of its descendants (except attributes). In fact, you can even do a lot more weird stuffs in selections like selecting and grouping together non directly connected parts of the tree. To sum up: selection is way more powerful than replacement, and by default, they don’t really look alike (and they aren’t quite implemented the same way either…).

I’m thinking about rationalizing all this, along with the syntax. Through the many feature additions (as David and I decided we needed more power), selection patterns have probably become too convoluted, unnecessarily complex and fine-grained where it doesn’t matter. However, we should not give up on all that flexibility and probably bring some over to the replacement mechanism. It’ll also be a good opportunity to revise the syntax (more XPath-isms?) and migrate to some automated parser generator, one that can create reentrant and push-style parsers. I have that big patch of NetBSD yacc doing just that, waiting in my Git repository, but it probably needs more polishing before I can hope to submit it to the mailing lists to get myself flamed for doing things on my own, but oh well, we’ll see.

Now’s not the time for that yet. First, I need to test xmlsed more thoroughly to ensure that the implementation underlying the syntax really works.

Also, other things that I’ll want/need to do once I’m done with xmlsed (to the point where it runs well enough, doing something useful) are, in no particular order:

Then of course, there’ll also be work on the remaining tools: some kind of xmljoin is probably next after xmlsed. In which order I address the various points, I don’t know yet; it’ll probably depend on what I feel is more urgent, and what I feel like doing first. :)