Tuesday, September 29, 2009

On joys and sorrows of library development – part 1

This may come as a surprise, but I am not dead. In fact, what you see is a new post! As usual I have a lot of interesting themes to cover, and barely enough time to spare. While I'm at it, let me tell you about NDAs. I hate NDAs with a passion – I've got some things to blog about that are partially covered by NDA (of course, the interesting parts are NOT); also I've been thinking that this is a non-issue and basically that I can blog about things that are not quite critical, but half a year ago or so I was forced to remove a blog post; the reasons are not exactly clear but it seems that it was because of a single sentence that mentioned something that's NOT secret in my point of view and was NOT relevant to post contents. For this reason I'm hesitant to write about some topics so I'll either skip them altogether (which is a shame) or find a way to omit all details that might seem sensitive to people. Also I'm not sure if blogging about post removal due to NDA is an NDA violation?..

Anyway, the topic for today is something different – I'll write a bit about library development. In past few years I've developed and maintained a C++ XML parser PugiXML. This is a tiny library which focuses on performance and ease of use. We've had tremendous speedups of export process after converting from TinyXML, and I know lots of other success stories. PugiXML is portable (lots of platforms and compilers are supported, I've gone through special efforts to support MSVC6 and old CodeWarriors), console-aware (i.e. you can toggle off STL/exception support, override memory management, etc.), small, robust, etc. It even features an XPath evaluator!

PugiXML was born as a project to clean up pugxml – initial idea was to strip pugxml header from sources (thus reducing compilation/linking times), slightly cleanup interface and use it. What followed was an almost complete rewrite of the code, bringing the parser closer to standard compliance, adding useful features for DOM inspection, and greatly improving speed. There are bits of code left from pugxml, and interface is very similar, but it's quite a different project now. As far as I know, the only parser in use that beats PugiXML at parsing speed is RapidXML, and the only major problem with PugiXML is that it's Unicode support is pretty much limited by UTF8. Though both of those may change at some point in the future :)

I'm going to write some stuff here that may be of interest to other people.

1. Interface


The initial API was taken as-is from pugxml; in the hindsight, this was both a good (since it offered a very simple transition for pugxml users) and bad thing. It's a bad thing because the interface is seriously cluttered.

For example, there are at least four methods of traversing children nodes: you can use the next_sibling() function (the DOM is structured as a graph of nodes, with nodes connected via pointers; each node contains a pointer to both right and left siblings, the function gets the right one), you can use the node iterator, you can use xml_tree_walker (which is a Visitor-like interface), and finally you can grab all child elements via an insert iterator with all_elements_by_name(). Oh, and you can use XPath, which makes five methods.

As another example, every method for string-based queries (i.e. xml_node::attribute(const char*), which means “give me the first attribute with the following name) has a corresponding method which uses wildcard matching instead of string comparison (i.e. node.attribute_w(“foo*ba?”) will match foobar or fooatbaz).

Overall, it's not that much (I have a friend who's been working with a codebase that has an interface with 760+ virtual functions, so I'm not easily scared) and it does not stand in the way while you're using the library, but it certainly does not help maintaining and developing it.

But the worst part is that I can't remove any of those functions. For example, I consider tree walker to be a bad abstraction; it's rarely usable, and if it is, it's easy to write it outside the library. If I had a full API usage statistics, I could've made a conscious decision – either nobody uses it and I remove it, or there are very few who do and I extract it into an external helper class in an external header (possibly changing the interface slightly), or it's a feature that is used in every second application that uses my library and I can't do anything. The problem is I have no statistics, so I can't do anything.

Other than that, I feel the interface to be good (I use it relatively often both in my pet projects and at work, so if there was something that annoyed me I would've fixed that); the best decision for me is pointer abstraction – in pugixml you don't work with pointers to node (as with TinyXML), you work with tiny pointer wrapper class (the size is equal to that of a pointer) that's passed by value; the point is that there is no null pointer exception, all operations on “null” nodes/attributes are perfectly defined. Of course, the same could be done with a pointer API by using a dummy object instead of null pointer, what matters is the decision to protect the user. Also I find that this makes parsing code much more concise – you don't have to do error handling for every API call!

2. Performance


The parsing performance is very good, on COLLADA files it's hundreds of megabytes per second (probably closer to gigabyte); the bottleneck is always HDD read speed unless the file is cached. Of course, it's still slightly slower than it could be; also the performance comes for a price of not being fully standard compliant – it manifests in allowing certain XML standard violations, such as disallowed Unicode symbols in attribute/node names, multiple node attributes with the same name, etc. This means that while any correct XML file will be parsed, some malformed ones will not be rejected. Up to some point there even were flags to make parser pass certain standard violations (i.e. there was a mode that could handle HTML-style unclosed tags by scanning for matching open tag and automatically closing all descendants), but I removed them to reduce clutter (that was at the point when parser was used by me and a couple of friends so no harm done).

The memory consumption is also good enough (when we switched from TinyXML at work, we got ~2x improvement in terms of memory required to parse a DOM tree), although it could be better. Surprisingly this was achieved without any tricks that I love (take the pointer, take lower N bits, stuff something useful in there, pretend that everything was that way) and almost without any bit-packing.

All good things come at a price – the parser currently requires that the whole XML file is a large contiguous chunk of memory (i.e. if you have a 200 Mb file to parse, you have to have a 200 Mb chunk of address space); also, this chunk dies with the document so in the worst case PugiXML can lose in peak memory consumption if you modify your tree too much (i.e. load a 200 Mb document from file, remove all nodes, add an equivalent amount of contents by hand – the memory overhead of PugiXML will be i.e. 400 Mb (larger than that because nodes take some space too), the memory overhead of a typical parser will be 200 Mb). Of course this is almost never a problem in practice.

Next time: performance highlights (tricks to make parsing fast, saving performance), user requests, documentation, portability concerns

9 comments:

  1. As a coincidence, I may have to rewrite my XML systems in the near future so this is very appreciated. Thank you for sharing.

    ReplyDelete
  2. Have you consider vtd-xml for improved perfomrance and reducing memory usage?

    http://vtd-xml.sf.net

    ReplyDelete
  3. No, I haven't; I seriously doubt the claims - the benchmark is only against Xerces; well, guess what - pugixml is much faster and better in terms of memory than Xerces; for some XML files 90-120 Mb/sec is just not good enough. I haven't tested VTD yet though, so I'm not exactly sure. Also it's huge, meaning it's in the same league as Xerces.

    All other claims are, uh, weird too - basically, it's stupid to claim you're the worlds *est something unless you deliver concrete proof in terms of benchmarks against latest version of a wide variety of product. Are you VTD-XML developer? If you are, I'd suggest rewriting both claims and benchmarks.

    ReplyDelete
  4. Profiled latest vtd-xml against latest (repo) pugixml, pugixml is 3x faster on a synthetic test (1 Mb file with lots of markup), at the expense of 25% more memory; on a moderate (several Mb) COLLADA file it's 6x faster at the expense of 2% more memory. Memory management is not overrideable btw, had to hack it.

    ReplyDelete
  5. Saw you tweets comparing vtd-xml with pugi...
    pugi is not a conformant XML parser at all, it does little well-formedness checking
    so sigh...

    ReplyDelete
  6. pugixml does not claim it's a conformant XML parser, it's simply an XML parser suitable for 95% of XML parsing. VTD-XML should not claim it's the fastest parser in the world either, since that's a lie.

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete
  8. it is not a lie, pugixml is not even conformant to be suited for 15% of the task, think this way, if it doesn't do any checking, it certainly would be infinitely fast.. but what use does it have?

    ReplyDelete
  9. Oh dear, here we go again. I could explain in detail why there's no need for a conformant parser for a lot of tasks - pugixml is used by a number of people, and they're loving it - but I won't bother talking to an anonymous person who seems to just want to tease me.

    If you want a meaningful discussion, please state who you are so that I know that you're honest about your claims.

    ReplyDelete