a tour of python’s ElementTree path language

A few weeks back I started looking at ElementPath package of the ElementTree module to explore a new syntax for finding things in xml documents, but this was a bust. The existing path syntax (a sort of limited and significantly less confusing subset of xpath) works quite nicely and I can’t see any reason to fiddle with it.  And while the implementation of ElementPath is simple and readable, I liked exploring it so much I thought I’d share some thoughts since it serves as a nice example of the power of generators and first class functions.

First consider this simple xml document as an example:

<root>
  <a>
    <b value="2"/>
  </a>
  <a>
    <b>
      <c value="101"/>
    </b>
  </a>
  <a value="will not be found"/>
</root>

The ElementPath expression for finding all “b” elements that are children of “a” elements is simply “a/b” (quite familiar if you’ve ever worked with xpath).  The code to build up the ElementTree and actually evaluate this expression is also fairly straightforward.  Assuming we’ve defined the string TEST_DOC to be the xml document in the code section above, to do this we simply build the ElementTree and call findall():

root = xml.etree.ElementTree.fromstring(TEST_DOC)
root.findall(path)

First it’s useful to think how we might write some code to find all “b” elements that are children of an “a” element without the benefit of a path language.  Writing this by hand, we simply walk the tree, finding all “a” elements and checking to see if they have a “b” element as a child. Here is one possible way to write this code:

def brute_find(root):
    def find_a(root):
        for node in root:
            if node.tag == "a":
                yield node
 
    def find_b(root):
        for node in root:
            if node.tag == "b":
                yield node
 
    for a_node in find_a(root):
        for b_node in find_b(a_node):
            yield b_node

In this particular implementation, we chain together two generators. The first yields all “a” elements, and the second yields all “b” elements that are children of those elements yielded by the first function (“a” elements). When evaluating the path expression “a/b” the actual ElementPath implementation ends up doing something quite similar. Let’s take a look at what happens step by step.

First the expression is tokenized using a regular expression.  The stream of tokens for “a/b” looks like:

[('', 'a'), ('/', ''), ('', 'b')]

The meat of the work in the path language happens when it uses the first first element of the first tuple in the stream of token-tuples to dispatch to the functions responsible for creating the code that will be used to find elements.

To build up code that will walk the tree like our hand-coded example above, the ElementPath implementation looks at the first item in the tuples of the stream and dispatches to helper functions. The first item in the first tuple of the stream of tokens above, the empty string, dispatches to the prepare_child() function, which looks this:

def prepare_child(next, token):
    tag = token[1]
    def select(context, result):
        for elem in result:
            for e in elem:
                if e.tag == tag:
                    yield e
    return select

Generally speaking, all tokens in the stream cause these helper functions to be dispatched (if you are looking at the ElementPath source code, each helper function starts with “prepare_”).  Each helper function returns a function that contains the code for walking the tree and finding nodes of interest (and optionally the helper function gobbles up more tokens in the stream using the next() iterator).  All the functions returned contain a common calling convention, taking a context and a result as parameters. I’ll ignore the context parameter except to say that it is used for path expressions that look backwards up the tree instead of just at attributes or child elements. The result parameter is the result of the previously filtered node (if any) or the root of the tree. In this particular instance, prepare_child() returns a function select() that ignores the context (it does no backtracking) and iterates over the result element, yielding children of the result element if their tag matches the second item in the provided token-tuple.

All of the functions returned by the helper functions are appended to a list, selector, and ultimately chained together:

result = [elem]
...
for select in selector:
    result = select(context, result)
return result

So, in our example path, “a/b”, the select() function looking for “a” elements is provided as input to the select() function looking for “b” elements.  Conceptually we can think of this as:

function_looking_for_b_elements(
   context={},
   result=function_looking for_a_elements())

Therefore when the first select function (looking for “b” elements) is evaluated, it forces the evaluation of the second function looking for “a” elements. Remember that these are all generators and inherently lazy, and this means that the entire process is lazy. If you’re only looking for one match, you’ll get the first one in the xml tree (by first I’m mean using a DFS) and no further evaluation is required.  The find(), findall(), and iterfind() methods on each ElementTree node let you find one, all, or as many nodes as you need respectively.

To summarize this process, the original path expression is tokenized.  Then functions are dispatched using these tokens.  The functions may consume more tokens in the stream, but ultimately return functions themselves.  The functions returned are chained together and the root of the tree is passed in.  All of the functions in the chain are generators, and asking for the first matched element will force evaluation all the way up the chain.

As a final thought, remember that all of this finding of elements happens against a tree that is completely loaded into memory. Obviously this isn’t always plausible.  Some subsets of this path language and others like xpath, can, however, be used in a streaming setting. Adding something like this to the standard library could be useful (especially if it was general enough to be applied to streams of JSON).

All of the code I wrote for this post can be found in this gist. It has been tested in Python 3.2 only.

zope at chipy

A few days back I have a little presentation about zope interfaces at ChiPy.  It was the first time I’ve presented at ChiPy except for two lightning talks (one on turtle and another on ittywiki).  Presenting was a lot of fun!

If you’re clamoring for zope interface knowledge, more details about the talk can be found on its github page.

happy software development in business land

Several years ago, nearing the end of my time at my first “real” job, an “agile consultancy” was brought in to evaluate the firms readiness to use agile techniques.  Things were bad at the time. Morale was low, many projects were perceived as failing, and nearly every meeting ended in frustration.  So when the agile consultants conducted their interviews and group sessions, these tended to devolve into an airing of grievances.

One of these sessions still sticks in my mind.  The consultants asked everyone in the group of about ten people to write one word on an index card describing what the firm could be doing better. We then posted the cards on the wall.  I knew exactly what to write, and within a few minutes my card, emblazoned with the world “TOOLS,” stood out in a field of other cards reading “process.”

At the time it was so painfully obvious to me that tools were the primary source of frustration, and I was disheartened to see my colleagues miss the mark so completely.  In my mind, process meant more meetings, more garbage, and more waste.  We had a tool problem! Most developers were using a shared development server which yielded combinations of code never represented in version control.  While we had individual development virtual machines,there was nothing close to a one click deploy, nothing to bootstrap a project out of version control.  Those of us actually running the projects on our development machines did it from scratch and administered our own boxes. Many of the developers were new to the LAMP stack, and didn’t have the skillset to accomplish these administrative tasks. Likewise, while some projects had functional and integration tests, there was little buy in and the continuous integration server sat dormant. So after the agile consultants asked me about my tool-related frustrations, the topic was gradually redirected to “process” and I zoned out. Process meant bullshit, and these guys were outsiders charging us an arm and a leg to feed us more of it.  And what did it matter? My foot was already halfway out the door, and I went home and sent off a few more resumes.

Unfortunately I was wrong.  Tools weren’t the issue.  We had plenty of tools, and while most were underutilized, this merely required an investment of time.  The problem was community.  Our community was dysfunctional, and fixing it would’ve subsumed all problems.

moving on

There are many ways to build quality software in a reasonable amount of time without wanting to scream, but no one solution is a fit for all situations.  In my current gig, I’m living in a bit of a utopia. I’ve been a lead on a project for an internal system replacing an older system (it sends email, but that’s not terribly important) for about nine months.  I have one extremely well-informed and expert internal client in the marketing department that is the key stakeholder in the project, and because this system is used by many other developers, my colleagues get to try out my APIs and offer feedback.  All told, we have the nicest piece of software I’ve ever written.  It’s internally successful, delivering results early and being improved incrementally, and a joy to work on.  We have tools: the system will bootstrap itself and start running locally with one call to a shell script, the unit tests run all the time, the build server builds the docs (we have docs!), and we’re able to contribute liberally to our shared internal library.  And we have just a bit of process, too: releases are often available in a staging environment, we deploy very rapidly,  feedback from the internal client is nearly instantaneous, progress is tracked in our ticket system, and we have zero standing meetings (instead preferring to do things with short, ad-hoc, high value interactions at the coffee machine).  But even in this same organization I’ve worked on projects that have left me profoundly unhappy, and obviously the approach for my current project won’t always work or scale to other systems.  So I’m left trying to identify the things that make development fun and make this project such a success.

This current project reminds me a lot of working in an academic setting.  In my pre-”real”-job days, the stakeholders were highly involved, highly informed, and wanted to be working on the same stuff that I wanted to be working on (because we were broke and doing it all for free).  But what’s going on now and what was going on then is that we’ve fostered a real sense of community.  Sure, we have some constituent elements that are required to build software in a professional setting (tools, process), and we’ve assembled these nuts and bolts in an orderly fashion. But much more importantly, we’ve fostered a sense of mutual respect.

My stakeholder in the marketing department functions as the closest thing to a project manager I have.  We plan things out and we repeatedly revise our estimates and expectations.  Together he knows that context switching between feature requests slows me down, but I’ve also come to realize that when he asks for something quickly, he’s already thought through the ramifications, and the overhead of the context switch is worthwhile for the business.  The key element here is that our relationship is not adversarial.  Quite the opposite.  I respect his insight into his area of expertise, and he respects mine.  I’ve seen so many project manager/developer relationships go south because both parties perceive the other as the adversary.  The PM thinks the developer is dragging his or her feet, and the developer is irrationally convinced the that fucking over developers was in the PM’s job description. This makes work suck. This makes software suck. Warring factions can’t collaborate to build something great; warring factions can only destroy.

I’m not sure how mutual levels of respect can be earned or fostered.  Of all things, the possibility of this might be the most innate thing in the organization’s human constituents (i.e. we have to presuppose that we’ve hired good people on all sides if this is ever to happen).  But, assuming that respect and trust can be built between the nerds and the stake holders over time, there are still some other factors that have to exist to make a happy project.  One important thing is continued learning on all sides.  Everyone has to be willing to learn new things.  Sometimes this means that business types have to learn new vocabulary.  The might have to know how a cookie works or what it means for the web to be stateless.  And the same holds for developers.  We may have to hold our nose at the imprecise language used in marketing meetings and learn some acronyms and business strategies.  If an unwillingness to learn exists, the project will fail.

Transparency is also key.  Developers need to know how things work.  They need to know the background story around the business decisions that went into a project.  Likewise, the stakeholders need to see that progress is being made.  If we’re building a web app, then the latest stable(ish) release should be available for the business types to tinker with.  The transparency helps both sides.  Sometimes the analytical nerd brain can raise legitimate business questions.  Other times, the stakeholders can stem a failure or misread specification much earlier if they’re interacting with the project on a continual basis.

Finally, fault tolerance and resilience is key to a pleasant project.  Both sides, programmers and stakeholders, have to understand that the other will occasionally make mistakes, misread a spec, or be unclear.  Again, this can’t be adversarial.   For a happy project, we programmers must realize there is no suit and tie cabal maliciously typing away unclear requirements documents.  No, just like we forget to add files to version control and fat finger Unix commands, things get misinterpreted and misread.  So everyone should be aware that mistakes will happen.  Some ideas might have to be delayed, and some 500 pages might be raised.  But, if we respond to these quickly, refuse to panic, and fix both types of problems, the project can will continue to be fun and happy.  And it will succeed.

sum(it_up)

For me, something like a Project Euler problem is the most satisfying computing exercise.  It’s well defined, there is a solution, and I know when I’ve found it.  But this is a mirage.  Very few computing problems are solved with such simple and clear algorithms. Everything is fuzzy and we rarely go it alone.  But if we respect each other along the way, keep and eye on how things are going, and respond to problems as they come we can get closer and closer to a clear solution and enjoy the journey along the way.

 

exec() yourself silly

Now we all know exec() is cruise-control for awesome. Despising readability, sanity, and performance, I sprinkle my code liberally with this MSG and joyfully await the ensuing migraine. But sometimes I ask myself, is it enough?
Read more »

default function parameter values in python

Sure, we all know that “default parameter values are evaluated when the function definition is executed” in python. But what may not be clear is a fun way to introduce time-dependent bugs. Read more »

y-combinator in javascript

On Monday I was discussing the y combinator at Rossi’s.  The next morning I wrote an example in scheme to send to my friends, but they don’t speak scheme.  Sadly, this is tricky to write in python because of the lack of multi-line lambdas, but it’s damned easy in javascript:

Read more »

2011: Best Year Yet

Useless tram for moving drunks between casinos.

In the spirit of ChiPy meetings, I’m declaring 2011 to be the best year yet. I can only assume the best-ness of each passing year will increase monotonically until the year of my death, at which point, all bets are off. Read more »

stupid closure tricks in py3k

This is kind of silly, but I like the nonlocal keyword in Python 3 (well, ok, I like let-bindings in Lisps much much better, but this will do) and how you can get a closure and then poke the insides of the closure to see what it’s up to. Read more »

relating to nosql

About the time the nosql movement started to get legs, I started to get interested in relational databases.  They’re awesome beasts, and horizontal scalability issues aside, they handle a lot of the very hard problems in the web programming space (i.e. durability and concurrency).  But when data gets large and you can’t find the 2 million quarters in your sofa for that oracle license (or the 2 million braincells for that oh-so-perfect sharding scheme),  the nosql databases start to show their appeal. Read more »

simple s-expression parser in python

A few weeks back, I boasted in a bar that anyone could write a simple s-expression parser from scratch (no flex, no yacc, no bison, no antler, etc).  Well, I could not boast without doing it myself, so I finally did.  My s-expression lexer/parser weighs in at about 170 lines without tests (450 lines with them), and parsers integers and symbols.  I even created a little repl which takes this:

(+ 1 2)

And converts it into this:

[Symbol(+), Number(1), Number(2)]

A pretty-printed python list consisting of objects representing symbols and numbers.

I think the design is fairly readable if a bit unusual. There are classes for each token (instances of which are used for fully and partially formed tokens) and we “add” these tokens together to form a larger token and ultimately invoke a next_token_exception allowing the lexer to yield the largest, greediest, most fully-formed token possible.

Here’s an example of the lexer in action:

>>> symbol_token("h")
<parse_sexp.symbol_token object at 0x7fb9e2d253d0>
>>> _ + symbol_token("i")
<parse_sexp.symbol_token object at 0x7fb9e2d25710>
>>> formed = _
>>> formed + " "
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "parse_sexp.py", line 33, in __add__
    raise next_token_exception()
parse_sexp.next_token_exception
>>> formed.value
'hi'

After lexing, we’re left with a stream of fully-formed tokens; it’s a simple matter to filter out the whitespace tokens and start yielding instances of Symbol() and Number() classes with proper, pythonic types for the data.

Although I used regexes (quite unnecessarily) and generators, this little thing could probably be ported to C. It’d be the fastest, most useless parser/lexer around :)

Happy Hacking!