Latest posts.

palindromes on twitter – rettiwt no semordnilap

Lately I’ve been thinking about processing streams of data on the fly. Weirdly enough, I have a fascination with palindromes, so I wrote a very bad hack with Twisted to hookup to the streaming twitter feed and search for these things. Over the course of about a day, I found 259 distinct palindromes where palindromes are (very narrowly) defined as:

  1. Words without punctuation, numbers or accents (i.e. only a-z in regex).
  2. Longer than four characters.
  3. Consisting of three or more distinct characters.
  4. Not in a set of stop words I decided I was not interested in.

The astute reader will notice that these rules actual mean all valid palindromes will be 5 or more characters. Now for the results:

Twitter Palindromes

“Level” is the clear winner with 662 occurrences. The venerable palindrome “racecar” sped into the #30 spot with a mere 8 occurrences.

Unfortunately I did not keep track of how many tweets I scanned, but it appears a very small percentage of tweets contain this type of palindrome.  I would like to convert this to a full-on palindrome tracking twitter bot that will annoyingly tweet at you if you tweet something palindrome-ish, but I am slightly afraid someone would take that code and make streaming-twitter-pesterbot-v1™.  Hunting for palindromic sentences would be fun, too (a palindromic 144 character tweet would be pure genius).  Most importantly, I’d like to search for sarahpalindromes which I’ve defined as words that become palindromes with one transposition (sarahpalindrome detection algorithms are much appreciated…you could probably get a paper out of something like that).

Binding Classes to Tags in Python’s ElementTree

Most of my work on BetterXML was in the area of data binding. In both of our frameworks, we created mappings between classes and markup elements, and then bound each occurrence of a tag in a document to an instance of the mapped class. Our work there was implemented from scratch (i.e. on top of a SAX parser), but it turns out that some of our ideas are quite simple to implement in Python’s ElementTree. Let’s say we are parsing the xml version of someone’s twitter stream.  It looks a little something like:

<statuses type="array">
  <status>
    <created_at>Fri Dec 18 18:31:14 +0000 2009</created_at>
    <id>6804282413</id>
    <text>my frozen soup will never thaw </text>
    ...
  </status>
  ...
</statuses>

We may like to bind each instance of the <text> tag to an instance of a Tweet class:

class Tweet(_ElementInterface):
    def word_count(self):
        return len(self.text.split())

    def at_someone(self):
        for word in self.text.split(): #this is in the xml, but here we do it from scratch for fun
            if(word.startswith('@')):
                return True
        return False

The actual binding occurs by creating our own factory method, and supplying this to the ElementTree TreeBuilder and parser:

def bound_element_factory(tag, attrs):
    if(tag == 'text'):
        return Tweet(tag, attrs)
    elif(tag == 'statuses'):
        return Twitter(tag, attrs)
    else:
        return _ElementInterface(tag, attrs)

parser = XMLTreeBuilder(target=TreeBuilder(element_factory=bound_element_factory))
tree = ElementTree.parse('mytweets.xml', parser=parser)

So now as the document, is parsed, the bound_element_factory() method is called each time a tag is encountered. Instead of always instantiating an _ElementInterface class for each tag (what the default factory method does), this method instantiates specialized children classes when a “text” or “statuses” tag is encountered. These classes can obviously define or override methods willy-nilly.

While I think this is a pretty neat xml party trick, it is fair to question the utility of this approach in particular and data binding in general.   After all, ElementTree is pretty useful on its own. But data binding is appealing because it allows us to create the object and class hierarchy first, and worry about going to and from the markup representation later.

This version of data binding is a bit backwards, though. Because our objects extend the _ElementInterface object, we still must call the ElementTree’s insert() method build up a tree of objects.  Thus, when we write our classes and build up trees of objects, we’re aware that they’re going to or coming from xml.

However, all is not lost. This approach can be useful when going from existing markup (particularly if it is changing or poorly defined) to an object hierarchy.  Again, this is what ElementTree excels at, but here we have the opportunity to create some utility methods in the objects and hide the markup from the users of these objects.  So in this case, if twitter changes the markup from <text> to <tweet>, we need only change the factory and the call to find().  Users of the Tweet class will see no changes.

A complete working example can be found in my bitbucket (pardon the non-working thing in that repo, too). Also, a major hat tip to the flexibility of the ElementTree implementation. I high recommend reading that very pretty code, too.

Binary tree traversal in Python with generators

One of the many things I like about Python is generators. They allow for the creation of iterators without the boilerplate imposed by Java or PHP. Furthermore, iterators can be thought of as streams, and many times we don’t really want to default to eager behavior (maybe eager behavior should be demanded ). For example, consider a class implementing a very simple binary tree:

class BinaryTree:
    def __init__(self, data, left=None, right=None):
        self.data = data
        self.left = left
        self.right = right

    def __unicode__(self):
        return '%s' % self.data

More of a struct, this simple class has slots (instance variables) for the left sub-node, the right sub-node, and the data. When we first encounter a structure like this in, say, a data structure course, our inclination is to traverse the thing. Remember, we can do this any number of ways: depth-first, breadth-first, pre-order, post-order (for the traversals in this article, I will only be concerned with node data, but all the algorithms can easily be modified to yield the nodes themselves). If we want to write a very simple, eager, depth-first (and also pre-order) traversal of a tree like this, we can do something as follows:

def recursive_dfs(tree):
    nodes = []
    if(tree != None):
        nodes.append(tree.data)
        nodes.extend(recursive_dfs(tree.left))
        nodes.extend(recursive_dfs(tree.right))
    return nodes

This is a great first step, but, as already mentioned, it is eager. By calling this function, we get a complete list of all the nodes in the tree whether we need them or not. With a few simple modifications, however, we can pull nodes out of a tree on demand in the same pre-order fashion by using Python generators. We simply start the traversal, yield the node data, yield all nodes in the left subtree, and then yield all nodes in the right subtree:

def basic_dfs(tree):
    if(tree!=None):
        yield tree.data
        for node_data in basic_dfs(tree.left):
            yield node_data
        for node_data in basic_dfs(tree.right):
            yield node_data

If we wanted a (not-quite)-post-order traversal, we would yield the nodes in the right subtree first. We could do this by simple rewriting the function above. However, we can take this a step further, and leave the the decision of what nodes to yield first to another function entirely (I will call this the ‘chooser’ function):

def left_then_right(tree):
    if(tree!=None):
        yield tree.left
        yield tree.right

def dfs(tree, chooser=left_then_right):
    if(tree!=None):
        yield tree.data
        for immediate_child in chooser(tree):
            for node_data in dfs(immediate_child, chooser):
                yield node_data

Thus dfs(sometree) will call the left_then_right() function by default, and perform a pre-order traversal. For our (not-quite)-post-order traversal, we define the right_then_left() function:

def right_then_left(tree):
    if(tree!=None):
        yield tree.right
        yield tree.left

And passing this function with dfs(sometree, right_then_left), we have our new traversal. To really see the benefit of our lazy traversals, though, we can go one step further, and implement a binary-search on top of our dfs() function as a chooser function. Instead of yielding the left subtree and the right subtree, the chooser function will yield the nodes in the left subtree if the value we’re searching for is less than the node data, the nodes in the right if the value is greater than the node data, or the node data itself if it is equal to the value:

def binary_search_chooser(value):
    def binary_search_chooser_inner(tree):
        if(tree!=None and tree.data!=None):
            if(value<=tree.data):
                yield tree.left
            else:
                yield tree.right

    return binary_search_chooser_inner

Don’t let the closure fool you, this is still simple stuff. A call to binary_search_chooer(5), returns a chooser function that will decide whether to go left or right down the tree based on a node’s value. So to search a BST for 5, we can just call bfs(tree, binary_search_chooser(5)). This will give us a list of nodes (the path to a leaf), the last of which will be 5 if that value is found.

For sure there are more efficient ways to do these kinds of traversal with pointer manipulation, etc, but this serves as fun exercise for fans of generators. The astute reader will also note that we’ve implemented the strategy pattern in a functional-programming type of way with our use of first class functions.

the stack!

img_0386

When I was teaching data structures, one of the things that commonly came up was the whole issue of “pass by value” versus “pass by reference.” This was a very difficult thing to teach;  somehow I’ve internalized the common C/Java style memory model, and these things are fairly second nature.  While lecturing, I was definitely guilty of resorting to mind-numbing aphorisms like “integers are passed by value while objects are passed by reference”.  These rules of thumb just don’t work (and make it incredibly difficult to get your head around the idea that object _references_ are passed by value). This (fabulous) article over at MSDN has a really nice description of the stack as an implementation detail.  Perhaps the key to solving this problem is really to just make the first course in computer science programs a course in programing language implementation.  Then you can move on to harder things like control flow and datastructures (I’m kind of kidding here).

But in reality, the vagaries of “value types” versus heap-allocated objects are too pointy for a data structures course.  Their guts are easily exposed in something like C while being appropriately hidden in an everything-is-an-object language like Python.  The Java/C# mix of these memory allocation strategies is confusing for the beginner.  Expose the details or don’t.  Exceptions to the rules breed bugs and confusion.

An American Childhood

About a month ago, I finished my second Annie Dillard book, An American Childhood.  Dillard has had a special place in my heart ever since I read Pilgrim at Tinker Creek in high school.  Tinker Creek is a Thoreau-like book of insanely detailed observation so powerful I find myself rereading every two or three years.  The language in Tinker Creek is so beautifully crafted that it reads, at times, like a prose poem, the art of which can, if care is not taken, distract from the actual content of the text.  So, it is with this high opinion of DIllard that I began to read An American Childhood.  I was not dissappointed.

On the surface, the book is a memoir of Dillard’s early years in Pittsburgh, from birth through adolescence, in the 1950s.  Dillard tells the story with the same skill as Pilgrim at Tinker Creek, and with the same attention to detail.  She bounces between topics, describing her father’s wide-eyed trip down the Mississippi one summer after quitting his job, and her mother’s wit and command of the spoken word.  The common thread in all of this is her childhood exuberance.  The young Dillard is not only intelligent but hungering for knowledge of any sort, reading field guides, linguistic texts, and drawing how-tos.  Reading up on her large rock collection, Dillard is surprised to discover adults with her own sense of wonder:

One book cautioned me against refining any gold I found…if I refined it in any way I was “obliged by law” to sell it to a licensed gold dealer…these, then, were books which advised, in detail, how to avoid making money, right here in America.

The descriptions of regular events like the family’s improving social status and country club sponsored cotillions, contrasts sharply with Dillard’s inner dialogue, the fervor of which grows and grows throughout the book, causing her to wonder if it will last:

Must I then lose the world forever, that I had so loved?  Was it all, the whole bright and various planet, where I had been so ardent about finding myself alive, only a passion peculiar to children, that I would outgrow even against my will?

An American Childhood has the marks of a typical memoir, but is made all the more enjoyable by Dillard’s mastery of language.  Furthermore, the young Dillard’s refusual to be blasé about anything large or small in the universe, is echoed by her modern author’s voice and the depth of self-revelation.  This is a book enjoyable from many angles, and though I’ve read only a small fraction of Dillard’s work, she is undoubtedly one of my favorite authors.

burn the rope

The first game I’ve ever beaten.  Ever.

http://www.mazapan.se/games/BurnTheRope.php

(you have to burn the rope)

xtra! xtra!

Yesterday I received my xtracycle, and today I finished the assembly.  Sort of.  I’m still lacking proper shift cable and housing so I’m rocking a single speed xtracycle right now, but I took it on its maiden voyage in the middle of our giant snow.  First off, the xtracycle handles beautifully in the snow.  The bike is so long you can actually get the thing sideways and control it quite nicely.  More importantly, though, this beast can haul anything.  In the photo above, I’m hauling punch ingredients for our work Christmas party.  I can’t wait to haul crap.  If you need a large object transported by bicycle, just let me know.

And now gratuitous roller shots:

hold on!

hold on!

dan is teh roller rockstar

dan is teh roller rockstar

it’s cold.

I’d just like to point out that when I went to bed it was 51 degrees, and when I woke up, it was 5 degrees.  So, whoever is in charge of these things, please stop with the order of magnitude temperature changes in 8 hours.

That is all.

the affero public license

After Chris Grubbs and I finished up our work on phlombay.com for a class project about a year ago, we released what little code there was under the Affero Public License.  My reasons were strictly idealogical, and I assume Chris wasn’t going to be selling proprietary versions of the site, so he went along just fine.  However, since that time I’ve really wondered about the utility of license (as an aside, I’ve started work on phlombay again, but that’s another post entirely).

Unlike the GPL, the Affero stipulates that licensed software running on a network and providing a network or web service, must also provide the source.  So while Google is not required to release any modifications to (the GPL licensed) MySQL backing up their search, they would be required to do so were the software licensed under Affero (I think…but what if the DB server is merely accessible via a webapp…hmmm).

Regardless, I had wondered about the utility of this license.  However, that changed today after discovering some of the technorati services such as events and contacts.  These services parse a page looking for microformats and then generate the appropriate file for that microformat (i.e. a vCard or iCal file).  This is a perfect example of when code should be released under the Affero.  By providing the code backing up these services, Technorati would enable users to setup their own servers (to relieve load, enforce uptime requirements, etc) while also ensuring that any modifications would be available to the public.

Though many object to the viral nature of licenses like the GPL, this objection is somewhat neutered in terms of webservices.  Generally the service is available for  (as in beer) already.  Modifications to the code of this service probably serve no real monetary purposes; instead they improve the quality of the service.  Therefore, there is less of a reason to shy away from a viral license when releasing code that backs webservices.   Sure, in a perfect world we wouldn’t need viral licenses (or any licenses at all!), but that world is far off.  Until then, the GPL and the Affero GPL are handy intellecutal property hacks.

private methods are stupid

Today at work in a unit-testing-brown-bag-lunch-session, I spouted off about private methods, claiming that they are stupid.  I stand by my statement; private methods are definitely stupid.  However, my true feelings are a bit more nuanced.

If we assume we are associating a specific chunk of functionality with every method we write, then there is an overarching hubris associated with declaring a method to be private.  When we declare such a method to be private, we  predict the future, saying “only I, the privileged creator of this spectacular class will ever need this functionality.”

But, wait, aren’t there legitimate reasons to declare a private method?  What if a particular method leaves an object in an inconsistent state, debilitating all subsequent calls to “public” methods?  Should this dangerous method be marked private?  Perhaps.  Leaving aside arguments about the necessity to create such a method, there are other ways around the problem.  Convention, not language level constructs solve the issue nicely.  Python, for example, uses a leading underscore for “private” methods, and Smalltalk programmers sometimes place these methods in the “private” protocol.  And while I am not interested in metaphysical debates about what makes a system “object oriented” (I mean really, I think CLOS is perhaps the best OO system out there), with the examples of Smalltalk and Python, it is clear that private methods are not necessary for OO systems.

The argument that private methods are necessary to “protect” code is even sillier.  If we’re afraid of how other programmers might use our classes, then we shouldn’t bother programming.  Not only do these fears foster a sense of mistrust among programmers, but also, they are not easily eliminated.  In the most straightforward cases, users of our classes with access to the code may go through and mark useful methods public (note that loosening protection introduces no bugs in non-reflective code), or worse, copy and paste this functionality into another method entirely.   In more severe cases, the “nefarious” programmer will resort to introspection or disassemblers.

We programmers are tinkerers, and we should be willing to let ourselves tinker.  Sleep easy, and declare your methods to be public!