A few weeks back, I boasted in a bar that anyone could write a simple s-expression parser from scratch (no flex, no yacc, no bison, no antler, etc). Well, I could not boast without doing it myself, so I finally did. My s-expression lexer/parser weighs in at about 170 lines without tests (450 lines with them), and parsers integers and symbols. I even created a little repl which takes this:
And converts it into this:
[Symbol(+), Number(1), Number(2)]
A pretty-printed python list consisting of objects representing symbols and numbers.
I think the design is fairly readable if a bit unusual. There are classes for each token (instances of which are used for fully and partially formed tokens) and we “add” these tokens together to form a larger token and ultimately invoke a next_token_exception allowing the lexer to yield the largest, greediest, most fully-formed token possible.
Here’s an example of the lexer in action:
>>> symbol_token("h")
<parse_sexp.symbol_token object at 0x7fb9e2d253d0>
>>> _ + symbol_token("i")
<parse_sexp.symbol_token object at 0x7fb9e2d25710>
>>> formed = _
>>> formed + " "
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "parse_sexp.py", line 33, in __add__
raise next_token_exception()
parse_sexp.next_token_exception
>>> formed.value
'hi'
After lexing, we’re left with a stream of fully-formed tokens; it’s a simple matter to filter out the whitespace tokens and start yielding instances of Symbol() and Number() classes with proper, pythonic types for the data.
Although I used regexes (quite unnecessarily) and generators, this little thing could probably be ported to C. It’d be the fastest, most useless parser/lexer around
Happy Hacking!
Posted by Matt Bone at 7:43 pm on June 28th, 2010.
Tags: lisp, python, software.
- close your files, flush your files, fsync() your files, sync() if you have to, just don’t be stupid.
- don’t use globals for mutable state. especially when you’re in a multi-threaded environment. or you’re interested in correct answers.
- there’s a good chance the ORM is writing shitty sql.
- coffee != sleep
- more tests!
Posted by Matt Bone at 8:43 pm on June 22nd, 2010.
Tags: software.

I created this image using matplotlib and the data from my CPS Racial Data Warehouse project. Circle size represents school population. The axes are latitude and longitude (as a reference, the loop is the empty rectangle just below the left corner of the legend). Also I should note the Native American population was excluded from this image because of a technical issue.
Full size image here, more on this project soon (in the meantime, the code is here).
Posted by Matt Bone at 5:24 am on April 13th, 2010.
Tags: data, python, software, transparency.
I’ve been thinking a lot about software quality and testing. This is what I’ve come up with:

Posted by Matt Bone at 7:49 am on March 7th, 2010.
Tags: software.

When I was teaching data structures, one of the things that commonly came up was the whole issue of “pass by value” versus “pass by reference.” This was a very difficult thing to teach; somehow I’ve internalized the common C/Java style memory model, and these things are fairly second nature. While lecturing, I was definitely guilty of resorting to mind-numbing aphorisms like “integers are passed by value while objects are passed by reference”. These rules of thumb just don’t work (and make it incredibly difficult to get your head around the idea that object _references_ are passed by value). This (fabulous) article over at MSDN has a really nice description of the stack as an implementation detail. Perhaps the key to solving this problem is really to just make the first course in computer science programs a course in programing language implementation. Then you can move on to harder things like control flow and datastructures (I’m kind of kidding here).
But in reality, the vagaries of “value types” versus heap-allocated objects are too pointy for a data structures course. Their guts are easily exposed in something like C while being appropriately hidden in an everything-is-an-object language like Python. The Java/C# mix of these memory allocation strategies is confusing for the beginner. Expose the details or don’t. Exceptions to the rules breed bugs and confusion.
Posted by Matt Bone at 6:11 pm on April 30th, 2009.
Tags: software.
I spent a significant portion of my undergraduate career fooling around with the outdated seismology lab in the Loyola Physics department. Though I did eventually connect the seismometers to a computer and was able to record activity, they usually are not running as there is currently no student interested in checking in daily. However, we do turn them on every now and then for a lab that the physics students do. So, we hooked everything up yesterday, and BAM, the biggest earthquake in many years hit Chicago this morning. We captured the quake, and I am positively thrilled. Two traces are below. The times are way off (clock drift), but you can clearly see the quake on the 14:00 line. Also, apparently the after shock was centered in “Bone Gap, IL”! Ha! Amazing.


Posted by Matt Bone at 1:48 pm on April 18th, 2008.
Tags: software.
We’ve created an ETL Yahoo pipe that aggregates the content of members’ blogs tagged with ‘etl.luc.edu’. What I’m wondering is how quickly the pipe gets updated, and where things get cached along the line…
Posted by Matt Bone at 3:07 pm on January 23rd, 2008.
Tags: etl.luc.edu, software.
After some discussions with Dr. George recently, I was interested in exploring the relationship between xml and s-expressions. This has certainly been done before, but I thought I would try my hand at it. In the end result, I’m able to convert an s-expression such as:
(a :a1 "test" :a2 "test2" "mycdata" (b :a3 "c"))
to xml:
<a a1="test" a2="test2">mycdata<b a3="c"></b></a>
I do so with a somewhat strange but more explicit intermediate format:
(:element-name "a" :attributes (("a1" . "test") ("a2" . "test2")) :cdata nil :children (:element-name "b" :attributes (("a3" . "c")) :cdata "My CDATA" :children nil))
This shows pretty clearly how a single namespace xml document can be represented with an s-expression. What I’m wondering now is if I can go from the intermediate representation to something like YAML or JSON. George and I have been calling these ‘tree languages’; I wonder if some sort of overarching equivalence for these ‘tree languages’ can be shown?
I don’t think this has much practical use, but the code is here.
Posted by Matt Bone at 9:29 am on September 5th, 2007.
Tags: software.
(I’m guessing that a lot of what I say here applies to all distributed version control systems, but since I’ve only used darcs, that is what I’ll focus on.)
One of the neatest things about darcs is the fact that every checked out copy is itself a repository. This is perfect for university courses! Students can check out the professor’s code, make their own changes to the code (i.e. if the code is the starting point for some assignment…quite common), and commit their changes to their own repository without ‘disconnecting’ it from the professor’s version. All the student needs to do is pull changes from the professors server periodically and deal with any conflicts that may arise. The student can even push their copy to a university server for backup. It’s perfect!
If only Google code had darcs support, then I could force my 271 students to do this. As it stands, I’m not so sure a data structures course is the ideal place to introduce version control, but I may end up doing this anyway.
Posted by Matt Bone at 4:48 am on August 14th, 2007.
Tags: software.
I’m reading the GNU manifesto, and Stallman talks about adding:
“We plan to have…perhaps eventually a Lisp-based window system through which several Lisp programs and ordinary Unix programs can share a screen. Both C and Lisp will be available as system programming languages.” –RMS
What happened? Where did the Lisp programmers go?
Posted by Matt Bone at 10:54 am on August 4th, 2007.
Tags: software.