My Secret Life as a Spaghetti Coder
home | about | contact | privacy statement
It's no big secret that I'm not a huge fan of XML. But, when I posted about Bob Martin's revolt against XML, it was half-jokingly. I use XML when I find it useful, and certainly I wouldn't go so far as to say (quoting Bob Martin)
What is the matter with these people? How, after all the experience we've had with XSLT, Ant, WSDL, etc., etc., could they create YET ANOTHER XML language. Are they dolts? Are they idiots?
But when Peter posted a comment asking "how quickly can you write a parser...," I revisited the post from Bob, and dug into it a little.

I'm going to go out on a limb here and use something I learned in school (this doesn't happen often, at least not with the "theoretical" stuff). There are concise ways to describe languages and grammars, so one would think there exists a tool that can take that description and automatically parse some text for you. It sounds reasonable, anyway. In fact, checking up on it, that's what tools like ANTLR and YACC seem to do.

As far as rolling your own parser: if your language is very simple, you can easily write a parser using string.split(pattern) that would do the job. It's only when the language gets more complex that the parsing becomes difficult. In this case, Robert Martin mentioned that you should "write a little YACC grammar that is nice, and small, and translates into that hideous XML." Since I couldn't find a download for YACC, I decided to get ANTLR and give it a whirl.

I'll show a very simple dependency injection DSL that follows this basic rule: make bean: id, class, constructor-arg {name=value, name2=value2,...}. Obviously, when writing a real one you'd want to take some time to make it simpler for the user, which would lead to a more complex grammar than this. In any case, the code if you were to write it might look like:

make bean: samuel_adams, beer.SamAdams, constructor-arg {rating=6.1}
make bean: jack_daniels, liquor.whiskey.jd, constructor-arg {rating=8.1}
make bean: dp, champagne.domPerignon, constructor-arg {rating=9.5, year=1996}

First, lets define the tokens for the lexer. In ANTLR, these start with a capital letter, so we have: MakeBean, BeanID, Class, TypeOfInjection, ArgName, and ArgValue. (I'll put it all together in legal ANTLR statements below)

Then, we'll want to define the rules for our parser. These start with lowercase letters. For this, we have statements, expressions, args and prog, our program. Statements consist of expressions followed by CRLF which may lead to another expression or the end. I added args in, which could have easily been put right into the statement if I had wanted.

Here's the code you'd use in ANTLR. So far, I see that it draws state machines for me, but I don't yet know how to feed it input and get output (however, I imagine that wouldn't be too difficult). I've tried to add comments to explain what I understand to be going on.

grammar expr;
options { // not sure what all can go here
    output=AST;
    ASTLabelType=CommonTree; // type of $stat.tree ref etc...
}

//not sure what all can go here
prog: ( statements {System.out.println($stat.tree.toStringTree());} )+ ;


statements: expression CRLF -> expression // expression followed by CRLF can lead to a new expression
    | CRLF -> //or newline can lead to nothing (end of program)
    ;


expression:
    MakeBean WS* ':' WS* BeanID WS* ',' WS* Class WS* ',' WS* TypeOfInjection WS* '{' WS* args WS* '}' WS*
    ;


args:
    ArgName WS* '=' WS* ArgValue (',' ArgName WS* '=' WS* ArgValue)*
    ;


// stuff I put in for reusable components
Identifier : (Char|'_') + (Int|Char|'_')*; // composition of characters, _'s, and digits
Int : '0'..'9'+ ; // any digit, one or more times
CRLF:'\r'? '\n' ; // carriage return / line feed
WS : (' '|'\t')+ {skip();} ; // whitespace
Char : ('a'..'z'|'A'..'Z'); // any character
Float : Int+ '.' Int+; // one or more digits followed by a period and some more digits

// Tokens we described eariler
MakeBean: 'make bean';
BeanID : Identifier;
Class : Identifier ('.' Identifier)*;
TypeOfInjection: 'constructor-arg';
ArgName : Identifier;
ArgValue: Identifier|Int|Float;

I don't claim that this design is the optimal (or even close to optimal) one - this is the first I've done something like this outside of an academic setting, where the goal was to explain the kinds of strings something like this might generate (or, given some strings, construct a grammar that can generate it). In fact, if you've got a better design (with reasons, or some heuristics we can follow), I'd especially love to hear from you in the comments. Also, feel free to ask questions and I'll answer them to the best of my ability.

In all, it is hard to measure how long this took me. I had tons of different distractions going on while doing this, so discounting those I'd estimate about an hour or two to get this VERY minor grasp of ANTLR- but I also have the benefit of already being exposed to the grammar description language, so your mileage may vary. In case you're interested, here are some ANTLR tutorials.

The main drawback to ANTLR is that it has only a few target languages (at the moment): Java, C#, Objective C, C, Python and Ruby. On the other hand, Perl, C++, and Oberon are being worked on, and you are able to add support for others. (Does anyone want to make one of these for ColdFusion?)

But, even with that limitation, is it starting to look like all those benefits to using XML as a DSL container aren't exclusive to XML? I can imagine how easy this would be once I learn what I'm doing. Guess its time to buy The Definitive ANTLR Reference: Building Domain-Specific Languages (the Elk?).

Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!


Comments
Leave a comment

Well, as per IM conversation, not sure I'm going to use ANTLR as my primary tool, but I've bought the reference guide so I'll take it down to CF United for some light reading next week!

Posted by Peter Bell on Jun 19, 2007 at 06:40 PM UTC - 6 hrs

Then I hope to see some tutorial posts sometime soon after that! =)

Posted by Sam on Jun 20, 2007 at 06:15 AM UTC - 6 hrs

ANTLR is crazy cool, I was toying with writing an actionscript parser once...but then I realized, this was a bit over my head...as for a CF parser, you should talk to Mark Mandel, I believe he's been writing one for the CFEclipse project.

Posted by Derek P. on Jun 26, 2007 at 12:16 PM UTC - 6 hrs

I did notice his name as one of the testimonials. That would certainly be cool to build external DSLs on top of CF with less effort than you'd normally need to put in.

Posted by Sam on Jun 26, 2007 at 12:26 PM UTC - 6 hrs

I've been using ANTLR v3 since it started in Beta, and I love it!

It's how I built TQL for Transfer (tho if I built it now I would have built it differently, but that's the way of all things), and integrated the Java code with Transfer using JavaLoader.

The main power I find with ANTLR is that it is just SO extensible, not only within the grammar, but also in terms of the code you can add to things. You don't like the CommonToken... make it use a different one.. you don't like the Tree implementation, use a different one there too, if you want to do something tricky, you can write inline code into your grammar to do fancy rewrites with island parsers, catch exceptions for better error handling.. the list is endless.

ANTLR is an amazing tool, and the LL(*) parsing technology is crazy smart.

Sam: if you ever want to talk ANTLR or CF/ANTLR integration, drop me a line.

Posted by Mark Mandel on Jun 26, 2007 at 05:50 PM UTC - 6 hrs

Mark- When I get to using it on a more regular basis, I'll most certainly will take you up on that offer. Thanks!

Posted by Sam on Jun 27, 2007 at 05:58 AM UTC - 6 hrs

Hey Mark, is that an open offer?! I've just got the ANTLR book for my "light summer reading" :->

Posted by Peter Bell on Jun 27, 2007 at 09:46 AM UTC - 6 hrs

The ANTLR mailing list is an *awesome* resource, I usually have replies back to my questions within the hour, and at the maximum 24 hours (Terrence is very active on the mailing list), but I'm always ready to chat ANTLR via IM or otherwise any time ;)

So yeah, I don't mind.

@Peter: I finally got the book too... after using it for over 6 months, I figured it was about time... ;)

Posted by Mark Mandel on Jun 27, 2007 at 06:05 PM UTC - 6 hrs

Leave a comment

Leave this field empty
Your Name
Email (not displayed, more info?)
Website

Comment:

Subcribe to this comment thread
Remember my details
Google
Web CodeOdor.com

Me
Picture of me

Topics
.NET (19)
AI/Machine Learning (14)
Answers To 100 Interview Questions (10)
Bioinformatics (2)
Business (1)
C and Cplusplus (6)
cfrails (22)
ColdFusion (78)
Customer Relations (15)
Databases (3)
DRY (18)
DSLs (11)
Future Tech (5)
Games (5)
Groovy/Grails (8)
Hardware (1)
IDEs (9)
Java (38)
JavaScript (4)
Linux (2)
Lisp (1)
Mac OS (4)
Management (15)
MediaServerX (1)
Miscellany (76)
OOAD (37)
Productivity (11)
Programming (168)
Programming Quotables (9)
Rails (31)
Ruby (67)
Save Your Job (58)
scriptaGulous (4)
Software Development Process (23)
TDD (41)
TDDing xorblog (6)
Tools (5)
Web Development (8)
Windows (1)
With (1)
YAGNI (10)

Resources
Agile Manifesto & Principles
Principles Of OOD
ColdFusion
CFUnit
Ruby
Ruby on Rails
JUnit



RSS 2.0: Full Post | Short Blurb
Subscribe by email:

Delivered by FeedBurner