It's
no big secret that I'm
not a huge fan of XML. But, when I posted about
Bob Martin's revolt against XML, it was half-jokingly. I use XML when I find it useful, and certainly I wouldn't go so far as to say (quoting
Bob Martin)
What is the matter with these people? How, after all the experience we've had with XSLT, Ant, WSDL, etc., etc., could they create YET ANOTHER XML language. Are they dolts? Are they idiots?
But when
Peter posted a comment asking "how quickly can you write a parser...," I revisited the post from Bob, and dug into it a little.
I'm going to go out on a limb here and use something I learned in school (this doesn't happen often, at least not with the "theoretical" stuff). There
are concise ways to describe languages and grammars, so one would think there exists a tool that can take that description and automatically parse some text for you. It sounds reasonable, anyway. In fact, checking up on it, that's what tools like
ANTLR and
YACC seem to do.
As far as rolling your own parser: if your language is very simple, you can easily write a parser using
string.split(pattern)
that would do the job. It's only when the language gets more complex that the parsing becomes difficult. In this case, Robert Martin mentioned that you should "write a little YACC grammar that is nice, and small, and translates into that hideous XML." Since I couldn't find a download for YACC, I decided to get ANTLR and give it a whirl.
I'll show a very simple dependency injection DSL that follows this basic rule:
make bean: id, class, constructor-arg {name=value, name2=value2,...}
. Obviously, when writing a real one you'd want to take some time to make it simpler for the user, which would lead to a more complex grammar than this. In any case, the code if you were to write it might look like:
make bean: samuel_adams, beer.SamAdams, constructor-arg {rating=6.1}
make bean: jack_daniels, liquor.whiskey.jd, constructor-arg {rating=8.1}
make bean: dp, champagne.domPerignon, constructor-arg {rating=9.5, year=1996}
First, lets define the tokens for the
lexer. In ANTLR, these start with a capital letter, so we have:
MakeBean, BeanID, Class, TypeOfInjection, ArgName, and ArgValue
. (I'll put it all together in legal ANTLR statements below)
Then, we'll want to define the rules for our parser. These start with lowercase letters. For this, we have
statements, expressions, args
and
prog
, our program.
Statements
consist of expressions followed by CRLF which may lead to another expression or the end. I added
args
in, which could have easily been put right into the
statement
if I had wanted.
Here's the code you'd use in ANTLR. So far, I see that it draws state machines for me, but I don't yet know how to feed it input and get output (however, I imagine that wouldn't be
too difficult). I've tried to add comments to explain what I understand to be going on.
grammar expr;
options { // not sure what all can go here
output=AST;
ASTLabelType=CommonTree; // type of $stat.tree ref etc...
}
//not sure what all can go here
prog: ( statements {System.out.println($stat.tree.toStringTree());} )+ ;
statements: expression CRLF -> expression // expression followed by CRLF can lead to a new expression
| CRLF -> //or newline can lead to nothing (end of program)
;
expression:
MakeBean WS* ':' WS* BeanID WS* ',' WS* Class WS* ',' WS* TypeOfInjection WS* '{' WS* args WS* '}' WS*
;
args:
ArgName WS* '=' WS* ArgValue (',' ArgName WS* '=' WS* ArgValue)*
;
// stuff I put in for reusable components
Identifier : (Char|'_') + (Int|Char|'_')*; // composition of characters, _'s, and digits
Int : '0'..'9'+ ; // any digit, one or more times
CRLF:'\r'? '\n' ; // carriage return / line feed
WS : (' '|'\t')+ {skip();} ; // whitespace
Char : ('a'..'z'|'A'..'Z'); // any character
Float : Int+ '.' Int+; // one or more digits followed by a period and some more digits
// Tokens we described eariler
MakeBean: 'make bean';
BeanID : Identifier;
Class : Identifier ('.' Identifier)*;
TypeOfInjection: 'constructor-arg';
ArgName : Identifier;
ArgValue: Identifier|Int|Float;
I don't claim that this design is the optimal (or even close to optimal) one - this is the first I've done something like this outside of an academic setting, where the goal was to explain the kinds of strings something like this might generate (or, given some strings, construct a grammar that can generate it). In fact, if you've got a better design (with reasons, or some heuristics we can follow), I'd especially love to hear from you in the comments. Also, feel free to ask questions and I'll answer them to the best of my ability.
In all, it is hard to measure how long this took me. I had tons of different distractions going on while doing this, so discounting those I'd estimate about an hour or two to get this VERY minor grasp of ANTLR- but I also have the benefit of already being exposed to the grammar description language, so your mileage may vary. In case you're interested, here are some
ANTLR tutorials.
The main drawback to ANTLR is that it has only a few target languages (at the moment): Java, C#, Objective C, C, Python and Ruby. On the other hand, Perl, C++, and Oberon are being worked on, and you are able to add support for others. (Does anyone want to make one of these for ColdFusion?)
But, even with that limitation, is it starting to look like all those benefits to using XML as a DSL container aren't exclusive to XML? I can imagine how easy this would be once I learn what I'm doing. Guess its time to buy
The Definitive ANTLR Reference: Building Domain-Specific Languages (the Elk?).
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
Well, as per IM conversation, not sure I'm going to use ANTLR as my primary tool, but I've bought the reference guide so I'll take it down to CF United for some light reading next week!
Posted by
Peter Bell
on Jun 19, 2007 at 06:40 PM UTC - 6 hrs
Then I hope to see some tutorial posts sometime soon after that! =)
Posted by
Sam
on Jun 20, 2007 at 06:15 AM UTC - 6 hrs
ANTLR is crazy cool, I was toying with writing an actionscript parser once...but then I realized, this was a bit over my head...as for a CF parser, you should talk to Mark Mandel, I believe he's been writing one for the CFEclipse project.
Posted by
Derek P.
on Jun 26, 2007 at 12:16 PM UTC - 6 hrs
I did notice his name as one of the testimonials. That would certainly be cool to build external DSLs on top of CF with less effort than you'd normally need to put in.
Posted by
Sam
on Jun 26, 2007 at 12:26 PM UTC - 6 hrs
I've been using ANTLR v3 since it started in Beta, and I love it!
It's how I built TQL for Transfer (tho if I built it now I would have built it differently, but that's the way of all things), and integrated the Java code with Transfer using JavaLoader.
The main power I find with ANTLR is that it is just SO extensible, not only within the grammar, but also in terms of the code you can add to things. You don't like the CommonToken... make it use a different one.. you don't like the Tree implementation, use a different one there too, if you want to do something tricky, you can write inline code into your grammar to do fancy rewrites with island parsers, catch exceptions for better error handling.. the list is endless.
ANTLR is an amazing tool, and the LL(*) parsing technology is crazy smart.
Sam: if you ever want to talk ANTLR or CF/ANTLR integration, drop me a line.
Posted by
Mark Mandel
on Jun 26, 2007 at 05:50 PM UTC - 6 hrs
Mark- When I get to using it on a more regular basis, I'll most certainly will take you up on that offer. Thanks!
Posted by
Sam
on Jun 27, 2007 at 05:58 AM UTC - 6 hrs
Hey Mark, is that an open offer?! I've just got the ANTLR book for my "light summer reading" :->
Posted by
Peter Bell
on Jun 27, 2007 at 09:46 AM UTC - 6 hrs
The ANTLR mailing list is an *awesome* resource, I usually have replies back to my questions within the hour, and at the maximum 24 hours (Terrence is very active on the mailing list), but I'm always ready to chat ANTLR via IM or otherwise any time ;)
So yeah, I don't mind.
@Peter: I finally got the book too... after using it for over 6 months, I figured it was about time... ;)
Posted by
Mark Mandel
on Jun 27, 2007 at 06:05 PM UTC - 6 hrs
Leave a comment