What kinds of languages can be generated by regex in Ruby and Perl

home | about | contact | privacy statement

What kinds of languages can be generated by regex in Ruby and Perl?

Posted by Sam on Feb 27, 2008 at 12:00 AM UTC - 5 hrs

(or your favorite regex engine)

I have a feeling this post is going to go over like a ton of bricks.

The subject of regular languages, context free languages, and just formal language theory in general caught my eye today with a question about prime numbers. This was one of my favorite classes as an undergraduate, so I thought I'd join in the discussion.

Raganwald quotes Sam, who said

If you can provide a regular expression such that it actually matches the string representation of only prime (or only non-prime) integers, that would be pretty sweet. A proof that such a thing could not be created would be equally impressive.

(Sam also linked to a blog post that linked to another that constructed a regular expression to decide if the number of 1s in a string of 1s is not prime, or /^1?$|^(11+?)\1+$/.)

A couple of comments from people at Raganwald mentioned using the pumping lemma for regular languages to prove that such a thing was not possible.

Indeed, even determining if 1* is not prime (the regular expression above) can be shown not to be a regular language. Using the pumping lemma for regular languages:

Let the language L = { 1ⁱ, i is a number greater than 0 and is not prime }
Assume L is a regular language. Then, by the pumping lemma, there exists a number p >= 1 such that every string w in L of length p or greater can be decomposed into substrings w=xyz, where |y| (length of y) > 0, |xy| <= p, and for all i >= 0, xyⁱz is also in L.
Choose a string, w from L whose length is greater than p.
Since there are infinitely many prime numbers, we can find one greater than any p.
Therefore, we can choose an i in w=xyⁱz such that repeating it some number of times will be a prime. We arrive at a contradiction, and note that because w cannot be pumped, L is not a regular language.

Another commenter mentioned that "Regular expressions these days can match any context-free language through the use of sub-expressions." Clearly, since our language L is not regular, but it is matched by a regex, we can see that today's regexes are more powerful than regular languages would allow them to be. But, is our regex even restricted enough to be a CFL?

A similar proof using the pumping lemma for CFLs would show that our language L is more powerful than even a CFL. (Don't let me slide here if that's not the case.)

Still, that doesn't tell us anything useful for the problem at hand - only that the regexen of (at least) Perl and Ruby are more powerful than CFLs. But how much more? If we want to prove that a regular expression (in the practical sense of the word) cannot take a string representation of a number, (e.g., "13" or "256") and determine if it is not prime (or prime), then we need to know how powerful regex are.

But I don't know where to start on that front. Any ideas?

Alternatively, if we want to prove that it can be done, we need only demonstrate so by coming up with the regex to do it. I'm not convinced it's possible, but I'm not convinced it's not possible either. Ideally, I'd like to find the formal definition of how powerful regex are, and prove at least that we don't know if the language is in that class or not. (The pumping lemmas, for example, are necessary but not sufficient to prove membership of L amongst the respective class of languages.)

Comments are appreciated. I'm sort of stuck at the moment, so I wanted to bounce these ideas out there to see if someone might bounce one back.

Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!

Comments

Leave a comment

There are no comments for this entry yet.

Leave a comment

Topics
.NET (19)
AI/Machine Learning (14)
Answers To 100 Interview Questions (10)
Bioinformatics (2)
Business (1)
C and Cplusplus (6)
cfrails (22)
ColdFusion (78)
Customer Relations (15)
Databases (3)
DRY (18)
DSLs (11)
Future Tech (5)
Games (5)
Groovy/Grails (8)
Hardware (1)
IDEs (9)
Java (38)
JavaScript (4)
Linux (2)
Lisp (1)
Mac OS (4)
Management (15)
MediaServerX (1)
Miscellany (76)
OOAD (37)
Productivity (11)
Programming (168)
Programming Quotables (9)
Rails (31)
Ruby (67)
Save Your Job (58)
scriptaGulous (4)
Software Development Process (23)
TDD (41)
TDDing xorblog (6)
Tools (5)
Web Development (8)
Windows (1)
With (1)
YAGNI (10)

Resources
Agile Manifesto & Principles
Principles Of OOD
ColdFusion
CFUnit
Ruby
Ruby on Rails
JUnit

RSS 2.0: Full Post | Short Blurb
Subscribe by email: