(or your favorite regex engine)

I have a feeling this post is going to go over like a ton of bricks.

The subject of regular languages, context free languages, and just formal language theory in general
caught

my eye today with a question about prime numbers. This was one of

my favorite classes as
an undergraduate,
so I thought I'd join in the discussion.

Raganwald
quotes Sam, who said
If you can provide a regular expression such that it actually matches the string
representation of only prime (or only non-prime) integers, that would be pretty sweet.
A proof that such a thing could not be created would be equally impressive.

(Sam also

linked to a blog post
that

linked to another
that constructed a regular expression to decide if the number of 1s in a string of 1s is
not prime, or

`/^1?$|^(11+?)\1+$/`

.)

A

couple of comments from people at Raganwald
mentioned using the

pumping lemma for regular languages to
prove that such a thing was not possible.

Indeed, even determining if 1* is not prime (the regular expression above) can be shown

*not* to be a regular language. Using the pumping
lemma for regular languages:

- Let the language L = { 1
^{i}, i is a number greater than 0 and is not prime }
- Assume L is a regular language. Then, by the pumping lemma, there exists
a number p >= 1 such that every string w in L of length p or greater can
be decomposed into substrings w=xyz, where |y| (length of y) > 0, |xy| <= p,
and for all i >= 0, xy
^{i}z is also in L.
- Choose a string, w from L whose length is greater than p.
- Since there are infinitely many prime numbers, we can find one greater than any p.
- Therefore, we can choose an i in w=xy
^{i}z such that repeating it some number
of times will be a prime. We arrive at a contradiction, and
note that because w cannot be pumped, L is not a regular language.

Another commenter mentioned that "Regular expressions these days can match any
context-free language through the use of sub-expressions." Clearly, since our language L
is not regular, but it is matched by a regex, we can see that today's regexes are more
powerful than regular languages would allow them to be. But, is our regex even restricted
enough to be a

CFL?

A similar proof using the

pumping lemma for CFLs
would show that our language L is more powerful than even a CFL. (Don't let me slide here if
that's not the case.)

Still, that doesn't tell us anything useful for the problem at hand - only that the regexen of (at least) Perl and Ruby
are more powerful than CFLs. But how much more? If we want to prove that a regular expression
(in the practical sense of the word)
cannot take a string representation of a number, (e.g., "13" or "256") and determine if
it is not prime (or prime), then we need to know how powerful regex are.

But I don't know where to start on that front. Any ideas?

Alternatively, if we want to prove that it can be done, we need only demonstrate so by
coming up with the regex to do it. I'm not convinced it's possible, but I'm not convinced it's
not possible either. Ideally, I'd like to find the formal definition of how powerful regex are,
and prove at least that we don't know if the language is in that class or not. (The
pumping lemmas, for example, are necessary but not sufficient to prove membership of L
amongst the respective class of languages.)

Comments are appreciated. I'm sort of stuck at the moment, so I wanted to bounce these
ideas out there to see if someone might bounce one back.

Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll *love* my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
There are no comments for this entry yet.

**Leave a comment**