regular expression basics

Forum Moderators: coopster

Message Too Old, No Replies

regular expression basics

anyone want to supply some?

jatar_k

6:56 am on Aug 29, 2002 (gmt 0)

Regular expressions are the bane of my existence. I always get someone else to write them and I am more than happy to cash in favors to make sure it happens.

What I am looking for is the basics. If I was going to start using regular expressions and needed to understand the fundamentals involved how would you explain it to me?

I know what they are and I am very familiar with all of the functions that use them, I am just talking straight up regex. What do all the complicated strings of chars in those functions mean?

Damian

8:27 am on Aug 29, 2002 (gmt 0)

Maybe this helps:
This page [artswebsite.com] has an overview with the basics, taken from the Macromedia Dreamweaver help files and from O'Reilly.

jatar_k

4:35 pm on Aug 29, 2002 (gmt 0)

Thanks Damian, looks like a good place to start.

Robber

8:11 pm on Aug 29, 2002 (gmt 0)

I reckon the best advice I could give regarding regex is to look at the expression one character at a time. If you try and figure out the whole thing all at once you're heading for trouble - especialy when you start assigning with parentheses!

Another thing I find (which is probably obvious!) but take the time to figure out which parts of the regex are special characters, eg, \w would match all whitespace characters, but at a glance its easy to miss that.

Oh yeah, and one other thing thats useful to remember - ^ means match at the start of a string, except in a character class when it means negate the character class, that catches me out quite a bit.

Well thats my 2p worth!

ergophobe

10:08 pm on Aug 29, 2002 (gmt 0)

jatar,

why don't you try this article

So What's A $#!%% Regular Expression, Anyway?! [devshed.com] from devshed.

Are you working under *nix? If so, just play around with grep and such.

If under Windows, there are lots of regex utilities. BKReplacem is a good one and there are various ports of grep to Win.

Tom

transistor

11:44 pm on Aug 29, 2002 (gmt 0)

A simple one:


<?
if (eregi("^[a-zA-Z0-9\ ]+$",$var)) { // a space after the backslash
 echo "Passed!";
} else {
 echo "Failed";
}
?>

Like Robber said: ^ starts with...
between brackets are the characters allowed, in this case:
a-zA-Z (any lowercase or uppercase letter)
0-9 (any number)
\ (this is a space escaped, allows a space, doh!)
the brackets end and then
+$ (which I understand as "ends with")
So, this code will return:
Passed! for $var="My name is Transistor"
Failed for $var="No time, to lose!" // the comma and the exclamation mark are not allowed
Passed! for $var="12345678"
Passed! for $var="Regex 101"
Failed for $var="$100.50" // Dollar sign and period not allowed

jdMorgan

11:44 pm on Aug 29, 2002 (gmt 0)

jatar,

Another comment I read somwhere that I've found to be absolutely true is that regex are easier to write than they are to read! A good comparison is that writing your own scripts is much easier than reading someone else's. So do give it a try.

Jim

jatar_k

7:22 am on Aug 30, 2002 (gmt 0)

Ok so I have written a few good ones previously and debugged a ton but oooh do I hate them.

These all look like good resources and tips. Who knows I might even get good at them.

I will add my own personal one. I have always used the perl in a nutshell from oreilly.

The times I have been forced to do them it has gotten me through quite well but I am going to add this thread to my resource list.

<added>transistor, sweet little example, very intuitive

jdMorgan

8:19 pm on Aug 30, 2002 (gmt 0)

Just a minor correction to this "resource" :)

^[a-zA-Z0-9\ ]+$
...the brackets end and then +$ (which I understand as "ends with")

The "+" means "require one or more of the preceding character or group - in this case the group contents of the square brackets.

The "$" means, "and this must match at the end of the string being tested."

Jim

tonic

8:56 pm on Aug 30, 2002 (gmt 0)

and for more help this tool is very kewl :
[gotdotnet.com...]

you enter the regexp, some text, it displays the output

mdharrold

9:29 pm on Aug 30, 2002 (gmt 0)

The regex page I use to sort it all out. [troubleshooters.com]

Another comment I read somwhere that I've found to be absolutely true is that regex are easier to write than they are to read! A good comparison is that writing your own scripts is much easier than reading someone else's. So do give it a try.

I agree completely. I have to take several minutes of complete silence to understand someone else's regular expression.

gsx

9:54 pm on Aug 30, 2002 (gmt 0)

They are easy to write.

If you want to match a letter, type the letter. If you want to match a symbol, always type a backslash then the symbol.

Then there is the special codes, the opposite of the above: a letter preceeded by a backslash or a symbol without a backslash.

You will most likely use:
[...-...] : Range of chars : e.g. [A-Z]
^ : Match start of string
$ : Match end of string
. : Match any character
\b : Match any word boundary : e.g. \bit\b will match 'a it b' but not 'a bite b'

Then there are qualifiers:
* : Match zero or more times : e.g. X* will match '', 'X', 'XX', 'XXX' etc...
*?: Match zero or more times (same as above but will take as few characters as possible, above will take as many characters as possible)
+ : Match one or more times : e.g. X+ will match 'X', 'XX' etc.. but not ''
+?: Match one or more times : (same as above but will take as few characters as possible, above will take as many characters as possible)

Minimal and Maximal are as follows:
if you have the string <b><a href=x>ThisLink</a></b>
then you match with \<.*\>, you will get the whole string matching because it takes as many characters as possible with the .*
but if you match with \<.*?\> you will get <b> returned, but no more, because it is the shortest possible (from the left)

You will find .*? invaluable: e.g. \<span.*?\/span\> will get any string <span....>....</span> matched.

(Technically, you do not need to backslash the < and >, but it makes it easier to understand that it is a literal char when you read it in years to come)

I recommend O'Reilly books for further information, very in depth but brilliant for quick reference.

Robber

9:15 am on Aug 31, 2002 (gmt 0)

Nice one gsx, can't forget the ?, first time I saw it was something simple like .*?, at the time I hadnt come across the concept of greediness and wondered what the hell was going on - I assumed the ? meant zero or one, which in that context was rubbish, so watch out folks, it doesnt mean that at all.

lorax

12:25 pm on Aug 31, 2002 (gmt 0)

Having spent my share of tripping over reg expressions I'll add that just like everything else - syntax is everything. The difference can be that when reg expressions don't work it can be awfully hard to find that little typo.

A good comparison is that writing your own scripts is much easier than reading someone else's. I almost agree. ;) Reading isn't so much the problem for me (and this may be what you were really getting at) but rather wrapping my mind around where the programmer was headed with the code and building a mental picture of all the pieces. When you write it yourself that develops naturally. Much the same for regular expressions - but on a smaller scale.

I personally found it easier to work my way through regular expressions by taking someone's example code and playing with it. I used one that checked email addresses for the "@" and looked for the "." as well. The problem I noted is that it didn't account for the fact that some email addresses use a . before @ like "john.smith@roger.com". Playing with that code taught me a lot. I spun my wheels for a time over a syntax problem. So after a cup of tea and a bit of lunch I came back to it and spotted the bugger right off. That's how it usually goes.

Just my 2 cents.
GB