homepage Welcome to WebmasterWorld Guest from 54.226.180.223
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
strange reg ex result
sssweb




msg:4528417
 2:07 pm on Dec 15, 2012 (gmt 0)

Can someone explain why this gives a match:

preg_match('/#[a-z]+#/', 'test#test#test');

but this doesn't:

preg_match('/#[.]+#/', 'test#test#test');

 

g1smd




msg:4528418
 2:12 pm on Dec 15, 2012 (gmt 0)

The [.]+ doesn't match as there are no literal periods in the input. It would match test#.#test or test#.....#test with literal periods.

The [ ] denotes a character group, in this case [.] is looking for a single literal period. Usually, you use group syntax when there are multiple choices of character to match: [a-z] matches a single letter, [0-9] matches a single digit, [aeiou] matches a single vowel. The + in [x]+ matches "one or more" instances of any of the listed characters in the group.

You might decide to use .+ but that is also the wrong thing to use.

Using .+ would "eat" everything to the end, including the following # and everything after it. The pattern would be looking for another # after test(#)(test#test) - brackets show the matches.

Never use .* or .+ at the beginning or in the middle of a pattern. It is always ambiguous. Only ever use .* or .+ at the trailing end of a pattern, where it means "capture the remainder of the input".

sssweb




msg:4528422
 3:01 pm on Dec 15, 2012 (gmt 0)

Right -- I see that now. I changed it to:

preg_match('/#[^#]+#/', 'test#test#test');

which seems to work. Any reason I shouldn't use that? (I want to test for any char except '#')

g1smd




msg:4528438
 4:21 pm on Dec 15, 2012 (gmt 0)

That's the right pattern but could be ambiguous when test1#test2#test3#test4 is input.

Would you want test2 or test3 to match? It would usually match test2.

I rarely specify a pattern without a start anchor and a grouping to get me along to the right place to start looking for real. That would be another [^#]+ again.

sssweb




msg:4528440
 4:22 pm on Dec 15, 2012 (gmt 0)

[edit: just saw your reply -- yes, I want all '#test' to match as long as there's 2 or more '#']

Here's another one -- why does this give a match:

preg_match('/(?<=test)[\d]+/', 'test556252345', $matches);

but this doesn't:

preg_match('/(?<=test)[.]+/', 'test556252345', $matches);

[edited by: sssweb at 4:29 pm (utc) on Dec 15, 2012]

g1smd




msg:4528442
 4:29 pm on Dec 15, 2012 (gmt 0)

[group] lists literal characters that should match.

. has a different meaning inside [ ] compared to outside it.

[.]+ is looking for one or more literal periods in the input where none exists.



What do you mean you want to match "all" tests?

You want to extract test2#test3 ?

You'll need a set of ( ) and another + if you do.

'/^[^#]+#(([^#]+#)+)/' will extract test2#test3# from test1#test2#test3#test4 and you'll need to clean the final # from the match.

[edited by: g1smd at 4:38 pm (utc) on Dec 15, 2012]

sssweb




msg:4528446
 4:30 pm on Dec 15, 2012 (gmt 0)

[edit - just re-reading your first post -- are you saying '/[aeiou.]/' matches vowels plus a literal period? I've always used [.] to mean any char and never had a problem with it.]


Our sync-posting is confusing -- give me a second to digest & reply

g1smd




msg:4528451
 4:39 pm on Dec 15, 2012 (gmt 0)

. means "any character" when used on it's own or in (.)

In [.] it means a literal period, an actual full stop.

sssweb




msg:4528454
 4:46 pm on Dec 15, 2012 (gmt 0)

Yikes -- that's news to me; I always thought it meant any char unless you escape it with a backslash.

Re matching test1#test2#test3#test4 -- sorry, I wasn't clear; I meant that I don't care if there's a long string matching the pattern. I'm using it in a conditional, and want it to accept anything with 2 or more '#'; I only use the actual $matches array in the conditional, with more specific code.

g1smd




msg:4528456
 4:49 pm on Dec 15, 2012 (gmt 0)

Yes, abc matches "abc" and [abc] matches a single "a" or "b" or "c".

Beware . matches "any character" but \. and [.] match a literal "." here. There are several other symbols that take on a different meaning inside [ ].

Grab the RegEx manual and read the details for "character groups" very thoroughly. :)

sssweb




msg:4528458
 5:22 pm on Dec 15, 2012 (gmt 0)

Thanks -- just did it.

Also searched all my old files; fortunately I never used it that way -- guess that's why I never had probs w/it :)

g1smd




msg:4528460
 5:28 pm on Dec 15, 2012 (gmt 0)

^(([^/]+/)+) turns up in rules that need to match multiple folder levels and discard the filename and extension, so get familiar with how it all works.

lucy24




msg:4528523
 10:02 pm on Dec 15, 2012 (gmt 0)

There are several other symbols that take on a different meaning inside [ ].

Heh. I was just explaining this to someone on an unrelated forum, so I've got most of it at the front of my head.

Inside grouping brackets, everything means the literal character EXCEPT:

\ always has to be escaped everywhere, no exceptions.

] of course means "end of grouping" so literal ] has to be escaped \]. Technically you can use un-escaped open-bracket [ but if it makes you uneasy you can escape it.

^ at the very beginning means "NOT these characters"; elsewhere means a literal ^ but it doesn't hurt to escape it in case you later rearrange the group.

- between any two characters means the whole character range, like the common [0-9]. It will either yield an error or will simply not work if the characters aren't in order, like [c-a]. In final position the meaning of - depends on dialect, but as with ^ it doesn't hurt to escape it.

Shorthand forms like \d work exactly the same in or out of brackets. So do expressions like \p{Punct} (exact form depends on dialect).

Most RegEx dialects don't mind if you escape things that don't formally need escaping, but only do it when it makes it easier to keep track of the regex. And, of course, when the \ doesn't change the meaning: [w.!?] and [\w.!?] are very different things.

Also be careful about flagging case-insensitivity when you've got brackets, because you may get unintended results. I have come to grief several times over UTF-8 long "s" and even Latin-1 eszet, both of which capitalize to ASCII S.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved