Forum Moderators: coopster

Message Too Old, No Replies

Help with regular expression splitting string

regular expression regex regexp split string

         

beany

6:19 pm on Mar 26, 2008 (gmt 0)

10+ Year Member



Hi all,

After more than four hours of googling and trying and failing I gave up. Now I am seeking your help with a (at least I think) fairly simple problem:

On my website users can enter tags like "James", "James Bond", or "Agent 007". Each tag consists of one or more words. These tags are submitted as one single string that I need to split at each whitespace or - if multiple words are enclosed by double quotes - at the double quotes.

Here is what I came up with:

Code:


<?php
/*string*/ $keywords = "\"James! (007) Bond\", \"Aston Martin\" 007 Q";
preg_match_all('/"(.[^"]+)"¦([\w]+)/i', $keywords, $arr);
print "<pre>"; print_r($arr); print("</pre>");
?>

Output:


Array
(
[0] => Array
(
[0] => "James! (007) Bond"
[1] => "Aston Martin"
[2] => 007
[3] => Q
)

[1] => Array
(
[0] => James! (007) Bond
[1] => Aston Martin
[2] =>
[3] =>
)

[2] => Array
(
[0] =>
[1] =>
[2] => 007
[3] => Q
)

)

My solution is working but I think it still can be improved. Please can you have a look at it and give some advise how I can improve the pattern? For example do I really need an alternative pattern?

Thanks in advance!

Regards

jo

eelixduppy

8:57 pm on Mar 27, 2008 (gmt 0)



Hello, and welcome to WebmasterWorld! :)

Your pattern, at a quick glance, looks fine to me. What don't you like about it that you want to improve?

PHP_Chimp

9:47 pm on Mar 27, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your 2 patterns are doing different things. So depending on exactly what you want depends on if you need both.

"(.[^"]+)"

Will look for the " followed by any character followed by something that is not a ", the not " at least once.
So this string needs to be at least 4 characters long ( 2x", 2x other), the first character (that will be captured) can be anything, the second not a ".
([\w]+)

Will look for any a-zA-Z0-9 or _ at least once. So this string can be a single character long. There is no requirement for the enclosing quotes.

You dont need the i, as \w includes both upper and lower case, the . is anything and not " includes everything other than ".

Each of your statements are quite different, however you may need both for in your specific case. As we dont know exactly what you want we cant tell you if you could get away with only one of them. You could test to see if a single or modified statement would do what you want.

[edited by: PHP_Chimp at 9:47 pm (utc) on Mar. 27, 2008]

beany

7:20 am on Mar 28, 2008 (gmt 0)

10+ Year Member



Hi eelixduppy, hi PHP_Chimp,

Many thanks for your replies, much appriciated. What I want is to split the following (or similar string) on each whitespace but not on whitespaces enclosed by doublequotes:

$str = 'Action Stunt "Bruce Willis" Movie "Die Hard"'

From the above string I am expecting an array with the following items:

0: Action
1: Stunt
2: Bruce Willis
3: Movie
4: Die Hard

My regexp does quite exactly what I want. But as I am not an regexp expert I am wondering if my regexp could be improved in terms of design, performance, or reliability. Thanks for the hint with the "i" will leave this out.

I worked some more on the regexp and googled this forum and found something similar that looks better to me and works fine also:

preg_match_all('/".*"¦\b\S+\b/U', $inputString, $resultTags);

By now I have two solutions that work for me but I am still wondering if they could be improved. I don't think that I will ever understand the regular expressions completely but I am working on it. :-)

Again many thanks for your help.

Beany

PHP_Chimp

11:28 am on Mar 29, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The thing that I like about php (and most other languages) is that there is almost always more than one way to do things. So if your expression works then it is the correct expression for the job. You could make an expression that used look ahead and look behind assertions to check for the presence of the " and to split or not split around whitespace, however what you have is a lot easier to understand, so why bother with a more complicated expression? If you are in an area where time is very critical then you may need to play with the expressions to cut that few extra micro seconds off, however when the slowest thing in an internet chain is likely to be the clients download speed, cutting a few microseconds off code execution is unlikely to be a huge issue.

What you do need to be careful with is testing the expression to make sure that it does do exactly what you want and dont have any false positives.

As you say that you are not so confident with regular expressions there are a few things I would suggest. Sorry if you already know this. Especially important if you are using other people regular expressions, as they may well have been written to perform another function.

Be very careful with is the use of . and *.

The * matches 0 or more. So generally + (matching 1 or more) would be a better choice, as it is unlikely you want to match nothing. The larger problem is the "greediness" of these modifiers. That is why there is a U [uk.php.net] modifier to make the whole expression ungreedy. You can also use .*? or .+? (the extra ? after the * or +) to make a single match ungreedy.

Example -
$str = 'Action Stunt "Bruce Willis" Movie "Die Hard"';
With no U modifier ".*" would match from:
Action Stunt "Bruce Willis" Movie "Die Hard"
The U modifier means that it matches from:
Action Stunt "Bruce Willis" Movie "Die Hard"

The pattern would also match "", as there is nothing between the ".

So if you want to match a 'word' character [a-zA-Z0-9_] then use \w [uk.php.net], or make your own character class (the things in the []'s). As the period allows "%^$&$%" to match the pattern. I assume that this is unlikely to be what you want.

So try to make your expressions specific, as that avoids unwanted matches that can easily occur when using .* or .+. It may also stop you having to use an or pattern, although in your case an or is probably the best/easiest way to accomplish what you want.