Forum Moderators: coopster
After more than four hours of googling and trying and failing I gave up. Now I am seeking your help with a (at least I think) fairly simple problem:
On my website users can enter tags like "James", "James Bond", or "Agent 007". Each tag consists of one or more words. These tags are submitted as one single string that I need to split at each whitespace or - if multiple words are enclosed by double quotes - at the double quotes.
Here is what I came up with:
Code:
<?php
/*string*/ $keywords = "\"James! (007) Bond\", \"Aston Martin\" 007 Q";
preg_match_all('/"(.[^"]+)"¦([\w]+)/i', $keywords, $arr);
print "<pre>"; print_r($arr); print("</pre>");
?>
Output:
Array
(
[0] => Array
(
[0] => "James! (007) Bond"
[1] => "Aston Martin"
[2] => 007
[3] => Q
) [1] => Array
(
[0] => James! (007) Bond
[1] => Aston Martin
[2] =>
[3] =>
)
[2] => Array
(
[0] =>
[1] =>
[2] => 007
[3] => Q
)
)
My solution is working but I think it still can be improved. Please can you have a look at it and give some advise how I can improve the pattern? For example do I really need an alternative pattern?
Thanks in advance!
Regards
jo
Your pattern, at a quick glance, looks fine to me. What don't you like about it that you want to improve?
"(.[^"]+)"
([\w]+)
You dont need the i, as \w includes both upper and lower case, the . is anything and not " includes everything other than ".
Each of your statements are quite different, however you may need both for in your specific case. As we dont know exactly what you want we cant tell you if you could get away with only one of them. You could test to see if a single or modified statement would do what you want.
[edited by: PHP_Chimp at 9:47 pm (utc) on Mar. 27, 2008]
Many thanks for your replies, much appriciated. What I want is to split the following (or similar string) on each whitespace but not on whitespaces enclosed by doublequotes:
$str = 'Action Stunt "Bruce Willis" Movie "Die Hard"'
From the above string I am expecting an array with the following items:
0: Action
1: Stunt
2: Bruce Willis
3: Movie
4: Die Hard
My regexp does quite exactly what I want. But as I am not an regexp expert I am wondering if my regexp could be improved in terms of design, performance, or reliability. Thanks for the hint with the "i" will leave this out.
I worked some more on the regexp and googled this forum and found something similar that looks better to me and works fine also:
preg_match_all('/".*"¦\b\S+\b/U', $inputString, $resultTags);
By now I have two solutions that work for me but I am still wondering if they could be improved. I don't think that I will ever understand the regular expressions completely but I am working on it. :-)
Again many thanks for your help.
Beany
What you do need to be careful with is testing the expression to make sure that it does do exactly what you want and dont have any false positives.
As you say that you are not so confident with regular expressions there are a few things I would suggest. Sorry if you already know this. Especially important if you are using other people regular expressions, as they may well have been written to perform another function.
Be very careful with is the use of . and *.
The * matches 0 or more. So generally + (matching 1 or more) would be a better choice, as it is unlikely you want to match nothing. The larger problem is the "greediness" of these modifiers. That is why there is a U [uk.php.net] modifier to make the whole expression ungreedy. You can also use .*? or .+? (the extra ? after the * or +) to make a single match ungreedy.
Example -
$str = 'Action Stunt "Bruce Willis" Movie "Die Hard"';
With no U modifier ".*" would match from:
Action Stunt "Bruce Willis" Movie "Die Hard"
The U modifier means that it matches from:
Action Stunt "Bruce Willis" Movie "Die Hard"
The pattern would also match "", as there is nothing between the ".
So if you want to match a 'word' character [a-zA-Z0-9_] then use \w [uk.php.net], or make your own character class (the things in the []'s). As the period allows "%^$&$%" to match the pattern. I assume that this is unlikely to be what you want.
So try to make your expressions specific, as that avoids unwanted matches that can easily occur when using .* or .+. It may also stop you having to use an or pattern, although in your case an or is probably the best/easiest way to accomplish what you want.