Forum Moderators: coopster
My posting have some similarity with above link
By the way : How to find a particular URL in a Webpage & checking Hyperlink..
Say my site is www.rankingoogle.com ; I want to check whether all back linking sites are properly back linking me or not? I have all the back linking site list in database. I want to pick up every one of them and check by following every pattern..
Say a common link is <a href="http://www.rankingoogle.com" title="SEO" class=''>SEO Ranking</a>
Follwoing a regular expression I want to check it, any kind of generalised hyperlink checking. I have written a regex as follows
$regex = "<[a][[:space:]]+([a-zA-Z]*[[:space:]]*=?[[:space:]]*(\"[^\"]*\"¦'[^']*')?[[:space:]]+)*href[[:space:]]*=[[:space:]]*((\"http://(www\\.)?".$murl."/?[[:space:]]*\")¦('http://(www\\.)?".$murl."/?')¦(http://(www\\.)?".$murl."/?))[[:space:]]*([a-zA-Z]*[[:space:]]*=?[[:space:]]*(\"[^\"]*\"¦'[^']*')?[[:space:]]*)*>.+";
But it fails in some cases..
Any help?
// asuming preg_match
// your pattern made smaller, so it fits on my screen ;)
$pattern = "%<a\s+([a-z]*\s*=?\s*(\"[^\"]*\"¦'[^']*')?\s+)*href\s*=\s*((\"http://(www\\.)?".$murl."/?\s*\")¦
('http://(www\\.)?".$murl."/?')¦(http://(www\\.)?".$murl."/?))\\s*([a-z]*\s*=?\s*(\"[^\"]*\"¦'[^']*')?\s*)*>.+%i";
// wow this is still a long expression...
Could you not just simplify your expression to look for -
$pattern = '%<a[^/>]*href=["\'](?:http://)?(?:www\.)?example.com/?["\'][^/>]*/?>%i'
<edit>
Broke up the regex to make it fit on the screen.
[edited by: PHP_Chimp at 10:30 pm (utc) on Feb. 1, 2008]
This is your regex =>
$pattern = '%<a[^/>]*href=["\'](?:http://)?(?:www\.)?example.com/?["\'][^/>]*/?>%i'
I dont know why % signs are here . Can u explain ..
I think u wanted to match like this
1) Match <a , then any character except / character 0 or more times. I think here a space after <a is a must, for that I used \s+ below in my regex below, it means white space character for 1 or more times
After that href= ok bur then u written ["\'] above.. Here " must end with " and ' must end with ' just like <a target="_blank" style='' also this " or ' is optional. Because <a title=hello also valid and <a title = hello also valid..
I wanted to write a REGEX which will consider as follows
<a(ANY No. of space)(This whole set optional : attribute(space optional)=(space optional)(" or ' optional)(attribute_value optional)(" or ' optional, but single quote or double quote must end with proper match)(ANY No. of space)) href=(ANY No. of space)(" or ' optional)http://(www. optional)URL(trailing slash / optional)(" or ' optional but match with starting " or ')(ANY No. of space optional)(any no. characters spaces except >)
This is the main section after that match </a>
--------------------------
// asuming preg_match
// your pattern made smaller, so it fits on my screen ;)
$pattern = "%<a\s+([a-z]*\s*=?\s*(\"[^\"]*\"¦'[^']*')?\s+)*href\s*=\s*((\"http://(www\\.)?".$murl."/?\s*\")¦
('http://(www\\.)?".$murl."/?')¦(http://(www\\.)?".$murl."/?))\\s*([a-z]*\s*=?\s*(\"[^\"]*\"¦'[^']*')?\s*)*>.+%i";
// wow this is still a long expression...
Could you not just simplify your expression to look for -
$pattern = '%<a[^/>]*href=["\'](?:http://)?(?:www\.)?example.com/?["\'][^/>]*/?>%i'
This wont check your title is present but will check for the presence of your address within a <a tag.
<edit>
Broke up the regex to make it fit on the screen.
[edited by: PHP_Chimp at 10:30 pm (utc) on Feb. 1, 2008]
------------------------------------
The only reason for the ["\'] is so that people can quote there attributes with either a " or '.
So what you want can be simplified.
<a - this cant get any more simple.
(ANY No. of space)(This whole set optional : attribute(space optional)=(space optional)(" or ' optional)(attribute_value optional)(" or ' optional, but single quote or double quote must end with proper match)(ANY No. of space)) - so this lot is matched by [^/>]+ as this includes anything other than the end of the tag.
Im not saying that your regex is wrong, just that you can make it a lot easier to read by using a negative class to take into consideration all of the bits you have above. However I agree that I originally put * after the class, and that should have been a + to force at least 1 character. If you want to force the space then put \s or [ ](you dont need the ['s, but it is a little difficult to show a space without them ;) after the <a.
As with so many things there are a lot of different regexes that would do the job for you. Your original one is a lot more specific than mine, however if you make it more specific then you are going to find more problems with it, as it will match a more limited set of criteria. So my regex is a lot looser, and therefor will match more often. My regex will get you the answer that you asked for...this may not however be the answer that you wanted.