How to find a particular URL in a Webpage & checking Hyperlink

Forum Moderators: coopster

Message Too Old, No Replies

How to find a particular URL in a Webpage & checking Hyperlink

php freelancer

7:18 pm on Feb 1, 2008 (gmt 0)

[webmasterworld.com...]

My posting have some similarity with above link

By the way : How to find a particular URL in a Webpage & checking Hyperlink..

Say my site is www.rankingoogle.com ; I want to check whether all back linking sites are properly back linking me or not? I have all the back linking site list in database. I want to pick up every one of them and check by following every pattern..

Say a common link is <a href="http://www.rankingoogle.com" title="SEO" class=''>SEO Ranking</a>

Follwoing a regular expression I want to check it, any kind of generalised hyperlink checking. I have written a regex as follows

$regex = "<[a][[:space:]]+([a-zA-Z]*[[:space:]]*=?[[:space:]]*(\"[^\"]*\"¦'[^']*')?[[:space:]]+)*href[[:space:]]*=[[:space:]]*((\"http://(www\\.)?".$murl."/?[[:space:]]*\")¦('http://(www\\.)?".$murl."/?')¦(http://(www\\.)?".$murl."/?))[[:space:]]*([a-zA-Z]*[[:space:]]*=?[[:space:]]*(\"[^\"]*\"¦'[^']*')?[[:space:]]*)*>.+";

But it fails in some cases..

Any help?

PHP_Chimp

10:29 pm on Feb 1, 2008 (gmt 0)


// asuming preg_match
// your pattern made smaller, so it fits on my screen ;)
$pattern = "%<a\s+([a-z]*\s*=?\s*(\"[^\"]*\"Ś'[^']*')?\s+)*href\s*=\s*((\"http://(www\\.)?".$murl."/?\s*\")Ś
('http://(www\\.)?".$murl."/?')Ś(http://(www\\.)?".$murl."/?))\\s*([a-z]*\s*=?\s*(\"[^\"]*\"Ś'[^']*')?\s*)*>.+%i";
// wow this is still a long expression...

Could you not just simplify your expression to look for -


$pattern = '%<a[^/>]*href=["\'](?:http://)?(?:www\.)?example.com/?["\'][^/>]*/?>%i'

This wont check your title is present but will check for the presence of your address within a <a tag.

<edit>
Broke up the regex to make it fit on the screen.

[edited by: PHP_Chimp at 10:30 pm (utc) on Feb. 1, 2008]

php freelancer

5:34 am on Feb 2, 2008 (gmt 0)

Thanks mate, You have written the following lines, But can u explain a little bit?

This is your regex =>
$pattern = '%<a[^/>]*href=["\'](?:http://)?(?:www\.)?example.com/?["\'][^/>]*/?>%i'

I dont know why % signs are here . Can u explain ..
I think u wanted to match like this
1) Match <a , then any character except / character 0 or more times. I think here a space after <a is a must, for that I used \s+ below in my regex below, it means white space character for 1 or more times

After that href= ok bur then u written ["\'] above.. Here " must end with " and ' must end with ' just like <a target="_blank" style='' also this " or ' is optional. Because <a title=hello also valid and <a title = hello also valid..

I wanted to write a REGEX which will consider as follows
<a(ANY No. of space)(This whole set optional : attribute(space optional)=(space optional)(" or ' optional)(attribute_value optional)(" or ' optional, but single quote or double quote must end with proper match)(ANY No. of space)) href=(ANY No. of space)(" or ' optional)http://(www. optional)URL(trailing slash / optional)(" or ' optional but match with starting " or ')(ANY No. of space optional)(any no. characters spaces except >)

This is the main section after that match </a>
--------------------------
// asuming preg_match
// your pattern made smaller, so it fits on my screen ;)
$pattern = "%<a\s+([a-z]*\s*=?\s*(\"[^\"]*\"Ś'[^']*')?\s+)*href\s*=\s*((\"http://(www\\.)?".$murl."/?\s*\")Ś
('http://(www\\.)?".$murl."/?')Ś(http://(www\\.)?".$murl."/?))\\s*([a-z]*\s*=?\s*(\"[^\"]*\"Ś'[^']*')?\s*)*>.+%i";
// wow this is still a long expression...

Could you not just simplify your expression to look for -

$pattern = '%<a[^/>]*href=["\'](?:http://)?(?:www\.)?example.com/?["\'][^/>]*/?>%i'

This wont check your title is present but will check for the presence of your address within a <a tag.

<edit>
Broke up the regex to make it fit on the screen.

[edited by: PHP_Chimp at 10:30 pm (utc) on Feb. 1, 2008]
------------------------------------

PHP_Chimp

9:02 pm on Feb 2, 2008 (gmt 0)

The % is just to mark the boundary of the pattern, as im assuming that you are using preg_match or something like that.
As the % character doesnt find its way into may regexes I just find it easier to use that than / and end up with hundreds of \/ all over the place.

The only reason for the ["\'] is so that people can quote there attributes with either a " or '.

So what you want can be simplified.
<a - this cant get any more simple.
(ANY No. of space)(This whole set optional : attribute(space optional)=(space optional)(" or ' optional)(attribute_value optional)(" or ' optional, but single quote or double quote must end with proper match)(ANY No. of space)) - so this lot is matched by [^/>]+ as this includes anything other than the end of the tag.

Im not saying that your regex is wrong, just that you can make it a lot easier to read by using a negative class to take into consideration all of the bits you have above. However I agree that I originally put * after the class, and that should have been a + to force at least 1 character. If you want to force the space then put \s or [ ](you dont need the ['s, but it is a little difficult to show a space without them ;) after the <a.

As with so many things there are a lot of different regexes that would do the job for you. Your original one is a lot more specific than mine, however if you make it more specific then you are going to find more problems with it, as it will match a more limited set of criteria. So my regex is a lot looser, and therefor will match more often. My regex will get you the answer that you asked for...this may not however be the answer that you wanted.