Regex to extract urls

Forum Moderators: coopster

Message Too Old, No Replies

Regex to extract urls

Sudarsan1984

1:12 pm on Feb 18, 2010 (gmt 0)

Hello all!
I like to know about the pattern which i must follow to get the result using preg_match_all for the urls in the webpage.
In the page there is many urls and in that in need to get the url which is like this "<h2><a id='....' href='http://example.com'" /></a>...............</h2>
Can anyone help me find this pattern i must follow to get this!

[edited by: eelixduppy at 2:23 pm (utc) on Feb 18, 2010]

jatar_k

1:33 pm on Feb 18, 2010 (gmt 0)

do you want the whole h2?
if not what part do you want specifically?

Sudarsan1984

4:45 am on Feb 19, 2010 (gmt 0)

I need the urls which is in <h2> tag. ie the url which is like this "http://example.com". Hope you understand it!

jatar_k

1:49 pm on Feb 19, 2010 (gmt 0)

you could search for "php regex for extracting urls" and you should find a bunch of examples

Sudarsan1984

4:47 am on Feb 22, 2010 (gmt 0)

I have searched in the php regex for extracting urls',but i was not able to find the accurate result for query i have! Please do get me the code for this!

Readie

5:09 am on Feb 22, 2010 (gmt 0)

From jdMorgan's third post in [webmasterworld.com...]

/(([a-z0-9][a-z0-9\-]*[a-z0-9]\.)+([a-z]{2,6}|co\.[a-z]{2}))/

although by the looks of it, it needs modifying to account for sub-directories

Sudarsan1984

5:32 am on Feb 22, 2010 (gmt 0)

For Readie, I think you didnt understand what i want.I will explain again. I have the data in the source code of the page as "<h2> <a id='....' href='http://example.com'" /></a>...............</h2>",this the exact one. Now i need is the pattern which i can use in preg_match_all to extract the urls only from the above source.ie in the h2 tags only.
Please get me the pattern code!

rocknbil

6:40 pm on Feb 22, 2010 (gmt 0)

The title of the topic makes some a little reluctant to assist. But anyway . . . . you want

<h#>

<a....

"URL"

</a>

</h#>

So something like this should work (untested, you may have to play with it)

$line = preg_replace('/<h\d[^>]*>[^<]*<a.*?href\s*=\s*[^"']*([^"'>\s])+[^"']*>[^<]*<\/a>[^<]*<\/h\d>/',"$1",$line);

It's a little complicated because you have to consider the possibility of spaces, classes, targets, or other attributes in the anchor and head. A more simple one might be to just match on the h# and href:

$line = preg_replace('/<h\d.*?href\s*=\s*[^"']*([^"'\s>])+?/',"$1",$line);

Being that, if it starts with an h#, you can assume both the head and anchor have a closing tag don't have to match it.

Note I use the variable "$line" as you will likely have to parse the "scrape" line by line.