Forum Moderators: coopster

Message Too Old, No Replies

Regular expression and collecting urls

Reg exp, urls, file ext. filter

         

HeadBut

8:10 pm on Mar 26, 2005 (gmt 0)

10+ Year Member



I am currently using this to capture all the urls in my site scan:
preg_match_all("<a href[[:space:]]*=[[:space:]]*['\"]*([a-z]{3,5}://[.a-z0-9-]+[^'\"]*)['\"]*[[:space:]]*[/]?>",$Capturedhtml,$PageURLs, PREG_SET_ORDER)

but sometimes a link doesn't end with a quote and I get a bunch of other chars in my urls. I'd also like to filter out some file extensions like "pdf"....

Any help with this regular exspresion would be appreciated!

thanks!

wrightee

8:47 am on Mar 28, 2005 (gmt 0)

10+ Year Member



It's a bit early on a bank holiday to figure that regex out, but here's a couple of thoughts for a kludge.

- Get all <a href ..> into an array with a simple regex
- Loop through and use str functions to grab http:// to ("¦'¦[:space:]), since theoretically any of those three characters should spell the end of a link

Trouble is... not all links start with http, some links might actually have any of those three characters in them (not really allowed, but sometimes happens)..

From memory, I'm pretty sure extracting urls is a fairly well discussed example at the php.net site under the regex functions, maybe there's something there that can help?

Sorry for the average answer..

gettopreacherman

3:28 am on Mar 30, 2005 (gmt 0)

10+ Year Member



<a href=*([\s\S]*)>

Then preg_replace any " and you are done :-)