Forum Moderators: coopster

Message Too Old, No Replies

regex fails to fetch out URLs from html output

failing to fetch special pattern URLs from html

         

phparion

12:31 pm on Aug 27, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi

I have a long html page in which the chunk of code I am interested in looks like below,

Code:

<ul>
<li><strong><a href="http://www.example.com/string1/string2/string3/string4/#*$!-#*$!-#*$!x-xx-#*$!x.html" title="lin title here">link name here</a></strong>
<br /><address>
address here<br />
blah blah blah<br />
blah blah
</address>
</li>
</ul>
</div>

all the target urls are within <li></li> tags

I want to fetch the URL PART from this html i.e

Code:

http://www.example.com/string1/string2/string3/string4/#*$!-#*$!-#*$!x-xx-#*$!x.html

for which I am using following code

PHP Code:
preg_match_all("/(http¦https)?:\/\/?([a-zA-Z0-9\-\.]*\.[a-zA-Z]{2,5})(:[a-zA-Z0-9]*)?\/?([a-zA-Z0-9.-_]*\/)?([a-zA-Z0-9.-_?&=%+$]+)?/", $url , $arr );

this regex gives me URLs of all patterns however I am interested only in the above pattern URLs for which I tried many regex but all give empty results.

and something like this

Code:

preg_match_all("http\:\/\/www\.[a-zA-Z0-9-_.]\.com\/[a-zA-Z0-9-_.]\/[a-zA-Z0-9-_.]\/[a-zA-Z0-9-_.]\/[a-zA-Z0-9-_.]\/[a-zA-Z0-9-_.]\.html",$url,$arr);

gives me following error

Code:

Delimiter must not be alphanumeric or backslash in

I wonder if someone could help me to write a regex that can get URLs on only this scheme and no other URLs as there are many other schemes of the URLs too in the same long html output.

thank you very much.

[edited by: eelixduppy at 12:40 pm (utc) on Aug. 27, 2007]
[edit reason] use example.com, thanks [/edit]

d40sithui

3:53 pm on Aug 27, 2007 (gmt 0)

10+ Year Member



hi,
your 2nd pattern gave you an error because you didnt start and end with a "/".

try this pattern

$pattern10 ="/^http:\/\/www+\.[a-zA-Z0-9-_.]+\.com+\/[a-zA-Z0-9-_.]+\/[a-zA-Z0-9-_.]+\/[a-zA-Z0-9-_.]+\/[a-zA-Z0-9-_.]+\/[a-zA-Z0-9-_!#*$.]+\.html+$/";

phparion

4:47 am on Aug 28, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



this pattern also returns empty array :(

phparion

5:02 am on Aug 28, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



it is working now, just removed ^,$ from the pattern and used / as the start and end of the pattern and it started to work.

thank you very much