Forum Moderators: coopster

Message Too Old, No Replies

Multiple Page Parser

         

Chillax

3:16 pm on Apr 19, 2012 (gmt 0)

10+ Year Member



Hey guys, I'm newbie here and I know very little about php. I want to ask for your help, please.
I'm trying to built a php script that parses www.google.com and www.yahoo.com and checks the source for <a href="http://www.google.com/ncr and than to notify me where my code is missing.
This is all I have for now and I don't know what to do next :), also I'm not sure if my regex is correct.

<?php
echo nl2br ("a href checker\n\n");

$pages = array(
"http://www.google.com",
"http://www.yahoo.com");

foreach($pages as $page)
file_get_contents('$page');

$regex = '!a href=\"http://www.google.com/ncr\"!smiU';

if (preg_match("$regex", "$pagina")) {
echo "Found a href.";
} else {
echo nl2br ("a href not found on %site%");
}
?>


I really need some hints, since I'm stuck.
Thank you very much,

rocknbil

4:48 pm on Apr 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Welcome aboard Chillax . . . look at this . . .


$regex = '!a href=\"http://www.google.com/ncr\"!smiU';

Why do you escape characters? To avoid errors on quoting, like

"she said \"hey\""
'I think that's Robert\'s time machine'

So you've single-quoted your regex which means it's only going to match on quotes if there's a literal slash in front of it. That's a start, but there are likely other problems. For example,

a href='http://www.google.com/ncr'
a href=http://www.google.com/ncr
a href="http://google.com/ncr"
a class="inpage" href="http://google.com/ncr"

. . . and 20 others or so, you get the drift.

given that, you likely want a regex that
- may or may not have something between "a" and "href" (we can just eliminate "a")
- may or may not have quotes
- may or may not have www

A start might be

$regex = '!href=["\']?http(s)?://(www.)?google.com/ncr["\']?!smiU';

- starts with href=
- followed by one or zero of ' or "
- followed by http, with or without s
- followed by ://
- may or may not have www.
- google.com/ncr
- followed by one or zero of these: " or '

Chillax

7:23 pm on Apr 20, 2012 (gmt 0)

10+ Year Member



Thank you very much rocknbil, much apreciated