Forum Moderators: coopster

Message Too Old, No Replies

Screen Scraping Problem

         

rjbearcan

10:06 pm on Aug 18, 2007 (gmt 0)

10+ Year Member



Hello everyone,
I am trying to scrape a site to get a list of theaters. Since the address is always changing, all I can count on is that the address will have either 'Chicago, IL' or 'New York, NY'. However when I run my script which is pretty basic, I get a lot of returns for 'Chicago, IL' but not the address. Here is what I am running:

$URL = "http://www.site.com/venue/id";
$Contents = file_get_contents($URL);
$Lines = preg_split("/\n/",$Contents);

foreach($Lines as $Line)
{
if (preg_match("/([^`]*?), Chicago, IL /",$Line,$Match))
{
print("Found match (" . $Match[1] . ")<br />\n");
}
}

I thought that would return all the lines that had Chicago, IL in it and whatever text preceded it but instead it returns nothing.

Thanks for any help you can give me.

SteveLetwin

12:10 am on Aug 19, 2007 (gmt 0)

10+ Year Member



To quote [us.php.net...]
$matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.

So your $Match[1] will contain any text that doesn't have a back-quote (`) in it that precedes ', Chicago, IL '. Given that you're dealing with HTML, I wouldn't use bare spaces like that, but replace them with \s* or \s+. Also are you sure that the address and city are actually on the same line?

SteveLetwin

3:55 am on Aug 19, 2007 (gmt 0)

10+ Year Member



Or it could be that the spaces are the non-breaking space entity (&nbsp;). You might want to convert HTML entities to normal characters before you do the regex. You could do a regex replace to do this, but I'm sure PHP has a function to do this for you automatically for most entities.