Welcome to WebmasterWorld Guest from 54.226.46.6

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Web scraping

Need some help with web scraping

     

bayridge

1:44 pm on Mar 10, 2014 (gmt 0)

5+ Year Member



I'm trying to get baseball scores each day and use it in a script to show on my site. Anyone familiar with web scraping and can point me to some sample php scripts on doing scraping? I don't want rss. Thanks.

coopster

2:26 am on Mar 26, 2014 (gmt 0)

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Welcome to WebmasterWorld, bayridge.

You can build your own spider/bot using PHP and the cURL API. The PHP manual pages have some examples:

[php.net...]

incrediBILL

3:52 am on Apr 5, 2014 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



You can even do it using $data=file_get_contents("http://example.com");

Then you can process the content of $data which is a big string containing your web page.

...or something more sophisticated:

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$html=file_get_contents("http://www.example.com");
$doc->loadHTML( $html);

Now you have the page loaded in a DOMDocument object as $doc and you can extract anything you need using getElementsByTagName.

Or use DOMXPath and it's query functions.

Get the idea?

Several ways to do it.

Here's some examples using the methods I described:
[anchetawern.github.io...]

There's a really nice step by step tutorial for DIY scraping programming here:
[oooff.com...]

bayridge

3:01 pm on Apr 5, 2014 (gmt 0)

5+ Year Member



Thanks for your help. I will give it a try.

bayridge

4:37 pm on Apr 5, 2014 (gmt 0)

5+ Year Member



Not working for me. Tried it and get errors.

Warning: file_get_contents() [function.file-get-contents]: php_network_getaddresses: getaddrinfo failed: Name or service not known in /home/content/m/i/k/mikey/html/mysite/scrape01.php on line 4

Warning: file_get_contents(http://www.example.com) [function.file-get-contents]: failed to open stream: php_network_getaddresses: getaddrinfo failed: Name or service not known in /home/content/m/i/k/mikey/html/mysite/scrape01.php on line 4

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Empty string supplied as input in /home/content/m/i/k/mikey/html/mysite/scrape01.php on line 5

Used your script
<?php
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$html=file_get_contents("http://www.example.com");
$doc->loadHTML( $html);
?>

bayridge

4:40 pm on Apr 5, 2014 (gmt 0)

5+ Year Member



First example worked ok from pokemon site.

Second one didn't work

Parse error: syntax error, unexpected T_VARIABLE in /home/content/m/i/k/mikey/html/mysite/scrape02.php on line 2

script
<?php
2. $url = 'http://www.oooff.com';
3. $output = file_get_contents($url);
4. echo $output;
5. ?>

bayridge

4:50 pm on Apr 5, 2014 (gmt 0)

5+ Year Member



Oops. I are a idiot.

Forgot to remove line numbers 1-5 in last example.

Thanks.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month