Welcome to WebmasterWorld Guest from 174.129.127.214

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Reading a Google web page in PHP

They seem to be blocking fopen()

   
2:55 am on Nov 10, 2007 (gmt 0)

5+ Year Member



When I try to read a page from Google with fopen() I get an error like this:

Warning: fopen(http://www.google.com/search?source=ig&hl=en&rlz=&q=whatever&btnG=Google+Search) [function.fopen]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in /home/foo/bar/functions.php on line 25

I'm working on a project that fetches pages and parses a few elements from them. It's not critically important that I be able to fetch pages from Google specifically, I was just doing some testing and noticed this error from their pages. That got me thinking that I may end up seeing this problem from other websites in the future.

Here's my code:


$fp = @fopen($url, 'r');
$contents = '';
$data = @fread($fp, 8192);
$contents .= $data;
// some parsing happens here...
@fclose($fp);

Is there another method I could use to read pages from Google or am I just doing something wrong?

3:38 am on Nov 10, 2007 (gmt 0)

WebmasterWorld Administrator jatar_k is a WebmasterWorld Top Contributor of All Time 10+ Year Member



try using cURL, make it look like a user

[php.net...]

depending on what you are doing, watch the number of requests over time

3:46 pm on Nov 10, 2007 (gmt 0)

5+ Year Member



Try this:

<?php
// An example, get a web page into a string. See also file_get_contents().
$html = implode('', file('http://www.example.com/'));
?>

Also, RTFM! :)

4:00 pm on Nov 10, 2007 (gmt 0)

WebmasterWorld Senior Member vincevincevince is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Hello Srirangan, Welcome to WebmasterWorld! [webmasterworld.com] Your method should work the same way as the fopen() method. The problem identified originally is correct, Google do not encourage automated querying of their search engine.

Just changing the user agent used by PHP will be enough I believe but do be aware that repeated or frequent automated access to Google will mean your IP gets blocked.

ini_set('user_agent','Custom Script for example.com');

As I recall, it is only the default PHP user agent which is blocked; you don't need to pretend to be a browser and nor should you do so.

Do remember to pay attention to the robots.txt file for Google which bans your URL explicitly as it starts with /search:
[google.com...]

Google do offer an wide range of APIs:
[code.google.com...]
This is the recommended method for accessing the content and the one which Google allows.

4:11 pm on Nov 10, 2007 (gmt 0)

5+ Year Member



As I recall, it is only the default PHP user agent which is blocked...

This does indeed appear to be the case. I changed the user agent and my script is now working as expected.

4:19 pm on Nov 10, 2007 (gmt 0)

5+ Year Member



Woops.. sorry missed that vince.. :o)