Forum Moderators: open

Message Too Old, No Replies

Why google results can not be parsed?

I'm trying to parse it to make a Google-update-Start-Detect-Machine

         

dirson

8:47 am on Oct 26, 2002 (gmt 0)

10+ Year Member



Hello.

Using Perl on Linux+Apache, I'm trying to parse this webpage:

[www2.google.com...]

in order to obtain the number of backlinks.

But I get an error message. On the other hand, I'm able to parse pages like:
[google.com...] or
[google.com...]

I'm using 'LWP::Simple' Perl module.

I suppose that Google does not allow parse their results. Is this right?

Does anybody have any experience?

Thank you very much.

warumauchnicht

9:17 am on Oct 26, 2002 (gmt 0)

10+ Year Member



Hello dirson,

i think you have to change your USER AGENT. I don't know how to in perl but maybe google checks it.

If it's like "Perl 4.01" or something else, google shows an error. If it's like "Mozilla 4.0 and so on" no error appears.

Try and error,
Tino

Nick_W

9:24 am on Oct 26, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Welcome to WebmasterWorld warumauchnicht and dirson ;)

I think you mean Trial and error!

Good luck with the script, do you plan to make a 'mailing list' for it?

Nick

dirson

9:32 am on Oct 26, 2002 (gmt 0)

10+ Year Member



About mailing list... of course I'm planning it! :)

Nick_W

9:36 am on Oct 26, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You may want to think about using the google API instead of just quering as Google dislike auto queries and your domain/IP could suffer...

Nick

dirson

9:54 am on Oct 26, 2002 (gmt 0)

10+ Year Member



I thought about using Google API, but is it possible to obtains results from www2 and www3?

Nick_W

10:03 am on Oct 26, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Good point! I expect not...

Nick

andreasfriedrich

3:53 pm on Oct 28, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I suppose that Google does not allow parse their results. Is this right?

Donīt [s]end automated queries to Google in an attempt to monitor your site's ranking. [google.com]

Thatīs the legal side of things[1]. However, technically there is nothing they can do to prevent you from parsing their results once you obtained them. Itīs just some stupid HTML document after all.

Running your queries from a fixed ip address might be a bad idea. Running thousands of automated queries from a dynamic ip address might be a bad idea, since Google might block access to their site for the whole ip block.

Using automated queries in a sensible manner (donīt hammer Googleīs server with requests - one query a day) will probably work ok.

Using LWP::Simple [search.cpan.org] you cannot change the UA string. But something like the following code will work:

use LWP::UserAgent; 
my $ua = LWP::UserAgent->new(timeout => 30,
............................ agent => 'some real browser UA string',
............................);
#
$response =
$ua->get('http://www2.google.com/search?q=link:http://www.yahoo.com');

Your aim of building a "Google-update-Start-Detect-Machine" suggests, however, that you wonīt use this in a sensible manner. Since you probably want to run it every minute or so to detect the start of an update. There is no doubt that such a high frequency of requests will get you into trouble.

Andreas

--------------------------
[1] One might argue that trying a figure out when a new update starts is no "attempt to monitor your site's ranking". But...

cminblues

4:40 pm on Oct 28, 2002 (gmt 0)

10+ Year Member



>>There is no doubt that such a high frequency of requests will get you into trouble.<<

Not if you use a large number of proxies, and change them every day. :)

<added>Ouch! just read that you want to make this using Perl-Linux-Apache.. so I think you mean from a server, your server maybe.. be very very careful hehe ;)</added>

cminblues