parsing google "about X results" statistic

Forum Moderators: open

Message Too Old, No Replies

parsing google "about X results" statistic

parsing google results

mistert_1

3:38 pm on Jun 21, 2003 (gmt 0)

Could anybody point me in the right direction on this. I have done a search of previous posts, and haven't come up with exactly what I'm after.

I'd like to be able to throw a list of around 5,000 URL's at google, and obtain the statistic of how many results each throws up. Specifically, I'm interested in the "links to:<URL>" results. I am using this as some measure of reputation/standing for a list of companies.

I have a Linux Apache server, and basic PHP/MySQL experience.

Any/All comments/suggestions gratefully recieved.
TIA
Mike

Net_Wizard

6:23 pm on Jun 23, 2003 (gmt 0)

vincex3,

The difference is the 'user agent' that is sent to Google. Legitimate browser query would send a proper user agent identifying as 'known' browser.

As Speedmax have already explained his app does not send user agent which could flag his domain/ISP IP as doing an automated query.

mistert_1

Sorry, I'm not following you, I thought we're talking ranking/position tool in this thread.

Cheers

mistert_1

6:33 pm on Jun 23, 2003 (gmt 0)

Net_wizard,

Very good point re: browser ID.

I did actually start this thread in order to parse the results of a google "link:companyurl" so that i could capture the statistic "number of matches" as an indicator of reputation. (i.e a company with 60,000 links to it is more reputable than one with none) I relaise this is imperfect, but would be a nice additional variable to a pretty large academic study I am doing. There has been loads of suggestions, some more related than others, but it seems automating google queries brings an awful lot of interest.

Keep it coming.
Mike

philipp

7:01 pm on Jun 23, 2003 (gmt 0)

Mistert, I am also not talking about reading out the content of the site. Just the page-count. If you're interested you should just try to get started with the Google Web API, which is the legal way to implement automated querying (GoogleGuy, I sort of miss this addition to your statement). Yes, if you have the default limit of 1,000 requests, you need to query Google for 5 days, 1,000 times per day -- but since it'll be automated you won't be doing anything else than clicking somewhere, drinking a coffee, and returning when all is saved and finished. The Google Page-Count can be read out independent of the returned pages (which are 10 at most per single query). For example if you move your mouse over a bar at my Centuryshare graph, you will see something like "1,400 of 302,230", which means there were e.g. 1,400 pages containing <<"beatles" + 1963>> versus the 302.230 containing just <<1963>>. These are the same numbers as what Google outputs as "... of about [x] pages ...".

g1smd

7:17 pm on Jun 23, 2003 (gmt 0)

>> the 9 datacenter is <<

>> "www","www2","www3","www-ex","www-fi","www-cw","www-sj","www-ab","www-zu" <<

Where does that leave: -in, -dc, and -va then?

speedmax

10:12 pm on Jun 23, 2003 (gmt 0)

i mean you can add more datacenter, if you like..

any number will do .

This 35 message thread spans 2 pages: 35