Forum Moderators: open

Message Too Old, No Replies

parsing google "about X results" statistic

parsing google results

         

mistert_1

3:38 pm on Jun 21, 2003 (gmt 0)

10+ Year Member



Could anybody point me in the right direction on this. I have done a search of previous posts, and haven't come up with exactly what I'm after.

I'd like to be able to throw a list of around 5,000 URL's at google, and obtain the statistic of how many results each throws up. Specifically, I'm interested in the "links to:<URL>" results. I am using this as some measure of reputation/standing for a list of companies.

I have a Linux Apache server, and basic PHP/MySQL experience.

Any/All comments/suggestions gratefully recieved.
TIA
Mike

edit_g

4:02 pm on Jun 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Welcome to webmasterworld mistert_1!

This is a link to the Google's terms of service page: [google.com...] . Take a look at the section headed "No Automated Querying".

If you run afoul of this statement (which you would, were you to carry out your idea) then your IP could be banned along with your site if they can graner this from your IP and queries.

Not a good idea at all really...

<edited> period killing url...

minivip

4:13 pm on Jun 21, 2003 (gmt 0)

10+ Year Member



What's about using the Google Web API. That would suffice for 1000 requests per day, so after 5 days the work is done. Wouldn't that be legitimate?

mistert_1

5:22 pm on Jun 21, 2003 (gmt 0)

10+ Year Member



Thanks for the info.

I suppose I figured Google wouldn't be *too* keen, but wasn't too sure that they'd notice. 5,000 queries isn't a massive amount.

The API is interesting, but I haven't been able to make much sense of it, and it appears to only support perl or .Net?

Surely somebody must've used it for a similar purpose?

vincevincevince

5:48 pm on Jun 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



make the google url by sticking the search query in urlencode() and then concact it with the rest of the query string:

$url="http://www.google.com/search?sourceid=navclient&ie=UTF-8&oe=UTF-8&q=".urlencode($query);

with php open the url as a file in the normal way and read the file (only enough to get the #pages bit) into a variable $text.

in this variable ($text), the raw part you want to parse is:

<font size=-1 color=#ffffff>Results <b>1</b> - <b>10</b> of about <b>63,200,000</b>.

write a regular expression for this:
use preg_match, and i suggest the regex of:

/ of about \<b\>([0-9,]*)/

ie:

preg_match("/ of about \<b\>([0-9,]*)/",$text,$array);

if you catch the matched array (3rd argument of preg_match), you should find that the $array[1] is the number.

run it through str_replace to replace , with a blank

mistert_1

6:07 pm on Jun 21, 2003 (gmt 0)

10+ Year Member



Wow, Thanks Vincex3

I will get the books out [:(], and give that a try.

I notice in the FAQ's at: [google.com...]
that I can only receive 10 results per query. Does that mean I must fire the script 500 times, or is it possible to loop it?

(this is all hypothetical, as i have read the terms and conditions so diligently pointed out above. ;-) )

Krapulator

6:40 pm on Jun 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The API will return the number of results found along with the results, so if it is only the total number of results retuned that you are looking for, you would only need to query once for each company.

vincevincevince

7:20 pm on Jun 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



my solution isn't using the API, and obviously you'd not use it as it would be against google's TOS.

to hypothetically loop would be easy to do with php

look in your book for "for" loops


$queries= [the array of search terms]
for ($count=0;$count<sizeof($queries);$count++)
{
/*
make the query url
read it as a file
parse for the number
output your result
*/
}

it's not hard php to do, all easy to find in most books

mistert_1

10:46 pm on Jun 21, 2003 (gmt 0)

10+ Year Member



Cheers Vince.

Indeed. Fully hypothetically, I reckon I could get away with it once... ;-)

Krapulator, I was after feeding an entire text file of company URLs in - a single click operation. (Once coded)

I will have a bit more of a poke at the API before getting "the manual" out, but thanks to everyone for their help.

--Mike

RBuzz

10:51 pm on Jun 21, 2003 (gmt 0)

10+ Year Member



You would have to use the API 5000 times. Each api key is for one search, so if you were searching for:

link:http://www.example.com
link:http://www.example.net
link:http://www.example.org

Each one of those searches would be one key use. So you would have to parse it out over five days.

One of the elements of the Google results array that you get back is the result count. So you can just pull that from the array and not have to parse the search results. That's how the Google Hack "kin count" works -- you can't get actual phone numbers back when doing an API search using the phonebook: syntax, but you do get the number of results.

HTH,

RBuzz

RBuzz

10:53 pm on Jun 21, 2003 (gmt 0)

10+ Year Member



Sorry, I didn't really answer your question:

yes, you can loop using the API. You'd have to split your file into 1000-URL chunks, then make a loop that opens each line and runs the query.

... then set it off and go get a sandwich. I've written some API-usin' programs that burn 200-300 key uses at a time, and it's a bit boring to sit there and stare at the screen. :->

RBuzz

speedmax

8:09 pm on Jun 22, 2003 (gmt 0)

10+ Year Member



Thanks for the idea. i got inspired and quickly wrote a php (tested on PHP4.31). i can say this is very very handy for webmaster. save time to go through loads of pages type in all those "allinxxx" crap.

i call it advanced Google Checker

ranking position in top 100 result
number of page indexed
number of links

From every google datacenter(9).

Keep in mind that this php is pretty slow due to it parse about 30 pages to check it, performance also depend on your connection and speed.

This code is for educational use only, use at your own risk.

Enjoy it!

<?php
/*
File Name google.php
Advanced Google checker
Author : SPEEDMAX

HOW TO USE
[yourdomain.com...]
*/
$query = $_GET['key'];
$domain = $_GET['url'];
$numResult = 100;
// if you want to add/delete datacenter edit this line
$datacenter = array("www","www2","www3","www-ex","www-fi","www-cw","www-sj","www-ab","www-zu");

echo "Google Checker : $domain <br>";
echo '<table border=1 ><tr><td>Data Center</td><td>Ranking</td><td>No of page indexed</td><td>Inbound Link</td></tr>';
foreach($datacenter as $v){
$url="http://$v.google.com/ie?q=".urlencode($query)."&hl=en&lr=&ie=ISO-8859-1&num=100&start=0&sa=N";
$file=implode("",file($url));
$file = explode("<NOBR>",$file);
foreach($file as $key => $value){
if(eregi($domain, $value)) {
$result = $key;
}
}
echo "<tr><td>".$v."</td><td>".$result."</td>";
$url2="http://$v.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=site%3A$domain+-wx34A&num=1";
$file=implode("",file($url2));

if(preg_match("/ of about \<b\>([0-9,]*)/",$file,$text)){
echo "<td>".$text[1]."</td>";
}

$url3="http://$v.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=link%3A$domain&num=1";
$file=implode("",file($url3));

if(preg_match("/ of about \<b\>([0-9,]*)/",$file,$text)){
echo "<td>".$text[1]."</td>";
}
echo "</tr>";
}
echo "</table>";
?>

mistert_1

11:33 pm on Jun 22, 2003 (gmt 0)

10+ Year Member



Hi Speedmax.

Thanks for the code, but I am not exactly sure what function it performs? Sorry to sound ignorant, but it appears to be a parsing script to check your google ranking?

I am pretty new to this side of google, and would love to hear a little more?

Am I right in saying that the script is run by following the URL given (http://yourdomain...) and it therefore checks your ranking for a given keyword?

If so, it is a very good place for me to start. Thanks again.

Mike

mistert_1

11:43 pm on Jun 22, 2003 (gmt 0)

10+ Year Member



Hmm...

Seem to be a couple of problems with the code..?

I have replaced my domain with "yourdomain" and my username with "username" but this is roughly the result...
<edit>
$file=implode("",file($url));
</edit>

[edited by: Brett_Tabke at 2:52 pm (utc) on June 23, 2003]

mistert_1

11:46 pm on Jun 22, 2003 (gmt 0)

10+ Year Member



Sorry for the excessive pasting.

The [apparently] bad line is:

$file=implode("",file($url));

Which is repeated several times with a different url#.

--Any ideas?

Mike

speedmax

3:05 am on Jun 23, 2003 (gmt 0)

10+ Year Member



Make sure that your host privider support file($url);

try to use this code to test if your host support or not.

<?php
$url = "http://www.google.com";
$html = implode("",file($url));
echo $html;
?>

you should see the "cached version of Google" .

:P

speedmax

4:52 am on Jun 23, 2003 (gmt 0)

10+ Year Member



For those who is interested.
This is the result of the script.
=========================================
Google Checker
URL : www.example.com
Keyword : lyrics
Data Center¦Ranking¦No of page indexed¦Inbound Link
www 97 19,500 154
www2 90 17,000 154
www3 90 17,000 154
www-ex 98 19,100 154
www-fi 90 17,000 154
www-cw 97 19,500 154
www-sj 97 19,100 154
www-ab 97 17,200 154
www-zu 97 17,200 154

vincevincevince

7:24 am on Jun 23, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



speedmax, that's a beautiful script, nice one :-)

mistert_1

10:49 am on Jun 23, 2003 (gmt 0)

10+ Year Member



Speedmax,

I do indeed see the cached version of google - regular google, but missing the image file.

The original script still isn't working for me, though? I wonder if the input URL is right?

I am using:
[mydomain.com...]

Where "mydomain" has been replaced with my real domain.

I am still at a loss over exactly what the 9 different datacentres are, and what their importance is. Is there some FAQ I should have read?

Thanks,
Mike

speedmax

10:59 am on Jun 23, 2003 (gmt 0)

10+ Year Member



check your sticky mail.

9 datacenter is

"www","www2","www3","www-ex","www-fi","www-cw","www-sj","www-ab","www-zu"

you may need PHP4.30 or above, here is the latest version. a bit faster, if you are running old ver of php use the old script.

---------------------------------------------------
<?php
/*
File Name google.php
Advanced Google checker
Author : SPEEDMAX

HOW TO USE
[yourdomain.com...]
*/

$query = $_GET['key'];
$domain = $_GET['url'];
$mode=$_GET['mode'];
$numResult = 100;
$datacenter = array("www","www2","www3","www-ex","www-fi","www-cw","www-sj","www-ab","www-zu");
echo "Google Checker <br>URL : $domain<br>Keyword : $query <br>";
echo '<table border=1><tr><td>Data Center</td><td>Ranking</td><td>No of page indexed</td><td>Inbound Link</td></tr>';

foreach($datacenter as $v){
$url="http://$v.google.com/ie?q=".urlencode($query)."&hl=en&lr=&ie=ISO-8859-1&num=100&start=0&sa=N";
$file=file_get_contents($url);
$file = explode("<NOBR>",$file);
foreach($file as $key => $value){
if (preg_match ("/$domain/i", $value)){
$result = $key;
}
}
unset($file);
echo "<tr><td>".$v."</td><td>".$result."</td>";

$url2="http://$v.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=site%3A$domain+-wx34A&num=1";
$file=file_get_contents($url2);

if(preg_match("/ of about \<b\>([0-9,]*)/",$file,$text)){
echo "<td>".$text[1]."</td>";
}
unset($file);
$url3="http://$v.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=link%3A$domain&num=1";
$file=file_get_contents($url3);

if(preg_match("/ of about \<b\>([0-9,]*)/",$file,$text)){
echo "<td>".$text[1]."</td>";
}
unset($file);
echo "</tr>";
}
echo "</table>";
?>

georgeek

11:14 am on Jun 23, 2003 (gmt 0)

10+ Year Member



edit_g

If you run afoul of this statement (which you would, were you to carry out your idea) then your IP could be banned along with your site if they can graner this from your IP and queries.

If this were true then you could of course get your competitor banned by spoofing their IP address.

vincevincevince

11:43 am on Jun 23, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ever tried spoofing an IP address? it's not easy.
but yes, I can see that it could be theoretically possible, if somewhat against the spirit of SEO, to do this and get a competitor banned.
i know people have had IPs banned from being able to access google for similar things - but am not sure about sites residing at these domains. if googleguy is not aware of the potential here, then I would be surprised.

i was meaning to add today to run it from a dialup connection, (if you must run it), so you have a new IP every time - don't run it from your webhost. if you've got a windows OS at home, then foxserve is an easy way to get a working php setup :-)

edit_g

11:47 am on Jun 23, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If this were true then you could of course get your competitor banned by spoofing their IP address.

And indeed I'm sure you can - but it is easier said than done. Go ahead and run the script - I really don't care. I'm not getting into an argument when the facts are so blatantly obvious.

speedmax

2:06 pm on Jun 23, 2003 (gmt 0)

10+ Year Member



I think it is rather interesting that people keep thinking if I run this script I will get google banned.

As a sys admin for 3 years, I can tell you it is harder than you think to track it.

I believe google's policy is trying to make sure the Fair use of their resource.

I don't see any problem if I check this once a day.
It only sends 3 requests. This is nothing.

It is work as the same way as the googlebot, download some page and analyze information.

It doesn't send the USER-AGENT. From my experience there are many request from my site has no USER-AGENT information. Unless there is an identical activity such as 1000 of request every hour, what I do is ban that IP or send some rubbish back to that IP.

Of course if you don¡¯t want to risk anything just setup a home server.

philipp

3:12 pm on Jun 23, 2003 (gmt 0)

10+ Year Member



I did something similar with the Centuryshare calculator (please use Google to find it, I can't post URLs here). It needs 40 requests per calculation because it takes a certain word (say, "Beatles"), and then shows how "popular" this thing was through the years. (I'm using a mindshare with every year of the 20th century, and the program already stores the "default" page-count for just a year on its own to show relevant peaks). Examples can be seen at Google Blogoscoped, see archive.
If you want to know how to query the Google Web API (which is what I do for the Centuryshare calculator, as Google doesn't want automatic fetching of its pages), see the recent Google Web API tutorial.

If you want to calculate the trustworthiness/ reputation of the 5,000 companies, maybe you could also do a mindshare.

E.g. how many pages for "Example Inc + good + trustworthy + reliable" vs how many pages for "Example Inc + bad + liars + unreliable".

GoogleGuy

4:43 pm on Jun 23, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



BTW, I would *not* recommend doing automated queries against Google. If your IP address gets shut off, that can be a big pain for you (and your ISP).

minivip

4:56 pm on Jun 23, 2003 (gmt 0)

10+ Year Member



Does this mean the automated usage of the Google API is unbeloved as well? I hope not.

Net_Wizard

5:14 pm on Jun 23, 2003 (gmt 0)



Well, one way to solve this problem is do your normal 'widget' query with the preference set to 100 then 'save' the page temporarily to your hard drive(I won't go looking beyond the 1st 100).

Then, using your parser app, you can easily look for your domain and its current ranking.

This method is a little bit slower than full automated app running but it won't ban your/ISP IP from Google.

Cheers

vincevincevince

5:49 pm on Jun 23, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



'save' the page temporarily to your hard drive(I won't go looking beyond the 1st 100).
Then, using your parser app, you can easily look for your domain and its current ranking.

i run php on my PC here, when i tested this script, you know what it did? it downloaded the page, 'save'd it temporarily to memory, then using the parser app, it easily looked for info I wanted... what's the difference?

mistert_1

6:04 pm on Jun 23, 2003 (gmt 0)

10+ Year Member



I appreciate the lateral thinking, but I'm not sure if I have fully explained the problem.

@Net_Wizard
Good idea re: mindshare, except some of these companies may be very small, and will rely on google to do business. Thus I feel checking the syntax "link:companyurl.com" in google will give a better estimate?

@GoogleGuy
Thanks for the advice.

@ Phillipp/Vincex3:
Regarding the 100 search saved page - I'm not sure if I fully understand? I am after just the number of results returned - not the actual content of them. Plus, I am looking to cut down on throwing "widgetcompany#1" into google 5,000(max) times...

If I have gotten the wrong end of the stick, do shout out

Thanks again,

Mike

This 35 message thread spans 2 pages: 35