Forum Moderators: open
I'd like to be able to throw a list of around 5,000 URL's at google, and obtain the statistic of how many results each throws up. Specifically, I'm interested in the "links to:<URL>" results. I am using this as some measure of reputation/standing for a list of companies.
I have a Linux Apache server, and basic PHP/MySQL experience.
Any/All comments/suggestions gratefully recieved.
TIA
Mike
This is a link to the Google's terms of service page: [google.com...] . Take a look at the section headed "No Automated Querying".
If you run afoul of this statement (which you would, were you to carry out your idea) then your IP could be banned along with your site if they can graner this from your IP and queries.
Not a good idea at all really...
<edited> period killing url...
I suppose I figured Google wouldn't be *too* keen, but wasn't too sure that they'd notice. 5,000 queries isn't a massive amount.
The API is interesting, but I haven't been able to make much sense of it, and it appears to only support perl or .Net?
Surely somebody must've used it for a similar purpose?
$url="http://www.google.com/search?sourceid=navclient&ie=UTF-8&oe=UTF-8&q=".urlencode($query);
with php open the url as a file in the normal way and read the file (only enough to get the #pages bit) into a variable $text.
in this variable ($text), the raw part you want to parse is:
<font size=-1 color=#ffffff>Results <b>1</b> - <b>10</b> of about <b>63,200,000</b>. write a regular expression for this:
use preg_match, and i suggest the regex of:
/ of about \<b\>([0-9,]*)/ ie:
preg_match("/ of about \<b\>([0-9,]*)/",$text,$array); if you catch the matched array (3rd argument of preg_match), you should find that the $array[1] is the number.
run it through str_replace to replace , with a blank
I will get the books out [:(], and give that a try.
I notice in the FAQ's at: [google.com...]
that I can only receive 10 results per query. Does that mean I must fire the script 500 times, or is it possible to loop it?
(this is all hypothetical, as i have read the terms and conditions so diligently pointed out above. ;-) )
to hypothetically loop would be easy to do with php
look in your book for "for" loops
$queries= [the array of search terms]
for ($count=0;$count<sizeof($queries);$count++)
{
/*
make the query url
read it as a file
parse for the number
output your result
*/
}
it's not hard php to do, all easy to find in most books
Indeed. Fully hypothetically, I reckon I could get away with it once... ;-)
Krapulator, I was after feeding an entire text file of company URLs in - a single click operation. (Once coded)
I will have a bit more of a poke at the API before getting "the manual" out, but thanks to everyone for their help.
--Mike
link:http://www.example.com
link:http://www.example.net
link:http://www.example.org
Each one of those searches would be one key use. So you would have to parse it out over five days.
One of the elements of the Google results array that you get back is the result count. So you can just pull that from the array and not have to parse the search results. That's how the Google Hack "kin count" works -- you can't get actual phone numbers back when doing an API search using the phonebook: syntax, but you do get the number of results.
HTH,
RBuzz
yes, you can loop using the API. You'd have to split your file into 1000-URL chunks, then make a loop that opens each line and runs the query.
... then set it off and go get a sandwich. I've written some API-usin' programs that burn 200-300 key uses at a time, and it's a bit boring to sit there and stare at the screen. :->
RBuzz
i call it advanced Google Checker
ranking position in top 100 result
number of page indexed
number of links
From every google datacenter(9).
Keep in mind that this php is pretty slow due to it parse about 30 pages to check it, performance also depend on your connection and speed.
This code is for educational use only, use at your own risk.
Enjoy it!
<?php
/*
File Name google.php
Advanced Google checker
Author : SPEEDMAX
HOW TO USE
[yourdomain.com...]
*/
$query = $_GET['key'];
$domain = $_GET['url'];
$numResult = 100;
// if you want to add/delete datacenter edit this line
$datacenter = array("www","www2","www3","www-ex","www-fi","www-cw","www-sj","www-ab","www-zu");
echo "Google Checker : $domain <br>";
echo '<table border=1 ><tr><td>Data Center</td><td>Ranking</td><td>No of page indexed</td><td>Inbound Link</td></tr>';
foreach($datacenter as $v){
$url="http://$v.google.com/ie?q=".urlencode($query)."&hl=en&lr=&ie=ISO-8859-1&num=100&start=0&sa=N";
$file=implode("",file($url));
$file = explode("<NOBR>",$file);
foreach($file as $key => $value){
if(eregi($domain, $value)) {
$result = $key;
}
}
echo "<tr><td>".$v."</td><td>".$result."</td>";
$url2="http://$v.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=site%3A$domain+-wx34A&num=1";
$file=implode("",file($url2));
if(preg_match("/ of about \<b\>([0-9,]*)/",$file,$text)){
echo "<td>".$text[1]."</td>";
}
$url3="http://$v.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=link%3A$domain&num=1";
$file=implode("",file($url3));
if(preg_match("/ of about \<b\>([0-9,]*)/",$file,$text)){
echo "<td>".$text[1]."</td>";
}
echo "</tr>";
}
echo "</table>";
?>
Thanks for the code, but I am not exactly sure what function it performs? Sorry to sound ignorant, but it appears to be a parsing script to check your google ranking?
I am pretty new to this side of google, and would love to hear a little more?
Am I right in saying that the script is run by following the URL given (http://yourdomain...) and it therefore checks your ranking for a given keyword?
If so, it is a very good place for me to start. Thanks again.
Mike
I do indeed see the cached version of google - regular google, but missing the image file.
The original script still isn't working for me, though? I wonder if the input URL is right?
I am using:
[mydomain.com...]
Where "mydomain" has been replaced with my real domain.
I am still at a loss over exactly what the 9 different datacentres are, and what their importance is. Is there some FAQ I should have read?
Thanks,
Mike
9 datacenter is
"www","www2","www3","www-ex","www-fi","www-cw","www-sj","www-ab","www-zu"
you may need PHP4.30 or above, here is the latest version. a bit faster, if you are running old ver of php use the old script.
---------------------------------------------------
<?php
/*
File Name google.php
Advanced Google checker
Author : SPEEDMAX
HOW TO USE
[yourdomain.com...]
*/
$query = $_GET['key'];
$domain = $_GET['url'];
$mode=$_GET['mode'];
$numResult = 100;
$datacenter = array("www","www2","www3","www-ex","www-fi","www-cw","www-sj","www-ab","www-zu");
echo "Google Checker <br>URL : $domain<br>Keyword : $query <br>";
echo '<table border=1><tr><td>Data Center</td><td>Ranking</td><td>No of page indexed</td><td>Inbound Link</td></tr>';
foreach($datacenter as $v){
$url="http://$v.google.com/ie?q=".urlencode($query)."&hl=en&lr=&ie=ISO-8859-1&num=100&start=0&sa=N";
$file=file_get_contents($url);
$file = explode("<NOBR>",$file);
foreach($file as $key => $value){
if (preg_match ("/$domain/i", $value)){
$result = $key;
}
}
unset($file);
echo "<tr><td>".$v."</td><td>".$result."</td>";
$url2="http://$v.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=site%3A$domain+-wx34A&num=1";
$file=file_get_contents($url2);
if(preg_match("/ of about \<b\>([0-9,]*)/",$file,$text)){
echo "<td>".$text[1]."</td>";
}
unset($file);
$url3="http://$v.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=link%3A$domain&num=1";
$file=file_get_contents($url3);
if(preg_match("/ of about \<b\>([0-9,]*)/",$file,$text)){
echo "<td>".$text[1]."</td>";
}
unset($file);
echo "</tr>";
}
echo "</table>";
?>
i was meaning to add today to run it from a dialup connection, (if you must run it), so you have a new IP every time - don't run it from your webhost. if you've got a windows OS at home, then foxserve is an easy way to get a working php setup :-)
As a sys admin for 3 years, I can tell you it is harder than you think to track it.
I believe google's policy is trying to make sure the Fair use of their resource.
I don't see any problem if I check this once a day.
It only sends 3 requests. This is nothing.
It is work as the same way as the googlebot, download some page and analyze information.
It doesn't send the USER-AGENT. From my experience there are many request from my site has no USER-AGENT information. Unless there is an identical activity such as 1000 of request every hour, what I do is ban that IP or send some rubbish back to that IP.
Of course if you don¡¯t want to risk anything just setup a home server.
If you want to calculate the trustworthiness/ reputation of the 5,000 companies, maybe you could also do a mindshare.
E.g. how many pages for "Example Inc + good + trustworthy + reliable" vs how many pages for "Example Inc + bad + liars + unreliable".
Then, using your parser app, you can easily look for your domain and its current ranking.
This method is a little bit slower than full automated app running but it won't ban your/ISP IP from Google.
Cheers
'save' the page temporarily to your hard drive(I won't go looking beyond the 1st 100).
Then, using your parser app, you can easily look for your domain and its current ranking.
i run php on my PC here, when i tested this script, you know what it did? it downloaded the page, 'save'd it temporarily to memory, then using the parser app, it easily looked for info I wanted... what's the difference?
@Net_Wizard
Good idea re: mindshare, except some of these companies may be very small, and will rely on google to do business. Thus I feel checking the syntax "link:companyurl.com" in google will give a better estimate?
@GoogleGuy
Thanks for the advice.
@ Phillipp/Vincex3:
Regarding the 100 search saved page - I'm not sure if I fully understand? I am after just the number of results returned - not the actual content of them. Plus, I am looking to cut down on throwing "widgetcompany#1" into google 5,000(max) times...
If I have gotten the wrong end of the stick, do shout out
Thanks again,
Mike