homepage Welcome to WebmasterWorld Guest from 54.226.191.80
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Is this googlebot or not?
Seems this in GBot's Range but doesn't Validate
webcentric




msg:4659560
 5:25 pm on Apr 2, 2014 (gmt 0)

While culling Googlebot records from my raw log table, I'm encountering some questionable results related to the 66.249.70.0 /24 range. Here are a couple of examples.

66.249.70.238 - Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

66.249.70.18 - Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

There are other IP's involved as well (all in the above range).

Anyway, these don't validate using the standard method (failing on the forward lookup). Here are the results.

Reverse DNS for 66.249.70.238 = crawl-66-249-70-238.googlebot.com

Domain checks out.

Forward DNS throws a socket exception: No such host is known.

The reason I'm asking about this is because I thought this was in G's range and it seems like this should validate. As I said before, I'm seeing this with everything across 66.249.70.0 /24 that claims to be Googlebot.

 

incrediBILL




msg:4659590
 6:47 pm on Apr 2, 2014 (gmt 0)

Googlebot has been using a wider variation of user agents to get content cloaked to mobile devices.

As long as it says Googlebot somewhere and the round trip DNS validation works, it's the real deal.

However, in this case it sounds like someone at Google make an error in the DNS for crawl-66-249-70-238.googlebot.com and it might be worthwhile sending someone there an email to let them know.

dstiles




msg:4659600
 7:11 pm on Apr 2, 2014 (gmt 0)

I had to modify my acceptance algorithm for stupid googlebot for this reason. I go so many "bad UA" responses in my log it was ridiculous!

webcentric




msg:4659621
 7:34 pm on Apr 2, 2014 (gmt 0)

I had a feeling this was a DNS error on G's part. Makes me wonder how many unwarranted 404's, 403's they're eating these days because of this. Right now I'm just cleaning up log files but I can imagine what a mess this would create in a blocking algo (if you actually care about having your stuff indexed).

Contact Google? Not even sure where to start on that one where this matter is concerned.

lucy24




msg:4659753
 10:22 pm on Apr 2, 2014 (gmt 0)

But isn't it worth it for the visceral satisfaction of telling a Major Internet Company that they goofed?

Just hope they never start calling themselves "GoogleBot", as that's an attested spoofer.

webcentric




msg:4659762
 10:59 pm on Apr 2, 2014 (gmt 0)

Telling them they goofed does seem more satisfying "viscerally" than telling them they goofed up my logs.

Seems like almost everything this afternoon is coming from that .238 address which is filling up my Fake Googlebot log quite rapidly. Funny thing is, most of this is coming from IP's that end in 8 with hits from the following over the past few days.

66.249.70.18
66.249.70.28
66.249.70.138
66.249.70.148
66.249.70.158
66.249.70.168
66.249.70.228
66.249.70.238

With one odd ball for good measure.
66.249.70.72

jojy




msg:4659765
 11:12 pm on Apr 2, 2014 (gmt 0)

I have written a script which checks reverse/forward dns. When I do forward dns it returns me the host name instead of ip address. Here is my script:


$ip = '66.249.70.122';
$host = gethostbyaddr($ip);

//check if host exists
if($host != $ip) {
$real_ip = @gethostbyname($host);
if($real_ip == $ip) {
echo 'It's Google';
else {
echo 'Forward dns lookup failed';
}
}

JD_Toims




msg:4659853
 8:58 am on Apr 3, 2014 (gmt 0)

$real_ip = @gethostbyname($host.'.'); // -- ;) -- //

webcentric




msg:4659955
 2:49 pm on Apr 3, 2014 (gmt 0)

In thinking a bit more about the title of this thread (and the caption it's posted under on the home page of WebmasterWorld) some other ideas came to mind...

"Googlebot Now Takes Steps to Block Itself"
"Googlebot: the First Self-blocking Robot"
"Google Launches Google404.com -- Revolutionizes Internet Search"

Regarding the last one, just type in "404". 2.3 Billion results returned in .083 seconds. The real trick is going to be how to get a good ranking in this new engine. I'm stuffing my 404 page with keywords as we speak. Also, will be adding some structured data and a large, high quality image suitable for scraping.

incrediBILL




msg:4660183
 9:57 pm on Apr 3, 2014 (gmt 0)

Barry Schwartz ran with our story and got the attention of John Mueller from Google that said "Oops, we'll get that set up before we continue using those IP ranges!"

See the great technical reply on Google+
https://plus.google.com/u/0/+BarrySchwartz/posts/8JZ5azQfCvk

webcentric




msg:4660410
 2:40 pm on Apr 4, 2014 (gmt 0)

Still getting hit as of 10 minutes ago. They're just making the coffee in Mountain View I'm thinking.

Dim coffeeGrounds as Integer = numberOfPeopleAtGoogle x 20000

etc.

webcentric




msg:4660481
 6:53 pm on Apr 4, 2014 (gmt 0)

I've only seen one hit since about 10am Eastern or so (a little after noon from a smart phone). Hard to say if that means it's really stopped because they've come in batches in the past. Forward lookup still failing on at least the one IP I actually checked today (66.249.70.148).

bumpski




msg:4660658
 11:02 am on Apr 5, 2014 (gmt 0)

Barry Schwartz ran with our story and got the attention of John Mueller from Google that said "Oops, we'll get that set up before we continue using those IP ranges!"

Yes they'll fix it, right after they get their list of potential cloakers...

webcentric




msg:4661138
 2:36 pm on Apr 7, 2014 (gmt 0)

Well the hits have stopped coming for now from those IPs. Call it a fix if you like. Moving on... ;)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved