homepage Welcome to WebmasterWorld Guest from 23.21.9.44
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Google IPs Cannot Be Trusted
Only Googlebot With Round-Trip DNS Validation
incrediBILL




msg:4631780
 10:58 am on Dec 18, 2013 (gmt 0)

I know many of you blindly accept Google's IP ranges as trustworthy and this is hardly the case. If you're allowing anything with a Google IP range global access then all you're doing is allowing a certain subset of scrapers carte blanche access because of their IP.

The Google IP range hosts all sorts of tools that be used for nefarious purposes including:

  • Google Wireless Transcoder
  • Google Translator
  • Google Engine

Luckily the Google Engine forces all requests to have an "AppEngine-Google” prefix that can be easily filtered.

Plus, I've seen the old proxy hijacking, which I thought that much like polio and had been eliminated, rear it's ugly head once again. The only wayt to stop this is to verify Googlebot is only crawling from it's valid IP addresses.

Full trip Googlebot validation is a must have front line defense, use it!

[edited by: incrediBILL at 6:31 pm (utc) on Dec 18, 2013]

 

Angonasec




msg:4631781
 11:12 am on Dec 18, 2013 (gmt 0)

We've blocked many Google "features" for years;
translate,
prefetch,
preview,
feed bots etc.

It appears to be only a matter of time before we block all G bots as they sink in value and trustworthiness.

Appreciate you waking us up Bill.

incrediBILL




msg:4632068
 8:51 am on Dec 19, 2013 (gmt 0)

I've actually used Google translate to infiltrate sites so-called protection that was designed to avoid prying eyes from seeing their cloaking activities. People think they're being clever and secure but nothing short of a full round-trip DNS check can be perfect.

See the original blog post where Matt Cutts explained how to do it:
[googlewebmastercentral.blogspot.com...]

Also, witness these recent threads about Google IPs:

Google? Is that you?
[webmasterworld.com...]

Google Test-Bot: Google-Test2
[webmasterworld.com...]

Google Translate
[webmasterworld.com...]

Bing, Ask, Yandex and all the rest you might allow need to be validated as tightly as possible too.

keyplyr




msg:4632074
 9:50 am on Dec 19, 2013 (gmt 0)

Bill I'm blocking GoogleImageProxy, are you?

Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (via ggpht.com GoogleImageProxy)

Been seeing this UA scrape images from robots.txt disallowed directories. The same images also fly "X-Robots-Tag: noindex" in the response header, yet these images are showing up in Google's Image Search. So far my demands for them to be removed have been ignored.

Angonasec




msg:4632096
 10:57 am on Dec 19, 2013 (gmt 0)

Interesting find, GoogleImageProxy, from those awfully helpful people at Mountain View;

"If you can not directly access image links or the loading is slowly, this script will rewrite the image links to googleusercontent.com proxy address. With Google server you can display image normally and load faster."

KeyP: Better late than never: What are your belt and braces blocks for this nasty?

keyplyr




msg:4632245
 7:11 pm on Dec 19, 2013 (gmt 0)



What are your belt and braces blocks for this nasty?


I'm not blocking any G ranges, only whitelisting header & UA.

lucy24




msg:4632256
 8:41 pm on Dec 19, 2013 (gmt 0)

If you can not directly access image links or the loading is slow, this script will rewrite the image links to googleusercontent.com proxy address. With Google server you can display image normally and load faster.

I'm trying to read that as something other than "Look! A useful new alternative to hotlinking!"

keyplyr




msg:4632259
 9:02 pm on Dec 19, 2013 (gmt 0)



Here's another one I'm watching:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/537.36 (KHTML, like Gecko, Google Publisher Plugin; Googlebot/2.1) Chrome/27.0.1453 Safari/537.36"

incrediBILL




msg:4632366
 5:51 am on Dec 20, 2013 (gmt 0)

Bill I'm blocking GoogleImageProxy, are you?


Since I whitelist the answer to any question of that form is always: YES!

Angonasec




msg:4633009
 12:11 pm on Dec 23, 2013 (gmt 0)

KeyP: "...only whitelisting header & UA."

Presumably they have to match the desirable G traffic?

iBill: Difficult to share your method in a public forum without wrecking it, but at least give us a lead-in.

It is Christmas :)

wilderness




msg:4633040
 2:53 pm on Dec 23, 2013 (gmt 0)

whitelist Bill Mar 2006 [webmasterworld.com]

Whitelist Jim Nov 2006 [webmasterworld.com]

iomfan




msg:4641510
 10:55 am on Jan 31, 2014 (gmt 0)

Only allow access from Google-IP addresses where the accessing host also presents a UA that identifies it as Googlebot or Googlebot Mobile. :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved