Welcome to WebmasterWorld Guest from 34.237.76.249

Forum Moderators: Ocean10000

Message Too Old, No Replies

Definitive way to validate GoogleBot authenticity

official word from Google

     
10:45 pm on Sep 21, 2006 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Today Matt Cutts announced that the definitive way to validate the Googlebot for authenticity was to use a combination of reverse and forward DNS lookups.

Many of us here have been commenting on the fact that reverse DNS for Googlebot has been resolving to googlebot.com and now their DNS upgrade project appears to be complete and now it's the official way to verify Google is actually crawling your site.

From their blog post:

...do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

[googlewebmastercentral.blogspot.com...]

The trick is to only do this lookup once per IP on a daily or weekly basis and cache the results so that you don't introduce too much overhead with this validation process.

3:08 am on Sept 23, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 22, 2001
posts:2450
votes: 0


I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

I've seen a spoofer with a setup where their IP address reverse-resolved to *.googlebot.com exactly once. I caught them because I did a whois lookup on the IP address.

Note that doing a forward lookup won't always point to the exact IP address (at least not in the past), but it will at least point you to within a close range.

4:50 am on Sept 23, 2006 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Note that doing a forward lookup won't always point to the exact IP address (at least not in the past), but it will at least point you to within a close range.

It should point to the right place now as that was the whole purpose of Matt pushing this DNS project thru to completion. Previously it was just in progress so there may have been errors, but now it's complete which is why it was announced.

5:12 am on Sept 23, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5506
votes: 5


Perhaps you gentlemen may add some simplicity to this gibberish?

It seems to me that your attempting to make a litle understood and accepted method of identification even more complex than it already is?

This forum (until a few months ago) always utilized IP ranges to identify providers.
Then taking the numbers and going to reputable registrars (ARIN, RIPE, APNIC, etc.) and ID'ing the name.

Then particpants began presenting names "crawl-a-b-c-d.googlebot.com" rather than numbers.

Requiring participants to use non-reputable websites with DNS tools.
It seems quite absurd a requirement to me.
I rarely use DNS for anything.

1) are you suggesting that a spider may change the IP range of which is presented in logs and still retrieve the data through a false and redirected DNS registration?
OR
are you just suggesting that IP in visitor logs would be a non-google IP range and the UA would present itself as google?
(if the latter it's really a non-issue for anybody excpet a beginner).

2) Frankly, I don't see google of anything to be much of a problem (with the exception of their image bot which has been randomly crawling some sites in dorect violation of robots.txt.)

7:44 am on Sept 23, 2006 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Frankly, I don't see google of anything to be much of a problem

Just because you're not aware of the problem doesn't mean it doesn't exist.

I've been studying these problems and raising the issue with the proper channels for months prior to speaking about it openly in front of hundreds of people in August with 3 search engines present and paying attention. If it wasn't a problem, with an audience chiming in that it WAS a problem, I don't think that even this level of change would've happened.

It's a start and I can live with this so far.

I'll quote myself from a few threads:

a) Scapers spoof as Google to rip-off the naive people depending on shoddy .htaccess files blocking bad user agents

b) Google is fed lists of cloaked links by proxy sites and they crawl thru the proxy sites and hijack your listings via the proxy. In some cases they give the proxy site ownership of your page.

c) Scrapers even scrape via Google proxy services such as translate.google.com and the Web Accelerator so just limiting Googlebot to a Google IP range is insufficient to stop abuse.

That's why Google did this as some of us made a big enough stink about it that they gave us the tools to help Googlebot be properly restricted and accurately block some of this nonsense.

[webmasterworld.com...]

Regarding people spoofing Google, it's often actually GOOGLE, I posted this on Matt's blog:


...sadly many times it’s ACTUALLY GOOGLEBOT crawling thru a proxy server or worse.

Here’s an EXTREME example of this that happened to me recently:

209.73.170.36 - - [05/Sep/2006:14:11:59 -0500] “HEAD / HTTP/1.1" 200 - “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
209.73.170.36 - - [05/Sep/2006:14:11:59 -0500] “GET / HTTP/1.1" 200 1257 “-”"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) (via babelfish.yahoo.com)”

The reverse IP is w21.search.scd.yahoo.com so it’s really Babelfish, but is it Google?

So I looked at the extra detail I track and it claimed it was GoogleBot via a Yahoo proxy “Proxy FORWARD=66.249.65.18" which was actually crawl-66-249-65-18.googlebot.com going thru Babelfish.

Googlebot doesn’t ask for a HEAD but Babelfish does ask for the HEAD first [checks for 404s or redirects which Babelfish won’t follow] before performing a GET to the page.

So the whole thing checked out front to back that it was in fact Google translating my webpage via Babelfish and I would sure as heck like to know where that link came from!

So, now you know many times it’s really Googlebot doing something bizarre

Or last but not least, here's a sample of how the PHP and CGI proxy servers feed Google lists of links thru their proxy site with rot13 or tiny URLs and you get hijacked like this:

The following is paraphrased to avoid specifics ;)

MySite's Crawl Patrol System Caught Your Dumb Proxy
MySite doesn't like proxy servers and why is Google crawling thru your server to my server?
USER AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) ...
www.someproxysite.com/cgi-bin/garbage.cgi/somejunk/gibberish.url - 3k -
- Cached - Similar pages

This all has been going on for quite some time and now we have Google to at least enable this step, with many thanks to Matt Cutts pushing it, to be able to easily validate Googlebot crawling from Google domains and not thru proxies and whatever causing all sorts of trouble,

I think the appropriate phrase is "THANK YOU!" to Google for giving us a definitive tool to help thwart some of the bad people from doing bad things.

Besides, Google isn't the only SE to fall for some of these games, just the largest and most popular, ONE down, more to go.

7:50 am on Sept 23, 2006 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Oh yeah, forgot to address this...

are you suggesting that a spider may change the IP range

No, we aren't, but both Yahoo and Google suggested they can change IP addresses at a moments notice when a ) testing new code or b ) deploying new software. Such was said in front of hunreds of people and this was Google's response to let us know it's the real deal.

'Nuff said.

1:19 pm on Sept 23, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5506
votes: 5


Bill,
Have you ever tried to use one of the online translators?

I'm here to tell you that it's a big waste of time. Rarely do you end up with much more of an understanding than when you began.

There are portions of Europe that are very interested in the materials on my sites.
Their interest presents a two-fold problem for me.
1) The bandwidth is a one-way stream with no benefit to me.
2) When I did allow the Euro traffic, I cannot tell you how many bad days began when I awoke to some new an undientified software that had crawled my sites while I was sleeping.

Just yesterday is an example.
Somebody on a Finnish blog linked to one of pages. When the visitors returned 403's, a another link was provided. 403 again.
I don't read or speak Finnish.
The person providing the links was somebody in the US I communicate with.
First thing I did was email her.
The I created a folder with Rewrite off and copied the pages there and provided those links.

I also explained that neither the images or links to other pages on my sites wouldn't work (because I hadn't edited the links to reflect the new folder).

Within two hours I had two crawl attempts with one changing four UA's in eleven minutes.
Then there were people playing around with folder structures and attempting to get an index file from a newly created folder which did not have one.

Was it benefical to me to make the extra effort for a few people from Finlind?
Hardly.

Should I spend any more time on attempting these types of solutions for Euro visitors in the future?
Hardly, given that Europeans are far more intersted in what I have to offer (without fees) than what they are capabale of reciprocating with (data or fees.)
[Same goes for the Orient and Latin-America and/or South America]

I've had the AltaVista Translator denied for some time and will add the others as well.
That's my choice and it's NOT a move for every webmaster.

If and When I find a solution to the European translation problem and a possible solution on how to deal with those copying my data, perhaps then I'll provide access to Euro visitors, however I don't expect that to happen in this lifetime.

Spiders crawling my pages (at least with any mass or volume) have been literally at a standstill for some time.
Very occassionally one will appear, however I seem to have learned (snoop) the capability to limit their access before they decide on a major crawl.

Proxies and co-locators are the same types of low-lifes, IMO. It's simply a misrepresentation of who you are.
We're not supposed to present our web pages to SE's in such a manner (cloak), however it's permissible for them to do so.
What a crock!

4:53 pm on Sept 23, 2006 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99



We're not supposed to present our web pages to SE's in such a manner (cloak), however it's permissible for them to do so.
What a crock!

No, they aren't supposed to cloak either but apparently whatever they are doing is initially fooling the SE's until someone rats them out, and simply detecting this condition and returning a 403 Forbidden in this situation stops the proxy sites from succeeding in their efforts.