How to Verify Googlebot and Avoid Rogue Spiders - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How to Verify Googlebot and Avoid Rogue Spiders

1
2
»

jcoronella

1:57 am on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Matt [mattcutts.com] recently made a good recommendation over on his blog about how to "authenticate" Googlebot - that is, see if a given spider really is Googlebot, or if it is someone like me pretending to be Googlebot to find your cloaked pages.

The solution is simple, and effective for Googlebot, and also most likely for Yahoo's Slurp and MSNbot. It only relies on G, Y, or M having properly set up DNS entries for the crawling IP's. It's a two step process and involves doing a reverse dns lookup, then a forward DNS lookup.

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

This is clean, simple, and brilliant. I have implemented the reverse IP lookup, but never followed up with the forward - which is key. By doing both you avoid someone filling out an erroneous reverse DNS entry, which is very simple to do.

Finally, my quality content is one step closer to staying mine.

< The full technique is outlined here:
[googlewebmastercentral.blogspot.com...] >

[edited by: tedster at 7:25 pm (utc) on July 5, 2007]

jcoronella

2:55 am on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

It seems this will work for both Googlebot and Slurp. MSN doesn't have correct reverse lookups set for their crawling IP's.

Here is some php code:


<? $botip = $_SERVER['REMOTE_ADDR']; 
$bothost = gethostbyaddr( $botip ); 
$verifiedbotip = gethostbyname( $bothost ); 
if ( $botip = $verifiedbotip ) { 
 if ( substr($bothost, -14) == '.googlebot.com') { 
 print '<b><font color=green>This really is Googlebot</font><br>'; 
 } elseif ( substr( $bothost, -18) == '.inktomisearch.com') { 
 print '<b><font color=green>This really is Slurp</font><br>'; 
 } else { 
 print '<b>This is not Slurp or Googlebot<br></b>'; 
 } 
} else { 
 print '<b>Host does not match reverse lookup<br></b>'; 
} 
?>

It's probably best to do some string matching first so you only do the much slower reverse and forward lookups when you need to, but you get the idea.

carguy84

4:24 am on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Ya, this is an expensive look up, I would not recommend it on a high traffic site, and I would definitely to some prescreening before I initiated the lookup. I do a reverseDNS lookup when some one registers, but for other reasons.

Chip-

koan

4:57 am on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

or if it is someone like me pretending to be Googlebot to find your cloaked pages.

I'm not sure I understand. Isn't a good thing that people check for cloaking spammers to report them? Why help cloakers?

incrediBILL

5:05 am on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Go figure, I posted this in the 'Search Engine Spider Identification' forum and it's still pending moderation, oh well, ya lose some, lose some, and lose some more ;)

this is an expensive look up, I would not recommend it on a high traffic site

I do this on every initial access from a new IP on a high traffic site and it's no big deal if you cache the results for 24 hours. That means Google doesn't keep invoking DNS lookups except one time per IP per 24 hour period which is reasonable.

As a matter of fact, I'm pretty sure WebmasterWorld does a DNS lookup as well and we know THIS is a very high traffic site.

Brett could chime in with details or you could hear him elaborate on the topic at PubCon.

[edited by: incrediBILL at 5:07 am (utc) on Sep. 22, 2006]

bhartzer

5:08 am on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

check for cloaking spammers

Uh, Google has other ways for checking for sites that are cloaking, like using different IPs and ISPs to crawl/check pages.

By the way, all cloakers are not spammers--just like all spammers are not cloakers. There are actually good reasons for cloaking. Go figure.

cabbagehead

5:37 am on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

How Necessary is all this? How big of a problem is it? And is the concern that Google not give credit for the content to someone else? Or is it that you don't want to be the source for soeone else mashing up and creating their own SE bait content?

mcavic

5:37 am on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I do this on every initial access from a new IP on a high traffic site and it's no big deal if you cache the results for 24 hours.

Agreed. And if your Web server is already doing a reverse lookup for logging, you can trust what it says the host name is. That way, you then only have to do the forward lookup.

How Necessary is all this?

I think it's only necessary for sites that are cloaking. And there are, IMO, some legitimate uses for cloaking.

incrediBILL

7:05 am on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

How Necessary is all this? How big of a problem is it? And is the concern that Google not give credit for the content to someone else? Or is it that you don't want to be the source for soeone else mashing up and creating their own SE bait content?

OK, let me 'splain it to you...

a) Scapers spoof as Google to rip-off the naive people depending on shoddy .htaccess files blocking bad user agents

b) Google is fed lists of cloaked links by proxy sites and they crawl thru the proxy sites and hijack your listings via the proxy. In some cases they give the proxy site ownership of your page.

c) Scrapers even scrape via Google proxy services such as translate.google.com and the Web Accelerator so just limiting Googlebot to a Google IP range is insufficient to stop abuse.

That's why Google did this as some of us made a big enough stink about it that they gave us the tools to help Googlebot be properly restricted and accurately block some of this nonsense.

Matt Cutts deserves more than a few kudo's following thru on this, THANK YOU MATT!

[edited by: incrediBILL at 7:13 am (utc) on Sep. 22, 2006]

trinorthlighting

1:34 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I wish google would start lawsuits against people who are using spiders bearing their name. Its a trademark issue.

Jon_King

1:42 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I know jcoronella gave us a start and I'd help on this topic if I only knew how, perhaps some one can give us examples for the common server os's and/or static, dynamic pages... in short a bit more about implementation please.

webdoctor

1:44 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

...start lawsuits against people who are using spiders bearing their name. Its a trademark issue.

I don't think this problem is going to be solved by the trademark lawyers.

For example, IE uses a user-agent something like this:

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

Yet according to the USPTO [uspto.gov], the Mozilla Foundation has a trademark on 'Mozilla', for

"Computer programs for accessing and displaying files on both the internet and the intranet; network access server operating software for connecting computers to the internet and the intranet."

Does this mean MSFT should be sued for "impersonating" Mozilla? <boggle>

Where's Webwork when we need him? ;-)

FourDegreez

4:34 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

What might be a legitimate use of cloaking? I'm sure I'm not the only one wondering.

graywolf

4:48 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

What might be a legitimate use of cloaking? I'm sure I'm not the only one wondering.

The New York Times showing everything to Googlebot but requiring a paid registration for people to see the content

incrediBILL

4:57 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

What might be a legitimate use of cloaking? I'm sure I'm not the only one wondering.

For starters, showing visitors geo-targeted information as technically that is cloaking since the search engines are geo-neutral, you show them generic things yet show visitors custom tailored information such as local content or ads.

Another scenario I think falls under the term of cloaking is to custom tailor part of the page content based on the keywords in the query from the search engine. The information in the SE against isn't the exact same as the visitor sees but it's a perfectly valid page response to a visitor based on search specifics.

One more perfectly valid cloaking method is to give the search engines just the content and not all the HTML layout as it does the SE no good to get all the superfluous noise. As long as the content is exactly the same there's nothing technically wrong with this IMO.

How about all the little mobile devices as they can't display a full web page so is that reformatting or cloaking? Depends on how you look at it.

I don't cloak, too much work.

[edited by: incrediBILL at 4:59 pm (utc) on Sep. 22, 2006]

mcavic

6:35 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The New York Times showing everything to Googlebot but requiring a paid registration for people to see the content

That's exactly what I was thinking of. Some people might not consider that legitimate, but it gets them traffic without having to give away entire articles.

custom tailor part of the page content based on the keywords in the query from the search engine

I don't really consider that cloaking, because it's referrer-based, rather than agent-based. But I do it.

incrediBILL

10:20 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I don't really consider that cloaking, because it's referrer-based, rather than agent-based. But I do it.

Not true, the page content INDEXED was agent-based but what the visitor sees is referral-based, and the definition of cloaking is showing the SE different content than you show the visitor, therefore...... cloaked.

But I don't see anything wrong with it when at least most of the content shown to the search engine is still on the page.

tedster

10:43 pm on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

There's all kinds of "cloaking" that is pretty innocent. For instance, some people check user agent or IP so as not to assign a session id to a spider. The visible content that a visitor sees remains the same, but the urls in the anchor tags are cleaner. Google's guidelines say:

Don't deceive your users or present different content to search engines
than you display to users, which is commonly referred to as "cloaking."

A lot hinges on how you understand that word "content".

netchicken1

12:34 am on Sep 23, 2006 (gmt 0)

10+ Year Member

Can I just say "Gosh what a great topic!" This is why I love WebmasterWorld, you learn something every day :)

Cheers Guys :)

koan

5:55 am on Sep 23, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The New York Times showing everything to Googlebot but requiring a paid registration for people to see the content

Um I would consider this very illegitimate. You can't have your cake and eat it too. Either your information is public, or not. If the landing page doesn't not correspond to the Google exerpt, I would report it as spam, users should see exactly what Google is seeing. Now, if I search for something and find it in the first page, yet I would have to register to see the second page, that would be fair. But Google's results must correspond to what is published.

smells so good

6:40 am on Sep 23, 2006 (gmt 0)

10+ Year Member

If the landing page doesn't not correspond to the Google exerpt

Generally speaking, in every case that I've seen, the landing page corresponds exactly to the excerpt. You do get to read at least that much before registration is required. So it seems that the excerpts are carefully crafted and spoon fed - cloaked. But this is straying off topic for a really good topic.

trinorthlighting

4:00 pm on Sep 23, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Microsoft has been suing people who are mimmicing their products on the web though, google should do the same.

Sometimes you make an example out of a few the amount will generally decrease.

incrediBILL

5:10 pm on Sep 23, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Not all of what looks like spoofing is actually spoofing.

Googlebot is tricked into crawling thru proxy sites with a cloaked list of links and the only way I know this for certain is that these hijacked page listings show up in the SERPs all the time.

Therefore, when Google was crawling your page via the proxy server, it would appear to the casual observer to be someone spoofing Google.

[edited by: incrediBILL at 5:10 pm (utc) on Sep. 23, 2006]

blend27

5:25 pm on Sep 23, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

That is why we first check User Agent, then if IP is in Bots Range

and if first is true and second is not serve some other content.

also try translating WW thru translate.google.com

amznVibe

6:03 pm on Sep 23, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Cache the reverse lookups - you might even be able to use the awstats dns cache.

incrediBILL

6:35 pm on Sep 23, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

That is why we first check User Agent, then if IP is in Bots Range

The IP range alone is insufficient as you can get spoofed via the web accelerator and possibly translate.google.com, as I've been scraped via both, or any other Google proxy service. The forward/reverse DNS narrows it down to googlebot.com so you know it's really their crawler and nothing else.

Hanu

7:13 am on Sep 24, 2006 (gmt 0)

10+ Year Member

What is this other than a tool for cloakers to protect their content? Why would a Google employee give out this kind of information? I mean not that it takes rocket scientist to figure this out but why would G help the cloakers? I thought cloaking was evil - and that evil was something you don't do.

incrediBILL

7:37 am on Sep 24, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

What is this other than a tool for cloakers to protect their content?

Did you read the whole thread?

This is a tool for the rest of us to block Google spoofing scrapers and stop cloaked proxy hijackers from stealing our SERPs.

[edited by: incrediBILL at 7:38 am (utc) on Sep. 24, 2006]

BenjArriola

1:51 pm on Sep 24, 2006 (gmt 0)

10+ Year Member

What might be a legitimate use of cloaking? I'm sure I'm not the only one wondering.

For highly competitive SEO markets and depending if I am not feeing lazy, I sometimes hide almost every unnecessary tags, images and everything, I just leave the basic tags needed and the content. If someone reports the sites to Google, I do not know how google will react. I am cloaking but the content is the same. So basically I am serving people and Google the same content. They just are not in the same layout.

b) Google is fed lists of cloaked links by proxy sites and they crawl thru the proxy sites and hijack your listings via the proxy. In some cases they give the proxy site ownership of your page.

I'm slow, I don't get this? But is this the reason why I see some sites with different URLs sometimes in the SERPs?

In an SEO Yahoogroup I am a member of, one member was showing screenshots of a search result, but in Yahoo, and the URL displayed was www.google.com and the page had nothing to do with Google. Is this the link cloaking by proxy?

Hanu

1:53 am on Sep 25, 2006 (gmt 0)

10+ Year Member

> Did you read the whole thread?

Scanned it. But I read all of your posts, Bill. Might have missed something.

> This is a tool for the rest of us to block Google spoofing scrapers and stop cloaked proxy hijackers from stealing our SERPs.

First question: Ok, you can detect and block scrapers who pretend to be Googlebot. What do you do with scrapers that don't? OIOW: Why would you care whether a scraper pretended to be Googlebot unless you were cloaking?

Second question: Now what exactly is a cloaked proxy hijacker?

This 36 message thread spans 2 pages: 36

1
2
»