Welcome to WebmasterWorld Guest from 107.20.75.63

Message Too Old, No Replies

How to Verify Googlebot and Avoid Rogue Spiders

     
1:57 am on Sep 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 4, 2002
posts:1958
votes: 0


Matt [mattcutts.com] recently made a good recommendation over on his blog about how to "authenticate" Googlebot - that is, see if a given spider really is Googlebot, or if it is someone like me pretending to be Googlebot to find your cloaked pages.

The solution is simple, and effective for Googlebot, and also most likely for Yahoo's Slurp and MSNbot. It only relies on G, Y, or M having properly set up DNS entries for the crawling IP's. It's a two step process and involves doing a reverse dns lookup, then a forward DNS lookup.


> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

This is clean, simple, and brilliant. I have implemented the reverse IP lookup, but never followed up with the forward - which is key. By doing both you avoid someone filling out an erroneous reverse DNS entry, which is very simple to do.

Finally, my quality content is one step closer to staying mine.

< The full technique is outlined here:
[googlewebmastercentral.blogspot.com...] >

[edited by: tedster at 7:25 pm (utc) on July 5, 2007]

2:55 am on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 4, 2002
posts:1958
votes: 0


It seems this will work for both Googlebot and Slurp. MSN doesn't have correct reverse lookups set for their crawling IP's.

Here is some php code:


<?

$botip = $_SERVER['REMOTE_ADDR'];

$bothost = gethostbyaddr( $botip );

$verifiedbotip = gethostbyname( $bothost );

if ( $botip = $verifiedbotip ) {
if ( substr($bothost, -14) == '.googlebot.com') {
print '<b><font color=green>This really is Googlebot</font><br>';

} elseif ( substr( $bothost, -18) == '.inktomisearch.com') {
print '<b><font color=green>This really is Slurp</font><br>';

} else {
print '<b>This is not Slurp or Googlebot<br></b>';

}
} else {
print '<b>Host does not match reverse lookup<br></b>';

}

?>

It's probably best to do some string matching first so you only do the much slower reverse and forward lookups when you need to, but you get the idea.

4:24 am on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 13, 2005
posts:1077
votes: 0


Ya, this is an expensive look up, I would not recommend it on a high traffic site, and I would definitely to some prescreening before I initiated the lookup. I do a reverseDNS lookup when some one registers, but for other reasons.

Chip-

4:57 am on Sept 22, 2006 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member 10+ Year Member

joined:June 18, 2005
posts:1693
votes: 4


or if it is someone like me pretending to be Googlebot to find your cloaked pages.

I'm not sure I understand. Isn't a good thing that people check for cloaking spammers to report them? Why help cloakers?

5:05 am on Sept 22, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


Go figure, I posted this in the 'Search Engine Spider Identification' forum and it's still pending moderation, oh well, ya lose some, lose some, and lose some more ;)

this is an expensive look up, I would not recommend it on a high traffic site

I do this on every initial access from a new IP on a high traffic site and it's no big deal if you cache the results for 24 hours. That means Google doesn't keep invoking DNS lookups except one time per IP per 24 hour period which is reasonable.

As a matter of fact, I'm pretty sure WebmasterWorld does a DNS lookup as well and we know THIS is a very high traffic site.

Brett could chime in with details or you could hear him elaborate on the topic at PubCon.

[edited by: incrediBILL at 5:07 am (utc) on Sep. 22, 2006]

5:08 am on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 28, 2003
posts:1977
votes: 0


check for cloaking spammers

Uh, Google has other ways for checking for sites that are cloaking, like using different IPs and ISPs to crawl/check pages.

By the way, all cloakers are not spammers--just like all spammers are not cloakers. There are actually good reasons for cloaking. Go figure.

5:37 am on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 30, 2002
posts: 1186
votes: 0


How Necessary is all this? How big of a problem is it? And is the concern that Google not give credit for the content to someone else? Or is it that you don't want to be the source for soeone else mashing up and creating their own SE bait content?
5:37 am on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 31, 2003
posts:1316
votes: 0


I do this on every initial access from a new IP on a high traffic site and it's no big deal if you cache the results for 24 hours.

Agreed. And if your Web server is already doing a reverse lookup for logging, you can trust what it says the host name is. That way, you then only have to do the forward lookup.

How Necessary is all this?

I think it's only necessary for sites that are cloaking. And there are, IMO, some legitimate uses for cloaking.
7:05 am on Sept 22, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


How Necessary is all this? How big of a problem is it? And is the concern that Google not give credit for the content to someone else? Or is it that you don't want to be the source for soeone else mashing up and creating their own SE bait content?

OK, let me 'splain it to you...

a) Scapers spoof as Google to rip-off the naive people depending on shoddy .htaccess files blocking bad user agents

b) Google is fed lists of cloaked links by proxy sites and they crawl thru the proxy sites and hijack your listings via the proxy. In some cases they give the proxy site ownership of your page.

c) Scrapers even scrape via Google proxy services such as translate.google.com and the Web Accelerator so just limiting Googlebot to a Google IP range is insufficient to stop abuse.

That's why Google did this as some of us made a big enough stink about it that they gave us the tools to help Googlebot be properly restricted and accurately block some of this nonsense.

Matt Cutts deserves more than a few kudo's following thru on this, THANK YOU MATT!

[edited by: incrediBILL at 7:13 am (utc) on Sep. 22, 2006]

1:34 pm on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 5, 2006
posts:2094
votes: 2


I wish google would start lawsuits against people who are using spiders bearing their name. Its a trademark issue.
1:42 pm on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 16, 2002
posts:2133
votes: 1


I know jcoronella gave us a start and I'd help on this topic if I only knew how, perhaps some one can give us examples for the common server os's and/or static, dynamic pages... in short a bit more about implementation please.
1:44 pm on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 23, 2003
posts:915
votes: 0


...start lawsuits against people who are using spiders bearing their name. Its a trademark issue.

I don't think this problem is going to be solved by the trademark lawyers.

For example, IE uses a user-agent something like this:

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

Yet according to the USPTO [uspto.gov], the Mozilla Foundation has a trademark on 'Mozilla', for

"Computer programs for accessing and displaying files on both the internet and the intranet; network access server operating software for connecting computers to the internet and the intranet."

Does this mean MSFT should be sued for "impersonating" Mozilla? <boggle>

Where's Webwork when we need him? ;-)

4:34 pm on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 27, 2003
posts:732
votes: 0


What might be a legitimate use of cloaking? I'm sure I'm not the only one wondering.
4:48 pm on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 28, 2002
posts:1324
votes: 0


What might be a legitimate use of cloaking? I'm sure I'm not the only one wondering.

The New York Times showing everything to Googlebot but requiring a paid registration for people to see the content

4:57 pm on Sept 22, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


What might be a legitimate use of cloaking? I'm sure I'm not the only one wondering.

For starters, showing visitors geo-targeted information as technically that is cloaking since the search engines are geo-neutral, you show them generic things yet show visitors custom tailored information such as local content or ads.

Another scenario I think falls under the term of cloaking is to custom tailor part of the page content based on the keywords in the query from the search engine. The information in the SE against isn't the exact same as the visitor sees but it's a perfectly valid page response to a visitor based on search specifics.

One more perfectly valid cloaking method is to give the search engines just the content and not all the HTML layout as it does the SE no good to get all the superfluous noise. As long as the content is exactly the same there's nothing technically wrong with this IMO.

How about all the little mobile devices as they can't display a full web page so is that reformatting or cloaking? Depends on how you look at it.

I don't cloak, too much work.

[edited by: incrediBILL at 4:59 pm (utc) on Sep. 22, 2006]

6:35 pm on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 31, 2003
posts:1316
votes: 0


The New York Times showing everything to Googlebot but requiring a paid registration for people to see the content

That's exactly what I was thinking of. Some people might not consider that legitimate, but it gets them traffic without having to give away entire articles.

custom tailor part of the page content based on the keywords in the query from the search engine

I don't really consider that cloaking, because it's referrer-based, rather than agent-based. But I do it.
10:20 pm on Sept 22, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


I don't really consider that cloaking, because it's referrer-based, rather than agent-based. But I do it.

Not true, the page content INDEXED was agent-based but what the visitor sees is referral-based, and the definition of cloaking is showing the SE different content than you show the visitor, therefore...... cloaked.

But I don't see anything wrong with it when at least most of the content shown to the search engine is still on the page.

10:43 pm on Sept 22, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


There's all kinds of "cloaking" that is pretty innocent. For instance, some people check user agent or IP so as not to assign a session id to a spider. The visible content that a visitor sees remains the same, but the urls in the anchor tags are cleaner. Google's guidelines say:

Don't deceive your users or present different content to search engines
than you display to users, which is commonly referred to as "cloaking."

A lot hinges on how you understand that word "content".

12:34 am on Sept 23, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:May 27, 2005
posts:614
votes: 0


Can I just say "Gosh what a great topic!" This is why I love WebmasterWorld, you learn something every day :)

Cheers Guys :)

5:55 am on Sept 23, 2006 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member 10+ Year Member

joined:June 18, 2005
posts:1693
votes: 4


The New York Times showing everything to Googlebot but requiring a paid registration for people to see the content

Um I would consider this very illegitimate. You can't have your cake and eat it too. Either your information is public, or not. If the landing page doesn't not correspond to the Google exerpt, I would report it as spam, users should see exactly what Google is seeing. Now, if I search for something and find it in the first page, yet I would have to register to see the second page, that would be fair. But Google's results must correspond to what is published.

6:40 am on Sept 23, 2006 (gmt 0)

Junior Member

5+ Year Member

joined:Aug 1, 2006
posts:112
votes: 0


If the landing page doesn't not correspond to the Google exerpt

Generally speaking, in every case that I've seen, the landing page corresponds exactly to the excerpt. You do get to read at least that much before registration is required. So it seems that the excerpts are carefully crafted and spoon fed - cloaked. But this is straying off topic for a really good topic.

4:00 pm on Sept 23, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 5, 2006
posts:2094
votes: 2


Microsoft has been suing people who are mimmicing their products on the web though, google should do the same.

Sometimes you make an example out of a few the amount will generally decrease.

5:10 pm on Sept 23, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


Not all of what looks like spoofing is actually spoofing.

Googlebot is tricked into crawling thru proxy sites with a cloaked list of links and the only way I know this for certain is that these hijacked page listings show up in the SERPs all the time.

Therefore, when Google was crawling your page via the proxy server, it would appear to the casual observer to be someone spoofing Google.

[edited by: incrediBILL at 5:10 pm (utc) on Sep. 23, 2006]

5:25 pm on Sept 23, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1666
votes: 35


That is why we first check User Agent, then if IP is in Bots Range

and if first is true and second is not serve some other content.

also try translating WW thru translate.google.com

6:03 pm on Sept 23, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 16, 2002
posts:2010
votes: 0


Cache the reverse lookups - you might even be able to use the awstats dns cache.
6:35 pm on Sept 23, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


That is why we first check User Agent, then if IP is in Bots Range

The IP range alone is insufficient as you can get spoofed via the web accelerator and possibly translate.google.com, as I've been scraped via both, or any other Google proxy service. The forward/reverse DNS narrows it down to googlebot.com so you know it's really their crawler and nothing else.

7:13 am on Sept 24, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 8, 2003
posts:548
votes: 0


What is this other than a tool for cloakers to protect their content? Why would a Google employee give out this kind of information? I mean not that it takes rocket scientist to figure this out but why would G help the cloakers? I thought cloaking was evil - and that evil was something you don't do.
7:37 am on Sept 24, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


What is this other than a tool for cloakers to protect their content?

Did you read the whole thread?

This is a tool for the rest of us to block Google spoofing scrapers and stop cloaked proxy hijackers from stealing our SERPs.

[edited by: incrediBILL at 7:38 am (utc) on Sep. 24, 2006]

1:51 pm on Sept 24, 2006 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 10, 2006
posts:47
votes: 0


What might be a legitimate use of cloaking? I'm sure I'm not the only one wondering.

For highly competitive SEO markets and depending if I am not feeing lazy, I sometimes hide almost every unnecessary tags, images and everything, I just leave the basic tags needed and the content. If someone reports the sites to Google, I do not know how google will react. I am cloaking but the content is the same. So basically I am serving people and Google the same content. They just are not in the same layout.

b) Google is fed lists of cloaked links by proxy sites and they crawl thru the proxy sites and hijack your listings via the proxy. In some cases they give the proxy site ownership of your page.

I'm slow, I don't get this? But is this the reason why I see some sites with different URLs sometimes in the SERPs?

In an SEO Yahoogroup I am a member of, one member was showing screenshots of a search result, but in Yahoo, and the URL displayed was www.google.com and the page had nothing to do with Google. Is this the link cloaking by proxy?

1:53 am on Sept 25, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 8, 2003
posts:548
votes: 0


> Did you read the whole thread?

Scanned it. But I read all of your posts, Bill. Might have missed something.

> This is a tool for the rest of us to block Google spoofing scrapers and stop cloaked proxy hijackers from stealing our SERPs.

First question: Ok, you can detect and block scrapers who pretend to be Googlebot. What do you do with scrapers that don't? OIOW: Why would you care whether a scraper pretended to be Googlebot unless you were cloaking?

Second question: Now what exactly is a cloaked proxy hijacker?

This 36 message thread spans 2 pages: 36
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members