Forum Moderators: Robert Charlton & goodroi
The solution is simple, and effective for Googlebot, and also most likely for Yahoo's Slurp and MSNbot. It only relies on G, Y, or M having properly set up DNS entries for the crawling IP's. It's a two step process and involves doing a reverse dns lookup, then a forward DNS lookup.
> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
This is clean, simple, and brilliant. I have implemented the reverse IP lookup, but never followed up with the forward - which is key. By doing both you avoid someone filling out an erroneous reverse DNS entry, which is very simple to do.
Finally, my quality content is one step closer to staying mine.
< The full technique is outlined here:
[googlewebmastercentral.blogspot.com...] >
[edited by: tedster at 7:25 pm (utc) on July 5, 2007]
Here is some php code:
<?$botip = $_SERVER['REMOTE_ADDR'];
$bothost = gethostbyaddr( $botip );
$verifiedbotip = gethostbyname( $bothost );
if ( $botip = $verifiedbotip ) {
if ( substr($bothost, -14) == '.googlebot.com') {
print '<b><font color=green>This really is Googlebot</font><br>';} elseif ( substr( $bothost, -18) == '.inktomisearch.com') {
print '<b><font color=green>This really is Slurp</font><br>';} else {
print '<b>This is not Slurp or Googlebot<br></b>';}
} else {
print '<b>Host does not match reverse lookup<br></b>';}
?>
It's probably best to do some string matching first so you only do the much slower reverse and forward lookups when you need to, but you get the idea.
this is an expensive look up, I would not recommend it on a high traffic site
I do this on every initial access from a new IP on a high traffic site and it's no big deal if you cache the results for 24 hours. That means Google doesn't keep invoking DNS lookups except one time per IP per 24 hour period which is reasonable.
As a matter of fact, I'm pretty sure WebmasterWorld does a DNS lookup as well and we know THIS is a very high traffic site.
Brett could chime in with details or you could hear him elaborate on the topic at PubCon.
[edited by: incrediBILL at 5:07 am (utc) on Sep. 22, 2006]
I do this on every initial access from a new IP on a high traffic site and it's no big deal if you cache the results for 24 hours.
How Necessary is all this?
How Necessary is all this? How big of a problem is it? And is the concern that Google not give credit for the content to someone else? Or is it that you don't want to be the source for soeone else mashing up and creating their own SE bait content?
OK, let me 'splain it to you...
a) Scapers spoof as Google to rip-off the naive people depending on shoddy .htaccess files blocking bad user agents
b) Google is fed lists of cloaked links by proxy sites and they crawl thru the proxy sites and hijack your listings via the proxy. In some cases they give the proxy site ownership of your page.
c) Scrapers even scrape via Google proxy services such as translate.google.com and the Web Accelerator so just limiting Googlebot to a Google IP range is insufficient to stop abuse.
That's why Google did this as some of us made a big enough stink about it that they gave us the tools to help Googlebot be properly restricted and accurately block some of this nonsense.
Matt Cutts deserves more than a few kudo's following thru on this, THANK YOU MATT!
[edited by: incrediBILL at 7:13 am (utc) on Sep. 22, 2006]
...start lawsuits against people who are using spiders bearing their name. Its a trademark issue.
I don't think this problem is going to be solved by the trademark lawyers.
For example, IE uses a user-agent something like this:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
Yet according to the USPTO [uspto.gov], the Mozilla Foundation has a trademark on 'Mozilla', for
"Computer programs for accessing and displaying files on both the internet and the intranet; network access server operating software for connecting computers to the internet and the intranet."
Does this mean MSFT should be sued for "impersonating" Mozilla? <boggle>
Where's Webwork when we need him? ;-)
What might be a legitimate use of cloaking? I'm sure I'm not the only one wondering.
For starters, showing visitors geo-targeted information as technically that is cloaking since the search engines are geo-neutral, you show them generic things yet show visitors custom tailored information such as local content or ads.
Another scenario I think falls under the term of cloaking is to custom tailor part of the page content based on the keywords in the query from the search engine. The information in the SE against isn't the exact same as the visitor sees but it's a perfectly valid page response to a visitor based on search specifics.
One more perfectly valid cloaking method is to give the search engines just the content and not all the HTML layout as it does the SE no good to get all the superfluous noise. As long as the content is exactly the same there's nothing technically wrong with this IMO.
How about all the little mobile devices as they can't display a full web page so is that reformatting or cloaking? Depends on how you look at it.
I don't cloak, too much work.
[edited by: incrediBILL at 4:59 pm (utc) on Sep. 22, 2006]
The New York Times showing everything to Googlebot but requiring a paid registration for people to see the content
custom tailor part of the page content based on the keywords in the query from the search engine
I don't really consider that cloaking, because it's referrer-based, rather than agent-based. But I do it.
Not true, the page content INDEXED was agent-based but what the visitor sees is referral-based, and the definition of cloaking is showing the SE different content than you show the visitor, therefore...... cloaked.
But I don't see anything wrong with it when at least most of the content shown to the search engine is still on the page.
Don't deceive your users or present different content to search engines
than you display to users, which is commonly referred to as "cloaking."
A lot hinges on how you understand that word "content".
The New York Times showing everything to Googlebot but requiring a paid registration for people to see the content
Um I would consider this very illegitimate. You can't have your cake and eat it too. Either your information is public, or not. If the landing page doesn't not correspond to the Google exerpt, I would report it as spam, users should see exactly what Google is seeing. Now, if I search for something and find it in the first page, yet I would have to register to see the second page, that would be fair. But Google's results must correspond to what is published.
Generally speaking, in every case that I've seen, the landing page corresponds exactly to the excerpt. You do get to read at least that much before registration is required. So it seems that the excerpts are carefully crafted and spoon fed - cloaked. But this is straying off topic for a really good topic.
Googlebot is tricked into crawling thru proxy sites with a cloaked list of links and the only way I know this for certain is that these hijacked page listings show up in the SERPs all the time.
Therefore, when Google was crawling your page via the proxy server, it would appear to the casual observer to be someone spoofing Google.
[edited by: incrediBILL at 5:10 pm (utc) on Sep. 23, 2006]
That is why we first check User Agent, then if IP is in Bots Range
The IP range alone is insufficient as you can get spoofed via the web accelerator and possibly translate.google.com, as I've been scraped via both, or any other Google proxy service. The forward/reverse DNS narrows it down to googlebot.com so you know it's really their crawler and nothing else.
What might be a legitimate use of cloaking? I'm sure I'm not the only one wondering.
For highly competitive SEO markets and depending if I am not feeing lazy, I sometimes hide almost every unnecessary tags, images and everything, I just leave the basic tags needed and the content. If someone reports the sites to Google, I do not know how google will react. I am cloaking but the content is the same. So basically I am serving people and Google the same content. They just are not in the same layout.
b) Google is fed lists of cloaked links by proxy sites and they crawl thru the proxy sites and hijack your listings via the proxy. In some cases they give the proxy site ownership of your page.
I'm slow, I don't get this? But is this the reason why I see some sites with different URLs sometimes in the SERPs?
In an SEO Yahoogroup I am a member of, one member was showing screenshots of a search result, but in Yahoo, and the URL displayed was www.google.com and the page had nothing to do with Google. Is this the link cloaking by proxy?
Scanned it. But I read all of your posts, Bill. Might have missed something.
> This is a tool for the rest of us to block Google spoofing scrapers and stop cloaked proxy hijackers from stealing our SERPs.
First question: Ok, you can detect and block scrapers who pretend to be Googlebot. What do you do with scrapers that don't? OIOW: Why would you care whether a scraper pretended to be Googlebot unless you were cloaking?
Second question: Now what exactly is a cloaked proxy hijacker?