.htaccess doesn't work with googlebot any more

Forum Moderators: open

Message Too Old, No Replies

.htaccess doesn't work with googlebot any more

htaccess googlebot rewriterule rewritecond

albertb

5:55 pm on Jun 13, 2006 (gmt 0)

Hi,

I have some flash websites and, being flash movies, spiders can't read well their content and follow the links.
I used .htaccess to redirect googlebot (but also textual browsers as links
and linx) to an alternative home page (index_text.php) which contained
the same text and links as the flash animation, but was written in plain
html and was easily indexed by search engines.

This method worked on all my flash sites (hosted on different
providers) till february-march 2006, when googlebot stopped indexing this
textual page... and started to index the common homepage as a normal user...

Any ideas of what happened and how to solve the problem?

This is my .htaccess

RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Links [OR]
RewriteCond %{HTTP_USER_AGENT} ^Lynx
RewriteRule ^$ index_text.php

Thanks in advance.

volatilegx

5:57 pm on Jun 13, 2006 (gmt 0)

Sounds like they are spidering your site with a "stealthed" spider not identifying itself as Googlebot. If you know its IP addresses you could use .htaccess to perform the redirect, otherwise, you're S.O.L.

the_nerd

7:55 pm on Jun 29, 2006 (gmt 0)

RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [OR]

I wouldn't touch cloaking with a 10-foot-pole, but common sense tells me nobody would juggle around lists with 10 of 1000s of spider IPs and keep them up-to-the-second if you could fool 4000 phds simply by using "... HTTP_USER_AGENT} ^Googlebot"

brizad

10:44 pm on Jun 29, 2006 (gmt 0)

IP based cloaking is the only way to go in my opinion. It's too easy for the SE to NOT label themselves as who they truly are, and it's too easy for real humans to diguise themselves as bots and see your cloaked pages.

I'd say you might need some better cloaking software that keeps up with the SE IPs automatically. PM me if you want my recomendations.

jdMorgan

1:17 am on Jun 30, 2006 (gmt 0)

The reason that your code 'quit working' is that Googlebot changed its user-agent string some time ago to a "Mozilla compatible" format of "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Therefore, your start-anchored regular expressions pattern no longer matches their requests.

You could remove the start-anchor and use:


RewriteCond %{HTTP_USER_AGENT} Googlebot [OR]

-or the more specific-


RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/ [OR]

As others have stated, this won't fool a hand-check. But if index_txt.php is indeed a plain-text equivalent of your Flash page, no more, no less, then I wouldn't worry about it; Google is against cloaking with intent to deceive the user, not against user-agent-dependent content negotiation per se.

You might also want to make sure you send a 'Vary' header to warn network caches that you are serving user-agent-dependent content:


# Tell caches that page content changes depending on client user-agent
<FilesMatch "\.(html�php)$">
Header set Vary: "User-Agent"
</FilesMatch>

Change the broken pipe "�" character to a solid pipe before use; Posting on this board modifies that character.

Jim

volatilegx

2:20 pm on Jun 30, 2006 (gmt 0)

Good catch, Jim... I missed the carat :o