Page is a not externally linkable
revrob - 7:26 pm on Sep 28, 2012 (gmt 0)
I've virtually given up with bingbot - having tried a whole variety of methods, via robots.txt and .htaccess. Even when I had all the bingbot IP ranges supposedly banned, I found that bingbot was occasionally accessing bulky media files in disallowed folders, even using an IP address that should have been totally banned, and which was getting a 403 response everywhere else on my site - it seemed to be able to evade the Rewrite to [F] commands when accessing a minority of some pdf and jpg files (which were also restricted in robots.txt but bing didn't care about that either.
My current experiment is to use a rewrite command to send all the various MS IP ranges I can identify, to visit robots.txt, whatever it is they are asking for, where they can chew on the disallow directive for bingbot that they are so keen to ignore.
User-agent: bingbot
Disallow: /
Here is what I have put up this afternoon in .htaccess
RewriteCond %{REMOTE_ADDR} ^157\.(5[4-9]|60)\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* http://www.example.com/robots.txt [L]
RewriteCond %{REMOTE_ADDR} ^131\.253\.(2[1-9]|3[0-9]|4[0-7])\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* http://www.example.com/robots.txt [L]
RewriteCond %{REMOTE_ADDR} ^65\.52\.([0-9]|[1-4][0-9]|5[0-5])\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* http://www.example.com/robots.txt [L]
I have robots.txt listed in the "don't rewrite" section near the beginning of .htaccess.
RewriteCond %{REQUEST_URI} !/robots\.txt$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
I'm now waiting to see if that works or if some of the bingbot visits will continue to somehow evade it.
I have had a couple of bingbot visits since putting that code up, which have redirected nicely to robots.txt
If MS are not prepared to observe robots.txt then I am not prepared to let them read anything EXCEPT robots.txt
The only other legit bot I have trouble with is Yahoo Slurp! which also has a habit of ignoring robots.txt directives but I have managed to tame that one via .htaccess.