I recently launched a subdomain on my site, mobile.mysite.com. It's basically brand new. Already, a rogue spider is lowercasing letters in URLs and wreaking havoc. So I made changes to my .htaccess file and noticed some peculiar things. (Note: changed the .htaccess in the mobile. subdomain directory. It works - I've tested it with some 301 redirects.) (Note: this is the Sosospider out of Japan, and it's webspider.htm files says to block them in the robots.txt file. I added that line - let's see if that helps, but I'm not going to wait to find out.)
This is my new .htaccess code:
# Sosospider+(+http://help.soso.com/webspider.htm)
# <Files *>
order allow,deny
deny from 117.135.129.*
deny from 117.135.129
deny from Sosospider
allow from all
# </Files>
RewriteEngine on
RewriteCond %{HTTP_REFERER} help\.soso\.com\/webspider [NC,OR]
RewriteCond %{HTTP_REFERER} anotherbadsite\.com [OR]
RewriteCond %{HTTP_REFERER} Sosospider
RewriteRule .* - [F,L]
There are a few more things in my .htaccess file, but that is the only section with deny and allow. No more above or below it. The above, and trying different combinations, did not block our Sosospider friend as I would have expected. What am I doing wrong?
My 2 questions.
1. Whether I have the <Files> part or not, this code does not block sosospider. Any ideas why? (I have the anotherbadsite in there as a placeholder. I think the first two deny are duplicates, but I put both just to see if one or the other would work. No luck.)
2. When I leave in the <Files> tag, my access log format changes! The IP usual format x.x.x.x changes to something else. I see it when I click on my site, and I see it when Googlebot hits my site, yet sosospider still shows its IP (may be related to what I say below). Example:
With <Files *> left in:
crawl-66-249-71-1.googlebot.com - - [07/Feb/2012:17:16:11 -0500] "GET /list.cgi?st=ergo%20part/ HTTP/1.1" 200 2531 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)"
And without: (this looks usual to me)
66.249.71.1 - - [07/Feb/2012:17:22:24 -0500] "GET /list.cgi?st=east+west HTTP/1.1" 200 1080 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)"
Why does putting the <Files *> tag in change the Apache access log format?
Those are my 2 questions.
Here is a Sosospider entry, with all the .htaccess code above in place:
117.135.129.75 - - [07/Feb/2012:17:34:33 -0500] "GET /subdir/thing/item/ HTTP/1.1" 404 - "-" "Sosospider+(+http://help.soso.com/webspider.htm)"
I changed the name for this post, but item should be Item. So it's a 404. Here is a list.cgi entry showing status 200:
117.135.129.73 - - [07/Feb/2012:17:30:50 -0500] "GET /list.cgi?group=things&max=100&min=25&page=4 HTTP/1.1" 200 5437 "-" "Sosospider+(+http://help.soso.com/webspider.htm)"
where things should be Things. Not to mention, all these links are marked rel="nofollow" in the HTML!
Since I could not block that darn spider as I intended (for my whole mobile. subdomain), I at least blocked it for my list.cgi script as:
my $ip=$ENV{"REMOTE_ADDR"};
if ($ip=~/^117\.135\.129/) {
print "Content-type: text/html\nStatus:403\n\n";
print "Bad SosoSpider";
exit 0;
}
This works. I see the 403 status scroll by.
But still, I'd like to block this (dumb) spider in my .htaccess file.