baiduspider can back early today. It did fetch robots.txt, but apparently did not parse it correctly. It ended up fetching a disallowed file, which happens to redirect to a spider trap, and banned itself by IP address.
We have discussed Baidu at length here, and I've tried to give them the benefit of the doubt. However, they don't seem to be capable of coding their spider to correctly fetch and parse robots.txt. I'm leaving the ban in place until they get it right.
Msg#: 1848 posted 11:42 am on Apr 23, 2003 (gmt 0)
I've found it has tended to misbehave also, and am dithering over whether to ban it.
In recent weeks it has requested several times the file
my server is unix-based, so requests are case sensitive. A file /sitetech/global.css does exist, but only lower-case, i.e. it's generating a 404 at present because /SIteTECh/global.Css doesn't actually exist.
There are no links on any of my pages to the above, and I really doubt there are external links to a non-existent file, so why is it making a request for this file which has never existed?
Msg#: 1848 posted 10:50 am on Apr 29, 2003 (gmt 0)
I got a reply from them. They said:
"We are testing if your site is case sensitive or not. So we change some character in the filename to uppercase th get it. If we can get it, your site is not case sensitive, and we will change all characters in the urls of your site to avoid get duplicate page in your site. I am very sorry to trouble you."
-- Personally I think it is unacceptable deliberately to cause errors on other people's sites, time after time. If they want to check for duplicate pages they should just do a comparison of two files...