Forum Moderators: open
robots.txt? Yes BUT asked for root x2 in the same second despite Disallowed:
07:52:22 /
07:52:22 /robots.txt
07:52:22 /
Host harbors multiple bots, bad and otherwise. (See prior threads [google.com].)
[edited by: incrediBILL at 5:18 pm (utc) on Oct. 5, 2009]
[edit reason] removed specifics [/edit]
BTW I've had numerous problems with this server farm in the past. Not requesting robots.txt, at least not in the last 24 hrs, prior to taking HTML files just put Linguee Bot on probation.
[linguee.com...]
We take this issue seriously, and we understand that we need to be more open and transparent. We want our bot to behave nicely and earn a good reputation. Your feedback on the info page is highly appreciated.
Regards
Linguee Bot Team
Linguee Bot (http://www.linguee.com/bot)
212.227.136.nnn
bot3.linguee.com
-----
inetnum: 212.227.134.0 - 212.227.143.255
netname: SCHLUND-CUSTOMERS
descr: 1&1 Internet
country: DE
-----
READ ROBOTS.TXT? Yes
OBEYED ROBOTS.TXT? No
-----
Took a bazillion files, most of which were in disallowed folders.
When the bot is accessing disallowed folders, it is usually having trouble parsing its syntax. While it could very well be a bug on our part, we have also seen the weirdest "standard extensions" in the wild. Thanks for your help.
From 04:47:48 till 04:47:54 it tried the following
HEAD / - 302
HEAD / - 302
HEAD /error/oops.asp - 200
GET /robots.txt - 200
GET /error/oops.asp - 200
GET /es/error/oops.asp ¦-¦0¦404_Not_Found
GET /error - 301
GET /error/ - 403
GET / - 302
GET /error/oops.asp - 200
GET /default.asp - 302
GET /error/oops.asp - 200
fabricating url strings (in bold) as it went along.
For quite some time I have set all my sites so that anyone/anything that comes without a referrer has to "open the door" (unless they have a permanent key, ex. known SEs) which no known/unknown bot/scraper has been able to do.
If I'm not mistaken, this an older visit from a previous version of the bot. You see an access to the root dir before reading robots.txt – we've solved that some time ago. And we have recently changed some other stuff to be less obnoxious, based on the feedback we got, so a new visit should look different.
Apart from generating the non-200s, did it actually access disallowed directories?
Thanks.