Forum Moderators: open

Message Too Old, No Replies

Linguee Bot

         

Pfui

3:05 pm on Oct 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



oftzzz.serverloft.de
Linguee Bot (bot@linguee.com)

robots.txt? Yes BUT asked for root x2 in the same second despite Disallowed:

07:52:22 /
07:52:22 /robots.txt
07:52:22 /

Host harbors multiple bots, bad and otherwise. (See prior threads [google.com].)

[edited by: incrediBILL at 5:18 pm (utc) on Oct. 5, 2009]
[edit reason] removed specifics [/edit]

Pfui

6:25 pm on Oct 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



P.S. Lopped off a letter in the Host. It's loftzzz.serverloft.de (with zzz = numbers... Thanks iB:)

GaryK

4:41 pm on Oct 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It visited me last week too, but only requested robots.txt once per session per site. Well behaved in terms of honoring robots and crawl speed. I wonder if they noticed the excessive requests for robots.txt and correct it promptly?

keyplyr

7:54 am on Nov 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



IP address: 85.25.124.*** (www.serverloft.de)
UA: Linguee Bot (bot@linguee.com)
rDNS: bot.linguee.com
robots.txt: no

BTW I've had numerous problems with this server farm in the past. Not requesting robots.txt, at least not in the last 24 hrs, prior to taking HTML files just put Linguee Bot on probation.

enigma1

11:59 am on Nov 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My log also shows some IPs in that range 85.25.124 doing RFIs. The particular bot is being reported in the past as a site scraper.

keyplyr

8:38 am on Dec 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Also from...

IP address: 212.227.136.***
rDNS: bot3.linguee.com
robots.txt: yes

No bad behavior.

linguee

8:34 pm on Dec 7, 2009 (gmt 0)

10+ Year Member



Thank you for this thread. We have created a new info page for our bot.

[linguee.com...]

We take this issue seriously, and we understand that we need to be more open and transparent. We want our bot to behave nicely and earn a good reputation. Your feedback on the info page is highly appreciated.

Regards
Linguee Bot Team

Pfui

7:01 am on Dec 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@linguee / Linguee Bot Team: Thank you for stopping by, giving consideration to our comments, and penning a helpful info page, too! As indicated in the OP, my main concern was your bot ignoring robots.txt. I look forward to seeing it on its best behavior in the future:) Again, thanks.

GaryK

8:50 pm on Dec 13, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



They appear to have updated their UA, but while it read robots.txt quite often, perhaps too often, it still seems to have problems respecting it.

Linguee Bot (http://www.linguee.com/bot)
212.227.136.nnn
bot3.linguee.com
-----
inetnum: 212.227.134.0 - 212.227.143.255
netname: SCHLUND-CUSTOMERS
descr: 1&1 Internet
country: DE
-----
READ ROBOTS.TXT? Yes
OBEYED ROBOTS.TXT? No
-----
Took a bazillion files, most of which were in disallowed folders.

linguee

9:38 pm on Dec 13, 2009 (gmt 0)

10+ Year Member



@GaryK: Is there any way that we could have a look at the robots.txt in question?

When the bot is accessing disallowed folders, it is usually having trouble parsing its syntax. While it could very well be a bug on our part, we have also seen the weirdest "standard extensions" in the wild. Thanks for your help.

GaryK

9:41 pm on Dec 13, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'll send it via Sticky Mail later this evening. I'm on the go right now.

Staffa

12:00 pm on Dec 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Updated UA again, seen today from 85.25.124.n :

Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)

Pfui

3:31 am on Dec 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@Staffa: Did it read and heed robots.txt?

Staffa

10:17 am on Dec 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Pfui,it read it alright but I doubt it would have heeded it :

From 04:47:48 till 04:47:54 it tried the following

HEAD / - 302
HEAD / - 302
HEAD /error/oops.asp - 200
GET /robots.txt - 200
GET /error/oops.asp - 200
GET /es/error/oops.asp ¦-¦0¦404_Not_Found
GET /error - 301
GET /error/ - 403
GET / - 302
GET /error/oops.asp - 200
GET /default.asp - 302
GET /error/oops.asp - 200

fabricating url strings (in bold) as it went along.

For quite some time I have set all my sites so that anyone/anything that comes without a referrer has to "open the door" (unless they have a permanent key, ex. known SEs) which no known/unknown bot/scraper has been able to do.

linguee

12:04 pm on Dec 30, 2009 (gmt 0)

10+ Year Member



Hi. Some comments I have on this:

If I'm not mistaken, this an older visit from a previous version of the bot. You see an access to the root dir before reading robots.txt – we've solved that some time ago. And we have recently changed some other stuff to be less obnoxious, based on the feedback we got, so a new visit should look different.
Apart from generating the non-200s, did it actually access disallowed directories?

Thanks.

Staffa

5:10 pm on Dec 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



@ linguee, please define old

keyplyr

9:58 pm on Dec 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)

Crawled 1/2 of my main site last night, hits 1 second apart, requested robots.txt

linguee

3:08 pm on Dec 31, 2009 (gmt 0)

10+ Year Member



@ Staffa: We deployed a new version about 3 weeks ago.

Staffa

6:58 pm on Dec 31, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, then that must have been just after my log lines from December 07.

Apart from the log lines I posted earlier the bot did not access anything else because it did not get access to the site.