Forum Moderators: open

Message Too Old, No Replies

NG/1.0 update

Spider from France reads, ignores robots.txt

         

jdMorgan

1:30 pm on Sep 14, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This User-agent has been previously reported here as a search engine spider from France. However, it is now misbehaving from this particular IP address, which may or may not belong to the search engine. I chased it only as far as the RIPE database, which shows this IP in a range assigned to a French ISP, ISDnet.

195.154.174.164 - - [13/Sep/2002:15:52:49 -0400] "GET /robots.txt HTTP/1.0" 200 857 "-" "NG/1.0"
195.154.174.164 - - [13/Sep/2002:17:52:40 -0400] "GET / HTTP/1.0" 200 44134 "-" "NG/1.0"
195.154.174.164 - - [13/Sep/2002:15:52:50 -0400] "GET / HTTP/1.0" 200 44134 "-" "NG/1.0"
195.154.174.164 - - [14/Sep/2002:08:14:19 -0400] "GET /news.html HTTP/1.0" 200 34007 "-" "NG/1.0"
.
. (Fetched many allowed files)
.
195.154.174.164 - - [14/Sep/2002:08:14:05 -0400] "GET /common/officers.shtml HTTP/1.0" 403 838 "-" "NG/1.0"

403-Goodbye!

Jim

heini

1:48 pm on Sep 14, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



WWW.EXALEAD.COM 195.154.174.161
WWW1.EXALEAD.COM 195.154.174.162
WWW2.EXALEAD.COM 195.154.174.163

Couldn't verify the .164 though.

Exalead powers AOL France.

jdMorgan

2:02 pm on Sep 14, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



heini,

Yes, that's what I saw here before... And was therefore surprised at the behavior shown above.
(Brett's validator says my robots.txt is correct, and it has been interpreted correctly by many other 'bots.)

AOL 403'ed? Hah! Maybe I like it! ... Maybe not.

Jim

martin

10:57 pm on Sep 14, 2002 (gmt 0)

10+ Year Member



195.154.174.164 ng1.exabot.com

Did it do anything wrong?

jdMorgan

12:31 am on Sep 15, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



martin,

Sorry if I wasn't clear... Yes, see the last line of the log in the first post. It fetched an off-limits file, disallowed in robots.txt, and got itself banned.

Jim

martin

12:22 pm on Sep 16, 2002 (gmt 0)

10+ Year Member



Sorry I didn't noticed.

weesnich

2:35 pm on Sep 23, 2002 (gmt 0)

10+ Year Member



Same here: reads robots.txt and then requestes everything, ignoring forbidden paths.

martin

3:09 pm on Sep 23, 2002 (gmt 0)

10+ Year Member



It behaves well with my site. Didn't see anything wrong but it started only on 13 Sep. At their rate I don't think they will soon get into the banned area (hope they never will).

heini

3:19 pm on Sep 23, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Didn't see the bot disrespecting robots text so far. They should attach an contact/info site though.

This is an English page with contact info [exalead.com].

It's a pretty impressive job Exalead is doing at AOL.fr.
If now they start spidering the whole web, regardless of language, I wonder what they are up to....

weesnich

8:43 pm on Sep 23, 2002 (gmt 0)

10+ Year Member



Ok. I wrote them a friendly eMail including relevant logs and a part of my robots.txt

If I get an useful answer I'll post you an abstract here.

weesnich

1:20 am on Sep 25, 2002 (gmt 0)

10+ Year Member



I hope they dont mind if I cite their fast and friedly answer here:

<answer>
After verification, it appears that our robot indeed rejected your
robots.txt file as malformed because of a missing terminal newline. Though this is technically incorrect with respect to the specification
(http://www.robotstxt.org/wc/norobots-rfc.html),
I reckon such a minor and unambiguous deviation should have been accepted. I will see that our robots behaviour is changed regarding this point.
</answer>

I think they are some of the good guys out there.

jdMorgan

2:22 am on Sep 25, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I reckon such a minor and unambiguous deviation should have been accepted.

Wow! I didn't know they spoke "Cowboy" in France. Tres bien!

Hmm... Now I'll have to go check for a terminal LF in my robots.txt and un-ban them if there isn't
one. Sounds like a minor upgrade to the Search Engine World robots.txt checker is needed, too.

Thanks for checking on this, weesnich.

Jim

volatilegx

5:08 pm on Oct 14, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



also from 195.154.174.164