Forum Moderators: goodroi

Message Too Old, No Replies

boitho.com bot violating robots.txt

Specifically requested only forbidden files

         

jazzguy

8:08 pm on May 5, 2005 (gmt 0)

10+ Year Member



"boitho.com-dc/0.75 ( http*//www.boitho.com/dcbot.html )" came from 129.241.104.168. It specifically targetted disallowed files from robots.txt, ignoring all other pages.

The info page says it's a distributed crawler, so just like my policy for the cronic robots.txt violater Grub, I banned the user agent and the entire IP block associated with the offending IP.

larryhatch

7:31 am on Jun 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Motorhaven: " Oh... and before you deny it:

User-agent: *
Disallow: /ftr/
Disallow: /support/
Disallow: /thespot/
Disallow: /mailyou/

See those last two entries? Spider traps. Your crawler hit both of them. "

- - - -

I love it! Thanks for the trick.

NOW I understand some strange entries in my logs. I had the following in my robots.txt:

User-agent: *
Disallow: /airheads/ (I have a UFO site. Lots of airheads in UFO-land.)

I found a few user agents trying to access the non-existent subdirectory /airheads,
and didn't know what to make of it! Now I do. Thanks again, and back to Boitho. -Larry

Rogi

7:57 am on Jun 13, 2005 (gmt 0)

10+ Year Member



Ok, so has it been established that boitho did/will violate robots.txt?

It's a little hard to follow this thread with all the flaming going on. I don't think Boitho (my opinion of the posts from someone who works on the bot) is malicious but if it's established that it disobbey's robots.txt i'll ban it.

What about the other one? The one the whole flame war was all about... did Jazzguy end up proving the violation of robots.txt or was it all just a waste of content-building time reading it all?

rj87uk

8:12 am on Jun 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Lord Majestic 1 - Jazzguy 0

I think that is very inappropriate behavior. I count at least Lord Majestic 6 - Jazzguy 0.

Jazzguy - Why not post your robots.txt?

Romeo

7:22 pm on Jun 13, 2005 (gmt 0)

10+ Year Member



Hi Runarb,

thank you for your reply.

I searched thru my server's logs, but could not find any reference to "boitho" trying to look at my /robots.txt for a more than 4 month period before the 2005-05-06:

grep -i boitho access_log.2005* ¦ grep robot
--> no results

So the content of my /robots.txt does not seem to matter much.
However, here it is. It is very short, as it contains basically just the trap disallow (trap directory renamed) and one other statement:
----------------------------
User-agent: *
Disallow: /bottrap/
User-agent: ia_archiver
Disallow: /
----------------------------
This file is in place unchanged since 2002-11-02. Other bots like googlebot, msnbot, inktomi yahoo slurp, etc, as well as the ia_archiver, don't have problems with it.

If your bot uses another ident that "boitho" when fetching the /robots.txt, pls let me know, and I will grep thru my logs again.

Regards,
R.

Lord Majestic

7:38 pm on Jun 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Where is the MJ12 search engine? I was all over their MJ12 site, found distributed crawling,
complex blogs, all sorts of stuff .. everything EXCEPT a stinkin' search engine!

I can't post URL here, but if you check our site now you will see there a link to Alpha version of the search engine: it was (alpha)released on 3/06/06.

It is still work in progress and the plan is to release weekly (or so) updates. I expect to see all crawled data indexed and available for searching by the end of summer.

I think that is very inappropriate behavior. I count at least Lord Majestic 6 - Jazzguy 0.

<snip></snip>

[edited by: ThomasB at 8:43 pm (utc) on June 14, 2005]
[edit reason] removed noise [/edit]

Lord Majestic

6:13 pm on Jun 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



...the bot owner once again confirmed that his bot has violated robots.txt before, so there you go.

So far 2 bug reports were correct, and about 8 more were not.

Thus there is about 80% chance that a bug report is not correct. Which does not mean that all bug reports should be ignored, however if a bug report has no sufficient information to replicate alleged bug then there is nothing I can do.

bcolflesh

6:19 pm on Jun 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



...the bot owner rejected my offer to supply further information

So you finally supplied him with the only info that would matter, the robots.txt file? And he rejected it?

fischermx

6:22 pm on Jun 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I can't find any reasonable motivation to hit so hard in pointing a bot's bug. Specially if the bot operator is showing up, explaining, apologizing, offering support and asking proof to fix his bot.

fischermx

6:23 pm on Jun 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



bcolflesh, I've been following this thread, the guy never supplied the robots.txt, instead he offered him his advise as consultant :P

jazzguy

6:25 pm on Jun 14, 2005 (gmt 0)

10+ Year Member



...if a bug report has no sufficient information to replicate alleged bug then there is nothing I can do.

Rejecting offers to supply more information is another easy way to dismiss reported bugs. Again, more repetition of ground that has already been covered multiple times in this thread.

This 111 message thread spans 12 pages: 111