boitho.com bot violating robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

boitho.com bot violating robots.txt

Specifically requested only forbidden files

jazzguy

8:08 pm on May 5, 2005 (gmt 0)

"boitho.com-dc/0.75 ( http*//www.boitho.com/dcbot.html )" came from 129.241.104.168. It specifically targetted disallowed files from robots.txt, ignoring all other pages.

The info page says it's a distributed crawler, so just like my policy for the cronic robots.txt violater Grub, I banned the user agent and the entire IP block associated with the offending IP.

jazzguy

8:25 pm on May 10, 2005 (gmt 0)

I asked for robots.txt in the 2nd post in this thread, I fail to see any sarcasm in it

The sarcasm and insults came in later posts.

You told me off

Told you off? In post #3? I declined to post the robots.txt and instead offered to answer any questions you might have about it. I also clarified webmasterworld's "new user" status of my username in case that would help in any way.

clearly my bot does not just ignore robots.txt

I never said that it "just" ignored robots.txt. I said that on one of my sites it violated robots.txt.

and since I know for fact that it works in principle I can't just simulate a case that I have no idea about.

I'm not disputing that. The dispute here seems to be about what information was offered.

don't expect anything to happen with your "report" as it lacks credibility

I don't expect you to do anything with my report since you declined any further information that I offered.

due to lack of data minimally necessary to verify it.

That's debatable. You rejected my offer to supply more information, so there's no way to know if it would have been enough to verify my claim.

Anyway, I think I am going to publish code that does the job (checks if URL should not be retrieved for a given robots.txt) so that anybody who questions MJ12bot's support for robots.txt can see for themselves.

Good luck. Earlier in this thread, I even said it's possible that you've corrected whatever bug caused the violation in my logs. Unfortunately, we'll never know where the problem was because you rejected my offer of help. And you said you already corrected a bug that caused a violation of someone else's robots.txt. Maybe that was actually the same bug that affected my site. At this point it's just too late for your bot on sites that administer.

Lord Majestic

8:37 pm on May 10, 2005 (gmt 0)

At this point it's just too late for your bot on sites that administer.

But its not too late for other people's sites. And if the purpose of your report is to help others then the first port of call is the person who created the bot, ie me, but since you refused to provide minimum information necessary to trace the problem (and you clearly never contacted me by email shown on page in bots's useragent as all such emails were resolved to mutual satisfaction of all sides) I have no choice but to rest my case: anyone who cares about bug being fixed would provide information necessary to do so, certainly if that information is publicly available.

My robots.txt function requires two pieces of data: robots.txt and URLs to check, if this is not present then I can't check it, end of story.

jazzguy

8:53 pm on May 10, 2005 (gmt 0)

but since you refused to provide minimum information necessary to trace the problem

That point has already been raised and countered multiple times in this thread.

and you clearly never contacted me by email

Just another variation of a point that has already been raised and countered multiple times in this thread.

anyone who cares about bug being fixed would...

That point has already been raised and countered multiple times in this thread. You rejected my offer to provide you with more information.

My robots.txt function requires...

That point has already been raised and countered multiple times in this thread.

end of story.

That would be nice as it seems the same things keep getting repeated over and over.

bcolflesh

8:55 pm on May 10, 2005 (gmt 0)

Can I have the last word?

motorhaven

1:31 pm on Jun 9, 2005 (gmt 0)

I know for a fact this crawler doesn't correctly obey robots.txt. It fell into two spider trap on my server last night.

Boitho.... meet mod_rewrite!

motorhaven

1:47 pm on Jun 9, 2005 (gmt 0)

Oh... and before you deny it:

User-agent: *
Disallow: /ftr/
Disallow: /support/
Disallow: /thespot/
Disallow: /mailyou/

See those last two entries? Spider traps. Your crawler hit both of them. I believe an apology is due to at least one other member in this thread.

Additionally, I suggest taking this crawler off-line until it actually does comply -- otherwise you're going to find it in many mod_rewrite rulesets.

Romeo

2:59 pm on Jun 9, 2005 (gmt 0)

While skipping thru a pile of bot-trap backlogs here yesterday, I also noticed the boitho bot, as described by Jazzguy in post #1.
Came from 129.241.104.173 on 2005-05-06, did not bother to fetch a /robots.txt and unfortunately uncovered the trap ... sorry.

Regards,
R.

runarb

9:53 pm on Jun 12, 2005 (gmt 0)

The Boitho robot should not access your website without checking the robots.txt file.

To save bandwidth the Boitho robot caches the robots.txt file locally. Sow it is probably not showing in your log near the other entries because we have checked it some days ago.

Also one crawler may fetshs the robots.txt file, an another the html page.

Can you post or pm me the url sow I can investigate.

jdMorgan

6:28 am on Jun 13, 2005 (gmt 0)

runarb,

I'd suggest you also take note of the "Expires" header of robots.txt if it is provided by the server, and mark that domain's cached robots.txt entry to expire if you would normally cache it beyond the expiry time.

My robots.txt is set to expire four hours after it is fetched. If boitho caches it for 24 hours, there is a chance that boitho will attempt to fetch a disallowed resource, and that would lead to a 403-Forbidden from my server.

During updates, I upload the new robots.txt in the morning, and change the files and access restriction checking in the late afternoon. Therefore, a four-hour expiry should prevent any robot that comprehends the expiry time from using a stale robots.txt. Due to production schedules, I cannot wait 24 hours after I receive notice of changes, so four hours is a practical limit.

Jim

larryhatch

7:14 am on Jun 13, 2005 (gmt 0)

Good detective work all!

Not to change the subject (from boitho) but I at least found some sort of search engine there.

Where is the MJ12 search engine? I was all over their MJ12 site, found distributed crawling,
complex blogs, all sorts of stuff .. everything EXCEPT a stinkin' search engine!

Whats the point of all that crawling if there is no search engine page? -Larry

This 111 message thread spans 12 pages: 111