boitho.com bot violating robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

boitho.com bot violating robots.txt

Specifically requested only forbidden files

jazzguy

8:08 pm on May 5, 2005 (gmt 0)

"boitho.com-dc/0.75 ( http*//www.boitho.com/dcbot.html )" came from 129.241.104.168. It specifically targetted disallowed files from robots.txt, ignoring all other pages.

The info page says it's a distributed crawler, so just like my policy for the cronic robots.txt violater Grub, I banned the user agent and the entire IP block associated with the offending IP.

jazzguy

8:25 pm on May 10, 2005 (gmt 0)

I asked for robots.txt in the 2nd post in this thread, I fail to see any sarcasm in it

The sarcasm and insults came in later posts.

You told me off

Told you off? In post #3? I declined to post the robots.txt and instead offered to answer any questions you might have about it. I also clarified webmasterworld's "new user" status of my username in case that would help in any way.

clearly my bot does not just ignore robots.txt

I never said that it "just" ignored robots.txt. I said that on one of my sites it violated robots.txt.

and since I know for fact that it works in principle I can't just simulate a case that I have no idea about.

I'm not disputing that. The dispute here seems to be about what information was offered.

don't expect anything to happen with your "report" as it lacks credibility

I don't expect you to do anything with my report since you declined any further information that I offered.

due to lack of data minimally necessary to verify it.

That's debatable. You rejected my offer to supply more information, so there's no way to know if it would have been enough to verify my claim.

Anyway, I think I am going to publish code that does the job (checks if URL should not be retrieved for a given robots.txt) so that anybody who questions MJ12bot's support for robots.txt can see for themselves.

Good luck. Earlier in this thread, I even said it's possible that you've corrected whatever bug caused the violation in my logs. Unfortunately, we'll never know where the problem was because you rejected my offer of help. And you said you already corrected a bug that caused a violation of someone else's robots.txt. Maybe that was actually the same bug that affected my site. At this point it's just too late for your bot on sites that administer.

Lord Majestic

8:37 pm on May 10, 2005 (gmt 0)

At this point it's just too late for your bot on sites that administer.

But its not too late for other people's sites. And if the purpose of your report is to help others then the first port of call is the person who created the bot, ie me, but since you refused to provide minimum information necessary to trace the problem (and you clearly never contacted me by email shown on page in bots's useragent as all such emails were resolved to mutual satisfaction of all sides) I have no choice but to rest my case: anyone who cares about bug being fixed would provide information necessary to do so, certainly if that information is publicly available.

My robots.txt function requires two pieces of data: robots.txt and URLs to check, if this is not present then I can't check it, end of story.

jazzguy

8:53 pm on May 10, 2005 (gmt 0)

but since you refused to provide minimum information necessary to trace the problem

That point has already been raised and countered multiple times in this thread.

and you clearly never contacted me by email

Just another variation of a point that has already been raised and countered multiple times in this thread.

anyone who cares about bug being fixed would...

That point has already been raised and countered multiple times in this thread. You rejected my offer to provide you with more information.

My robots.txt function requires...

That point has already been raised and countered multiple times in this thread.

end of story.

That would be nice as it seems the same things keep getting repeated over and over.

bcolflesh

8:55 pm on May 10, 2005 (gmt 0)

Can I have the last word?

motorhaven

1:31 pm on Jun 9, 2005 (gmt 0)

I know for a fact this crawler doesn't correctly obey robots.txt. It fell into two spider trap on my server last night.

Boitho.... meet mod_rewrite!

motorhaven

1:47 pm on Jun 9, 2005 (gmt 0)

Oh... and before you deny it:

User-agent: *
Disallow: /ftr/
Disallow: /support/
Disallow: /thespot/
Disallow: /mailyou/

See those last two entries? Spider traps. Your crawler hit both of them. I believe an apology is due to at least one other member in this thread.

Additionally, I suggest taking this crawler off-line until it actually does comply -- otherwise you're going to find it in many mod_rewrite rulesets.

Romeo

2:59 pm on Jun 9, 2005 (gmt 0)

While skipping thru a pile of bot-trap backlogs here yesterday, I also noticed the boitho bot, as described by Jazzguy in post #1.
Came from 129.241.104.173 on 2005-05-06, did not bother to fetch a /robots.txt and unfortunately uncovered the trap ... sorry.

Regards,
R.

runarb

9:53 pm on Jun 12, 2005 (gmt 0)

The Boitho robot should not access your website without checking the robots.txt file.

To save bandwidth the Boitho robot caches the robots.txt file locally. Sow it is probably not showing in your log near the other entries because we have checked it some days ago.

Also one crawler may fetshs the robots.txt file, an another the html page.

Can you post or pm me the url sow I can investigate.

jdMorgan

6:28 am on Jun 13, 2005 (gmt 0)

runarb,

I'd suggest you also take note of the "Expires" header of robots.txt if it is provided by the server, and mark that domain's cached robots.txt entry to expire if you would normally cache it beyond the expiry time.

My robots.txt is set to expire four hours after it is fetched. If boitho caches it for 24 hours, there is a chance that boitho will attempt to fetch a disallowed resource, and that would lead to a 403-Forbidden from my server.

During updates, I upload the new robots.txt in the morning, and change the files and access restriction checking in the late afternoon. Therefore, a four-hour expiry should prevent any robot that comprehends the expiry time from using a stale robots.txt. Due to production schedules, I cannot wait 24 hours after I receive notice of changes, so four hours is a practical limit.

Jim

larryhatch

7:14 am on Jun 13, 2005 (gmt 0)

Good detective work all!

Not to change the subject (from boitho) but I at least found some sort of search engine there.

Where is the MJ12 search engine? I was all over their MJ12 site, found distributed crawling,
complex blogs, all sorts of stuff .. everything EXCEPT a stinkin' search engine!

Whats the point of all that crawling if there is no search engine page? -Larry

larryhatch

7:31 am on Jun 13, 2005 (gmt 0)

Motorhaven: " Oh... and before you deny it:

User-agent: *
Disallow: /ftr/
Disallow: /support/
Disallow: /thespot/
Disallow: /mailyou/

See those last two entries? Spider traps. Your crawler hit both of them. "

- - - -

I love it! Thanks for the trick.

NOW I understand some strange entries in my logs. I had the following in my robots.txt:

User-agent: *
Disallow: /airheads/ (I have a UFO site. Lots of airheads in UFO-land.)

I found a few user agents trying to access the non-existent subdirectory /airheads,
and didn't know what to make of it! Now I do. Thanks again, and back to Boitho. -Larry

Rogi

7:57 am on Jun 13, 2005 (gmt 0)

Ok, so has it been established that boitho did/will violate robots.txt?

It's a little hard to follow this thread with all the flaming going on. I don't think Boitho (my opinion of the posts from someone who works on the bot) is malicious but if it's established that it disobbey's robots.txt i'll ban it.

What about the other one? The one the whole flame war was all about... did Jazzguy end up proving the violation of robots.txt or was it all just a waste of content-building time reading it all?

rj87uk

8:12 am on Jun 13, 2005 (gmt 0)

Lord Majestic 1 - Jazzguy 0

I think that is very inappropriate behavior. I count at least Lord Majestic 6 - Jazzguy 0.

Jazzguy - Why not post your robots.txt?

Romeo

7:22 pm on Jun 13, 2005 (gmt 0)

Hi Runarb,

thank you for your reply.

I searched thru my server's logs, but could not find any reference to "boitho" trying to look at my /robots.txt for a more than 4 month period before the 2005-05-06:

grep -i boitho access_log.2005* � grep robot
--> no results

So the content of my /robots.txt does not seem to matter much.
However, here it is. It is very short, as it contains basically just the trap disallow (trap directory renamed) and one other statement:
----------------------------
User-agent: *
Disallow: /bottrap/
User-agent: ia_archiver
Disallow: /
----------------------------
This file is in place unchanged since 2002-11-02. Other bots like googlebot, msnbot, inktomi yahoo slurp, etc, as well as the ia_archiver, don't have problems with it.

If your bot uses another ident that "boitho" when fetching the /robots.txt, pls let me know, and I will grep thru my logs again.

Regards,
R.

Lord Majestic

7:38 pm on Jun 13, 2005 (gmt 0)

Where is the MJ12 search engine? I was all over their MJ12 site, found distributed crawling,
complex blogs, all sorts of stuff .. everything EXCEPT a stinkin' search engine!

I can't post URL here, but if you check our site now you will see there a link to Alpha version of the search engine: it was (alpha)released on 3/06/06.

It is still work in progress and the plan is to release weekly (or so) updates. I expect to see all crawled data indexed and available for searching by the end of summer.

I think that is very inappropriate behavior. I count at least Lord Majestic 6 - Jazzguy 0.

[edited by: ThomasB at 8:43 pm (utc) on June 14, 2005]
[edit reason] removed noise [/edit]

Lord Majestic

6:13 pm on Jun 14, 2005 (gmt 0)

...the bot owner once again confirmed that his bot has violated robots.txt before, so there you go.

So far 2 bug reports were correct, and about 8 more were not.

Thus there is about 80% chance that a bug report is not correct. Which does not mean that all bug reports should be ignored, however if a bug report has no sufficient information to replicate alleged bug then there is nothing I can do.

bcolflesh

6:19 pm on Jun 14, 2005 (gmt 0)

...the bot owner rejected my offer to supply further information

So you finally supplied him with the only info that would matter, the robots.txt file? And he rejected it?

fischermx

6:22 pm on Jun 14, 2005 (gmt 0)

I can't find any reasonable motivation to hit so hard in pointing a bot's bug. Specially if the bot operator is showing up, explaining, apologizing, offering support and asking proof to fix his bot.

fischermx

6:23 pm on Jun 14, 2005 (gmt 0)

bcolflesh, I've been following this thread, the guy never supplied the robots.txt, instead he offered him his advise as consultant :P

jazzguy

6:25 pm on Jun 14, 2005 (gmt 0)

...if a bug report has no sufficient information to replicate alleged bug then there is nothing I can do.

Rejecting offers to supply more information is another easy way to dismiss reported bugs. Again, more repetition of ground that has already been covered multiple times in this thread.

bcolflesh

6:33 pm on Jun 14, 2005 (gmt 0)

Rejecting offers to supply more information...

So he rejected the robots.txt file you sent him?

Lord Majestic

6:33 pm on Jun 14, 2005 (gmt 0)

Rejecting offers to supply more information is another easy way to dismiss reported bugs.

The information that matters is the robots.txt file and URLs to check that you refused to provide.

Its like saying that a browser is not rendering page correctly and then refusing to provide HTML to verify bug report.

jazzguy

6:38 pm on Jun 14, 2005 (gmt 0)

So you finally supplied him with the only info that would matter

<Sigh> As has been covered already in this thread, there's no way to know if the information I offered would have mattered because the offer itself was rejected -- no information was ever examined.

the guy never supplied the robots.txt

Already covered multiple times in this thread.

instead he offered him his advise as consultant :P

Your sarcastic remark highlights a big part of what caused this thread to degenerate so quickly, i.e., assuming that a report is inaccurate or that a reporter is a rookie.

Lord Majestic

6:41 pm on Jun 14, 2005 (gmt 0)

Already covered multiple times in this thread.

Its a rather safe guess that it would have taken you less time to re-type the whole of your robots.txt manually using small finger on the left hand than to post as much as you did in this thread.

jazzguy

6:43 pm on Jun 14, 2005 (gmt 0)

So he rejected the robots.txt file you sent him?

Already covered multiple times in this thread. Since this is an old thread, I think some people might benefit from re-reading it before they post as most of the recent questions have already been covered.

fischermx

6:43 pm on Jun 14, 2005 (gmt 0)

From my 20 years of my programming experience I can tell there's nothing more annoying than a user complaining about bugs without showing proof.

bcolflesh

6:47 pm on Jun 14, 2005 (gmt 0)

...or that a reporter is a rookie

A typical rookie mistake is not providing the necessary data to fix a problem - ex: telling someone their spider is broken, then not providing their robots.txt and a sample URL.

jazzguy

6:49 pm on Jun 14, 2005 (gmt 0)

Its a rather safe guess

More assumptions.

that it would have taken you less time to re-type the whole of your robots.txt manually using small finger on the left hand than to post as much as you did in this thread.

Okay, now you're just trolling. Supplying the robots.txt has already been covered multiple times in this thread. Do you really want to spend another three pages saying the same things over and over again?

Lord Majestic

6:51 pm on Jun 14, 2005 (gmt 0)

A typical rookie mistake is not providing the necessary data to fix a problem

I always ask for specific information that helps track bug, and asking for robots.txt is pretty much given, so much that I never came across with someone actually refusing to show it -- jazzman is the first, and hopefully last.

Its as if the guy does not want his problem fixed, or the problem did not exist in the first place. Note that he started this thread accusing boitho spider of violating robots.txt, but then quickly switched to bash my bot, in both cases not providing any details that can reasonably help investigate the problem.

jazzguy

6:51 pm on Jun 14, 2005 (gmt 0)

A typical rookie mistake is not providing the necessary data to fix a problem - ex: telling someone their spider is broken, then not providing their robots.txt and a sample URL.

You're way behind -- try to keep up. That was covered way back at the beginning of the thread.

This 111 message thread spans 4 pages: 111