Forum Moderators: goodroi

Message Too Old, No Replies

boitho.com bot violating robots.txt

Specifically requested only forbidden files

         

jazzguy

8:08 pm on May 5, 2005 (gmt 0)

10+ Year Member



"boitho.com-dc/0.75 ( http*//www.boitho.com/dcbot.html )" came from 129.241.104.168. It specifically targetted disallowed files from robots.txt, ignoring all other pages.

The info page says it's a distributed crawler, so just like my policy for the cronic robots.txt violater Grub, I banned the user agent and the entire IP block associated with the offending IP.

Lord Majestic

8:24 pm on May 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just out of interest, can you post your robots.txt?

jazzguy

8:44 pm on May 5, 2005 (gmt 0)

10+ Year Member



Just out of interest, can you post your robots.txt?

My robots.txt validates and has been in use for a while if that's what you're wondering about. If you have another question about it, just let me know.

Webmasterworld categorizes my username as a new user, but I'm not actually a new webmaster.

Lord Majestic

9:40 pm on May 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Many validators of robots.txt are validating syntax rather than substance and thus give OK to robots.txt's that actually won't always do the job as intended. There was a discussion on here that addressed a number of these issues, which may or may not have applied in your case, but without robots.txt its hard to say.

jazzguy

10:18 pm on May 5, 2005 (gmt 0)

10+ Year Member



Like I said, if you have a specific question about my robots.txt or its syntax, let me know. Robots.txt is not exactly rocket science -- it's syntax is well-documented and is about as simple as you can get. I've been maintaining robots.txt files for years. They validate and all the major legitimate search engines don't have any problems obeying them.

This thread seems to be straying from its purpose which was to give a heads-up about a rougue bot. In this case, the boitho.com bot specifically fetched spider trap URLs that were (and have always been) disallowed in robots.txt. The bot did not request any other files on the site.

runarb

2:01 pm on May 9, 2005 (gmt 0)

10+ Year Member



Hi

I am one of the people behind Boitho. The Boitho robot does follow the robot exclusion protocol, and should not crawl pages that are excluded.

Can you please send me the urls that what crawled, and tell me how old the robots.txt file is to mail address “runarb ( at ) boitho.com”? Sow I can look inn my logs to see what did go wrong.

Information about the boitho robot is available here: [boitho.com...]

Regards
Runar Buvik

jazzguy

9:06 pm on May 9, 2005 (gmt 0)

10+ Year Member



Sorry, I don't provide URLs for software testing or log analysis, but as far as robots.txt age, your rogue bot specifically targeted some URLs that had been disallowed for years and others that had been disallowed for at least three months, if not more. In this case, it didn't seem to matter because the bot did not fetch robots.txt first and did not even request / or any other main URLs. It specifically targeted forbidden files and only forbidden files, which gives the appearance of malicious use.

You say that your bot is supposed to obey robots.txt. Is that hardcoded or a user option? If it's only an option, then the first malicious user that ignores robots.txt ends up getting your bot banned as in this case. If it's hardcoded but buggy, then of course the same result.

Both Grub and MJ12bot have been banned from all sites that I administer because of robots.txt violations. Now your bot has been added to the list. Maybe not all webmasters will be as quick to ban misbehaving bots as I am, but I've seen too much abuse on my sites to grant leniency and I certainly don't have time to hand-hold every bot writer who thinks they might have the next big search engine.

While it's too late for grub, boitho, and MJ12bot as far as my servers are concerned, the best suggestion I have for anyone else attempting to write a legitimate bot is to hardcode it to respect robots.txt and test it thoroughly against the spec before you release it. If you allow a user of your bot to override that or if you release your bot before you've corrected any robots.txt-related bugs, then you run the risk of having your bot summarily banned by a large number of webmasters and as a result, rendered useless.

GaryK

9:23 pm on May 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is either funny or pathetic. This link caught my eye so I tried doing a search on the above referenced website and got a blank page with XML error: syntax error at line 1 on it. If you want the keywords I used sticky me.

EDIT: I forgot to mention the search was done from the bot page, not the main page.

Lord Majestic

9:31 pm on May 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Both Grub and MJ12bot have been banned from all sites that I administer because of robots.txt violations.

Now that you mentioned my bot I have to respond and ask to provide robots.txt that was supposedly violated by the bot.

the best suggestion I have for anyone else attempting to write a legitimate bot is to hardcode it to respect robots.txt and test it thoroughly against the spec before you release it.

Its hard coded and its not optional: users can't turn it off. The implementation is very robust and I had only half a dozen reports (after ~ 500 mln URLs crawled) about supposed violation of robots.txt of which only one was correct (that bug was fixed same day).

If you can't back your words with robots.txt (post it here since its the relevant forum) then it would be good manner not to accuse others of breaking robots.txt spec.

I certainly don't have time to hand-hold every bot writer who thinks they might have the next big search engine.

You sure have time to post on the subject -- all you need to do is to provide your current robots.txt or just sticky with URL. If you refuse to do so little to set the record straight, then I have no choice but to consider your allegations false and kindly ask to stop spreading incorrect information that you can't back up.

jazzguy

10:33 pm on May 9, 2005 (gmt 0)

10+ Year Member



Now that you mentioned my bot I have to respond and ask to provide robots.txt that was supposedly violated by the bot.

I've already responded to that inquiry above.

Its hard coded and its not optional: users can't turn it off. The implementation is very robust and I had only half a dozen reports (after ~ 500 mln URLs crawled) about supposed violation of robots.txt of which only one was correct (that bug was fixed same day).

That sounds like you may have good intentions, but my logs show a violation and that's what I go by. And you just admitted that you have violated robots.txt on at least one occasion that was reported to you. I wonder how many webmasters just banned you outright like I did without filing a bug report or commenting.

If you can't back your words with robots.txt (post it here since its the relevant forum) then it would be good manner not to accuse others of breaking robots.txt spec.

What you regard as good or bad manner is not my concern. I've already responded to your robots.txt inquiry above and offered to answer any syntax questions.

You sure have time to post on the subject

Posting on the subject is to benefit others. I have no interest in helping you debug your bot even though I have offered to answer syntax questions.

all you need to do is to provide your current robots.txt or just sticky with URL.

Think about what you're asking. You're asking me to supply personally-identifiable information to an entity that has left evidence of malicious behavior on a site I administer. No thank you.

If you refuse to do so little to set the record straight, then I have no choice but to consider your allegations false

That's your prerogative. Personally I would not be so quick to dismiss a report of an error with my software, but everyone has their own policies. Of course, it's certainly possible that you may have corrected whatever bug caused your bot to violate my robots.txt, but so far I haven't seen any reason to lift the ban and your demeanor certainly does not help.

and kindly ask to stop spreading incorrect information that you can't back up.

The information is correct, I choose not to provide personally-identifiable information and I will post as I see fit.

This 111 message thread spans 12 pages: 111