boitho.com bot violating robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

boitho.com bot violating robots.txt

Specifically requested only forbidden files

jazzguy

8:08 pm on May 5, 2005 (gmt 0)

"boitho.com-dc/0.75 ( http*//www.boitho.com/dcbot.html )" came from 129.241.104.168. It specifically targetted disallowed files from robots.txt, ignoring all other pages.

The info page says it's a distributed crawler, so just like my policy for the cronic robots.txt violater Grub, I banned the user agent and the entire IP block associated with the offending IP.

Lord Majestic

10:57 pm on May 9, 2005 (gmt 0)

I've already responded to your robots.txt inquiry above and offered to answer any syntax questions.

You have not shown your robots.txt, what do you want me to guess it or something? Do you know how many possible robots.txt's are out there? My bot certainly had bugs, every piece of software has, however those that were reported were all fixed and since you are refusing to show any evidence whatsoever to back your claims then there is not really much to talk about: if your course of action to just ban bots then do it silently, just don't spread information that you can't prove.

Posting on the subject is to benefit others.

If you wanted to benefit others then you would have posted your robots.txt and either helped us fixed supposed bugs, or fix your own bugs. You refuse to do that, and this clearly shows you have no intention to help anybody.

Think about what you're asking. You're asking me to supply personally-identifiable information to an entity that has left evidence of malicious behavior on a site I administer.

robots.txt is not personal and you don't have to give full URLs either, just full path that would allow to validate them. Your excuses are becoming more and more ridiculous.

Personally I would not be so quick to dismiss a report of an error with my software, but everyone has their own policies.

I never dismissed any bug report in my software, however if people refuse to say how they came across with those bugs then I can't reproduce them, and therefore I can't help them. Its as ridiculous as to expect anything done if you reported that Windows crashes without giving any information about circumstances.

I haven't seen any reason to lift the ban and your demeanor certainly does not help.

What are you Amazon or something? I am afraid I would not have even noticed you site not being in the index since there are plenty of sites out there, many billions of pages in fact. The reason I responded is to make sure that I fix any bugs in my code to avoid causing troubles to other people, however your refusal to help speaks for itself.

The information is correct, I choose not to provide personally-identifiable information and I will post as I see fit.

You claim has no substance and your refusal to provide simple robots.txt and few URLs to quickly verify if there is a bug gives low credibility to your report in my view.

jmccormac

11:00 pm on May 9, 2005 (gmt 0)

I had to ban the MJ12bot as well. Not for robots.txt violations but because one of its distributed clients hammered the webserver here and was in effect causing a denial of service attack.

I think that webmasters of large directory sites are now looking at a bandwidth/results model when it comes to banning spiders. It is simply whether the search engine in question delivers users for the amount of bandwith it uses. If it does not or is not a high enough profile search engine, then webmasters will ban it.

Regards...jmcc

Lord Majestic

11:04 pm on May 9, 2005 (gmt 0)

I had to ban the MJ12bot as well. Not for robots.txt violations but because one of its distributed clients hammered the webserver here and was in effect causing a denial of service attack.

Can you define "hammered"? There is only one connection to server with compulsory delay between requests (1 second currently). I find it hard to believe it would have amounted to a DoS attack.

Also guys, with all due respect, if you don't tell about problems the nobody would know about them. If you want to help yourself and others (as the original poster claims), then why not use link from referer to submit bug report?

If it does not or is not a high enough profile search engine, then webmasters will ban it.

Fair enough, its your choice and if the main intention of your directory to be crawled by G/M/Y then its up to you to decide whether others can crawl it. Thats why there is support for robots.txt that allows to easily tell bot to avoid crawling URLs from your site.

It would also be fair to separate your desire of being crawled by major engines only from unsubstantiated accusations, and here I refer to the original poster who refuses to quickly set the record straight.

[edited by: Lord_Majestic at 11:11 pm (utc) on May 9, 2005]

jmccormac

11:10 pm on May 9, 2005 (gmt 0)

Can you define "hammered"?

Tried to download about 80K pages sequentially.

I find it hard to believe it would have amounted to a DoS attack.

It isn't up to you to decide. :) There was no randomisation or sporadic spidering - your bot just hammered away at the webserver like a braindead scraper program.

One possible mod would be to synch your bot with the timezone of the websites so that you could target the site in slack/off peak time. Busy sites would be better spidered at a slower speed. If it is a distributed spidering op, then a distributed URL/target list would be a far more webserver friendly solution.

Regards...jmcc

[edited by: jmccormac at 11:15 pm (utc) on May 9, 2005]

Lord Majestic

11:12 pm on May 9, 2005 (gmt 0)

Tried to download about 80K pages sequentially.

Over what period of time? And when it happened? Was it trying to get same URLs or different ones? Any details at all?

There was no randomisation or sporadic spidering - your bot just hammered away at the webserver like a braindead scraper program.

New URLs are limited to 5k per server, so the only explanation I have is that somehow bot was getting same URL(s) from your server. That piece of code had been used for over 6 months, and crawled almost 500 mln URLs (different ones, not from same site). Had there been a persistent error then it would have become self-evident, but I never heart any reports like that, and if someone crawled 80k URLs from my site then I would sure as hell contact them :)

Perhaps you have a number of redirects per URL?

If it is a distributed spidering op, then a distributed URL/target list would be a far more webserver friendly solution.

Yes I agree, it won't be easy to do right now though. Currently there is a limit of URLs per server (5k per big load), and 200 per work unit, so generally web server should not be hit for many URLs in a short time frame.

Exception could be if you employ lots of unique domains, but even so, there is a limit of number of requests per same IP (1).

jmccormac

11:33 pm on May 9, 2005 (gmt 0)

New URLs are limited to 5k per server, so the only explanation I have is that somehow bot was getting same URL(s) from your server.

The problem is that for a large directory, a program that is not in the main Tier 1 spider group (G/Y/M) requesting that many URLs is a thing that would worry webmasters.

if someone crawled 80k URLs from my site then I would sure as hell contact them :)

Yep but webmasters will shoot first and ask questions later. :) Basically the site contains details of nearly 100K Irish domains and websites. I could easily include the UK and most of the RIPE countries because the main business here is hoster/domain statistical reporting on all 650K+ hosters in com/net/org/biz/info/ie.

Yes I agree, it won't be easy to do right now though. Currently there is a limit of URLs per server (5k per big load), and 200 per work unit, so generally web server should not be hit for many URLs in a short time frame.

The thing is that in acting like a scraper, the spider will trigger any protective software. A human user will not request pages sequentially at the rate of 1 page per second.

Regards...jmcc

Lord Majestic

11:40 pm on May 9, 2005 (gmt 0)

The thing is that in acting like a scraper, the spider will trigger any protective software.

Its a bot and it does not hide it, bots crawl URLs you know, hard to not act like a bot when you are a bot. My bot support Crawl-Delay parameter, that it will pick even if it was specified for some other bot (MSNbot) to deal with exactly this sort of issue :)

My personal take is that search engine that I am building has no interest whatsoever in directories: in a few months time I hope to build graph structure of the web to calculate ranks of pages for search engine, as well as identify huge directory sites that we have no interest in crawling. I can't avoid doing zero day crawl though :(

Yep but webmasters will shoot first and ask questions later.

Hey guys, its fine by me -- all I want is to try to make sure bugs, if any, are fixed, but to do that it really helps to get reasonably detailed information :)

jazzguy

11:42 pm on May 9, 2005 (gmt 0)

...and this clearly shows you have no intention to help anybody.

You seem to be confusing "have no intention to help anybody" with "have no intention to help you." The latter is the case. Beyond that, I think your rant above just repeats what's already been covered so I don't see a need to repeat myself again by responding to each point. You can just re-read my previous posts for responses to your most recent.

You claim has no substance and your refusal to provide simple robots.txt and few URLs to quickly verify if there is a bug gives low credibility to your report in my view.

Every webmaster can make their own determination about the credibility of reports they read and compare it with their own logs. I happen to think that your ranting and insults weaken your case rather than help it, but to each his own.

I had to ban the MJ12bot as well. Not for robots.txt violations but because one of its distributed clients hammered the webserver here and was in effect causing a denial of service attack.

I witnessed the same behavior which is what led to the initial ban. The permanent ban only came after the robots.txt violation. The MJ12bot owner's demeanor here just seals the deal for me.

Oh, and in case anybody reading is wondering, this thread was about the boitho.com bot.

Lord Majestic

11:48 pm on May 9, 2005 (gmt 0)

I happen to think that your ranting and insults weaken your case rather than help it, but to each his own.

Show your robots.txt + URLs (domain name is not necessary), and I will test it with my code. Heck, I am happy to release C# robots.txt module source code so that anybody can test it themselves. See how far I am more than prepared to go, and you call that rant?

The MJ12bot owner's demeanor here just seals the deal for me.

I am afraid I can't fix alleged bug in code that is known to work well without a reproduceable test case. Your refusal to provide the test case seals the deal for me: I can't help to fix problem that I don't think exists. Historically the burden of proof was on the accuser, and it is even more so in software since there are just too many possibilies: tell the best programmer in the world that his software crashes and then refuse to tell the details and see what he tells you.

This thread was about boitho, and while it was not my bot I merely asked for robots.txt that you refused (and I shutup since its not my bot), but then you refused one of the boitho guys too, after which you proceeded to accuse my bot of violations, yet again refusing to back your words with any evidence.

I act in good faith but I can't do anything with generic unfounded allegations like that. I think we've said enough for the readers to make up their own minds on this subject :)

runarb

11:58 pm on May 9, 2005 (gmt 0)

You say that your bot is supposed to obey robots.txt. Is that hardcoded or a user option?

Obeying robots.txt is not a user option.

Only the downloading of urls is distributed. Which urls to download is managed by a central server. This server manages a robts.txt cashe and tests every url that it sends out agents it. If it does not have the robots.txt file, it asks the client to download it.

The central server also keeps track of all the servers that is being visited to prevent that one server is visited to often.

This 111 message thread spans 12 pages: 111