homepage Welcome to WebmasterWorld Guest from 54.196.168.78
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

This 111 message thread spans 4 pages: 111 ( [1] 2 3 4 > >     
boitho.com bot violating robots.txt
Specifically requested only forbidden files
jazzguy




msg:1527363
 8:08 pm on May 5, 2005 (gmt 0)

"boitho.com-dc/0.75 ( http*//www.boitho.com/dcbot.html )" came from 129.241.104.168. It specifically targetted disallowed files from robots.txt, ignoring all other pages.

The info page says it's a distributed crawler, so just like my policy for the cronic robots.txt violater Grub, I banned the user agent and the entire IP block associated with the offending IP.

 

Lord Majestic




msg:1527364
 8:24 pm on May 5, 2005 (gmt 0)

Just out of interest, can you post your robots.txt?

jazzguy




msg:1527365
 8:44 pm on May 5, 2005 (gmt 0)

Just out of interest, can you post your robots.txt?

My robots.txt validates and has been in use for a while if that's what you're wondering about. If you have another question about it, just let me know.

Webmasterworld categorizes my username as a new user, but I'm not actually a new webmaster.

Lord Majestic




msg:1527366
 9:40 pm on May 5, 2005 (gmt 0)

Many validators of robots.txt are validating syntax rather than substance and thus give OK to robots.txt's that actually won't always do the job as intended. There was a discussion on here that addressed a number of these issues, which may or may not have applied in your case, but without robots.txt its hard to say.

jazzguy




msg:1527367
 10:18 pm on May 5, 2005 (gmt 0)

Like I said, if you have a specific question about my robots.txt or its syntax, let me know. Robots.txt is not exactly rocket science -- it's syntax is well-documented and is about as simple as you can get. I've been maintaining robots.txt files for years. They validate and all the major legitimate search engines don't have any problems obeying them.

This thread seems to be straying from its purpose which was to give a heads-up about a rougue bot. In this case, the boitho.com bot specifically fetched spider trap URLs that were (and have always been) disallowed in robots.txt. The bot did not request any other files on the site.

runarb




msg:1527368
 2:01 pm on May 9, 2005 (gmt 0)

Hi

I am one of the people behind Boitho. The Boitho robot does follow the robot exclusion protocol, and should not crawl pages that are excluded.

Can you please send me the urls that what crawled, and tell me how old the robots.txt file is to mail address “runarb ( at ) boitho.com”? Sow I can look inn my logs to see what did go wrong.

Information about the boitho robot is available here: [boitho.com...]

Regards
Runar Buvik

jazzguy




msg:1527369
 9:06 pm on May 9, 2005 (gmt 0)

Sorry, I don't provide URLs for software testing or log analysis, but as far as robots.txt age, your rogue bot specifically targeted some URLs that had been disallowed for years and others that had been disallowed for at least three months, if not more. In this case, it didn't seem to matter because the bot did not fetch robots.txt first and did not even request / or any other main URLs. It specifically targeted forbidden files and only forbidden files, which gives the appearance of malicious use.

You say that your bot is supposed to obey robots.txt. Is that hardcoded or a user option? If it's only an option, then the first malicious user that ignores robots.txt ends up getting your bot banned as in this case. If it's hardcoded but buggy, then of course the same result.

Both Grub and MJ12bot have been banned from all sites that I administer because of robots.txt violations. Now your bot has been added to the list. Maybe not all webmasters will be as quick to ban misbehaving bots as I am, but I've seen too much abuse on my sites to grant leniency and I certainly don't have time to hand-hold every bot writer who thinks they might have the next big search engine.

While it's too late for grub, boitho, and MJ12bot as far as my servers are concerned, the best suggestion I have for anyone else attempting to write a legitimate bot is to hardcode it to respect robots.txt and test it thoroughly against the spec before you release it. If you allow a user of your bot to override that or if you release your bot before you've corrected any robots.txt-related bugs, then you run the risk of having your bot summarily banned by a large number of webmasters and as a result, rendered useless.

GaryK




msg:1527370
 9:23 pm on May 9, 2005 (gmt 0)

This is either funny or pathetic. This link caught my eye so I tried doing a search on the above referenced website and got a blank page with XML error: syntax error at line 1 on it. If you want the keywords I used sticky me.

EDIT: I forgot to mention the search was done from the bot page, not the main page.

Lord Majestic




msg:1527371
 9:31 pm on May 9, 2005 (gmt 0)

Both Grub and MJ12bot have been banned from all sites that I administer because of robots.txt violations.

Now that you mentioned my bot I have to respond and ask to provide robots.txt that was supposedly violated by the bot.

the best suggestion I have for anyone else attempting to write a legitimate bot is to hardcode it to respect robots.txt and test it thoroughly against the spec before you release it.

Its hard coded and its not optional: users can't turn it off. The implementation is very robust and I had only half a dozen reports (after ~ 500 mln URLs crawled) about supposed violation of robots.txt of which only one was correct (that bug was fixed same day).

If you can't back your words with robots.txt (post it here since its the relevant forum) then it would be good manner not to accuse others of breaking robots.txt spec.

I certainly don't have time to hand-hold every bot writer who thinks they might have the next big search engine.

You sure have time to post on the subject -- all you need to do is to provide your current robots.txt or just sticky with URL. If you refuse to do so little to set the record straight, then I have no choice but to consider your allegations false and kindly ask to stop spreading incorrect information that you can't back up.

jazzguy




msg:1527372
 10:33 pm on May 9, 2005 (gmt 0)

Now that you mentioned my bot I have to respond and ask to provide robots.txt that was supposedly violated by the bot.

I've already responded to that inquiry above.

Its hard coded and its not optional: users can't turn it off. The implementation is very robust and I had only half a dozen reports (after ~ 500 mln URLs crawled) about supposed violation of robots.txt of which only one was correct (that bug was fixed same day).

That sounds like you may have good intentions, but my logs show a violation and that's what I go by. And you just admitted that you have violated robots.txt on at least one occasion that was reported to you. I wonder how many webmasters just banned you outright like I did without filing a bug report or commenting.

If you can't back your words with robots.txt (post it here since its the relevant forum) then it would be good manner not to accuse others of breaking robots.txt spec.

What you regard as good or bad manner is not my concern. I've already responded to your robots.txt inquiry above and offered to answer any syntax questions.

You sure have time to post on the subject

Posting on the subject is to benefit others. I have no interest in helping you debug your bot even though I have offered to answer syntax questions.

all you need to do is to provide your current robots.txt or just sticky with URL.

Think about what you're asking. You're asking me to supply personally-identifiable information to an entity that has left evidence of malicious behavior on a site I administer. No thank you.

If you refuse to do so little to set the record straight, then I have no choice but to consider your allegations false

That's your prerogative. Personally I would not be so quick to dismiss a report of an error with my software, but everyone has their own policies. Of course, it's certainly possible that you may have corrected whatever bug caused your bot to violate my robots.txt, but so far I haven't seen any reason to lift the ban and your demeanor certainly does not help.

and kindly ask to stop spreading incorrect information that you can't back up.

The information is correct, I choose not to provide personally-identifiable information and I will post as I see fit.

Lord Majestic




msg:1527373
 10:57 pm on May 9, 2005 (gmt 0)

I've already responded to your robots.txt inquiry above and offered to answer any syntax questions.

You have not shown your robots.txt, what do you want me to guess it or something? Do you know how many possible robots.txt's are out there? My bot certainly had bugs, every piece of software has, however those that were reported were all fixed and since you are refusing to show any evidence whatsoever to back your claims then there is not really much to talk about: if your course of action to just ban bots then do it silently, just don't spread information that you can't prove.

Posting on the subject is to benefit others.

If you wanted to benefit others then you would have posted your robots.txt and either helped us fixed supposed bugs, or fix your own bugs. You refuse to do that, and this clearly shows you have no intention to help anybody.

Think about what you're asking. You're asking me to supply personally-identifiable information to an entity that has left evidence of malicious behavior on a site I administer.

robots.txt is not personal and you don't have to give full URLs either, just full path that would allow to validate them. Your excuses are becoming more and more ridiculous.

Personally I would not be so quick to dismiss a report of an error with my software, but everyone has their own policies.

I never dismissed any bug report in my software, however if people refuse to say how they came across with those bugs then I can't reproduce them, and therefore I can't help them. Its as ridiculous as to expect anything done if you reported that Windows crashes without giving any information about circumstances.

I haven't seen any reason to lift the ban and your demeanor certainly does not help.

What are you Amazon or something? I am afraid I would not have even noticed you site not being in the index since there are plenty of sites out there, many billions of pages in fact. The reason I responded is to make sure that I fix any bugs in my code to avoid causing troubles to other people, however your refusal to help speaks for itself.

The information is correct, I choose not to provide personally-identifiable information and I will post as I see fit.

You claim has no substance and your refusal to provide simple robots.txt and few URLs to quickly verify if there is a bug gives low credibility to your report in my view.

jmccormac




msg:1527374
 11:00 pm on May 9, 2005 (gmt 0)

I had to ban the MJ12bot as well. Not for robots.txt violations but because one of its distributed clients hammered the webserver here and was in effect causing a denial of service attack.

I think that webmasters of large directory sites are now looking at a bandwidth/results model when it comes to banning spiders. It is simply whether the search engine in question delivers users for the amount of bandwith it uses. If it does not or is not a high enough profile search engine, then webmasters will ban it.

Regards...jmcc

Lord Majestic




msg:1527375
 11:04 pm on May 9, 2005 (gmt 0)

I had to ban the MJ12bot as well. Not for robots.txt violations but because one of its distributed clients hammered the webserver here and was in effect causing a denial of service attack.

Can you define "hammered"? There is only one connection to server with compulsory delay between requests (1 second currently). I find it hard to believe it would have amounted to a DoS attack.

Also guys, with all due respect, if you don't tell about problems the nobody would know about them. If you want to help yourself and others (as the original poster claims), then why not use link from referer to submit bug report?

If it does not or is not a high enough profile search engine, then webmasters will ban it.

Fair enough, its your choice and if the main intention of your directory to be crawled by G/M/Y then its up to you to decide whether others can crawl it. Thats why there is support for robots.txt that allows to easily tell bot to avoid crawling URLs from your site.

It would also be fair to separate your desire of being crawled by major engines only from unsubstantiated accusations, and here I refer to the original poster who refuses to quickly set the record straight.

[edited by: Lord_Majestic at 11:11 pm (utc) on May 9, 2005]

jmccormac




msg:1527376
 11:10 pm on May 9, 2005 (gmt 0)

Can you define "hammered"?
Tried to download about 80K pages sequentially.

I find it hard to believe it would have amounted to a DoS attack.
It isn't up to you to decide. :) There was no randomisation or sporadic spidering - your bot just hammered away at the webserver like a braindead scraper program.

One possible mod would be to synch your bot with the timezone of the websites so that you could target the site in slack/off peak time. Busy sites would be better spidered at a slower speed. If it is a distributed spidering op, then a distributed URL/target list would be a far more webserver friendly solution.

Regards...jmcc

[edited by: jmccormac at 11:15 pm (utc) on May 9, 2005]

Lord Majestic




msg:1527377
 11:12 pm on May 9, 2005 (gmt 0)

Tried to download about 80K pages sequentially.

Over what period of time? And when it happened? Was it trying to get same URLs or different ones? Any details at all?

There was no randomisation or sporadic spidering - your bot just hammered away at the webserver like a braindead scraper program.

New URLs are limited to 5k per server, so the only explanation I have is that somehow bot was getting same URL(s) from your server. That piece of code had been used for over 6 months, and crawled almost 500 mln URLs (different ones, not from same site). Had there been a persistent error then it would have become self-evident, but I never heart any reports like that, and if someone crawled 80k URLs from my site then I would sure as hell contact them :)

Perhaps you have a number of redirects per URL?

If it is a distributed spidering op, then a distributed URL/target list would be a far more webserver friendly solution.

Yes I agree, it won't be easy to do right now though. Currently there is a limit of URLs per server (5k per big load), and 200 per work unit, so generally web server should not be hit for many URLs in a short time frame.

Exception could be if you employ lots of unique domains, but even so, there is a limit of number of requests per same IP (1).

jmccormac




msg:1527378
 11:33 pm on May 9, 2005 (gmt 0)

New URLs are limited to 5k per server, so the only explanation I have is that somehow bot was getting same URL(s) from your server.
The problem is that for a large directory, a program that is not in the main Tier 1 spider group (G/Y/M) requesting that many URLs is a thing that would worry webmasters.

if someone crawled 80k URLs from my site then I would sure as hell contact them :)
Yep but webmasters will shoot first and ask questions later. :) Basically the site contains details of nearly 100K Irish domains and websites. I could easily include the UK and most of the RIPE countries because the main business here is hoster/domain statistical reporting on all 650K+ hosters in com/net/org/biz/info/ie.

Yes I agree, it won't be easy to do right now though. Currently there is a limit of URLs per server (5k per big load), and 200 per work unit, so generally web server should not be hit for many URLs in a short time frame.
The thing is that in acting like a scraper, the spider will trigger any protective software. A human user will not request pages sequentially at the rate of 1 page per second.

Regards...jmcc

Lord Majestic




msg:1527379
 11:40 pm on May 9, 2005 (gmt 0)

The thing is that in acting like a scraper, the spider will trigger any protective software.

Its a bot and it does not hide it, bots crawl URLs you know, hard to not act like a bot when you are a bot. My bot support Crawl-Delay parameter, that it will pick even if it was specified for some other bot (MSNbot) to deal with exactly this sort of issue :)

My personal take is that search engine that I am building has no interest whatsoever in directories: in a few months time I hope to build graph structure of the web to calculate ranks of pages for search engine, as well as identify huge directory sites that we have no interest in crawling. I can't avoid doing zero day crawl though :(

Yep but webmasters will shoot first and ask questions later.

Hey guys, its fine by me -- all I want is to try to make sure bugs, if any, are fixed, but to do that it really helps to get reasonably detailed information :)

jazzguy




msg:1527380
 11:42 pm on May 9, 2005 (gmt 0)

...and this clearly shows you have no intention to help anybody.

You seem to be confusing "have no intention to help anybody" with "have no intention to help you." The latter is the case. Beyond that, I think your rant above just repeats what's already been covered so I don't see a need to repeat myself again by responding to each point. You can just re-read my previous posts for responses to your most recent.

You claim has no substance and your refusal to provide simple robots.txt and few URLs to quickly verify if there is a bug gives low credibility to your report in my view.

Every webmaster can make their own determination about the credibility of reports they read and compare it with their own logs. I happen to think that your ranting and insults weaken your case rather than help it, but to each his own.

I had to ban the MJ12bot as well. Not for robots.txt violations but because one of its distributed clients hammered the webserver here and was in effect causing a denial of service attack.

I witnessed the same behavior which is what led to the initial ban. The permanent ban only came after the robots.txt violation. The MJ12bot owner's demeanor here just seals the deal for me.

Oh, and in case anybody reading is wondering, this thread was about the boitho.com bot.

Lord Majestic




msg:1527381
 11:48 pm on May 9, 2005 (gmt 0)

I happen to think that your ranting and insults weaken your case rather than help it, but to each his own.

Show your robots.txt + URLs (domain name is not necessary), and I will test it with my code. Heck, I am happy to release C# robots.txt module source code so that anybody can test it themselves. See how far I am more than prepared to go, and you call that rant?

The MJ12bot owner's demeanor here just seals the deal for me.

I am afraid I can't fix alleged bug in code that is known to work well without a reproduceable test case. Your refusal to provide the test case seals the deal for me: I can't help to fix problem that I don't think exists. Historically the burden of proof was on the accuser, and it is even more so in software since there are just too many possibilies: tell the best programmer in the world that his software crashes and then refuse to tell the details and see what he tells you.

This thread was about boitho, and while it was not my bot I merely asked for robots.txt that you refused (and I shutup since its not my bot), but then you refused one of the boitho guys too, after which you proceeded to accuse my bot of violations, yet again refusing to back your words with any evidence.

I act in good faith but I can't do anything with generic unfounded allegations like that. I think we've said enough for the readers to make up their own minds on this subject :)

runarb




msg:1527382
 11:58 pm on May 9, 2005 (gmt 0)

You say that your bot is supposed to obey robots.txt. Is that hardcoded or a user option?

Obeying robots.txt is not a user option.

Only the downloading of urls is distributed. Which urls to download is managed by a central server. This server manages a robts.txt cashe and tests every url that it sends out agents it. If it does not have the robots.txt file, it asks the client to download it.

The central server also keeps track of all the servers that is being visited to prevent that one server is visited to often.

jazzguy




msg:1527383
 12:06 am on May 10, 2005 (gmt 0)

MJ12bot owner,

The problem seems to be that you are making inaccurate assumptions about the reasons for my post. I didn't post here to help you debug your bot (although I offered multiple times to answer specific syntax-related questions), I posted here to document a robots.txt violation. I think it is very telling that rather than accept whatever help is offered, you go on the defensive and attack anyone that doesn't submit to your testing procedures.

As a result, my offer to answer syntax-related questions is now withdrawn.

Lord Majestic




msg:1527384
 12:16 am on May 10, 2005 (gmt 0)

I posted here to document a robots.txt violation

And you have not provided the least of information on what your robots.txt is like, just how that qualifies for "documenting" is beyond me :(

you go on the defensive and attack anyone that doesn't submit to your testing procedures.

Like I asked you something weird like to provide me with sample of your DNA or something: robots.txt compliance can be tested with the following:
1) robots.txt in question
2) list of URLs to be tested

If you know some better way of tracking bugs then be my guest - offer your solution. :)

As a result, my offer to answer syntax-related questions is now withdrawn.

Oh dear, but what exactly you expected me to ask? Its like you have a million uniquely numbered balls mixed and then you pulled a few of them to hide in your hands, and then offer me to guess the numbers: just how unreasonable is this? :(

Just to show some good will I will try:

1) do you have trailing /'s in disallowed statements in your robots.txt, ie (bolded):

User-agent: *
Disallow: /somedir/

2) What is the HTTP response code on HEAD request of your robots.txt, ie (using cygwin's utility):

HEAD [example.com...]

3) Do you use Unix style end of lines in robots.txt?

4) Do you have robots.txt's on all subdomains of your site?

Let it be clear that I certainly put some effort in it :)

jazzguy




msg:1527385
 5:26 pm on May 10, 2005 (gmt 0)

I figured you were the "has to have the last word" type, so I thought I would just let your last post stand, but then you went and edited more spin into it.

I posted here to document a robots.txt violation

And you have not provided the least of information on what your robots.txt is like, just how that qualifies for "documenting" is beyond me

I offered to supply more information multiple times, you threw a hissy fit and wouldn't accept that offer.

Like I asked you something weird like to provide me with sample of your DNA or something

A perfect example of why I'm not inclined to help you debug your software and why I rescinded my offer to help you. Your posts are filled with insults and sarcasm. You're behavior reminds me of that of spammers when they get reported -- shift the blame, insist on personal details.

If you know some better way of tracking bugs then be my guest - offer your solution.

Your lack of reading comprehension is disturbing. Or maybe it's just denial.

Just to show some good will I will try:

Let it be clear that I certainly put some effort in it

You've got to be kidding. Your "effort" and "good will" only came on page 2 of this thread and only after I had already withdrawn my offer because of your behavior. And your "effort" and "good will" post was still littered with insulting sarcasm.

Although I thought we were done before you went and edited more spin into your last post, I guess you'll still feel a need to have the last word, so go right ahead. If it's just more repetition of what's already been covered, my participation in this thread is done.

Lord Majestic




msg:1527386
 5:42 pm on May 10, 2005 (gmt 0)

You're behavior reminds me of that of spammers when they get reported -- shift the blame, insist on personal details.

I hope you are not reporting spammers the same way you report bugs here by just stating that server X is spamming without offering any proof of your words whatsoever.

I can't validate a bug report without details and your refusal to provide robots.txt + URLs to validate your claim gives no other choice but to ignore your bug report. I wash my hands now: I did all I could to sort this problem, including readiness to publish robots.txt checking code (for peer review), and if thats not enough then tough luck: I have better things to do than to argue with someone who refuses to substantiate their claim.

I am forced to post all this to ensure that whoever comes across to read about MJ12bot not obeying robots.txt will see who here was cooperative and who was not. No sarcasm intended :)

jazzguy




msg:1527387
 6:14 pm on May 10, 2005 (gmt 0)

I am forced to post all this to ensure that whoever comes across to read about MJ12bot not obeying robots.txt will see who here was cooperative and who was not.

Oh, what the heck, if you're going to keep spinning my actions over and over again, I guess I'll have a keep setting the record straight. You suggest that I wasn't being cooperative by not complying with your debugging procedures. I say you were being uncooperative by not accepting the help that I offered you. Is there any real need to keep hashing that out or do you think readers of the thread can decide for themselves?

I hope you are not reporting spammers the same way you report bugs here by just stating that server X is spamming without offering any proof of your words whatsoever.

More repetition. I did offer more proof. You chose not to accept that offer.

I can't validate a bug report without details and your refusal to provide robots.txt + URLs to validate your claim gives no other choice but to ignore your bug report.

That's more repetition. I already responded to that previously.

I wash my hands now: I did all I could to sort this problem

Not exactly. You refused and belittled my offer to supply you with more information.

if thats not enough then tough luck

Tough luck for who? The website administrators who haven't yet banned your bot? I couldn't care less if you fix the bugs in your software.

I have better things to do than to argue with someone who refuses to substantiate their claim.

But apparently not better things to do than keep putting your spin on offers of substantiation that were rejected.

Lord Majestic




msg:1527388
 6:28 pm on May 10, 2005 (gmt 0)

You suggest that I wasn't being cooperative by not complying with your debugging procedures.

Please post here what information in your view is necessary to ascertain whether there is or there is not a bug in a robots.txt code used by my bot. You clearly know more about software debugging than me, so please share your knowledge for the benefit of all.

My view is probably old fashioned, but the following data is minimally needed to achieve required result:

1) robots.txt in question
2) URLs to check

You see, I am not being funny here, but if code takes on input robots.txt + URLs to validate, then that's what minimally required to verify if it works or not.

Just who is spinning here: me when I ask for minimally required data to verify code, or you who refuses to provide this publicly available data? Its not like I am asking your c/c to validate credit card payment module here! Just who is being unreasonable here?

I think its my last post on this matter unless you are going to provide what is minimally required to check if code is faulty or not.

jazzguy




msg:1527389
 6:59 pm on May 10, 2005 (gmt 0)

You clearly know more about software debugging than me, so please share your knowledge for the benefit of all.

I find it odd that you seem think sarcasm would encourage someone to help you.

My view is probably old fashioned, but the following data is minimally needed to achieve required result:

Again, more repetition. Your debugging demands have already been covered in this thread.

Just who is spinning here: me when I ask for minimally required data to verify code, or you who refuses to provide this publicly available data?

You, when you claim that what might be the easiest data for you to verify code is the minimally required data. I offered data--you rejected it. And also you are spinning when you choose to ignore or belittle privacy concerns raised earlier in the thread.

Just who is being unreasonable here?

In my opinion, you are for the reasons stated above. I don't consider it unreasonable to withhold personally-identifiable information from an entity responsible for a bot that left evidence of malicious behavior on a site I administer. And I certainly don't think it unreasonable to withhold such information from someone who chooses sarcasm and insults as their primary style of communication.

I think its my last post on this matter unless you are going to provide what is minimally required to check if code is faulty or not.

Good, I thought this thread was over many posts back. I think both sides of the argument have been thoroughly covered. We disagree on what is minimally required to check your code.

fischermx




msg:1527390
 7:05 pm on May 10, 2005 (gmt 0)

Lord Majestic 1 - Jazzguy 0 :)

Jazzguy, you are not a programmer, are you? actually are you a "guy"? All this thread looked like a discussion with my wife, lol. :)
See, I understand your point, you're not MJ12bot's debugger and are not willing to help. That's fine, it's your choice. But also understand that nobody will stand quiet while it is being accused of buggy software without probe ;)

Lord Majestic, your project seems very interesting, it got my attention you're using MS technology on the distributed client.
Wow, congratulations.

jazzguy




msg:1527391
 7:19 pm on May 10, 2005 (gmt 0)

you are not a programmer, are you? actually are you a "guy"?

Yes and yes (insult noted).

See, I understand your point, you're not MJ12bot's debugger and are not willing to help. That's fine, it's your choice. But also understand that nobody will stand quiet while it is being accused of buggy software without probe

That is certainly understandable. And he could have made his debugging requirements known without all of the insults and sarcasm. He also could have accepted the information that I offered him, and waited until after he had evaluated it before making a determination on whether or not it was sufficient.

Lord Majestic




msg:1527392
 7:39 pm on May 10, 2005 (gmt 0)

And he could have made his debugging requirements known without all of the insults and sarcasm.

I asked for robots.txt in the 2nd post in this thread, I fail to see any sarcasm in it, here it is:

Just out of interest, can you post your robots.txt?

You told me off, and since original post was not about my bot I gone away as I generally don't like to get into flame wars, however you proceeded to mention my bot being banned, which prompted me to ask for robots.txt again since it concerned my code that I could check.

I am looking at my crawler stats right now and I see that out of 860k URLs checked in this session just over 33k were disallowed by robots.txt, so clearly my bot does not just ignore robots.txt, and since I know for fact that it works in principle I can't just simulate a case that I have no idea about.

My bot is legit and every HTTP request contains a link to page where more information is given and anyone has chance to contact me with any problems (like other people did). If you choose to remain anonymous and refuse to share publicly available information then its your choice - just don't expect anything to happen with your "report" as it lacks credibility, not least due to your insistance to hide behing anonymity, but mainly due to lack of data minimally necessary to verify it.

Anyway, I think I am going to publish code that does the job (checks if URL should not be retrieved for a given robots.txt) so that anybody who questions MJ12bot's support for robots.txt can see for themselves. We come in peace :)

This 111 message thread spans 4 pages: 111 ( [1] 2 3 4 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved