If you hit a site too hard or too often or go where you aren't wanted, the owner may well ban your IP address.
A robot is defined as "a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. "
That makes your program/service a robot. Your choice is whether you want to be known as a good one or a bad one. There are lists of bad ones and my sites (as does WebmasterWorld) ban them all.
"Why should I obey robots.txt"
Because it is not your site.
Every time a spider crawls a site is used the owners bandwidth. The site is primerily aimed for site users.
Sure if you offer something in return like visitors then I dont think many will to woried. But if it's all take and no give then most will have issues with this.
I dont think many site admins will have a problem with a bot that obeys robots.txt and doesnt harm bandwih.
Welcome to WebmasterWorld PWalker,
I think you have it a little backwards when you say
|it seems that doing either a or b above is an open invitation to have my spider disallowed from a target site |
Not doing both a and b is definitely an open invitation to have my spider disallowed from a target site.
The monent you start to misbehave you might find yourself added to a banned list. Then you get nowhere.
Thanks for your replies.
I can see this as a sort of 'give and take' between web administrators and software developers.
If I were, for instance, to release my software publically then, especially with the increase in public adsl use, then you are going to have (multiple) IP's sucking your bandwith, with no way of differentiating between browser and software use. (with the exception of some sort of clever access history analyser).
If I have an adsl customer using my software, with a 'permanent' IP, then he/she will eventually have their IP banned, which is no good for me or the customer.
Dynamically allocated IP's could be problematic for you.
So if I then identify my spider when requesting information, then suddenly I'm added to a global banned list and my software is rendered useless (I can see from example robots.txt files, that many common 'off line' browsing programs are listed). This would (at present) be an unacceptable business risk.
This is a catch 22, with an increase in the use of 'personal' bots, then web administrators are eventually going to have to 'disallow' anything that looks spider like.
With this as the case, my software is effectively identifying itself as a spider by requesting robots.txt, this sort of defeats the object.
Following global disallows within the robots.txt I see as courteous, as are ignoring directories such as /cgi-bin/
Following product - specific disallows is another matter, and identifying my software as anything other than non - browser could be risky if I were to look at this from a business perspective.
|that is currently indistinguishable from a browser. |
I DARE you to crawl my sites under that premise and find out how misguided your impressions are and how fast both your IP range and UA will be added to this forum. (That is, after you have been denied future access to my sites.)
Additionally I'm just curious as to how you found this forum?
Were you doing a search?
The implication with that is that should an IP or UA get listed in this forum that the denial tends to gather some sort of influence with other webmasters as well (even though we are not all in agreement on severity.)
Is it logical that you would take a chance on limiting the success and commercial viability of your tool for lack of compliance?
If you are going to write a robot, make it a courteous one, and follow all the rules. A good starting point for the rules is:
You raise a genuine issue. Most websites welcome search engine spiders and human visitors. Most of us are wary of non-human visitors that have other, or unknown, purposes.
With the increase of "personal agents" or useful tools for disabled people (such as XENU Link Sleuth which has a nice side-effect of generating a site map) there will need to be some more thought about permissions to permit friendly robots to rummage in a polite way while keeping spammy and discourteous robots at bay.
Maybe there needs to be a code of conduct for agents, and a whitelist of the good bots who have signed up to it.
It all comes down to web manners. When a spider comes to visit the first thing it does is "ring the bell" by looking through my robots.txt if it's not on the bad list then the door will be opened, but like all good house guests it should know not to outstay it's welcome and not visit to often.
"indistinguishable from a browser" ya think?
Unless your bot is for sinister purposes why would you want to spoof the UA?
If oure going to do any robot work, make it well behaved, usa a proper UA and have contact info within the UA.
For a while I was running a bot and most emails that I received from it where not critisism they where genuine enquires just to check up on what I was doing. I dare say if I did not include contact details or reply to emails I woudld have been added to a lot of ban lists.
|Why should I obey robots.txt |
Apart from all the other reasons, here's one: why not? It isn't hard and you know that if you do you won't peeve anyone off.
Thanyou Wilderness and Victor
In answer to each in turn:
Wilderness I cannot resist a challenge. As it's very unlikely that my software will ever be public, mail me your website address to firstname.lastname@example.org and I'll have a go sometime this evening. (In the unlikely event you don't pick it up i can provide you with my access logs time/ page/ browser identifier etc post event.)
It'll be by 56k however..
I found this forum through Google using pretty simple search terminology (i think it was 'Spider algorithm') etc , it may have been a cached return however.
Victor thanks for the link, if I ever go commercial, I'll use the information out of professional courtesy.
|Wilderness I cannot resist a challenge. As it's very unlikely that my software will ever be public, mail me your website address to email@example.com and I'll have a go sometime this evening. (In the unlikely event you don't pick it up i can provide you with my access logs time/ page/ browser identifier etc post event.) |
Many folks here can tell you that I have no interest in either APNIC or RIPE traffic.
Your likely already denied.
BTW your request to sticky me URL quite unusual?
You actually think I or any other webmaster would request a crawl with what you have presented here?
A few folks in these forums have found my sites in jest. It's not that difficult and utilizes a minimum of search terms.
One even added a UA which realted to a sticky mail he knew I couldn't resist replying to ;)
Your automated crawler that tries to look like a human and sends a fake "Internet Explorer" user-agent string will quite soon fall into a robot trap and get banned.
You have nothing to lose and everything to gain by adhering to robots.txt; just do it.
Publish a comment in your user agent string following the semi-colon that provides a URL to information about what you are doing.
If you're sensible and provide a convincing reason then no sane webmaster is going to ban you; if they do it's their problem.
It is because of people like P.Walker that many people on a dial-up connection get banned from the Internet.
I am sure you are a good software-programmer and when you will make your Burglar-Tool you will have Burglars that will buy it from you (and some even steal it from you).
Then someone will use it from a dial up connection to steal data, e-mail addresses from many small webmasters.
Not only that, but while at it they are also stealing bandtwidth for which we have to pay.
And lastly the IP of this dial-up connection will be banned on many servers so that you will be the cause of denying simple people that do not have 56K or adsl, to get on the internet. Here in Asia we already have many problems because of software developers likew you.
Do you need something from my website, then come and read it, do you need something more from my website then e-mail me and ask me, surely as a code under webmasters, one helps the other.
But please dont come in the night covered with your black cape to steal what does not belong to you.
Whether a robot identifies itself or not with a User-Agent string is besides the point. Many webmasters use "spider traps" which basically consist of an invisible link on the index page of a website, pointing to a file which is disallowed in robots.txt. If any requests are made for that file, the IP address of the requestor is automatically banned from any further requests. Thus, robots not following the robots.txt convention could quickly find themselves banned from many websites. If you plan to distribute your spidering software, I recommend building in compliance with robots.txt, or it could soon become useless. Doing so would satisfy more than just professional courtesy and netiquette, but also just makes sense from a perspective of self-interest.
|I am a developer creating an http text only data mining tool, that is currently indistinguishable from a browser. |
Please clarify. Are you simply referring to the User-agent and referrer string?
Wilderness, it is easy enough to make a spider look like a browser. You have to pay attention the timing, secondary requests, cookie acceptance, and header information a bit beyond UA.
I faked out Direct Hit for years using a browser masquerading bot that rotated IPs. Using such a bot is a lot of work, but it was well worth it at the time.
Just walked in the door and caught this thread. Good question.
|I cannot see the incentive... |
When incentive is lacklng, go for honesty and integrity. Or, making your work count for something decent. Or, letting yourself be known as someone that deals with others honorably.
|Whether a robot identifies itself or not with a User-Agent string is besides the point. Many webmasters use "spider traps" |
Here's an example of what volatilegx expressed.
travered my dime store trap.
188.8.131.52 - - [12/Jun/2003:16:42:33 -0700] "GET / HTTP/1.1" 200 9409 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)"
|Wilderness, it is easy enough to make a spider look like a browser. |
In order for that to work on me and my sites, considering their content and how the person who created the pages is aware of precisely what traffic was intended?
The hiding bot would need to be aware of some IP ranges which have previously visited my sites as well as ranged that are either denied or allowed.
I'm not saying that what you say is not possible in most instances. just that it wouldn't work on my sites considering their limited audience.
|Wilderness I cannot resist a challenge. |
Here's a challenge: Write a correctly-implemented robots.txt parser module, and have your 'bot obey the directives. Otherwise, your creation will suffer the same fate as Indy Library and other previously-useful 'bots whose authors neglected to enforce robots.txt compliance in their licensing agreements... Ignomy.
Let us not reveal the inner workings of our various traps lest we be speaking with unfriendly agents here.
And on Jim's suggestion, I bow out of this thread.