Forum Moderators: goodroi
User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
Disallow: /Admin
What's "Disallow: //" supposed to mean? Our crawler, SiteTruth, treated it as if it were "Disallow /", and refused to crawl the site at all. Yet that's clearly not the intent, or there would be no other entries.
Did their webmaster screw up, or is something special going on?
No special handling should be required if your prefix-matching routine is correct -- i.e. neither "/" nor "//" is a special case in any way; If over the length of the prefix given in the robots.txt, the candidate Request_URI matches that prefix, then do not fetch.
A common error --as well as a common exploit-- is to link to or request example.com//foo or example.com/foo//bar
Because of a bug/feature/anomaly in the Apache Request_URI-to-filename translator, there are security implications to URL-paths containing multiple consecutive slashes. This being a public forum, I won't detail this. But multiple slashes *will* occur in links and HTTP requests, and to his/her credit, HAL's Webmaster was just being very thorough. Hopefully, they took 'stronger' measures in addition to the robots.txt directive, since robots.txt compliance is 'voluntary.'
Jim
It's quite common to redirect "foo.com" to "www.foo.com" with a 301 redirect. That's just good practice. But if you do that, also redirect "foo.com/robots.txt" to "www.foo.com/robots.txt".
The site that's been giving us trouble (which is "ibm.com") has, as its base robots.txt file, "User-agent: *, Disallow: /", which is a simple "all robots go away" command. But actual pages redirect to "www.ibm.com", and "www.ibm.com/robots.txt" has a different, and far less restrictive, robots.txt file. Presumably they were trying to prevent search engine aliasing, but the end result is confusion.
It's quite possible to redirect "robots.txt" files. "microsoft.com", for example, does it. Try "microsoft.com/robots.txt" in a browser and watch it redirect. So that works.
For "www.ibm.com", there are enough incoming links that they get listed in search engines, but if you're a small site and your incoming links are to "foo.com", this kind of error could hide you from search engines.
(We run a crawler that rates sites, so we're looking at this from the robot's side, not the webmaster's side. We get to see all the dirty laundry of web site design.)
This is done to prevent 404 errors filling the logs from requests for non-existent robots.txt files, and it is also done by some hosts who provide "dummy-proof" shopping carts to their customers -- They want to be sure that the shopping carts don't get spidered, resulting in millions of bogus "purchases" by spiders.
On such hosts, it is impossible to redirect robots.txt.
Besides, "example.com" and "www.example.com" are two entirely different domains, and can and should have separate robots.txt files if they do not resolve to the same "sites."
If example.com redirects to www.example.com, and only www.example.com has a robots.txt file, then that robots.txt should be applied to all requests within the www.example.com domain. If both example.com and www.example.com have robots.txt files, then regardless of any redirect from one domain to the other, only the robots.txt in the same domain as the target URL-path should be applied.
You might also want to be on the lookout for sites which have separate robots.txt files for http and https -- It is certainly possible, and should be handled.
In short, http://example.com, http://www.example.com, [example.com,...] [example.com...] should be considered as four separate name-spaces, and each may have its separate and unique robots.txt which applies to all robot fetches to that namespace.
Jim