Forum Moderators: goodroi

Message Too Old, No Replies

"Disallow: //" - error or meaningful?

robots.txt crawler error

         

sitetruth

4:40 pm on Jul 11, 2007 (gmt 0)



The "robots.txt" file of a major site (the corporate site of a very large computer manufacturer with a three letter name) has a robots.txt file including

User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
Disallow: /Admin

What's "Disallow: //" supposed to mean? Our crawler, SiteTruth, treated it as if it were "Disallow /", and refused to crawl the site at all. Yet that's clearly not the intent, or there would be no other entries.

Did their webmaster screw up, or is something special going on?

jdMorgan

5:17 pm on Jul 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interpret it literally: Do not fetch URL-paths which begin with "//".

No special handling should be required if your prefix-matching routine is correct -- i.e. neither "/" nor "//" is a special case in any way; If over the length of the prefix given in the robots.txt, the candidate Request_URI matches that prefix, then do not fetch.

A common error --as well as a common exploit-- is to link to or request example.com//foo or example.com/foo//bar

Because of a bug/feature/anomaly in the Apache Request_URI-to-filename translator, there are security implications to URL-paths containing multiple consecutive slashes. This being a public forum, I won't detail this. But multiple slashes *will* occur in links and HTTP requests, and to his/her credit, HAL's Webmaster was just being very thorough. Hopefully, they took 'stronger' measures in addition to the robots.txt directive, since robots.txt compliance is 'voluntary.'

Jim

sitetruth

10:42 pm on Jul 11, 2007 (gmt 0)



Ah. It turns out there's a separate problem. It's subtle, and webmasters should know about it.

It's quite common to redirect "foo.com" to "www.foo.com" with a 301 redirect. That's just good practice. But if you do that, also redirect "foo.com/robots.txt" to "www.foo.com/robots.txt".

The site that's been giving us trouble (which is "ibm.com") has, as its base robots.txt file, "User-agent: *, Disallow: /", which is a simple "all robots go away" command. But actual pages redirect to "www.ibm.com", and "www.ibm.com/robots.txt" has a different, and far less restrictive, robots.txt file. Presumably they were trying to prevent search engine aliasing, but the end result is confusion.

It's quite possible to redirect "robots.txt" files. "microsoft.com", for example, does it. Try "microsoft.com/robots.txt" in a browser and watch it redirect. So that works.

For "www.ibm.com", there are enough incoming links that they get listed in search engines, but if you're a small site and your incoming links are to "foo.com", this kind of error could hide you from search engines.

(We run a crawler that rates sites, so we're looking at this from the robot's side, not the webmaster's side. We get to see all the dirty laundry of web site design.)

Lord Majestic

11:10 pm on Jul 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It is best not to redirect robots.txt. Keep it simple and it will be more reliable - do not assume that if it works for Google then it will work for everyone else.

jdMorgan

11:58 pm on Jul 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It is not always possible to redirect robots.txt. Some shared virtual hosting providers "wrap" robots.txt files in a script which echoes the user's robots.txt if present, or serves a default robots.txt if no user-provided robots.txt file is present. As this is done using a server-configuration-level Alias directive, it is invoked before any "user-level" redirects can be applied.

This is done to prevent 404 errors filling the logs from requests for non-existent robots.txt files, and it is also done by some hosts who provide "dummy-proof" shopping carts to their customers -- They want to be sure that the shopping carts don't get spidered, resulting in millions of bogus "purchases" by spiders.

On such hosts, it is impossible to redirect robots.txt.

Besides, "example.com" and "www.example.com" are two entirely different domains, and can and should have separate robots.txt files if they do not resolve to the same "sites."

If example.com redirects to www.example.com, and only www.example.com has a robots.txt file, then that robots.txt should be applied to all requests within the www.example.com domain. If both example.com and www.example.com have robots.txt files, then regardless of any redirect from one domain to the other, only the robots.txt in the same domain as the target URL-path should be applied.

You might also want to be on the lookout for sites which have separate robots.txt files for http and https -- It is certainly possible, and should be handled.

In short, http://example.com, http://www.example.com, [example.com,...] [example.com...] should be considered as four separate name-spaces, and each may have its separate and unique robots.txt which applies to all robot fetches to that namespace.

Jim