question about robots.txt, am I banning a directory by accident

Forum Moderators: goodroi

Message Too Old, No Replies

question about robots.txt, am I banning a directory by accident

arbitrary

5:41 am on Jan 12, 2006 (gmt 0)

I am currently using the following robots.txt.

User-agent: *
Disallow: /widgets/

I am obviously trying to block all bots from the widgets directory.

I have created a new directory that I want search engines to visit. That directory is:

/widgets-for-sale/

Can someone tell me if my robots.txt will prevent search engines from visiting /widgets-for-sale/?

I would have thought not but I put up the new directory, (/widgets-for-sale/), and linked to it but have not had search engines visit the new pages after about 10 days.

Thanks.

jdMorgan

1:09 pm on Jan 12, 2006 (gmt 0)

Robots.txt uses prefix-matching.

Therefore,


User-agent: *
Disallow: /widgets/

disallows robots only from URL-paths that *start with* "/widgets/"

It will have no effect on URL-paths that *start with* "/widgets-for-sale/"

See A Standard for Robots Exclusion [robotstxt.org].

Jim

Lord Majestic

2:17 pm on Jan 12, 2006 (gmt 0)

User-agent: *
Disallow: /widgets/

The only possible problem with this is in case if bot requests directory without last slash (its not required), and if your server won't issue redirect then bot will access page that is meant to be disallowed, however strictly speaking it would be no fault of bot.

arbitrary

4:27 pm on Jan 12, 2006 (gmt 0)

Thanks Jim, Lord Majestic.

Lord Majestic, after your post, I checked and made sure that I linked to the directory with a trailing '/'.

I think I just need to be more patient with this.

Lord Majestic

8:42 pm on Jan 12, 2006 (gmt 0)

I checked and made sure that I linked to the directory with a trailing '/'.

If you have trailing '/' in robots.txt then its possible that some bots will request that directory without such slash, which will technically be 100% legal as far as robots.txt is concerned, it is therefore good idea NOT to have trailing slash in Disallow directives - this covers for all eventualities.

In your case however it would result in the other directory excluded since it starts with the same prefix, tough dilemma, best solution would probably be rename it to avoid clash.

arbitrary

10:54 pm on Jan 12, 2006 (gmt 0)

Thanks, I think this may be the problem. I have not had a single request to these pages from any bot and all the major bots visit my site each day.

Time to move those pages. Easy enough as they are not in the search engines anyway.

Thanks to all.

pixel_juice

11:31 pm on Jan 12, 2006 (gmt 0)

it is therefore good idea NOT to have trailing slash in Disallow directives - this covers for all eventualities

I'm not sure this is accurate. If I want to disallow the directory /widgets.htm/ but NOT the page /widgets.htm I need the trailing slash. Surely omitting it just makes the whole thing less accurate?

The only page a robot could mistakenly access could be the index page for the directory /widgets.htm/, if they asked for just /widgets.htm - but surely this would be incorrect behaviour on the part of the bot since a request for /widgets is asking for a file called widgets, whereas a requests for /widgets/ is asking for the root of the directory?

Lord Majestic

11:47 pm on Jan 12, 2006 (gmt 0)

I'm not sure this is accurate.

It depends on a point of view. Consider the following example:

-----------------------
User-agent: *
Disallow: /dir/
-----------------------

The intention here is to disallow crawling of anything inside /dir/, including root of the directory. Looks fine, but don't forget bots will check if the URL they are about to retrieve starts with the disallow statement. Sounds still good? Not really - consider that it is perfectly valid to request root of directory without specifying / at the end, ie: http://www.example.com/dir - this URL won't be matched given robots.txt above, and its all perfectly valid.

Some or even most webservers will issue a redirect to proper url - http://www.example.com/dir/ - this will give bot chance to match the URL - and it should do, however re-evaluation on redirect is not supported by all bots.

So, the conclusion is that specifying / at the end of Disallow directives may lead to some bots supposedly violating robots.txt, yet in reality they followed it strictly and its the webmaster who is at fault.

pixel_juice

1:25 pm on Jan 13, 2006 (gmt 0)

it is perfectly valid to request root of directory without specifying / at the end, ie: http://www.example.com/dir

I think this is the statement I disagree with - http://www.example.com/dir is a request for a file called 'dir' in the root of www.example.com. Most web servers (if a file called 'dir' does not exist) perform a 'courtesy' redirect to the actual directory URL - /dir/ but robots should not request directories without the slash since this a request for a file, not for a directory index.

I don't know if the index file in the root directory is an http spec thing, I suppose more properly the links would be to /dir/index.htm.

"Directories require a trailing slash"
[httpd.apache.org...]

"IIS first treats [dir] as a file that it should give back to the browser. If this file cannot be found, IIS checks to see if there is a directory with this name"
[support.microsoft.com...]

Lord Majestic

1:36 pm on Jan 13, 2006 (gmt 0)

I think this is the statement I disagree with - http://www.example.com/dir is a request for a file called 'dir' in the root of www.example.com.

This is not correct - http://www.example.com/dir is a URI - client that requests it has no clue as to whether it directory or file or whatever - this is totally up to server how it classifies them, as far as client is concerned it requested http://www.example.com/dir and its a perfectly valid request.

Indeed some or even many webservers will issue redirect - but not all bots re-check robots.txt upon redirect and some servers don't issue redirect at all - they just detect its directory and avoid cost of redirect.

It is therefore better to avoid having trailing slash in Disallow statements. This is not theory - I have personally come across with a few webmasters who claimed my bot violated their robots.txt but upon check it transpired that the situation described above happened and it was technically their fault.

pixel_juice

1:43 pm on Jan 13, 2006 (gmt 0)

OK agreed, but the other problem with 'Disallow: /widget' is that it would also disallow /widget.htm and so IMO is more likely to cause confusion.

Lord Majestic

3:16 pm on Jan 13, 2006 (gmt 0)

This is true, which is why its best to have uniquely names directories for different stuff that so there are no name clashes.

arbitrary

6:55 pm on Jan 14, 2006 (gmt 0)

Here is an update to the situation.

Recapping the situation. My robots.txt:

User-agent: *
Disallow: /widgets/

My directory was called:
/widgets-for-sale/

but it was not being indexed (and I wanted it to be). So I changed the directory to:

/SynonymForWidgets-for-sale/

The directory was crawled the next day by both googlebot and msn which had not crawled the /widgets-for-sale/ directory in the 10 previous days.

This leads me to believe that my inital robots.txt was somehow telling bots not to visit that directory. I don't understand how these things work, just passing on what I saw and hopefully that can help someone else as well.

Lord Majestic

8:19 pm on Jan 14, 2006 (gmt 0)

I think I can explain it - my bot had this problem I explained with people specifying / at the end of dirs, while their server was not making redirect thus my bot wrongly appeared to have been breaking robots.txt, even though I was right I realised that the impression will count more than substance, so I modified my robots.txt algorithm to automatically ignore /'s at the end of Disallowed link.

So, in case of my bot it would skip both directories and it seems logical to me that other search engines may also take the same approach.

arbitrary

8:46 pm on Jan 14, 2006 (gmt 0)

Thanks for your help and explanation. I think that is exactly what is going on, the bots were ignoring the trailing slash when making the requests.

py9jmas

9:43 pm on Jan 14, 2006 (gmt 0)

Indeed some or even many webservers will issue redirect

Have you got any examples of webservers that don't issue a redirect that you're happy to post? I wouldn't have thought it would be very many - not redirecting to /dir/ tends to break relative links.

Lord Majestic

3:32 am on Jan 15, 2006 (gmt 0)

I can't mind server types, but I had first hand experience with this problem before. Some bots don't re-check redirected URL against robots.txt anyway.