Forum Moderators: goodroi
Disallow: /shop/
Remove last slash, ie:
Disallow: /shop
The reason is that URL [example.com...] is as legit as [example.com...] but if its used in the form without slash at the end, then your robots.txt will not have it disallowed. There is an exception when it should still work when webserver issues redirect to proper directory but not all webservers do that.
To disallow a directory the following is correct:Disallow: /directory/
without slash at the end might work, but is not correct.
Actually its with slash at the that it might work, and having slash specified means you will allow access to that directory when slash was not used -- there is no obligation on behalf of bot to always use / at the end of URL, some people could link to you without it, and since bot has no idea if its directory or not, it will use URL as is, ie without slash. Here is what will then happen in your case:
The standard states the following:
"Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved."
It means that if bot tried to access your directory by giving URL without slash at the end, then the statement above won't disallow it. Many webservers will issue redirect to proper directory with slash at the end, and good bot will pick it up at that point, however some servers don't issue that redirect and some bots don't check new redirected URL against robots.txt.
Either way its far more reliable to remove slash at the end for all URLs in Disallow statements.
every single search result show the trailing slash
[google.co.uk...]
Are you saying Brett is wrong? nay surely not!
[searchengineworld.com...]
[webmasterworld.com...]
1) someone links to that directory without slash at the end (its still legit)
2) search engine normalisation strips away /' (it is not compulsory you know)
Why the others won't know that? Because you need to actually try to implement robots.txt processing from search engine point of view and then crawl a fair few URLs to come across with this situation -- I had to modify my robots.txt handling to take into account this error because its so widespread. This does not make it right however :)
Still don't believe me? Read standard -- it clearly states that only URLs that START WITH disallowed path will be disallowed, now tell me if [example.com...] starts with /cgi-bin/ -- of course its not, and while this is unlikely to happen with cgi-bin as URLs pointing to it are likely to have actual filename, but it can happen with things like /shop/
The advice to use /' is probably 99% correct, but if you want 100% correctness then don't use /' at the end of paths. WebmasterWorld's validator should really catch this case, and also case when people don't use /' to start disallowed path.
I mean, a good validator should actually implement robots.txt algorithm as if it was search engine and take as input list of URLs, not just robots.txt file because currently validators validate syntax rather than actual logic.
Disallow: /directory/
is 100% the appropriate way of disallowing a directory, the alternate method of.
Disallow: /directory
will disallow the directory as well as any pages of the same name.
a request for example.com/directory will be forwarded to example.com/directory/index.ext which is the true url. Regardless of whether the request was made to /directory if we have disallowed /directory/ then the robot should not index this link based on the standard if that bot follows robots.txt.
I understand that robots still do but that is the fault of the bot in question. The trailing slash is the appropriate way to disallow a directory in its entirety and only that directory.
the end of your quote of the standard is as follows
... For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.
[edited by: jatar_k at 7:23 pm (utc) on May 18, 2005]
a request for example.com/directory will be forwarded to example.com/directory/index.ext which is the true url.
Sadly not always -- some webservers do not execute forward and as the result return page for URL example.com/directory
Regardless of whether the request was made to /directory if we have disallowed /directory/ then the robot should not index this link based on the standard if that bot follows robots.txt.
Only if webserver actually redirects to proper directory thus giving chance the bot to disallow it.
The trailing slash is the appropriate way to disallow a directory in its entirety and only that directory.
You see bots don't know whether URL they request is a directory or not -- they just get data and if redirection is not forced then bot will retrieve /directory perfectly legitimately.
Disallow: /directorywill disallow the directory as well as any pages of the same name.
Indeed, but it will also disallow /directory itself if webserver does not execute redirect.
Disallow: /help/ would disallow /help/index.html but allow /help.html.
Note index.html -- yes /help/ will disallow /help/index.html, but in most cases webservers pick contents of index.html without telling the client that it was actually index.html.
if you want to disallow a directory only and not a page of the same name, the proper syntax is
Disallow: /directory/
if you find your server is confused or that bots are, then it may need to be changed but that doesn't change the fact that this is the appropriate syntax.
If you really don't want something indexed robots,txt isn't going to help anyway, you need a little htaccess magic but, once again, this doesn't change the interpretation of the standard.
but interpreting the standard based on a bot's misconception or a sysadmins confusion at server setup doesn't cause the standard to be altered.
I am not changing interpretation of standard, I believe I interpret it as is, specifically the part that states how exactly matching should be done (ie URLs starting with the disallow's value), and I note that usage of slash at the end will result in conflict in a number of cases.
From my side I modified robots.txt processing to remove slash's at the end of disallow statements to ensure that webmaster's intention to avoid having directory's root itself crawled is honoured.
The info you got on the first few post will work just fine for your robots.txt file, if your in any doubt check your robots.txt file using the Robots.txt Validator
if you have any other question don't hesitate to ask.
User-agent: *
Allow: /searchhistory/
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalog_list
Disallow: /news
Disallow: /nwshp
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /sorry/
...
1: The following lines should be obeyed by all (*) robots
2: Allow spidering of the /searchhistory/-directory
3: Do not spider /search-file
4: Do not spider /groups-file
5: Do not spider /images-file
6: Do not spider /catalogs-file
7: Do not spider /catalog_list-file
8: Do not spider /news-file
9: Do not spider /nwshp-file
10: Do not spider /?-file
11: Do not spider /addurl/image?-file
12: Do not spider /pagead/-directory
13: Do not spider /relpage/-directory
14: Do not spider /sorry/-directory
This assumes that the webserver is standard complaint.
based on the standards
1: The following lines should be obeyed by all (*) robots
2: Allow is not supported
3: Do not spider /search-file or dir
4: Do not spider /groups-file or dir
5: Do not spider /images-file or dir
6: Do not spider /catalogs-file or dir
7: Do not spider /catalog_list-file or dir
8: Do not spider /news-file or dir
9: Do not spider /nwshp-file or dir
10: Do not spider /?-file or dir
11: Do not spider /addurl/image?-file
12: Do not spider /pagead/-directory though this allows files of that name to be spidered
13: Do not spider /relpage/-directory though this allows files of that name to be spidered
14: Do not spider /sorry/-directory though this allows files of that name to be spidered
next :) (the robots.txt I get from them is a lot longer too)
Please note also that each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, if you wanted to allow all filetypes to be served via http but only .html pages to be served via https, the robots.txt file for the http protocol (http://yourserver.com/robots.txt) would be:User-Agent: *
Allow: /
The robots.txt file for the https protocol (https://yourserver.com/robots.txt) would be:User-Agent: *
Disallow: /
Allow: /*.html$[google.com...]
Google technology questions question 1