Can't find the robot.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Can't find the robot.txt

where is it?

mmmwowmmm

10:11 pm on May 17, 2005 (gmt 0)

This might sound like the newby-ist question in the world, but where exactly does the robot.txt exist? I've looked in my web on my computer and don't see it, and I've looked on my web on the server and don't see it. btw, I'm using FP and my host is godaddy, if it matters.

Lord Majestic

10:19 pm on May 17, 2005 (gmt 0)

Its up to you to create it and it should be located in the root of your server, file should be named robots.txt

outrun

10:21 pm on May 17, 2005 (gmt 0)

Nice tutorial to get you started.
Robots.txt file tutorial [searchengineworld.com]

mmmwowmmm

10:23 pm on May 17, 2005 (gmt 0)

Oh. Well no wonder I couldn't find it. Guess I'll start learning how to make one - Thanks

mmmwowmmm

10:24 pm on May 17, 2005 (gmt 0)

Thanks outrun, I appreciate that. just what I need.

mmmwowmmm

12:19 am on May 18, 2005 (gmt 0)

ok, so if I have a folder in my directory called "shop", and I don't want any spiders to crawl anything in that folder, I just create a text file and type:

User-agent: *
Disallow: /shop/

and then put it in the main directory. Is that correct?

ThomasB

5:28 pm on May 18, 2005 (gmt 0)

You have to call the file "robots.txt" but besides that it's correct

Lord Majestic

5:30 pm on May 18, 2005 (gmt 0)

Disallow: /shop/

Remove last slash, ie:

Disallow: /shop

The reason is that URL [example.com...] is as legit as [example.com...] but if its used in the form without slash at the end, then your robots.txt will not have it disallowed. There is an exception when it should still work when webserver issues redirect to proper directory but not all webservers do that.

mmmwowmmm

5:45 pm on May 18, 2005 (gmt 0)

Ok, got it. Thanks a lot.

ThomasB

6:38 pm on May 18, 2005 (gmt 0)

To disallow a directory the following is correct:

Disallow: /directory/

without slash at the end might work, but is not correct.

Lord Majestic

6:48 pm on May 18, 2005 (gmt 0)

To disallow a directory the following is correct:
Disallow: /directory/
without slash at the end might work, but is not correct.

Actually its with slash at the that it might work, and having slash specified means you will allow access to that directory when slash was not used -- there is no obligation on behalf of bot to always use / at the end of URL, some people could link to you without it, and since bot has no idea if its directory or not, it will use URL as is, ie without slash. Here is what will then happen in your case:

The standard states the following:

"Disallow

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved."

It means that if bot tried to access your directory by giving URL without slash at the end, then the statement above won't disallow it. Many webservers will issue redirect to proper directory with slash at the end, and good bot will pick it up at that point, however some servers don't issue that redirect and some bots don't check new redirected URL against robots.txt.

Either way its far more reliable to remove slash at the end for all URLs in Disallow statements.

ncw164x

6:59 pm on May 18, 2005 (gmt 0)

Not wanting to sound disrespectful Lord Majestic but if your way is right without the trailing slash then everyone else who has examples of a disallow in a robot.txt file must be wrong

every single search result show the trailing slash
[google.co.uk...]

Are you saying Brett is wrong? nay surely not!
[searchengineworld.com...]
[webmasterworld.com...]

Lord Majestic

7:07 pm on May 18, 2005 (gmt 0)

Yeah they are all wrong (don't take my opinion - see quoted standard above) in a sense that they ASSUME that whoever requests the ROOT of that directory MUST have slash appended to it. This may not be the case if:

1) someone links to that directory without slash at the end (its still legit)
2) search engine normalisation strips away /' (it is not compulsory you know)

Why the others won't know that? Because you need to actually try to implement robots.txt processing from search engine point of view and then crawl a fair few URLs to come across with this situation -- I had to modify my robots.txt handling to take into account this error because its so widespread. This does not make it right however :)

Still don't believe me? Read standard -- it clearly states that only URLs that START WITH disallowed path will be disallowed, now tell me if [example.com...] starts with /cgi-bin/ -- of course its not, and while this is unlikely to happen with cgi-bin as URLs pointing to it are likely to have actual filename, but it can happen with things like /shop/

The advice to use /' is probably 99% correct, but if you want 100% correctness then don't use /' at the end of paths. WebmasterWorld's validator should really catch this case, and also case when people don't use /' to start disallowed path.

I mean, a good validator should actually implement robots.txt algorithm as if it was search engine and take as input list of URLs, not just robots.txt file because currently validators validate syntax rather than actual logic.

jatar_k

7:19 pm on May 18, 2005 (gmt 0)

the syntax of

Disallow: /directory/

is 100% the appropriate way of disallowing a directory, the alternate method of.

Disallow: /directory

will disallow the directory as well as any pages of the same name.

a request for example.com/directory will be forwarded to example.com/directory/index.ext which is the true url. Regardless of whether the request was made to /directory if we have disallowed /directory/ then the robot should not index this link based on the standard if that bot follows robots.txt.

I understand that robots still do but that is the fault of the bot in question. The trailing slash is the appropriate way to disallow a directory in its entirety and only that directory.

the end of your quote of the standard is as follows

... For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

[edited by: jatar_k at 7:23 pm (utc) on May 18, 2005]

Lord Majestic

7:22 pm on May 18, 2005 (gmt 0)

a request for example.com/directory will be forwarded to example.com/directory/index.ext which is the true url.

Sadly not always -- some webservers do not execute forward and as the result return page for URL example.com/directory

Regardless of whether the request was made to /directory if we have disallowed /directory/ then the robot should not index this link based on the standard if that bot follows robots.txt.

Only if webserver actually redirects to proper directory thus giving chance the bot to disallow it.

The trailing slash is the appropriate way to disallow a directory in its entirety and only that directory.

You see bots don't know whether URL they request is a directory or not -- they just get data and if redirection is not forced then bot will retrieve /directory perfectly legitimately.

Disallow: /directory
will disallow the directory as well as any pages of the same name.

Indeed, but it will also disallow /directory itself if webserver does not execute redirect.

Disallow: /help/ would disallow /help/index.html but allow /help.html.

Note index.html -- yes /help/ will disallow /help/index.html, but in most cases webservers pick contents of index.html without telling the client that it was actually index.html.

jatar_k

7:26 pm on May 18, 2005 (gmt 0)

but interpreting the standard based on a bot's misconception or a sysadmins confusion at server setup doesn't cause the standard to be altered.

if you want to disallow a directory only and not a page of the same name, the proper syntax is

Disallow: /directory/

if you find your server is confused or that bots are, then it may need to be changed but that doesn't change the fact that this is the appropriate syntax.

If you really don't want something indexed robots,txt isn't going to help anyway, you need a little htaccess magic but, once again, this doesn't change the interpretation of the standard.

Lord Majestic

7:29 pm on May 18, 2005 (gmt 0)

but interpreting the standard based on a bot's misconception or a sysadmins confusion at server setup doesn't cause the standard to be altered.

I am not changing interpretation of standard, I believe I interpret it as is, specifically the part that states how exactly matching should be done (ie URLs starting with the disallow's value), and I note that usage of slash at the end will result in conflict in a number of cases.

From my side I modified robots.txt processing to remove slash's at the end of disallow statements to ensure that webmaster's intention to avoid having directory's root itself crawled is honoured.

mmmwowmmm

7:34 pm on May 18, 2005 (gmt 0)

I'm staying out of this.

ncw164x

7:39 pm on May 18, 2005 (gmt 0)

I will be back in a minute I am just going for a slash ;)

jatar_k

7:59 pm on May 18, 2005 (gmt 0)

sry mmmwowmmm

we sidetracked your thread, I hope we somehow answered your question though. ;)

Lord Majestic

8:04 pm on May 18, 2005 (gmt 0)

we sidetracked your thread

Sorry its my fault, I am crawling back to under the stone where I live :(

ncw164x

8:05 pm on May 18, 2005 (gmt 0)

yeah sorry mmmwowmmm,

The info you got on the first few post will work just fine for your robots.txt file, if your in any doubt check your robots.txt file using the Robots.txt Validator

[searchengineworld.com...]

if you have any other question don't hesitate to ask.

mmmwowmmm

9:05 pm on May 18, 2005 (gmt 0)

Oh, hey, I was just being facetious, it's cool. I got my question answered - and then some.

(besides, it was fun stirring up a debate among the senior members and the mods ;)

Thanks guys

ThomasB

9:36 pm on May 18, 2005 (gmt 0)

Ok guys, let's assume Google is using a proper robots.txt and analyze theirs:

User-agent: *
Allow: /searchhistory/
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalog_list
Disallow: /news
Disallow: /nwshp
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /sorry/
...

[google.com...]

1: The following lines should be obeyed by all (*) robots
2: Allow spidering of the /searchhistory/-directory
3: Do not spider /search-file
4: Do not spider /groups-file
5: Do not spider /images-file
6: Do not spider /catalogs-file
7: Do not spider /catalog_list-file
8: Do not spider /news-file
9: Do not spider /nwshp-file
10: Do not spider /?-file
11: Do not spider /addurl/image?-file
12: Do not spider /pagead/-directory
13: Do not spider /relpage/-directory
14: Do not spider /sorry/-directory

This assumes that the webserver is standard complaint.

jatar_k

9:42 pm on May 18, 2005 (gmt 0)

hehe, alright I guess we can keep playing ;)

based on the standards

1: The following lines should be obeyed by all (*) robots
2: Allow is not supported
3: Do not spider /search-file or dir
4: Do not spider /groups-file or dir
5: Do not spider /images-file or dir
6: Do not spider /catalogs-file or dir
7: Do not spider /catalog_list-file or dir
8: Do not spider /news-file or dir
9: Do not spider /nwshp-file or dir
10: Do not spider /?-file or dir
11: Do not spider /addurl/image?-file
12: Do not spider /pagead/-directory though this allows files of that name to be spidered
13: Do not spider /relpage/-directory though this allows files of that name to be spidered
14: Do not spider /sorry/-directory though this allows files of that name to be spidered

next :) (the robots.txt I get from them is a lot longer too)

Lord Majestic

9:53 pm on May 18, 2005 (gmt 0)

If I remember correctly Google did not have robots.txt from day one, and somebody noticed couple of years ago that such a major search engine did not think of getting robots.txt :)

Reid

10:56 am on May 21, 2005 (gmt 0)

allow: that has something to do with secure servers right?

Please note also that each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, if you wanted to allow all filetypes to be served via http but only .html pages to be served via https, the robots.txt file for the http protocol (http://yourserver.com/robots.txt) would be:
User-Agent: *
Allow: /
The robots.txt file for the https protocol (https://yourserver.com/robots.txt) would be:
User-Agent: *
Disallow: /
Allow: /*.html$
[google.com...]
Google technology questions question 1