Forum Moderators: open

Message Too Old, No Replies

Google indexed some content blocked by robots.txt

Their fault or mine?

         

salmo

6:54 pm on Nov 5, 2002 (gmt 0)

10+ Year Member



I have noticed that Google has been following links from a page that should have been been excluded by a robots.txt file. How is it possible to prevent Googlebot from indexing pages that are not intended to be included? The format used is:

User-Agent: *
Disallow: /page.html/

Is it possible that this is not the correct format?

Brett_Tabke

6:56 pm on Nov 5, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Is the page visible in the google cache right now?

[searchengineworld.com...]

Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/ index.html but allow /help.html

jatar_k

7:02 pm on Nov 5, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



speicifically take a look at the examples.

Does the difference between "User-Agent" and "User-agent" matter? I wouldn't think so but proper one is User-agent.

pageoneresults

7:13 pm on Nov 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, the robots.txt file is case sensitive. The upper case A in Agent is a problem. Good call jatar_k.

salmo

7:16 pm on Nov 5, 2002 (gmt 0)

10+ Year Member



The page is in the cache right now.

The only reason that I discovered this is that I have a new site that was recntly indexed with a PR5, the only links from external sites were a graphic link from the start page of the site, the second link was from a page www.mydomain.com/links.html which should have been excluded by the robots file.

Yidaki

7:16 pm on Nov 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



... and remove the slash after ".htm". Slash is a directory seperator. There's no such after a file "file.htm".

Brett_Tabke

7:19 pm on Nov 5, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I'd think it would be the trailing slash too. Although robots.txt field names are supposidly case sensitive, no bot that I know of follows that.

pageoneresults

7:21 pm on Nov 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You may want to validate your robots.txt file [searchengineworld.com]. Brett, is the validator set up to catch case sensitivity? I'm in agreement on the slash.

salmo

7:23 pm on Nov 5, 2002 (gmt 0)

10+ Year Member



Thanks, I have fixed the case problem in User-agent and removed the trailing slash from the last item in the list of dissallowed pages. Hopefully that will sort out the problem.

Thanks again for your help.

Slade

8:01 pm on Nov 5, 2002 (gmt 0)

10+ Year Member



If you need to have the link removed before next update, take a look at: [google.com...]