Forum Moderators: goodroi
User-agent: *
Disallow: forum/
Disallow: forum
Disallow: /forum/
Disallow: /forum
Now, syntax wise it is indeed a valid robots.txt, however syntax wise 1/0 is correct operation, but it won't work well during run-time. From run-time point of view it contains 3 incorrect disallow statements that should have been picked up by the validator -- they are all mistakes. Here are the details:
> Disallow: forum/
> Disallow: forum
These are both incorrect because they have no / before actual path. Robots.txt standard is pretty clear about disallow statements: "This can be a full path, or a partial path; any URL that starts with this value will not be retrieved."
The mistake is to think that the path part of URL does not start with /, say in: [example.com...] the URL part that will be checked against disallow statements is /forum, and clearly neither "forum" nor "forum/" start with this value - they have / before actual value.
Don't believe me? Check robotstxt.org site for yourself -- in all their examples they have / before actual path, for a reason because / is integral component of it and to have no /'s in disallow statements is meaningless.
> Disallow: /forum/
You think this is correct? Think again, a URL can be [example.com...] without slash at the end
and its perfectly valid because /forum does not start with /forum/
I had to modify my robots.txt code to take into account this error -- even robots.txt use /'s at the end without realisation that original URL may not have it and still remain valid.
> Disallow: /forum
Now this is THE way to do it -- all other errors should have been highlighted by the validator and I reckon the only reason they were not due to validator not being crawler, had it been crawler it would have received enough hate mail to analyse common errors people make to explain them here.
IMO validator should be changed to be more than just syntax checker, if WebmasterWorld agrees with my argument then I will send the invoice via post ;)
That should block indexing on everypage named testpage.html in all directories throughout the site.
Which would have been fine. But still G has indexed the page - and still has it in the page although robots.txt has been requested a few times since then and I change the robots.txt to inlcude all possible variations:
User-agent: *
Disallow: testpage.html
Disallow: /testpage.html
Disallow: html/testpage.html
Disallow: /html/testpage.html
(The last two being an older name for the same page)
That should block indexing on everypage named testpage.html in all directories throughout the site. Using a leading slash would only specify the file in the root.
It may work in Google's implementation (they appear to use pattern matching), but it may not do so for other robots that follow robots.txt standard to the letter, here is quote:
"Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html."
Starts is the keyword -- if you pick filename from the middle or URL, or its end then don't expect it to be matched by robots that follow the standard to the letter.
I'll have to concur with Brett on what happens when the path is not preceded with a forward slash. For example...
User-agent: *
Disallow: testpage.html
Blocks all pages named testpage.html
Disallow: /testpage.html
Blocks only the page at the root level.
Disallow: html/testpage.html
Blocks all pages with html/testpage.html in path.
Disallow: /html/testpage.html
Blocks only the testpage.html in the /html/ directory.
[edited by: pageoneresults at 3:06 pm (utc) on April 29, 2005]
There have been many topics concerning Googlebot indexing disallowed content. I believe it was jdMorgan who suggested that if you have content and don't want it indexed, then don't use a Disallow: in the robots.txt file. What he suggested was to use a Robots META Tag on the page not to be indexed...
<meta name="robots" content="none"> I've experimented with this and it does work. I removed a line from my robots.txt file disallowing a specific page from my site. It was showing with a URI only listing. I dropped the Robots META Tag in that page and within 60 days the URI only listing was gone. Just checked again and it is definitely not indexed. Pages that I have disallowed in the robots.txt file are indexed with URI only.
Pages that I have disallowed in the robots.txt file are indexed with URI only.
Well, existance of URI does not mean the page is indexed -- some OTHER content (anchor text) pointing to your page was indexed, not the page itself. IMO thats fair play even if site's robots.txt says not to index it, and if META tags say not to index it as the page is not indexed as requested.
Therefore, a leading slash is required on all URL-paths to be disallowed, unless the record is specific to a particular robot that has been demonstrated to use pattern-matching as opposed to prefix-matching.
IOW, I would treat "slashless URL-path" usage in the same way as other non-standard robots protocol extensions; Use them only in records directed to a specific robot that is know to support the extension.
There are some subtleties of terminology that may catch the unwary here: Robots.txt tells compliant robots not to fetch (read, request, or GET) a resource (page, file, etc.), while the on-page robots meta-tag tells compliant robots not to index the resource (list as a search result, include in their index or search results).
In order for an on-page robots meta-tag to have any effect, that page must *not* be Disallowed in robots.txt, otherwise the page and the tag will never be fetched by a compliant robot.
If Googlebot finds a link to a page and that page is Disallowed in robots.txt, then Google will show that page in search results as a URL-only listing -- No title and no description. It will show up in searches based on the link-text used in the link(s) that Googlebot found on other pages. MSN and Ask Jeeves appear to behave in a similar manner.
Yahoo does almost the same thing, except that the listing in search results is not just the page's URL -- Yahoo uses the link text it found as the title of the page. (This is a relatively new behaviour, and may change).
Jim
Still a few mysteries remain.
#1: For one query, the page appears with its META description as SERP result #1, for a second query it appears with on-page text excerpts as SERP result #30something. IN no case I found only an URL listing.
#2: Why has it been indexed in the first place? It is linked from nowhere - you need to KNOW the URL in order to get to the page.
During setting up the page, I had some technical problems which I discussed in a public forum. Somebody asked me if he could take a look and I dropped the URL. The forum is nicely indexed in Google, but the backlin could only be found by the
linkĻ query. OK, that solved mystery #2.
P.S. Darn... I missed my 1111th post...
The key is that a page may be listed as a URL-only listing if it is disallowed in robots.txt. You can remove it completely (at the cost of it being spidered repeatedly) by "allowing" it in robots.txt, and placing the <meta name="robots" content="noindex"> tag on the page itself.
Jim