Forum Moderators: Robert Charlton & goodroi
User-agent: Googlebot
Disallow: /folder2
many pages inside that folder like example below found in the google index.
/folder2/filename.cfm?pull-this-type-of-page
Ironically, we had actually wanted the pages indexed, and they were. Later, when the pages began to fall out of google we noticed that had google been following the robots.txt they would never have gotten in there in the first place.
There are other files in the folder that we definitely do not want indexed, so this is a concern.
So, we are going to use the following robots.txt:
User-agent: Googlebot
Disallow: /folder2
Allow: /folder2/filename.cfm
Any thoughts on why google took the pages in the blocked folder? Can we expect that adding the Allow: to robots.txt will improve the chance that google gets those we want them to get and leaves the others alone?
thanks for any opinions
If your Disallowed pages appeared with a title taken from the page itself and with a description taken from the page's description meta-tag or a snippet derived from text on the page itself, then it is more likely that you posted the page before or at the same time as the new robots.txt, or that you had a technical problem with your site or server that caused trouble when G tried to fetch robots.txt.
If you don't want the URL mentioned at all in search results, then don't Disallow it in robots.txt. Instead, allow it to be fetched, but place a <meta name="robots" content="noindex"> tag in the <head> of that page.
Your new Disallow-Allow construct should work as you want. But consider altering your directory structure to simplify your robots.txt in the future -- For example, move the Disallowed resources into a subdirectory below the "filename.cfm" file's directory, and then disallow that entire subdirectory.
Jim
Since we only recently discovered that files in the robots.txt /folder2 are in the index, and that the change to robots.txt hasn't been implemented yet, it still appears as though google disregarded the disallow.
the *fetched* links appeared in the search results with meta page titles and page text snippet for descriptions. Not naked.
These pages have been updated and grown in numbers over the last 6 months, and if there were momentary technical problems, this doesn't seem to account for why they were in the index for such an extended period of time, there having been multiple spider/cache dates. The robots.txt for this folder hasn't changed in months and been accessible.
Do you think that this must be something on our end that we just haven't figured out yet? Or, is it possible that google would index the *fetched* pages that were disallowed and show titles and snippet descriptions as described for some other reason?
We are considering moving the content of the folder that we don't want spidered into another folder, but wanted to take the easy way with the Allow: if it works.
It is possible that Googlebot violated robots.txt, just not very likely. The last credible report I find here on WebmasterWorld was from 2005.
It's more likely that robots.txt was edited but not uploaded to the server, or that some other human or procedural failing occurred on your end, rather than Google's. (Sorry to be blunt, but candy-coating it doesn't help find or resolve the problem)
I'd look at the archived files and logs to find out.
Jim
I assume the cache dates for the pages are after the robots directive was implemented?