Google disregarding robots.txt disallow?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google disregarding robots.txt disallow?

latimer

5:39 pm on Dec 10, 2008 (gmt 0)

robots.txt was:

User-agent: Googlebot
Disallow: /folder2

many pages inside that folder like example below found in the google index.

/folder2/filename.cfm?pull-this-type-of-page

Ironically, we had actually wanted the pages indexed, and they were. Later, when the pages began to fall out of google we noticed that had google been following the robots.txt they would never have gotten in there in the first place.

There are other files in the folder that we definitely do not want indexed, so this is a concern.

So, we are going to use the following robots.txt:

User-agent: Googlebot
Disallow: /folder2
Allow: /folder2/filename.cfm

Any thoughts on why google took the pages in the blocked folder? Can we expect that adding the Allow: to robots.txt will improve the chance that google gets those we want them to get and leaves the others alone?

thanks for any opinions

jdMorgan

6:24 pm on Dec 10, 2008 (gmt 0)

Google will list any URL it finds a link to, anywhere on the Web. However, if that URL is Disallowed in robots.txt, Google will not *fetch* that URL. Therefore, the listing in their search results will have no title and no description; It will appear either as a naked link or may be titled with the link-text taken from one of the links Googlebot found to that URL.

If your Disallowed pages appeared with a title taken from the page itself and with a description taken from the page's description meta-tag or a snippet derived from text on the page itself, then it is more likely that you posted the page before or at the same time as the new robots.txt, or that you had a technical problem with your site or server that caused trouble when G tried to fetch robots.txt.

If you don't want the URL mentioned at all in search results, then don't Disallow it in robots.txt. Instead, allow it to be fetched, but place a <meta name="robots" content="noindex"> tag in the <head> of that page.

Your new Disallow-Allow construct should work as you want. But consider altering your directory structure to simplify your robots.txt in the future -- For example, move the Disallowed resources into a subdirectory below the "filename.cfm" file's directory, and then disallow that entire subdirectory.

Jim

latimer

7:23 pm on Dec 10, 2008 (gmt 0)

thanks for the reply jd

Since we only recently discovered that files in the robots.txt /folder2 are in the index, and that the change to robots.txt hasn't been implemented yet, it still appears as though google disregarded the disallow.

the *fetched* links appeared in the search results with meta page titles and page text snippet for descriptions. Not naked.

These pages have been updated and grown in numbers over the last 6 months, and if there were momentary technical problems, this doesn't seem to account for why they were in the index for such an extended period of time, there having been multiple spider/cache dates. The robots.txt for this folder hasn't changed in months and been accessible.

Do you think that this must be something on our end that we just haven't figured out yet? Or, is it possible that google would index the *fetched* pages that were disallowed and show titles and snippet descriptions as described for some other reason?

We are considering moving the content of the folder that we don't want spidered into another folder, but wanted to take the easy way with the Allow: if it works.

jdMorgan

8:02 pm on Dec 10, 2008 (gmt 0)

It's really not a question to be answered by opinion. I'd pull the archived server logs for the time period in question and look at the Googlebot fetches of robots.txt and the Disallowed pages. Also look at the site's backups to see what was in robots.txt at the time.

It is possible that Googlebot violated robots.txt, just not very likely. The last credible report I find here on WebmasterWorld was from 2005.

It's more likely that robots.txt was edited but not uploaded to the server, or that some other human or procedural failing occurred on your end, rather than Google's. (Sorry to be blunt, but candy-coating it doesn't help find or resolve the problem)

I'd look at the archived files and logs to find out.

Jim

jimbeetle

8:27 pm on Dec 10, 2008 (gmt 0)

I'd also check all the simple-stupid stuff: robots.txt file is in the proper location depending on your server setup; robots.txt filename is actually all lower case; file paths in records are in the correct case; blank line between each record; and blank line at the end of the file. And, as always, check for typos and speeling within the file.

Receptional Andy

8:31 pm on Dec 10, 2008 (gmt 0)

You can also use Google's own "analyse robots.txt" tool within their Webmaster Tools console to see what robots.txt file content they have and whether the indexed URLs are excluded as far as they're concerned.

I assume the cache dates for the pages are after the robots directive was implemented?