Does Google Disregards (or mistreat) robots.txt?

Forum Moderators: open

Message Too Old, No Replies

Does Google Disregards (or mistreat) robots.txt?

Or robots META tags.

leanweb

3:13 am on Nov 17, 2003 (gmt 0)

this is curious. i have some pages on my site that should not be indexed. so i added robots.txt type of meta tags inside those pages. well, apparently google will index these pages anyway and only after delay of few days or weeks decide to exclude from index.

is this normal?

markis00

9:16 am on Nov 17, 2003 (gmt 0)

I think what may have happened is that the googlebot can't access the pages anymore, but an old version of the pages is still in the index.

However, eventually the files will get dropped from the index, and since the googlebot can't crawl them anymore, they won't reappear.

When I deleted pages off my server it took a couple of weeks before they were finally gone from the SERPS.

ciml

12:58 pm on Nov 17, 2003 (gmt 0)

I think this is slightly different as leanweb's using robots META tags.

I'd guess you're just having to wait for the fetched page to go through the indexing process. Googlebot crawls very widely, we've had very few genuine REP violations that I can remember. In fact, I'm not sure I do rememeber one.

HitProf

1:24 pm on Nov 17, 2003 (gmt 0)

I've seen googlebot walk through a meta robots exclusion on a friends site a couple of months ago. The only remedy we could think of was to put it behind a password.

On another occasion GB ignored my robots.txt file. Contacted the Googlebot team and they disposed of the pages. 3 Months later the pages are back in the index, displaying titles only. Now with this Florida update they finally seem to disappear, at least from -ex. Still sitting on most other dc's.

takagi

1:30 pm on Nov 17, 2003 (gmt 0)

Note: If you believe your request is urgent and cannot wait until the next time Google crawls your site, use our automatic URL removal system. In order for this automated process to work, your webmaster must first insert the appropriate meta tags into the page's HTML code.

Source: Remove Content from Google's Index [google.com]

HitProf

2:16 pm on Nov 17, 2003 (gmt 0)

Hi takagi,

This option isn't as automatic as it sounds and will work only temporarily. The results are reincluded in the index after 3 months - without recrawling and without testing for robots.txt. Robots meta tags can't have any influence this way.

From the same message:

Doing this and submitting via the automatic URL removal system will cause a temporary, 90 day removal of your site from the Google index. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 90 days to reissue the removal.)

The robots.txt bit is garbage.

jdMorgan

2:25 pm on Nov 17, 2003 (gmt 0)

> displaying titles only

If you meant "displaying URLs only," then this is standard behaviour for Google and Ask Jeeves/Teoma for pages Disallowed by robots.txt. They do not fetch the page, and they therefore do not list the page title or description, but if they find a link to the page, they will list the URL found in that link.

The fix is to *allow* Google and Ask to fetch the page, and then use the on-page meta robots tag to prevent indexing.

It all comes down to varying interpretations of the word "indexing" in the Standard for Robots Exclusion. Some search engines take a robots.txt Disallow to mean "don't mention it" and others take it to mean, "don't fetch this page." The varying results of a robots.txt Disallow are a result of these interpretations. Note also that compliance with the Standard is purely voluntary; it carries no authority.

Jim

takagi

2:37 pm on Nov 17, 2003 (gmt 0)

Hi HitProf,

My understanding of the remark on the 90 days is different. It only applies to the situation where the robots.txt file is in a subdirectory. The normal place for this file is the root, and then it should work OK. What leanweb described was using a meta tag like:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

and not the robots.txt file. But maybe your experience is not the same is the explanation on the Google site.

HitProf

2:45 pm on Nov 17, 2003 (gmt 0)

Jim,

Slip of the keyboard, of course I mean url.
The most funny thing is the pages don't even exist but were made up by Googlebot following relative urls.

Takagi,

It's indeed my experience and not just interpretation.

btw: I'm not concerned about these pages anymore, it's just an observation of how things work.

ciml

3:37 pm on Nov 17, 2003 (gmt 0)

HitProf, if Google aren't allowed to fetch the URLs then it can't tell that they return 404 not found. In that scenario, the fix is just to remove the /robots.txt and et Googlebot find that there's nothing there.

HitProf

4:09 pm on Nov 17, 2003 (gmt 0)

ciml

brilliant! as usual :)

Newman

4:56 pm on Nov 17, 2003 (gmt 0)

My robot.txt is empty file. Is that OK for Googlebot?

HitProf

7:23 pm on Nov 19, 2003 (gmt 0)

Shouldn't be a problem.