Why does Google ignore my robots.txt page?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Why does Google ignore my robots.txt page?

Rob_Cook

7:20 am on Oct 10, 2005 (gmt 0)

Google has for some time now completely disregarded my robots.txt page, continuing to index pages from my site when I clearly and correctly instruct them not to do so.

I also add a robots 'no index' commmand on the individual pages. No matter - they ignore that too.

Is Google doing this on purpose? Are they no longer bothering to adhere to robots instructions?

larryhatch

8:35 am on Oct 10, 2005 (gmt 0)

Two things you might do:

1) Whois the DNS #s of Google spiders on your Access_log files. Make sure they are indeed Google.
2) Try to validate your robots.txt file. One tiny mistake and it might not work.

IF it is indeed Google, AND your robots validates, then there is an issue. -Larry

taps

9:52 am on Oct 10, 2005 (gmt 0)

One other problem could be:

You have a section for all User-Agents that disallow a lot of pages. And maybe you have an additional entry especially for Googlebot disallowing only one specific page.

If Googlebot finds an entry "User-agent: Googlebot" in robots.txt it will ignore any other entry. This can cause a lot of trouble.

If you have an own section for Googlebot you'll have to repeat there any entries made for "User-agent: *" and add the specific pages for Googlebot.

Concerning the nodinex: Googlebot did ignore them completely on my page. My theory is: As soon as Googlebot finds a robots.txt it will ignore the meta-ignore. Too bad for me, since Google found a lot of duplicate content ignoring these tags :-(

Rob_Cook

11:38 am on Oct 10, 2005 (gmt 0)

Well, this is a direct copy of my robots page:

User-agent: Googlebot
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /redirect/

Google is indexing pages in the /redirect/ folder.

Every page in that folder also contains:
META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"

Anyone see anything wrong with that?

Umbra

12:01 pm on Oct 12, 2005 (gmt 0)

*Bump* Same problem here. Googlebot occasionally ignores our robots.txt (which was around for a few years) and then falls into a bad bot trap.

Wibfision

12:30 pm on Oct 12, 2005 (gmt 0)

Google will list anything that has a link pointing to it regardless of whether it is in your robots.txt or not. Note that I said list, not index. I would imagine that any pages in your redirect folder are listed URL only.

If you don't want them to be listed, don't exclude the folder with robots.txt, but just use the "NOINDEX,NOFOLLOW" metatag. At the moment, Google doesn't see the NOINDEX because it cannot index the page because it is excluded in the robots.txt.

I hope that makes sense :-)

charlier

12:52 pm on Oct 12, 2005 (gmt 0)

I think Wibfision has hit the nail on the head, but I wonder if there is any way to keep google from putting the link URLs in the index other then noindex,nofollow. As I understand it putting that on a page would cause all the links on the page to be ignored. What if you just have a couple links you don't want to be followed, for example we have a 'Send this to a friend' link on each of our stories; this link has the current page name in the URL so the target form can figure out what story it should send. I have thought of using the referrer info. to get the story ID, or some sort of IFRAME arrangement set up so the link only appears to be on the page.

Any other ideas greatly appreciated.

jdMorgan

12:56 pm on Oct 12, 2005 (gmt 0)

> continuing to index pages

Be aware that robots.txt instructs a robot not to *fetch* a page. It does not instruct a search engine not to *list* a page. If Google finds a link to your page on another site, it may indeed list your page by URL without ever having fetched it. Thus, there is no violation of robots.txt. Yahoo does the same thing, but includes the link-text it finds on the link as the title in search results.

So, this may have to do with your definition of "continuing to index pages," which is not clear from the description in this thread.

If you are basing your statements on URL-only links appearing in Google's SERPs, then the solution is to remove the Disallow from robots.txt, and use the on-page <meta name="robots" content="noindex"> tag to prevent the page from being listed. The function of this robots meta tag and the robots.txt file are not the same, since robots says "don't fetch this page," whereas the meta tag says, "Don't index this page." So in order for a search engine robot to "see" your on-page robots meta tag, you must allow the page to be fetched by not Disallowing it in robots.txt.

If none of the above is applicable, then you might want to look at your site using a Server Headers checker [webmasterworld.com], and verify that the correct robots.txt file and server response code is returned when requested from any domain or subdomain that your server will respond to. In some cases, servers are configured so that a separate robots.txt is associated with each subdomain, so that requesting example.com and www.example.com can and will return two different robots.txt files, even if these site's "content pages" are the same. If one of these robots.txt files is missing, then a 404-Not Found would be returned. The usual interpretation for a missing robots.txt is to spider the entire site.

So a checklist might be:

Validate robots.txt [searchengineworld.com] files.

Check server response [webmasterworld.com] for robots.txt request in each active domain and subdomain.

If using robots meta tags on pages, do not Disallow robot from those pages using robots.txt.

Verify that the requesting IP address and User-agent is a legitimate robot, and not a spoof, WAP proxy, or "Web accelerator" service.

If all this checks out, then stronger measures such as a bad-bot trap can be used. However, if I saw a legitimate Googlebot fall into my bot-trap, I'd report it to them immediately.

Jim

Umbra

2:05 pm on Oct 12, 2005 (gmt 0)

If there is a bad bot trap like:

http://www.example.com/badbot/banme.cgi

And robots.txt included for a long time:

User-agent: *
Disallow: /badbot/

Then Googlebot should never visit this page, right? Even if an external link points to that URL then robots.txt would instruct Googlebot to avoid it.

In our case, Googlebot triggered the bad bot trap a few times this year. The page is not listed in Google's index, the Googlebot user agent is legit, and the IP address points to Google Inc. All the pages validate and have proper server response.

So more recently, we added this metatag as an extra precaution:

Despite that, Googlebot recently triggered the bad bot trap yet again. I don't know why Googlebot previously ignored robots.txt. The robots.txt AND/OR the new metatag should instruct Googlebot to leave this page alone for the foreseeable future. If not, then Googlebot is definitely misbehaving.

Lord Majestic

2:10 pm on Oct 12, 2005 (gmt 0)

User-agent: *
Disallow: /badbot/
Then Googlebot should never visit this page, right?

Not right _IF_ you have not disallowed /badbot/ in part of robots.txt related SPECIFICALLY to Googlebot (if you have this part), it means that its allowed because when specific bot is mentioned in user-agent it means ONLY THAT PART applies -- User-agent: * is disregarded.

This is why Google gets into your bad bot area - you should have disallowed it for all specific bots too.

Which is really the question for robots.txt forum, not Google News.

Umbra

2:13 pm on Oct 12, 2005 (gmt 0)

Lord Majestic,

Our robots.txt does not mention Googlebot specifically anywhere in robots.txt

ie.,

User-Agent: Unwantedbot1
Disallow: /

User-Agent: Unwantedbot2
Disallow: /

User-agent: *
Disallow: /badbot/

Since Googlebot is not "Unwantedbot1" or "Unwantedbot2" then doesn't it have to pay attention to what's listed under User-agent: *

jdMorgan

2:23 pm on Oct 12, 2005 (gmt 0)

Umbra,

> Then Googlebot should never visit this page, right?

That is correct. It should not fetch that page. I've never personally seen Googlebot violate robots.txt, though, at least not recently -- I don't remember such an incident, that is. In the cases where some other *Google user-agent* fell into a trap, it was always their WAP proxy or their "Accelerator," neither of which are Googlebot.

In cases like these, it's best to use an internal URL rewrite to give the user-agent a 403 on the requested page. This isn't for the faint of heart though, because it *is* cloaking, but with no intention to deceive (since the page is Disallowed anyway).

The main point to my post is that there are subtle differences between the functions and meanings of robots.txt and the on-page robots meta-tag, and that all issues of canonical domain resolution, server responses, robots.txt for different subdomains, robots.txt location and syntax -- all of these must be correct before robots.txt exclusion can be relied upon to function properly.

In your case, you'll need to be sure that Gbot can fetch the page(s) with the meta-tags on them *before* being trapped. Depending on your implementation of the poison links and/or bait-URL rewrites, it's quite possible that your trap will be invoked before Gbot could ever read that meta-tag.

Jim

Lord Majestic

2:31 pm on Oct 12, 2005 (gmt 0)

Our robots.txt does not mention Googlebot specifically anywhere in robots.txt

In which case my explanation does not apply. :(

Chris_D

2:44 pm on Oct 12, 2005 (gmt 0)

Anyone see anything wrong with that?

As several people have pointed out
- you can't use both meta robots and robots.txt to address the same page
- robots.txt WON'T prevent Google fron listing a URL in its index.

Pretend Googlebot is a book reviewer.

Robots.txt is like a sign on the book cover which basically says 'you can't open these books and read them. But you can still list that this book is here'.

The Meta robots 'noindex' is on the first page inside the book - and says - 'you can't keep any record that this book is here, or any record of what's in it'.

So what happens if you do both robots.txt and meta robots noindex? Well - obeying robots.txt means the book will never get opened - so the meta robots noindex instruction won't get read - its 'inside the book'.

So what do you do?

Use one or the other - not both.
If you don't want the URL in the index - use meta robots noindex - which means putting it in the head of the document on every page you don't want indexed....
If its too hard, (e.g. your template gets used in a cms for bazillions of pages - most of which you want indexed) and you don't care if the URL itself is in the index or not - use robots.txt.
There is no other 'easy' way to tell a robot 'don't event think about listing this URL in your index or reading it'.

Alternative workarounds can also be used - like converting this content into javascript popup pages....
:)