Forum Moderators: Robert Charlton & goodroi
I also add a robots 'no index' commmand on the individual pages. No matter - they ignore that too.
Is Google doing this on purpose? Are they no longer bothering to adhere to robots instructions?
You have a section for all User-Agents that disallow a lot of pages. And maybe you have an additional entry especially for Googlebot disallowing only one specific page.
If Googlebot finds an entry "User-agent: Googlebot" in robots.txt it will ignore any other entry. This can cause a lot of trouble.
If you have an own section for Googlebot you'll have to repeat there any entries made for "User-agent: *" and add the specific pages for Googlebot.
Concerning the nodinex: Googlebot did ignore them completely on my page. My theory is: As soon as Googlebot finds a robots.txt it will ignore the meta-ignore. Too bad for me, since Google found a lot of duplicate content ignoring these tags :-(
User-agent: Googlebot
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /redirect/
Google is indexing pages in the /redirect/ folder.
Every page in that folder also contains:
META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"
Anyone see anything wrong with that?
If you don't want them to be listed, don't exclude the folder with robots.txt, but just use the "NOINDEX,NOFOLLOW" metatag. At the moment, Google doesn't see the NOINDEX because it cannot index the page because it is excluded in the robots.txt.
I hope that makes sense :-)
Any other ideas greatly appreciated.
Be aware that robots.txt instructs a robot not to *fetch* a page. It does not instruct a search engine not to *list* a page. If Google finds a link to your page on another site, it may indeed list your page by URL without ever having fetched it. Thus, there is no violation of robots.txt. Yahoo does the same thing, but includes the link-text it finds on the link as the title in search results.
So, this may have to do with your definition of "continuing to index pages," which is not clear from the description in this thread.
If you are basing your statements on URL-only links appearing in Google's SERPs, then the solution is to remove the Disallow from robots.txt, and use the on-page <meta name="robots" content="noindex"> tag to prevent the page from being listed. The function of this robots meta tag and the robots.txt file are not the same, since robots says "don't fetch this page," whereas the meta tag says, "Don't index this page." So in order for a search engine robot to "see" your on-page robots meta tag, you must allow the page to be fetched by not Disallowing it in robots.txt.
If none of the above is applicable, then you might want to look at your site using a Server Headers checker [webmasterworld.com], and verify that the correct robots.txt file and server response code is returned when requested from any domain or subdomain that your server will respond to. In some cases, servers are configured so that a separate robots.txt is associated with each subdomain, so that requesting example.com and www.example.com can and will return two different robots.txt files, even if these site's "content pages" are the same. If one of these robots.txt files is missing, then a 404-Not Found would be returned. The usual interpretation for a missing robots.txt is to spider the entire site.
So a checklist might be:
If all this checks out, then stronger measures such as a bad-bot trap can be used. However, if I saw a legitimate Googlebot fall into my bot-trap, I'd report it to them immediately.
Jim
http://www.example.com/badbot/banme.cgi
And robots.txt included for a long time:
User-agent: *
Disallow: /badbot/
Then Googlebot should never visit this page, right? Even if an external link points to that URL then robots.txt would instruct Googlebot to avoid it.
In our case, Googlebot triggered the bad bot trap a few times this year. The page is not listed in Google's index, the Googlebot user agent is legit, and the IP address points to Google Inc. All the pages validate and have proper server response.
So more recently, we added this metatag as an extra precaution:
<META NAME="robots" CONTENT="NOINDEX">
Despite that, Googlebot recently triggered the bad bot trap yet again. I don't know why Googlebot previously ignored robots.txt. The robots.txt AND/OR the new metatag should instruct Googlebot to leave this page alone for the foreseeable future. If not, then Googlebot is definitely misbehaving.
User-agent: *
Disallow: /badbot/Then Googlebot should never visit this page, right?
Not right _IF_ you have not disallowed /badbot/ in part of robots.txt related SPECIFICALLY to Googlebot (if you have this part), it means that its allowed because when specific bot is mentioned in user-agent it means ONLY THAT PART applies -- User-agent: * is disregarded.
This is why Google gets into your bad bot area - you should have disallowed it for all specific bots too.
Which is really the question for robots.txt forum, not Google News.
Our robots.txt does not mention Googlebot specifically anywhere in robots.txt
ie.,
User-Agent: Unwantedbot1
Disallow: /
User-Agent: Unwantedbot2
Disallow: /
User-agent: *
Disallow: /badbot/
Since Googlebot is not "Unwantedbot1" or "Unwantedbot2" then doesn't it have to pay attention to what's listed under User-agent: *
> Then Googlebot should never visit this page, right?
That is correct. It should not fetch that page. I've never personally seen Googlebot violate robots.txt, though, at least not recently -- I don't remember such an incident, that is. In the cases where some other *Google user-agent* fell into a trap, it was always their WAP proxy or their "Accelerator," neither of which are Googlebot.
In cases like these, it's best to use an internal URL rewrite to give the user-agent a 403 on the requested page. This isn't for the faint of heart though, because it *is* cloaking, but with no intention to deceive (since the page is Disallowed anyway).
The main point to my post is that there are subtle differences between the functions and meanings of robots.txt and the on-page robots meta-tag, and that all issues of canonical domain resolution, server responses, robots.txt for different subdomains, robots.txt location and syntax -- all of these must be correct before robots.txt exclusion can be relied upon to function properly.
In your case, you'll need to be sure that Gbot can fetch the page(s) with the meta-tags on them *before* being trapped. Depending on your implementation of the poison links and/or bait-URL rewrites, it's quite possible that your trap will be invoked before Gbot could ever read that meta-tag.
Jim
Anyone see anything wrong with that?
As several people have pointed out
- you can't use both meta robots and robots.txt to address the same page
- robots.txt WON'T prevent Google fron listing a URL in its index.
Pretend Googlebot is a book reviewer.
Robots.txt is like a sign on the book cover which basically says 'you can't open these books and read them. But you can still list that this book is here'.
The Meta robots 'noindex' is on the first page inside the book - and says - 'you can't keep any record that this book is here, or any record of what's in it'.
So what happens if you do both robots.txt and meta robots noindex? Well - obeying robots.txt means the book will never get opened - so the meta robots noindex instruction won't get read - its 'inside the book'.
So what do you do?
Use one or the other - not both.
If you don't want the URL in the index - use meta robots noindex - which means putting it in the head of the document on every page you don't want indexed....
If its too hard, (e.g. your template gets used in a cms for bazillions of pages - most of which you want indexed) and you don't care if the URL itself is in the index or not - use robots.txt.
There is no other 'easy' way to tell a robot 'don't event think about listing this URL in your index or reading it'.
Alternative workarounds can also be used - like converting this content into javascript popup pages....
:)