Forum Moderators: goodroi
(I hope this is the right forum)
When you say these disallowed pages are indexed, do you mean that the page is shown in G search results with a title, snippet, description, etc., or is it just listed as a URL?
If the latter case, G is simply listing the information it can find from links to the page (and the associated link-text), therefore no title, snippet, and description appear in the results. If you want to remove this type of listing, the solution seems counter-intuitive; Remove the disallow in robots.txt, and add an on-page meta robots noindex tag. This applies to AJ/Teoma as well.
If the former case - if you're seeing a "full listing" - then there is indeed some problem with robots.txt. Make sure that your records are in the right order (Spiders accept the first User-agent directive that matches their name or "*", whichever comes first, and won't look further).
Jim
User-agent: *
Disallow: /cgi-bin/
Disallow: /wiget1.html
Disallow: /wiget2.html
Disallow: /robot.html
Disallow: /bla/stuff.html
Disallow: /links/
Disallow: /googlereplace.html
Problem comes in with /wiget1.html and /widget2.html showing up in SERP.
FYI the entry /robot.html is a honeypot to catch bots that use the robots.txt file to "find" files they aren't supposed to know about. It helps in targeting the bad guys.
But my intention by disallowing it was to have the page not being indexed. Hence my second question,
Do I have to use the robots-metatag to prevent these pages from being indexed?
@KenB: I do use this as my bad bots trap, I have one directory mentioned in robots.txt with a default page that will add the user agent string and IP to a database as a source for banning visitors.
The problem comes in with Google's definition of "indexed" versus our expectations. In cases where Google shows only a URL, they do not consider themselves to have "indexed" the page, because indeed, they have not fetched the page. They are showing a listing based soley upon a link to the page that they found. Because they did not crawl the page, they cannot show a title or description, and the page's placement in the SERPs is completely dependent upon the link text of the link(s) they found to the page.
Ask Jeeves/Teoma's behaviour is similar enough to just call it identical.
How to stop it: As counterintuitive as it is, the best way to stop it is to set up a special section of your robots.txt for AJ and Google. In that section, *do not* disallow AJ or Google from crawling that page. Instead, add the <meta name="robots" content="noindex"> tag to your pages that you don't want them to show in the SERPs.
The advantage of this approach is that other 'bots will go to the 'general' section of your robots.txt and see that they're not supposed to fetch those pages. This will saves you some bandwidth. AJ and Google will go to their own special section, see that they can fetch the page, GET the page, find the meta robots tag, and drop the page from their listings. However, they will continue to crawl that page periodically, and this costs you some bandwidth.
In order to keep your robots.txt small, you can use the same 'record' for both AJ and Google - like this:
User-agent: Googlebot
User-agent: Ask Jeeves/Teoma
Disallow: /cgi-bin/
Disallow: /robot.html
User-agent: *
Disallow: /cgi-bin/
Disallow: /wiget1.html
Disallow: /wiget2.html
Disallow: /robot.html
Thank you very much for that information. So GoogleBot does work as I suspected:
[So ]technically, it does follow the robots.txt and does not visit the page. But it indexes it, since it's linked to from a page that's allowed by robots.txt.
I will change my robots.txt as you proposed and add the meta tags to the files.
FYI the entry /robot.html is a honeypot to catch bots that use the robots.txt file to "find" files they aren't supposed to know about. It helps in targeting the bad guys.
KenB i hear dof this be4 and looked into finding way to find what robots visit the robots.txt file but haven't been succesfull.
ARe you able to offer advice on how to set this up?
many thx