Indexed pages that are disalowed by robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Indexed pages that are disalowed by robots.txt

is this normal?

WebJoe

6:27 pm on Jan 6, 2004 (gmt 0)

I have a robots.txt that has not changed since oct 10 2003, with several directories disallowed for all user agents. I validated the robots.txt with the robots.txt validator [searchengineworld.com], and yet I find a page within one of these directories indexed by google. Is this normal? Do I have to use the robots-metatag to prevent these pages from being indexed?

(I hope this is the right forum)

KenB

6:33 pm on Jan 6, 2004 (gmt 0)

I have the same problem with several pages. I have a feeling that while the page is no longer being indexed by Google's bot, Google is using the old cached index for the SERP. I'd love to know how to stop it.

jdMorgan

6:52 pm on Jan 6, 2004 (gmt 0)

Defining the terms here would help...

When you say these disallowed pages are indexed, do you mean that the page is shown in G search results with a title, snippet, description, etc., or is it just listed as a URL?

If the latter case, G is simply listing the information it can find from links to the page (and the associated link-text), therefore no title, snippet, and description appear in the results. If you want to remove this type of listing, the solution seems counter-intuitive; Remove the disallow in robots.txt, and add an on-page meta robots noindex tag. This applies to AJ/Teoma as well.

If the former case - if you're seeing a "full listing" - then there is indeed some problem with robots.txt. Make sure that your records are in the right order (Spiders accept the first User-agent directive that matches their name or "*", whichever comes first, and won't look further).

Jim

KenB

7:12 pm on Jan 6, 2004 (gmt 0)

Here's my example made a little generic for these purposes:

User-agent: *
Disallow: /cgi-bin/
Disallow: /wiget1.html
Disallow: /wiget2.html
Disallow: /robot.html
Disallow: /bla/stuff.html
Disallow: /links/
Disallow: /googlereplace.html

Problem comes in with /wiget1.html and /widget2.html showing up in SERP.

FYI the entry /robot.html is a honeypot to catch bots that use the robots.txt file to "find" files they aren't supposed to know about. It helps in targeting the bad guys.

WebJoe

10:12 pm on Jan 6, 2004 (gmt 0)

In my case, google is just showing the url of the page, no title, no snippet...but the page is indexed. So technically, it does follow the robots.txt and does not visit the page. But it indexes it, since it's linked to from a page that's allowed by robots.txt.

But my intention by disallowing it was to have the page not being indexed. Hence my second question,

Do I have to use the robots-metatag to prevent these pages from being indexed?

@KenB: I do use this as my bad bots trap, I have one directory mentioned in robots.txt with a default page that will add the user agent string and IP to a database as a source for banning visitors.

WebJoe

5:40 pm on Jan 18, 2004 (gmt 0)

Anyone?

jdMorgan

6:03 pm on Jan 18, 2004 (gmt 0)

WebJoe,

The problem comes in with Google's definition of "indexed" versus our expectations. In cases where Google shows only a URL, they do not consider themselves to have "indexed" the page, because indeed, they have not fetched the page. They are showing a listing based soley upon a link to the page that they found. Because they did not crawl the page, they cannot show a title or description, and the page's placement in the SERPs is completely dependent upon the link text of the link(s) they found to the page.

Ask Jeeves/Teoma's behaviour is similar enough to just call it identical.

How to stop it: As counterintuitive as it is, the best way to stop it is to set up a special section of your robots.txt for AJ and Google. In that section, *do not* disallow AJ or Google from crawling that page. Instead, add the <meta name="robots" content="noindex"> tag to your pages that you don't want them to show in the SERPs.

The advantage of this approach is that other 'bots will go to the 'general' section of your robots.txt and see that they're not supposed to fetch those pages. This will saves you some bandwidth. AJ and Google will go to their own special section, see that they can fetch the page, GET the page, find the meta robots tag, and drop the page from their listings. However, they will continue to crawl that page periodically, and this costs you some bandwidth.

In order to keep your robots.txt small, you can use the same 'record' for both AJ and Google - like this:

User-agent: Googlebot 
User-agent: Ask Jeeves/Teoma 
Disallow: /cgi-bin/ 
Disallow: /robot.html 
  
User-agent: * 
Disallow: /cgi-bin/ 
Disallow: /wiget1.html 
Disallow: /wiget2.html 
Disallow: /robot.html

This method is valid according to the Standard, and almost all robots support it. A couple of them don't, but I have corresponded with them them and 'they are working on it'. If in doubt, read the Standard for Robots Exclusion carefully. More info here [webmasterworld.com].

Jim

WebJoe

6:57 pm on Jan 18, 2004 (gmt 0)

Jim,

Thank you very much for that information. So GoogleBot does work as I suspected:

[So ]technically, it does follow the robots.txt and does not visit the page. But it indexes it, since it's linked to from a page that's allowed by robots.txt.

I will change my robots.txt as you proposed and add the meta tags to the files.

spud01

12:22 pm on Jan 20, 2004 (gmt 0)

FYI the entry /robot.html is a honeypot to catch bots that use the robots.txt file to "find" files they aren't supposed to know about. It helps in targeting the bad guys.

KenB i hear dof this be4 and looked into finding way to find what robots visit the robots.txt file but haven't been succesfull.

ARe you able to offer advice on how to set this up?

many thx

KenB

7:03 pm on Jan 20, 2004 (gmt 0)

I simply created the entry into the robots.txt that stated: "disallow: /robots.htm", without actually creating the page and then look for 404 requests for this page in the log files. I don't do anything fancy. Since the page doesn't exist and a 404 error is generated, it will show up in my error report of my Analog reports. If I see the entry, I know I need to look at my logs to see what the bot was.