GoogleBot is too dumb to understand robots.txt and meta robots

Forum Moderators: open

Message Too Old, No Replies

GoogleBot is too dumb to understand robots.txt and meta robots

AthlonInside

10:25 am on May 6, 2003 (gmt 0)

Huh, I am tired of GoogleBot which keep on ignoring my robots.txt file and my meta robots.

I have blocked it with

User-agent: *
Disallow: /somthing.php

User-agent: others
Disallow: /somethingelse.php

why did googlebot still think he has some kind of veto power to do anything he/she wants?

jpjones

10:39 am on May 6, 2003 (gmt 0)

Googlebot (in general) reads and follows correctly formatted robots.txt files to the letter.

Questions: When did you update your robots.txt file?
Has googlebot retrieved it since you have made changes?
After it retrieved it, did it then go and get your disallowed page?

Have you checked out the googlebot robots.txt faq [google.com] page for information?

HTH,
JP

AthlonInside

10:54 am on May 6, 2003 (gmt 0)

What is funny is they do not come back to my disallow files (as shown in logs) anymore once I have my robots.txt but they remain in the Google index for 3 months! Why they didn't remove it!

jpjones

11:09 am on May 6, 2003 (gmt 0)

I'm not sure on the length of retention in googles' index, but maybe you might like to try the following to hasten the pages' removal:

1) Add:

<META NAME="googlebot" CONTENT="noarchive,nofollow,noindex">

to the disallowed pages.

2) Remove the disallow from robots.txt, watch your log files for googlebot retrieving the robots.txt file, followed by the page you do not want to be in the index.

3) Once this page has been retrieved, add the disallow back into robots.txt.

I've not tested this personally, but it could be worth trying. If logic follows, then the disallowed page would be removed from the index following the next freshbot update. Failing that, make sure the page can be retrieved by deepbot, and the page will then be dropped on the next update.

JonB

11:42 am on May 6, 2003 (gmt 0)

Huh, I am tired of GoogleBot which keep on ignoring my robots.txt file and my meta robots. why did googlebot still think he has some kind of veto power to do anything he/she wants?
------------------------

athlonInside,that will explain it:

[ftrain.com...]

and it is SHE :)

Dreamquick

11:48 am on May 6, 2003 (gmt 0)

Have a look here, specifically the 40th post by Everyman, essentially saying that Google will link to excluded items even though it can't crawl them;

[webmasterworld.com...]

...which sounds a lot like what you are seeing with Google.

BTW there's a much better post by Brett on the subject but I can't seem to find it :(

- Tony

shaadi

12:09 pm on May 6, 2003 (gmt 0)

I have a hard time explaining this to our member base which is now 50,000+

Everyday I have to handle around 5-6 quries from CRM regarding this issue :(

AthlonInside

12:48 pm on May 6, 2003 (gmt 0)

GoogleBot is greedy, I just want to block it from a redirect script and the redirect has no contents - no title, no text, no links, just a meta refresh and they still want it so badly, getting all of the redirect ... :(

Anyone knows how the 'remove page' in google works? You need to register with a valid email and add something to the page to be removed. Anyone know what is the 'something'?

ciml

7:11 pm on May 6, 2003 (gmt 0)

AthlonInside, JP and Tony have the right answers.

This is the normal situation:

1. Google sees a link to your URL.
2. Google checks /robots.txt and is allowed in.
3. Google fetches the page.
4. Google includes the URL in the next index, with the listing from the page content.

This is the /robots.txt banned situation:

1. Google sees a link to your URL.
2. Google checks /robots.txt and is not allowed in.
3. Google does not fetch the page.
4. Google includes the URL in the next index, but just lists the URL without the listing from the page content.

Keep in mind that "disallow" in /robots.txt doesn't stop your URL from existing, it just disallows a robot from crawling it. /robots.txt therefore saves you bandwidth, but it doesn't stop people from knowing that the URL exists.

It seems odd, but as JP suggests you can remove a URL by allowing it in your /robots.txt and using the META robots tag with "noindex". In your case, you probably woudn't want to put the disallow back into your /robots.txt as you presumably want the URL to remain out of Google in future updates.

I don't think this last point affects you AthlonInside, but it's important to point out that there are other ways of finding URLs (eg. 'referer' logs or bad robots who don't obey REP). Because of this, secret or sensitive data should be held behind something like basic authentication combined with SSL, and only then if it's not too risky to have on an Internet accessible machine.

AthlonInside

10:38 am on May 8, 2003 (gmt 0)

I have the pages removed by using their remove page facility. I ask them to read my robots.txt. It took 1 day to remove all the pages from the index. Not bad.