Forum Moderators: open
Questions: When did you update your robots.txt file?
Has googlebot retrieved it since you have made changes?
After it retrieved it, did it then go and get your disallowed page?
Have you checked out the googlebot robots.txt faq [google.com] page for information?
HTH,
JP
1) Add:
<META NAME="googlebot" CONTENT="noarchive,nofollow,noindex"> 2) Remove the disallow from robots.txt, watch your log files for googlebot retrieving the robots.txt file, followed by the page you do not want to be in the index.
3) Once this page has been retrieved, add the disallow back into robots.txt.
I've not tested this personally, but it could be worth trying. If logic follows, then the disallowed page would be removed from the index following the next freshbot update. Failing that, make sure the page can be retrieved by deepbot, and the page will then be dropped on the next update.
JP
athlonInside,that will explain it:
[ftrain.com...]
and it is SHE :)
[webmasterworld.com...]
...which sounds a lot like what you are seeing with Google.
BTW there's a much better post by Brett on the subject but I can't seem to find it :(
- Tony
Anyone knows how the 'remove page' in google works? You need to register with a valid email and add something to the page to be removed. Anyone know what is the 'something'?
This is the normal situation:
1. Google sees a link to your URL.
2. Google checks /robots.txt and is allowed in.
3. Google fetches the page.
4. Google includes the URL in the next index, with the listing from the page content.
This is the /robots.txt banned situation:
1. Google sees a link to your URL.
2. Google checks /robots.txt and is not allowed in.
3. Google does not fetch the page.
4. Google includes the URL in the next index, but just lists the URL without the listing from the page content.
Keep in mind that "disallow" in /robots.txt doesn't stop your URL from existing, it just disallows a robot from crawling it. /robots.txt therefore saves you bandwidth, but it doesn't stop people from knowing that the URL exists.
It seems odd, but as JP suggests you can remove a URL by allowing it in your /robots.txt and using the META robots tag with "noindex". In your case, you probably woudn't want to put the disallow back into your /robots.txt as you presumably want the URL to remain out of Google in future updates.
I don't think this last point affects you AthlonInside, but it's important to point out that there are other ways of finding URLs (eg. 'referer' logs or bad robots who don't obey REP). Because of this, secret or sensitive data should be held behind something like basic authentication combined with SSL, and only then if it's not too risky to have on an Internet accessible machine.