Forum Moderators: Robert Charlton & goodroi
I had an accident, please help me!
I had a HDD crash on my server 2 months before. I had to reinstall the entire site.
But I forgot to place back my old robots txt file ...
and I just noticed still at yesterday, when a browse through my old logs ... :(((
The result the followings:
The Googlebot crawled my cgi-bin directory, and indexed more than 55000! queries from my amazon product feed script + crawled the entire shtml directory which contain printable version all of my pages. These directories was disallowed in my old robots.txt file.
What I have to do now? Any suggestion?
No SERP changes still ... But I'am really afraid Google will penalize for spam and duplicate content soon.
I placed back to my old robots.txt file which disallow to crawl these directories, and I placed
META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW" into my cgi and shtml pages.
Can I do anything else? What do you think, do I have to contact with Google?
Should I use the URL removal tool?
In this case, my site will drop out from Google for the next 6 month? or just the pages?
I really appreciate any kind of answers.
Thank you.
It's obvious to me that Google "partially indexes" virtually all pages out there, even those blocked by robots.txt. All the html pages I've blocked in robots.txt are in the Google index, indexed as URL only "partially indexed" pages.
It sounds like you have a real problem - if those pages have any sort of confidential info I really sympathize with you.
1. Now that you have replaced the robots.txt file, those pages will stay in the index, but never be re-spidered again, so they will likely turn supplemental, but still be publically accessible.
2. You could use the url removal tool, which will remove them from the index temporarily (3 to 6 months) but it is likely that those pages will then return as "Supplemental" and stay in the index for perpetuity.
3. The other alternative is ugly, and could cause you many hours of work. In a nutshell, move all of those files to another directory name, then change all of your code and links for the new directory name.
a. How much work that would cause depends on what your code looks like?
b. The way your hosting service is set up may require you to have all of your executables in cgi-bin. To get around that, place the files in a sub-directory of cgi-bin (and make sure the sub-directory has 755 permissions)
so, the new path to your files would be:
www.example.com/cgi-bin/zothfiles/
If your code is fairly simple, it may just be a config file setting or a simple find and replace to update to a new path.
After you are sure that everything will work that way, then you can delete the files in the original path, let googlebot in and serve a 410 for those files. I think that is your best bet for hoping that those files will drop off the index, although it might take some time (maybe months?)
IMHO it is better that google CAN re-spider them as either 410 gone or blank pages to wipe out the cache of what it spidered before.
Good luck,
I havent tried before the meta-no-follow-no-index, so I cannot comment on that.
Related topic: [webmasterworld.com...]
>I've always excluded cgi-bin, yet for some reason Googlebot's been there. I discovered this via sitemaps.
You are lucky. My sitemap is lying to me. I put up 400+ pages for testing and as soon as I did googlebot was at them! I saw this in my logs. I go to my sitemap and there is no mention of the fact.
Google is lieing to me ... but it is not as if that hasnt happened before. They are a brand name not a couple of guys at university trying to make the world a better place.
Vote with your fingers. Use MSN and Wikipedia :)
Previously I used to get the same problem of Googlebot indexing banned directories, then I checked my robots.txt file for errors at following page:-
[searchenginepromotionhelp.com...]
It showed me an error of not having an extra line at the end of robots.txt file, I rectify the problem, uploaded the new file and its done! All the restricted URL’s now seen in my error statistics in SiteMap.