Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot crawled unwanted directories

What now?

         

zoth

9:25 am on Dec 31, 2005 (gmt 0)

10+ Year Member



Hello webmasters,

I had an accident, please help me!

I had a HDD crash on my server 2 months before. I had to reinstall the entire site.
But I forgot to place back my old robots txt file ...
and I just noticed still at yesterday, when a browse through my old logs ... :(((

The result the followings:
The Googlebot crawled my cgi-bin directory, and indexed more than 55000! queries from my amazon product feed script + crawled the entire shtml directory which contain printable version all of my pages. These directories was disallowed in my old robots.txt file.

What I have to do now? Any suggestion?

No SERP changes still ... But I'am really afraid Google will penalize for spam and duplicate content soon.

I placed back to my old robots.txt file which disallow to crawl these directories, and I placed
META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW" into my cgi and shtml pages.

Can I do anything else? What do you think, do I have to contact with Google?

Should I use the URL removal tool?
In this case, my site will drop out from Google for the next 6 month? or just the pages?

I really appreciate any kind of answers.
Thank you.

zoth

8:33 am on Jan 1, 2006 (gmt 0)

10+ Year Member



No suggestions?

soapystar

12:08 pm on Jan 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



googlebot seems to have gone mad lately.....follows every page and indexes them even with the noindex and nofollow tag...follows every link it can find whether within javascript..comment tags..forms...wherever it finds any source code that looks remotely like a link it blindly spiders the link....

bumpski

12:13 am on Jan 2, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you use a sitemap.xml file and the associated Google stats you'll get a list of pages Google thinks are blocked by robots.txt. This is a nice check

It's obvious to me that Google "partially indexes" virtually all pages out there, even those blocked by robots.txt. All the html pages I've blocked in robots.txt are in the Google index, indexed as URL only "partially indexed" pages.

cws3di

2:34 am on Jan 2, 2006 (gmt 0)

10+ Year Member



zoth

It sounds like you have a real problem - if those pages have any sort of confidential info I really sympathize with you.

1. Now that you have replaced the robots.txt file, those pages will stay in the index, but never be re-spidered again, so they will likely turn supplemental, but still be publically accessible.

2. You could use the url removal tool, which will remove them from the index temporarily (3 to 6 months) but it is likely that those pages will then return as "Supplemental" and stay in the index for perpetuity.

3. The other alternative is ugly, and could cause you many hours of work. In a nutshell, move all of those files to another directory name, then change all of your code and links for the new directory name.

a. How much work that would cause depends on what your code looks like?

b. The way your hosting service is set up may require you to have all of your executables in cgi-bin. To get around that, place the files in a sub-directory of cgi-bin (and make sure the sub-directory has 755 permissions)

so, the new path to your files would be:
www.example.com/cgi-bin/zothfiles/

If your code is fairly simple, it may just be a config file setting or a simple find and replace to update to a new path.

After you are sure that everything will work that way, then you can delete the files in the original path, let googlebot in and serve a 410 for those files. I think that is your best bet for hoping that those files will drop off the index, although it might take some time (maybe months?)

IMHO it is better that google CAN re-spider them as either 410 gone or blank pages to wipe out the cache of what it spidered before.

Good luck,

moftary

12:31 pm on Jan 2, 2006 (gmt 0)

10+ Year Member



What I can ensure is that googlebots dont follow robots.txt
On everysite that I have, I have placed a robots.txt file asking all bots to stay away from certain files/directories. Each time I find googlebots disobey my robots.txt rules and find those files/directories in serps days later.

I havent tried before the meta-no-follow-no-index, so I cannot comment on that.

BillyS

12:59 pm on Jan 2, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I agree with the statement that Googlebot seems to spider what it wants too.

I've always excluded cgi-bin, yet for some reason Googlebot's been there. I discovered this via sitemaps.

g1smd

1:26 am on Jan 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google will not see your meta noindex because you asked them in robots.txt not to crawl your pages. The Google Removal Tool does not remove pages from the index; it merely hides them from the public results for 90 or 180 days.

Put the <meta name="robots" content="noindex"> tag on all pages that you do not want to be indexed. Take the exclusion out of the robots.txt file. Let Google spider all of those pages and see the noindex instruction. That will be the only way to get them removed.

stinkfoot

10:50 am on Jan 4, 2006 (gmt 0)

10+ Year Member



>I agree with the statement that Googlebot seems to spider what it wants too.

Related topic: [webmasterworld.com...]

>I've always excluded cgi-bin, yet for some reason Googlebot's been there. I discovered this via sitemaps.

You are lucky. My sitemap is lying to me. I put up 400+ pages for testing and as soon as I did googlebot was at them! I saw this in my logs. I go to my sitemap and there is no mention of the fact.

Google is lieing to me ... but it is not as if that hasnt happened before. They are a brand name not a couple of guys at university trying to make the world a better place.

Vote with your fingers. Use MSN and Wikipedia :)

milanmk

11:42 pm on Jan 5, 2006 (gmt 0)

10+ Year Member



I think I have a point for robots.txt.

Previously I used to get the same problem of Googlebot indexing banned directories, then I checked my robots.txt file for errors at following page:-

[searchenginepromotionhelp.com...]

It showed me an error of not having an extra line at the end of robots.txt file, I rectify the problem, uploaded the new file and its done! All the restricted URL’s now seen in my error statistics in SiteMap.

zoth

9:23 am on Jan 6, 2006 (gmt 0)

10+ Year Member



Thanks for everbody answers.

The 410 redirect and placing cgi script to the other
directory was really good ideas, so I made it both of them ;-)

Thank you so much cws3di!

stinkfoot

1:01 pm on Jan 6, 2006 (gmt 0)

10+ Year Member



milanmk

nice tip have added blank line to robots.txt and will hope

g1smd

7:59 pm on Jan 6, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Remember too, that URLs that you wish to block should include the full path and that path MUST start with a / right at the beginning.