Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How long does it need for Googlebot to understand it is not wanted?

User-agent: Googlebot Disallow: /widgets/

         

mattg3

2:34 pm on Dec 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think this is correct as far as I can see.

User-agent: Googlebot
Disallow: /widgets/

[google.com...]

states

We generally download robots.txt files about once a day. You can see the last time we downloaded your file using the robots.txt analysis tool in Google Sitemaps and checking the Last downloaded date and time.

Google might have everflux installed but they still seem to be not able to go through their index pretty fast. :\

It's not really urgent. I am just interested how long does it take?

jdMorgan

4:26 pm on Dec 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Generally, it only takes a few hours, assuming three things: First that your Googlebot is a real Googlebot, and not a spoofer; Second, that there are no other records in robots.txt that override the one you posted; And lastly, that your robots.txt is valid overall and resides in the proper location in the top Web-accessible directory of the (sub)domain in question.

As with the search servers, there are many machines running the Googlebot application. Each one of them (or perhaps a 'representative' member of their clusters) will have to fetch and process your new robots.txt before it takes effect. I always allow a minimum of 24 hours when making any changes to robots.txt or to UA-based access controls for spiders to re-fetch and re-analyze robots.txt.

After this time, the Googlebots should all stop fetching resources in your /widgets/ subdirectory. However, if you are asking how long it will take for those listings to be removed from search results, the answer may be anywhere from 90 days from main results, a year from Supplemental results with full title/description, and then never -- Any pages with links to them from any page anywhere on the Web may live on in Supplemental results forever with URL-only listings for title and description.

If this latter issue is your concern, then consider *allowing* Googlebot to spider your /widgets/ subdirectory, but redirecting all /widgets/ pages to replacement pages or returning 410-Gone responses at the server level for all /widgets/ pages.

Jim

trinorthlighting

4:33 pm on Dec 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We just blocked about 100 urls from one of our sites last week via robots.txt Google has removed 20 so far in their index. The pages were supplemental though and do not get crawled often though.

Google does drop them though.

g1smd

5:09 pm on Dec 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Some of them will still show as URL-only entries for many months after you put the block in place.

Even after a year, they will not all be gone.

Terabytes

5:29 pm on Dec 29, 2006 (gmt 0)

10+ Year Member



just wanted to pop a thought into this:

If "somewhere" a link exists on another site pointing to your directory, G is going to try and follow that link to your disallowed directory....over...and over...and over...

unless you can get all the pointers to your directory removed (probably impossible) you'll have to leave that disallow in place, probably forever..

thanks!
Tera

AndyA

6:38 pm on Dec 29, 2006 (gmt 0)

10+ Year Member



You might try this:

Disallow: /widgets

(note there's no trailing slash)

I did this with my forum directory, and then pages started showing up in Webmaster Tools as forbidden by robots.txt. Prior to that, they weren't. Adding the trailing slash tells Google it's OK to spider example.com/widgets/ and if there's links to other pages in the directory from there, then they get spidered anyway.

I don't know if this applies in your case or not, but it might not hurt to drop that last slash.

[edited by: AndyA at 6:40 pm (utc) on Dec. 29, 2006]

mattg3

8:33 pm on Dec 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The phrase posted is all that's in the robots.txt and it's in the root directory and the path is correct.

I can try the no trailing backslash and a nirwana redirect. I don't use webmastertools on this server as I want to experiment with a "Google free zone" server.

Thanks for all the suggestions!