homepage Welcome to WebmasterWorld Guest from 54.163.70.249
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
How long until robots.txt works?
slobizman




msg:1529390
 3:51 am on Nov 13, 2003 (gmt 0)

thrity minutes ago I added a robots.txt file to my home directory with the following contents:

User-agent: *
disallow: /

But, I see that it is still going through my pages. Googlebot is on my server 24 hours a day. Does it need to leave and come back before it starts checking the robots.txt file, or should it check it with each query?

If it has to leave and come back, what if it never does?

 

jdMorgan




msg:1529391
 4:26 am on Nov 13, 2003 (gmt 0)

It depends on the robot. Googlebot will likely figure it out by tomorrow. Some others may take up to a month. Just like getting indexed, the 'bots do things on their schedule, not ours.

Jim

operafan




msg:1529392
 4:31 am on Nov 13, 2003 (gmt 0)

Because you only recently added your new robots file, all the bots that came to spider before the rbt.txt was put up will go on spidering your site as it is without restrictions until the next time they come around & request for the rbt.txt file & then note the rules within & follow it.

looks like Jim beat me to it, I want to collect more points :)

slobizman




msg:1529393
 5:29 am on Nov 13, 2003 (gmt 0)

If Google is on 24/7 (seriously), will it ever step back and restart a new session and re-read the robots.txt file?

operafan




msg:1529394
 6:07 am on Nov 13, 2003 (gmt 0)

not to worry from time to time googlebot will request for your robots.txt file..just be patient.

Mohamed_E




msg:1529395
 7:04 pm on Nov 13, 2003 (gmt 0)

But, I see that it is still going through my pages. Googlebot is on my server 24 hours a day. Does it need to leave and come back before it starts checking the robots.txt file, or should it check it with each query?

Like many questions about Google, the answer to this one is found on (surprise!) the Google Information for Webmasters section of the Google site right on the FAQ page where it belongs :)

In answer to the question Why isn't Googlebot obeying my robots.txt file? [google.com] you will find:

To save bandwidth, Googlebot only downloads the robots.txt file once a day or whenever we have fetched many pages from the server. So, it may take a while for Googlebot to learn of any changes that might have been made to your robots.txt file. Also, Googlebot is distributed on several machines. Each of these keeps its own record of your robots.txt file. Also, check that your syntax is correct against the standard at: [robotstxt.org...] If there still seems to be a problem, please let us know and we'll correct it.

slobizman




msg:1529396
 7:51 pm on Nov 13, 2003 (gmt 0)

Thanks. Duh! I'll bookmark that page.

slobizman




msg:1529397
 9:49 pm on Nov 13, 2003 (gmt 0)

Okay, it worked. Both Google and Hotbot off. I'm going to run it like this for a couple days and see if my bandwidth goes down (that's why I was doing it--Was doing 400 queries an hour in my forum).

When I change it back, I want to add the following:

User-agent: *
disallow: /*.gif$
disallow: /*.jpg$
disallow: /*.jpeg$
disallow: /*.bmp$

Am I right in assuming that this will effectively allow the bots to go through the forum, yet not download graphic files, saving me lots of bandwidth?

Mohamed_E




msg:1529398
 11:01 pm on Nov 13, 2003 (gmt 0)

slobizman,

Thanks for making me do some reading and learning a lot more than I knew when I started out on this thread!

The syntax you suggest, taken from Google's Webmaster info, is not general, it is a variant that, it appears, is only used by Google. Check out Robots.txt - Am I Missing Somthing? [webmasterworld.com].

In message #2 DaveAtIFG point out that:

Wild cards are only acceptable in the User-Agent field.

while in message #9 jdMorgan adds:

AFAIK, the only "big" search engine that supports extensions to the Standard for Robots Exclusion is Google, as documented in their Webmaster Help section.

So your code should keep Googlebot out of your images etc., but will be ignored by all other robots.

I suspect that the correct way would be to make separate directories for these files, the robots exclusion protocol deals specifically with keeping robots out of directories.

Take what I have written with a grain of salt until confirmed by someone more knowlegeable, I have just learned this a few minutes ago :)

slobizman




msg:1529399
 3:06 pm on Nov 17, 2003 (gmt 0)

I'm still confused on something. Can I use Robots.txt to not only keep them out fo a directory, but can I keep them from executing a URL with a ceratin string in it?

For example, I want to stop the [good] bots from exectuing the URLs:

[my-domain.com...]

[my-domain.com...]

Will the following work?

User-agent: *

disallow: /*showtopic.php
disallow: /*search

slobizman




msg:1529400
 3:23 pm on Nov 18, 2003 (gmt 0)

Oops, I meant:

User-agent: *
disallow: /*showtopic
disallow: /*search

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved