homepage Welcome to WebmasterWorld Guest from 67.202.56.112
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
faulty robots.txt?
tenaka




msg:1528974
 9:48 am on Dec 27, 2003 (gmt 0)

hi guys,

my robots.txt looks like:

quote:
--------------------------------------------------------------------------------

User-agent: *
Disallow: /cgi-bin/
Disallow: /gallery/
Disallow: /images/
Disallow: /stat_www/
Disallow: /stat_www_old/
Disallow: /survey/
Disallow: /templates/

--------------------------------------------------------------------------------

when I look at the logfile I see strange things: the google bot is regularly visiting and indexing sites but
quote:
--------------------------------------------------------------------------------
this one: 66.196.65.36 - Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
--------------------------------------------------------------------------------

does only come to my site and reads the robots.txt allthe time and then leaves again. Today it read it 10 times and nothing else.

Btw to which chmode do I have to set the robots.txt?
And how do some ppl manage to get a 404 error, when retrieving my robots.txt?

Am I doing something wrong?

 

ncw164x




msg:1528975
 9:03 pm on Jan 2, 2004 (gmt 0)

Your robots.txt file should be

[nameofyoursite...]

as in the root of your site, you should not be getting a 404 error via the browser if the file is in this directory
are you seeing a 404 error when googlebot and slurp requests the file

The file must be FTP'd in ASCII not Binary

hope this helps

ncw164x

jdMorgan




msg:1528976
 9:16 pm on Jan 2, 2004 (gmt 0)

tenaka,

Welcome to WebmasterWorld [webmasterworld.com]!

To second ncw164x, your robots.txt looks fine. For added reassurance, run it throught this robots.txt validator [searchengineworld.com]

Note that robots.txt Disallow patterns are prefix-matched. That is the robot will not fetch anything that begins with the string you specify after Disallow. Therefore, you can disallow both "/stat_www/" and "/stat_www_old/" using the single directive:

Disallow: /stat_www

The only side effect will be if you have other files or subdirectories whose names also start with "stat_www". For example, "/stat_www_public.html" would also be disallowed.

Inktomi's Slurp is notoriously slow about digging deeply into sites - You may just have to wait awhile. If your site is commercial and you want it spidered soon and frequently, consider the paid inclusion option.

chmod 644 should be fine - robots.txt is fetched just like any other text file or html page.

Jim

tenaka




msg:1528977
 9:27 pm on Jan 2, 2004 (gmt 0)

thx guys,

I just got an email from inktomis tech support telling me that they are retrieving the robots.txt from time to time to check the pages are still up. They also told me that they have thousands of indexed pages from my cgi-bin that I banned them from recently because that was a mistake on my side. well, actually they said they have a few pages from my site in their index and if I have more I should link to them.

I checked again and they only have 3 pages and 1000 of old ones from my cgi-bin...

I will just wait a little longer...

tenaka




msg:1528978
 2:13 pm on Jan 3, 2004 (gmt 0)

talking about 404 errors:

66.77.73.162 - FAST-WebCrawler/3.8 (crawler at trd dot overture dot com; [alltheweb.com...]

Date Page Status Referer
01/03 15:01 /robots.txt 404 -

The first time FAST visited me for a long time and it did not find my robots.txt?

Of course I have one and it is ok..

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved