Forum Moderators: goodroi

Message Too Old, No Replies

Is this Robots.txt OK?

Please check this robot.txt file

         

mzconstruction

3:52 pm on Feb 2, 2006 (gmt 0)

10+ Year Member



I'm specifically concerned because one of our staff has suggested that the following file could exclude Googlebot from the site entirely. But I can't see a problem with our file. It validates fine.

# Disallow Google ad bots from the entire site
User-agent: Mediapartners-Google*
Disallow: /

# Disallow Google Adsense from the entire site
User-agent: Google AdSense
Disallow: /

# Disallow Googlebot, MSN and Yahoo from these directories and all files within them
User-agent: Googlebot
User-agent: MSNBot
User-agent: Slurp
Disallow: /css
Disallow: /cgi-bin
Disallow: /error_docs
Disallow: /images
Disallow: /js
Disallow: /somedir
Disallow: /somedir

# Disallow ALL bots from these directories and all their child objects
User-agent: *

Disallow: /css
Disallow: /cgi-bin
Disallow: /error_docs
Disallow: /images
Disallow: /js
Disallow: /somedir
Disallow: /somedir

# Disallow specific bots from indexing or crawling the site at all
# Most recent additions first.

User-agent: GigaBot
Disallow: /

User-agent: Voyager
Disallow: /

User-agent: BaiDuSpider
Disallow: /

User-agent: BackRub/*.*
Disallow: /

User-agent: Grub.org
Disallow: /

User-agent: BotRightHere
Disallow: /

User-agent: larbin
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Walhello appie
Disallow: /

User-agent: Python-urllib
Disallow: /

User-agent: Googlebot-Image
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: WebBandit
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: CopyRightCheck
Disallow: /

User-agent: Crescent
Disallow: /

User-agent: Yandex bot
Disallow: /

# END OF FILE

TIA

Dijkgraaf

10:52 pm on Feb 2, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why is there a * after Mediapartners-Google?

Also for BackRub/*.*, are you trying to use wildcards for the UA? If so I don't think that is allowed.

It might pay you to put the "User-agent: *" section at the very end. Some bots might not look any further than that to see if there is something that matches their specific UA.

mzconstruction

11:57 pm on Feb 2, 2006 (gmt 0)

10+ Year Member



Why is there a * after Mediapartners-Google?

Good question. I'm not entirely sure. Our code monkey originally wrote the file and she no longer works for us. Nothing nasty! She's having a baby. ;-)

Also for BackRub/*.*, are you trying to use wildcards for the UA? If so I don't think that is allowed.

OK, I can remove those.

It might pay you to put the "User-agent: *" section at the very end.

Good point, thanks. I'll do that.

Is the file otherwise OK? Funnily enough Googlebot seems to be ignoring its exclusions as it was actively spidering two folders that we disallowed. What's that about -- any ideas?

mzconstruction

12:04 am on Feb 3, 2006 (gmt 0)

10+ Year Member



Why is there a * after Mediapartners-Google?

I've found out now from her notes. It's a wildcard for the version no. 2.1 I think at the moment. It's the adsense bot. We used to run adsense but not any longer so allowing it would just gobble up bandwidth.

I take it the file is OK though? Nothing in there to prevent Google or any other bot spidering the site?

Dijkgraaf

12:12 am on Feb 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well there are a few things that could cause that.
1) the file has to be called robots.txt (all lower case and that exact name) In the title I noticed you had it starting with an upper case, and in the description it was without an s.
2) robots.txt has to be in the root directory of the web site. You should be able to see it if you enter http://www.example.com/robots.txt (replacing example.com with your domain.
3) The exclusions have to be the in the same case as what you are excluding.

encyclo

12:18 am on Feb 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A couple of points:

# Disallow Google Adsense from the entire site

This is covered by the mediapartners ban so this run is not required.

# Disallow Googlebot, MSN and Yahoo from these directories and all files within them
(..)
# Disallow ALL bots from these directories and all their child objects

The list of files and directories is the same for both, so the first is redundant.

Finally, there shouldn't be an extra line feed in the "Disallow ALL bots" part.

# Disallow Google ad bots from the entire site
User-agent: Mediapartners-Google*
Disallow: /

# Disallow ALL bots from these directories and all their child objects
User-agent: *
Disallow: /css
Disallow: /cgi-bin
Disallow: /error_docs
Disallow: /images
Disallow: /js
Disallow: /somedir
Disallow: /somedir

# Disallow specific bots from indexing or crawling the site at all
# Most recent additions first.

(followed by your list of specific bots)

mzconstruction

2:35 am on Feb 3, 2006 (gmt 0)

10+ Year Member



the file has to be called robots.txt

It is. My bad typing I'm afraid.
robots.txt has to be in the root directory of the web site.

It is.
The exclusions have to be the in the same case..

They are. (AFAIK).


# Disallow Google Adsense from the entire site
This is covered by the mediapartners ban so this run is not required.

OK. We can remove that.
The list of files and directories is the same for both, so the first is redundant.

Yes, I can see that now. Repetition is not required then? We thought it might be more effective this way.
As it happens Google appear to be ignoring it anyway. Our bulletin is board is specifically disallowed and they are spidering it anyway. ;-(

Finally, there shouldn't be an extra line feed in the "Disallow ALL bots" part.

OK. We will remove that.

Thaks for everyone's help. Much appreciated.

Dijkgraaf

2:51 am on Feb 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is google actually requesting the pages, or only listing the URL's without titles and cache?
If it is only listing the URL's than that is standard behaviour for several bots.

mzconstruction

11:06 pm on Feb 3, 2006 (gmt 0)

10+ Year Member



Google is requesting pages.