homepage Welcome to WebmasterWorld Guest from 54.198.94.76
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Webmaster World's Robots.txt
Why don't they allow msnbot?
Rollo




msg:1526050
 5:33 pm on Nov 28, 2004 (gmt 0)

Webmaster World has a lengthy robots.txt

[webmasterworld.com ]

I can see why they don't allow a lot of it, but I was wondering why they don't allow msnbot?

 

paybacksa




msg:1526051
 5:40 pm on Nov 28, 2004 (gmt 0)

don't assume that the robots.txt you see is the same one served to spiders... that is not always the case.

As for blocking msn, it has been known to spider excessively for no obvious benefit to the webmaster (while consuming the webmster's bandwidth allocation) so some sites block it.

AlucardSpyderWriter




msg:1526052
 6:06 pm on Nov 28, 2004 (gmt 0)

<<don't assume that the robots.txt you see is the same one served to spiders... that is not always the case.>>

By "the one that you see" did you mean "the one that your browser might get"? Because browsers don't make requests for Robots.txt at all.

What type of server gives up different versions of the file for different requests/user-agents/spiders?

jim_w




msg:1526053
 7:28 pm on Nov 28, 2004 (gmt 0)

What I want to know is how much good, if any, is there in blocking ‘bad’ spiders, like some of the ones listed in the robots.txt mentioned above, when the scummy people using such bots can just change the user agent?

Should all of them be listed in the robots.txt file, or is it a moot point?

jdMorgan




msg:1526054
 9:25 pm on Nov 28, 2004 (gmt 0)

ASW,

> What type of server gives up different versions of the file for different requests/user-agents/spiders?

Mine do. It's one way to cut bandwidth consumed by robots that don't understand multiple-user-agent records. Detect those UAs and serve them a simplified robots.txt with their UA string inserted. A combination of mod_rewrite and some simple cgi scripting on Apache can be used to do this easily.

jim_w,

Some "bad" robots are in fact spoofs of legitimate user-agents. In cases where the legitimate robot visits but is considered to be of no practical use to the site owner, it may be Disallowed in robots.txt. It is in fact necessary to take stronger measures for the spoofers, but having the robots.txt disallow helps identify the spoofers (because they don't fetch robots.txt, or they ignore the contents of robots.txt even though they do fetch it. So no, it's not entirely a waste of time.

Jim

AjiNIMC




msg:1526055
 6:11 pm on Nov 30, 2004 (gmt 0)

check out this

[66.102.7.104...]

no cache, I wonder how google is getting to all the pages of webmasterworld.

[216.239.57.104...]

no page of webmasterworld has a cache but its getting indexed may be links from outside.

AjiNIMC

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved