Welcome to WebmasterWorld Guest from 54.196.243.192

Forum Moderators: mack

Message Too Old, No Replies

MSN Live not respecting robots.txt rules

What's their problem?

     
9:02 am on Sep 26, 2007 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member 10+ Year Member

joined:June 18, 2005
posts:1692
votes: 3


MSN Live search is one of the only mainstream search engine that keeps getting caught up in my bot trap, which is, obviously, forbidden in my robots file. What's their problem? Why do they visit files they shouldn't be? Shouldn't they be focusing instead on evaluating real pages instead? It's not like they're sending any real traffic anyway... that's just another strike into tolerating their bot, but my patience has limit.
9:06 am on Sept 26, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 11, 2004
posts:1014
votes: 0


Robots.txt makes suggestions and requests. There is no obligation for any spider or bot to make use of the robots.txt file or its contents, it's just good manners if they do.

Matt

5:37 pm on Sept 26, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 6, 2002
posts:1825
votes: 21


Their search engine is junk, they are not able to respect robots.txt and crawl rate. We have 2007 not 1998, seems like they have been hiding under a bridge.

I think they should give up on building their search engine, it is too late.

[edited by: SEOPTI at 5:38 pm (utc) on Sep. 26, 2007]

8:17 am on Oct 2, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:May 19, 2003
posts:70
votes: 0


Yup - just checked logs this morning. MSNbot appears to randomly select items from the robots file to ignore and subsequently index.

That said, it makes such a poor job of crawling the site (despite a few sitemap files) that it is hard to say if it would ignore all of the robots exclusions if it ever worked properly.

Also discovered that it does not appear to understand a sitemap index file. Only when you explicitly put all the sitemaps in the robots file does the silly bot retrieve them.

1:47 pm on Oct 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 6, 2002
posts:1825
votes: 21


MSN failed big time to build a search engine, they are stuck 1998. They had almost 10 years to write a function which respects robots.txt
What a failure.
3:20 pm on Oct 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


A useful technique for this situation is to detect known-good 'bot requests for your 'trap' URLs, and internally rewrite them to a minimal page containing a link to your home page and a <meta name="robots" content="noindex"> tag.

Yes, it's cloaking, but with no intent to deceive anyone.

Jim