Welcome to WebmasterWorld Guest from 54.205.75.60

Forum Moderators: mack

Message Too Old, No Replies

MSN Live not respecting robots.txt rules

What's their problem?

   
9:02 am on Sep 26, 2007 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



MSN Live search is one of the only mainstream search engine that keeps getting caught up in my bot trap, which is, obviously, forbidden in my robots file. What's their problem? Why do they visit files they shouldn't be? Shouldn't they be focusing instead on evaluating real pages instead? It's not like they're sending any real traffic anyway... that's just another strike into tolerating their bot, but my patience has limit.
9:06 am on Sep 26, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Robots.txt makes suggestions and requests. There is no obligation for any spider or bot to make use of the robots.txt file or its contents, it's just good manners if they do.

Matt

5:37 pm on Sep 26, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Their search engine is junk, they are not able to respect robots.txt and crawl rate. We have 2007 not 1998, seems like they have been hiding under a bridge.

I think they should give up on building their search engine, it is too late.

[edited by: SEOPTI at 5:38 pm (utc) on Sep. 26, 2007]

8:17 am on Oct 2, 2007 (gmt 0)

10+ Year Member



Yup - just checked logs this morning. MSNbot appears to randomly select items from the robots file to ignore and subsequently index.

That said, it makes such a poor job of crawling the site (despite a few sitemap files) that it is hard to say if it would ignore all of the robots exclusions if it ever worked properly.

Also discovered that it does not appear to understand a sitemap index file. Only when you explicitly put all the sitemaps in the robots file does the silly bot retrieve them.

1:47 pm on Oct 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



MSN failed big time to build a search engine, they are stuck 1998. They had almost 10 years to write a function which respects robots.txt
What a failure.
3:20 pm on Oct 6, 2007 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



A useful technique for this situation is to detect known-good 'bot requests for your 'trap' URLs, and internally rewrite them to a minimal page containing a link to your home page and a <meta name="robots" content="noindex"> tag.

Yes, it's cloaking, but with no intent to deceive anyone.

Jim