| 9:06 am on Sep 26, 2007 (gmt 0)|
Robots.txt makes suggestions and requests. There is no obligation for any spider or bot to make use of the robots.txt file or its contents, it's just good manners if they do.
| 5:37 pm on Sep 26, 2007 (gmt 0)|
Their search engine is junk, they are not able to respect robots.txt and crawl rate. We have 2007 not 1998, seems like they have been hiding under a bridge.
I think they should give up on building their search engine, it is too late.
[edited by: SEOPTI at 5:38 pm (utc) on Sep. 26, 2007]
| 8:17 am on Oct 2, 2007 (gmt 0)|
Yup - just checked logs this morning. MSNbot appears to randomly select items from the robots file to ignore and subsequently index.
That said, it makes such a poor job of crawling the site (despite a few sitemap files) that it is hard to say if it would ignore all of the robots exclusions if it ever worked properly.
Also discovered that it does not appear to understand a sitemap index file. Only when you explicitly put all the sitemaps in the robots file does the silly bot retrieve them.
| 1:47 pm on Oct 6, 2007 (gmt 0)|
MSN failed big time to build a search engine, they are stuck 1998. They had almost 10 years to write a function which respects robots.txt
What a failure.
| 3:20 pm on Oct 6, 2007 (gmt 0)|
A useful technique for this situation is to detect known-good 'bot requests for your 'trap' URLs, and internally rewrite them to a minimal page containing a link to your home page and a <meta name="robots" content="noindex"> tag.
Yes, it's cloaking, but with no intent to deceive anyone.