Forum Moderators: mack
User-agent: *
Disallow: /badbots.html
When searching for our main domain, www.widgets.com, the page [widgets.com...] shows up first in the SERPs!
Very disturbing that MSN Bot is apparently not following robots.txt..
Anyone else experience anything similar?
it definitely is *not* a local problem.
can anyone else verify this for any specific page that they have disallowed in robots.txt?
perhaps the recent algo shake-up has thrown MSNBot off a bit? i know that about 10-12 months ago slurp had a similar problem, not obeying robots.txt, but at least they didn't include those pages in SERPs.
Dijkgraaf: it is a URL only listing, but this page does not have meta data on it (since it has no place in SERPs) so don't think that really helps.
interesting enough although the link (invisible and on every page in obscure corner location is absolute; [domain.com...] i do see msnbot try to hit this page for relative locations such as /dir1/dir2/badbots.html in the logs. i dont have any occurances of msnbot hitting the trap and getting banned, so it *seems* like it is including the link in the index purely from it being on other indexed pages without following it..?
i will try milanmk's suggestion, but this indicates that msnbot is not actually following the link (and perhaps adhering to robots.txt after all..?..) and just including the link (and giving it top placement if our domain.com is searched for) based on prevalence on pages..?
seems very odd to me.. hard to exclude if the link is not followed but still included?
if this is part of the new algo, then that seems like it would introduce more "junk" into the index.
What happens is that the bot finds a link and it puts that URL into the index to request later.
When it comes to request the URL, it finds out that the URL is disallowed in robots.txt, so doesn't request it.
Usually these URL only listings will not appear, unless you do a search for all pages in a domain.
By the sounds of it you might have some links that are confusing the bot, such as a relative link e.g. "../badbots.html" or a link to badbots.html without a leading / in a page in a subdirectory.
Because /dir1/dir2/badbots.html doesn't match
disallow: /badbots.html
it is requesting those.
Putting those meta tags on a correctly banned page is redundant, as to read those meta tags, the bot has to actually request the page. It could be a good backup practice though.
Another things you can do is add rel="nofollow" to the links to badbots.html. I'm going to be experimenting with this to see if this will keep it out of search indexes.
regarding the relative links, NO. i specifically made this link absolute as [domain.com...] on *every* page that it is on, as to prevent such problems..
seems to me that it is a problem from this update, b/c it just recently started happening..
does MSN have a URL or e-mail for page exclusion, since it is causing a pretty bad headache as it is right now..?
Don't get why that would be, except that it adds to speculation that something went wrong recently, aside from the obvious observations about subdomains etc...
[webmasterworld.com...]