MSN Search is not following robots.txt

Forum Moderators: mack

Message Too Old, No Replies

MSN Search is not following robots.txt

jexx

3:47 pm on Feb 16, 2006 (gmt 0)

I just noticed that I MSN Search is indexing our spider trap, for bad bots, in the SERPs. This page is specifically excluded from the index with robots.txt using the following directive (in robots.txt):

User-agent: *
Disallow: /badbots.html

When searching for our main domain, www.widgets.com, the page [widgets.com...] shows up first in the SERPs!

Very disturbing that MSN Bot is apparently not following robots.txt..
Anyone else experience anything similar?

milanmk

6:36 pm on Feb 16, 2006 (gmt 0)

You can try checking your robots.txt file for any errors at this website.

[searchenginepromotionhelp.com...]

jexx

1:07 am on Feb 17, 2006 (gmt 0)

i know that. i always check my robots.txt files. ..and just to make sure there wasn't any PIBKAC errors, i checked it again.

it definitely is *not* a local problem.
can anyone else verify this for any specific page that they have disallowed in robots.txt?

perhaps the recent algo shake-up has thrown MSNBot off a bit? i know that about 10-12 months ago slurp had a similar problem, not obeying robots.txt, but at least they didn't include those pages in SERPs.

milanmk

4:49 am on Feb 17, 2006 (gmt 0)

Did you try blocking your specific page by putting META tags?

Dijkgraaf

9:55 am on Feb 17, 2006 (gmt 0)

Is msnbot actually requesting your badbots.html (can you see it in your log files)?
Is it a URL only listing, or does it have the title of the page and cached contents?

jexx

6:33 pm on Feb 17, 2006 (gmt 0)

milanmk: no, i haven't but i suppose i should try..

Dijkgraaf: it is a URL only listing, but this page does not have meta data on it (since it has no place in SERPs) so don't think that really helps.

interesting enough although the link (invisible and on every page in obscure corner location is absolute; [domain.com...] i do see msnbot try to hit this page for relative locations such as /dir1/dir2/badbots.html in the logs. i dont have any occurances of msnbot hitting the trap and getting banned, so it *seems* like it is including the link in the index purely from it being on other indexed pages without following it..?

i will try milanmk's suggestion, but this indicates that msnbot is not actually following the link (and perhaps adhering to robots.txt after all..?..) and just including the link (and giving it top placement if our domain.com is searched for) based on prevalence on pages..?

seems very odd to me.. hard to exclude if the link is not followed but still included?

if this is part of the new algo, then that seems like it would introduce more "junk" into the index.

milanmk

6:48 pm on Feb 17, 2006 (gmt 0)

i do see msnbot try to hit this page for relative locations such as /dir1/dir2/badbots.html in the logs

Maybe this is the reason why msnbot is getting confused and ultimately indexed your page.

I think you can try renaming your file in addition to my suggestion of adding META tags to it.

Dijkgraaf

8:44 pm on Feb 17, 2006 (gmt 0)

It is quite common for search engines to have a URL only listing for a page that is banned in robots.txt

What happens is that the bot finds a link and it puts that URL into the index to request later.
When it comes to request the URL, it finds out that the URL is disallowed in robots.txt, so doesn't request it.
Usually these URL only listings will not appear, unless you do a search for all pages in a domain.

By the sounds of it you might have some links that are confusing the bot, such as a relative link e.g. "../badbots.html" or a link to badbots.html without a leading / in a page in a subdirectory.
Because /dir1/dir2/badbots.html doesn't match
disallow: /badbots.html
it is requesting those.

Putting those meta tags on a correctly banned page is redundant, as to read those meta tags, the bot has to actually request the page. It could be a good backup practice though.

Another things you can do is add rel="nofollow" to the links to badbots.html. I'm going to be experimenting with this to see if this will keep it out of search indexes.

caveman

10:32 pm on Feb 17, 2006 (gmt 0)

SInce this update stated they're showing a lot of our pages that are banned from bots also.

jexx

1:25 am on Feb 18, 2006 (gmt 0)

thanks caveman.. verification!

regarding the relative links, NO. i specifically made this link absolute as [domain.com...] on *every* page that it is on, as to prevent such problems..

seems to me that it is a problem from this update, b/c it just recently started happening..

does MSN have a URL or e-mail for page exclusion, since it is causing a pretty bad headache as it is right now..?

caveman

1:37 am on Feb 18, 2006 (gmt 0)

Weird though. It's not consistent, meaning we have parallel sites and one shows the issue and the other one not. Nothing is ever exactly equal between two different sites, but when both rank well, have same sorts of domain names and structures, and then one goes funky in SERP's and other one not, you gotta wonder.

Don't get why that would be, except that it adds to speculation that something went wrong recently, aside from the obvious observations about subdomains etc...

jexx

8:10 pm on Feb 22, 2006 (gmt 0)

MSN Search *seem* to have remove this page from the index, which indicates that it indeed was an algo problem..

Jordo needs a drink

3:31 pm on Feb 23, 2006 (gmt 0)

Might be this also...

[webmasterworld.com...]