Forum Moderators: mack

Message Too Old, No Replies

MSN indexing urls of pages blocked by robots meta tag

indexing & ranking landing page urls, affiliate links, etc

         

Robert Charlton

11:20 pm on Mar 10, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



In the past week, we've been hit with numerous problems with MSN indexing urls of pages "blocked" by the meta robots tag. Unfortunately, MSN is ranking these well.

In all cases, I've been able to locate backlinks for these pages... In one case, it was an old affiliate link to what a client had turned into a test page (something he shouldn't have done, but he figured it was a blocked page, so 'why not?'). On another "blocked" page, there was a link to one of our PPC landing pages.

Because Google will index urls to pages blocked by robots.txt if the links to these pages are exposed, I've been using the meta robots tag instead of robots.txt to block such pages...

<meta name="robots" content="noindex, nofollow">

It works on Google. It seems, though, that MSN might be doing things the opposite way, albeit they indicate that they do observe the "noindex"

MSN Live Search - Site Owner Help [search.live.com]

Use metadata tags to control page indexing and link crawling

You can allow MSNBot to crawl your website and still restrict access to specific web pages and documents by using the noindex and nofollow meta tags within the page code. The noindex tag allows the web page to be retrieved by MSNBot, but blocks indexing of its content.

What I'm seeing is suggesting that not only is MSN not currently observing the robots noindex meta tag, at least not in a way that's consistent with Google's observance... but also that MSN is continuing to have huge problems making quality discriminations among pages and among inbound links. These pages with these links never should be ranking, let alone appearing in the index.

Has anyone else seen this?

Beyond that... and maybe MSN Dude will step in... how can we get our landing pages out of MSN search?

jdMorgan

11:38 pm on Mar 10, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



They've got some core problems with their robots. Even several months after they were notified, their various 'bots still can't properly parse (prefix-match) the User-agent: line in robots.txt, so it's not really surprising that they'll list pages with <meta name="robots" content="noindex"> tags on them.

One thing that promotes clarity in these discussions is to distinguish between robots fetching a page (controlled by robots.txt) and search engines listing (indexing) a page (controlled by the on-page meta-robots tag). A common problem is that if the page is Disallowed in robots.txt, then a robots.txt-compliant robot can't fetch it to see the meta-robots tag on that page. In that case, the result is that the page may be listed in the SE index as URL-only or URL-with-link-text if a link to the page is found.

Jim

Robert Charlton

12:51 am on Mar 11, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



...A common problem is that if the page is Disallowed in robots.txt, then a robots.txt-compliant robot can't fetch it to see the meta-robots tag on that page. In that case, the result is that the page may be listed in the SE index as URL-only or URL-with-link-text if a link to the page is found.

Jim - Thanks....

Yes, your latter point is something that needs to be emphasized to webmasters. Using the meta robots tag and then obscuring it with robots.txt is a common problem. In my experience, the subject has led to several debates with webmasters of clients, where I'd wanted to use the meta robots tag and have requested the webmaster to drop the robots.txt. It frightens them.

Unfortunately, the situation with MSN right now is not making matters easier. In the case I cite in my post above, I've specifically "been using the meta robots tag instead of robots.txt to block such pages..."

And what to do if Google does it one way and MSN choose to do it another? The engines do need to get on the same page about this (no pun intended). I know they all talk to each other. This needs to go at the top of their list, and MSN needs to fix its problems immediately.

And again, how in the world did they ever decide that these pages were worthy of ranking?

jdMorgan

1:12 am on Mar 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> It scares them.
If bandwidth consumption by 'bots is a concern, then it might legitimately do so...

> how in the world did they ever decide that these pages were worthy of ranking?
Unique content? :)

As to your question of what to do if they do things differently or are broken, either live with it or maybe cloak the pages with a password required for robot user-agents. No intent to deceive -- just keep out, thanks.

Jim

Marcia

1:26 am on Mar 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> how in the world did they ever decide that these pages were worthy of ranking?
Unique content?

One page in particular is *not* a page with anything at all, it's a redirect URL for an affiliate link through a popular aggregating service that's blocked by both robots.txt and meta=noindex,nofollow

Robert Charlton

7:54 am on Mar 11, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



One page in particular is *not* a page with anything at all, it's a redirect URL for an affiliate link through a popular aggregating service...

Marcia - Not sure exactly what you're referring to here. I should have said "url," not "page," with regard to ranking... but are you suggesting that this is akin to the old 302 "hijacking" problem, and that it's a click-tracking page that's ranking?

If I remember correctly how those looked, it's not the same... since the result is clustered with another of our pages, not one of the linking site's pages.

Marcia

9:00 am on Mar 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The title reads:

xyz.example.com

The URL under that reads:

xyz.example.com/robots_txt_excluded_subdirectory/jump.php?sid=12345678abcdefg...

It's a long tail search that's got only a few hundred pages returned, but it's so way off it's as though you were searching for imported canned kumquats and when you clicked on that link you arrived at a page selling snowplows.

are you suggesting that this is akin to the old 302 "hijacking" problem, and that it's a click-tracking page that's ranking?

No one is hijacking anything, although some of Google's Froogle pages were hijacking sites at MSN a while back using 302's, with Google ranking for the strangest things.

But yes, this is a link that's going through a tracking page on a third party site, and the link that MSN's crawler grabbed and listed is JAVASCRIPT.

[edited by: Marcia at 9:33 am (utc) on Mar. 11, 2007]

Robert Charlton

9:40 am on Mar 11, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



xyz.example.com/robots_txt_excluded_subdirectory/jump.php?sid=12345678abcdefg...

It's a long tail search that's got only a few hundred pages returned, but it's so way off it's as though you were searching for imported canned kumquats and when you clicked on that link you arrived at a page selling snowplows.

Marcia - Yes, that "jump.php..." does remind me of the good old days of 302 "page-jacking."

That's not what I think I'm seeing at all, except that MSN may be badly behaved.

These results I'm talking about are for searches that in Google return about 2-million and 25-million pages respectively... in MSN return about 100,000 and 1.5 million (interesting difference). Nothing long tail about these. It took a year or two or three to achieve the results we have on Google, with a fair number of genuinely good links. It's unbelievable that on the basis of one tracking string link, or a spidered major search engine ad, MSN would put these in the top 10.

The results are spot on for the site, albeit they're not the pages (or urls) you'd expect to rank, particularly since we've got a meta robots noindex tag on the pages.

The serp listing is our domain as the title line, and then below it simply the urls to that a few linkers happened to use... in this case either an old affiliate link to us...

domain.com/pagename?XYID=F3456q789 on one...

...or the url of one of our pay per click landing pages (with no tracking string), in the other, that someone who found us by an ad used in an article.

So, I'm not seeing it as the same situation, except as an example that MSN has got some cleaning up to do.

> It scares them.
If bandwidth consumption by 'bots is a concern, then it might legitimately do so...

Jim - Talking about the habits of clients wanting to keep their robots.txt. I think it's more the departure from the norm that frightens them. Many are willing to add the robots meta but argue a lot about dropping the robots.txt. It can't be bandwidth... we're generally not talking about that many pages.

As I said, examples like this don't help. It's hard to explain to a marketing manager, particularly if you'd argued with his IT guy to get him to drop the robots.txt.

Marcia

10:17 am on Mar 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



either an old affiliate link to us...

domain.com/pagename?XYID=F3456q789 on one...


That isn't altogether out of the realm of possibility, since MSN sometimes (but not often) does index affiliate URLs. Even Google has on occasion - in fact they did recently, but had the situation remedied within days.

I always have seen occasional affiliate URLs crop up at MSN, and they'll rank for the search term too, with whoever the link originates from ending up getting paid commissions for the sales.

All around, MSN really needs to work very HARD on how they handle redirects, both 301 and 302.