Forum Moderators: phranque

Message Too Old, No Replies

ISAPI rewrite - a specific question

about IIS and how pages are rendered after a rewrite

         

Fiver

6:30 pm on Oct 31, 2006 (gmt 0)

10+ Year Member



A site has duplicate content penalties in google because they use urls that pass

?partner=partnersite

placing a no-index, no-follow on all internal links on the page will stop a spider from going any further, but it wont stop a spider from following a link in from an external site to page.aspx?partner=partnersite (thousand of such external links exist)

isapi-rewriting the url to page.aspx/partner/partnersite wont eliminate duplicates as this url will be identical in content to page.aspx/partner/partnersite2

What would work in theory is if I could include a meta no-index on this virtual page as it is rendered, while avoiding placing that meta no-index on the original version of the page.

Could this be accomplished in a robots.txt (or any other method) for urls _after_ they have been rewritten with isapi rewrite? The question is, can I render any page served with IIS that has?partner= in the referring url with a meta no-index?

/not used to IIS
/not the webmaster but will have to explain it to them clearly - so I need to get my head around it.

please feel free to tell me if I'm barking up the wrong tree.

Thanks in advance,
Fiver

tedster

8:40 pm on Oct 31, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Since googlebot supports pattern matching, I think this should work in your robots.txt

User-agent: Googlebot
Disallow: /*?partner=*

Then any ISAPI Rewrite involvement evaporates, no?

Fiver

7:51 pm on Nov 1, 2006 (gmt 0)

10+ Year Member



You're right tedster, I also thought of that solution after posting this, but it brings another issue to light.

There are a thousand partners pages indexed for this site in google, all identical. The issue is that these session id laden versions are linked to from external sites (the partners link to the site with their session id in the url, naturally - it's not an affiliate program, just looks that way).

We can't change the fact that these links from other sites exist, all we can do is serve the pages with noindex nofollow when they are rendered with?partner= in the url. We did this a couple of months ago, but google hasn't crawled these session id laden links since april (cache date)... and hence hasn't removed them from the index.

I can update my robots with the disallow you suggest, and I can then ask the google url removal tool to revisit that robots.txt, BUT I get the feeling the removal bot will simply spider the site as is, removing any pages it happens to come across with?partner= in it. This will do little good because it wont find the partner id'd versions of each page. There aren't any links on the site to them... just from external sites.

Short of writing a huge and rather dumb sitemap linking to all of these id'd pages for the purpose of google discovering they are now meta'd with noindex, what should I do? (i could take the nofollow part of the metas out, and put a link to the index with each partner id, that would save some work, but it's still a thousand partners :/ )

Is there any chance the google removal bot will go ahead and recheck every page that shows for a site:domain.com query?

THAT would solve the problem.... but my instinct is that the bot will simply crawl the site again, not all the indexed files.

Our patience is wearing thin waiting for google to revisit the urls it cached back in april....