Duplicate pages and noindex

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Duplicate pages and noindex

johnn

12:28 am on Jun 26, 2005 (gmt 0)

In a website, if I have 2 links and when I click each link, the same page is displayed. If I put noindex in one click, does Google still index this page?

johnn

10:53 pm on Jun 26, 2005 (gmt 0)

Hmm, no answer? I'm sure some people have this problem before. Here's an example:
An article is written by 2 authors, and it may have two urls (therefore 2 links) like:

abc.com?id=123&author=au1
abc.com?id=123&author=au2

It will display same page but different urls. That's the problem!

This can also happen in case when a product belongs to 2 different categories! I'm sure some people here had this before.

sailorjwd

1:26 am on Jun 27, 2005 (gmt 0)

What... no helpful people here? They are all hungover from N.O.

I don't know about putting noindex on a 'click' but if you put the noindex meta tag on one of the dup pages that would do it IMHO.

Reid

1:30 am on Jun 27, 2005 (gmt 0)

<a href=oneofthelinks rel="nofollow">adopted by G,Y,M</a>

nofollow seems more for external links though
an alternative would be
robots.txt
user-agent: *
disallow: one of the URL's

Clint

8:03 am on Jun 27, 2005 (gmt 0)

Don't forget:

User-agent: *
Disallow: /one of the URL's

....The slash before the path.

I don't know if that is case-sensitive or not. If it is, what's above is correct.

Reid, do you know if this definitely works?
<a href=oneofthelinks rel="nofollow">adopted by G,Y,M</a>

Joe, I don't think the noindex tag works for G. I was about to post something on this. I have the 'noindex, nofollow' tag on one of my links pages for the G-bot, and it still visits the page! Apparently it doesn't obey it.

SebastianX

10:28 am on Jun 27, 2005 (gmt 0)

The NOINDEX value of the robots META tag will not prevent from crawling. A crawler must fetch the page to parse META tags. NOINDEX means 'do not deliver this page on the SERPs'.

As for the initial question: I'd go for NOINDEX,FOLLOW with one page and INDEX,FOLLOW (=ALL = default value =obsolete) with the other.

sailorjwd

12:03 pm on Jun 27, 2005 (gmt 0)

Noindex meta seems to work for me. I just checked and there are no pages with Noindex and in my site: command.

But what do I know - nill.

Clint

1:50 pm on Jun 27, 2005 (gmt 0)

>>>>Joe, I don't think the noindex tag works for G. I was about to post something on this. I have the 'noindex, nofollow' tag on one of my links pages for the G-bot, and it still visits the page! Apparently it doesn't obey it. <<<<<<<

I found out (DUHHHHH), that bot has to first visit the page in order to see the meta tag! So then I assume after it crawls the <head>, it sees the "noindex" or whatever and obeys it.

Reid

3:53 pm on Jun 27, 2005 (gmt 0)

Reid, do you know if this definitely works?
<a href=oneofthelinks rel="nofollow">adopted by G,Y,M</a>

yes - the bot will disregard a link with the nofollow attribute but it is designed for external links - not sure how it would affect ranking if you start using it on internal links though.

we are talking about 2 links to the same page from one page right?
I would just use robots.txt

user-agent: *
disallow: /?id=123&author=au2

BTW this is a perfect example where google sitemap could prove very useful
The problem here is that one page has 2 URL's which google could index the same page under both URL's and cause dupe-content. Google sitemap could straighten this mess by only submitting one of the 2 URL's in the sitemap.xml file. Doesn't do much for Yahoo and MSN though.

johnn

11:28 pm on Jun 27, 2005 (gmt 0)

>>we are talking about 2 links to the same page from one page right?

Yes. And therefore 2 different links to TWO different urls of the same page! The main problem is if Google still index such page after 'noindex' is placed properly - that's the question.

Reid

5:28 am on Jun 28, 2005 (gmt 0)

if you put a noindex Meta tag on the page then neither URL would get indexed. That's why I was saying to disallow one of the URL's in robots.txt and the page would be indexed under the other URL.

This is how google indexes basically

1. googlebot finds the URL from a link on one of your pages (or from another website) and adds a URL-only listing to the index
2. another googlebot crawls the URL-only listing and adds the Title and description, cache ect.

If you do nothing, it will list both URL's as 'pages' and crawl each URL seperatly causing a duplicate listing.

If you put a META noindex on the page it will attempt to crawl both URL's but leave them both as URL-only. it will treat the 2 URL's as 2 seperate 'pages' with a noindex meta tag on each of them.

Block one URL in robots.txt and it will list the page under the other URL and it will remove the disallowed URL from the index. If someone links to the disallowed URL then it will appear in the index as URL-only but not get crawled because of robots.txt even get removed because it is disallowed in robots.txt. It will keep cycling through getting listed (from the other site) and getting removed (from robots.txt)

Reid

5:51 am on Jun 28, 2005 (gmt 0)

the difference between robots.txt and Meta robots directions

robots.txt disallows the URL before it is even requested (do not request this URL)

robots META tag is in the header, after the URL has been requested. It tells robots NOINDEX (do not add this page to the index - leave it URL-only in google)
NOFOLLOW (do not follow the links on this page)

Google is funny that way 'knowing about' a URL does not make it 'indexed'
I'm not sure wether they 'know about' 8 million URL's or they actually have 8 million URL's 'indexed'

SebastianX

6:51 am on Jun 28, 2005 (gmt 0)

Since the page is dynamic, I'd put in a NOINDEX meta tag for one version, and no robots tag for the other version, depending on the input parameters. This works with every crawler.

If you search for "steering and supporting search engine crawling" you'll find more info on robots.txt, robots meta tags, rel=nofollow ... in my tutorial.

johnn

2:30 am on Jun 29, 2005 (gmt 0)

Thank you for information. So to sum it up, here's what I'll will do:

- I will Not use noindex on any dynamic page.
- Because I have many different pages but they all
are this type of page, I'll create a folder called xyz that will contain all URLs that I don't want search engines to index; so in robots.txt, I will add this line:

disallow: /xyz/

I'm safe now?
Thank you,
Best regards,
John

Clint

1:43 pm on Jul 1, 2005 (gmt 0)

Reid, do you know if this definitely works?
<a href=oneofthelinks rel="nofollow">adopted by G,Y,M</a>
yes - the bot will disregard a link with the nofollow attribute but it is designed for external links - not sure how it would affect ranking if you start using it on internal links though.

Thanks Reid. Reason I was asking is to maybe do this with some of the sites to which I link that G may not like.

I'm confused by this:

>>we are talking about 2 links to the same page from one page right? <<<
Yes. And therefore 2 different links to TWO different urls of the same page! The main problem is if Google still index such page after 'noindex' is placed properly - that's the question.

What's wrong with one page having more than one link to the same page? This is with INTERNAL links. Many if not most may have a few links on one page that point to the same other page at their site. For example, a product page and at the top you may have hyperlinked "back to whatever.com home page" and you may also have that at the bottom of the product's page.

Clint

1:45 pm on Jul 1, 2005 (gmt 0)

if you put a noindex Meta tag on the page then neither URL would get indexed. That's why I was saying to disallow one of the URL's in robots.txt and the page would be indexed under the other URL.

How can putting a robots meta tag on one page affect some other page? Please explain.
Thanks.

Reid

11:04 am on Jul 2, 2005 (gmt 0)

This is the topic right?

abc.com?id=123&author=au1
abc.com?id=123&author=au2
It will display same page but different urls. That's the problem!

so lets call the page article123.
either URL above will serve up this page because article 123 was written by author 1 and 2 both.

There is a concern that google will list the same article (123) under both URL's - that's 2 pages with the exact same content.
Now if that content has a robots META tag (noindex) in the header, then neither URL would get indexed because they both point to the same content containing the robots tag.
But if there were no robots tag, then both URL's would be indexed.
We only want one of the URL's indexed (because there's only one page) so if we disallow one of the 2 URL's in robots.txt and have no robot's tag on the page, then it will be indexed one page one URL.
The URL that's not disallowed will be indexed but the URL that is disallowed will not be indexed.
So One page 2 URL's
one indexed and one disallowed

Reid

11:32 am on Jul 2, 2005 (gmt 0)

the other solution would be to write a robots meta (noindex) into the script.
when id=123au=1 is called then it gets a noindex written into the header
when id=123au=2 is called it doesn't get written in.

this would be a better solution because if any bot does find the (noindex) URL then it won't index it.

disallowing in robots.txt doesn't keep the bots from finding it through an inbound link, but should it get indexed, the robots.txt should cause it to be removed again (get rid of the IBL of course).

John's solution also looks good, folder /xyz/

zafile

3:16 pm on Jul 2, 2005 (gmt 0)

...