Forum Moderators: Robert Charlton & goodroi
abc.com?id=123&author=au1
abc.com?id=123&author=au2
It will display same page but different urls. That's the problem!
This can also happen in case when a product belongs to 2 different categories! I'm sure some people here had this before.
User-agent: *
Disallow: /one of the URL's
....The slash before the path.
I don't know if that is case-sensitive or not. If it is, what's above is correct.
Reid, do you know if this definitely works?
<a href=oneofthelinks rel="nofollow">adopted by G,Y,M</a>
Joe, I don't think the noindex tag works for G. I was about to post something on this. I have the 'noindex, nofollow' tag on one of my links pages for the G-bot, and it still visits the page! Apparently it doesn't obey it.
As for the initial question: I'd go for NOINDEX,FOLLOW with one page and INDEX,FOLLOW (=ALL = default value =obsolete) with the other.
I found out (DUHHHHH), that bot has to first visit the page in order to see the meta tag! So then I assume after it crawls the <head>, it sees the "noindex" or whatever and obeys it.
yes - the bot will disregard a link with the nofollow attribute but it is designed for external links - not sure how it would affect ranking if you start using it on internal links though.
we are talking about 2 links to the same page from one page right?
I would just use robots.txt
user-agent: *
disallow: /?id=123&author=au2
BTW this is a perfect example where google sitemap could prove very useful
The problem here is that one page has 2 URL's which google could index the same page under both URL's and cause dupe-content. Google sitemap could straighten this mess by only submitting one of the 2 URL's in the sitemap.xml file. Doesn't do much for Yahoo and MSN though.
This is how google indexes basically
1. googlebot finds the URL from a link on one of your pages (or from another website) and adds a URL-only listing to the index
2. another googlebot crawls the URL-only listing and adds the Title and description, cache ect.
If you do nothing, it will list both URL's as 'pages' and crawl each URL seperatly causing a duplicate listing.
If you put a META noindex on the page it will attempt to crawl both URL's but leave them both as URL-only. it will treat the 2 URL's as 2 seperate 'pages' with a noindex meta tag on each of them.
Block one URL in robots.txt and it will list the page under the other URL and it will remove the disallowed URL from the index. If someone links to the disallowed URL then it will appear in the index as URL-only but not get crawled because of robots.txt even get removed because it is disallowed in robots.txt. It will keep cycling through getting listed (from the other site) and getting removed (from robots.txt)
robots.txt disallows the URL before it is even requested (do not request this URL)
robots META tag is in the header, after the URL has been requested. It tells robots NOINDEX (do not add this page to the index - leave it URL-only in google)
NOFOLLOW (do not follow the links on this page)
Google is funny that way 'knowing about' a URL does not make it 'indexed'
I'm not sure wether they 'know about' 8 million URL's or they actually have 8 million URL's 'indexed'
If you search for "steering and supporting search engine crawling" you'll find more info on robots.txt, robots meta tags, rel=nofollow ... in my tutorial.
- I will Not use noindex on any dynamic page.
- Because I have many different pages but they all
are this type of page, I'll create a folder called xyz that will contain all URLs that I don't want search engines to index; so in robots.txt, I will add this line:
disallow: /xyz/
I'm safe now?
Thank you,
Best regards,
John
Reid, do you know if this definitely works?
<a href=oneofthelinks rel="nofollow">adopted by G,Y,M</a>yes - the bot will disregard a link with the nofollow attribute but it is designed for external links - not sure how it would affect ranking if you start using it on internal links though.
Thanks Reid. Reason I was asking is to maybe do this with some of the sites to which I link that G may not like.
I'm confused by this:
>>we are talking about 2 links to the same page from one page right? <<<Yes. And therefore 2 different links to TWO different urls of the same page! The main problem is if Google still index such page after 'noindex' is placed properly - that's the question.
What's wrong with one page having more than one link to the same page? This is with INTERNAL links. Many if not most may have a few links on one page that point to the same other page at their site. For example, a product page and at the top you may have hyperlinked "back to whatever.com home page" and you may also have that at the bottom of the product's page.
if you put a noindex Meta tag on the page then neither URL would get indexed. That's why I was saying to disallow one of the URL's in robots.txt and the page would be indexed under the other URL.
How can putting a robots meta tag on one page affect some other page? Please explain.
Thanks.
abc.com?id=123&author=au1
abc.com?id=123&author=au2It will display same page but different urls. That's the problem!
so lets call the page article123.
either URL above will serve up this page because article 123 was written by author 1 and 2 both.
There is a concern that google will list the same article (123) under both URL's - that's 2 pages with the exact same content.
Now if that content has a robots META tag (noindex) in the header, then neither URL would get indexed because they both point to the same content containing the robots tag.
But if there were no robots tag, then both URL's would be indexed.
We only want one of the URL's indexed (because there's only one page) so if we disallow one of the 2 URL's in robots.txt and have no robot's tag on the page, then it will be indexed one page one URL.
The URL that's not disallowed will be indexed but the URL that is disallowed will not be indexed.
So One page 2 URL's
one indexed and one disallowed
this would be a better solution because if any bot does find the (noindex) URL then it won't index it.
disallowing in robots.txt doesn't keep the bots from finding it through an inbound link, but should it get indexed, the robots.txt should cause it to be removed again (get rid of the IBL of course).
John's solution also looks good, folder /xyz/