Forum Moderators: Robert Charlton & goodroi
I wanted to put a situation forward and get a feel for what people think should be done to promote best possible rankings in a particular situation.
We have a site advertising hundreds of different item listings.
Our URLs are structured as follows using URL rewriting:
abc.com/categoryx/blue (this will default to page 1)
abc.com/categoryx/blue/2 (page 2)
abc.com/categoryx/blue/3 (page 3)
We have also introduced sorting on the above:
abc.com/categoryx/blue/1/pricing-high-low
abc.com/categoryx/blue/2/pricing-high-low (page 2)
Each of the above produces a listing of widgets with the first 100 or so words of description/content on each widget.
At any rate, the URLs above will have very similar content - not quite duplicated but not distinct enough to constitute unique content. The same widget would be displayed on many of the different URLS above. There are lots of competitor websites advertising the same products with similar text as well.
We rank top 5 for all major terms that we compete for, i.e. "abc" in a highly competitve field. But for secondary terms such as "categoryx blue" we do not rank very well at all. The pages we would like to rank well for as well as the terms are:
abc.com/categoryx/blue (on search for "blue category1")
page 2 and onwards will never be required to be shown as a search result.
My questions is: is it a mistake to have sorted pages of similar (but not identical) things on different URLs?
Should it rather be structured as:
(i) abc.com/categoryx?pg=2&type=blue&sort=high-low
or:
(ii) abc.com/categoryx-blue-2-high-low
With explicit GET parameters I thought Google might give more "power" to that page and also see us having fewer "thin" or non-unique content pages. This goes against having "clean" URLs but will perhaps give you fewer more concentrated pages of content in Google's eyes?
The second alternative introduces less slashes, which might also be a factor? I have seen huge sites like tripadvisor minimising their use of slashes.
Any thoughts or suggestions would be greatly appreciated. By the way, I did link internally to the paged lists with nofollows but found this to give bad results.
Both (all three) approaches are still creating different URLs - even the query string counts as part of the UR. And using fewer slashes is not a ranking factor.
I would just make sure that sorted pages were not indexed at all - robots.txt works, or meta robots noindex on the sorted content pages. Then write the urls for sorts any way is easy for you to maintain.
Thanks for the swift reply! With 24k+ posts your opinion most certainly carries some serious weight.
In your opinion, do you think using meta robots noindex will improve the ranking of the desired (indexed) pages? Is it possible that Google already disregards them and there would be no benefit of using them?
On a side note, I often wonder whether something like this could fall under the category of "over tweaking"?
Even on high PR sites (8+) I have seen the introduction of sort pages into Google's index cause all kinds of trouble. So my point of view is that I know my site best and I will choose what Google does and does not have the option to include.
[edited by: tedster at 8:13 pm (utc) on June 9, 2009]
I have seen numerous reports of Google ignoring these directives? Perhaps they may still cache pages but will not count these pages when they calculate rankings. What are your thoughts on this?
Also, would you include a "nofollow" directive as well as the "noindex"?
Is there any advantage between robots.txt and the meta robots noindex directive?
Is there any advantage between robots.txt and the meta robots noindex directive?
You don't want to use robots.txt and the meta robots noindex directive simultaneously. See this current discussion for some distinctions between them and why it may appear that Google is ignoring "noindex"...
Robots.txt disallowed file shows up in SERPs & Google traffic drops
[webmasterworld.com...]
Perhaps they may still cache pages but will not count these pages when they calculate rankings. What are your thoughts on this?
We do need to be precise with word choices here: having no cached page visible doesn't mean the page isn't indexed. Having a "noindex" robots tag means that the page is spidered (it must be in order to have the meta tag read) but its content is not included in the searchable index.
However, in a situation such as a noindex,follow meta tag - it is clear that Google must use the links on the page in their calculations, isn't it?
having no cached page visible doesn't mean the page isn't indexed
I thought if a page was indexed that it was stored and indexed in the Google cache? Thus by noindex I assumed a page would be read by Google but not stored/cached. Am I misunderstanding something? Are you saying one needs to discriminate between Google's cache and their searchable cache? I'm a little confused...
Robots.txt disallowed file shows up in SERPs & Google traffic drops
[webmasterworld.com...]
Thanks Robert, very interesting article - I had not made the connection regarding the subtle distinction between robots.txt and meta directives. Can you think of a situation where this would be of practical concern though? If Google was to still spider a page (due to no locks in robots.txt) with a noindex directive, yet still not index it, is it not as if it had never seen it at all?Thanks again for your guys time on this :)
This does, however, mean that the user will not be able to bookmark sorted page (i.e. bookmarking sorted page will give unsorted results when accessed again) so you have to weigh what you want to achieve.
NOINDEX means it will not show in SERPs. Google still has a FULL COPY, and will treat it exactly like any other page for upstream and downstream calculations.
You can allow the page to show in SERPs, but stop users accessing Google's cached version using NOARCHIVE
You can stop Google visiting a page using robots.txt. They will not see any robots metatags in this case. The page MAY STILL SHOW IN SERPS as a URL, depending on inbound links. You will not get any ranking credit for anything on that page.
For sorted results, you are (IMHO) best using NOINDEX, and may find a use for the CANONICAL TAG (referencing the default sort URL)
So far we have 3 separate methods:
(i) sort via POST
(ii) robots.txt (prevent spidering altogether)
(iii) meta robots noindex (spider, but not include in index)
If we assume that we want just page 1 of the unsorted list to compete in Google, which of the above will give the best results?
My reasons were mainly that I don't think the sort pages should be included in Google's index, cache or any ranking calculations whatsoever. After all, their only utility is to aid the user to browse conveniently - they add no more value to the original page from a search engines point of view.
I'm hoping this will effectively reduce the amount of duplicate/similar content in Google's store and allow the authoritative (or canonical) pages to compete without any penalty being applied (either on a sitewide-level or page-level).
This way you are not losing visitor to the site (in case someone bookmarked previously sorted results) and seeing 301 Google will slowly drop pages with sort parameters in URL from its index. This also means if there is any external link to previous URL with sort parameters, you will still get this link juice.