Forum Moderators: open
This same plant is green, and my company wants to rank highly for "Green plants" so it has a different category called "Green plants" which lists all of the plants that we sell that are Green. The URL for this same plant changes slightly to include the filter www.example.com/plantid-1293022?ColourIdentifier=Green
This same plant could also be described as an "Evergreen", and my company wants to rank highly for "Evergreen Plants" etc etc, URL is now www.example.com/plantid-1293022?TypeIdentifier=Evergreen.
Each URL is essentially the same as it's exactly the same content. We already have <link rel="canonical" href="www.example.com/plantid-1293022" /> in the <head> tags so that Google will show the pretty URL on the rankings. My problem is this...
I want Google to rank our website highly for "Plants", "Green Plants" and "Evergreen" - but also for the individual plants themsleves - so it is important that these category pages are crawled. However once on these category pages Googlebot is having to visit individual plants, eg www.example.com/plantid-1293022, 6 or 7 times because it appears as different URL's for so many categories. This means that Googlebot is crawling the same page 6 times when only once would do.
We are getting an error in Webmaster tools saying "Googlebot encountered an extremely high number of URLs on your site" and this is because of it crawling so many pages with same content but different URL's.
If this can be stopped then Googlebot will crawl more pages, so we should then have a larger number indexed in Google.
I'm looking for a solution that would solve the problem of recrawling the same pages, so that more unique pages can be crawled - any ideas would be welcome.
Firstly I am thinking of making each URL the same and passing info via a cookie rather than the URL. I'm SEO, not Dev, so I could only manage to get this implemented if this had solid SEO reasons - would this have benefits for SEO? I'm thinking against this because surely, even if each URL was identical, it would still get crawled 6 or 7 times because it's listed on 6 or 7 different category pages? Or does Googlebot recognise that it has already crawled that URL and so doesn't do it again?
Is there a way that I can keep the category page, but not have to use nofollows to stop Googlebot crawling the plant pages which would waste all of the link juice?
Is there a method/way of using robots.txt to follow some links but not others?
Any helps with this would be greatly appreciated.
Thanks all
ChainsawDR
So first, try fix the problem at the source: To the maximum extent possible, modify the script(s) so that it produces pages with links only to the canonical page URLs. This should extend all the way from http/https, through the www-versus-non-www domain name, optional FQDN indicator and port number (a trailing period and/or port number appended to the hostname), URL-path, fragment identifier (called a 'named anchor' on an HTML page), and the query string parameters and their order.
You can't really pass URLs through cookies, since the server both sets and reads the cookie and the client only stores it, unless you rely on client-side scripting -- which search engines don't support.
Then, stop the 'bots from crawling the sources of any remaining non-canonical URLs -- that is, set up robots.txt so that they cannot crawl the 'search' facility. You may have to re-arrange your URLs and/or your URL-to-filename mapping to do this. Apache mod_rewrite and ISAPI Rewrite on IIS are the usual tools for the latter of these two jobs.
Finally, for the 'bots that support it, block the URLs+querystrings using wild-card matching in robots.txt.
If you've got pages that link to 'dirty' URLs, and you cannot block crawling of those (category) pages in robots.txt without destroying your intra-site linking, then look at putting the "meta robots nofollow" tag on those pages. And if that won't work for your needs, then you may have to go with "User-agent-dependent content generation," i.e. "soft cloaking" -- by not including those 'dirty links' sections of the page if the visitor is a search engine robot.
While doing all of this, address the problems from within your own server. Don't rely on Google's proprietary 'band-aid' fixes -- or those of any other search engine. Fix the root cause of the problem instead of trying to fix all the damage caused by that root problem.
The single most important thing to keep in mind is that search engines crawl and work with only URLs -- not domains, not sites, not "pages" -- just URLs. So if you publish a 'bad' URL, then it exists -- even if it does not resolve to an existing resource on any server. And if there is the slightest difference in the characters of two URLs, then they are different URLs, whether or not they resolve to the same content; If Google finds a link to each of them, it will crawl both. So when thinking about the problem, keep your thoughts URL-centered, because those are the only 'objects' that have meaning to search engines.
That is, instead of saying "because of it crawling so many pages with same content but different URL's," things will be much clearer if you say, "We're publishing (and giving Google) many URLs that all resolve to the same (page) content, and Google is crawling all of them."
This is a duplicate-content problem writ large. (Duplicate-content is a popular and much-discussed topic in our Google Search forum, and my all-time-favorite thread title is "Duplicate content -- Get it right or perish.") The fact that you're getting a warning in GWMT means you've got a serious problem and that you *do* need to fix it because it likely limiting the depth to which Google is willing to crawl your site, and very likely *is* hurting your URLs' rankings by diluting them.
Jim
The URL for this same plant changes slightly to include the filter www.example.com/plantid-1293022?ColourIdentifier=Green
One page (one particular display of content) should never be accessible by more than one URL. You should never have identical pages on multiple URLs. Don't change the URL if the page doesn't change significantly. Meaning just that, significantly (one keyword extra doesn't make a significant difference).
You can easily have several different category pages pointing to different subsets of sub pages. This would be the proper way to do it.
---
> Firstly I am thinking of making each URL the same
Do that
> would this have benefits for SEO?
What you're doing now is ranking suicide, so yes.
> Or does Googlebot recognise that it has already
> crawled that URL and so doesn't do it again?
Gbot is smart. So, yes. Eventually it would.
> Is there a way that I can keep the category page,
> but not have to use nofollows to stop Googlebot
> crawling the plant pages which would waste all of the link juice?
Yes, build your site properly. See answer one :)
> Is there a method/way of using robots.txt to
> follow some links but not others?
No, you can't use "robot.txt" for links, only pages. But... don't go there. That would be the wrong tool for the job. Fix those pages and URLs - everything else is a waste of your time.
---
Added: Hi Jim :)
You can't really pass URLs through cookies, since the server both sets and reads the cookie and the client only stores it, unless you rely on client-side scripting -- which search engines don't support.
Yes I agree, I had meant that the information that we are currently passing in URL's (such as filters that the user has applied) could be passed in a cookie instead - so the URL's remain clean.
I have to try and justify this change, so would I be correct in saying that by making all of our URLs canonical, Googlebot will eventually realise it doesn't need to keep recrawling the same links (like it must do with links in Global Footers) and so more of our pages will get crawled and indexed?
Plus, even though we have the canonical element within the dirty URLs which specify the pretty URL - this will not work as well as having 6 links using the same pretty URL in them.
When trying to request developer time for this change I am certain that I will come up against two objections, and I am unsure of how to get past them:
1. "Say we made these URL's canonical - Googlebot would probably have to crawl 10,000 pages before it came to the same URL again - are you telling me that before it crawls each URL it checks that it hasn't done so already? Wouldn't this mean Googlebot is checking through lists of millions of URLs that it has already crawled before crawling each of the next URLs infront of it? I don't believe this, it won't crawl any more pages than it currently is."
2. (Before I started work here) "We've just had developers add the <link rel="canonical"> element to each of the URL's affected - this means we dont have 6 different URL's with a small amount of PageRank, the <link rel="canonical"> is passing the PageRank to the pretty/correct URL - so why should we spend the developer time when we've already fixed the problem? It might not be perfect but I don't think the cost of developers is worth making it perfect"
Do you have any thoughts on these objections?
Thanks again guys, you're being a great help.
ChainsawDR
1) Yes, Google does indeed work from a queue of URLs to crawl. And to avoid getting stuck in "infinite URL-spaces," they are unwilling to give too many entries in this queue to any one domain. So if you produce huge numbers of different URLs leading to the same content, these are displacing some of your other 'unique' URLs which would have been queued for crawling, but were left out because your domain had already consumed its quota.
In short, if each 'page' has seven URLs, and you take steps to reduce that to one canonical URL per page, you may expect (at least) that you will get seven times more 'real' pages crawled as a result. I say 'at least' because the warning in Google Webmaster Tools implies that they have placed an arbitrary limit on the "crawl-depth" for your domain, and might be willing to explore deeper and crawl a few more URLs if they were more confident that they would not later have to de-duplicate and discard 85% of them (six out of seven).
Google has been and is quite proud of their search technology developments, and has published and continues to publish papers and file patents on their search methods. Many of these are available and eaasy to find with searches of 'public' pages and with Google's Patent search. It's likely you can find a detailed description of how Google queues URLs for crawling -- perhaps even including some diagrams to make it clear to your co-workers...
2) <link rel="canonical" ... > is OK for search engines that support it. Not all do, because it is an unofficial 'extension' to the Standard for Robot Exclusion. The fact that you need to use it, however, flags your site as being poorly-planned and/or poorly-implemented. Suppose that this is be considered as a ranking factor... It is one of the 'band-aids' that I referred to in my previous posts.
<soapbox>(I vehemently oppose the canonical tag and the 'declare canonical domain' setting in Google Webmaster Tools because they encourage people to take the easy way out -- to tick a checkbox instead of actually fixing the underlying problem. And because these fixes only apply to one (declare canonical domain) or a few (link rel="canonical") search engines, they leave the site crippled in the view of all of the other search engines. This to me is like a doctor who gives you aspirin for your headache when he knows you've got a history of strokes). Some say they provide these proprietary solutions as much to 'raise the bar' for future search engine competitors as to help Webmasters, but being a pragmatist, I don't really care; Either way, it's a bad approach to fixing serious problems.</soapbox>
If you're fighting against a company that is not willing to invest the time to "do it right the first time" in order to avoid creating a lot of 'repair' work in the future, and in order to maximize profitability (or attain their goal, whatever it is), then maybe it's time to leave and find some place that cares about success. This site sounds like an absolute mess to me, and if you have to 'fight the establishment' to fix it, that's really sad; Some companies simply refuse to succeed. This problem sounds like it *needs* a chainsaw taken to it! :)
Jim
first of all thank you guys very much for your advice. The company are going forward with changing the system, rather than continuing with the band aid (they're a really good company, just can be hard to get developer time sometimes - but this is being addressed too).
Just wanted to say thanks for the help, your advice certainly helped me get my approach right.
Best regards
ChainsawDR