I presume that the above email was sent to one who plays with Google, ie, webmaster stuff and other things where G has an email address... Can't see Google sending that same comment out to all the OTHER websites for which they do not have contact information etc. or any dog in the fight.
This reminds me of pissing in your water supply or pooping around the campfire. Makes for more BAD PRESS than solving problems. The way to deal with duplicate content is to ignore it... oh, wait, can't do that! There would be hordes of webmasters crying foul!
Best I can say that if one plays in their sandbox, their rules ("sandbox" means playpen/play area, not isolation) will apply. Live with it and don't complain.
Over the last three years I've migrated 75% of "ad income" away from Google and saw a 22% increase for income. The Google monster still lives, as a monster, but there are increasingly more predators/competitors---even mosquitoes---nipping at the heels of the monster wherein webmasters can enjoy income without krappola dictates from the creepy line.
|Live with it and don't complain. |
Thats what I am trying to do.
|Over the last three years I've migrated 75% of "ad income" away from Google and saw a 22% increase for income. |
Unless you live in USA, Yahoo and Microsoft adnetwork won't work. I don't think so managing own ad network is good idea for selling ads. Especially this won't work if you are running a small business. What else choice you got?
I've seen this message before when there was a problem with a rewrite. The bot was getting caught in a loop which in turn was generating a high volume of pages returning a 200 OK. I'd double check and make sure you don't have malformed syntax somewhere which is causing the bot to get hung up in some sort of black hole.
Did you make any changes since the time that message appeared? What does crawl activity look like? Do you see an initial spike and then lots of spikiness afterward? That could be a sign of a problem somewhere in the machine.
|I'd double check and make sure you don't have malformed syntax somewhere which is causing the bot to get hung up in some sort of black hole |
What are you referring here? Problem in URL rewriting? The urls that google bot attached to this message are working fine almost all are paginated urls and they have right canonical link.
|Did you make any changes since the time that message appeared? |
No I didn't
|Do you see an initial spike and then lots of spikiness afterward? |
I see normal crawl activity during August to November. A little spike in the mid of Oct (time spent, kilo bytes downloaded and number of pages crawled)
|What are you referring here? Problem in URL rewriting? The urls that google bot attached to this message are working fine almost all are paginated urls and they have right canonical link. |
I do believe the URIs are just a sampling of what it found and not the entire set. If you browse to a paginated URI that is not valid, are the proper server headers being returned?
|A little spike in the mid of Oct (time spent, kilo bytes downloaded and number of pages crawled). |
And this message from Google came after that "little spike" in the middle of October?
I'd be looking at technical glitches at this point in time to make sure all is okay. If you didn't make any changes that makes things a bit more challenging in determining what might be happening. Could be a glitch in GWT but as long as I've used it, the information reported is accurate. When you get a notification from Google like this, it would be cause for concern. First thing I'd be checking are server headers to make sure the bots are getting proper directives based on their requests. 200s where appropriate, 301s, 404s, 410s, etc.
From Google's John Mu...
|We show this warning when we find a high number of URLs on a site -- even before we attempt to crawl them. If you are blocking them with a robots.txt file, that's generally fine. If you really do have a high number of URLs on your site, you can generally ignore this message. If your site is otherwise small and we find a high number of URLs, then this kind of message can help you to fix any issues (or disallow access) before we start to access your server to check gazillions of URLs :-). |
Googlebot encountered an extremely high number of URLs on your site
|And this message from Google came after that "little spike" in the middle of October? |
No it came yesterday.
Probably there is something wrong at my end but so far I looked at the all urls (paginated) that have been given in message, I couldn't found any problem.
How would you deal with paginated urls on your site?
I frequently get the message from Google about one of my sites. It has the hosting history for every domain name in com/net/org/biz/info/mobi/asia back to 2000 and the stats for nameservers over the same period. So with approximately 300 million or so pages, John Mu's advice seems to be the best. Naturally I haven't put all of these pages in the site maps but there is still a relatively large number. From search engine development work, broken rewriters can often cause recursive page structures (where the same page content is served with a load of different URLs) and this is one of the things that Google may be trying to avoid. The first thing to check is that any rewriter is working properly and also check the numbers of pages in your sitemap files. Prioritise the important ones and freeze the ones that never change at a lower priority/importance.
[edited by: jmccormac at 4:52 pm (utc) on Nov 7, 2010]
|How would you deal with paginated urls on your site? |
These days I do most everything via the meta robots element. In the case of pagination, we noindex the paginated pages and allow the bot to follow links. Unless of course that is the end of the click path at which time those pages would be available for public indexing.
I prefer to keep all non-essential pages out of the index. Those pages that are paginated are usually just gateways to the final click path. It's those listings on the paginated pages that take you to the final records, that's what you want indexed and showing up for search queries, not the paginated pages or the click paths inbetween start and finish. ;)
Keep in mind that this method does not apply to all taxonomies. There are some sites where the paginated content is of importance from an indexing perspective and those would be allowed to get indexed. I don't see many like that, but they exist.
I like to conserve site equity and only allow the most important pages to get indexed. Everything else is noindex e.g.
<meta name="robots" content="noindex">
This allows the bots to crawl those pages and follow links to the final destination documents while keeping those inbetween docs out of the index. They are intermediary in the click path and they suck equity from the site.
On a side note, I noarchive everything these days - everything.
|On a side note, I noarchive everything these days - everything |
Is there any specific reason why you do that?
What are the potential risks of Google Cache?
I don't think a canonical link tag, on its own, prevents future crawling of that URL - especially so in the case of paginated URLs, where the content is significantly different from the canonical URL you are declaring.
@tedster as suggested by pageoneresults should I just use <meta name="robots" content="noindex"> tag on paginated urls?
I'm not sure if noindex is enough to turn off the warning message for you or not. It's a pretty recent development and I have not experienced it.
With a noindex meta tag, those URLs will still be discovered and crawled - they must be in order for the meta tag to be read. The content of those URLs will be kept out of the search results, however. Still, depending how deep your pagination goes, all that crawling might be a drain on your total crawl budget - at least for a while until Google gets it sorted.
|On a side note, I noarchive everything these days - everything. |
yeah me too, so i trust you have banned Scoutjet, the bot of currently hyped new SE blekko
|...should I just use <meta name="robots" content="noindex"> tag on paginated urls? |
can In interrupt for a second here to make sure we are talkiing about the same thing?
When you say "paginated," you mean something like:
Or are you saying:
Thanks in advance for the clarification.
this is correct
JohnMu has advised against using canonicals for pagination. The reason being it acts as a kind of "bot redirect", e.g. the Googlebot hits the page, sees the canonical, and then doesn't bother with the rest of the page. This causes problems with crawling and PR flow, as Google doesn't reach items / pages down the list of pagination.
The approach I've been using, and which seems to work well, is meta noindexing everything after page one. The pages canonical to themselves (in case someone externally links to them), but not to page 1.