homepage Welcome to WebmasterWorld Guest from 54.242.241.20
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Googlebot found an extremely high number of URLs on your site
jojy

5+ Year Member



 
Msg#: 4227606 posted 9:16 pm on Nov 6, 2010 (gmt 0)

I received an email from Google saying one of my site may have similar or identical pages.

Here is the message

Googlebot encountered extremely large numbers of links on your site. This may indicate a problem with your site's URL structure. Googlebot may unnecessarily be crawling a large number of distinct URLs that point to identical or similar content, or crawling parts of your site that are not intended to be crawled by Googlebot. As a result Googlebot may consume much more bandwidth than necessary, or may be unable to completely index all of the content on your site.


The urls that Google bot found are

http://www.example.com/videos/p/2
http://www.example.com/news/p/237
.....
......

On paginated pages I am using <link rel="canonical" href="http://www.example.com/news" /> and the Firefox extension detects it fine.

Wondering if I am doing anything wrong?

 

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4227606 posted 5:56 am on Nov 7, 2010 (gmt 0)

I presume that the above email was sent to one who plays with Google, ie, webmaster stuff and other things where G has an email address... Can't see Google sending that same comment out to all the OTHER websites for which they do not have contact information etc. or any dog in the fight.

This reminds me of pissing in your water supply or pooping around the campfire. Makes for more BAD PRESS than solving problems. The way to deal with duplicate content is to ignore it... oh, wait, can't do that! There would be hordes of webmasters crying foul!

Best I can say that if one plays in their sandbox, their rules ("sandbox" means playpen/play area, not isolation) will apply. Live with it and don't complain.

Over the last three years I've migrated 75% of "ad income" away from Google and saw a 22% increase for income. The Google monster still lives, as a monster, but there are increasingly more predators/competitors---even mosquitoes---nipping at the heels of the monster wherein webmasters can enjoy income without krappola dictates from the creepy line.

jojy

5+ Year Member



 
Msg#: 4227606 posted 11:37 am on Nov 7, 2010 (gmt 0)

Live with it and don't complain.


Thats what I am trying to do.

Over the last three years I've migrated 75% of "ad income" away from Google and saw a 22% increase for income.


Unless you live in USA, Yahoo and Microsoft adnetwork won't work. I don't think so managing own ad network is good idea for selling ads. Especially this won't work if you are running a small business. What else choice you got?

pageoneresults

WebmasterWorld Senior Member pageoneresults us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4227606 posted 3:20 pm on Nov 7, 2010 (gmt 0)

I've seen this message before when there was a problem with a rewrite. The bot was getting caught in a loop which in turn was generating a high volume of pages returning a 200 OK. I'd double check and make sure you don't have malformed syntax somewhere which is causing the bot to get hung up in some sort of black hole.

Did you make any changes since the time that message appeared? What does crawl activity look like? Do you see an initial spike and then lots of spikiness afterward? That could be a sign of a problem somewhere in the machine.

jojy

5+ Year Member



 
Msg#: 4227606 posted 3:31 pm on Nov 7, 2010 (gmt 0)

I'd double check and make sure you don't have malformed syntax somewhere which is causing the bot to get hung up in some sort of black hole


What are you referring here? Problem in URL rewriting? The urls that google bot attached to this message are working fine almost all are paginated urls and they have right canonical link.

Did you make any changes since the time that message appeared?

No I didn't

Do you see an initial spike and then lots of spikiness afterward?


I see normal crawl activity during August to November. A little spike in the mid of Oct (time spent, kilo bytes downloaded and number of pages crawled)

pageoneresults

WebmasterWorld Senior Member pageoneresults us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4227606 posted 4:08 pm on Nov 7, 2010 (gmt 0)

What are you referring here? Problem in URL rewriting? The urls that google bot attached to this message are working fine almost all are paginated urls and they have right canonical link.


I do believe the URIs are just a sampling of what it found and not the entire set. If you browse to a paginated URI that is not valid, are the proper server headers being returned?

A little spike in the mid of Oct (time spent, kilo bytes downloaded and number of pages crawled).


And this message from Google came after that "little spike" in the middle of October?

I'd be looking at technical glitches at this point in time to make sure all is okay. If you didn't make any changes that makes things a bit more challenging in determining what might be happening. Could be a glitch in GWT but as long as I've used it, the information reported is accurate. When you get a notification from Google like this, it would be cause for concern. First thing I'd be checking are server headers to make sure the bots are getting proper directives based on their requests. 200s where appropriate, 301s, 404s, 410s, etc.

From Google's John Mu...

We show this warning when we find a high number of URLs on a site -- even before we attempt to crawl them. If you are blocking them with a robots.txt file, that's generally fine. If you really do have a high number of URLs on your site, you can generally ignore this message. If your site is otherwise small and we find a high number of URLs, then this kind of message can help you to fix any issues (or disallow access) before we start to access your server to check gazillions of URLs :-).


Googlebot encountered an extremely high number of URLs on your site
[Google.com...]

jojy

5+ Year Member



 
Msg#: 4227606 posted 4:26 pm on Nov 7, 2010 (gmt 0)

And this message from Google came after that "little spike" in the middle of October?


No it came yesterday.

Probably there is something wrong at my end but so far I looked at the all urls (paginated) that have been given in message, I couldn't found any problem.

How would you deal with paginated urls on your site?

jmccormac

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



 
Msg#: 4227606 posted 4:48 pm on Nov 7, 2010 (gmt 0)

I frequently get the message from Google about one of my sites. It has the hosting history for every domain name in com/net/org/biz/info/mobi/asia back to 2000 and the stats for nameservers over the same period. So with approximately 300 million or so pages, John Mu's advice seems to be the best. Naturally I haven't put all of these pages in the site maps but there is still a relatively large number. From search engine development work, broken rewriters can often cause recursive page structures (where the same page content is served with a load of different URLs) and this is one of the things that Google may be trying to avoid. The first thing to check is that any rewriter is working properly and also check the numbers of pages in your sitemap files. Prioritise the important ones and freeze the ones that never change at a lower priority/importance.

Regards...jmcc

[edited by: jmccormac at 4:52 pm (utc) on Nov 7, 2010]

pageoneresults

WebmasterWorld Senior Member pageoneresults us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4227606 posted 4:48 pm on Nov 7, 2010 (gmt 0)

How would you deal with paginated urls on your site?


These days I do most everything via the meta robots element. In the case of pagination, we noindex the paginated pages and allow the bot to follow links. Unless of course that is the end of the click path at which time those pages would be available for public indexing.

I prefer to keep all non-essential pages out of the index. Those pages that are paginated are usually just gateways to the final click path. It's those listings on the paginated pages that take you to the final records, that's what you want indexed and showing up for search queries, not the paginated pages or the click paths inbetween start and finish. ;)

Keep in mind that this method does not apply to all taxonomies. There are some sites where the paginated content is of importance from an indexing perspective and those would be allowed to get indexed. I don't see many like that, but they exist.

I like to conserve site equity and only allow the most important pages to get indexed. Everything else is noindex e.g.

<meta name="robots" content="noindex">

This allows the bots to crawl those pages and follow links to the final destination documents while keeping those inbetween docs out of the index. They are intermediary in the click path and they suck equity from the site.

On a side note, I noarchive everything these days - everything.

indyank

WebmasterWorld Senior Member



 
Msg#: 4227606 posted 5:13 pm on Nov 7, 2010 (gmt 0)

On a side note, I noarchive everything these days - everything


Is there any specific reason why you do that?

pageoneresults

WebmasterWorld Senior Member pageoneresults us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4227606 posted 5:38 pm on Nov 7, 2010 (gmt 0)

Many reasons...

What are the potential risks of Google Cache?
[WebmasterWorld.com...]

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4227606 posted 5:45 pm on Nov 7, 2010 (gmt 0)

I don't think a canonical link tag, on its own, prevents future crawling of that URL - especially so in the case of paginated URLs, where the content is significantly different from the canonical URL you are declaring.

jojy

5+ Year Member



 
Msg#: 4227606 posted 6:22 pm on Nov 7, 2010 (gmt 0)

@tedster as suggested by pageoneresults should I just use <meta name="robots" content="noindex"> tag on paginated urls?

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4227606 posted 8:51 pm on Nov 7, 2010 (gmt 0)

I'm not sure if noindex is enough to turn off the warning message for you or not. It's a pretty recent development and I have not experienced it.

With a noindex meta tag, those URLs will still be discovered and crawled - they must be in order for the meta tag to be read. The content of those URLs will be kept out of the search results, however. Still, depending how deep your pagination goes, all that crawling might be a drain on your total crawl budget - at least for a while until Google gets it sorted.

topr8

WebmasterWorld Senior Member topr8 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4227606 posted 9:19 pm on Nov 7, 2010 (gmt 0)

hey pageoneresults

On a side note, I noarchive everything these days - everything.


yeah me too, so i trust you have banned Scoutjet, the bot of currently hyped new SE blekko

see [webmasterworld.com...]

Planet13

WebmasterWorld Senior Member planet13 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4227606 posted 5:52 am on Nov 8, 2010 (gmt 0)

...should I just use <meta name="robots" content="noindex"> tag on paginated urls?


can In interrupt for a second here to make sure we are talkiing about the same thing?

When you say "paginated," you mean something like:

widgets-A-through-F.html
widgets-G-through-L.html
widgets-M-through-S.html
widgets-T-through-Z.html

Right?

Or are you saying:

all-widgets-sorted-by-price.html
all-widgets-sorted-by-most-popular.html
all-widgets-sorted-by-manufacturer.html

Thanks in advance for the clarification.

jojy

5+ Year Member



 
Msg#: 4227606 posted 9:45 am on Nov 8, 2010 (gmt 0)


widgets-A-through-F.html
widgets-G-through-L.html
widgets-M-through-S.html
widgets-T-through-Z.html

this is correct

danimalSK



 
Msg#: 4227606 posted 10:01 am on Nov 8, 2010 (gmt 0)


JohnMu has advised against using canonicals for pagination. The reason being it acts as a kind of "bot redirect", e.g. the Googlebot hits the page, sees the canonical, and then doesn't bother with the rest of the page. This causes problems with crawling and PR flow, as Google doesn't reach items / pages down the list of pagination.

The approach I've been using, and which seems to work well, is meta noindexing everything after page one. The pages canonical to themselves (in case someone externally links to them), but not to page 1.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved