g1smd

msg:4493195 | 10:47 am on Sep 10, 2012 (gmt 0) |
Google has seen links pointing to 10 000 URLs, but has only crawled 5600 of them. If a significant number of crawled URLs are 404, soft 404, redirects {maybe}, or other "non-pages", then crawling is throttled back so as to not waste their crawl budget.
|
phranque

msg:4493202 | 11:32 am on Sep 10, 2012 (gmt 0) |
i'm going with g1smd's answer. i would start analyzing the urls crawled by googlebot and look for the responses given to non-canonical urls. if the status code is a 302 or a 200 that's your problem.
|
MinosTheNinth

msg:4493252 | 1:46 pm on Sep 10, 2012 (gmt 0) |
Thanks a lot guys. Hope i found source of the problem. It seems, that calendar plugin messed with the URLs with adding ?month=xxx&yr=xxxx to almost every URL. When i switched to GWT to add this parameter as it does not affect displayed data I found that it is already here with option "Let googlebot decide" and 10.270 monitored URLs. So i changed it to option "No: Doesn't affect page content" (just to be sure). Thanks you both for you very quick answer and help. I'll try contact developer of this plugin, and report this issue.
|
phranque

msg:4493269 | 2:17 pm on Sep 10, 2012 (gmt 0) |
a calendar plugin is a typical source of infinite url space.
|
g1smd

msg:4493294 | 2:40 pm on Sep 10, 2012 (gmt 0) |
I have no idea why calendering systems don't limit the date range that is accessible and don't return 404 for empty dates. Almost all of them seem to suffer from this flaw. [edited by: g1smd at 2:50 pm (utc) on Sep 10, 2012]
|
MinosTheNinth

msg:4493300 | 2:48 pm on Sep 10, 2012 (gmt 0) |
Problem is, that it appends month and year selection parameter even to posts completely unrelated to calendar. I have no idea, how crawler find these URLs, but i discovered it with use unix command line utility called webcheck (btw very nice utility).
|
g1smd

msg:4493302 | 2:50 pm on Sep 10, 2012 (gmt 0) |
That's even worse.
|
phranque

msg:4493324 | 3:26 pm on Sep 10, 2012 (gmt 0) |
i've seen one site where any page could have any date, past or future, appended to the already non-canonical query string. every page started out with links to all of "this month's" dates, whatever month that happened to be at the time the page was requested, and links to the next and previous months.
|
lucy24

msg:4493382 | 4:44 pm on Sep 10, 2012 (gmt 0) |
It's easier* to handle invalid queries after the fact than to prevent them from being added in the first place. Especially when people or googlebots can add anything they like to their address bar. So no matter what, you always need a bad-query-handling routine. * Not "better" or "more desirable". Just easier.
|
|