Welcome to WebmasterWorld Guest from 54.198.14.214

Forum Moderators: DixonJones & mademetop

Message Too Old, No Replies

WT "not selected" URLs number higher than "ever crawled"

     
9:27 am on Sep 10, 2012 (gmt 0)

New User

joined:Aug 21, 2012
posts: 18
votes: 0


Hi, I recently noticed strange behaviour on one of my sites.

Site is running Wordpress installation and on 8/26/12 it was moved to different hosting. Recently number of "ever crawled" and "not selected" URL in Google Webmaster Tools started to grow very fast.

On 7/22/12 there was:
- 753 ever crawled
- 606 not selected

On 8/26/12 there was:
- 5.686 ever crawled
- 5.499 not selected

On 9/9/12 there was:
- 5.686 ever crawled
- 10.404 not selected

So the questions are:

How can be "not selected" higher than "ever crawled"?

Does it indicate problem with my site? How can I find it's source and make corrections?

Any help or suggestions is very appreciated.
10:47 am on Sept 10, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Google has seen links pointing to 10 000 URLs, but has only crawled 5600 of them.

If a significant number of crawled URLs are 404, soft 404, redirects {maybe}, or other "non-pages", then crawling is throttled back so as to not waste their crawl budget.
11:32 am on Sept 10, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10553
votes: 12


i'm going with g1smd's answer.

i would start analyzing the urls crawled by googlebot and look for the responses given to non-canonical urls.
if the status code is a 302 or a 200 that's your problem.
1:46 pm on Sept 10, 2012 (gmt 0)

New User

joined:Aug 21, 2012
posts: 18
votes: 0


Thanks a lot guys. Hope i found source of the problem.

It seems, that calendar plugin messed with the URLs with adding ?month=xxx&yr=xxxx to almost every URL. When i switched to GWT to add this parameter as it does not affect displayed data I found that it is already here with option "Let googlebot decide" and 10.270 monitored URLs. So i changed it to option "No: Doesn't affect page content" (just to be sure).

Thanks you both for you very quick answer and help. I'll try contact developer of this plugin, and report this issue.
2:17 pm on Sept 10, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10553
votes: 12


a calendar plugin is a typical source of infinite url space.
2:40 pm on Sept 10, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I have no idea why calendering systems don't limit the date range that is accessible and don't return 404 for empty dates. Almost all of them seem to suffer from this flaw.

[edited by: g1smd at 2:50 pm (utc) on Sep 10, 2012]

2:48 pm on Sept 10, 2012 (gmt 0)

New User

joined:Aug 21, 2012
posts: 18
votes: 0


Problem is, that it appends month and year selection parameter even to posts completely unrelated to calendar.

I have no idea, how crawler find these URLs, but i discovered it with use unix command line utility called webcheck (btw very nice utility).
2:50 pm on Sept 10, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


That's even worse.
3:26 pm on Sept 10, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10553
votes: 12


i've seen one site where any page could have any date, past or future, appended to the already non-canonical query string.
every page started out with links to all of "this month's" dates, whatever month that happened to be at the time the page was requested, and links to the next and previous months.
4:44 pm on Sept 10, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


It's easier* to handle invalid queries after the fact than to prevent them from being added in the first place. Especially when people or googlebots can add anything they like to their address bar. So no matter what, you always need a bad-query-handling routine.


* Not "better" or "more desirable". Just easier.