Forum Moderators: open
Correct me if I'm wrong but the 2nd latest crawl (that generated this latest update) and this current crawl are absolutely incomplete!
I manage *quite* a few sites and I've been in touch with several webmasters and it seems to me Google has realized they can't take full copies of the web everymonth and they're now limiting their crawl.
This leads to a few conclusions:
1 - If this is true then Google is "shaping" the web the way they want to. Crawling this, not crawling that, by their own choice instead of going for a full crawl is simply determining that this link will be valid and that other won't(obvious, since those not crawled will not count!).
2 - This freshbot and deepbot are a mess, they're pulling the same pages from each other and the deepbot does not crawl many sites fully anymore.
Something else comes to mind.
Microsoft is almost always forced to release products earlier than they wish - due to market constraints and competition.
I have a strange feeling the same is happening with Google. They've somehow messed up their schedule while messing with the googlebot code(Googleguy acknowledged they messed with it to crawl dynamic pages better) and now to catch up they're having to do incomplete crawls to be back in 30 days updates.
What is obvious is : if it's not crawling faster and it's crawling for shorter periods then it's got to be crawling less!
So what, we rewrite into html? I respect the task google has, but at the end of the day they're number 1 because of the quality of their searches. It was evident from the beginning, i remember the day i switched from hotbot -> this newbie google, i remained impressed tho..
Now? It's odd, i feel im rewritting my sites so that googlebot can read them. As much use as this forum has been, it's a "personal" struggle to develop the latest sites AND be indexed. I cant adventure, AT ALL!?
No googlebot definately does NOT index sites compeltely. I can't believe other webmaster's but J.Smith havn't noticed this. I respect GG for being (1000posts+) and still here, but this is so worrying, im almost at the point where i EXPECT this. Why some, why not others?
It's hard to tell why? is it due to market forces, or just code that spider's have issues with. If so? HOW? WHY? IN THE DARK!?
However i split the site into language domains, so making lcid redundant to avoid excessive querystring.
No duplicate data, pages use template though (perhaps 20% of resultant 'html' could be that.)
I made the sitemap.aspx into an htm file, -this might help? (but i missed the deep crawl)
I crosslink to the other domains on my homepage only.
The menu system uses no javascript (nor does my site) or any other uplevel browser technology. It's a tree menu along the side, approx 7 top menus, 3-4 levels deep at some time. Folds and expands (identical to explorer)
All this made for a site that was quick and easy to navigate, and exists in 6 languages. But it doesnt get spidered.
Thanks for you attention GG, I appreciate that development of such a massive index with trickery to avoid and clarity to maintain will take time.
I could easily create a single site with more "pages" than Google has in its entire index, but I wouldn't expect Google to index them all.
As gg suggested, they are trying to handle all these dynamic types but it'll take time..
Rather than avoid the issue, tackle it as best you can from the beginning, otherwise how would G stay competitive?
<added>Would limiting PHP,ASP,ASPX etc from the serps be sensible!?</added>