Forum Moderators: open

Message Too Old, No Replies

Botched Crawl

Invalid Snapshot of the Web

         

johnsmith2003

12:33 am on Mar 16, 2003 (gmt 0)



The way Google works I think they need an accurate "snapshot" of the web every month to be able to calculate PR and link relationships that somehow reflect the "shape" of the web. One missed link and ...you've got distortion already. Of course some errors are acceptable so the world won't miss a few links.

Correct me if I'm wrong but the 2nd latest crawl (that generated this latest update) and this current crawl are absolutely incomplete!

I manage *quite* a few sites and I've been in touch with several webmasters and it seems to me Google has realized they can't take full copies of the web everymonth and they're now limiting their crawl.

This leads to a few conclusions:

1 - If this is true then Google is "shaping" the web the way they want to. Crawling this, not crawling that, by their own choice instead of going for a full crawl is simply determining that this link will be valid and that other won't(obvious, since those not crawled will not count!).

2 - This freshbot and deepbot are a mess, they're pulling the same pages from each other and the deepbot does not crawl many sites fully anymore.

Something else comes to mind.

Microsoft is almost always forced to release products earlier than they wish - due to market constraints and competition.

I have a strange feeling the same is happening with Google. They've somehow messed up their schedule while messing with the googlebot code(Googleguy acknowledged they messed with it to crawl dynamic pages better) and now to catch up they're having to do incomplete crawls to be back in 30 days updates.

What is obvious is : if it's not crawling faster and it's crawling for shorter periods then it's got to be crawling less!

hitchhiker

7:28 am on Mar 16, 2003 (gmt 0)

10+ Year Member



[Google tries not to hit dynamic servers too hard.]

So what, we rewrite into html? I respect the task google has, but at the end of the day they're number 1 because of the quality of their searches. It was evident from the beginning, i remember the day i switched from hotbot -> this newbie google, i remained impressed tho..

Now? It's odd, i feel im rewritting my sites so that googlebot can read them. As much use as this forum has been, it's a "personal" struggle to develop the latest sites AND be indexed. I cant adventure, AT ALL!?

No googlebot definately does NOT index sites compeltely. I can't believe other webmaster's but J.Smith havn't noticed this. I respect GG for being (1000posts+) and still here, but this is so worrying, im almost at the point where i EXPECT this. Why some, why not others?

It's hard to tell why? is it due to market forces, or just code that spider's have issues with. If so? HOW? WHY? IN THE DARK!?

GoogleGuy

7:40 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



hitchhiker, Google's goal is to be able to crawl well enough that no webmaster ever has to change their site for us. We're not there yet, but we're trying. What sort of pages have you seen that we didn't index? Dynamic or static?

hitchhiker

7:51 am on Mar 16, 2003 (gmt 0)

10+ Year Member



GG,
aspx (asp.net)
I'm using a querystring to navigate (no cookies)
?nd = nodeid (position of menu)
?lc = linkid (actual page data)
?lcid = language id
(I'm considering hashing the querystring now.)

However i split the site into language domains, so making lcid redundant to avoid excessive querystring.

No duplicate data, pages use template though (perhaps 20% of resultant 'html' could be that.)

I made the sitemap.aspx into an htm file, -this might help? (but i missed the deep crawl)

I crosslink to the other domains on my homepage only.

The menu system uses no javascript (nor does my site) or any other uplevel browser technology. It's a tree menu along the side, approx 7 top menus, 3-4 levels deep at some time. Folds and expands (identical to explorer)

All this made for a site that was quick and easy to navigate, and exists in 6 languages. But it doesnt get spidered.

Thanks for you attention GG, I appreciate that development of such a massive index with trickery to avoid and clarity to maintain will take time.

jomaxx

5:33 pm on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Although Google say they don't want to hit servers too hard, IMO a reason of just as much importance is that with dynamic pages you can create a virtually infinite number of valid URLs.

I could easily create a single site with more "pages" than Google has in its entire index, but I wouldn't expect Google to index them all.

hitchhiker

12:51 am on Mar 17, 2003 (gmt 0)

10+ Year Member



That's no reason to limit dynamic pages, just another reason to adapt to an un-stopable market trend

jomaxx

3:06 am on Mar 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



?

So you're saying that Google ought to index all 3 billion unique pages on my hypothetical site?

hitchhiker

9:44 am on Mar 17, 2003 (gmt 0)

10+ Year Member



No, clarity in the serps (accuracy) is not something you can achieve by ignoring pages just because they 'can' be false.

As gg suggested, they are trying to handle all these dynamic types but it'll take time..

Rather than avoid the issue, tackle it as best you can from the beginning, otherwise how would G stay competitive?

<added>Would limiting PHP,ASP,ASPX etc from the serps be sensible!?</added>

Powdork

9:31 pm on Mar 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Would limiting PHP,ASP,ASPX etc from the serps be sensible!?

It would be ludicrous.

But it would move me up considerably so I say let's go for it.:)

Bladzee

8:01 am on Mar 18, 2003 (gmt 0)

10+ Year Member



Hi guys, i'm new here, but I joined for the specific purpose of trying to find out what happend to Google's index this past month. It blew my mind. Almost all of our clients either didn't get indexed or were missing an unbeleivable number of backwards links. This past index killed our pagerank and caused havoc among clients. Is there anything that can be done?

Powdork

8:11 am on Mar 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Did almost all of your clients reside on recently expired domains or have the lost backlinks come from recently expired (and resuscitated) domains?

born2drv

8:17 am on Mar 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Wasn't the last update based on a crawl that was partially incomplete due to the SQL worm in late January?

Maybe I've got the date wrong but that could have accounted for problems for some people, even if their sites were fine their datacenters may have been down.

Powdork

8:21 am on Mar 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I thought of server problems but that wouldn't explain the loss of backlinks unless there is a lot of cross linking from sites on the same server or with the same isp.

Powdork

8:22 am on Mar 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OOPS!
Welcome to Webmaster World Bladzee. Good Luck

Bladzee

5:29 pm on Mar 18, 2003 (gmt 0)

10+ Year Member



Thanks guys! I thought it was the SQL worm too, but wasn't sure.
This 44 message thread spans 2 pages: 44