Botched Crawl - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Botched Crawl

Invalid Snapshot of the Web

1
2
»

johnsmith2003

12:33 am on Mar 16, 2003 (gmt 0)

The way Google works I think they need an accurate "snapshot" of the web every month to be able to calculate PR and link relationships that somehow reflect the "shape" of the web. One missed link and ...you've got distortion already. Of course some errors are acceptable so the world won't miss a few links.

Correct me if I'm wrong but the 2nd latest crawl (that generated this latest update) and this current crawl are absolutely incomplete!

I manage *quite* a few sites and I've been in touch with several webmasters and it seems to me Google has realized they can't take full copies of the web everymonth and they're now limiting their crawl.

This leads to a few conclusions:

1 - If this is true then Google is "shaping" the web the way they want to. Crawling this, not crawling that, by their own choice instead of going for a full crawl is simply determining that this link will be valid and that other won't(obvious, since those not crawled will not count!).

2 - This freshbot and deepbot are a mess, they're pulling the same pages from each other and the deepbot does not crawl many sites fully anymore.

Something else comes to mind.

Microsoft is almost always forced to release products earlier than they wish - due to market constraints and competition.

I have a strange feeling the same is happening with Google. They've somehow messed up their schedule while messing with the googlebot code(Googleguy acknowledged they messed with it to crawl dynamic pages better) and now to catch up they're having to do incomplete crawls to be back in 30 days updates.

What is obvious is : if it's not crawling faster and it's crawling for shorter periods then it's got to be crawling less!

GoogleGuy

12:41 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hey johnsmith2003, welcome to WebmasterWorld! I wouldn't say that just because we enhanced our ability to crawl dynamic pages that it was broken before. :) To the best of my knowledge our latest crawl is the most complete we've done, although I'd be happy to hear about sites that we're missing or things that we're not doing well. Are the domains you mentioned in the ODP? Are they com/net/org, or foreign domains like fr/de?

freejung

1:08 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

What makes you think an actual complete crawl has ever been done or is even possible? AFAIK, Google has never claimed to do a complete crawl, nor does anybody else. Web is too big. Therefore choices always have to be made about what to crawl and what not to crawl.

On the other hand, it is definitely true that Google has way too much power (sorry, GG, but it's true), and this is another reason why. Maybe competitors will step up and claim to have a more complete crawl. But nobody would ever claim to have a complete index of the whole web, that's rediculous.

johnsmith2003

1:13 am on Mar 16, 2003 (gmt 0)

Hi GG, thanks for the welcome and for the reply.

I don't think your latest crawl is even close to complete, but as for being the most complete one to date - well, I can't really discuss that since that's information you have and i don't.

GoogleBot is missing far too many pages on many many sites I've been watching(more than one server).

Some of these sites are .com, others are .net, some samples include .de and .it and have also suffered from this. Some are hosted in shared accounts others in dedicated servers all by themselves. Most don't belong to me, I'm bringing you a group complaint from myself and many webmasters I work with.

In all cases the crawl has not been complete - and this comes precisely after you said you had improved your capability of crawling dynamic pages.

Yes, the crawl was complete before the last crawl, something you did recently is causing this....I don't know what.

The bot seems to be trying to identify dynamic parameters by matching common file extentions....for example

page.php/32 is not crawled right while page.xtx/32 is crawled perfectly - which means the bot is taking the .php and concluding it is PHP and that 32 is a parameter. xtx (fictitious) works fine because googlebot can't tell what xtx is....and so it skips(!) the known extention and curiously crawls the unknown extention!

before posting here i was about to enter some bogus file extention into apache's configuration and change the pages PHP extention on my site to .bogus so I'd get crawled by googlebot again...i figured posting would be easier and more appropriate to help you also try to figure out what's up.

on other sites .html pages are being missed, all the examples above are php only as a single case study, on the others GG is missing internal pages, .html pages, .php pages and lots of useful content from very old sites that have always been crawled ok.

FAST and Inktomi are heavily crawling now and getting them all perfect at very high speeds - this is definitely not a DNS issue or local server issue - it is something Google changed on the googlebot before the february crawl.

thanks and please see what you can do because it's such an eclectic mix of sites that are going wrong that i really think you should have a look at the pages googlebot is pulling in to see if you're not getting a distorted index(i think you are, judging by some results in common searches here recently, in one example i searched for a certain chemical compound and got a george bush page back, i don't know what chemicals he was found of but surely this one would have killed him).

lastly (sorry for the long post) i really think you're releasing early indexes just to be on market schedule, maybe to be within contract rules with the 3rd parties you provide search to, etc. i don't know but i'm sure Google is rushing to market with an incomplete index from the sample of 100+ sites i've been watching - remember, if you let the market drive you you'll end up with a technically inferior product!

regards. john.

GoogleGuy

1:14 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Just to clarify, I don't think anyone has a complete copy of the web. I do think that our coverage is very comprehensive, and getting more so over time.

There's always more dynamic urls out there to crawl. :)

WebGuerrilla

1:20 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Every single day, the size of the web grows by thousands and thousands of pages. Unfortunately, the number of days in the month stay the same.

There isn't any way to take a complete snapshot of the web. All you can do is try and develop a system that takes regular snapshots of the portions of the web that your users think are the most important. Obviously that's a subjective task, but I'd say based on the number of people using Google each day that they've done a pretty good job developing their crawling strategy.

GoogleGuy

1:20 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks for posting with more info, johnsmith2003. If you'd like me to check it out, would you post the urls with as much details as possible here:
[google.com...]
That form is mostly for reporting spam, but it's also the easiest for me to get access to. Mention your nickname and webmasterworld so I can find it easily. If it sounds like we've gotten a crawling bug recently, I'll pass it on to our crawling group and they'll check it out in more detail. Thanks!

johnsmith2003

1:26 am on Mar 16, 2003 (gmt 0)

Hello GG - thanks again for your reply. I think I'll pass ;) - I don't ever come near that form, it's a policy I follow that I will not cross a certain line with google.

You should know there's something weird with what you've done to Googlebot in february and I think it's wrong, beyond that is beyond my obligation.

Thanks again, and keep an eye on that bot, other engines are doing a better job than Google, at least on the crawl side.

rfgdxm1

1:32 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>Are the domains you mentioned in the ODP? Are they com/net/org, or foreign domains like fr/de?

As if this makes a difference? Does Google have some perverse bias against foreign TLDs? My main site is out in .ws namespace because slimy, filthy domain name speculators are hoarding those foolishly and not using the .com, .net and .org versions. That .ws is ODP listed, but I'd think Google would know better than to discrimate based on TLD.

cchooper

1:33 am on Mar 16, 2003 (gmt 0)

10+ Year Member

I run a very small site, no more than 50 pages including other sites linked to on my domain and I'm also very low-ranked (PR3-1) but Google has always crawled my entire site, which, has always been written in, and contained the file extenstions: php

Most of the site is setup so that to view a page on say, widgets, you would be linked to index.php?page=widgets, and sprockets, index.php=sprockets, with every site linking to the other (for now. I'll be moving to that nice lil pyramid scheme soon, but I am using it on another site for now (on the same domain,) it helps me organize fairly well, but hasn't been around long enough to see how well Google reacts :) )

But I've heard the same complaints from many other people, with Googlebot not crawling dynamic URLs.

But I do have a question, as much changes as Google makes, is the Googlebot/2.1 UA every going to change to a 2.2, or 3.0? I suppose doing so would bother quite a few custom-made scripts to watch out for Google, but that's not neccessarily a bad thing :)

freejung

1:33 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Well, Jeeze, John, if you won't help him out with the info, then what's the point of getting in his face about it in the first place?

jomaxx

1:33 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

For my part I haven't noticed any problem, either from a searcher's or from a webmaster's point of view. What you are presumably seeing is that Google limits its crawling of dynamic pages -- they say this quite clearly on the "information for webmasters" page of their site.

GoogleGuy

1:54 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

That's your call, johnsmith2003, and I respect whatever you want to do. You could also email to googlebot@google.com or help@google.com--probably the first address would be more likely to reach a crawling person quicker.

I know that .php is in our list of "good" extensions and we crawl quite a lot of PHP pages every crawl. Have other people seen any problems with Googlebot crawling .php pages, or any other types of pages?

kila_m

2:06 am on Mar 16, 2003 (gmt 0)

10+ Year Member

Hmm. I had to write a lot of mod_rewrites because google wasnt following my phpnuke site. When I re-wrote the links it now follows them - but your saying you changed the algorithm to handle these.. if thats the case then will google will see dupe content and penalise me thinking im a spammer or something?

BigDave

2:39 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

john,

You never answered if your sites are in the ODP. What is the PR of the home page of the sites, and how many of the mod_rewrite pages are on the site? What percentage of the pages are being missed?

I don't have any problems with my php, but I'm on a PR7 ODP page and have less than 3000 pages total, so it might not be a great comparison.

Google tries not to hit dynamic servers too hard. I suppose it's possible that they recognize your rewriten URLs as being dynamic and they are slowing down their crawl.

You should know there's something weird with what you've done to Googlebot in february and I think it's wrong, beyond that is beyond my obligation.

You must have some incredibly unique content for anyone to ever miss it.

jmccormac

3:25 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

It really depends on what you define as 'accurate' johnsmith2003. From just indexing the Irish owned .ie/com/net/org websites (not quite on the same scale as Google but at least, our algorithm can identify Irish sites. ;) ), the vast majority of website content is not updated regularly and has not been updated for almost a year (longer in some cases). On a global basis, Google would have to keep indexing the same pages each month throughout the com/net/org/info/biz/cctld domains. That is an amazingly bad strategy from a resources point of view. If I was to guess, Google is trying to establish the frequently updated sites and the stale sites and direct resources accordingly.

If Google is trying to rationalise its bandwidth/spidering usage, then the new index will probably be fresher in terms of content. It is no use having the highest number of webpages crawled if that data is stale. On the PR side of things, a fresher index would be a good thing and should increase the accuracy. Again it gets down to this whole PR argument and how this is calculated. I don't think that the actual Google algorithm was ever published so to use a terrible pun, everyone is just dancing in the dark. :)

Regards...jmcc

cchooper

4:16 am on Mar 16, 2003 (gmt 0)

10+ Year Member

> Google tries not to hit dynamic servers too hard. I suppose it's possible that they recognize your rewriten URLs as being dynamic and they are slowing down their crawl.

Does Googlebot penalize for pages that don't return a file size in the HTTP headers? Does it even use that as a factor? If so, the fact that many dynamic pages won't return a document size may be what's screwing you guys over :p

Except ... my php pages don't return a filesize, but I get crawled just fine, only, I'm operating on a much smaller scale it seems ^_^

GoogleGuy

4:38 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

cchooper, it shouldn't make a difference whether the server returns the filesize or not.

Powdork

4:53 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Are the domains you mentioned in the ODP?

How can this matter?
How can one get a complete snapshot of the web when relying on the ODP?
What do you say to the people who have been "in the queue" for the last six months?

Fortunately, it really doesn't appear to matter.

cchooper

5:04 am on Mar 16, 2003 (gmt 0)

10+ Year Member

Doesn't matter? Awesome, I wasn't sure if the Googlebot used that for anything, didn't think so, but you know how those crazy bots are ;)

BigDave

5:06 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

From my understanding, Google uses ODP and Yahoo! directories to "seed" the crawl. If you are in ODP, your home page should get hit earlier in the crawl. the earlier you home page gets crawled, the sooner all the links from that page make it into the queue. And so on down the tree.

If you have a site that has a million pages, and you are many links removed from ODP or yahoo, you might not get all your pages crawled.

A smaller site would probably be okay, as would that large site if they had a PR7 ODP link, and lots of other incoming links as well as some deep links.

There is probably more to calculating where they will crawl next than just a simple queue, but being in one of the seed directories should definitely increase your odds of being fully crawled.

jomaxx

5:12 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Powdork, rfgdxm1 relax. It seems to me Googleguy was just asking for any specific details so that he would have something to investigate. Given "john smith's" exaggerations and wild accusations, I am impressed by how helpful Googleguy tried to be.

jmccormac

5:18 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

There is probably more to calculating where they will crawl next than just a simple queue, but being in one of the seed directories should definitely increase
your odds of being fully crawled.

Interesting idea BigDave. That would have Google applying an element of pre-loading but it does tie in with this whole idea of authoritative hubs.

Regards...jmcc

BigDave

5:31 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

That would have Google applying an element of pre-loading

Google has to start the crawl somewhere. Why not start it from known sources of a large variety of links that have a history of having sufficient PR to spread through their tree.

It makes a lot more sense for them to stick in www.yahoo.com and www.dmoz.org into the queue, than it does for them to start with my website.

Google's site does say

2. What else can I do to get listed in Google?
Google partners on the Web include Yahoo! and Netscape. If you are having difficulty getting listed in the Google index, you may want to consider submitting your site to either or both of these directories.
...
Once your site is included in either of these directories, Google will often index your site within six to eight weeks.

It's not proof that they do this, but it really makes sense to do it this way.

rfgdxm1

5:32 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>Powdork, rfgdxm1 relax. It seems to me Googleguy was just asking for any specific details so that he would have something to investigate. Given "john smith's" exaggerations and wild accusations, I am impressed by how helpful Googleguy tried to be.

I have little doubt of this. In fact, I pray 5 times a day facing the Googleplex. ;)

rfgdxm1

5:36 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>It's not proof that they do this, but it really makes sense to do it this way.

This could just mean that Google is saying they always crawl Yahoo! and the ODP. IOW, a link from cousin Kim's website is no guarantee. ;)

Powdork

5:39 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Powdork, rfgdxm1 relax.

We are relaxed. As ODP editors all we have to do is sit back and relax and collect paychecks. Rfgdxm1 did you get yours, mine is late?;)
I am always impressed with how GG handles things. I just don't understand how Google can place such importannce on something that is hurting so badly.
GoogleGuy: Normally the tagline is "Are you listed in directories like Y! or the ODP?" Is there any significance to your not mentioning Yahoo!?

jmccormac

5:44 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Google has to start the crawl somewhere. Why not start it from known sources of a large variety of links that have a history of having sufficient PR to spread
through their tree.

Almost every search engine I've seen in the past few years at a local level, tends to rely on Dmoz/ODP for input. Google does not seem to be different there. What I do, sometimes, wonder about is how Google derives its initial index. From what I can see, it is crawler based and it follows links. This could mean that a site which has no links does not get included and thus directories like Dmoz/ODP and Yahoo, and the smaller local and niche directories are critical to Google's existence.

Regards...jmcc

BigDave

5:54 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I don't think that the directories are critical to google, but they make the job of covering the web much easier.

Some sectors would be almost impossible to crawl without the directories, since they are stingy with their outgoing links, but you could probably find much of the web by seeding the crawl from Stanford University's servers.

jmccormac

6:14 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I don't think that the directories are critical to google, but they make the job of covering the web much easier.

On the cctld scene, the local directories would be a lot more important to Google than for the com/net/org scene. Many cctld domains do not release details of the domains registered and consequently, cctld websites can be difficult to find, even with crawling.

Some sectors would be almost impossible to crawl without the directories, since they are stingy with their outgoing links, but you could probably find much of
the web by seeding the crawl from Stanford University's servers.

The top level domain websites are the easiest to find and generating a complete list of all com/net/org websites is trivial. The big problem, and this is where crawlers tend to be very useful is in finding the /~ personal subdirectories.

What it all comes down to is that Google's PR gives a qualitative assessment of the webpages it indexes.

Regards...jmcc

This 44 message thread spans 2 pages: 44

1
2
»