Forum Moderators: open
Correct me if I'm wrong but the 2nd latest crawl (that generated this latest update) and this current crawl are absolutely incomplete!
I manage *quite* a few sites and I've been in touch with several webmasters and it seems to me Google has realized they can't take full copies of the web everymonth and they're now limiting their crawl.
This leads to a few conclusions:
1 - If this is true then Google is "shaping" the web the way they want to. Crawling this, not crawling that, by their own choice instead of going for a full crawl is simply determining that this link will be valid and that other won't(obvious, since those not crawled will not count!).
2 - This freshbot and deepbot are a mess, they're pulling the same pages from each other and the deepbot does not crawl many sites fully anymore.
Something else comes to mind.
Microsoft is almost always forced to release products earlier than they wish - due to market constraints and competition.
I have a strange feeling the same is happening with Google. They've somehow messed up their schedule while messing with the googlebot code(Googleguy acknowledged they messed with it to crawl dynamic pages better) and now to catch up they're having to do incomplete crawls to be back in 30 days updates.
What is obvious is : if it's not crawling faster and it's crawling for shorter periods then it's got to be crawling less!
On the other hand, it is definitely true that Google has way too much power (sorry, GG, but it's true), and this is another reason why. Maybe competitors will step up and claim to have a more complete crawl. But nobody would ever claim to have a complete index of the whole web, that's rediculous.
I don't think your latest crawl is even close to complete, but as for being the most complete one to date - well, I can't really discuss that since that's information you have and i don't.
GoogleBot is missing far too many pages on many many sites I've been watching(more than one server).
Some of these sites are .com, others are .net, some samples include .de and .it and have also suffered from this. Some are hosted in shared accounts others in dedicated servers all by themselves. Most don't belong to me, I'm bringing you a group complaint from myself and many webmasters I work with.
In all cases the crawl has not been complete - and this comes precisely after you said you had improved your capability of crawling dynamic pages.
Yes, the crawl was complete before the last crawl, something you did recently is causing this....I don't know what.
The bot seems to be trying to identify dynamic parameters by matching common file extentions....for example
page.php/32 is not crawled right while page.xtx/32 is crawled perfectly - which means the bot is taking the .php and concluding it is PHP and that 32 is a parameter. xtx (fictitious) works fine because googlebot can't tell what xtx is....and so it skips(!) the known extention and curiously crawls the unknown extention!
before posting here i was about to enter some bogus file extention into apache's configuration and change the pages PHP extention on my site to .bogus so I'd get crawled by googlebot again...i figured posting would be easier and more appropriate to help you also try to figure out what's up.
on other sites .html pages are being missed, all the examples above are php only as a single case study, on the others GG is missing internal pages, .html pages, .php pages and lots of useful content from very old sites that have always been crawled ok.
FAST and Inktomi are heavily crawling now and getting them all perfect at very high speeds - this is definitely not a DNS issue or local server issue - it is something Google changed on the googlebot before the february crawl.
thanks and please see what you can do because it's such an eclectic mix of sites that are going wrong that i really think you should have a look at the pages googlebot is pulling in to see if you're not getting a distorted index(i think you are, judging by some results in common searches here recently, in one example i searched for a certain chemical compound and got a george bush page back, i don't know what chemicals he was found of but surely this one would have killed him).
lastly (sorry for the long post) i really think you're releasing early indexes just to be on market schedule, maybe to be within contract rules with the 3rd parties you provide search to, etc. i don't know but i'm sure Google is rushing to market with an incomplete index from the sample of 100+ sites i've been watching - remember, if you let the market drive you you'll end up with a technically inferior product!
regards. john.
There isn't any way to take a complete snapshot of the web. All you can do is try and develop a system that takes regular snapshots of the portions of the web that your users think are the most important. Obviously that's a subjective task, but I'd say based on the number of people using Google each day that they've done a pretty good job developing their crawling strategy.
You should know there's something weird with what you've done to Googlebot in february and I think it's wrong, beyond that is beyond my obligation.
Thanks again, and keep an eye on that bot, other engines are doing a better job than Google, at least on the crawl side.
As if this makes a difference? Does Google have some perverse bias against foreign TLDs? My main site is out in .ws namespace because slimy, filthy domain name speculators are hoarding those foolishly and not using the .com, .net and .org versions. That .ws is ODP listed, but I'd think Google would know better than to discrimate based on TLD.
Most of the site is setup so that to view a page on say, widgets, you would be linked to index.php?page=widgets, and sprockets, index.php=sprockets, with every site linking to the other (for now. I'll be moving to that nice lil pyramid scheme soon, but I am using it on another site for now (on the same domain,) it helps me organize fairly well, but hasn't been around long enough to see how well Google reacts :) )
But I've heard the same complaints from many other people, with Googlebot not crawling dynamic URLs.
But I do have a question, as much changes as Google makes, is the Googlebot/2.1 UA every going to change to a 2.2, or 3.0? I suppose doing so would bother quite a few custom-made scripts to watch out for Google, but that's not neccessarily a bad thing :)
I know that .php is in our list of "good" extensions and we crawl quite a lot of PHP pages every crawl. Have other people seen any problems with Googlebot crawling .php pages, or any other types of pages?
You never answered if your sites are in the ODP. What is the PR of the home page of the sites, and how many of the mod_rewrite pages are on the site? What percentage of the pages are being missed?
I don't have any problems with my php, but I'm on a PR7 ODP page and have less than 3000 pages total, so it might not be a great comparison.
Google tries not to hit dynamic servers too hard. I suppose it's possible that they recognize your rewriten URLs as being dynamic and they are slowing down their crawl.
You should know there's something weird with what you've done to Googlebot in february and I think it's wrong, beyond that is beyond my obligation.
You must have some incredibly unique content for anyone to ever miss it.
If Google is trying to rationalise its bandwidth/spidering usage, then the new index will probably be fresher in terms of content. It is no use having the highest number of webpages crawled if that data is stale. On the PR side of things, a fresher index would be a good thing and should increase the accuracy. Again it gets down to this whole PR argument and how this is calculated. I don't think that the actual Google algorithm was ever published so to use a terrible pun, everyone is just dancing in the dark. :)
Regards...jmcc
Does Googlebot penalize for pages that don't return a file size in the HTTP headers? Does it even use that as a factor? If so, the fact that many dynamic pages won't return a document size may be what's screwing you guys over :p
Except ... my php pages don't return a filesize, but I get crawled just fine, only, I'm operating on a much smaller scale it seems ^_^
If you have a site that has a million pages, and you are many links removed from ODP or yahoo, you might not get all your pages crawled.
A smaller site would probably be okay, as would that large site if they had a PR7 ODP link, and lots of other incoming links as well as some deep links.
There is probably more to calculating where they will crawl next than just a simple queue, but being in one of the seed directories should definitely increase your odds of being fully crawled.
There is probably more to calculating where they will crawl next than just a simple queue, but being in one of the seed directories should definitely increase
your odds of being fully crawled.
Regards...jmcc
That would have Google applying an element of pre-loading
Google has to start the crawl somewhere. Why not start it from known sources of a large variety of links that have a history of having sufficient PR to spread through their tree.
It makes a lot more sense for them to stick in www.yahoo.com and www.dmoz.org into the queue, than it does for them to start with my website.
Google's site does say
2. What else can I do to get listed in Google?Google partners on the Web include Yahoo! and Netscape. If you are having difficulty getting listed in the Google index, you may want to consider submitting your site to either or both of these directories.
...
Once your site is included in either of these directories, Google will often index your site within six to eight weeks.
It's not proof that they do this, but it really makes sense to do it this way.
I have little doubt of this. In fact, I pray 5 times a day facing the Googleplex. ;)
Powdork, rfgdxm1 relax.
We are relaxed. As ODP editors all we have to do is sit back and relax and collect paychecks. Rfgdxm1 did you get yours, mine is late?;)
I am always impressed with how GG handles things. I just don't understand how Google can place such importannce on something that is hurting so badly.
GoogleGuy: Normally the tagline is "Are you listed in directories like Y! or the ODP?" Is there any significance to your not mentioning Yahoo!?
Google has to start the crawl somewhere. Why not start it from known sources of a large variety of links that have a history of having sufficient PR to spread
through their tree.
Almost every search engine I've seen in the past few years at a local level, tends to rely on Dmoz/ODP for input. Google does not seem to be different there. What I do, sometimes, wonder about is how Google derives its initial index. From what I can see, it is crawler based and it follows links. This could mean that a site which has no links does not get included and thus directories like Dmoz/ODP and Yahoo, and the smaller local and niche directories are critical to Google's existence.
Regards...jmcc
Some sectors would be almost impossible to crawl without the directories, since they are stingy with their outgoing links, but you could probably find much of the web by seeding the crawl from Stanford University's servers.
I don't think that the directories are critical to google, but they make the job of covering the web much easier.
On the cctld scene, the local directories would be a lot more important to Google than for the com/net/org scene. Many cctld domains do not release details of the domains registered and consequently, cctld websites can be difficult to find, even with crawling.
Some sectors would be almost impossible to crawl without the directories, since they are stingy with their outgoing links, but you could probably find much of
the web by seeding the crawl from Stanford University's servers.
The top level domain websites are the easiest to find and generating a complete list of all com/net/org websites is trivial. The big problem, and this is where crawlers tend to be very useful is in finding the /~ personal subdirectories.
What it all comes down to is that Google's PR gives a qualitative assessment of the webpages it indexes.
Regards...jmcc