|Googlebot fetching GZIP compressed pages, concurrent test?|
Googlebot is clearly doing two concurrent crawls. One with an enchanced "Bot" using HTTP 1.1 and requesting GZIP compressed pages. This is an optional request. Your server may serve up compressed or uncompressed pages. Take a close look at your log files. If your web server is Apache 2.0 and has loaded the GZIP compression module you'll see the byte count of your web page fetches has dropped by a factor of 4 to 5 in size. Depending upon the number of servers supporting GZIP compression Google could crawl 4 to 5 times faster. They've probably set up a complete test set of servers and are reviewing the results of grabbing compressed files, seeing how much garbage they get.
By using different user agent strings, which they appear to be doing, they can pick up cloaked pages. By fetching what appears to be non-existent pages they can detect page and site hijacking. It looks like they're really experimenting. Here's a utility [leknor.com] that can check your site for compression.
For more info and other utilities:
For the Apache server, look at version 2.0, version 1.3 can compress but its tough, look at documentation:
Ya, noted in a few other threads. Thanks.
This is the 4th(?) time that Google has tested Gzip?
could explain many things ..very neat find
I have seen a lot of different Google User Agents lately, but none different enough to fool cloaking software. Any examples of the ones you've seen bumpski?
perhaps I am being a muppet, but all of my pages are gzipped, they are all in and all rank?
When was it ever in question in recent times
Gzip has to be accepted. You have it turned on but your users have to let your server know that their browsers support it. This means googlebot is negotiating for it.
Googlebot has not been requesting compressed pages for a long time. Now, in the last few days, Googlebot is requesting compressed pages. I'm sure many servers are serving compressed pages, but until now not to Googlebot.
Internet Explorer and Netscape do request compressed pages, unless you're like me and have N....n Internet Security 2003 which just blantantly turns this capability off. If you temporarily disable Internet security you can see that your browser will typically request compressed pages. At the bottom of the page at the www.webcompression.org site the page will let you know if your browser is requesting compressed pages or not.
Compression is yet another way to spam Google because you can dynamically serve different content based upon the content of the request, but, all Google has to do is randomly produce requests for uncompressed pages, compare the results to the compressed page to detect spam, so perhaps their getting around to it. By quadrupling their available network and server bandwidth they can spend a lot more time detecting spam without loading our servers, their servers, or the Internet itself. This also leaves more CPU time for Pagerank calculations.
Perhaps this is their approach to correcting Page Rank, get rid of the abusers and maybe it will start to work again. They do seem to be doing a lot of banning recently, this may be an enhancement that allows for a totally automated approach.
Certainly with their influx of funds it makes sense to have a full duplicate set of bot servers, they can hide the "dance" completely simply by toggling between server sets which can be done very quickly. Build a new index and "switch".
I've seen requests from Google with no agent strings at all, but I think this was from the image bots, but others have reported Google IP's with no user agent. With only a few unknown or secret IP's and random sampling Google could have eliminated (detected) cloaking quite a while ago, it may just have been a matter of funding or bandwidth, who knows? They seem to be tackling bandwidth.
Apache 2.0 dynamically compresses pages on the fly which is very convenient, no extra work, except for the CPU, but the CPU is off loaded by the reduction in bytes to move. Unix is renown for this inefficiency; moving bytes.
I think the next few weeks (or months) are going to be interesting!
Yes this is quite interesting; new ideas from newly hired minds. I wonder if this wil cause any issues of sites being dropped or the not so friendly Florida update syndrome knowns as FARS, Florida Automated Removal System, sounds like SARS. (Smiling)
We will see MANY new things from Google, so the wall in the secret bathhroom has been telling me.
While in Europe I heard from the largest internet marketing firm there that there are at least 160 more things being worked on than the public version of Google Labs tells us.
FYI - Keep an eye on Google...
I think one of the reasons that gzip support is taking so long from the server side is this:
I run a web hosting company. If I enable gzip support, here is what happens:
1) The load on my server goes up due to the increased CPU time necessary for page compression.
2) The bandwidth my clients use goes down.
Both of these things are negatives... where are the benefits?
|no extra work, except for the CPU, but the CPU is off loaded by the reduction in bytes to move. Unix is renown for this inefficiency; moving bytes. |
Really? I'd have thought (for example) Suns were renowned for shifting data, with historically SCSI harddrives, SBUS (DMA on everything) etc compared to PCs with IDE, PCI etc. About the only thing I'd put above a proper UNIX box for shifting large amounts of network traffic would be dedicated routers (Cisco etc). Don't forget NetBSD currently holds the Internet2 land speed world record.
GZIP has a sliding scale of compression from 1 (least compression) to 9 (most). You can pick the balance between reducing bandwidth usage and minimising CPU usage. Once you get passed 3 or 4 you tend to suffer from the law of diminishing returns. If you use content-negotiation with static files, you can compress the files once, and serve them many times, getting the best of both worlds.
Interesting questions! The IEEE Communications society and others have studied UNIX's communications "stack" (TCPIP, UDP, etc) and the operating systems extensively, and found UNIX (and most operating systems) very inefficient in the communications layers of software. Most hardware is very efficient at moving bytes for communications but the software architecture prevents taking full advantage of these efficiencies.
To maintain software protection boundaries between drivers, communications software, and applications software, messages that arrive or are sent are copied numerous times by the operating system. The TCP protocol does a checksum on every message even though communications hardware is doing a CRC in hardware. By reducing the number of messages that must be sent, the number of memory copies an operating system must do is reduced. The number of checksums calculated is reduced. The number of interupts and context switches to move messages is reduced. All this can easily compensate for one new algorythm that actually does the compression. One must look at the operating system, communications system and Application CPU usage to get the whole story.
I'm considering statically compressed pages (content negotiation) because one of my webhosts does not support compression, even mod_gzip in Apache 1.3, but now I lose server side includes, etc, which are very convenient. It adds quite a maintainence burden for the website owner and finally Google may still look upon it as a potential source of spam. Dynamic, on demand compression, could make Google feel safer, I don't know how they would tell the difference though.
So to be brief total CPU usage shouldn't go up. I can see a small memory usage increase.
Regarding bandwidth per client going down; doesn't that mean more clients per server, less communications hardware, etc? I can see the revenue effect, but that can be fixed.
My customer using a 56K modem sees my webpages in one fourth the time (hopefully), that means more revenues for me, and then maybe I'll buy more webhost space! Of course now ISP's are the doing the compression (accelerators), but that doesn't help the Google crawl.
Even better Google can crawl my site 4 times as often, getting my new information and pages indexed much sooner (time to market!).
|I'm sure many servers are serving compressed pages, but until now not to Googlebot. |
Why on Earth would webserver check useragent before deciding if it is going to serve compressed content?
Its the client's responsibility to declare what encoding it supports, by sending Accept-Encoding headers with value such as "gzip,deflate", while its up to server to ultimately decide if it wants to server compressed content or not, there is no reason why it should even bother checking useragent.
The more bots and sites support compressed transfers the better for everyone - apart from hosting companies that charge for bandwidth. If it was all up to these companies we'd still be in age of pay per minute connectivity, thankfully its not the case in most parts of the developed world.
|Why on Earth would webserver check useragent before deciding if it is going to serve compressed content? |
Braindead user-agents claiming to support things they don't.
for some of the things you have to workaround.
|Braindead user-agents claiming to support things they don't. |
ah fair play then...
|I'm sure many servers are serving compressed pages, but until now not to Googlebot. |
Why on Earth would webserver check useragent before deciding if it is going to serve compressed content.
I guess I didn't explain it very well. Googlebot, as our web servers client, has been using HTTP 1.0 in its request. The Googlebot request does not include a request for a compressed page.
On Sept 28th or so, Googlebot's request used HTTP 1.1 and it was (finally) requesting GZIP compressed content. Google appears to be using a new set of servers (Googlebots) to make this happen.
If nothing else Google can use this fast compressed crawl to check for spamming in many ways and still retain their old HTTP 1.0 crawl. The user agent on this crawl was "Mozilla 5.0" as well, not Googlebot, but Google still included the link to the Googlebot info. I've read many articles wondering why Googlebot isn't requesting compressed content and they do seem to be testing it again. Google will have to always request some compressed pages and request uncompressed pages to check for cloaking that's taking advantage of compression. This may be why it has taken them so long to start using compression.
One of my webhosts doesn't provide GZIP compression at all, I hope my Pagerank doesn't go down in the future!
1) Investigate "zero copy". It addresses many of your concerns.
2) TCP checksumming has been removed in IPv6. Everyone realized it was a waste of CPU time.
>On Sept 28th or so
I checked my logs. On one server, I see the new HTTP/1.1 Mozilla/5.0 bot appearing on 10 August. So it's been doing it for a while.
Currently my webhost does not have mod_gzip.
i had put together a small php script which gzips the data before it sends it to the client if the client supports it, along with the right headers.
This php script runs fine.
What would happen if some day the webhost suddenly enables mod_gzip.?