Deep Crawl

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Deep Crawl

Googlebot on the prowl!

hoteimports

12:26 am on Jan 23, 2005 (gmt 0)

Googlebot is on the prowl! Just had a very, very dep crawl. Got all 18,000 pages!

encyclo

1:18 am on Jan 26, 2005 (gmt 0)

Google is going after loads of Javascript linked pages on my site.

What does the link markup look like? Does it include a recognizable URL within the Javascript? Is Googlebot fetching external .js files (or .css files, for that matter)?

If the irrelevant pages are from your own site, a robots.txt, a few judicious robots meta tags with noindex should help out.

Marshall Clark

1:29 am on Jan 26, 2005 (gmt 0)

Kukenan & Robster124

Are the javascript links complete links buried in javascript tags or are they partial URL type (with the remainder of the URL retrieved from an external js file)?

Either one would be interesting, but the second one would be mighty impressive as well. :)

taxpod

2:18 am on Jan 26, 2005 (gmt 0)

I can't believe how many pages Gbot is pulling down and the rate. It grabbed 5K pages per hour for four straight hours. I haven't seen this kind of speed in a long time. Used to be a steady stream of 1 or 2 K pages. This rate is causing me to worry about surver overload.

Critter

2:50 am on Jan 26, 2005 (gmt 0)

5K pages/hour is less than 2 pages per second.

Last big crawl I peaked at 170 pages per second.

Sweet dreams. :)

juvenall

6:39 am on Jan 26, 2005 (gmt 0)

Wow! I was just looking at my logs the other day and saw Google last hit me around the 18th. After reading this, I processed my logs and saw google's been hitting me on and off on the 24th and 25th. I hope it keeps up the pace today..hehe

Keep on truckin' Googlebot!

balam

6:58 am on Jan 26, 2005 (gmt 0)

> Where is the significance in thos Mozilla-bot-thing?

> Maybe the moz-5.0 version accepts newer standards?

See this thread [webmasterworld.com] where GG offers some info.

pmkpmk

10:19 am on Jan 26, 2005 (gmt 0)

I had a closer look at my logs spanning all of my sites right now, and I am a bit surprised. While it is true that on Jan 23 the GoogleBot activity is remarkably high, on an overall scale for January Yahoo's Slurp maxed every other spider out!

Slurp accounted for 73% spider traffic (more than 8000 requests), where GoogleBot only accounted for 20%. The Mozilla-Version of GoogleBot only made up for 0.15%.

The funny thing is though, that I'm not doing particularly well on Yahoo.

phantombookman

11:11 am on Jan 26, 2005 (gmt 0)

Same here, can't recall Gbot being so greedy, I wonder why?
Also msnbot is going crackers as well

Critter

1:40 pm on Jan 26, 2005 (gmt 0)

Hah! Yahoo bot? Puh-lese.

Ask Jeeves' bot *regularly* crawles 10 times the amount of pages on my site than any other bot including Googlebot.

gmiller

9:23 am on Jan 27, 2005 (gmt 0)

I was testing some updates to one of my sites a little while ago, and noticed that PR (according to one of my Firefox extensions) had shot up on a number of pages. Then it occurred to me that one of those pages had been at its current URL less than 24 hours. I restructured some things and set up 301 redirects, and the new URL is already PR7 (as I recall, the old one was PR2 yesterday).

So I checked a few searches, and noticed one of my pages had a 1/25/2005 date in the SERPS while showing a title that was changed on 1/26/2005. Go figure. Looks like a lot of things are updating.

johnnie

11:38 am on Jan 27, 2005 (gmt 0)

GB is going totally frenzy on my site. Very, very deep crawl here. Two days ago my site was in their index for the first time, now it's been reduced to URL only again. What's going on here?

bumpski

12:13 pm on Jan 27, 2005 (gmt 0)

At least on the 23rd Google bot was using HTTP 1.1. This protocol requests dynamic GZIP compressed web page content. If your webhost supports GZIP ( all can ) you will see in your log files that the byte count for each file read is typically reduced by a factor of 4. Only 6% of the webhosts support this free dynamic GZIP compression functionality which cuts Internet bandwidth usage by a factor of 4.

Googlebot could crawl your site 4X faster with 4X less bandwidth usage if your site supports GZIP. So few sites support GZIP that Google has no real incentive to switch over to a faster crawler. When properly set up GZIP would speed up many, many websites.

More info:
[webmasterworld.com...]

Critter

1:58 pm on Jan 27, 2005 (gmt 0)

HTTP 1.1 does *not* request gzip encoding by default or as part of the protocol behavior. Googlebot is, however, sending a Content-Encoding header with its request that specifies gzip encoding is acceptable.

Vetteman

2:06 pm on Jan 27, 2005 (gmt 0)

Googlebot is also devouring my site.

Could this portend a serp update or a PR update?

bumpski

2:53 pm on Jan 27, 2005 (gmt 0)

Thanks critter for the specifics clarification.
There have been several questions regarding the differences of the two BOTS and GZIP compression is a big difference that few take advantage of.

To date the optional GZIP request correlates with Google bots indication of HTTP 1.1 protocol.
When Googlebot uses HTTP 1.0 protocol it is definitely not requesting GZIP compressed content.
The protocol indicator 1.0/1.1 is almost adjacent to the page size in bytes so it's very convenient to use as a "GZIP" flag when reviewing your logs.

Even though this capability is available in virtually all web server software, only about 6% of all web hosts and therefore webmasters support this virtually free 4X performance improving and 4X bandwidth reducing technology.

As a 56K modem user, I'd sure like to see dynamic GZIP compression fully supported. Webmaster World unfortunately does not GZIP, Google does for SERPS.

Blackguy

7:58 pm on Jan 27, 2005 (gmt 0)

would the pages crawled by the Mozilla googlebot be indexed if they were nt crawled by the other googlebot?

xcomm

9:19 pm on Jan 27, 2005 (gmt 0)

Ok - something is going on with GoogleBot. For my site its normal they try to catch up crawking at the end of the month as they always begin very slow. But this time they seem to have some more punch as reported here. Lets see what it means... As assumed otherwise they seem to have been short in computing power last year. Maybe they put up some more clusters...

But to relate all our hopes here a little bit, Google is really sick in many ways this time (SandBox, overating links - link farm impact, hilltop oligarchy, big sites oligarchy, 2x32...). And when they are not able to solve this out in some way they will drive it to the wall.

It would be time to put in some cure now.

Or they at least should abandom their ugly sandbox.
Simply search again your keyword with 13x -adfs and see how good SERPS could be...

doclove

3:44 pm on Jan 28, 2005 (gmt 0)

So why are the results difference by adding 13x -adfs then by normal searching? Is there a thread that I can read that explains this theory? I just tried it and for my keword term I am ranked 25th without the 13x -adfs and 1st with them which is where I was prior to doing a 301 redirect.

Kukenan

3:58 pm on Jan 28, 2005 (gmt 0)

SandBox, overating links - link farm impact, hilltop oligarchy, big sites oligarchy, 2x32

also: 301 redirects not working, 302 page hijacking...

It would be nice to make some sort of wishlist.
Who knows? maybe Google would listen.

bull

8:23 pm on Jan 28, 2005 (gmt 0)

would the pages crawled by the Mozilla googlebot be indexed if they were not crawled by the other googlebot?

No. Not in the "public" index.

pmkpmk

5:38 pm on Jan 31, 2005 (gmt 0)

I made it!

Effective of January 29, I am now listed as #1 for my most important keyword. Before the recent deep crawl, I was the runner-up for almost a year, with an on-topic non commercial site being #1.

I need to check other SERPS, but it seems the recent deep crawl finds its way into the results.

nburne

2:05 pm on Feb 1, 2005 (gmt 0)

Is there any software I can use to see Googlebot crawling on my site live?

pmkpmk

2:34 pm on Feb 1, 2005 (gmt 0)

Assuming that you are running Apache on Linux, the most easiest solution is to open a terminal window and then issue the command

tail -f access.log ¦ grep -i googlebot

There are more sophisticated solutions though. Some log analysis tools can do "live stats" for example. There are some CRM packages which offer website-visitor-chat-functionality, which give you a live view on your sites visitors. But these tools usually exclude spiders.

I personally use a tool called "What's on?", which monitors all of my sites current visitors and which I have constantly open. It's the only one I found to do this stuff and it has a few bugs and glitches especially when it comes to DNS grouping and geotargetting. Probably there are other tools as well.

pmkpmk

8:08 pm on Feb 1, 2005 (gmt 0)

The tool's name is actually "Who's on", and not "What's on".

RichTC

1:22 am on Feb 10, 2005 (gmt 0)

Can anyone tell me in laymans terms what they think 1000 Hits from google relates to in the number of pages it will cash?.

How many google hits to the average page of text?

This 55 message thread spans 2 pages: 55

Deep Crawl

Googlebot on the prowl!

hoteimports

encyclo

Marshall Clark

taxpod

Critter

juvenall

balam

pmkpmk

phantombookman

Critter

gmiller

johnnie

bumpski

Critter

Vetteman

bumpski

Blackguy

xcomm

doclove

Kukenan

bull

pmkpmk

nburne

pmkpmk

pmkpmk

RichTC

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week