Forum Moderators: open

Message Too Old, No Replies

Are we sure that the Deep Crawl is limited to the 216s?

Something to think about

         

uber_boy

5:03 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



I've been lurking around this forum for close to a year now and, like everyone else here, have learned a great deal. Among the many things learned is that the IP addresses of the Fresh Bots start with 64, while the IP addresses of the Deep Crawl crawlers start with 216. This has been gospel here and I've witnessed many a newbie being dressed down for crying out that the Deep Crawl had started when, in fact, it was the Fresh Bots. In every case, the initial claim was met with a chorus of, "Are you SURE it's the 216s?"

Until today, I lived by the gospel of the 216s. Now, however, I'd like to formally call it into question based on three things I have observed at my site this morning. First, though, a bit of background.

My site is of the bibliographic variety and has millions of dynamically generated pages. It has a PR of 7 and, on average, gets about 100,000 pages read during the Deep Crawl and 50,000 pages read each month by the Fresh Bots. There was a time when I'd check the IP addresses to establish the difference between the two types of reads but, over time, it became clear that the domain names for the 216s took the form of crawl##.googlebot.com, while the domain names of the 64s took the form of crawler##.googlebot. Thus, I stopped paying attention to the IP addresses and took the domain names as a reliable indicator of what was happening.

With that said, let me now share with you some interesting observations from this morning. It begins with me noticing my site being hammered early this morning. A quick investigation revealed that it was googlebots of the crawl## variety, thus leading me to conclude that the Deep Crawl had begun. After reporting this here and having my claim questioned, I looked deeper and discovered that the IP addresses were of the 64.* variety, which could only mean two things: Google had changed its naming convention, or Google was now using the 64s to assist with the Deep Crawl.

I am inclined to choose the latter of these two possibilities for three reasons. The first is that the timing is right for a Deep Crawl. The second has to do with the intensity of the crawl: whereas Fresh Bots have traditionally maxed out at 1000 pages/hour at my site, the current crawl is 3000+ pages/hour. The third reason has to do with the length of crawl: the Fresh Bots have almost always left after an hour, whereas the current crawl of my site has been going on for several hours now.

In light of the foregoing, I am willing to risk being a heretic by saying that the gospel of the 216s may be false. That said, I will now sit back and wait for all of you to provide me with 101 reasons why I am a fool to say this. Can't wait!

bether2

5:18 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



uber_boy,

I wondered the same thing a couple of months ago. My site was then being "deepcrawled" by the freshbot (64.*) - as is happening again as I speak. Which made me wonder if the freshbot was taking over some of the tasks of the deepbot.

However, soon after (week or so?), my site was deepcrawled by the 216.* bot.

Beth

yankee

5:20 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



It's simple:

Freshbot only crawls pages linked from high PR pages.
Deepbot crawls all pages.

Just look at the pages are being crawled. That will tell you which bot it is. Two different behaviors.

I've only seen deep crawl behavior with 216. Deep crawl usually starts 4 or 5 days after the update begins.

uber_boy

5:25 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



With all due respect, Yankee, I've observed no difference in the behavior of the respective bots. This could simply be a function of my site's design but, in any case, I'm unable to differentiate between them on that basis. I suppose you could say that my site has "breadth" as opposed to "depth". Thus, the main differentiating factor in the past has been how many pages as opposed to the depth of the pages.

mbennie

5:26 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



I tend to suspect that the lateness of the recent updates has something to do with Google working toward combining deepcrawler and freshbot in an effort to be able to update continually rather than once/month.

It seems to me that this would be the next logical step for Google - real time updates - and would take quite alot of testing and tweaking. It would also seem to require that freshbot take a more agressive role in crawling and indexing.

Simply by adding the ability to compute page rank on the fly would make freshbot the primary crawler for Google and could make the deepcrawl obsolete.

Just my suspicions...

affiliateguy

5:27 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



I may be wrong but I always go by the name of the bot, not the number.

crawler11.googlebot.com = fresh
crawl11.googlebot.com = deep

bether2

5:29 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



yankee,

Right now I'm seeing freshbot (64.*) crawling my highest PR pages and my lowest PR pages.

Beth

ciml

5:32 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I know where you're coming from uber_boy and mbennie, but I really think that it's just a case of Freshbot getting better and better. If Google didn't update for a few months then Google searchers wouldn't notice (just the small proportion of Google aware webmasters who mostly come here).

Of course Google could use their 64.68.* datacentre for the deep crawl. If they ever do, I guess that Google engineers will take bets on how long it takes us all to figure it out.

pendanticist

5:36 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Deleted by Pendanticist.

Pendanticist.

[edited by: pendanticist at 5:38 pm (utc) on April 10, 2003]

BigDave

5:36 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've seen this behavior before, where the "freshbot" hit a few of my PR1 pages suring the deep crawl (I think it was October).

They are all generic machines that just boot off different images on the network. Just because we lable them as being for one purpose or another, doesn't mean that Google is unable to use them for whatever they want.

BigDave

5:38 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The name that a system uses to identify itself, has nothing to do with the IP address you get when you do a lookup on the domain name.

HitProf

5:41 pm on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The 64. seems to be a hybrid :)

See also kendos' post in
[webmasterworld.com...]

or do a search for '64'.

uber_boy

11:52 pm on Apr 10, 2003 (gmt 0)

10+ Year Member



It's been said by others that there are only two kinds of people: those that divide them into two kinds, and those that do not. If that's the case, then I guess I'm amongst the former as I'm starting to think there are only two kinds of people at this forum: those who identify Google's bots by name, and those who identify them by IP. Once again, I'd count myself among the former as my site's had non-stop attention today (30000+ pages) from crawl## bots with an IP address in the 64 range. Their constant presence is very different from the crawlER bots I've had in the past from the 64 range who visited sporadically for an hour at most. So that said, can someone please tell me why, when faced with such evidence, I shouldn't assume this is the Deep Crawl?

Jesse_Smith

12:41 am on Apr 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you have a high PR site, then you can say the Freshbot is exactly like the deepcrawl. The last update made there be about 1,700 links to my message boards. For the last week or so there have been around 3,400 links. 1,700 links added because of the freshbot. So yes, if your lucky, you can call the freshbot a deepcrawl and update that goes on all month!

bether2

12:53 am on Apr 11, 2003 (gmt 0)

10+ Year Member



Earler today, when I was seeing the "deep crawl by freshbot," I was seeing the IP 64.68.82.* in my logs, which according kendos' post in the above-mentioned link, would be the freshbot.

Where do you guys see "crawler99.googlebot.com" vs. "crawl99.googlebot.com"? Is it in your log files?

What I see in my log is "Googlebot/2.1+(+http://www.googlebot.com/bot.html)"

Am I looking in the wrong place?

Beth

ciml

10:35 am on Apr 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Beth, if your logs show the domain for each hit then you can use that (it can be spoofed, by the way). If your logs show the IP addresses, then you need a reverse DNS look up on the IP.

Brett_Tabke

10:42 am on Apr 11, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I think they are mixed between the two now. You can't tell the difference at this point in time. Someone pointed this all out in the forums last month, but I can not find which thread it was buried in.

bether2

1:20 pm on Apr 11, 2003 (gmt 0)

10+ Year Member



Thanks, ciml. My logs show the IP's. Never seen logs except my own, so didn't know that some show the domain.

Oh, just checked the stats that come with my hosting (for the first time in ages) and see the domains listed there. Its the crawlER that's on my site now.

Thanks,
Beth