Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Evolution of the new Mozilla Googlebot

         

catch2948

4:53 am on Apr 15, 2006 (gmt 0)

10+ Year Member



Last month, I started a experiment. The goal was to try to learn the crawling patterns of the new Mozilla Googlebot (herein referred to as "MoziG"). The site I chose to monitor (herein referred to as "target") is the same one I have been mentioning in my most recent posts (concerning indexed page counts). Target has approx. 3000 pages, which are a mix or static & dynamic urls.

Today (just over four weeks later), please see below what I have found along the way:

03/17

Started experiment early today. MoziG (IP 66.249.72.201) visited target; looking for old pages (gone for almost a year). Late afternoon, starts crawling new category index page links.

03/18

No activity

03/19

Huge activity today. MoziG crawling more new category index page links. Noticing a new trend. MoziG crawling blocks of static url pages, then dynamic url pages (random # of pages in each block). Pages are being crawled by url length (shortest url to longest url).

03/20

Huge activity today. MoziG crawled only dynamic urls until 11:00 PM. Activity stopped for exactly 45 minutes. MoziG then started crawling only static urls (again, based on url length)

03/21

Same activity pattern as yesterday (including an exactly 45 minute break). But then back to dynamic urls for 5 minutes, then instantly switching to static urls (with no delay).

03/22

Total of 10 pages crawled, ending at 1:30 pm. Exactly 1 hour later, MoziG (IP 66.249.66.168) crawls 1 page (robots.txt). 3 hours later, MoziG (IP 66.249.72.200) tries to crawl 1 old page, then no more activity.

03-23 thru 03/26

Minimal activity. 10 pages or less crawled daily.

03/27

MoziG (IP 66.249.72.200) appears, grabs robots.txt. Exactly 12 hours later, MoziG (IP 66.249.71.40) appears, crawls robots.txt and main index. Exactly 1 hour later, MoziG (IP 66.249.72.200) returns, grabs robots.txt. No more activity.

03/28 thru 04/04

Non stop crawling. MoziG (IP 66.249.72.200) crawling both dynamic & static urls, based on url length (shortest to longest).

04/05 thru 04/07

MoziG (IP 66.249.72.200) crawling dynamic urls only.

04/08 thru 04/11

MoziG (IP 66.249.72.200) crawling only static urls until mid day 04/08. After that, very slow crawl of random pages.

04/12

MoziG (IP 66.249.72.200) crawling only pages with shortest & longest urls (meaning the very shortest & the very longest. No "in the middle").

04/13 thru 04/14

MoziG (IP 66.249.72.200) crawling random pages VERY slowly, until late 04/14. MoziG (IP 66.249.66.229) starting to appear.

Conclusions

1. MoziG has appeared from 5 different IPs in the last month.
2. Each IP shift (less the really quick visits by 66.249.66.186 & 66.249.71.40) brings an "upgraded" mode of crawling (eg. now has the ability to crawl static & dynamic urls without delay between)
3. As opposed to the old Googlebots, there is no real "freshbot" & "deepbot" with MoziG. Any one of them can be either.
4. The exact time differences mentioned for 03/22 and 03/27 I believe are important. When I say "exact", I mean exact +¦- 5 seconds at the end (logs show this). Ideas anyone? Lab testing maybe?
5. The single thing that ALL IP versions of MoziG had in common is that they all CRAWLED PAGES IN ORDER FROM SHORTEST URL TO LONGEST (in character length; doesn't matter if static or dynamic).

Please comment.

tedster

3:53 pm on Apr 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Pages being crawed in order by length of the url -- I'll watch out for that. It is a most unexpected observation, so thanks for the alert. I assume you mean a simple character count? Or does googlebot go through root first by length and then start in on a subdirectory?

catch2948

4:46 pm on Apr 15, 2006 (gmt 0)

10+ Year Member



Yes, I mean by simple character count. When I first started seeing the trend, I thought I may have accidentally performed a sort on the data from that log. But after closer scrutiny, the timestamps where completely chronological. The crawling pattern was so obvious, I could load the logfile into my text editor, scroll down rather quickly, and watch the "widening" effect on the right of the page. That is, as the urls got longer, more space was taken up on the right of the page.

ronburk

4:49 pm on Apr 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Pages being crawed in order by length of the url

No such thing on my site.

In general, I think you picked the exact wrong time to analyze Googlebot patterns.