New Mozilla Googlebot Crawling Same-Length URLs

Forum Moderators: open

Message Too Old, No Replies

New Mozilla Googlebot Crawling Same-Length URLs

Interesting to be sure...

Critter

11:56 pm on Nov 3, 2004 (gmt 0)

I was looking at my logs recently and I've found that the URLs that the new Googlebot is crawling are, regardless of their place in my directory structure, all the same length total (full) file name. This is quite consistent across *thousands* of URLs crawled at this time.

e.g.

/dir1/dir2/something.html
/dir1/something_else.html

Very curious indeed.

Critter

6:33 pm on Nov 4, 2004 (gmt 0)

Oops, I guess I should have given the example in fixed-width format:


/dir1/dir2/something.html 
/dir1/something_else.html 
/even_more_something.html

Know what I mean? :)

digital

6:34 pm on Nov 4, 2004 (gmt 0)

Indeed..
I noticed the same thing! The bot starts from the short to the longest URLs.

sandor

7:50 pm on Nov 4, 2004 (gmt 0)

yeah, noticed this about a month ago ... i see another engine doing it too .. i forget who but it's not a big name

hunderdown

8:55 pm on Nov 4, 2004 (gmt 0)

Hmm. Maybe short file names will replace keyword-stuffed file names....

internetheaven

9:57 pm on Nov 4, 2004 (gmt 0)

Hmm. Maybe short file names will replace keyword-stuffed file names....

Uh-Oh, I use the file-name to give information to the php scripting. This could be a huge dent if it is true, guess I better go analyse my logs early this week ....

Rick_M

1:19 am on Nov 5, 2004 (gmt 0)

Someone mentioned last month in the Googlebot thread that the new bot spidered the site from shortest URL to longest (or vice-versa, I don't remember which) all in 1 session.

rfgdxm1

1:23 am on Nov 5, 2004 (gmt 0)

>Uh-Oh, I use the file-name to give information to the php scripting. This could be a huge dent if it is true, guess I better go analyse my logs early this week ....

So long as Googlebot still crawls all URLs, which it does first is likely not important.

Rick_M

1:58 am on Nov 5, 2004 (gmt 0)

The interesting thing (to me at least) about this is that it means that Google compiled a list of all pages from a certain site, sorted them, and then spidered an entire site at once. I'm not sure what the signficance of it all is though, aside from Google testing out new techniques for spidering sites and compiling the data. I did notice this sort of spidering last month from the new bots, but this month it seems a little heavier than last - and there appear to be more IP adddresses being used for the new bots (more of them?). Just my observation, from a small number of sites.

robho

9:34 pm on Nov 5, 2004 (gmt 0)

For one of my sites, where the new Googlebot grabbed over 50,000 pages on Nov 1st, I did see it grabbing pages in URL-length order, but not just from shortest to longest.

It started the day with 11-character page names, worked it way up to 17-characters, and mixed together 17 and 18 characters for a while. Next, came page names of 20 characters or more, not in length order (i.e. all lengths over 19 characters were jumbled up).

Then, after pausing for a fresh robots.txt, it started with 6-11 character urls, mixed up, with a few longer ones thrown in (picking up ones it missed earlier?), then again slowly increased the lengths, all 12 character, all 13 etc.

The actual sequence within url length was mostly a batch of pages in one directory in alpha order, then a page or more from elsewhere, then a batch from the previous directory in a fresh alpha sequence (not continuing from before), etc.

For the urls with over 20 characters, the sequence was alpha regardless of length, so it looks like there's a 19 or 20 character index being used somewhere (or a longer index that includes the http: //domain bit).

I don't think short urls would help in any way, it got to the longer ones as well as the short ones. And quickly!