Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot crawling by ascending URL length

         

enigma1

3:21 pm on Jan 10, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not sure if was mentioned elsewhere but here it goes. I have a site and exposes friendly links, I check my server logs in a couple of cases I see googlebot accessing pretty much all links from my site in a short period of time in a peculiar manner.

example.com/word
example.com/2-words
example.com/three-words
example.com/longer-words
example.com/some-more-words
etc..

It can't be random because it's around 150 links I am talking about. Links length varies between 5-50 characters long (the query part) and some links have the same length. The pattern with the bot accessing longer urls continues till all site links are accessed. Same IP (66.249.72.*) crawl lasts about 15 minutes. After that, it gets back to do regular accesses.

I am suspecting it's checking how the server responds on the one hand but I don't understand the ascending link-length pattern. Maybe is one way how they retrieved them from storage by length, anyone else have seen it or could better explain the reason behind it?

g1smd

8:08 pm on Jan 10, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I see this from time to time and it's been that way for a long time.

Certainly for URLs that are, or look like, folders asking for
www.example.com/folder1/
before asking for
www.example.com/folder1/folder2/folder3/
is a good idea. If the first request returns 404, there may be no need to make the second request.

deadsea

12:01 am on Jan 11, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I agree with g1smd, I've seen this behavior for years as well. It appears to me that Google has several crawl modes:

1) Discovery: Crawling new urls. Crawls each url once. Will go very deep. Crawls in order of discovery.

2) Return crawling for freshness based on pagerank. Recrawls urls periodically. More often the more pagerank the page has.

3) Batch crawling of urls that haven't been crawled recently. These are often very often urls that have no current links (and no pagerank). They typically have not been crawled in months or years. I typically see Googlebot crawling things like: malformed urls that Googlebot once upon a time saw a link to, old format urls that were redirected en mass to new urls years ago, orphan pages. Googlebot will typically crawl around 1000 per batch and crawl them in order of ascending url length.

enigma1

10:37 am on Jan 16, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The requests do not resemble folders just words with hyphens


asking for www.example.com/folder1/ before asking for www.example.com/folder1/folder2/folder3/ is a good idea.

Your example with slashes won't work. You would block a plain images folder default access:
www.example.com/images/
but you could very well have a friendly link that also includes the images word like:
www.example.com/images/blue-widgets.html
which returns a valid page

These requests are also accessed at other times individually not in particular order nor with the same rate. I can't remember if at some point I gave to google a sitemap with the wrong links ages ago but for a couple of incorrect words I had once, google has a very long history even if the misspelled links were short lived.

I also see although were crawled by words length they were not crawled alphabetically when the words had the same length.

I've seen this behavior for years as well.

Which behavior have you seen exactly? So you have these links:
www.example.com/red-widgets.html
www.example.com/blue-accessories.html
www.example.com/free-stuff.html

And the googlebot accesses
www.example.com/red
www.example.com/free
www.example.com/widgets
www.example.com/red-widgets

Have you seen that?

lucy24

4:25 am on Jan 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just noticed this phenomenon in a blocked robot and remembered the thread. Over and over and over again:

/
/fun/
/rats/
/games/
/fonts/
/ebooks/
/paintings/
/hovercraft/
/fun/judy.html

in that order. It's most striking in a fixed-pitch font ;) Matter of fact, I thought they were going for the other effect (the multi-directory version) in the seemingly random image hits:

/directory/lion.gif
/directory/directory/quluaqtutit.jpg *
/directory/directory/directory/cover.jpg

but apparently they couldn't sustain it.


* "Your stomach is rumbling".