Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Does googlebot anticipate new URLs before they are published?

         

helpnow

3:00 pm on Apr 16, 2010 (gmt 0)

10+ Year Member



Let me tell you a little story...

One of my sites has hundreds of thousands of pages. We sell everything. We represent lots of manufacturers.

Now, when we add a new product, we assign it an ID number through our database, and this ID number is used at our site to access that product's web page, like www.mysite/109765, where 109765 is the product ID#.

So... Every time we add a new product from a supplier, we would just add a new ID#. When a supplier discontinued a product, we'd mark the produt as discontinued, but let the ID# live. Over time, the ID numbers just kept growing and growing. It occured to me that ID# 000001 was the oldest URL on our site, and possibly commanded the most respect with google. Why do we keep adding more IDs? Why don't we backfill and reuse old IDs of products no longer available? No reason to keep access to those old products - redo them with new products!

So, we changed our database to work in this fashion, and that is how our database/web site works now.

OK, so that's the setup, and here comes the scary part. The last time we added an ID#, let's say the ID# was 517999. So, ID# 518000 does not and never did exist. Not in our database, and certainly not at our site.

The hair on the back of my neck is starting to rise now. ; )

Curiously, a few months ago, googlebot started asking for ID #518000. In fact it asked for about 50 IDs, in the range of 518000 - 518050. Weird. So, I watched this go on for a few weeks, wondering if I should 301 them, why is googlebot doing this, etc. etc. It was very curious.

Eventually it occured to me, well, if googlebot is desperate for that damn ID, why don't I feed it something? Alright. So the next time we had some new products to add, instead of backfilling, we added them to that new batch range. We had about 80 products to add, so we added them in to the batch 518000-518080. OK.

googlebot crawled those IDs within hours, and they were indexed in a couple days. Awesome!

Guess what happened next? Do you have goosebumps yet?

Yup! Within about a week, googlebot started asking for 518080-518130. About 50 more of the next IDs in the expected range. It's been doing this for a while now, pretty much every day. These URLs don't resolve, so they show up in my WMT ("Network unreachable"), and in my proprietary site error reports. I'm not sure how to proceed because if I resolve them one way or another, I suspect they will, I don't know, move on to the next batch? It's insane, but, from a computer's perspective, quite logical.

You can draw your own conclusions.

[edited by: tedster at 3:33 pm (utc) on Apr 17, 2010]
[edit reason] moved from another location [/edit]

Reno

3:56 pm on Apr 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Forgive me helpnow for being dense (readily admitted!), but if I understand your interesting story correctly, the gbot began anticipating what your next pages should be, and if you feed it what it wants, it will want more? and more, and more, ad infinitum? And I'm wondering if this would also be true if you were not using only numeric page numbers sequentially? For example, if you had had widgets-1 through widgets-700, would it next want everything through widgets-750?

..................

helpnow

4:48 pm on Apr 16, 2010 (gmt 0)

10+ Year Member



I do not know WHY it did what it did, I can only relate what I OBSERVED. -shrugging- Yes, it anticipated... Makes sense. After literally, no exaggeration, about 500,000 pages on my site in that sequence, yeah, the pattern is fairly established, and yeah, there is no other explanation, it began anticipating new URLs that would exist. It feels like it thinks it knows those URLs will come, so, it keeps checking until they finally exist. Why wait and discover them in a fake surprise via a link? If you know they're coming, go get them now, right? I mean, "we want to index the whole internet", so, if you can make a really good educated guess on where the next URL might be in a clear pattern, go get it now - we don't need to wait for a link to it to confirm what we already know. ; )

As far as how many more it wants, well, it has no idea what I am doing... 50 is arbitrary. Not sure why it settled on 50-ish... Maybe a batch of 50 is also part of the averga epattern of my 500,000. In fact, as I tink aboutit, yeah, thatis probably close to right. It has been as high as 2000 or so in one shot, and of course, as small as 1 at a time. Maybe 50 IS the average, thanks google for running the numbers for me. It looks like it is simply trying to stay one step ahead of me. It has no way of knowing my delay in adding new IDs was because I changed my startegy and started backfilling. In fact, as I navel-gaze, I wonder if my stalling of the addition of new IDs makes google wonder if my site has gone stale. That would suck, and be an unfortunate consequence. As far as what other patterns it can discern, no clue... But, if you take a list of my URLs, and put the URLs in a sequential list, and add date of discovery, it reads like a book, so, it really is not farfetched. My kindergarten daughter is studying patterns and I am sure she could figure out the next number in the series, so it makes sense googlebot could figure out what the next URL would be. ; )

Of course, this is MY site. No clue what googlebot is doing elsewhere. My tin-foil hat is on, this makes my head spin, but... It happened to me. Like I said, dunno why or how, but it happened, draw your own conclusions. ; )

tedster

3:44 pm on Apr 17, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's been reported for several years that googlebot will sometimes "fill in" forms using keywords from the site. It doesn't exactly type of course, but if an obvious URL pattern exists, then googlebot guesses what the resulting URL would be to see if that URL also resolves.

So if googlebot anticipates URLs based on keywords, then the idea of anticipating based on a numerical pattern doesn't sound all that far-fetched to me.

jdMorgan

4:29 pm on Apr 17, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> These URLs don't resolve, so they show up in my WMT ("Network unreachable")

If that is the error message, it bears looking into, as it may indicate that your custom 404 error handling is not returning a proper 404 response code, or that it is doing a redirect before returning that 404 response code. You might want to check it using a server-headers-checker add-on.

I'm not at all surprised that Google is "guessing" sequential URLs when the pattern is easily-discerned and seems predictable -- Their appetite is voracious, and they claim to want to index *all* the world's data. They also undoubtedly want to have the 'freshest' results, and to return your new pages faster than any other search service.

The good news here is that I doubt that they bother using predictive fetches unless they "like" a site... :)

Jim

g1smd

7:14 pm on Apr 17, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I never re-use an ID, never re-allocate it to a new product. I'd always worry that someone would place a repeat order for something and not get what they wanted as the ID now applied to a different product.

I also agree completely with jdMorgan's analysis of the error message. That doesn't sound quite right.

I've not noticed this 'pre-fetch' behaviour; but it should be easy to spot. Maybe it's not widespread, is a new feature, or applies only to sites over a certain size. Perhaps your incorrect server response 'exposed' this into your WMT data; perhaps it isn't normally listed there if the URLs simply return "404 Not Found" before any incoming links have been found from elsewhere within the site.