Make sure your site doesn't have a "page infinity"

True Story.

Just moments ago, our database team noticed some very ODD requests coming in. Someone via our interface was requesting "page 20,000" of a query that returns - at most - maybe a few hundred results.

The problem? a little glitch in our pagination code, put there by an (a-hem) ex-employee. His happy legacy -> It turns out that you can request search results, page one. and two. and three. well, when you get to the end of the results, it doesn't stop - the link "Next" keeps letting you request page three thousand twelve. three thousand thirteen. three thousand fourteen.

the link was basically built like this:
<a href="/search/?p=[currentpage + 1]">Next</a>

This was built into our "search results" page. So, besides there being a potentially infinite number of queries to search for, there are also an infinite number of pages for each query. The number of valid URLs in our imaginary sitemap just became "infinity squared".

And who is the visitor who keeps following links, going next, next, next, next, next, next, next? Why, it's Googlebot, of course.

Let's not discuss the idiocy of letting Googlebot crawl and index SERPs, or indeed *any* paginated result scripts. We will not go there. Instead, let this be a somewhat obvious addendum to my recent Pagination for the Pro [webmasterworld.com] article: Stop when you get to the end.

This kind of never-ending navigation loop is affectionately called a "spider trap". Once a crawler gets inside, it will never escape, it'll just keep crawling and crawling and crawling... eating your bandwidth like a dog on a neverending trail of scooby snacks.

A simple link checker tool would have found this.

A less obvious example of this error happens on calendars and other date choosing or time-choosing scripts. There's always a "next month". Your site can be booking golf course tee-off times for eons from now, when the sun has gone supernova and the solar system collapses in upon itself. When setting a limit for looped navigation, you need to ask yourself: how far ahead is far enough?

Don't build spider traps!

How to detect spider traps

Get yourself a link-checker tool. There are many available, free and otherwise. Something that you can use as your very own, polite and innocent crawler. Let it loose on your own site and see what it finds. It'll be easy to see you have a "spider trap" when it... doesn't stop. And as an added bonus, a personal crawling tool can expose all your 404 errors and timeouts (scripts that load slowly under stress) before Google notices them, plus if you turn up the "threads" you've got a poor-man's stress testing tool! Let your crawler request 50 pages per second, and you'll see how your server will respond on the fateful day you get Slashdotted or TechCrunched.

We've yet to deal with the SEO repercussions of this blunder

Make sure your site doesn't have a "page infinity"

httpwebwitch

rocknbil

httpwebwitch

g1smd

rocknbil

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week