Forum Moderators: phranque

Message Too Old, No Replies

Make sure your site doesn't have a "page infinity"

         

httpwebwitch

7:49 pm on May 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



True Story.

Just moments ago, our database team noticed some very ODD requests coming in. Someone via our interface was requesting "page 20,000" of a query that returns - at most - maybe a few hundred results.

The problem? a little glitch in our pagination code, put there by an (a-hem) ex-employee. His happy legacy -> It turns out that you can request search results, page one. and two. and three. well, when you get to the end of the results, it doesn't stop - the link "Next" keeps letting you request page three thousand twelve. three thousand thirteen. three thousand fourteen.

the link was basically built like this:
<a href="/search/?p=[currentpage + 1]">Next</a>

This was built into our "search results" page. So, besides there being a potentially infinite number of queries to search for, there are also an infinite number of pages for each query. The number of valid URLs in our imaginary sitemap just became "infinity squared".

And who is the visitor who keeps following links, going next, next, next, next, next, next, next? Why, it's Googlebot, of course.

Let's not discuss the idiocy of letting Googlebot crawl and index SERPs, or indeed *any* paginated result scripts. We will not go there. Instead, let this be a somewhat obvious addendum to my recent Pagination for the Pro [webmasterworld.com] article: Stop when you get to the end.

This kind of never-ending navigation loop is affectionately called a "spider trap". Once a crawler gets inside, it will never escape, it'll just keep crawling and crawling and crawling... eating your bandwidth like a dog on a neverending trail of scooby snacks.

A simple link checker tool would have found this.

A less obvious example of this error happens on calendars and other date choosing or time-choosing scripts. There's always a "next month". Your site can be booking golf course tee-off times for eons from now, when the sun has gone supernova and the solar system collapses in upon itself. When setting a limit for looped navigation, you need to ask yourself: how far ahead is far enough?

Don't build spider traps!

How to detect spider traps

Get yourself a link-checker tool. There are many available, free and otherwise. Something that you can use as your very own, polite and innocent crawler. Let it loose on your own site and see what it finds. It'll be easy to see you have a "spider trap" when it... doesn't stop. And as an added bonus, a personal crawling tool can expose all your 404 errors and timeouts (scripts that load slowly under stress) before Google notices them, plus if you turn up the "threads" you've got a poor-man's stress testing tool! Let your crawler request 50 pages per second, and you'll see how your server will respond on the fateful day you get Slashdotted or TechCrunched.

We've yet to deal with the SEO repercussions of this blunder

rocknbil

3:23 pm on May 7, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



<snicker> Well, we've all done something dumb at some point (delete from table where id - 3456 instead of id=3456 . . . )

Here's another one many overlook: when you build your pagination for say, 2 million records, and your results display, say, 100 records per page, what do you get for "page links?"

<a href="...">1-100</a> . . . . X 20,000.

20K links per page will choke any browser, especially if your links contain get parameters to "keep your place" in the pagination. What I do here is set a "links range" variable to limit the output. So if "links range" is set for 5 and you are midway through some results,

<< previous 5 ¦ 501-600 ¦ 601-700 ¦ 701-800 ¦ 801-900 ¦ 901-1000 ¦ next 5 >>

There are many places one can accidentally step into infinity . . . and feel infinitely stupid. :-)

httpwebwitch

4:04 pm on May 7, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Good point, rocknbil.

An increasingly common (and good) pagination style is

<prev [1]...[15][16] 17 [18][19]...[20399] next>

you see your current page, a few pages before and after, and an easy way to skip to the beginning and end of the whole set. Then you are guaranteed that you won't get a bazillion links on the page, and the user gets what they need to find their way around.

But IMHO if you have 2 million records returned, "page 15648" is pretty meaningless. If you're looking for an item, and won't find it unless you navigate to page 15648... that's awful. I'd offer the user a better way of filtering or sorting the list, so the item they want *might* appear in the first few pages, and there are no more than 50 pages total.

If that's even possible. It depends on the situation.

g1smd

4:59 pm on May 7, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There's a whole Duplicate Content issue with paginated content, where the stuff on page 1 today is on page 2 tomorrow and page 3 the day after. Ever changing content gives me nightmares.

That aside, the spider trap problem also appears on sites with any sort of calendering system, or that has any sort of date related URLs for categories or for pages.

rocknbil

7:16 pm on May 7, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'd offer the user a better way of filtering or sorting the list,

Well of course, anything like this has a robust front-end search on any field or combination of fields, with sorting options, but for lazy users who just hit search without any terms . . . you have to have a safety net.

I'm "on the fence" about

<prev [15][16] 17 [18][19] next>

as opposed to

<< previous 5 ¦ 501-600 ¦ 601-700 ¦ 701-800 ¦ 801-900 ¦ 901-1000 ¦ next 5 >>

So I usually built these to optionally display actual record ranges (100-101) or page numbers of results (15). People seem to use the former more frequently (shrug).