homepage Welcome to WebmasterWorld Guest from 54.198.140.148
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
SEO "Loose Ends" Regarding Pagination
FaceOnMars




msg:4537748
 8:02 pm on Jan 20, 2013 (gmt 0)

I'm trying to tie up some loose ends on a new pagination structure and have been struggling to find the best course for a couple of loose ends to avoid any sort of duplicate content issues or other unintended SEO consequences:

1.) How do I handle requests for pages which no longer exist? For example, there are a total of 401 items for a particular category which translates into 21 pages (with 20 results max per page). Lets say one item is removed from the database for whatever reason, thus decreasing the total page count back down to 20.

Currently, a mod rewrite rule will process any one or two digit variable such as:

http://www.example.com/category/4/
http://www.example.com/category/20/
http://www.example.com/category/21/
http://www.example.com/category/99/ (more relevant for #2 below)

My script (written in Perl) will check to see if page variable in the URL exceeds the actual page count, and if so currently just prints "sorry, no results" (for the time being while in development).

2.) Similarly, I imagine there will be some "random" requests for pages which are beyond the actual page count of a particular category? For example, there are not more than 20 actual pages, but somebody links to my site with the following:

http://www.example.com/category/99

Currently, my mod_rewrite rules ignore requests beyond the scope I've defined in my .htaccess file, triggering a 404. For example, all of the following result in a page not found:

http://www.example.com/category/4a
http://www.example.com/category/444
http://www.example.com/category/aaa

This isn't necessarily a mod_rewrite question, so I don't want to get bogged down in nuances of my mod_rewrite code, but rather simply explore the best general strategy for dealing with such "extraneous" requests. So, far I've come up with the following ideas (I'll refer to the above pages which are accepted by mod_rewrite conditions, but offer no content as "ghost pages" - for lack of a better term):


A.) Have my script place a <link rel="canonical" href="http://www.example.com/category/" /> on all ghost pages, but I'm not sure how this may play out in terms of SEO?

B.) Have my script place a noindex tag on all ghost pages. This seems like a good idea, but what happens when a category grows and starts a new page? Of course it won't be a ghost page at such a point, but have heard that it can be difficult to get Google to reverse a nonindex?

C.) My script is written in Perl, so not sure if it's possible (I believe it is in php) to send a 301 header to point back to the first page of the category (http://www.example.com/category/)?

D.) This might be the cleanest solution, but makes me weary for a few reasons: since mod_rewrite doesn't "know" the current state of my database (and page counts) and will pass along variables to my script regardless if they reflect actual pages or not, perhaps I could write another script which would dynamically update the .htaccess file and respective code to constrain what mod_rewrite accepts as page number variables?

Any thoughts / ideas on this issue is would be appreciated, thank you.

 

g1smd




msg:4537777
 9:12 pm on Jan 20, 2013 (gmt 0)

You raise a very important point.

For page numbers that do not exist, return the HTTP 404 Not Found header from your script.


Once a request is rewritten and pointed to an internal script, you're beyond the point that htaccess is handling the request and so the PHP or Perl script must return the right headers and the correct content or error mesage.

Failure to do so leads to the site being flagged for "soft 404 errors", and you don't want that.


For other content pages, such as products or posts, you'll want to return 404 Not Found if they don't yet exist. They'll return 200 OK when they do exist. When they no longer exist, they should return 410 Gone, the status delivered by your PHP or Perl script.

404 - The server can't find it, doesn't know why it can't find it, doesn't know if it ever existed, and doesn't know if it ever will exist. Google will check again from time to time to see if the status changes.

410 - The content is Gone and is probably never coming back (Google will still check occasionally just in case it does come back).

FaceOnMars




msg:4537813
 11:06 pm on Jan 20, 2013 (gmt 0)

Thank you g1, 404 (or 410) was my ideal solution, but for some reason I had thought is wasn't possible with Perl; however, I was wrong. Came across some code and implemented as follows:

if ($CurrentPage > "$CategoryPageCount") {

print "Status: 404 Not Found\r\n";

print "Content-Type: text/html\r\n\r\n";

print "<h1>404 File not found!</h1>";

exit;

}

Which generates the following entry in apache's access log (which I believe is what I'm after):

10.0.0.10 - - [20/Jan/2013:15:51:37 -0700] "GET /category/25/ HTTP/1.1" 404 28 "-" "Mozilla/5.0 (Windows NT 6.1; rv:18.0) Gecko/20100101 Firefox/18.0"

However, I was originally hoping to trigger my global custom 404 page, but no luck ... so opted to feed the "404 File not found!" message which appears in the block of code - which I believe is acceptable and better than the alternative.

To really fine tune things, I suppose it might be beneficial to work in a 410 for those pages which existed, but were removed. This might prove to be a bit more of a challenge. I suppose I could create another field in the database which tracks an historical maximum number of pages per category, then compares the $CurrentPage to both $CategoryPageCount as well as the new $HistoricalMaxCategoryPageCount and issue a 410 where appropriate.

deadsea




msg:4537816
 12:05 am on Jan 21, 2013 (gmt 0)

I wouldn't use 410 for these pages. What if you get more products again and the pages come back? Seems like a very likely senario.

I've used 302 (temporary) redirects to redirect back to page 1 in this type of situation. You certainly don't want to use 301 redirects, because like 410, that implies permanence.

g1smd




msg:4537819
 12:16 am on Jan 21, 2013 (gmt 0)

For products that go away and are never coming back, returning 410 Gone is appropriate.

For paginated lisings, e.g categories, return 404 Not Found when higher-numbered pages are gone because they may well come back.

I believe that all scripting languages allow you to override the HTTP headers and send something other than 200 OK.

Returning 404 Not Found from within your PHP or Perl script won't invoke your global error page. Once your PHP or Perl script is dealing with the request, you're way past the parts of Apache that check whether the request will resolve to a file and invoke error messages if not.

The usual method in PHP is to send the 404 HTTP header and then "INCLUDE" the file that contains the human readable 404 error message. I assume that Perl has some equivalent.

FaceOnMars




msg:4537828
 1:03 am on Jan 21, 2013 (gmt 0)

Thank you for the clarification on 410. I thought it meant gone, but could come back ... with a higher emphasis (vs. 404) on could come back. Regardless, it's easier to keep all pages I'm referring to as 404.

Yeah, I kind of wondered if it might have been an issue re: where along the assembly line apache might still allow for the custom error page to be called. In any case, thank you g1 for pointing out the PHP method ... which I'm not sure of the Perl equivalent, but it quickly got me to thinking that it would be easy enough to simply open up the actual custom error message file and print it, but not sure if it's the wisest way to go about it - not sure about unintended consequences regarding resource utilization if the page gets hits alot for whatever reason?

if ($CurrentPage > "$RoundedCategoryPageCount") {

print "Status: 404 Not Found\r\n";

print "Content-Type: text/html\r\n\r\n";

open (PAGENOTFOUND, "/absolute/path/to/custom/error/message/file.html") || die "couldn't open the file!";

while ($contents = <PAGENOTFOUND>) {
print $contents;
}

close (PAGENOTFOUND);

exit;

}

g1smd




msg:4537830
 1:15 am on Jan 21, 2013 (gmt 0)

In theory "
410 Gone" is supposed to be "Gone for good, Never coming back".

In practice, many sites do have pages that come back having been 410 status at some time in the past, so Google does still look a couple of times per year to make sure the status is still 410.

Imagine you bought a domain name from someone, and root
example.com/ had been returning 410 Gone for the last three years. If 410 Gone literally meant Gone Forever, you could never get your new home page indexed.

In the real world there are many things that "reset" or "override" the previously recorded Gone status, but I would never want to rely on that behaviour. So, 404 for some things and 410 for others it is.

P.S. A Google search for "perl equivalent of php include" may be useful.

smithaa02




msg:4537979
 2:55 pm on Jan 21, 2013 (gmt 0)

I would assume 404's would be best.

Now naturally google wants you to do rel="next" and rel="prev" for all paginated pages, else they will force the pages to compete with each other which will dilute your content or create duplicate content. There was a webmaster video on this somewhere...

I would assume a page that was part of a prev/next chain would be grouped into a super page entity and if removed google would understand this and not care too much. In essence I suspect google would think you "just made your page shorter". But it is a good question though...

Robert Charlton




msg:4538067
 8:22 pm on Jan 21, 2013 (gmt 0)

Google actually suggests three options for pagination... one of which is rel="next" and rel="prev". See some further discussion in this thread....

Changing count of posts per page on forum & its effect on rankings
http://www.webmasterworld.com/google/4535210.htm [webmasterworld.com]

In the thread I link to an excellent Google support page describing all three options, and also link to the video....

Pagination - Google Support
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1663744 [support.google.com]

Which option you use, IMO, depends on the situation and how you implement it. There's a View All option, eg, where all paginated pages are regarded by Google as one. View All is probably the best option for paginated forum threads, eg, where, as I note in the thread, "the unpredictable shape of forum discussions" makes this a wise choice.

For product pages, if you prioritize the order of products (most important products first), then rel="next" and rel="prev" appears to be the best approach. As Google describes on the support page...

My emphasis added at the end...
Use rel="next" and rel="prev" links to indicate the relationship between component URLs. This markup provides a strong hint to Google that you would like us to treat these pages as a logical sequence, thus consolidating their linking properties and usually sending searchers to the first page.

And if you either do nothing or do something wrong, Google says...

Paginated content is very common, and Google does a good job returning the most relevant results to users, regardless of whether content is divided into multiple pages.

And later...
...if an expected rel="prev" or rel="next" designation is missing... we'll continue to index the page(s), and rely on our own heuristics to understand your content.

FaceOnMars




msg:4538077
 8:50 pm on Jan 21, 2013 (gmt 0)

I added pagination along with rel="next" and rel="prev" hoping it will allow Google to view and associate the full depth of any given category. This was the primary goal of restructuring this component. Up until now, I've used a "global script" to allow visitors to see the next page of results for any given category; however, there were at least a couple/few problems:

1.) Very cumbersome for the visitor to have to click "See next 20 listings" at the bottom of every page

2.) It's probably sending mixed/bad signals to Google in so far as having a total of 120 categories all funneling into the same script (URL) for subsequent results.

I've seen a consistent, but somewhat "muted/dampened" (never too severe) decline from most Panda refreshes over the past year & I'm crossing my fingers (and toes) that pagination might help on this front by providing Google with easy access to the entire depth of directory structure - which may offer and tie in greater semantic continuity of content.

One issue which I'm a bit concerned about is the fact that I randomize results. Unfortunately, I can't escape this fact ... as it's an integral component of my business model. Essentially, all paid listings (as a group) appear above free listings & within each group all listings are randomized. Randomization used to occur on the fly; however, with the addition of pagination, I've created a script to essentially perform a randomization via a cron job once per day (overnight) and load the order of listings for any given category into a database. I then use this ordering for 24 hours until the deck gets shuffled again. I suppose I could increase the interval (i.e. weekly cron job), but not sure if that would help matters?

I'm considering adding a "view all" in addition to pagination; however, I'm very concerned about resource utilization, but if it's necessary to mitigate the fact that I'm randomizing results, then it'd be worth it. If I do add a "view all" link, I'm not sure what the best way is to nudge Google into indexing the front page of the category? Would placing a noindex and also canonical (-> front page of category) on the view all page work to this end (or is it a no-no akin to placing it on paginated pages)?

lucy24




msg:4538147
 3:43 am on Jan 22, 2013 (gmt 0)

I'm considering adding a "view all" in addition to pagination; however, I'm very concerned about resource utilization

I don't think you need to be. The users are thinking the same thing-- and it's a lot more noticeable from their end. How often do you select Show All on shopping sites? How many results-per-page does your search engine show? (G### goes up to 100, but I don't know anyone who uses this number by default.) Just make sure your default page says what the total is. "1-20 of 48" and the user might well go for Show All; "1-20 of 480" and they probably won't.

If you're already at the edge of your bandwidth or RAM limits, it's probably time for an upgrade anyway. "It's a great site but it takes forever to load" has got to be on the list of Top Ten Things you never want your users to say.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved