Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Strange cgi parameter issue

         

woodpecker

4:56 pm on Jan 9, 2014 (gmt 0)

10+ Year Member



Hi folks,

Something strange going on with Google and cgi parameters on our site, we have pages with 10 products on each page which are paginated in this format:-

/product/red?page=1
/product/red?page=2

Google sent a message about many duplicate pages, when I looked into it its been indexing pages with page numbers way above pages that should exist, its like its been trying different page numbers itself, indexing them then complaining the content looks duplicated, for example I found cases where the is only page=1 and page=2 linked but its indexed loads of almost blank pages up to page=290

There are no links on the site that spiders can follow with these high page numbers any idea what's going on?

rainborick

5:53 pm on Jan 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



While Google does sometimes attempt to create test URLs in situations like yours, the odds are that Google discovered these URLs on your site somehow, whether you intended it or not. It happened to me when I installed a new paging system on a site a while back without adequately testing it.

The first thing you need to do is to make sure that your site returns a 404 response when an invalid or illegal parameter is encountered so that Google never finds any content when they do try to crawl with such URLs.

The second thing you need to do is to do some extensive testing to make absolutely sure that your site isn't creating these invalid page parameters. Link checking tools like Xenu Link Sleuth can test your site for invalid links like these. Another simple method is to actually test one of these bad URLs with a mid-range page number in your browser to see if the resulting page includes additional links with invalid page parameters. That can help you track down the source of any problem. Good luck!

phranque

6:19 am on Jan 10, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



There are no links on the site that spiders can follow...

what does that qualification mean?
are you assuming the spider renders javascript, for example?

woodpecker

9:09 am on Jan 10, 2014 (gmt 0)

10+ Year Member



Thanks for the replies, the site does generate a page whatever the cgi page parameter is set to, so if there are only 2 pages and page=3 is set in the url, a dynamic page is still generated but it only has a menu, because its dynamic though there could be some times when there are 100 valid pages but others when there's only 1 page, I'm not sure how I can return a 404 dynamically?

I have tried testing with invalid page numbers, although a page is generated no invalid links are generated.

In response to phranque, what I mean is there are no hyperlinks generated with these urls where the page is invalid, so I'm not sure why Google is indexing them.

Any ideas?

aakk9999

1:44 pm on Jan 10, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm not sure how I can return a 404 dynamically?

You should read the database first (before sending headers or rendering any page content), to check if any rows are returned for your query where page number = x. If the query result set = zero, then send 404 headers and do not render the page.

woodpecker

3:41 pm on Jan 10, 2014 (gmt 0)

10+ Year Member



Yes you're right, I think I am rendering the page template before the query which is not good, as a temporary fix I have done a 301 bounce to the index page if the page number is invalid but the bounce is after the template is rendered, is that a good idea while I sort it properly?

aakk9999

1:27 am on Jan 11, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Not a good idea - Google will treat this as "Soft 404". I would take it out as soon as possible.

It is always better to do one proper fix then intermediate half fixes which can introduce their own problems.

lucy24

2:08 am on Jan 11, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the bounce is after the template is rendered

Ugh, that makes me think of another recent thread in which the 301 response wasn't getting read as intended. One of the very few things I know about server-side dynamic pages is that the numerical response (301, 404, whatever) has to be sent out before any content at all has been sent. Not so much as a stray line break in the text editor, or you're sunk. There's a cgi equivalent to output buffering, isn't there?