WMT shows duplicate pages with question marks in urls

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

WMT shows duplicate pages with question marks in urls

virtualreality

2:30 am on Nov 4, 2015 (gmt 0)

My WMT reports shows duplicate pages such:

/category/page.php
/category/page.php?ref=binfind.com/web

I don't know where this "?ref=binfind.com/web" is coming from. Is there a way to block all pages that have something after ? sign. I see other examples in my repoers such as:

/category/page.php
/category/page.php?r=1&l=ri&fst=0

/category/page.php
/category/page.php?src=AnyURL.com

Thanks!

FranticFish

7:31 am on Nov 4, 2015 (gmt 0)

Someone is linking to your page using that url, and your page.php script doesn't have code in it to either (a) trap these and redirect (and a canonical tag would do just as well here as htaccess/PHP) or (b) serve a 410 or similar for unexpected urls.

I personally think it would be more risky to trap and redirect (because by doing this you are taking ownership of that url rather than saying 'nothing to do with me'), but others might disagree.

If you choose to reject/block, you may not want to block ALL such urls - for instance, Google AdWords uses a string beginning ?gclid=

We wrote a function for our sites that kicks in at the page request (or 'GET') stage. It allows us to specify accepted character strings (and load a page) and reject everything else (serve a 4xx or 5xx response as you see fit). I'm sure there are scripts available that do this for popular CMS systems, or tutorials online on writing your own in PHP.

lucy24

7:56 am on Nov 4, 2015 (gmt 0)

by doing this you are taking ownership of that url rather than saying 'nothing to do with me

If all those bogus URLs get 404 responses without you having to do anything, then you can ignore the whole thing unless it really bugs you. (For example, if google becomes obsessed with some nonexistent URL you might cave in and start serving up explicit 410s to make them go away faster.) But if an URL with bogus parameters currently ends up with a 200, you have to do something. Otherwise you already are "taking ownership" by the simple act of serving up content.

If those bogus parameters are showing up in WMT, you can certainly tell them to ignore the parameter. (There's a range of options; choose carefully.) Unlike most things involving google, a truly nonexistent parameter will eventually go away-- that is, it will no longer appear in WMT-- and they'll stop requesting it.

Wilburforce

9:08 am on Nov 4, 2015 (gmt 0)

Is there a way to block all pages that have something after ? sign.

You can't block them generally, but you can stop specific parameters fromn being treated as separate URLs in WMT (or GSC as we probably should call it now) on the URL Parameters page (a subsection of Crawl).

I'm not sure it is worth the effort if you are just getting a handful of links - my own site gets them periodically - as it is unlikely to afffect your site in any other way.

virtualreality

6:14 pm on Nov 4, 2015 (gmt 0)

Thanks all for your reply. I checked one bogus url and it returns code 200. So I need to do something. What would be the best option to fix this issue?

lucy24

7:37 pm on Nov 4, 2015 (gmt 0)

You said at the outset that you learned about the issue via WMT. So if you go to the "parameters" area (currently the last item under the "Crawl" tab), the offending parameters should already be listed. If not, say "add parameter" and take it from there.

If your site returns the identical page with or without this parameter, and if your own links don't use the parameter, you might want to pick the "don't crawl URLs that contain this parameter at all" option.

virtualreality

5:09 pm on Nov 5, 2015 (gmt 0)

Thanks Lucy24. I dont want to crowl any URLs with parameters since my site is static and has simple .php and .html pages. Is there a way to block all URLs with parameters in robots.txt file?

lucy24

6:37 pm on Nov 5, 2015 (gmt 0)

Is there a way to block all URLs with parameters in robots.txt file?

According to the horse's mouth [support.google.com] (under the "Pattern-Matching" dropdown) it's
Disallow: /*?
where * is a wild card meaning "anything here" and ? is a literal question mark. It's one of those functions that not all robots recognize, but the Googlebot does. If your robots.txt has a "Googlebot" section, the rule goes there.

Option B, if you don't use parameters at all, is a global redirect of any request with a query string. Exact syntax will of course depend on your server type.

:: noting irritably that robotstxt.org of all places seems to be missing a without-www DNS entry ::

virtualreality

7:02 pm on Nov 5, 2015 (gmt 0)

Thanks again Lucy24. I added Disallow: /*? in my robots so unless I find better solution I will not change it.
Regarding option B, if I use a global redirect (for example redirect /category/page.php?r=1&l=ri&fst=0 to /category/page.php) does that make the impression my site is affiliate with these bogus URLs?

lucy24

8:54 pm on Nov 5, 2015 (gmt 0)

does that make the impression my site is affiliate with these bogus URLs?

I dunno. But it's a valid solution if you're getting a whole lot of requests with parameters-- especially if the requests are actually coming in from humans.

The problem with bogus parameters -- as opposed to other kinds of obvious typo URLs like "example.com/pagename.html.This" or "example.com/pagenam" -- is that your site generally won't return a 404, so the search engine can't know that the URL doesn't represent a real page. Worst case, you end up with unwanted Duplicate Content.

virtualreality

11:51 pm on Nov 5, 2015 (gmt 0)

Thank you.