Disallow crawling/indexing query strings?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Disallow crawling/indexing query strings?

JamesSC

11:31 pm on Dec 3, 2018 (gmt 0)

I'm having no luck preventing Google from crawling pages that somehow have come to have query strings attached in the form of

/page/###/?page=stats&view=post&post=###&blog=#######

and similar.

Robots.txt directives such as

Disallow: /?*
Disallow: /page/*/?*
Disallow: /?pg=*&*

and others have had no noticeable effect yet.

Because I'm running a WordPress site using plugins, I really don't want to try to address this through any sort of block in my .htaccess file.

Am I even using the correct directive syntax to prevent Google from crawling these query strings? Does Google even honor such these days?

Any help in preventing Google from wasting my crawl budget in this way would be appreciated.

lucy24

1:34 am on Dec 4, 2018 (gmt 0)

pages that somehow have come to have query strings attached

Google crawling-and-indexing is not the problem. Spurious query strings are the problem. You need to make them go away, and prevent new ones from being created.

Because I'm running a WordPress site using plugins, I really don't want to try to address this through any sort of block in my .htaccess file.

Even on a WP site, it is trivial to block and/or redirect as needed. You just need to put the rules in the right place, in the right order. (not2easy? You out there?)

In the meanwhile, why don't you just go into the Parameters section of wmt/gsc, and tell them to disregard the parameters?

not2easy

3:27 am on Dec 4, 2018 (gmt 0)

I'm out here, but without knowing what plugins might be used for various functions I have no fix.

The URL (permalink) structure comes from within the Settings section of WP admin. Whatever is set there will be the URL structure. WP is known to have different ways to reach the same content, so users depend on plugins to manage the sitemap, canonicalization and indexing of those various internal parameters. If nothing is set, if there is no sitemap and no kind of canonicalization or index/noindex control setup by the user, then Google and other bots will attempt to guess which version should be indexed.

As for robots.txt for controlling what you don't want crawled, it pays to take advice from Google's robots specifications [developers.google.com] to learn about proper syntax.

JamesSC

4:48 am on Dec 4, 2018 (gmt 0)

I've used pretty permalinks forever. These new, odd query string request/crawls came to my attention from reviewing my logs. I'm also at a loss to figure out which plugin might have/be injecting some of the more random, lewd search formulations, beyond the vanilla stats&view string I initially posted. The stats&view, to my mind, points to Jetpack, from which I only load and use the Stats and Subscriptions modules and have been doing so seemingly without extraneous query strings for years.

I've already consulted the Google developers page you linked, not2easy, as well as this one [sanzon.wordpress.com ] and according to both my wildcard syntax should have worked, particularly since I successfully submitted the revised robots.txt versions and tested them with the two crawlers, plain Googlebot and mobile, while both were still continuing to spider along, still asking for more query string pages. Perhaps I've not waited a sufficient lag time for what may have been correct directive syntax to route its way to my particular spiders' brains.

I did have an odd episode a week or so ago where Google warned me with great agitation that it couldn't find my robots.txt file, which had not been touched or altered for 18 months or so. Then Google got well again and could find and read it, all of its own accord with no intervention by me.

Lucy24, I haven't yet tried the parameter route simply because in one of the links I consulted there had been some caution against addressing the problem there rather than more directly, but if I don't find a better solution I will. This has only recently come to my attention; I'm only aware of it when I catch it in my logs; and initially I wasn't convinced it wasn't some idiosyncratic eruction from Google itself. In fact, the first time I noticed the issue there were systematic 301 redirects - not written by me - in between each Google crawl request for a query string-appended page.

Thinking...the phenomenon presents itself only when Google is crawling explicit pages in the form of GET /page/pagenumber/ rather than pretty permalink-slugged pages. Does that narrow things down any?

Anyway, thanks for the feedback. From what I gather from not2easy my robots.txt syntax should be correct and up to date, so perhaps I need to give it more time, at least until it's clear that Google is disregarding it for some reason.

JamesSC

5:44 am on Dec 4, 2018 (gmt 0)

Something I chanced across - apparently query strings can be introduced into Google spider brains externally:

[perishablepress.com ]

This guy provides a remedy, but a preventative would certainly be better.

lucy24

6:06 am on Dec 4, 2018 (gmt 0)

It worries me a little that he never got around to correcting a significant error in one of the proposed RewriteRules ($1 in target when nothing has been captured in pattern), even though the post dates from 2011. But other than that, yeah, makes sense.

All sorts of weird things can happen when something like a message board auto-converts plain text into links. From this very forum you'll hit the occasional 404 just because of a sentence-ending period being unintentionally included in a link.

JamesSC

1:20 pm on Dec 4, 2018 (gmt 0)

It worries me a little that he never got around to correcting a significant error in one of the proposed RewriteRules ($1 in target when nothing has been captured in pattern), even though the post dates from 2011.

Lucy24, are you referring to the fact that the .* in the RewriteRule is not contained within parentheses - should it not be? - or to something else?

not2easy

1:36 pm on Dec 4, 2018 (gmt 0)

The '$1' at the end of the rule's target URL is supposed to add back captured data but the '.*' is not capturing anything - it would need to be in (parentheses). I've found other bad ideas on that site. It is a site that has been around for years and has brought a number of visitors to this forum when they ran into such anomalies from using snippets they found online. Just sayin'.

JamesSC

1:44 pm on Dec 4, 2018 (gmt 0)

Thanks, not2easy, I was thinking it needed parentheses.