How can i stop google index this type of string

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How can i stop google index this type of string

Arturo99

12:17 pm on Aug 2, 2022 (gmt 0)

https://www.example.com/?i=111111

Hi
Can you tell me how to stop google indexing this type of string?
Site is on a windows server.

In a list of 404s on the site in search console, this type of url structiure appears 25% of the time
and always gets a 404.

thanks
Arturo

not2easy

1:26 pm on Aug 2, 2022 (gmt 0)

If you have no way to include a "no index" header with the URL, you could prevent crawling with a robots.txt disallow:

Disallow: /?i=

That will not be fast but if it had been in place you would not have the problem.

Arturo99

2:03 pm on Aug 2, 2022 (gmt 0)

Would it be an idea to have a list of such urls deliver a 410? (rather than a 404?)
Then google would not keep coming back 6 months later, testing for old urls, as they tend to do.

Arturo99

2:04 pm on Aug 2, 2022 (gmt 0)

I don't know what the ?i means at the start of the string.

Brett_Tabke

2:08 pm on Aug 2, 2022 (gmt 0)

> https://www.example.com/?i=111111
>I don't know what the ?i means at the start of the string.

What's the content system? Custom, or something like wordpress?

Arturo99

2:56 pm on Aug 2, 2022 (gmt 0)

It's a customized system that does not seem to handle old product deletions very well
We've got redirect file of 2k redirects to try and compensate.
Getting rid of those 404s to 410s would help clean things up.

Brett_Tabke

7:22 pm on Aug 2, 2022 (gmt 0)

>I don't know what the ?i means at the start of the string.

The ?i is the parameters that the user/link is passing to your script. So you script will see "i" = "111111" and reference that in whatever language it is written in. Presumably, the "i" in this case is a product id. If that system is still in place and referencing good products, then it is going to be hard to address via htaccess anyway but brute force (with thousands of redirects that are slowing down your file serving).

>410

I would pass on the 410 and instead continue with a custom 404. Don't let that traffic go some where else. Give them a 404 page that has resources on it. Especially if you know the old product code, then redirect them to new related products and services instead. Let Google take care of Google and you take care of the visitor.

robzilla

7:31 pm on Aug 2, 2022 (gmt 0)

If it's a proper 404, I'd say it's not your problem. Unless you are linking or redirecting to those URLs somewhere or somehow, or other domains are.

phranque

10:10 pm on Aug 2, 2022 (gmt 0)

Can you tell me how to stop google indexing this type of string?
...

In a list of 404s on the site in search console, this type of url structiure appears 25% of the time
and always gets a 404.

if a googlebot-requested url gets a 404 response, it will not be indexed.
it may get reported as such in GSC, which is normal.

If you have no way to include a "no index" header with the URL, you could prevent crawling with a robots.txt disallow:
Disallow: /?i=

this will prevent googlebot from seeing the 404 response, which may actually cause these urls to be indexed with the following description:

A description for this result is not available because of this site's robots.txt

Would it be an idea to have a list of such urls deliver a 410? (rather than a 404?)

it's been reported here often that googlebot will typically lose interest sooner if it sees a 410 instead of a 404.

I don't know what the ?i means at the start of the string.

everything following the ? in a url (and preceding the #) is the query string.

We've got redirect file of 2k redirects to try and compensate.
Getting rid of those 404s to 410s would help clean things up.

are these redirects or 404s?

I would pass on the 410 and instead continue with a custom 404. Don't let that traffic go some where else. Give them a 404 page that has resources on it. Especially if you know the old product code, then redirect them to new related products and services instead. Let Google take care of Google and you take care of the visitor.

i would still suggest a 410.
you can specify a custom 410 error document that also directs visitors to alternative or relevant resources.
googlebot will ignore the contents of the custom error documents, which is purely for human visitors.
googlebot only cares about the status code of the response.

If it's a proper 404, I'd say it's not your problem. Unless you are linking or redirecting to those URLs somewhere or somehow, or other domains are.

indeed.
it may be worth analyzing you web server access log files to see if/who else requesting these urls.
a backlink analysis tool may also give you information and suggest actions to take regarding links to these urls on your site.

lucy24

10:14 pm on Aug 2, 2022 (gmt 0)

Edit: phranque types faster than me, so we overlapped.

I would pass on the 410 and instead continue with a custom 404. Don't let that traffic go some where else.

Seems to me this is just the place for a 410 with custom 410 page. (Don�t know about IIS, but Apache's default 410 is scary.) If humans are looking for a discontinued product, that's the time to tell them �We used to carry that, but now you might like X, Y or Z instead�.

With 2k no-longer-existing pages, of course you wouldn't do this manually with a list of 2k hand-coded redirects. Set up the 410 page in whatever language is appropriate, and let it match up Old Product to New Product, with a final fallback if you really no longer have anything relevant.

tangor

4:18 am on Aug 3, 2022 (gmt 0)

If I have done responsible redirects to current pages and STILL get 404s I consider that a problem on the other guy's end.

Are we trying to clean up the logs to make it disappear, or is there another reason that 404 is not desired?

My experience is that the 404s eventually diminish and disappear if you give it enough time.

The robots.txt fix above is good advice, but that only applies to compliant robots. The non compliant will still hit since they do NOT honor robots.txt.

phranque

7:03 am on Aug 3, 2022 (gmt 0)

The robots.txt fix above is good advice, but that only applies to compliant robots.

based on the forum in which this was posted we should assume that the (google)bot is compliant.
using robots.txt to exclude googlebot will not prevent indexing of the url for the reason i described above.
excluding googlebot will prevent GSC from reporting 404s for these urls since googlebot will never see the response.

Arturo99

9:45 am on Aug 3, 2022 (gmt 0)

Is there a way to stop google indexing the query string?
https://www.example.com/?i=111111
I don't understand why they would index it.

phranque

10:08 am on Aug 3, 2022 (gmt 0)

you should ask yourself why they would crawl these urls - which is usually because they discovered a link to these urls somewhere.
if these are "valuable" links, you may also ask yourself whether you want google to ignore these urls (by making them worthless with a 404/410) or to follow a redirect to potential or existing relevant content.
just a thought...

Is there a way to stop google indexing the query string?

you can prevent google from indexing a url containing a query string by providing a 404 or 410 status code in the response to googlebot for requests of urls containing query strings.
eventually, google may stop requesting these urls, especially if you use a 410 instead of a 404 status code, but if it continues to discover links to those urls, it may continue to request them.

i know how to create a ruleset using apache mod_rewrite directives that would respond to any request containing any query string with a 410.
in order to figure out if or how to do so on a windows server, i'd have to crack open the documentation...

tangor

11:50 am on Aug 3, 2022 (gmt 0)

When you say indexing ... are they indexing it on your site, or REQUESTING it in a crawl?

If the latter, ignore it UNLESS it is a link you think valuable (how you figure out where g got it is beyond me), otherwise let the 404 ride and move on to more important work.

tangor

12:02 pm on Aug 3, 2022 (gmt 0)

Side note: 404 is actually useful to find out what kind of seeks and odd stuff are hitting your site. BTW killing query strings will also kill FB external hits and other things as well. Think hard on whether you want to do that.

tangor

12:04 pm on Aug 3, 2022 (gmt 0)

As I was about to exit this thread I saw:

Then google would not keep coming back 6 months later, testing for old urls, as they tend to do.

I have all kinds of search engines asking for stuff that's been gone up to 15 years ago. That's their problem, not mine!

lucy24

3:19 pm on Aug 3, 2022 (gmt 0)

killing query strings will also kill FB external hits and other things as well.

You can poke holes for those. I block query strings on non-php URLs, excepting fbclid, which is stripped.

martinibuster

6:49 pm on Aug 3, 2022 (gmt 0)

you should ask yourself why they would crawl these urls

I agree with that and everything else Phranque posted. Below is one way to follow through with Phranque's advice.

One way to go about it is to give the site a crawl with Screaming Frog to see if SF picks up those same query-string URLs. If it does pick those URLs up then something is misconfigured.

If possible, sort the SF results for HTML files, sort them by URLs using a fragment that will match the problematic URLs, like /? (or any other denominator common to those rogue URLs). You don't need to stress about this point. You can just identify five or ten of them and the next step will show you the problem.

Now check the referring URL of each rogue page. This will give you a clue of what is generating the problem. It could be some legacy link in a forgotten sitemap or a search function that is generating that. The key is finding the problem so that you can go to the next step and create a solution.

Good luck!

:)

Roger Montti aka martinibuster

Arturo99

9:20 pm on Aug 3, 2022 (gmt 0)

Hi Martin, I am trying screaming frog, but where are you seeing the referral url?
Do you mean inlinks?

On a query url, Screaming frog shows at the foot, both in and out links as the same
https://www.example.com/product?qs=1&productenquiry=1
https://www.example.com/product?qs=1&productenquiry=1

and also shows:

anchor............................path type...........link path.............................................................................................................link position
Skip to Main Content.......Path-Relative.......//body/header/a..................................................................................................Header
main menu.....................Path-Relative.....//body/header/div[@class='tertiary-header']/div/nav/span/a......................................Navigation
Get a Sample............. ...Root-Relative.....//body/div[@id='content']/div[2]/div[@class='wrapper'][1]/div/div[2]/div/p[1]/a............Content

Any clue here?

phranque

11:19 pm on Aug 3, 2022 (gmt 0)

One way to go about it is to give the site a crawl with Screaming Frog to see if SF picks up those same query-string URLs. If it does pick those URLs up then something is misconfigured.

i was assuming these weren't internal link, but this was an excellent suggestion.
robzilla had already suggested this possibility:

Unless you are linking or redirecting to those URLs somewhere or somehow, ...

Any clue here?

internally linking to a url in header navigation is a strong signal to google and googlebot will almost certainly follow these urls when discovered in that context.
that should answer the question in your title - stop linking to this type of string.

it looks like the page containing links to these urls is some sort of search page?

https://www.example.com/product?qs=1&productenquiry=1

perhaps there is a misconfiguration in the Header/Navigation/Content navigation elements on the search results page?

you should ask yourself why they would crawl these urls

and now you may ask yourself why your search results (?) pages' navigation elements are linking to urls with this type of query string.

- are these legacy urls?
- are these internal (CMS) urls?
- are these "placeholder" urls in a template?
- ?

phranque

11:25 pm on Aug 3, 2022 (gmt 0)

We've got redirect file of 2k redirects to try and compensate.
Getting rid of those 404s to 410s would help clean things up.

are these redirects or 404s?

there's been a lot of discussion about "redirects" but the OP hasn't clarified their statement nor answered my question quoted above.
improper "redirects" (e.g., using a 302 instead of a 301 redirect; or redirecting to an error response) can cause "indexing" issues...

No5needinput

4:01 pm on Aug 4, 2022 (gmt 0)

I use this to strip and redirect query strings including F/book. Poke holes as needed as shown below in the 2 examples.

# Strip query strings
RewriteCond %{QUERY_STRING} .
RewriteCond %{REQUEST_URI} !^/blah/admin
RewriteCond %{REQUEST_URI} !^/blah/foo
RewriteRule (.*) /$1? [R=301,L]

martinibuster

10:52 pm on Aug 4, 2022 (gmt 0)

Yes, inlinks. List by HTML or however you want, just so you get a list of all of your URLs. Then do a search on the top right for whatever string is in common for the URLs you're looking for. It can be whatever comes after the ? or it can be ?xxx or /category/? whatever gets you to it.

Then you can select them all, right click the selection and export the inlinks, which should get you the internal inlinks in a spreadsheet.

Or you can just select one the look at the bottom panel and select the inlinks tab, inspect several and if it becomes clear what's happening, like an internal search or whatever or code related, then fix it.

phranque

11:39 pm on Aug 4, 2022 (gmt 0)

I use this to strip and redirect query strings ...

(followed by apache mod_rewrite directives)

from the OP:

Site is on a windows server.

martinibuster

1:29 am on Aug 5, 2022 (gmt 0)

And anyway... I don't think it's solving anything to stop the symptoms of whatever is happening by building in redirects or whatever without actually finding the problem and then solving the problem.

I think it's better to identify what's causing the rogue URLs and then fix it there.

Otherwise it's like your engine is knocking so you insulate the car cabin from the engine noise. The noise is gone but the problem is still there.

Sgt_Kickaxe

3:16 pm on Aug 6, 2022 (gmt 0)

Just some added opinion...

#1 - Canonicals are your friend, use them.

Look at the WebmasterWorld page we're on right now. Add ?i=111111 to the end of the URL after the .htm and it will resolve with that string. Because the content of this page is the same WITH and WITHOUT the string the cannonical is telling Google which version to index. The string is thus effectively ignored by Google.

#2 - 404s are natural on the internet but if you can properly redirect the user it's good for user experience to do so. So long as Google indexes the right content it's better to use canonicals and not worry about strings than to serve up 404s on strings.. IMO. Htaccess may be useful to serve content without users seeing strings but I would hold off on using htaccess to force 404s on strings.

#3 - Don't assume the string is generated by your site. Anyone can add strings to your urls and link them to your site from elsewhere.

Good luck