How Do I Get Google to Remove a Doubly Indexed Page?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How Do I Get Google to Remove a Doubly Indexed Page?

With a ? in the URL.

MikeNoLastName

9:06 pm on Jul 3, 2005 (gmt 0)

I recently discovered that Google has our home page indexed as both:

www.example.com/
and
www.example.com/?abcde.htm

If I submit the later to the URL console to remove it, it will simply resolve (properly) to our home page and remove IT instead. Any suggestions how to get it out of the index without killing the home page?

I don't understand why G is indexing both anyway?
Is this a bug or is this how things are supposed to work? If so, couldn't ANYONE knock a site off the SERPS with a duplicate penalty by submitting enough of these type links?

[edited by: ciml at 10:58 am (utc) on July 4, 2005]
[edit reason] Examplified [/edit]

ciml

11:01 am on Jul 4, 2005 (gmt 0)

example.com/?foo and example.com/?bar are different URLs. Plenty of webmasters use URLs like these to serve different content.

> If I submit the later to the URL console to remove it, it will simply resolve (properly) to our home page

That is a feature of your Web server, or other software running on it.

toughturkey

5:39 pm on Jul 4, 2005 (gmt 0)

This came up for me this week too, and I was pointed here...

[webmasterworld.com...]

MikeNoLastName

8:25 pm on Jul 4, 2005 (gmt 0)

Sorry for the needed edit, I thought I always used "domain" dot com which is an accepted generalization/exemplification. Must've been late and tired.

Anyway, are you SURE example.com/?abcde are valid URLs to serve different content? I thought you had to set a variable like?src=abcde in order for anything useful to be passed on, and in those situations Google apparently recognizes it as a variation of the base page example.com and doesn't index it twice like in this case. In the example I'm seeing the? comes immediately after the "/".

The thread above I think will cover it nicely, thanks for the pointer, I'll be trying it out immediately.

One thing, though the example appears to apply to ANY Query string on ANY file call, however, we have a few places where a query string IS appropriate. After doing some additional research, in the example I gave above the correct code, to JUST ELIMINATE THE ONE GOOGLE INDEXED call would be, I believe:

---------------------------
RewriteCond %{QUERY_STRING} abcde.htm
RewriteRule .* http://www.example.com/ [R=301,L]
---------------------------

if "abcde" IS a valid query string somewhere on your site, I guess you're just out of luck.

MikeNoLastName

8:59 pm on Jul 4, 2005 (gmt 0)

Hmm,
It appears with my example (or for that matter the 301 example given in the other thread), that if the query string is on the HOME PAGE, then you get into an infinite 301 redirect loop because the query string is still passed through to the new page. So in my case I'll just have to redirect to the 404 page.

MikeNoLastName

9:22 pm on Jul 4, 2005 (gmt 0)

Hmm, Unfortunately even that doesn't work because the .htaccess intercepts it each time, sees the still tacked on query string and loops again. Sooo, unless someone can tell me how to strip off the QUERY_String before the next iteration, I just set it to 301 redirect back to Google for them to deal with :-). Or should I use a 302 ;)? At least they finally return a 404 error. I'm tired of dealing with their problems.

MikeNoLastName

12:48 am on Jul 5, 2005 (gmt 0)

Just a final follow-up. Ciml mentioned that this is a factor of our server. However, I've tested it on about 50 of the top websites out there (including paypal, M-soft and the major internet orgs.) and EVERY single one (except coincidently Google) handles it the same way as our server does, by returning a 200 code and retaining the original URL with the?, so theoretically just about any site on the web is vulnerable to this potential duplication penalty problem...

Still don't know where this link came from in the index. We've just added it to our home page to try to get Gbot back to re-spider it soon. We're not gong to risk a URL console remove on something which resolves to the home page.

eyezshine

4:40 am on Jul 5, 2005 (gmt 0)

How do we get rid of URLs like www.example.com%22%3E%3Cimg/ in Google?

Is there a rewrite for that?

Here are other examples of my same url in google.

example.com%22%3Edirectory/
.com%5C%22/
.com%5C/
.com%22/

There are 5 different urls like that indexed in google and my site is banned or sandboxed or something? All of those urls are 404's or cannot be found.

Reid

7:11 am on Jul 5, 2005 (gmt 0)

How do we get rid of URLs like www.example.com%22%3E%3Cimg/ in Google?
Is there a rewrite for that?
Here are other examples of my same url in google.
example.com%22%3Edirectory/
.com%5C%22/
.com%5C/
.com%22/

first you need to translate these ASCII characters into something that might make sense.
%22 = "
%5C = /
%3E = >
%3C = <

Problem 1:
example.com%22%3Edirectory/ translated into

example.com">directory/
that looks like you may have a faulty <a> tag somewhere

<a href=http://www.example.com">directory/a>
notice the missing " at the beginning of the URL and the missing < at the closing tag that could produce the faulty URL

Problem 2
.com%5C%22/ translates to
.com/"
same thing - missing or extra "

looks like you should run your pages through a validator

MikeNoLastName

7:54 am on Jul 5, 2005 (gmt 0)

Good reply Reid, I never have the patience to look those all up. :)
A better question is, once you find your problems, how how do you get G to stop indexing them? With situations like this, from what I've been reading, it's uncertain how the URL console will treat these when you submit them. Since the base URL may be a valid home page, G could think you mean to remove the home page... and 6 months can be a LONG time to be down.
In his case, since they resolve to a 404 you would think it would be ok to submit them to URL console to have them removed, but anything to do with a home page, _I_ would never take the chance. Although on the other hand Eyez the latest consensus seems to be that unless you are seeing a full description in the index (like we ARE) for the alternate URL's the duplicate entries are not SUPPOSED to cause a duplication penalty.

ciml

12:01 pm on Jul 5, 2005 (gmt 0)

Anyway, are you SURE example.com/?abcde are valid URLs to serve different content? I thought you had to set a variable like?src=abcde in order for anything useful to be passed on,

First, we have the URI [ietf.org] which defines "<scheme>://<authority><path>?<query>"

3.4. Query Component
The query component is a string of information to be interpreted by
the resource.
query = *uric
Within a query component, the characters ";", "/", "?", ":", "@",
"&", "=", "+", ",", and "$" are reserved.

Next, we have HTTP [w3.org]. From 3.2.2:

http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

Next in the chain, we have CGI [hoohoo.ncsa.uiuc.edu], which defines QUERY_STRING:

The information which follows the? in the URL which referenced this script. This is the query information. It should not be decoded in any fashion. This variable should always be set when there is query information, regardless of command line decoding.

Lastly, we follow the chain to HTML [w3.org]. In HTML 4 forums [w3.org] we find that for method=GET, (i.e. application/x-www-form-urlencoded) the form data must be encoded as follows:

The control names/values are listed in the order they appear in the document. The name is separated from the value by `=' and name/value pairs are separated from each other by `&'

So, the?a=foo&b=bar convention belongs to HTML, and is interpreted by HTTP servers.

Ciml mentioned that this is a factor of our server. However, I've tested it on about 50 of the top websites out there (including paypal, M-soft and the major internet orgs.) and EVERY single one (except coincidently Google) handles it the same way as our server does,

Note that the well known Web sites you mention also deliver their home pages with example.com/?abc=123

The default behavior in Apache and IIS is to deliver a static page unaltered, when the URL has a query appended. This is down to the internal workings of the Web server, and there is no reason not to have, for example, example.com/?english, example.com/?german, example.com/?french, etc.

eyezshine, those URLs shouldn't matter. The problem only comes if they resolve, and Google is able to fetch content on them.

eyezshine

4:13 pm on Jul 5, 2005 (gmt 0)

CIML,

Then what could my problem be if that isn't it? That site is clean as possible. I have checked it over and over and there is nothing shady about it. It used to be #1 for it's keyword for 3-4 years and then last year around august it dropped into nothingness?

The site did have affiliate product links with fastclick but I took the affiliate stuff off in hopes that would fix the problem? But it didn't help?

The site has many thousands of links to it from mostly related sites and some not, with different anchor text.

It's like the site has been marked for spam or something and there is no way out of the penalty. I have emailed google through their re-inclusion form many times but it don't help.

zgb999

5:53 pm on Jul 5, 2005 (gmt 0)

I too would like to do a 301-redirect from [site.com...] to [site.com...]

The site has only static urls and any url with a? is only there for tracking reasons. I tried it with the solution from Claus post mentioned above but it did not work. anypage.htm and anypage.htm?anything are both giving status 200 instead of anypage.htm?anything giving a 301 redirect to anypage.htm.

Is there a rewrite for all pages with a question mark?

ciml

6:03 pm on Jul 5, 2005 (gmt 0)

eyezshine, I can't say what's wrong in your case but if you see listings for for non-URLs that Google can't fetch, then I don't see what harm they can do to your Web site.

Caveat: Some time ago, a number of people had problems with, for example subdomain%20.example.com (or www%20.example.com) being linked to.

Google would try to fetch them and although those URLs are not legal with wildcard DNS and a Web server using IP based hosting and no Host check, content would be returned. If this URL was chosen during canonicalisation, people with other user agents would get an error when they clicked the link.

MikeNoLastName

6:10 pm on Jul 5, 2005 (gmt 0)

So, once again, the $million question is:
WHY is Google indexing these multiple versions, causing duplicate penalties, when they are legitimate, as ciml points out and resolve to the same page, as in our case? and More importantly how to get rid of them without removing the home page? Yes, in our case the? version is indexed AND has a full description, cache, etc. and is on the same domain being dup penalized.

claus

8:34 pm on Jul 6, 2005 (gmt 0)

>> so theoretically just about any site on the web is vulnerable to
>> this potential duplication penalty problem...

No. The word "theoretically" is wrong. They are.

To Google (and the other SE�s) an URI is an unique string of characters, and two strings of characters that aren't 100% identical are not considered to be the same URI's.

>> anypage.htm and anypage.htm?anything are both giving status 200

Toughturkey had the same problem in this thread [webmasterworld.com]. I really don't know what the cause is. It could eg. be a conflict of rules if you have other rules that are valid for the same set of URL's and are placed before these in the ".htaccess" file.

joeduck

9:11 pm on Jul 6, 2005 (gmt 0)

Google would try to fetch them and although those URLs are not legal with wildcard DNS and a Web server using IP based hosting and no Host check, content would be returned. If this URL was chosen during canonicalisation, people with other user agents would get an error when they clicked the link

ciml - do you think that this process has broader effects - does this result in G calling the wrong URL as canonical for many pages or just a select few?

AlexK

9:32 pm on Jul 6, 2005 (gmt 0)

zgb999:

Is there a rewrite for all pages with a question mark?

(It took me ages to find this!)

Have a look at the Apache docs [httpd.apache.org] (look for "Query String"):

When you want to erase an existing query string, end the substitution string with just the question mark.

Easy when you know how.

Reid

5:30 am on Jul 7, 2005 (gmt 0)

eyezshine, I can't say what's wrong in your case but if you see listings for for non-URLs that Google can't fetch, then I don't see what harm they can do to your Web site.

it is unrelated to the canonical issue but permanent 404's in google can harm a site.

It 'knows' the URI exists and 'knows' it is returning 404 but somehow refuses to forget about it.
404 is an error but not specifically a permanent error - it could be temporary.
A 404 URI seems to drag other URI's into the supplemental index. I sure wouldn't want a 404 associated in any way shape or form with my homepage for very long.

if you are worried about removing your homepage you can force a code 410 GONE on a specific URI and it will be removed.

ciml

9:57 am on Jul 7, 2005 (gmt 0)

joeduck, the process I described affecting sites with wildcard DNS, IP based hosting and links to illegal subdomains is very rare.

I can see a listing of www%20.example.com in the SERPs, with title and description.

Google Translate and the Google cache don't work for it, but I do notice that the browser I'm using today (IE6) can access it. The browser I was using when this problem first came to light (January 2002 I think) didn't.