Forum Moderators: Robert Charlton & goodroi
www.example.com/
and
www.example.com/?abcde.htm
If I submit the later to the URL console to remove it, it will simply resolve (properly) to our home page and remove IT instead. Any suggestions how to get it out of the index without killing the home page?
I don't understand why G is indexing both anyway?
Is this a bug or is this how things are supposed to work? If so, couldn't ANYONE knock a site off the SERPS with a duplicate penalty by submitting enough of these type links?
[edited by: ciml at 10:58 am (utc) on July 4, 2005]
[edit reason] Examplified [/edit]
> If I submit the later to the URL console to remove it, it will simply resolve (properly) to our home page
That is a feature of your Web server, or other software running on it.
[webmasterworld.com...]
Anyway, are you SURE example.com/?abcde are valid URLs to serve different content? I thought you had to set a variable like?src=abcde in order for anything useful to be passed on, and in those situations Google apparently recognizes it as a variation of the base page example.com and doesn't index it twice like in this case. In the example I'm seeing the? comes immediately after the "/".
The thread above I think will cover it nicely, thanks for the pointer, I'll be trying it out immediately.
One thing, though the example appears to apply to ANY Query string on ANY file call, however, we have a few places where a query string IS appropriate. After doing some additional research, in the example I gave above the correct code, to JUST ELIMINATE THE ONE GOOGLE INDEXED call would be, I believe:
---------------------------
RewriteCond %{QUERY_STRING} abcde.htm
RewriteRule .* http://www.example.com/ [R=301,L]
---------------------------
if "abcde" IS a valid query string somewhere on your site, I guess you're just out of luck.
Still don't know where this link came from in the index. We've just added it to our home page to try to get Gbot back to re-spider it soon. We're not gong to risk a URL console remove on something which resolves to the home page.
Is there a rewrite for that?
Here are other examples of my same url in google.
example.com%22%3Edirectory/
.com%5C%22/
.com%5C/
.com%22/
There are 5 different urls like that indexed in google and my site is banned or sandboxed or something? All of those urls are 404's or cannot be found.
How do we get rid of URLs like www.example.com%22%3E%3Cimg/ in Google?
Is there a rewrite for that?Here are other examples of my same url in google.
example.com%22%3Edirectory/
.com%5C%22/
.com%5C/
.com%22/
Problem 1:
example.com%22%3Edirectory/ translated into
example.com">directory/
that looks like you may have a faulty <a> tag somewhere
<a href=http://www.example.com">directory/a>
notice the missing " at the beginning of the URL and the missing < at the closing tag that could produce the faulty URL
Problem 2
.com%5C%22/ translates to
.com/"
same thing - missing or extra "
looks like you should run your pages through a validator
Anyway, are you SURE example.com/?abcde are valid URLs to serve different content? I thought you had to set a variable like?src=abcde in order for anything useful to be passed on,
First, we have the URI [ietf.org] which defines "<scheme>://<authority><path>?<query>"
3.4. Query ComponentThe query component is a string of information to be interpreted by
the resource.query = *uric
Within a query component, the characters ";", "/", "?", ":", "@",
"&", "=", "+", ",", and "$" are reserved.
Next, we have HTTP [w3.org]. From 3.2.2:
http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
Next in the chain, we have CGI [hoohoo.ncsa.uiuc.edu], which defines QUERY_STRING:
The information which follows the? in the URL which referenced this script. This is the query information. It should not be decoded in any fashion. This variable should always be set when there is query information, regardless of command line decoding.
Lastly, we follow the chain to HTML [w3.org]. In HTML 4 forums [w3.org] we find that for method=GET, (i.e. application/x-www-form-urlencoded) the form data must be encoded as follows:
The control names/values are listed in the order they appear in the document. The name is separated from the value by `=' and name/value pairs are separated from each other by `&'
So, the?a=foo&b=bar convention belongs to HTML, and is interpreted by HTTP servers.
Ciml mentioned that this is a factor of our server. However, I've tested it on about 50 of the top websites out there (including paypal, M-soft and the major internet orgs.) and EVERY single one (except coincidently Google) handles it the same way as our server does,
Note that the well known Web sites you mention also deliver their home pages with example.com/?abc=123
The default behavior in Apache and IIS is to deliver a static page unaltered, when the URL has a query appended. This is down to the internal workings of the Web server, and there is no reason not to have, for example, example.com/?english, example.com/?german, example.com/?french, etc.
eyezshine, those URLs shouldn't matter. The problem only comes if they resolve, and Google is able to fetch content on them.
Then what could my problem be if that isn't it? That site is clean as possible. I have checked it over and over and there is nothing shady about it. It used to be #1 for it's keyword for 3-4 years and then last year around august it dropped into nothingness?
The site did have affiliate product links with fastclick but I took the affiliate stuff off in hopes that would fix the problem? But it didn't help?
The site has many thousands of links to it from mostly related sites and some not, with different anchor text.
It's like the site has been marked for spam or something and there is no way out of the penalty. I have emailed google through their re-inclusion form many times but it don't help.
The site has only static urls and any url with a? is only there for tracking reasons. I tried it with the solution from Claus post mentioned above but it did not work. anypage.htm and anypage.htm?anything are both giving status 200 instead of anypage.htm?anything giving a 301 redirect to anypage.htm.
Is there a rewrite for all pages with a question mark?
Caveat: Some time ago, a number of people had problems with, for example subdomain%20.example.com (or www%20.example.com) being linked to.
Google would try to fetch them and although those URLs are not legal with wildcard DNS and a Web server using IP based hosting and no Host check, content would be returned. If this URL was chosen during canonicalisation, people with other user agents would get an error when they clicked the link.
No. The word "theoretically" is wrong. They are.
To Google (and the other SE´s) an URI is an unique string of characters, and two strings of characters that aren't 100% identical are not considered to be the same URI's.
>> anypage.htm and anypage.htm?anything are both giving status 200
Toughturkey had the same problem in this thread [webmasterworld.com]. I really don't know what the cause is. It could eg. be a conflict of rules if you have other rules that are valid for the same set of URL's and are placed before these in the ".htaccess" file.
Google would try to fetch them and although those URLs are not legal with wildcard DNS and a Web server using IP based hosting and no Host check, content would be returned. If this URL was chosen during canonicalisation, people with other user agents would get an error when they clicked the link
ciml - do you think that this process has broader effects - does this result in G calling the wrong URL as canonical for many pages or just a select few?
Is there a rewrite for all pages with a question mark?
Have a look at the Apache docs [httpd.apache.org] (look for "Query String"):
When you want to erase an existing query string, end the substitution string with just the question mark.
eyezshine, I can't say what's wrong in your case but if you see listings for for non-URLs that Google can't fetch, then I don't see what harm they can do to your Web site.
it is unrelated to the canonical issue but permanent 404's in google can harm a site.
It 'knows' the URI exists and 'knows' it is returning 404 but somehow refuses to forget about it.
404 is an error but not specifically a permanent error - it could be temporary.
A 404 URI seems to drag other URI's into the supplemental index. I sure wouldn't want a 404 associated in any way shape or form with my homepage for very long.
if you are worried about removing your homepage you can force a code 410 GONE on a specific URI and it will be removed.
I can see a listing of www%20.example.com in the SERPs, with title and description.
Google Translate and the Google cache don't work for it, but I do notice that the browser I'm using today (IE6) can access it. The browser I was using when this problem first came to light (January 2002 I think) didn't.