Incorrect URLs and Mirror URLs

Forum Moderators: open

Message Too Old, No Replies

Incorrect URLs and Mirror URLs

Causing duplication penalties.

crobb305

12:39 am on Nov 25, 2004 (gmt 0)

Google has indexed numerous incorrect URLs and mirror URLs all pointing to my index page. Subsequently, the original URL (www.mydomain.com) has been suppressed to the bottom of the results for any search (presumably a duplication penalty). This problem was also mentioned in message 11 of the following thread:

[webmasterworld.com...]

The URLs pertaining to my website that all point to my index page take the following form.

www.mydomain.com/?S=AC3%26Document=document
www.mydomain.com/?S=AC3%26Document=document
www.mydomain.com/?SID=xRSUNVW8R9P44HSYQ6UWED&
www.mydomain.com/?S=AC3%26Document=document
www.mydomain.com/default.asp?S=AC3&am
www.some-other-URL.com/go.php?id=aHR0cDovL3d3dy5jcmVkaXRjaGFtcGlvbi5jb20v
www.some-other-URL-2.com/go.php?id=aHR0cDovL3d3dy5jcmVkaXRjaGFtcGlvbi5jb20v
www.some-other-URL-3.com/file/callink.php?linkid=3

I have emailed google, but have received no reply. I am unsure what I can do to A) eliminate the incorrect URL's that appear to originate from my site and B) eliminate the mirror URLs that originate from unrelated websites.

Any help would be greatly appreciated.

crobb305

11:50 pm on Nov 26, 2004 (gmt 0)

fusioneer,

Thanks for the info. I have to say that I am not having these problems with Yahoo, the new MSN search, or any other search engine for that matter. Google needs to advance beyond their current problems. It's sad that great websites are being beaten out of the serps and that sabbotage of this form is still possible in Google. Google, please take notice. Please do something. Webmasters should *NOT* have to manually submit incorrect/mirror urls for removal.

Google should be very embarrassed.

my3cents

2:10 pm on Nov 27, 2004 (gmt 0)

Crobb,

I agree that this is a major problem, by the time the mirror/incorrect urls show up, the damage is already done. Every month there is a slew of new urls making the problem worse.

In vegas when the google rep was asked about why they had www.mysite.com/?tracking=ppc instead of our regular url, the google rep said to 301 it and it would solve the problem. I promptly tried to do a 301 and as most of you know, this is not supported and does nothing:

redirect 301 /?tracking=ppc [mysite.com...]

99% of the reason for the trip to vegas was to find a solution to this problem. I obviously did not get what I was looking for there.

These problems seem like they would be so simple for google to fix if they would only aknowledge that there IS a problem.

walkman

5:45 pm on Nov 27, 2004 (gmt 0)

"redirect 301 /?tracking=ppc [mysite.com"...] is not supported.

Can we ban /?blah from the robots.txt so google doesn't show as dupes? Will this work?
The next step would be to beg google to delete them, but any competitior can submit your
site.com?bye_bye_rankings to a blog or directory...and there you go again.

walkman

1:07 am on Nov 28, 2004 (gmt 0)

replying to myself...:)

I added this to my robots.txt:

User-agent: Googlebot
Disallow: /?tag
Disallow: /?tag1
Disallow: /?tag2
Disallow:?

time will tell...I'm thinking of trying it on a domain that is not important. Link as?test and block it on robots.txt. A rewrite would be great...any domain.com/?* would be 301d to to the index.
Anyone good enough on that? it could help tons of us

crobb305

5:42 am on Nov 28, 2004 (gmt 0)

I have a meeting/conf call scheduled with my host on Monday to try to see about special server configs, etc. I think it is absolutely absurd that I have to worry myself with this crap considering that Google should be advanced/intelligent enough to handle this itself. However, I can NOT continue to allow 4 years of VERY hard work go down the drain.

walkman

7:10 am on Nov 28, 2004 (gmt 0)

"I have to worry myself with this crap considering that Google should be advanced/intelligent enough to handle this itself."

Three things come to mind, given the silence and that I think GG or Google in general has heard of this (+page jacking, tracker2.php,?tagged_urls, meta refresh links to you, etc.):

1. They don't care because only x% of innocent sites are getting caught in the middle. It sucks to be them (penalized sites) but spammers used this method and the end jusfies the means. Plus, most normal users (not SEOs) will never know about the inside workings, so it doesn't matter.

2. They're working on a solution and it will be implemented either on the next update, or it's harder to figure because it may effect other things that we don't know.

3. This is NOT what is causing our sites to drop. Something else is, and since we don't know we keep wondering and grasping at straws. Maybe they're laughing their asses off at our theories (if they're reading of course).

Honestly, I don't know which it is....but not having to worry little things and what competitors can do to you would be great. Does guestbook /blog bombing still knock pages off?

Wail

9:18 am on Nov 29, 2004 (gmt 0)

my3cents:
In vegas, the google rep suggested that I 301, but now I find that this cannot be done because you cannot 301 a query string.

Sorry. Weekend delay effect.
You can 301 a query string. You just need to use server side script for it (which you're likely to be running if you've got query strings). You'll not be able to ticky boxes on IIS for query string URLs though.

my3cents

4:27 pm on Nov 29, 2004 (gmt 0)

walkman,

your point #3 - This may not be what is causing it, but the fact that most of my main pages have no title and description for the actual url has to have something to do with it. Also, the incorrect urls have the full title and descriptions, but they have no, or very little backlinks or PR.

I actually do not use query strings on my site, it's plain vanilla html, these tracking links are generated by search engines and directories and I have no use for them. I actually contacted several of the search engines and directories and asked them to remove the tracking portion of their link to me, some of them removed it 3-4 months ago, but they are still in the index.... with full titles and descriptions.

Also, I agree that we should not have to search for months and take drastic measures to find a work aroun d for this. I see that Google has over a quarter million 404 mpages indexed with titles and descriptions of the content of the 404 page.

Why would google want to index "this page cannot be found"?

Wail

4:53 pm on Nov 29, 2004 (gmt 0)

Why would google want to index "this page cannot be found"?

They don't. It's likely that these error pages don't return the proper 404 header codes. At Las Vegas this year Yahoo went on and on about the importance of 404 error pages which actually returned the 404 error code. It's clearly bugging Yahoo too.

my3cents

5:19 pm on Nov 29, 2004 (gmt 0)

I am removing my custom 404 page for now, google has indexed several urls that do not exist on my site. I see this problem with several other sites too. 404>302>200 = fully indexed with content of 404 page.

zeus

5:37 pm on Nov 29, 2004 (gmt 0)

Anyone talked to the newspapers about this, this would wake Google up, I think.

walkman

11:17 pm on Nov 29, 2004 (gmt 0)

"You can 301 a query string. You just need to use server side script for it (which you're likely to be running if you've got query strings). You'll not be able to ticky boxes on IIS for query string URLs though."

Wail, do you know how to? I have tried with my limited skills to do it via apache re-wewrite but with no luck. I don't use server script or anything though, I just linked to dom.com?track to see how many people came from there. Those links have been removed for at least 6 months.

walkman

11:20 pm on Nov 29, 2004 (gmt 0)

"I am removing my custom 404 page for now, google has indexed several urls that do not exist on my site. I see this problem with several other sites too. 404>302>200 = fully indexed with content of 404 page. "

I have a "custom 404 page", but I just checked and it returns the right 404 code. As long as it does, I don't think you need to remove it

Wail

9:16 am on Nov 30, 2004 (gmt 0)

Walkman,

PHP is my strong server side script. To 301 a query string I'd use something like this (where we say your URL is www.example.com but someone links to it like www.example.com/?track=1234):


<?php
 if ($track) {
 header("HTTP/1.0 301 Moved Permanently");
  header("location:http://www.example.com");
 }
?>

Spine

5:32 pm on Nov 30, 2004 (gmt 0)

My custom 404 that redirected surfers to my index page has certainly caused a bite on the ass. Previously Google handled this, but now it causes false duplicate content.

It's an expensive mistake so far.

zeus

11:14 pm on Nov 30, 2004 (gmt 0)

Ok Google.com is not as advanced as other search engines so they will not fix this problem, but I have tried other options, which I will place here:

Contact the hijacker or redirecter - mostly you dont get any answere

Then try to contact the hosting which can be a little complicated even with whois - if this also dont work or they got there own server.

Then contact there advertisers, there has to be money in this, like google adsense or other affiliates/advertiser on the site that is violating you site, that will hurt there business.

zeus

3:21 pm on Dec 1, 2004 (gmt 0)

Once again I find new sites included when I make a site:www.mydomain.com search, tracker2.php and redirects, one of them I saw this in the html-

<p>Redirecting, please wait...</p>
<meta name="robots" content="noindex">
<meta http-equiv="refresh" content="0; URL=http://www.example.com/
">

it look like they are doing things the right way, with a noindex.

What is going on, whats happening, why cant they fix this hijacking and redirecting stuff,
I even send a email to adsense to tell them one of there affiliates is hijacking my site, they could not see anything wrong in redirecting to my site, they just dont know.

[edited by: ciml at 2:00 pm (utc) on Dec. 2, 2004]
[edit reason] Examplified [/edit]

Variable

5:28 pm on Dec 1, 2004 (gmt 0)

Hi Fusioneer,

I am 90% sure the same thing happened to my site. Thank you very much for your post. I am going to implement what worked for you.

I have a large site though, hopefully I have as much success as you did. I'll let you know how it works out.

zeus

9:50 pm on Dec 1, 2004 (gmt 0)

I tried a lot to concor this hijacking/redirects, I contact adsense first time to tell that one of there affiliates are hijacking my site, they did not realy no what I was talking about, thats OK they are in a different field, then I explained what it was in the next email then I got this email:

< email thanking zeus for the extra information, indicating that it would be passed on >

Hopefully this will help clean up the search on Google, when they can see many of the hijackers are using adsense.

[edited by: ciml at 10:23 pm (utc) on Dec. 1, 2004]
[edit reason] Email Summarised [/edit]

Lorel

4:47 pm on Dec 5, 2004 (gmt 0)

Yes...many of the mirrors of my site have the "tracker2" in the url. This is blatant sabbotage, as my website is being penalized for apparent duplication.

I had two of my sites with a tracker2 involved and wrote the owner of the site and after getting no response I wrote the hosting co and now both links are gone--one within 24 hours.

fusioneer

6:35 am on Dec 6, 2004 (gmt 0)

Results are in...

Site in question returned to top 5 positions 1 week after duplicate URL's were removed using methods I described earlier in this thread.

Previously the site was nowhere to be found in top 1000 results.

For people PM'ing me on various "how-to"'s; I have described the method I used, in detail, earlier in the thread. I would recommend it for getting rid of URL's you don't want indexed in Google for any reason.

Re 301 redirect's: this method DID NOT work.
However if you want to do a 301 redirect for any other reason, you can use Coldfusion, ASP, or PHP, and this is extremely simple, google on "301 redirect +php" etc and you will find your answer. I don't want to post misleading code snippets in this thread as this was not the solution to this particular problem.

jdMorgan

7:31 am on Dec 6, 2004 (gmt 0)

my3cents,

In vegas, the google rep suggested that I 301, but now I find that this cannot be done because you cannot 301 a query string.

301-Redirect "www.mysite.com/?why.is=google_doing_this" to "www.mysite.com/why.is" using mod_rewrite in .htaccess on Apache:


Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} ^google_doing_this$
RewriteRule ^why\.is$ http://www.mysite.com/why.is? [R=301,L]

That will neatly redirect to the "/why.is" URL stripped of the query string (and without the "?")

If you also have a rewrite from the static URL to the dynamic one, the above method may not work (you may get a rewriting loop of "dueling rewrites"), but there's a way around that, too. The solution will be site-dependent, though, and unsuitable for a general thread like this.

Googlebot accepts some wildcards in robots.txt. Most other search engine 'bots don't. See Google's robots.txt FAQ [google.com].

The most common cause of incorrect server status codes returned by custom error pages is that the Apache ErrorDocument directive requires a local URL-path; If you specify a full URL, you will get a 302 redirect status. This is well-documented in the Apache ErrorDocument description [httpd.apache.org].

For IIS without ISAPI Rewrite, the PHP solution that Wail posted above looks good, although the "location" header should be capitalized.

For anyone who missed it, Robber's post (msg #12 in this thread) is worth reading.

I'd also like to mention that WebmasterWorld has a forum for Apache discussions...

Jim

Variable

7:03 pm on Dec 14, 2004 (gmt 0)

Fusioneer,

Thank you very much for your valuable post. I have definitely been hit hard by this issue.

My site has thousands of subdomains, each tailored to a particular category. Previously, I had my site setup to check the URL to see which category to bring up and in the case of an invalid URL, to bounce to the homepage.

My site has approximately 5000 valid subdomains. Someone linked to my site pointing to 25000 or so invalid subdomains. All of those invalid subdomains got indexed in G as my homepage and caused my site to pretty much drop out of the SERPs.

G confused my homepage www.example.com with one of the invalid subdomains bogus.example.com. So when I type in www.example.com into the G search box, it brings up bogus.example.com.

So I changed my site so when a bogus subdomain comes up, it 404s instead of redirs. Now my entire site is showing up with just the URL, but no description (like it 404'd when Gbot hit it). So now I have 30000 URLs in G (site:www.example.com) and none of them have descriptions and 25000 of them are invalid. Of course, none of them are in the SERPs anymore.

So I tried your removal tool method. Unfortunately, due to the limitations of robots.txt, you can't specify domains or subdomains(please correct me if I'm wrong), only the files and dirs below the root or the root itself. So I was unable to use robots.txt to remove the invalid URLs.

I was able to use the meta tag single page removal tool though. However, many of the invalid domains have a "%20" or a "+" in them, which the single page meta tag removal tool won't take.

So...it pretty much looks like I'll either have to start over completely or wait and hope that G addresses this issue.

Any advice?

Spine

7:24 pm on Dec 14, 2004 (gmt 0)

One of my sites had some content that was mistaken as duplicate. I've done my best to fix the issue, and hope that it will work itself out.

Thanks for the URL removal suggestions.

I've noticed one odd thing. Since the URLs have been removed, they no longer show up with a site:domain.com command, but google still says 'pages 1-10 out of 380 (75% more pages than I actually have). As I go through the pages of results it will say 'pages 21-30 out of 225' and 'pages 51-60 out of 170' etc, until, on the last page it has roughly the right number of pages that I actually have.

Hoping that when it starts to report the correct number for pages 1-10, 11-20 etc that my results for that site will pick up again. The site has seen some deep freshbot crawling, but I'm hoping that another deep crawl will do it.

fusioneer

11:19 pm on Dec 14, 2004 (gmt 0)

Variable - sounds like a unique twist on the samw problem.

OK to remove subdomains you would have to create a robots.txt in the root of each subdomain with:

User-agent: *
Disallow: /

However, you are saying some domains are bogus. In that case you may need to "create" the bogus domains in order to put the robots.txt in each one.

You may have it setup so that all subdomains are virtual, eg. there are no separate subdirectories to put robots.txt in. In this case you need to setup a dynamic script that will respond with the correct robots.txt depending on domain.

So we have two robots.txt files, [A] and [D]. A allows everything, D disallows everything.

good.example.com ---> robots.txt [A]
category.example.com ---> robots.txt [A]

bad.example.com ---> robots.txt [D]
bogus.example.com ---> robots.txt [D]

Basically Google will be checking for the presence of robots.txt on your subdomains - you have to make sure your site responds with the [D] robots.txt when Google asks for bogus.example.com/robots.txt.

Additionally you can use the URL removal tool as specified in my earlier post once you have setup the robots.txt file correctly.

I had another look through the spec here:
[robotstxt.org...]

---snip---
Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.
---snip---

It does say you can use a full path, however I don't think putting subdomains in the robots.txt works though I have never tried it. Eg.

Disallow: [bogus.example.com...]

You could try testing this on a single domain (anyone tried this before?).

walkman

1:50 am on Dec 15, 2004 (gmt 0)

Spine,
I have the same problem. I just hope that Google didn't keep the pages on their end and hide from the public. Otherwise, we'd be screwed for good.

this update should be able to clear that, maybe the recount each month or something.

walkman

5:18 am on Dec 15, 2004 (gmt 0)

fusioneer,
I just tried to remove my?tags and will check back in a day or two. 301 either doesn't work or it takes months and I can't wait for $$ reasons. This has already cost me a fortune (relatively speaking) and untold months of worrying and blaming everything. I even asked a nice site to remove my linked site name spread over 90+ domains with greta PR (donation page was shared).

Let's hope this (?track) is it

fusioneer

5:46 am on Dec 15, 2004 (gmt 0)

Walkman,

Steps described in earlier post worked perfectly for site in question. The site returned to top 5 rankings after being outside top 1000, this happened within 7 days of using the robots.txt method as described.

Caveat here: if your site is being penalized for duplicate content and you can get rid of duplicate/bad URL's using this method, it should work and your site should return to "normal" ranking.

If you site is being penalized for some other reason...this may only fix part of the problem ;)
If you want to PM me your URL I would be happy to take a look and give you my 2c.

Otherwise, good luck...!

walkman

1:33 pm on Dec 15, 2004 (gmt 0)

fusioneer,
I think this is it. I had 4 extra front pages indexed (as supplemantal) with?Tag1 ++. The surprising part is that the links to these?tracking pages were removed at least 6 months ago.

My front page with filter=0 is #1 for a few very compettive terms, without them is in the 50's now. My inside pages are nowhere to be found either (50-100+ place), but I think it's because the front page has been demoted so much that essentially no value is passed.

Once upon a time, Walkman used to be in the top 3-5 places for the inside pages (where my money is)...where I deserve it be of course ;)

Spine

1:52 am on Dec 16, 2004 (gmt 0)

Well, my index page is back at #1 where it belongs. I'm hoping that the rest of my site (inner pages still not doing as well) will fall in line soon.

Whether it was the false duplicates issue or not, I can't be 100% sure, but I'm sure removing them helped.

Thanks

This 172 message thread spans 6 pages: 172