Incorrect URLs and Mirror URLs

Forum Moderators: open

Message Too Old, No Replies

Incorrect URLs and Mirror URLs

Causing duplication penalties.

crobb305

12:39 am on Nov 25, 2004 (gmt 0)

Google has indexed numerous incorrect URLs and mirror URLs all pointing to my index page. Subsequently, the original URL (www.mydomain.com) has been suppressed to the bottom of the results for any search (presumably a duplication penalty). This problem was also mentioned in message 11 of the following thread:

[webmasterworld.com...]

The URLs pertaining to my website that all point to my index page take the following form.

www.mydomain.com/?S=AC3%26Document=document
www.mydomain.com/?S=AC3%26Document=document
www.mydomain.com/?SID=xRSUNVW8R9P44HSYQ6UWED&
www.mydomain.com/?S=AC3%26Document=document
www.mydomain.com/default.asp?S=AC3&am
www.some-other-URL.com/go.php?id=aHR0cDovL3d3dy5jcmVkaXRjaGFtcGlvbi5jb20v
www.some-other-URL-2.com/go.php?id=aHR0cDovL3d3dy5jcmVkaXRjaGFtcGlvbi5jb20v
www.some-other-URL-3.com/file/callink.php?linkid=3

I have emailed google, but have received no reply. I am unsure what I can do to A) eliminate the incorrect URL's that appear to originate from my site and B) eliminate the mirror URLs that originate from unrelated websites.

Any help would be greatly appreciated.

crobb305

9:33 pm on Nov 25, 2004 (gmt 0)

Also noticing that the site:mydomain.com search command is returning urls from ebay, and other directories, completely unrelated to my site. Google has such a horrible mess on their hands. What is going on?

Regarding my post above, I recall Yahoo having similar issues with mirror urls causing duplication penalties last spring. This seems to have been solved (at least with my site). I just disovered another 50 or so mirror urls of my site. Why can't Google get this right?

zeus

9:55 pm on Nov 25, 2004 (gmt 0)

I think I have the same problem, My site droped from 38.000 unique a day to 4000 unique a day.

The site is totaly clean only HTML site, it has been ranking well all the time, florida update was not even been noticed.

I have also found sites - mydoman.com.anotherdomain.com and all kind of forms also tracker2.php and when I make a site:mydomain some of the other domains also show up and the site count has gone from 3400 to now 296 and the site get 0 hits from Google now, it cant be that MR.internet cant see the difference between a real www.domainname and a mydomainname.otherdomain.com or cant fix redirects or tracker.php are we realy back to the stoneage or is realy time to wait for MSN to take over in year.

You can also see the trouble with sites that are included has a lot of urls without description, so something is totaly out of control.

Some will say you shall contact the hosting companies to the domains that messes up your site, but you will get to .pl .ru site what then and I dident even have any luck yet with american hostings.

fusioneer

10:34 pm on Nov 25, 2004 (gmt 0)

FYI we're having the same problem with one of our sites and I have just posted details here along with attempted solutions:

[webmasterworld.com...]

One thing I have seen in common with these issues is the form of the URL, which is:

[somedomain.com...]

Eg. the "?" immediately following the backslash of the domain. You will notice that URL's in this format "appear" to have the same PR as the root of the site, so it is a cheap way to create additional dynamic pages with possibly the best PR your site has to offer.

Eg:
[yahoo.com...]
has PR 9.

Strangely enough,

[google.com...]
has PR 0, Google are onto it.

[edited by: fusioneer at 10:45 pm (utc) on Nov. 25, 2004]

crobb305

10:40 pm on Nov 25, 2004 (gmt 0)

Yes...many of the mirrors of my site have the "tracker2" in the url. This is blatant sabbotage, as my website is being penalized for apparent duplication. Google, are you aware of this problem? Are you doing anything about it? I see backlinks are updating, and the "number of pages indexed" keeps increasing...but are REAL algorithmic problems being addressed?

zeus

10:53 pm on Nov 25, 2004 (gmt 0)

I bet you also have lost a lot of traffic, I looked at one of my other site wich is still doing great, it has full descripted url listed when I make a site:mydomain and it has no funky spam urls redirecting to it or hijack links, so this MUST be the problem we have and PLEASE Google wake up.

I will just say one thing I realy feel with you, it is pure hell when you dont know what to do and like I, I think many of you are making a living of this, Im no SEO, but I own a few sites, so this is hard, building a site for years and then gone and if this hassen happen I would have been to the las vegas conference.

crobb305

12:14 am on Nov 26, 2004 (gmt 0)

Yes...My traffic from Google has gone from 3000 visits a day to zero since May 15. I hope that Google pays attention to this problem, and to the emails I/we have sent. I have a feeling, however, that I will simply get a standard/form reply. They need to turn the knob down on the dup penalty until they get a handle on this spam/redirect/mirror-url problem.

Robber

12:22 am on Nov 26, 2004 (gmt 0)

We've seen a couple of examples where Google starts indexing URLs that don't exist and then seems to drop the proper pages way down the listings. The examples we've seen have taken the following forms:

Proper URL: ht*p://www.abc.co.uk/123-wer.php
Requested URL:ht*p://www.abc.co.uk/123-wer.php/567-uio.php

In the example above the server didn't respond with 404 so loads of duplication - now fixed so 404s are returned.

In the next example we have mod_rewrite code to redirect requests for non www. pages to the www. site. They are 301 redirects. However, things seem to have gone totally wrong with Google adding all the no www. pages to its index and whats more putting them in the supplemental index. These now appear in place of the correct pages when doing searches.

fusioneer

12:37 am on Nov 26, 2004 (gmt 0)

Robber - have seen similar results - have posted in another thread but will repeat here as this is where the main discussion is being directed.

We created a "kill list" of the URL's we wnated out of the index, and then attempted the following solutions:

1. Script detected "bad" URL's and inserted META tags in page:

This had no effect whatsoever.

2. Script detected bad URL's and returned HTTP header of 404.

This didn't seem to have an effect either

3. Script detected bad URL's and returned HTTP header 301 redirecting to non-existent page (404.cfm), thus creating a "404 page not found".

Limited success with some pages being removed from index.

4. Emailed Google for clarification. Is a penalty being applied, why, what can we do about it?

Google tells us no penalty is being applied.
I laugh as the site falls out of the top 1000 results.

5. Submit bad URL's one by one to the Google URL removal tool at:
[google.com...]

Request denied. No explanation given.

Current status: Help and suggestions welcome ;)

crobb305

2:43 am on Nov 26, 2004 (gmt 0)

I tried the url removal tool. However, a url can only be removed if it no longer exists on the web, or if you have the robots.txt set to "noindex". Clearly, redirect url's "still exist" on the web if they have been recently indexed as a mirror copy of your page in question. Thus, the url removal tool gives you an error message stating that the page appears to still exist on the web and can't be removed. Therefore, the tool is useless in this case.

Furthermore, my searches reveal 100+ redirect urls to my site (many with the "tracker2" included). So, asking the appropriate webmasters to remove the redirect or attempting to remove the urls using the removal tool would be a hopeless task for me.

walkman

3:00 am on Nov 26, 2004 (gmt 0)

there's 4 mysite.com?strings as supplemenatal so I don't know if it effects SERPS or not.

I had links with the?tag to track referrals about 2 years ago. Links have been removed for over a year but all of them were indexed as late as this month.

what happened to Google? They used to be known for being smart and careful not to screw innocent sites...have they gotten too big or just dont; care as much for innocent sites that get caught in the middle?

Robber

9:47 am on Nov 26, 2004 (gmt 0)

I think this highlights the issue that modern day SEO is also about quality webmastering and not just page optimisation and link building. You need to write water scripts, have correct server configurations, accurate usage of mod_rewrite etc to make sure that a site can not be manipulated to make it appear like there is a lot of duplication.

Its not just a question of using absolute links inside your site to make sure Googlebot doesn't get confused, you need to make sure your competitors can't link to a dodgy URL that will still output some content leading to duplication.

As I see it:
* you must configure your server to prevent url mapping.

* You need to ensure your scripts reject query string if they shouldn't have any.

* You need to validate a query string before outputting anything. Invalid query string should result in a redirect, perhaps 301 to the non query string version. But even here, Google seems to have totally confused all of our 301 redirects so who knows?!

Wail

9:47 am on Nov 26, 2004 (gmt 0)

Yeah. This has been going on for a wee while now. (I've tried to post about it before but got eaten by the editors each time).

In some instances, though not often, this appears to be a caching problem. If you sit and refresh the site: page then you'll get the "right" results from Google.

In some instances it is because a new page with duplicate text has popped up. It occured when a client bought a domain name with "offers" tagged onto their end of the brand. It was worse for them because by the time Google had split their listings 50/50 they'd stopped using the offers domain.

In some instances Google is using very old domains which have been sitting on a 301 for months. This is the most odd occurance. I think it's related to the PageJacking quirk noted elsewhere.

I wrote this up and mentioned a few other points and sent it off to Google. I shouldn't have put many "quirks" in one email though (don't often write to Google) as the email's gone off and into the care of toolbar support now!

zeus

11:58 am on Nov 26, 2004 (gmt 0)

My site is pure static so I dont have troubles with database and mod_rewrite, but still its out of the rankings and I do think it has something to do with all the new sites added which are spam dublicates.

If they dont fix this soon, I think I will contact some broker firms and my own to tell them about this problem, what have I to loose when the site is out, so why not earn some cash on pure knowledge of search engines and short the stock, maybe this will wake them up and I knew that there would be trouble as soon they whent on NASDAQ.

[edited by: zeus at 12:14 pm (utc) on Nov. 26, 2004]

Wail

12:03 pm on Nov 26, 2004 (gmt 0)

Zeus,

It's easy to test. If your site is www.example.com then visit Google and search for info:www.example.com

If there's a result then check the URL. If it's not your URL then Google's in a muddle.

zeus

12:16 pm on Nov 26, 2004 (gmt 0)

wail

then I get mydomain/ as text and directs to //mydomain.com

Wail

12:32 pm on Nov 26, 2004 (gmt 0)

Sounds like you might want to check that you're not your own competitor with www.mydomain.com Vs mydomain.com.

Otherwise, I suspect your current ranking problems aren't associated with this current quirk.

zeus

12:40 pm on Nov 26, 2004 (gmt 0)

hmm how do I fix that, but I dont think they would call it a dublicate, but anyway how is it possible to fix that

Seo1

1:02 pm on Nov 26, 2004 (gmt 0)

Some of this may have to do with your server set up and directory structure.

if you do a seach for h ttp://www.yoursite .com and www.yoursite .com and [yoursite.com...] and find results in two or three of the above, for your site, it is due to how people link to you.

If there are different ways you are being linked to your site PageRank will be lower as the PR is not spread through out all of your pages.

To remedy this you will need an Apache Mod-Rewrite.
The rewrite will force all links coming to www.yoursite .com to ht tp://www.yoursite.com and remedy any link bleed.

For Zeus
Mod rewrites are for both static and dynamic urls.

For londoh Your host won't turn off hard links. They cant!
They can turn off sym links

For DerekH

Spider bots crawl your server. Many people think the crawl their webpages which is wrong. Google reads your webpage. The links that come from your website are actually coming from your server and not your page.

If your server does not have a robots.txt file somebots won't bother to index you.

Google also only typically crawls links one level from your home page to rectify this they asked people to build site maps to their site. If the site map is over 100 links then you need more than 1 site map.

For Fusioneer

You cannot write scripts to effect any change in Googles index. Why would you even try..The index is Googles database which you do not have access to so neither would your script. Writing those scripts may earn you a Google ban.

As fot why Google will not delete these urls, with an index of 8,000,000,000 pages who do you think will have the time to go search for the urls you do not like? Googles not going to pay someone to do that.

I hate to say this but you guus spend so much time worrying over links and PageRank which have very little to do with how a site ranks on any search engine.

Content is King

Spend that time adding content and changing your content as much as possible, the more relevant and fresh your content the higher up you will move.

I hope this helps

Clint

SteveJohnston

1:09 pm on Nov 26, 2004 (gmt 0)

The few times I have tried it, I have managed to get old URLs expunged from Google, with some careful 301 redirecting on the home site and a forced re-spidering of the offending (the original and wrong) URL by way of a blog or some other frequently indexed page.

The additional problem I have encountered is when the URL for the site owner's content appears indexed under another site's domain. When investigated, a 302 on the 'another' site was contributing to the problem. Without this then changing to a 301 - fortunately they were co-operative - we were unable to break the association.

The crawling and indexing floodgates then opened, which was gratifying. Which also put to bed the notion of it being some sort of penalty.

Hope this helps.

Steve

papamaku

3:20 pm on Nov 26, 2004 (gmt 0)

I need to remove stacks of affiliate urls (www.site.com/?aff=1234) causing a duplicate penalty.

So i've put this in my robots.txt:

User-agent: Googlebot
Disallow: /?aff=*$

Googlebot has now stopped cawling those pages which is great, however they are all still in the index.

So I tried using the url removal tool and get this error message:

URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card: DISALLOW /?aff=*$

How can one part of their system support RegEx wildcards but not the other? And there's just way too many urls to do them individually.

Wail

4:03 pm on Nov 26, 2004 (gmt 0)

User-agent: Googlebot
Disallow: /?aff=*$

That's Google being kind. That's not actually a valid robots.txt file. There's no wildcard matching in the Disallow field.

It just takes a change in the weather and that robots.txt will parse as:

User-agent: Googlebot
Disallow: /

That'll be a nightmare.

walkman

4:08 pm on Nov 26, 2004 (gmt 0)

"That's not actually a valid robots.txt file. There's no wildcard matching in the Disallow field."

how about this:

User-agent: Googlebot
Disallow: /?aff_one
Disallow: /?aff_two
Disallow: /?aff_three

is it valid?

Wail

4:25 pm on Nov 26, 2004 (gmt 0)

It seems like a crime to say this - but I'm not sure. I've never seen the question mark cause a problem but then I don't recommend anyone test the waters to find out.

The "snake oil" charm in this thread is mod-rewrite. Let's use it again. I'd mod-rewrite affiliate query strings to look like static URLs (www.example.com/affiliate/1 or www.example.com/affiliate/2 etc) and then disallow /affiliate/

walkman

4:55 pm on Nov 26, 2004 (gmt 0)

somewhat unrelated but I just noticed that Google has cached excite.co.jp doing a tranlation of my site in japanese. Everything is the same obviously, expect the text added in japanese.

This is my second site that has shown that. if you search for WebmasterWorld's "News and Discussion for the Independent Web Professional" the excite translation is around # 80 or so. Will this cause any problems...for the smaller fish I mean, not Bret who has 451,245,4213 backlinks?

brand_guy

7:42 pm on Nov 26, 2004 (gmt 0)

You could consider creating a separate directory for the affiliate (duplicate) pages such as /yourdomain/aff/

On the server side this can be implemented using a virtual directory.

Place a robots.txt file on the server that excludes the new directory.

Apply 301 redirects to all "old" affiliate pages out there and point to their new locations in the new directory.

my3cents

9:12 pm on Nov 26, 2004 (gmt 0)

I've been saying this for about 5-6 months now:

my index page, which is updated regularly with fresh content, static vanilla html, 3 years old has been listed as url only while I see tracking urls for the same page with full title and description, some of them several months old.

This has also happened to most of my main category pages too. On my site, content is king as I regularly add new products, information and even new categories. I probably average 15-25 new pages a month and I employ several people that work to expand this site every day, graphic designers, researchers, content writers, etc.

Every month I see new dynamic strings at the end of my urls, and as they are added, they get full title and description while the real urls loses it's title and description and also it's ranking.

In vegas, the google rep suggested that I 301, but now I find that this cannot be done because you cannot 301 a query string.

www.mysite.com?why_is=google_doing_this

I really don't know what to do. It would be great if there were a way to stop google from indexing these strings, but I don't think it would fix the problem as every month new strings that I have no idea where they originated from, get into the index, eg.
www.mysite.com?afbfbwethgwh54w45h454hhwq5hq345hq345hq35hq35h

Also, I now see google has another entry for my index page as www.mysite.com.asp which not only doesn't make sense and could never be an actual url, but I don't even use asp anywhere on my site!

To make things worse, when google finds a link to my site to a page that does not exist, or never existed, my server returns a 404 header and google indexes the content of the 404 page with a full title and description:

title: 404 Not Found
description: The requested URL was not found on this server.

[google.com...]

When I first started seeing this several months ago I thought I should not worry, that these are things that google has obviously got wrong, but here we are nearly 6 months later and it is only getting worse. The advice given in Vegas does not help.

I would like to repeat that these problems ONLY exist on google.

These same topics have been brought up several times in the last 6 months, and yes in the past, moderators have shut the threads down before any discussion can ever start. Google must know there is a problem and they must not care.

Where do we go from here?

zeus

10:34 pm on Nov 26, 2004 (gmt 0)

my3cents - great I think you have just given me the future I can expect, my site is also out of the rankings still PR and some pages indexed on site:mydomain search, but all this happen nov.3 and as you say you had this for 6 month, well happy new year.

Then I think I will talk with my broker and some other investment companies to tell about this problem on Google because it hurts a lot of good sites and if you can find them in Google, well then you go to Yahoo or MSN which will take over all this in a years time, so maybe its time to go short here.

I realy liked this search engine, but they just cant keep there fingers of the algo/results, we have seen that many times that something works pretty ok then people make some changes then everything is going down.

my3cents, please sticky me your URL, so I can compare things

fusioneer

11:20 pm on Nov 26, 2004 (gmt 0)

We have had some partial success removing the bad URL's listed in Googles index for one site. For reference I will list what didn't work and also the method that did.

What didn't work
----------------
- <META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">
- HTTP 301 redirect
- HTTP 301 redirect to non-existent page
- HTTP 404 header return
- Google URL removal tool manual submission

What did work
-------------
- robots.txt with individually named URL's, manually submitted through Google's URL removal tool

Format of robots.txt
--------------------

User-agent: *
Disallow: /?tracker=1239988495865464
Disallow: /?keyword+keyword
Disallow: /?string
Disallow: /?sid0-9845588858545345345
Disallow: /foo.asp
Disallow: /foo.cfm?foo=1&foo=2&foo=3

Google URL removal tool
-----------------------
[services.google.com:8882...]

You need a Google account to use this.
1.Click on "Remove pages, subdirectories, or images using a robots.txt file".
2.Enter the URL of your robots.txt file (Note it does NOT have to be in the root directory to work!)
3.Submit and wait 24 hours

The result
----------
Site:www.example.com now returns results for the site with the offending URL's removed. This was done by Google within 24 hours.

The downside
------------
You have to manually enter each page you want removed in robots.txt. For large sites this may be problematic - the site in question was < 100 pages, we had about 50 URL's to kill.

Large & dynamic sites
---------------------
For bigger sites I would suggest writing a script that would use the Google API to query your site:www.example.com results. Pass in a list of your "good" URL's (the ones you want indexed) and subtract these from the site: results.

Anything left is a bad URL, and can then be outputted to a robots.txt file as above, and submitted via Google URL removal tool. I would still hand-check the results to ensure no good URL's "slip through". Can be done as often as needed.

Note below you could also setup wildcards in robots.txt for a permanent effect, though this is a bit more risky, eg:

Disallow: /?

Which according to the standard would exclude all of the following URL's:

www.example.com/?foo
www.example.com/?sid=098340958094583
www.example.com/?keyword+keyword

However I would experiment with this first...perhaps in a subdirectory or test domain. Caveat - haven't tried this (haven't needed to) so use at your own risk.

Wildcarding in robots.txt
-------------------------
Personally we were a bit loathe to do this as a single character error could result in large parts of your site being removed from Google's index. However it seems as if the standard does allow for this.

Have a look at Google's own robots.txt:
[google.com...]

First line is:
Disallow: /search

This disallows any URL beginning with "search". No "*" asterisk is used. This prevents other agents or search engines indexing Google's results (assuming they play nice with Google's robots.txt...). Note the absence of trailing slash - if a trailing slash was used this would indicate a directory match for the /search/ directory.

Other robots.txt references
---------------------------
Search Engine World [good quick ref;)]
[searchengineworld.com...]

Robotstxt.org
[robotstxt.org...]
[robotstxt.org...]

Searchtools.com [good all round reference]
[searchtools.com...]

Feedback and comments welcome.

zeus

11:30 pm on Nov 26, 2004 (gmt 0)

hello fusioneer

My site is a 2500 pages static site, I think my problem is that some is copying my site or hijacking, I do not have duplicated sites in mydomain.

So do you suggest I try out and thanks for at try out to fix this mess Google is doing.

This 172 message thread spans 6 pages: 172