Welcome to WebmasterWorld Guest from 220.127.116.11
Forum Moderators: open
The URLs pertaining to my website that all point to my index page take the following form.
I have emailed google, but have received no reply. I am unsure what I can do to A) eliminate the incorrect URL's that appear to originate from my site and B) eliminate the mirror URLs that originate from unrelated websites.
Any help would be greatly appreciated.
Regarding my post above, I recall Yahoo having similar issues with mirror urls causing duplication penalties last spring. This seems to have been solved (at least with my site). I just disovered another 50 or so mirror urls of my site. Why can't Google get this right?
The site is totaly clean only HTML site, it has been ranking well all the time, florida update was not even been noticed.
I have also found sites - mydoman.com.anotherdomain.com and all kind of forms also tracker2.php and when I make a site:mydomain some of the other domains also show up and the site count has gone from 3400 to now 296 and the site get 0 hits from Google now, it cant be that MR.internet cant see the difference between a real www.domainname and a mydomainname.otherdomain.com or cant fix redirects or tracker.php are we realy back to the stoneage or is realy time to wait for MSN to take over in year.
You can also see the trouble with sites that are included has a lot of urls without description, so something is totaly out of control.
Some will say you shall contact the hosting companies to the domains that messes up your site, but you will get to .pl .ru site what then and I dident even have any luck yet with american hostings.
One thing I have seen in common with these issues is the form of the URL, which is:
Eg. the "?" immediately following the backslash of the domain. You will notice that URL's in this format "appear" to have the same PR as the root of the site, so it is a cheap way to create additional dynamic pages with possibly the best PR your site has to offer.
has PR 9.
has PR 0, Google are onto it.
[edited by: fusioneer at 10:45 pm (utc) on Nov. 25, 2004]
I will just say one thing I realy feel with you, it is pure hell when you dont know what to do and like I, I think many of you are making a living of this, Im no SEO, but I own a few sites, so this is hard, building a site for years and then gone and if this hassen happen I would have been to the las vegas conference.
Proper URL: ht*p://www.abc.co.uk/123-wer.php
In the example above the server didn't respond with 404 so loads of duplication - now fixed so 404s are returned.
In the next example we have mod_rewrite code to redirect requests for non www. pages to the www. site. They are 301 redirects. However, things seem to have gone totally wrong with Google adding all the no www. pages to its index and whats more putting them in the supplemental index. These now appear in place of the correct pages when doing searches.
We created a "kill list" of the URL's we wnated out of the index, and then attempted the following solutions:
1. Script detected "bad" URL's and inserted META tags in page:
<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">
<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
This had no effect whatsoever.
2. Script detected bad URL's and returned HTTP header of 404.
This didn't seem to have an effect either
3. Script detected bad URL's and returned HTTP header 301 redirecting to non-existent page (404.cfm), thus creating a "404 page not found".
Limited success with some pages being removed from index.
4. Emailed Google for clarification. Is a penalty being applied, why, what can we do about it?
Google tells us no penalty is being applied.
I laugh as the site falls out of the top 1000 results.
5. Submit bad URL's one by one to the Google URL removal tool at:
Request denied. No explanation given.
Current status: Help and suggestions welcome ;)
Furthermore, my searches reveal 100+ redirect urls to my site (many with the "tracker2" included). So, asking the appropriate webmasters to remove the redirect or attempting to remove the urls using the removal tool would be a hopeless task for me.
I had links with the?tag to track referrals about 2 years ago. Links have been removed for over a year but all of them were indexed as late as this month.
what happened to Google? They used to be known for being smart and careful not to screw innocent sites...have they gotten too big or just dont; care as much for innocent sites that get caught in the middle?
Its not just a question of using absolute links inside your site to make sure Googlebot doesn't get confused, you need to make sure your competitors can't link to a dodgy URL that will still output some content leading to duplication.
As I see it:
* you must configure your server to prevent url mapping.
* You need to ensure your scripts reject query string if they shouldn't have any.
* You need to validate a query string before outputting anything. Invalid query string should result in a redirect, perhaps 301 to the non query string version. But even here, Google seems to have totally confused all of our 301 redirects so who knows?!
In some instances, though not often, this appears to be a caching problem. If you sit and refresh the site: page then you'll get the "right" results from Google.
In some instances it is because a new page with duplicate text has popped up. It occured when a client bought a domain name with "offers" tagged onto their end of the brand. It was worse for them because by the time Google had split their listings 50/50 they'd stopped using the offers domain.
In some instances Google is using very old domains which have been sitting on a 301 for months. This is the most odd occurance. I think it's related to the PageJacking quirk noted elsewhere.
I wrote this up and mentioned a few other points and sent it off to Google. I shouldn't have put many "quirks" in one email though (don't often write to Google) as the email's gone off and into the care of toolbar support now!
If they dont fix this soon, I think I will contact some broker firms and my own to tell them about this problem, what have I to loose when the site is out, so why not earn some cash on pure knowledge of search engines and short the stock, maybe this will wake them up and I knew that there would be trouble as soon they whent on NASDAQ.
[edited by: zeus at 12:14 pm (utc) on Nov. 26, 2004]
if you do a seach for h ttp://www.yoursite .com and www.yoursite .com and [yoursite.com...] and find results in two or three of the above, for your site, it is due to how people link to you.
If there are different ways you are being linked to your site PageRank will be lower as the PR is not spread through out all of your pages.
To remedy this you will need an Apache Mod-Rewrite.
The rewrite will force all links coming to www.yoursite .com to ht tp://www.yoursite.com and remedy any link bleed.
Mod rewrites are for both static and dynamic urls.
For londoh Your host won't turn off hard links. They cant!
They can turn off sym links
Spider bots crawl your server. Many people think the crawl their webpages which is wrong. Google reads your webpage. The links that come from your website are actually coming from your server and not your page.
If your server does not have a robots.txt file somebots won't bother to index you.
Google also only typically crawls links one level from your home page to rectify this they asked people to build site maps to their site. If the site map is over 100 links then you need more than 1 site map.
You cannot write scripts to effect any change in Googles index. Why would you even try..The index is Googles database which you do not have access to so neither would your script. Writing those scripts may earn you a Google ban.
As fot why Google will not delete these urls, with an index of 8,000,000,000 pages who do you think will have the time to go search for the urls you do not like? Googles not going to pay someone to do that.
I hate to say this but you guus spend so much time worrying over links and PageRank which have very little to do with how a site ranks on any search engine.
Content is King
Spend that time adding content and changing your content as much as possible, the more relevant and fresh your content the higher up you will move.
I hope this helps
The additional problem I have encountered is when the URL for the site owner's content appears indexed under another site's domain. When investigated, a 302 on the 'another' site was contributing to the problem. Without this then changing to a 301 - fortunately they were co-operative - we were unable to break the association.
The crawling and indexing floodgates then opened, which was gratifying. Which also put to bed the notion of it being some sort of penalty.
Hope this helps.
So i've put this in my robots.txt:
Googlebot has now stopped cawling those pages which is great, however they are all still in the index.
So I tried using the url removal tool and get this error message:
URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card: DISALLOW /?aff=*$
How can one part of their system support RegEx wildcards but not the other? And there's just way too many urls to do them individually.
how about this:
is it valid?
The "snake oil" charm in this thread is mod-rewrite. Let's use it again. I'd mod-rewrite affiliate query strings to look like static URLs (www.example.com/affiliate/1 or www.example.com/affiliate/2 etc) and then disallow /affiliate/
This is my second site that has shown that. if you search for WebmasterWorld's "News and Discussion for the Independent Web Professional" the excite translation is around # 80 or so. Will this cause any problems...for the smaller fish I mean, not Bret who has 451,245,4213 backlinks?
On the server side this can be implemented using a virtual directory.
Place a robots.txt file on the server that excludes the new directory.
Apply 301 redirects to all "old" affiliate pages out there and point to their new locations in the new directory.
my index page, which is updated regularly with fresh content, static vanilla html, 3 years old has been listed as url only while I see tracking urls for the same page with full title and description, some of them several months old.
This has also happened to most of my main category pages too. On my site, content is king as I regularly add new products, information and even new categories. I probably average 15-25 new pages a month and I employ several people that work to expand this site every day, graphic designers, researchers, content writers, etc.
Every month I see new dynamic strings at the end of my urls, and as they are added, they get full title and description while the real urls loses it's title and description and also it's ranking.
In vegas, the google rep suggested that I 301, but now I find that this cannot be done because you cannot 301 a query string.
I really don't know what to do. It would be great if there were a way to stop google from indexing these strings, but I don't think it would fix the problem as every month new strings that I have no idea where they originated from, get into the index, eg.
Also, I now see google has another entry for my index page as www.mysite.com.asp which not only doesn't make sense and could never be an actual url, but I don't even use asp anywhere on my site!
To make things worse, when google finds a link to my site to a page that does not exist, or never existed, my server returns a 404 header and google indexes the content of the 404 page with a full title and description:
title: 404 Not Found
description: The requested URL was not found on this server.
When I first started seeing this several months ago I thought I should not worry, that these are things that google has obviously got wrong, but here we are nearly 6 months later and it is only getting worse. The advice given in Vegas does not help.
I would like to repeat that these problems ONLY exist on google.
These same topics have been brought up several times in the last 6 months, and yes in the past, moderators have shut the threads down before any discussion can ever start. Google must know there is a problem and they must not care.
Where do we go from here?
Then I think I will talk with my broker and some other investment companies to tell about this problem on Google because it hurts a lot of good sites and if you can find them in Google, well then you go to Yahoo or MSN which will take over all this in a years time, so maybe its time to go short here.
I realy liked this search engine, but they just cant keep there fingers of the algo/results, we have seen that many times that something works pretty ok then people make some changes then everything is going down.
my3cents, please sticky me your URL, so I can compare things
What didn't work
- <META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">
- HTTP 301 redirect
- HTTP 301 redirect to non-existent page
- HTTP 404 header return
- Google URL removal tool manual submission
What did work
- robots.txt with individually named URL's, manually submitted through Google's URL removal tool
Format of robots.txt
Google URL removal tool
You need a Google account to use this.
1.Click on "Remove pages, subdirectories, or images using a robots.txt file".
2.Enter the URL of your robots.txt file (Note it does NOT have to be in the root directory to work!)
3.Submit and wait 24 hours
Site:www.example.com now returns results for the site with the offending URL's removed. This was done by Google within 24 hours.
You have to manually enter each page you want removed in robots.txt. For large sites this may be problematic - the site in question was < 100 pages, we had about 50 URL's to kill.
Large & dynamic sites
For bigger sites I would suggest writing a script that would use the Google API to query your site:www.example.com results. Pass in a list of your "good" URL's (the ones you want indexed) and subtract these from the site: results.
Anything left is a bad URL, and can then be outputted to a robots.txt file as above, and submitted via Google URL removal tool. I would still hand-check the results to ensure no good URL's "slip through". Can be done as often as needed.
Note below you could also setup wildcards in robots.txt for a permanent effect, though this is a bit more risky, eg:
Which according to the standard would exclude all of the following URL's:
However I would experiment with this first...perhaps in a subdirectory or test domain. Caveat - haven't tried this (haven't needed to) so use at your own risk.
Wildcarding in robots.txt
Personally we were a bit loathe to do this as a single character error could result in large parts of your site being removed from Google's index. However it seems as if the standard does allow for this.
Have a look at Google's own robots.txt:
First line is:
This disallows any URL beginning with "search". No "*" asterisk is used. This prevents other agents or search engines indexing Google's results (assuming they play nice with Google's robots.txt...). Note the absence of trailing slash - if a trailing slash was used this would indicate a directory match for the /search/ directory.
Other robots.txt references
Search Engine World [good quick ref;)]
Searchtools.com [good all round reference]
Feedback and comments welcome.