Forum Moderators: Robert Charlton & goodroi
Many unethical webmasters and site owners are already creating thousands of TEMPLATED (ready to go) SKYSCRAPER sites fed by affiliate companies immense databases. These companies that have your website info within their databases feed your page snippets, without your permission, to vast numbers of the skyscraper sites. A carefully adjusted variant php based redirection script that causes a 302 redirect to your site, and included in the script an affiliate click checker, goes to work. What is very sneaky is the randomly generated meta refresh page that can only be detected via the use of a good header interrogation tool.
Googlebot and MSMBOT follow these php scripts to either an internal sub-domain containing the 302 redirect or serverside and “BANG” down goes your site if it has a pagerank below the offending site. Your index page is crippled because googlebot and msnbot now consider your home page at best a supplemental page of the offending site. The offending sites URL that contains your URL is indexed as belonging to the offending site. The offending site knows that google does not reveal all links pointing to your site, takes a couple of months to update, and thus an INURL:YOURSITE.COM will not be of much help to trace for a long time. Note that these scripts apply your URL mostly stripped or without the WWW. Making detection harder. This also causes googlebot to generate another URL listing for your site that can be seen as duplicate content. A 301 redirect resolves at least the short URL problem so aleviating google from deciding which of the two URL's of your site to index higher, more often the higher linked pagerank.
Your only hope is that your pagerank is higher than the offending site. This alone is no guarantee because the offending site would have targeted many higher pagerank sites within its system on the off chance that it strips at least one of the targets. This is further applied by hundreds of other hidden 301 permanent redirects to pagerank 7 or above sites, again in the hope of stripping a high pagerank site. This would then empower their scripts to highjack more efficiently. Sadly supposedly ethical big name affiliates are involved in this scam, they know it is going on and google adwords is probably the main target of revenue. Though I am sure only google do not approve of their adsense program to be used in such manner.
Many such offending sites have no e-mail contact and hidden WHOIS and no telephone number. Even if you were to contact them, you will find in most cases that the owner or webmaster cannot remove your links at their site because the feeds are by affiliate databases.
There is no point in contacting GOOGLE or MSN because this problem has been around for at least 9 months, only now it is escalating at an alarming rate. All pagerank sites of 5 or below are susceptible, if your site is 3 or 4 then be very alarmed. A skyscraper site only need create child page linking to get pagerank 4 or 5 without the need to strip other sites.
Caution, trying to exclude via robots text will not help because these scripts are nearly able to convert daily.
Trying to remove a link through google that looks like
new.searc**verywhere.co.uk/goto.php?path=yoursite.com%2F will result in your entire website being removed from google’s index for an indefinite period time, at least 90 days and you cannot get re-indexed within this timeline.
I am working on an automated 302 REBOUND SCRIPT to trace and counteract an offending site. This script will spider and detect all pages including sub-domains within an offending site and blast all of its pages, including dynamic pages with a 302 or 301 redirect. Hopefully it will detect the feeding database and blast it with as many 302 redirects as it contains URLS. So in essence a programme in perpetual motion creating millions of 302 redirects so long as it stays on. As every page is a unique URL, the script will hopefully continue to create and bombard a site that generates dynamically generated pages that possesses php, asp, cigi redirecting scripts. A SKYSCRAPER site that is fed can have its server totally occupied by a single efficient spider that continually requests pages in split seconds continually throughout the day and week.
If the repeatedly spidered site is depleted of its bandwidth, it may then be possible to remove it via googles URL removal tool. You only need a few seconds of 404 or a 403 regarding the offending site for google’s url console to detect what it needs. Either the site or the damaging link.
I hope I have been informative and to help anybody that has a hijacked site who’s natural revenue has been unfairly treated. Also note that your site may never gain its rank even after the removal of the offending links. Talking to offending site owners often result in their denial that they are causing problems and say that they are only counting outbound clicks. And they seam reluctant to remove your links....Yeah, pull the other one.
[edited by: Brett_Tabke at 9:49 pm (utc) on Mar. 16, 2005]
Here is a quote from a post last september in another webmsterworld thread
They took over a week to answer my first email. I sent it to webmaster@google.com and the replies were coming from help@google.com. I tried a different address for them and the reply still came from help@google.com. I implored them to please refer my questions to somebody higher up and put ATTN:Googleguy in the message title. I started getting responses from googlebot@google.com.
I can tell you that the responses make me want to laugh, cry, and scream at the same time.In the meantime, my hijacked index page has moved up from number seven to number three in the SERPS even though they did remove the redirect and the link now goes to a 404 error page.
-------------------------
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
------------------------- Make it eg. the first rule (or the last rule) in your .htaccess file. It does this:
If Googlebot requests a file (any file), redirect that request one time only to the exact same URL with a 301 status code, and do no more. What happens after this is that Googlebot will get the file with a code "200 OK" or whatever code your webserver would otherwise throw at it (eg. if it's a dead link it will of course get a 404). [NC] means that the spelling of "GooGLeBot" is not case sensitive.
It also makes sure that Googlebot will always just see the domain with "www." in front of it (if you don't want this, just remove "www." from the rule).
This way, each and every URL that Googlebot requests will get some sort of "extra verification stamp" saying "the right URL for the file you requested is the same URL as the one you used"
(actually it says: "the URL you requested has been moved permanently to the exact same place - ie. to the location you already requested once". So, if there were no hijackers this would be pure nonsense. The "www." part adds a small bit of real and useful functionality.)
It is a bit similar to the (second part of the) method posted by boredguru, but it does not change any URLs and it does not use 302 status codes, so it will not create extra duplicate content for you.
>> slashdot
yeah, i noticed that ;) Too bad the slashdot crowd only need to see the word "adult" one time in an article to be talking about pr0n for hours. However, the point was picked up after a few screens of off-topic posts.
[edited by: claus at 11:33 pm (utc) on Mar. 15, 2005]
Googleguy is here to help with the small things website/google related and a big thanks for that.
Respect, Zeus, but I'm unsure of what that help could be. Other than advice on whether I should post a pic of my dog on the site, I don't know what might be forthcoming. It ain't like the good old days when one could almost believe they meant that, "Do no evil", stuff, (and GG would actually check on obvious injustices). IPO or not, there's little credibilty left if they can't even comment on this.
My only worry now is that one link that I requested a directory to remove, it was hijacking my homepage and they removed it at my request but when i went to the URL removal tool it responds "www(dot)othersite.com/go.php?id=58585 returns 302 found but the the HTTP response header is empty"
In other words they removed my link but the php redirect sends to an empty url - resolves to a 404 on their server but google can't seem to figure that out.
I e-mailed them back but they don't seem to care anymore.
How do you know if when the Gbot visits it visits thinking its fetching your domainname.com or it is thinking it is fetching hijacker.com/url.php?domainname.com .
Because when you are redirecting it Gbot could really have come asking for yourdomain.com but the next time(more like day) it could be asking for hijacker.com webpage which it thinks has moved to your homepage.
And as your homepage will be visited more often than some page three levels deep on your hijackers site, we would be pretty lucky catching the bot at the right time to make it think that the hijackers page has perrmanently.
This is the only flaw, but it is un-countable times more safer and cooler (no ciml i really think cool urls too dont change:) ) than what i suggested. I think taking this idea a step further will bring us closer to some realization of our goals.
How about doing it once every day for google bot alone.
That is
Day1 : Gbot asks for yourdomain.com. You redirect it once that day to yourdomain.com. No harm done today and no gain also.
Day2 : Gbot asks for yourdomain.com. You redirect it once that day to yourdomain.com. No harm done today and no gain also.
Day3 : ditto
Day4 : ditto
Day5 : ditto
Day6 : Gbot asks for yourdomain.com thiking it is fetching hijacker.com/url.php?url=yourdomain.com. Today no harm done but lots of good done.
I need to refine this better. So i am planning to look at my logs for the past year to see how Gbot has requested my pages, starting from the homepage and how many times a day.
Will post if i think i see any pattern and ask for your ideas.
claus - Don't use it. I'm sorry, it will loop of course. Back to the drawing board.
I made the same mistake in msg #:552
I wonder if jdmorgan (jim) from the apache forum could come up with a workable solution, he's good.
-------
Another idea,
For example, when people link to your site ask them to use www and then have this in your htaccess (or vice versa),
RewriteCond %{HTTP_HOST} ^www.example\.com
RewriteRule ^(.*)$ [example.com...] [R=permanent,L]
That way even if they use a 302 to www.example.com it would be corrected automatically when it was 301'ed in your htaccess file. Although this could potentially mess with your backlinks. Thoughts?
1. random content shuffle
2. programmed 1 shot 301 redirects (kinda like the random content).
3. massive invasive insertion of code.
Now I already do a lot of 1, 2 is on the boards, 3 I just did for another reason (related to this) and had a sore wrist for a week or so after (not going to do it again).
You stated claus that Gbot sees the 302 redirection and goes yipee one more url and indexes it again as hijackers url. Are you certain? Because if this is the way its done then we can get over it.
But...and this is a big But....what if Gbot does not go yipee one more new url. It already knows that the redirected url exists in its index. It just by default assigns that url to the hijackers url without doing an fetch.
We can get to know this with the help of gregdi & idoc &other victims (no too strong a word.. more like casualties), we can get to know.
I suggest this. gregdi & idoc check for the hijackers page in the index and check the cache date. if you have more than one hijacker, then check all their cache date and then check your logs to see gbot activity on that date. Is there any difference, like your page that was hijacked being fetched twice etc. Or if there any pecularities that you see please post them here concisely. Also dont be afraid to use your gut instincts, after all no-one knows your site better than you. Because in those pecularities lies our answer. really
<edit reason> Corrected typos</edit>
[edited by: boredguru at 1:36 am (utc) on Mar. 16, 2005]