Forum Moderators: Robert Charlton & goodroi
<base href="http://www.yoursite.com/" />
RewriteCond %{HTTP_REFERER} yourproblemproxy\.com
order allow,deny
deny from 11.22.33.44
allow from all
First thing I want to clarify -- as I understand your post, you are talking not about someone directly hijacking your traffic through some kind of DNS exploit. You're also not talking about someone strealing your content, although that can play into this picture at times.
Instead, you are talking about a proxy domain that points to yours taking over your position in the SERPs, sometimes a position that your url has held for a long time.
We've had some threads in this forum about it in the past, but there is nothing like a definitive address to the issue at this time. Yes, Google "should" fix it, and ultimately it is their job -- but what do you while you're waiting for Google to fix it. Might be a long wait, you know?
For reference, here are the two recent threads about proxy hijacks. Feel free to fold the issues raised there into this discussion. As I said before, I don't think anyone has the definitive solution so far, but there are a lot of good ideas.
Hijacking - Some Advice for Webmasters [webmasterworld.com]
Proxy Hijack - Now what should I do? [webmasterworld.com]
I'd also like to talk about the idea of blocking the IP address of the proxy server. In some cases, I think that the IP address may not be static, it may be spoofed, etc. Be careful if you do decide to block by IP that you've got all the technical information pinned down accurately, you don't want to harm your legitimate traffic, whether bot or human.
[edited by: tedster at 11:49 pm (utc) on July 1, 2007]
The best thing I did from advice from this forum was to periodically search google content from my website, and then if a proxy was stealing content, I added the ip to my htaccess which effectiely served a forbidden error for that page, and soon enough the page is removed.
Sometimes when this happens I will also do a page change, rewrite change on the index.
Hope this helps.
Here is what has worked in the past:
The first step is to attempt a dialog with the site owner. Frequently a hole in their setup is being used to do a bit of mayhem. In this case it appears that someone found that port 443 was open to indexing and fed the search engines some links.
Don't give them a lot of time to fix it on their end.
Second is to get to their hosting provider if the first step results in no change.
When you contact the ISP it is also time to file a DMCA request with Google.
But somewhere between step one and two file a very large spam report with Google noting the many sites copied especially any site containing content that Google might not wish to show in its index.
Currently, I can't get any response from the site that has all of the boosted content on it. Maybe a change is in the works.
[edited by: tedster at 4:15 pm (utc) on June 29, 2007]
If the purported-Googlebot requests are not coming from Google IP addresses, then one of two things is likely happening:
1) It is a spoofed user-agent, and not really Googlebot.
2) It *is* Googlebot, but it is crawling your site through a proxy.
The latter is how sites get 'proxy hijacked' in the Google SERPs -- Googlebot will see your content on the proxy's domain.
The most fool-proof and low-maintenance method to validate Googlebot requests is to do a double reverse-DNS lookup on the IP address requesting as Googlebot; If the IP address points to a Google hostname, and looking up that hostname then returns the original IP address, then it is legitimate Googlebot request.
This is the method recently recommended by Google in their Webmaster help -- doubtless due to this very problem.
However, some servers are configured such that the Webmaster cannot do rDNS lookups. In that case, just using a simple list of the IP addresses that Google usually crawls from is a viable solution -- IF you keep a sharp eye out for Google changing or adding to the list of IP addresses that Googlebot uses to crawl.
Stopping these proxy hijacks at the front door will eliminate the need to repeatedly chase your content in search, prevent the need to file potentially-false DMCA complaints, etc. A pinch of prevention is worth a pound of cure...
Jim
It just doesn't make sense to me, to leave that whole entire domain of swiped pages in the index, so that tens of thousands of individual webmasters have to file a DMCA, and most of them won't know what hit them, in order to know what to look for or do in the first place.
The average webmaster or site owner just knows when all their traffic is gone, they usually don't do an inurl: search as a matter of course and most probably have never heard of it.
Perhaps someone can explain how leaving those sites in the index while tens of thousands of legitimate site page that actually have value disappear altogether is supposed to be any good for the quality of search for users.
[edited by: tedster at 2:48 pm (utc) on June 27, 2007]
Instead, you are talking about a proxy domain that points to yours taking over your position in the SERPs, sometimes a position that your url has held for a long time
Exactly. It seems a bit difficult for people to wrap their heads around... hijacked within Google. The symptoms: A search for "My Company Name" would bring up the hijacker's site and not mine, all rankings for all keywords gone, massive traffic and sales loss.
Good news, either Google has taken action via my spam request or the coding mentioned in the first post has reversed the action of the hijacker but my index page is no longer hijacked and I am back in the rankings (although slightly lower ranked) for my big money keyword.
I plan to implement the double reverse DNS check to prevent this in future. I've also set up Google alerts to monitor inurl:mysite.com on a daily basis.
Overall I estimate damages from lost sales at about $1,700.
How to verify Googlebot is Googlebot [webmasterworld.com]
Avoid rogue spiders from spoofing the Googlebot user agent
The hijacking site hosts an online proxy web browser through which you can anonymously browse the web, and I think that someone used it to browse my website recently. In order to serve the page privately, the proxy sends a spider out to scrape the page and "host" it locally via 302 redirect. Google's spider come across this new 302 redirect and thinks that the content has a new location... confusing the original for the duplicate. In order to tell Google that the scraped page is not the original, you have to ban the 302 connection via htaccess by URL and IP address.
The problem lies with Google as they cannot tell the difference between a real 302 redirect and a scraped 302 redirect.
[edited by: synergy at 5:52 pm (utc) on June 27, 2007]
In this case, the proxy is configured such that it forwards requests from a client --in this case, Googlebot-- to your server, and forwards the responses from your server back to the client. Your server sees the proxy as the client, and Googlebot sees the proxy as your server.
The forwarding may be completely transparent, or may employ some insidious software -- Perhaps substituting a fake robots.txt file if one is requested by the client, doing a bit of cloaking for fun and profit, etc.
But leaving that detail for the moment, Google crawls your site through this proxy and lists your content at the proxy's address (domain).
Since your server sees the proxy as the client, it's easy enough to check that a client claiming to be Googlebot is coming from a Google address and not from/through some proxy's address, as previously described.
Proxies are not necessarily good or evil, they're just proxies. Some are used to hijack content, and some are used by information-starved people stuck behind firewalls to get to the real and unfiltered world-wide-web.
Jim
What was staggering is
(a) that all the pages were non-Supplemental when the actual parent site the script was installed on was dead and low PR, and
(b) even more remarkable than fully indexing all this newer, near-dupe content, Google was dumping the older real pages from SERPs because of it. Overnight we lost all rankings for all but 2 terms.
It wasn't malicious but that didn't help us. What we did was track the site owners down and by chance they lived locally (same city). After we spoke to them they took all the pages down. Luckily enough Google actually crawled them and us within a few days of this and all our rankings came back.
A quick search in Google for "cgi-bin/nph-ssl.cgi/000100A/" brings up now 55,000+ results when Saturday it was 13,000 and Sunday it was 30,000.It's now up to 79,200 since this morning, and going strong.
That one may be an unintentional "mis-hap" on their part, but based on what's been uncovered, it doesn't appear that the "other one" that's doing the same thing is quite so innocent.
I asked someone well versed in such matters earlier in the day, and according to them dealing with IP numbers is futile, since IP numbers can easily be swapped around, and cloaked info served to Google.
Also, the one doing the hijacking with a 302 go.php has 12,800 pages indexed, with the ones not serving the redirected site running Adsense on their own pages with the scraped content.
The particular site (and likely not the only one running that script) that Synergy was talking about has over 50,000 of the pages and is no longer responding to any request that I make. I suspect they may have closed the hole or that part of the net is off line to me.
On the go.php what can I say, It isn't difficult to handle redirection in any scripting system.
The only thing is, where do you find all-inclusive list of IP ranges per country? I DON'T MEAN doing the IP lookup thing, but, rather, banning entire countries using a reliable list.
Would this sort of thing be less likely to happen on a blogger blog? In other words, would a blog site hosted on a google server be less likely to get jacked? I mean, the issue of where the content sits, where it should be indexed from, and the domain it should be attributed to should be fairly clear if the content sits on a server operated by google itself. What do you think, Tedster?
After performing a inurl:mydomain.com I saw the following listing in SERPS:
mydomain.com/?ref=questionable-porn-site.com
When I click on the listing, it takes me to a page that looks like mine!
When I went to the domain of the offending site, it was a web cam site based in Turkey.
What's going on and how do I address?
That's not a hijack, that's just someone linking to your site with a bogus query string -- probably hoping to pick up a link from your 'stats' page if it is publicly-accessible, or from the page(s) where that link appears.
Servers themselves ignore query strings attached to URLs -- They have meaning only if the URL points to (or is rewritten to) a script and if that script ascribes some meaning to the query string; The server itself doesn't use the query string at all, which is why that link displays your page.
In order to prevent the appearance of legitimacy of that link, you can detect and redirect to remove such bogus query strings. How you do that depends on what kind of server you're hosted on, and whether any other pages/scripts on your site actually use query strings.
Jim
2. Block the site via .htaccess:
RewriteCond %{HTTP_REFERER} yourproblemproxy\.com
You're not blocking the site, you're blocking an actual visitor that may have landed on a hijacked page, not a wise idea to punish the wrong person IMO. Fact is, it's highly unlikely you'll ever see the proxy as an actual referrer as they mask their presence, so that step is a total waste of time.
The most fool-proof and low-maintenance method to validate Googlebot requests is to do a double reverse-DNS lookup on the IP address requesting as Googlebot; If the IP address points to a Google hostname, and looking up that hostname then returns the original IP address, then it is legitimate Googlebot request.
BINGO!
If you validate Googlebot (or any other crawler) with reverse/forward DNS checking the proxy hijacking simply goes away.
Not only can you stop a page hijacking, but when you detect this condition you can feed the search engine special pages via the proxy server and turn a thwarted hijacking to your advantage!
I'll leave what you feed the proxy server up to your imagination ;)
Otherwise, if you just let the page get returned via the proxy, Google sees your content coming from a different address and may assign ownership of the content to that proxy address.
Why Google is incapable of and/or refuses to block crawling via these sites, after all these years, which appear to have obvious detectable fingerprints is beyond me.
You're not blocking the site, you're blocking an actual visitor that may have landed on a hijacked page, not a wise idea to punish the wrong person IMO. Fact is, it's highly unlikely you'll ever see the proxy as an actual referrer as they mask their presence, so that step is a total waste of time.
If you do the .htaccess blocking, granted, you are denying legit access, but aren't you also letting G know it's a bad link when they visit the proxy server site?
[edited by: Robert_Charlton at 1:15 am (utc) on June 29, 2007]
[edit reason] fixed slash on quote tag [/edit]
If you do the .htaccess blocking, granted, you are denying legit access, but aren't you also letting G know it's a bad link when they visit the proxy server site?
If you block the IP of the proxy, yes.
If you block the proxy referrer, no.
A good proxy never divulges it's domain name as every request for every link is filtered via the proxy so your site should never see their referrer ever, unless something else was going on like they performed a redirect to your site instead of routing it through the proxy for some reason.
[edited by: incrediBILL at 12:21 am (utc) on June 29, 2007]
incrediBill hows about its own tail?
However you will never see Googlebot doing its thing through those scripts unless the user of said thingies is totally out of it.
What I see after I have located one of the little babies is frequenty like this entry:
127.0.0.1 - - [28/Jun/2007:22:45:59 -0400] "GET /robots.txt HTTP/1.0" 200 27 "-" "-" only difference is that I am running a modified copy of one of the little goody scripts and as you can see the referer and agent is gone.
This is a little 6 line version of one that normally caches server side on the site or desktop using it.
127.0.0.1 - - [28/Jun/2007:22:45:59 -0400] "GET /#*$!#*$!.php?url=http://localhost/robots.txt HTTP/1.1" 200 27 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.4) Gecko/20061201 Firefox/2.0.0.4 (Ubuntu-feisty)"
was what the server running the script saw.
[edited by: theBear at 2:58 am (utc) on June 29, 2007]