Forum Moderators: Robert Charlton & goodroi
[secure.*******...]
When you click the link in G you go straight to the Legitimate Site. G would automatically impose a duplicate penalty.
Questions:
1) Why does the hacker site do this? What do they get from the exercise, apart from harming other sites.
2) What do I say to G? Is it valid to send them a DMCA? I'm cautious about contacting G in case they hamfistedly remove the Legitimate pages. Anybody else reported this to G with success?
3) Is it safe to block access to our site to people visiting via nph-proxy.cgi tools? The only people I'd welcome using it would be punters using nph to avoid a government ban, as in China. Any other valid reasons for using nph-proxy?
4) What is the best way to block access to our site to nph-proxy (and variants) users?
Sample mod-rewrite code please.
Ta!
I notified G days ago, via their online form, but have not had any reply.
I'd prefer to block on the basis of the visitor using nph-proxy rather than IP, so what would be the safest code to do that?
Ta!
It takes a bit of time for Google to act if they are going to.
If you have an Adsense account, and Adwords account, then you should report such things through those as well as through normal forms to Google.
The more people that pile on the better.
You can harden your site to some degree. However the more you lock down your site the chance that you will block valid users increases.
All proxies are not bad, it all depends on what Google and others get to see that is the problem.
But now we face it and need to protect against. First thing I'd check would be USER_AGENT and all other possible variables to detect when proxy hijacker reads out site, and I would serve altered, non duplicate content. It's risky though to deny access to requests with empty USER_AGENT.
The major problem is that the best weapon against this hijackers is, in fact, cloaking, and so it's important to do it safely to avoid being penalized for Google. It would be naive to assume that Google don't use anything more sophisticated to detect cloaking than Mozilla Googlebot - the simplest cloaking scripts may match string 'Googlebot' in USER_AGENT so it's not unlikely that Google uses undetectable bots to detect cloaking. We wouldn't like to serve these bots altered content when fighting the hijacker.
I wonder if currently used nph proxies have a distinctive USER_AGENT at all - it's so easy to pretend being Firefox or Internet Explorer. But if they are lame enough we can exploit it to our protection. But remember, that some minor search engines use USER_AGENT like 'libwww-perl' and it would be not recommended to deny our pages to them.
It would be good way to track down their IP addresses, but they could use anonymous proxies to access our sites. But let's hope they don't intend to do it yet. In fact, it would slow their scripts down a lot.
Detecting them and cloaking the content it though very difficult. But there is another thing that protects us.
In discussion about 302 hijacking, GoogleGuy explained what factors decide in selecting the canonical URL. If our pages have no penalties, high PageRank and quality ontopic links, proxy hijackers can hardly do any harm.
Unfortunately, they can boost PR of hijacked URLs with spammy links on blogs, and if hijacked site have built its PR the same way, it can loose the fight, but sites supported with quality links cannot be outmatched by spammers and hijackers. So, as always in Google world, quality inbound links are the key.
Another thought, here we can see what's good in the existence of the sandbox. Hijackers' site, boosting PR with spammy technics, and with throw-away domain, is more likely to be sandboxed than old established sites. So it's unlikely to harm them even if it generates their duplicates.
These are my first thoughts, just my bit to this discussion, I may be wrong in some things I wrote, but that's what I occures to me based on my own experience.
I simply want to block access. And infom G so they can clean their index of such junk damaging legitimate sites vindictively.
Since they hit me months ago how do I trace their IP now?
I'll block the hacker site's IP from the whois info.
But surely it's possible to 403 block all visitors accessing via nph-proxy?
Like this:
RewriteCond %{REQUEST_URI} ^/nph-proxy$ [NC,OR]
as a line in my nasty bot blocks...
RewriteRule .* - [F]
Feel free to correct me.
As I said earlier, I don't mind losing a few legitimate visitors to block this stuff, because their damage could be very expensive.
A week and still no reply from G. I'll keep after them.
I normally get sensible answers from G by email.
I believe you have to be around PR 7 before penalties are ignored. I'm about PR 4-5 (I don't have GToolbar so I don't know precisely.)
It'd be nice if GG had a word for us on this nasty development, especially considering the 'Official Line' is that it is not possible to nobble a competitor's listing in G. That's clearly untrue.
If you block just four of these you'll stop a whole pile.
The second method is to block all agents without a user agent id.
If Google is so stupid to consider sending a viewer out to a site to detect cloaking by IDing with a empty agent id they should be put in the town square as rotten egg targets.
If Google is so stupid to consider sending a viewer out to a site to detect cloaking by IDing with a empty agent id they should be put in the town square as rotten egg targets.
Yeah, that's funny :))
If I were to detect cloaking, I'd put just kinda "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" in my USER_AGENT, if I were writing an nph-proxy script, I'd use it as well. And I'd visit a victim through anonymous proxy, to hide the IP of my hosting company.
But maybe they aren't so cautious. Or maybe the solution is to block both IPs of hosting companies and known anonymous proxies? Sounds difficult anyway.
The best way to fight a large number of these *wipes is to block the IP address ranges of the largest hosting companies, the servers at those places have no reason to access your sites.If you block just four of these you'll stop a whole pile.
Well let's have your IP list to block...
But I'd also like to block all users accessing via nph-proxy. Is there anything wrong with the htaccess modrewrite code I suggested earlier?
Ta!
I run my own proxy on my website for free. enabling my visitors to surf anonymously for free.
I have not put it up to create duplicate content. the only was I can see that Google was to see these pages was if someone was to setup a link to a page that has been "proxyified" and as we all know Google, it will follow all the other links within the "proxified" pages, which could lead to the whole internet!
This is scary for me, as I hope it never happens to me at all!
Does anyone know anyway I can block access from Google besides the robots.txt method.. I don’t want to be identified as one of these "hacking" sites!
g1smd: I'd like to see that other thread you mention so that I can block that nph source too. Could you sticky me?
For novices like me coming to this thread in future, looking for info on how to block these hacker sites using nph-proxy, I think it'd be helpful to post sample modrewrite code to use to block access via nph-proxy and variants.
If the code I posted here previously is wrong please say so too stop others from using it.
Ta!
The reason I will not comment on your code is that I haven't tested it.
You can get far better guidance from the folks in the Apache forum. Your code looks to be reasonable, I just don't know if it works as intended. Also in these forums the forum system can get in the way and make what was correctly functioning code fail. All code should be inside the code end of code markup.
A single mistook in an .htaccess file can have a drastic effect on your site. One that might not show up right off.
Then there is the fact that I do not believe (in fact I know) that blocking nph-proxy by name or partial name will catch most of what is going on.
You see it isn't the name of the script that is the problem.
I'll ask about the blocking code in the other forum.
I think it is worthwhile blocking on the basis of the nph-proxy name, for many reasons, even if it only stops 50% of them.
For example: when G bot sees you are blocking nph-proxy visitors, and you subsequently report an nph-proxy spammer, G can check your innocent target site and see that you are blocking nph-proxy.
It adds credibility to your spam report, so theynare more likely to act on it.
GOOD NEWS:
I reported this particular site to G, a week ago, and now the only pages showing in the serps are the Arabic copies of legitimate innocent Arabic sites. The 20,000 English copies I mentioned are gone.
So G is acting on this type of spam report.
So I urge everyone to check G for nph-proxy copies of their pages and report to G the spammers and hackers.
G still haven't replied to my emails though...
The boss usually hands them the info through a couple of paths.
Then you have to watch out for the old "gone" pages to re enter the index when Google does an update or attempts to clean house. In other words once you've been bit, they can come back and take another bite.
But, what do us woodland critters know?
[edited by: theBear at 11:51 pm (utc) on Oct. 16, 2005]
I also report the offending domain to Y! and MSN and see if they respond as efficiently as Google have.
It's a quick way for all the SE's to improve their serps.
In one swipe they eliminate thousands of illegal copies of legitimate pages, AND should therefore lift duplicate penalties wrongly applied to the good guys. Thereby bringing them to the attention of searchers.
I'll also report the offending domain to Y! and MSN and see if they respond as efficiently as Google have.
Just checked and none of the 20,000 illegal pages are in Y! or MSN Search. Don't think they were ever listed in their indicies. So they must be better than G at distinguishing this particular kind of illegal duplication.
Let's see if G's page total takes a fall, reflecting their clean out of nph-proxy duplicates. There must be literally millions of illegal copies currently in the index.
Yes, I have filed spamreports AGAIN!
C'mon GG; here's a clear example of malicious sites deliberately harming the rankings of innocent ones.