Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

nph-proxy pages in G

How to safely block access

         

Angonasec

10:21 pm on Oct 9, 2005 (gmt 0)



A hacker-type site has over 20,000 pgs in G, most of them duplicates of legitimate pages of other sites connected to via the hacker site's nph-proxy.cgi tool.
So the url listed in G is along the lines of: (I've added the * to replace the domain name)

[secure.*******...]

When you click the link in G you go straight to the Legitimate Site. G would automatically impose a duplicate penalty.

Questions:

1) Why does the hacker site do this? What do they get from the exercise, apart from harming other sites.
2) What do I say to G? Is it valid to send them a DMCA? I'm cautious about contacting G in case they hamfistedly remove the Legitimate pages. Anybody else reported this to G with success?

3) Is it safe to block access to our site to people visiting via nph-proxy.cgi tools? The only people I'd welcome using it would be punters using nph to avoid a government ban, as in China. Any other valid reasons for using nph-proxy?

4) What is the best way to block access to our site to nph-proxy (and variants) users?
Sample mod-rewrite code please.

Ta!

Angonasec

10:58 pm on Oct 11, 2005 (gmt 0)



Bump...

And that should be (my mistake):

[secure.*******...]

theBear

12:35 am on Oct 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



[webmasterworld.com...] Start at msg #14

Angonasec

10:19 pm on Oct 12, 2005 (gmt 0)



Thanks The Bear, I read that thread, and started this one to see how G deal with the problem, and how we can better protect ourselves.

I notified G days ago, via their online form, but have not had any reply.

I'd prefer to block on the basis of the visitor using nph-proxy rather than IP, so what would be the safest code to do that?

Ta!

g1smd

10:35 pm on Oct 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google usually take 2 or 3 days to respond to email or online forms. The first one or two replies will be "standard" letters, often with answers that are totally irrelevant to your question. Perservere and after the third or fourth attempt you might make some progress, but there is a good chance that you'll beat your head on a wall trying to get someone to understand what you are talking about. Google frontline communication is often abysmal to unusable.

theBear

1:49 am on Oct 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Over the last two sets of serp swings we have reported through multiple Google channels a number of "proxy" type sites.

It takes a bit of time for Google to act if they are going to.

If you have an Adsense account, and Adwords account, then you should report such things through those as well as through normal forms to Google.

The more people that pile on the better.

You can harden your site to some degree. However the more you lock down your site the chance that you will block valid users increases.

All proxies are not bad, it all depends on what Google and others get to see that is the problem.

shri

3:51 am on Oct 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We try and track down the offenders IP address and redirect them to different content or a 410.

Proxies are easy to deal with once they've been spotted.

theBear

4:05 am on Oct 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I like feeding them their own homepage but a 410 works it just isn't as much fun as watching Google chew on them.

Wizard

4:23 pm on Oct 13, 2005 (gmt 0)

10+ Year Member



I'm afraid these proxies are intended to harm rankings by dynamic creating duplicate content. I intended to point the problem out months ago, when there was a major discussion about 302 hijacking on WebmasterWorld, but the topic where I described the opportunity was not allowed by administrator whom I finally agreed with that it would have done more harm than benefit.

But now we face it and need to protect against. First thing I'd check would be USER_AGENT and all other possible variables to detect when proxy hijacker reads out site, and I would serve altered, non duplicate content. It's risky though to deny access to requests with empty USER_AGENT.

The major problem is that the best weapon against this hijackers is, in fact, cloaking, and so it's important to do it safely to avoid being penalized for Google. It would be naive to assume that Google don't use anything more sophisticated to detect cloaking than Mozilla Googlebot - the simplest cloaking scripts may match string 'Googlebot' in USER_AGENT so it's not unlikely that Google uses undetectable bots to detect cloaking. We wouldn't like to serve these bots altered content when fighting the hijacker.

I wonder if currently used nph proxies have a distinctive USER_AGENT at all - it's so easy to pretend being Firefox or Internet Explorer. But if they are lame enough we can exploit it to our protection. But remember, that some minor search engines use USER_AGENT like 'libwww-perl' and it would be not recommended to deny our pages to them.

It would be good way to track down their IP addresses, but they could use anonymous proxies to access our sites. But let's hope they don't intend to do it yet. In fact, it would slow their scripts down a lot.

Detecting them and cloaking the content it though very difficult. But there is another thing that protects us.

In discussion about 302 hijacking, GoogleGuy explained what factors decide in selecting the canonical URL. If our pages have no penalties, high PageRank and quality ontopic links, proxy hijackers can hardly do any harm.

Unfortunately, they can boost PR of hijacked URLs with spammy links on blogs, and if hijacked site have built its PR the same way, it can loose the fight, but sites supported with quality links cannot be outmatched by spammers and hijackers. So, as always in Google world, quality inbound links are the key.

Another thought, here we can see what's good in the existence of the sandbox. Hijackers' site, boosting PR with spammy technics, and with throw-away domain, is more likely to be sandboxed than old established sites. So it's unlikely to harm them even if it generates their duplicates.

These are my first thoughts, just my bit to this discussion, I may be wrong in some things I wrote, but that's what I occures to me based on my own experience.

Angonasec

9:11 pm on Oct 14, 2005 (gmt 0)



Thank for all your comments, I'm well outamydepth.
To combat nph-proxy spammers/hijackers I don't want to use any form of cloaking, or redirection. Far too risky long term.

I simply want to block access. And infom G so they can clean their index of such junk damaging legitimate sites vindictively.

Since they hit me months ago how do I trace their IP now?

I'll block the hacker site's IP from the whois info.

But surely it's possible to 403 block all visitors accessing via nph-proxy?

Like this:

RewriteCond %{REQUEST_URI} ^/nph-proxy$ [NC,OR]

as a line in my nasty bot blocks...

RewriteRule .* - [F]

Feel free to correct me.

As I said earlier, I don't mind losing a few legitimate visitors to block this stuff, because their damage could be very expensive.

A week and still no reply from G. I'll keep after them.
I normally get sensible answers from G by email.

I believe you have to be around PR 7 before penalties are ignored. I'm about PR 4-5 (I don't have GToolbar so I don't know precisely.)

It'd be nice if GG had a word for us on this nasty development, especially considering the 'Official Line' is that it is not possible to nobble a competitor's listing in G. That's clearly untrue.

theBear

9:23 pm on Oct 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The best way to fight a large number of these *wipes is to block the IP address ranges of the largest hosting companies, the servers at those places have no reason to access your sites.

If you block just four of these you'll stop a whole pile.

The second method is to block all agents without a user agent id.

If Google is so stupid to consider sending a viewer out to a site to detect cloaking by IDing with a empty agent id they should be put in the town square as rotten egg targets.

Wizard

6:00 am on Oct 15, 2005 (gmt 0)

10+ Year Member



If Google is so stupid to consider sending a viewer out to a site to detect cloaking by IDing with a empty agent id they should be put in the town square as rotten egg targets.

Yeah, that's funny :))

If I were to detect cloaking, I'd put just kinda "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" in my USER_AGENT, if I were writing an nph-proxy script, I'd use it as well. And I'd visit a victim through anonymous proxy, to hide the IP of my hosting company.

But maybe they aren't so cautious. Or maybe the solution is to block both IPs of hosting companies and known anonymous proxies? Sounds difficult anyway.

Angonasec

10:21 pm on Oct 15, 2005 (gmt 0)



TheBear suggested

The best way to fight a large number of these *wipes is to block the IP address ranges of the largest hosting companies, the servers at those places have no reason to access your sites.

If you block just four of these you'll stop a whole pile.

Well let's have your IP list to block...

But I'd also like to block all users accessing via nph-proxy. Is there anything wrong with the htaccess modrewrite code I suggested earlier?

Ta!

ezyid

7:18 pm on Oct 16, 2005 (gmt 0)

10+ Year Member



Im worried!

I run my own proxy on my website for free. enabling my visitors to surf anonymously for free.

I have not put it up to create duplicate content. the only was I can see that Google was to see these pages was if someone was to setup a link to a page that has been "proxyified" and as we all know Google, it will follow all the other links within the "proxified" pages, which could lead to the whole internet!

This is scary for me, as I hope it never happens to me at all!

Does anyone know anyway I can block access from Google besides the robots.txt method.. I don’t want to be identified as one of these "hacking" sites!

g1smd

7:30 pm on Oct 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




I see a huge stink brewing elsewhere in another forum where some site has already done that: hundreds of thousands of pages all proxified (and then adsense added to those pages).

Angonasec

10:58 pm on Oct 16, 2005 (gmt 0)



TheBear: Thanks for the sticky with the IP list to block.

g1smd: I'd like to see that other thread you mention so that I can block that nph source too. Could you sticky me?

For novices like me coming to this thread in future, looking for info on how to block these hacker sites using nph-proxy, I think it'd be helpful to post sample modrewrite code to use to block access via nph-proxy and variants.

If the code I posted here previously is wrong please say so too stop others from using it.

Ta!

theBear

11:09 pm on Oct 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That was just for one hosting provider, and I'm not sure I got all of their IP address blocks.

The reason I will not comment on your code is that I haven't tested it.

You can get far better guidance from the folks in the Apache forum. Your code looks to be reasonable, I just don't know if it works as intended. Also in these forums the forum system can get in the way and make what was correctly functioning code fail. All code should be inside the code end of code markup.

A single mistook in an .htaccess file can have a drastic effect on your site. One that might not show up right off.

Then there is the fact that I do not believe (in fact I know) that blocking nph-proxy by name or partial name will catch most of what is going on.

You see it isn't the name of the script that is the problem.

Angonasec

11:30 pm on Oct 16, 2005 (gmt 0)



Thanks TheBear!

I'll ask about the blocking code in the other forum.

I think it is worthwhile blocking on the basis of the nph-proxy name, for many reasons, even if it only stops 50% of them.

For example: when G bot sees you are blocking nph-proxy visitors, and you subsequently report an nph-proxy spammer, G can check your innocent target site and see that you are blocking nph-proxy.

It adds credibility to your spam report, so theynare more likely to act on it.

GOOD NEWS:

I reported this particular site to G, a week ago, and now the only pages showing in the serps are the Arabic copies of legitimate innocent Arabic sites. The 20,000 English copies I mentioned are gone.

So G is acting on this type of spam report.

So I urge everyone to check G for nph-proxy copies of their pages and report to G the spammers and hackers.

G still haven't replied to my emails though...

theBear

11:50 pm on Oct 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oh, they are getting reports.

The boss usually hands them the info through a couple of paths.

Then you have to watch out for the old "gone" pages to re enter the index when Google does an update or attempts to clean house. In other words once you've been bit, they can come back and take another bite.

But, what do us woodland critters know?

[edited by: theBear at 11:51 pm (utc) on Oct. 16, 2005]

Angonasec

11:51 pm on Oct 16, 2005 (gmt 0)



Thanks for the sticky g1smd. Greasy eh?

I also report the offending domain to Y! and MSN and see if they respond as efficiently as Google have.

It's a quick way for all the SE's to improve their serps.

In one swipe they eliminate thousands of illegal copies of legitimate pages, AND should therefore lift duplicate penalties wrongly applied to the good guys. Thereby bringing them to the attention of searchers.

Angonasec

12:02 am on Oct 17, 2005 (gmt 0)



I'll also report the offending domain to Y! and MSN and see if they respond as efficiently as Google have.

Just checked and none of the 20,000 illegal pages are in Y! or MSN Search. Don't think they were ever listed in their indicies. So they must be better than G at distinguishing this particular kind of illegal duplication.

Let's see if G's page total takes a fall, reflecting their clean out of nph-proxy duplicates. There must be literally millions of illegal copies currently in the index.

Angonasec

11:04 pm on Oct 30, 2005 (gmt 0)



Mmmm Jagger2 brought the nph-proxy spam pages BACK into G.

Yes, I have filed spamreports AGAIN!

C'mon GG; here's a clear example of malicious sites deliberately harming the rankings of innocent ones.

theBear

1:32 am on Oct 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm just shaking my head.

GIGO