Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Proxy Server URLs Can Hijack Your Google Ranking - how to defend?

         

synergy

1:59 pm on Jun 25, 2007 (gmt 0)

10+ Year Member



I posted about this in the back room but I think this need to be brought into public view. This is happening right now and could happen to you!

Over the weekend my index page and now some internal pages were proxy hijacked [webmasterworld.com] within Google's results. My well ranked index page dropped from the results and has no title, description or cache. A search for "My Company Name" brings up (now two) listings of the malicious proxy at the top of the results.

The URL of the proxy is formatted as such:
https://www.scumbagproxy.com/cgi-bin/nph-ssl.cgi/000100A/http/www.mysite.com

A quick search in Google for "cgi-bin/nph-ssl.cgi/000100A/" brings up now 55,000+ results when Saturday it was 13,000 and Sunday it was 30,000. The number of sites affected are increasing exponentially and your site could be next.

Take preventative action now by doing the following...

1. Add this to all of your headers:
<base href="http://www.yoursite.com/" />


and if you see an attempted hijack...

2. Block the site via .htaccess:
RewriteCond %{HTTP_REFERER} yourproblemproxy\.com


3. Block the IP address of the proxy
order allow,deny
deny from 11.22.33.44
allow from all


4. Do your research and file a spam report with Google.
[google.com...]

casua

11:01 am on Oct 14, 2007 (gmt 0)



Not every first time visitor - just the ones who claim to be a search engine spider.

Sorry, but that can only give you a false sense of security. The only thing they need to do to bypass your "protection" is to not pretend they are Google. Whack-a-mole style silliness. The other case (genuine Google bot visiting your site via a proxy) -- that is pure scrapping. And it is only one form of scrapping. If you have a big site and want to battle scrapping -- well, uh, good luck playing the whack-a-mole silliness.

casua

11:09 am on Oct 14, 2007 (gmt 0)



And again, scrapping in any form is solely Google's problem. What the hell do some people think? That webmasters will clean the Google search index from some mess created by other people? How cheeky would that be.

I say, let them be lazy. AltaVista was lazy too. We all know what happened next...

tedster

4:29 pm on Oct 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Again, this is NOT scraping - this is the real googlebot indexing a website's content through a proxy server.

When that happens, there is no ACTUAL content at the proxy. Your server will see a googlebot user agent, because the request really IS made by googlebot. Proxies do pass on the user agent with the request - that's how they work. But because there's a proxy server in the middle of the request chain, the IP for the GET request belongs to the proxy server instead of belonging to googlebot.

casua, I understand your opinion about not wanting to fix what you see as Google's problem. So don't do it. I'm not hoping to change your mind, only to keep the advice clear that other members have given for people who are interested.

casua

7:47 pm on Oct 14, 2007 (gmt 0)



Again, this is NOT scraping

It is scraping (whether intentional or incidental). If a site is in Google's index and that site has your content verbatim, then your content has been, by definition, scraped.

tedster

10:50 pm on Oct 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Call it what you want, but there is a real and precise difference. When your pages are scraped, the other domain actually hosts your content - maybe exactly as it is, maybe changed slightly, and maybe cut up into bits and shuffled around in some autogenerated fashion.

A proxy server does NOT host your content. The two situations require different treatment and different understanding by those who care about resolving the issue for their own sites. If you do not care, that is your right.

I have been willing to post about this up to now, because I think our discussion can be clarifying for others. As synergy mentioned here early in the thread "It seems a bit difficult for people to wrap their heads around..."

However, this post must be the end of our "it is" - "it is not" argument. Casua, you are free not to agree and to do as you will. But let's not bore the audience any further. Someone reading may have other insights about this thread's topic, which is "how to defend" - and not "should I defend" or "is there really a problem".

Let's both give some space in the thread for other members now, if there's anyone left except you and me.

CainIV

11:15 pm on Oct 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It is scraping (whether intentional or incidental)

Scriping is lifting content from pages, which this is not. Proxies are not always meant to be malicious.

Why should I spend my time fixing Google's problems?

We quite obviously this is a thread about people are experiencing significant problems due to proxy hijacking. If it doesn't involve you or your website, then naturally you might care very little about it.

kwasher

12:37 am on Oct 15, 2007 (gmt 0)

10+ Year Member




Only had to do 2 modification
- replaced the broken ¦ in line 5 of the php
- inderted <?php before the php code and?> after it

...just wanted to clarify that there are two broken pipes.

bytb

5:09 am on Oct 15, 2007 (gmt 0)



I think its great to spark up this debate and although I agree with tedster that reverse dns forward dns does stop proxy hijacks at present you can bet there is a way round it. I can think of one straight away, but I don't want to give these guys any ideas.

I also agree about loosing natural linking etc. through the method posted that tedster mentioned.

Howether lets set that a side as I think that can be overcome and all your bookmarks kept Tedster. The main issue I would like to see discussed is the idea of serving page in a frame to protect accessibility(against cloaking) and prevent all forms of automated attack online.

If correct as I think it is serving a framed page based on ip protects against any form of hijack (except 302 but am assured that one is cured), scraper and email harvester, potentially we have a solution that ends automated black hat seo completely. Now the scripts are not ment to be a final solution merely spark debate. Please Tedster reconsider before you dismiss this idea as your objections I think can be solved.

What I need to know is as far as google are concerned will they agree to not penalise a site that serves a page to a non trusted ip using a frame displaying same content as google allowed to crawl.

The use of this would be purely for protection of websites. If so then we're in business and cleverer people than I can help overcome some objections. (An idea already springs to mind that a php could be used to serve a frame in url for non trusted ip's so no redirect and this would work well for any site using a modrewrite or funny enough a cannoical issue as you could frame the non mod rewrite or cannoical and block in robots.txt - preserving Tedsters bookmarks).

I also feel google and search engines have there eye off the ball here. Howether I do have some sympathy as it hard enough keeping the buggers off your site, must be a nightmare for search engines to sort out.

blend27

1:43 pm on Oct 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



--- whack-a-mole with all the thousands of scrapers ---

casua,

It is not as hard as you think - scrapers run in packs, they host in packs and they go diving deep in SERPS in packs as well.

Some things can not be controlled by a website owner, but for the most part there is a solution to nonsense like preventing PROXY Highjack, preventing scraping.

I am not saying I have a perfect script to stop it all, but it does prevent 99% of attempts and I don’t need to try to whack-a-moll, it also gives me more time to take care of business that we run.

We run e-commerce sites, blogs, hobby sites and to some it is our life. Plus there is nothing better than having a PLESANT DAY.

loudspeaker

5:56 pm on Oct 19, 2007 (gmt 0)

10+ Year Member



I am not saying I have a perfect script to stop it all, but it does prevent 99% of attempts and I don't need to try to whack-a-moll, it also gives me more time to take care of business that we run.

I started a separate thread on that and received only one comment so far - could you may be share your techniques?

[webmasterworld.com...]

So, you're saying it's easy to block entire blocks of IP's? Based on what? Log analysis or some sort of intelligence?

What else in terms of methods?

[edited by: tedster at 9:03 pm (utc) on April 10, 2008]
[edit reason] fix character-set issue [/edit]

spacecadet2

3:59 am on Apr 10, 2008 (gmt 0)

10+ Year Member



Having read the many posts on this thread I now believe our Google listing/ranking was hijacked. Our company name has been ranked number one by google when our company name was entered as a search term and this has been the case for many years. A year ago - we found ourselves relegated to the second page - on the first page what once was our listing and the other listings on this first page contained our name in the title and totally unrelated info in the description and link.

We left it like this for a while thinking it was a Google mistake but eventually informed our Google rep. about it and serveral weeks later the problem was solved and everything was back to normal.

My question is this - would Google be able to find out who/what company hijacked our listing? (This problem occurred over a year ago.)

tedster

4:07 am on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not likely if it was a proxy server url that did it. The main purpose of many proxy set-ups is to offer anonymous browsing. If the problem was triggered by a maliciously placed link to the proxy url on another website, that leaves aminimal trail - but someone who is up to no good is likely to cover their tracks pretty well.

The rankings hijack is not always done with bad intent, and is quite often an innocent side effect of Google's spidering and the way proxy server urls are set up.

I'm glad to hear you got out of trouble. Have you taken any steps since then to prevent future troubles?

spacecadet2

4:41 pm on Apr 10, 2008 (gmt 0)

10+ Year Member



Can't get into too much detail but I suspect the opposing side of litigation we're involved in might be the culprit. As a small relatively low-tech company - we did not recognize the problem as a potential hijacking just assumed Google had gone amuck.

In terms of protecting ourselves from future attacks - do you recommend the code highlighted in the first post of this thread by Synergy?

Thanks for your response and help.

tedster

9:07 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Those steps can help, but the biggest single step you can take is the forward/reverse verification of googlebot. A rankings hijack happens if googlebot spiders your content through a proxy url. But when that happens the IP address will not be Google's.

spacecadet2

9:39 pm on Apr 10, 2008 (gmt 0)

10+ Year Member



Is there a post you can direct me to with more details? Thanks so much for your help.

Erku

9:45 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi guys this same thing is happening to us as well.

Is reverse verification something that the host does?

thank you
armen

tedster

9:51 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How to Verify Googlebot and Avoid Rogue Spiders [webmasterworld.com]

That thread is part of the Hot Topics area [webmasterworld.com], which is always pinned to the top of this forum's index page.

Erku

9:53 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are we talking about verifying only googlebot or all other good bots, such as MSN, Yahoo ASK and AOL?

tedster

11:05 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Since this is the Google Search forum we're talking about Google. The thread I linked to above mentions that you can use this approach for Yahoo's slurp, but MSNbot did not have the correct set-up for reverse DNS lookup at the time of that thread. It still doesn't [webmasterworld.com].

Hissingsid

7:42 am on Apr 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Earlier in this thread someone suggested using <base href="http://www.example.com" /> Before implementing this I thought I would do a bit of reading up on the subject and spotted this.

"the base URL may be overridden by an HTTP header accompanying the document"

If I were writing a proxy CGI with bad intentions I think I would make sure that I controlled the HTTP header sent.

Cheers

Sid

Erku

12:30 pm on Apr 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So what else can we do? My host says they can't support ip verification

USCountytrader

5:01 pm on Apr 11, 2008 (gmt 0)

10+ Year Member



I been trying to find a way to detect and prevent these proxy websites, but they seem to have thaught of everything.

(1) <base href="http://www.yoursite.com/" />
does not work..they are removing it

(2) RewriteCond %{HTTP_REFERER} yourproblemproxy\.com
does not work..they are forwarding all server variables

(3) Absolute links
does not work..they are rewriting all url's

To boot, they are replacing all your ads with their own. So not only do you get to host & devolop the content for them. but your revenue to help pay for it goes to the proxy website.

The one easy solution that I see that Google could do (seeing we are pretty much powerless against this outright theft) is because they are forwarding the server variables to prevent us from detecting them. Google could just not index anything where the BASE URL they are crawling does not match the HTTP_HOST. They would have to turn off server variable forwarding to get crawled, in which case, we should then be able to detect them with HTTP_REFERER and prevent them on our side.

Just an idea...I might be reaching here

[edited by: USCountytrader at 5:02 pm (utc) on April 11, 2008]

theBear

1:30 am on Apr 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



335 lines of Perl torture.

Warnings:

No warranty: use at your own risk.

I will not help you figure out how to activate it.

This is an proof of concept Apache rewrite map program and it can totally jam your server if you change it and make a mistooke.

This should be considered totally untested despite the fact it has been running on my home server for 10 months and has processed several million real life log file entries.

It saves block and allow state over system restarts.

This was written to run on a *nix system.

It isn't very smart.

Enjoy.

#!/usr/bin/perl
use Socket;
$¦ = 1; # Turn off buffering
my $ipstartb = '';
my $ipendb = '';
my $cidrranges = '';
my $ipblocklist = '';
my $ipallow = '';
my $real-ip;
my $hostname;
my $retval;
my $pass1 = "y";
my $logline;
my $host;
my $rest;
my $refer;
my $agent;
my $file;
my $sw;
#
# Open the bad guys list and setup the filter
#

open(FILE, "/etc/apache2/ipblock2");
@raw_data = <FILE>;
close FILE;
foreach $ip(@raw_data)
{
chop($ip);
$ipblocklist{$ip} = "b";
}

#
# Open the good guy list and setup the pass through
#

open(FILE, "/etc/apache2/ipallow2");
@raw_data = <FILE>;
close FILE;
foreach $ip(@raw_data)
{
chop($ip);
$ipallow{$ip} = "a";
}
#
# Open the cidr webhost and bogon list
#

open(FILE, "/etc/apache2/cidrlist");
@cidr_list = <FILE>;
close FILE;

#
# The following commented out code is for testing it allows existing
# log files to be used
#

#open(DB, "</etc/apache2/logyyy") or &cgierr("error in search. unable to open database: logyyy. Reason: $!");
#while (<DB>)
#{
# ($host,$user,$date,$rest)= $_=~m,^([^\s]+)\s+-\s+([^ ]+)\s+\[(.*?)\]\s+(.*),;
# if ($rest)
# {
# ($rtype,$file,$proto,$code,$bytes,$r2)=split(/\s/,$rest,6);
# if ($r2)
# {
# my @Split=split(/\"/,$r2);
# $agent=$Split[3];
# }
# }
#$logline="$agent¦¦$host¦¦$file";
#doit($logline);
#}
#close DB;

#sub doit
#{
while (<STDIN>)
{
chomp;
#my ($agent, $rhostaddr, $url) = split(/\¦\¦/, $_[0], 3);
my ($agent, $rhostaddr, $url) = split(/#######/, $_, 3);

# got a bad boy send him some special content

if ($ipblocklist{$rhostaddr} eq "b")
{
print "/403.shtml\n";
}
else
{

# got a known good guy send him what he asked for

if ($ipallow{$rhostaddr} eq "a")
{
print "$url\n";
}
else
{

#
# handle the cidr rang lists
# note we even cache the range information for subsequent use
#

$sw = "n";
$ipint = unpack("N", pack("C4", split(/\./, $rhostaddr)));
foreach $crange(@cidr_list)
{
if ($sw ne "y")
{
$crange =~ s/\n//g;
if ($cidrranges{$crange} ne "y")
{
($x, $mask) = split( /\//, $crange );
($a,$b,$c,$d) = split( /\./, $x );
$ipstart = &ip2net( $crange );
$ipstartint = unpack("N", pack("C4", split(/\./, $ipstart)));
$size = 2 ** ( 32 - $mask );
$ipend = &int2ip( unpack("N", pack("C4", split(/\./, $ipstart)))+$size );
$ipendint = unpack("N", pack("C4", split(/\./, $ipend)));
$cidrranges{$crange} = "y";
$ipstartb{$crange} = $ipstartint;
$ipendb{$crange} = $ipendint;
if( ($ipint >= $ipstartint) && ($ipint < $ipendint) )
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/scrappersblock2");
print FD "$rhostaddr\n";
close FD;
$sw = "y";
print "/403.shtml\n";
}
}
else
{
$ipstartint = $ipstartb{$crange};
$ipendint = $ipendb{$crange};
if( ($ipint >= $ipstartint) && ($ipint < $ipendint) )
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/scrappersblock2");
print FD "$rhostaddr\n";
close FD;
$sw = "y";
print "/403.shtml\n";
}
}
}
}

#
# Handle noagent requests
#

if ($agent eq "" && $sw ne "y")
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/agentblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
else
{

#
# Check for some known downloaders
#

if (((index($agent,"lwp-trivial") >= 0) ¦¦ (index($agent,"Wget") >= 0)) && $sw ne "y")
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/agentblock2");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
else
{

#
# Handle Google/Media Partners
#
if (((index($agent,"Googlebot/") >= 0) ¦¦ (index($agent,"Mediapartners-Google/") >= 0)) && $sw ne "y")
{
$hostname = hostname($rhostaddr);
if (index($hostname,"googlebot") >= 0)
{
$real_ip = inet_ntoa(inet_aton($hostname));
if($real_ip == $rhostaddr)
{
$ipallow{$rhostaddr} = "a";
open (FD, ">>/etc/apache2/ipallow2");
print FD "$rhostaddr\n";
close FD;
print "$url\n";
}
else
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakegoogleblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
}
else
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakegoogleblock2");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
}
else
{
#
# Handle Slurp
#
if ((index($agent,"Slurp") >= 0) && $sw ne "y")
{
$hostname = hostname($rhostaddr);
if ((index($hostname,"inktomisearch.com") >= 0) ¦¦ (index($hostname,"yahoo.net") >= 0))
{
$real_ip = inet_ntoa(inet_aton($hostname));
if($real_ip == $rhostaddr)
{
$ipallow{$rhostaddr} = "a";
open (FD, ">>/etc/apache2/ipallow2");
print FD "$rhostaddr\n";
print "$url\n";
}
else
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakeyahooblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml/\n";
}
}
else
{
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakeyahooblock2");
print FD "$rhostaddr\n";
close FD;
$ipblocklist{$rhostaddr} = "b";
print "/403.shtml\n";
}
}
else
{
if ((substr($url,0,10) eq "/forbidden") && $sw ne "y")
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/badbotblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
else
{
print "$url\n";
}
}
}
}
}
}

}
}

sub hostname {

my (@bytes, @octets,
$packedaddr,
$raw_addr,
$host_name,
$ip
);

if($_[0] =~ /[a-zA-Z]/g) {
$raw_addr = (gethostbyname($_[0]))[4];
@octets = unpack("C4", $raw_addr);
$host_name = join(".", @octets);
} else {
@bytes = split(/\./, $_[0]);
$packedaddr = pack("C4",@bytes);
$host_name = (gethostbyaddr($packedaddr, 2))[0];
}

return($host_name);
}

sub int2ip {
local($ip) = @_;
return join(".", unpack("C4", pack("N", $ip)));
}

sub ip2net {
local($ip) = @_;
($ip2net, $ip2cidr) = split(/\//, $ip);
return &int2ip(unpack("N", pack("C4", split(/\./, $ip2net))) & ~ ( 2 ** (32 - $ip2cidr) - 1));
}

There done went all of the formatting, watch out for the forum possibly mangling the code.

Cheers,
theBear

chazeo

10:18 pm on Apr 14, 2008 (gmt 0)

10+ Year Member



I would pay for software similar to anti-virus that updates it's database of known sites/IPs that should be blocked!

[edited by: tedster at 10:32 pm (utc) on April 14, 2008]

This 174 message thread spans 6 pages: 174