Welcome to WebmasterWorld Guest from 3.92.92.168

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Proxy Server URLs Can Hijack Your Google Ranking - how to defend?

     
1:59 pm on Jun 25, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:July 8, 2003
posts: 431
votes: 0


I posted about this in the back room but I think this need to be brought into public view. This is happening right now and could happen to you!

Over the weekend my index page and now some internal pages were proxy hijacked [webmasterworld.com] within Google's results. My well ranked index page dropped from the results and has no title, description or cache. A search for "My Company Name" brings up (now two) listings of the malicious proxy at the top of the results.

The URL of the proxy is formatted as such:
[scumbagproxy.com...]

A quick search in Google for "cgi-bin/nph-ssl.cgi/000100A/" brings up now 55,000+ results when Saturday it was 13,000 and Sunday it was 30,000. The number of sites affected are increasing exponentially and your site could be next.

Take preventative action now by doing the following...

1. Add this to all of your headers:

<base href="http://www.yoursite.com/" />

and if you see an attempted hijack...

2. Block the site via .htaccess:

RewriteCond %{HTTP_REFERER} yourproblemproxy\.com

3. Block the IP address of the proxy

order allow,deny
deny from 11.22.33.44
allow from all

4. Do your research and file a spam report with Google.
[google.com...]

11:01 am on Oct 14, 2007 (gmt 0)

New User

joined:Oct 14, 2007
posts:27
votes: 0


Not every first time visitor - just the ones who claim to be a search engine spider.

Sorry, but that can only give you a false sense of security. The only thing they need to do to bypass your "protection" is to not pretend they are Google. Whack-a-mole style silliness. The other case (genuine Google bot visiting your site via a proxy) -- that is pure scrapping. And it is only one form of scrapping. If you have a big site and want to battle scrapping -- well, uh, good luck playing the whack-a-mole silliness.

11:09 am on Oct 14, 2007 (gmt 0)

New User

joined:Oct 14, 2007
posts:27
votes: 0


And again, scrapping in any form is solely Google's problem. What the hell do some people think? That webmasters will clean the Google search index from some mess created by other people? How cheeky would that be.

I say, let them be lazy. AltaVista was lazy too. We all know what happened next...

4:29 pm on Oct 14, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Again, this is NOT scraping - this is the real googlebot indexing a website's content through a proxy server.

When that happens, there is no ACTUAL content at the proxy. Your server will see a googlebot user agent, because the request really IS made by googlebot. Proxies do pass on the user agent with the request - that's how they work. But because there's a proxy server in the middle of the request chain, the IP for the GET request belongs to the proxy server instead of belonging to googlebot.

casua, I understand your opinion about not wanting to fix what you see as Google's problem. So don't do it. I'm not hoping to change your mind, only to keep the advice clear that other members have given for people who are interested.

7:47 pm on Oct 14, 2007 (gmt 0)

New User

joined:Oct 14, 2007
posts:27
votes: 0


Again, this is NOT scraping

It is scraping (whether intentional or incidental). If a site is in Google's index and that site has your content verbatim, then your content has been, by definition, scraped.

10:50 pm on Oct 14, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Call it what you want, but there is a real and precise difference. When your pages are scraped, the other domain actually hosts your content - maybe exactly as it is, maybe changed slightly, and maybe cut up into bits and shuffled around in some autogenerated fashion.

A proxy server does NOT host your content. The two situations require different treatment and different understanding by those who care about resolving the issue for their own sites. If you do not care, that is your right.

I have been willing to post about this up to now, because I think our discussion can be clarifying for others. As synergy mentioned here early in the thread "It seems a bit difficult for people to wrap their heads around..."

However, this post must be the end of our "it is" - "it is not" argument. Casua, you are free not to agree and to do as you will. But let's not bore the audience any further. Someone reading may have other insights about this thread's topic, which is "how to defend" - and not "should I defend" or "is there really a problem".

Let's both give some space in the thread for other members now, if there's anyone left except you and me.

11:15 pm on Oct 14, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 19, 2004
posts:1939
votes: 0


It is scraping (whether intentional or incidental)

Scriping is lifting content from pages, which this is not. Proxies are not always meant to be malicious.

Why should I spend my time fixing Google's problems?

We quite obviously this is a thread about people are experiencing significant problems due to proxy hijacking. If it doesn't involve you or your website, then naturally you might care very little about it.

12:37 am on Oct 15, 2007 (gmt 0)

Full Member

10+ Year Member

joined:Sept 24, 2003
posts:318
votes: 0



Only had to do 2 modification
- replaced the broken ¶ in line 5 of the php
- inderted <?php before the php code and?> after it

...just wanted to clarify that there are two broken pipes.

5:09 am on Oct 15, 2007 (gmt 0)

Junior Member

joined:Oct 12, 2007
posts:44
votes: 0


I think its great to spark up this debate and although I agree with tedster that reverse dns forward dns does stop proxy hijacks at present you can bet there is a way round it. I can think of one straight away, but I don't want to give these guys any ideas.

I also agree about loosing natural linking etc. through the method posted that tedster mentioned.

Howether lets set that a side as I think that can be overcome and all your bookmarks kept Tedster. The main issue I would like to see discussed is the idea of serving page in a frame to protect accessibility(against cloaking) and prevent all forms of automated attack online.

If correct as I think it is serving a framed page based on ip protects against any form of hijack (except 302 but am assured that one is cured), scraper and email harvester, potentially we have a solution that ends automated black hat seo completely. Now the scripts are not ment to be a final solution merely spark debate. Please Tedster reconsider before you dismiss this idea as your objections I think can be solved.

What I need to know is as far as google are concerned will they agree to not penalise a site that serves a page to a non trusted ip using a frame displaying same content as google allowed to crawl.

The use of this would be purely for protection of websites. If so then we're in business and cleverer people than I can help overcome some objections. (An idea already springs to mind that a php could be used to serve a frame in url for non trusted ip's so no redirect and this would work well for any site using a modrewrite or funny enough a cannoical issue as you could frame the non mod rewrite or cannoical and block in robots.txt - preserving Tedsters bookmarks).

I also feel google and search engines have there eye off the ball here. Howether I do have some sympathy as it hard enough keeping the buggers off your site, must be a nightmare for search engines to sort out.

1:43 pm on Oct 19, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1997
votes: 75


--- whack-a-mole with all the thousands of scrapers ---

casua,

It is not as hard as you think - scrapers run in packs, they host in packs and they go diving deep in SERPS in packs as well.

Some things can not be controlled by a website owner, but for the most part there is a solution to nonsense like preventing PROXY Highjack, preventing scraping.

I am not saying I have a perfect script to stop it all, but it does prevent 99% of attempts and I donít need to try to whack-a-moll, it also gives me more time to take care of business that we run.

We run e-commerce sites, blogs, hobby sites and to some it is our life. Plus there is nothing better than having a PLESANT DAY.

5:56 pm on Oct 19, 2007 (gmt 0)

Full Member

10+ Year Member

joined:May 31, 2006
posts:268
votes: 0


I am not saying I have a perfect script to stop it all, but it does prevent 99% of attempts and I don't need to try to whack-a-moll, it also gives me more time to take care of business that we run.

I started a separate thread on that and received only one comment so far - could you may be share your techniques?

[webmasterworld.com...]

So, you're saying it's easy to block entire blocks of IP's? Based on what? Log analysis or some sort of intelligence?

What else in terms of methods?

[edited by: tedster at 9:03 pm (utc) on April 10, 2008]
[edit reason] fix character-set issue [/edit]

3:59 am on Apr 10, 2008 (gmt 0)

New User

10+ Year Member

joined:Oct 23, 2006
posts:14
votes: 0


Having read the many posts on this thread I now believe our Google listing/ranking was hijacked. Our company name has been ranked number one by google when our company name was entered as a search term and this has been the case for many years. A year ago - we found ourselves relegated to the second page - on the first page what once was our listing and the other listings on this first page contained our name in the title and totally unrelated info in the description and link.

We left it like this for a while thinking it was a Google mistake but eventually informed our Google rep. about it and serveral weeks later the problem was solved and everything was back to normal.

My question is this - would Google be able to find out who/what company hijacked our listing? (This problem occurred over a year ago.)

4:07 am on Apr 10, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Not likely if it was a proxy server url that did it. The main purpose of many proxy set-ups is to offer anonymous browsing. If the problem was triggered by a maliciously placed link to the proxy url on another website, that leaves aminimal trail - but someone who is up to no good is likely to cover their tracks pretty well.

The rankings hijack is not always done with bad intent, and is quite often an innocent side effect of Google's spidering and the way proxy server urls are set up.

I'm glad to hear you got out of trouble. Have you taken any steps since then to prevent future troubles?

4:41 pm on Apr 10, 2008 (gmt 0)

New User

10+ Year Member

joined:Oct 23, 2006
posts:14
votes: 0


Can't get into too much detail but I suspect the opposing side of litigation we're involved in might be the culprit. As a small relatively low-tech company - we did not recognize the problem as a potential hijacking just assumed Google had gone amuck.

In terms of protecting ourselves from future attacks - do you recommend the code highlighted in the first post of this thread by Synergy?

Thanks for your response and help.

9:07 pm on Apr 10, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Those steps can help, but the biggest single step you can take is the forward/reverse verification of googlebot. A rankings hijack happens if googlebot spiders your content through a proxy url. But when that happens the IP address will not be Google's.
9:39 pm on Apr 10, 2008 (gmt 0)

New User

10+ Year Member

joined:Oct 23, 2006
posts:14
votes: 0


Is there a post you can direct me to with more details? Thanks so much for your help.
9:45 pm on Apr 10, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 26, 2004
posts:1052
votes: 1


Hi guys this same thing is happening to us as well.

Is reverse verification something that the host does?

thank you
armen

9:51 pm on Apr 10, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


How to Verify Googlebot and Avoid Rogue Spiders [webmasterworld.com]

That thread is part of the Hot Topics area [webmasterworld.com], which is always pinned to the top of this forum's index page.

9:53 pm on Apr 10, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 26, 2004
posts:1052
votes: 1


Are we talking about verifying only googlebot or all other good bots, such as MSN, Yahoo ASK and AOL?
11:05 pm on Apr 10, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Since this is the Google Search forum we're talking about Google. The thread I linked to above mentions that you can use this approach for Yahoo's slurp, but MSNbot did not have the correct set-up for reverse DNS lookup at the time of that thread. It still doesn't [webmasterworld.com].
7:42 am on Apr 11, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 31, 2001
posts: 1357
votes: 0


Earlier in this thread someone suggested using <base href="http://www.example.com" /> Before implementing this I thought I would do a bit of reading up on the subject and spotted this.

"the base URL may be overridden by an HTTP header accompanying the document"

If I were writing a proxy CGI with bad intentions I think I would make sure that I controlled the HTTP header sent.

Cheers

Sid

12:30 pm on Apr 11, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 26, 2004
posts:1052
votes: 1


So what else can we do? My host says they can't support ip verification
5:01 pm on Apr 11, 2008 (gmt 0)

New User

10+ Year Member

joined:June 8, 2003
posts:24
votes: 0


I been trying to find a way to detect and prevent these proxy websites, but they seem to have thaught of everything.

(1) <base href="http://www.yoursite.com/" />
does not work..they are removing it

(2) RewriteCond %{HTTP_REFERER} yourproblemproxy\.com
does not work..they are forwarding all server variables

(3) Absolute links
does not work..they are rewriting all url's

To boot, they are replacing all your ads with their own. So not only do you get to host & devolop the content for them. but your revenue to help pay for it goes to the proxy website.

The one easy solution that I see that Google could do (seeing we are pretty much powerless against this outright theft) is because they are forwarding the server variables to prevent us from detecting them. Google could just not index anything where the BASE URL they are crawling does not match the HTTP_HOST. They would have to turn off server variable forwarding to get crawled, in which case, we should then be able to detect them with HTTP_REFERER and prevent them on our side.

Just an idea...I might be reaching here

[edited by: USCountytrader at 5:02 pm (utc) on April 11, 2008]

1:30 am on Apr 12, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


335 lines of Perl torture.

Warnings:

No warranty: use at your own risk.

I will not help you figure out how to activate it.

This is an proof of concept Apache rewrite map program and it can totally jam your server if you change it and make a mistooke.

This should be considered totally untested despite the fact it has been running on my home server for 10 months and has processed several million real life log file entries.

It saves block and allow state over system restarts.

This was written to run on a *nix system.

It isn't very smart.

Enjoy.

#!/usr/bin/perl
use Socket;
$¶ = 1; # Turn off buffering
my $ipstartb = '';
my $ipendb = '';
my $cidrranges = '';
my $ipblocklist = '';
my $ipallow = '';
my $real-ip;
my $hostname;
my $retval;
my $pass1 = "y";
my $logline;
my $host;
my $rest;
my $refer;
my $agent;
my $file;
my $sw;
#
# Open the bad guys list and setup the filter
#

open(FILE, "/etc/apache2/ipblock2");
@raw_data = <FILE>;
close FILE;
foreach $ip(@raw_data)
{
chop($ip);
$ipblocklist{$ip} = "b";
}

#
# Open the good guy list and setup the pass through
#

open(FILE, "/etc/apache2/ipallow2");
@raw_data = <FILE>;
close FILE;
foreach $ip(@raw_data)
{
chop($ip);
$ipallow{$ip} = "a";
}
#
# Open the cidr webhost and bogon list
#

open(FILE, "/etc/apache2/cidrlist");
@cidr_list = <FILE>;
close FILE;

#
# The following commented out code is for testing it allows existing
# log files to be used
#

#open(DB, "</etc/apache2/logyyy") or &cgierr("error in search. unable to open database: logyyy. Reason: $!");
#while (<DB>)
#{
# ($host,$user,$date,$rest)= $_=~m,^([^\s]+)\s+-\s+([^ ]+)\s+\[(.*?)\]\s+(.*),;
# if ($rest)
# {
# ($rtype,$file,$proto,$code,$bytes,$r2)=split(/\s/,$rest,6);
# if ($r2)
# {
# my @Split=split(/\"/,$r2);
# $agent=$Split[3];
# }
# }
#$logline="$agent¶¶$host¶¶$file";
#doit($logline);
#}
#close DB;

#sub doit
#{
while (<STDIN>)
{
chomp;
#my ($agent, $rhostaddr, $url) = split(/\¶\¶/, $_[0], 3);
my ($agent, $rhostaddr, $url) = split(/#######/, $_, 3);

# got a bad boy send him some special content

if ($ipblocklist{$rhostaddr} eq "b")
{
print "/403.shtml\n";
}
else
{

# got a known good guy send him what he asked for

if ($ipallow{$rhostaddr} eq "a")
{
print "$url\n";
}
else
{

#
# handle the cidr rang lists
# note we even cache the range information for subsequent use
#

$sw = "n";
$ipint = unpack("N", pack("C4", split(/\./, $rhostaddr)));
foreach $crange(@cidr_list)
{
if ($sw ne "y")
{
$crange =~ s/\n//g;
if ($cidrranges{$crange} ne "y")
{
($x, $mask) = split( /\//, $crange );
($a,$b,$c,$d) = split( /\./, $x );
$ipstart = &ip2net( $crange );
$ipstartint = unpack("N", pack("C4", split(/\./, $ipstart)));
$size = 2 ** ( 32 - $mask );
$ipend = &int2ip( unpack("N", pack("C4", split(/\./, $ipstart)))+$size );
$ipendint = unpack("N", pack("C4", split(/\./, $ipend)));
$cidrranges{$crange} = "y";
$ipstartb{$crange} = $ipstartint;
$ipendb{$crange} = $ipendint;
if( ($ipint >= $ipstartint) && ($ipint < $ipendint) )
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/scrappersblock2");
print FD "$rhostaddr\n";
close FD;
$sw = "y";
print "/403.shtml\n";
}
}
else
{
$ipstartint = $ipstartb{$crange};
$ipendint = $ipendb{$crange};
if( ($ipint >= $ipstartint) && ($ipint < $ipendint) )
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/scrappersblock2");
print FD "$rhostaddr\n";
close FD;
$sw = "y";
print "/403.shtml\n";
}
}
}
}

#
# Handle noagent requests
#

if ($agent eq "" && $sw ne "y")
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/agentblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
else
{

#
# Check for some known downloaders
#

if (((index($agent,"lwp-trivial") >= 0) ¶¶ (index($agent,"Wget") >= 0)) && $sw ne "y")
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/agentblock2");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
else
{

#
# Handle Google/Media Partners
#
if (((index($agent,"Googlebot/") >= 0) ¶¶ (index($agent,"Mediapartners-Google/") >= 0)) && $sw ne "y")
{
$hostname = hostname($rhostaddr);
if (index($hostname,"googlebot") >= 0)
{
$real_ip = inet_ntoa(inet_aton($hostname));
if($real_ip == $rhostaddr)
{
$ipallow{$rhostaddr} = "a";
open (FD, ">>/etc/apache2/ipallow2");
print FD "$rhostaddr\n";
close FD;
print "$url\n";
}
else
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakegoogleblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
}
else
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakegoogleblock2");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
}
else
{
#
# Handle Slurp
#
if ((index($agent,"Slurp") >= 0) && $sw ne "y")
{
$hostname = hostname($rhostaddr);
if ((index($hostname,"inktomisearch.com") >= 0) ¶¶ (index($hostname,"yahoo.net") >= 0))
{
$real_ip = inet_ntoa(inet_aton($hostname));
if($real_ip == $rhostaddr)
{
$ipallow{$rhostaddr} = "a";
open (FD, ">>/etc/apache2/ipallow2");
print FD "$rhostaddr\n";
print "$url\n";
}
else
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakeyahooblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml/\n";
}
}
else
{
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakeyahooblock2");
print FD "$rhostaddr\n";
close FD;
$ipblocklist{$rhostaddr} = "b";
print "/403.shtml\n";
}
}
else
{
if ((substr($url,0,10) eq "/forbidden") && $sw ne "y")
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/badbotblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
else
{
print "$url\n";
}
}
}
}
}
}

}
}

sub hostname {

my (@bytes, @octets,
$packedaddr,
$raw_addr,
$host_name,
$ip
);

if($_[0] =~ /[a-zA-Z]/g) {
$raw_addr = (gethostbyname($_[0]))[4];
@octets = unpack("C4", $raw_addr);
$host_name = join(".", @octets);
} else {
@bytes = split(/\./, $_[0]);
$packedaddr = pack("C4",@bytes);
$host_name = (gethostbyaddr($packedaddr, 2))[0];
}

return($host_name);
}

sub int2ip {
local($ip) = @_;
return join(".", unpack("C4", pack("N", $ip)));
}

sub ip2net {
local($ip) = @_;
($ip2net, $ip2cidr) = split(/\//, $ip);
return &int2ip(unpack("N", pack("C4", split(/\./, $ip2net))) & ~ ( 2 ** (32 - $ip2cidr) - 1));
}

There done went all of the formatting, watch out for the forum possibly mangling the code.

Cheers,
theBear

10:18 pm on Apr 14, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 22, 2005
posts:63
votes: 0


I would pay for software similar to anti-virus that updates it's database of known sites/IPs that should be blocked!

[edited by: tedster at 10:32 pm (utc) on April 14, 2008]

This 174 message thread spans 6 pages: 174