Forum Moderators: Robert Charlton & goodroi
<base href="http://www.yoursite.com/" />
RewriteCond %{HTTP_REFERER} yourproblemproxy\.com
order allow,deny
deny from 11.22.33.44
allow from all
Not every first time visitor - just the ones who claim to be a search engine spider.
Sorry, but that can only give you a false sense of security. The only thing they need to do to bypass your "protection" is to not pretend they are Google. Whack-a-mole style silliness. The other case (genuine Google bot visiting your site via a proxy) -- that is pure scrapping. And it is only one form of scrapping. If you have a big site and want to battle scrapping -- well, uh, good luck playing the whack-a-mole silliness.
I say, let them be lazy. AltaVista was lazy too. We all know what happened next...
When that happens, there is no ACTUAL content at the proxy. Your server will see a googlebot user agent, because the request really IS made by googlebot. Proxies do pass on the user agent with the request - that's how they work. But because there's a proxy server in the middle of the request chain, the IP for the GET request belongs to the proxy server instead of belonging to googlebot.
casua, I understand your opinion about not wanting to fix what you see as Google's problem. So don't do it. I'm not hoping to change your mind, only to keep the advice clear that other members have given for people who are interested.
Again, this is NOT scraping
It is scraping (whether intentional or incidental). If a site is in Google's index and that site has your content verbatim, then your content has been, by definition, scraped.
A proxy server does NOT host your content. The two situations require different treatment and different understanding by those who care about resolving the issue for their own sites. If you do not care, that is your right.
I have been willing to post about this up to now, because I think our discussion can be clarifying for others. As synergy mentioned here early in the thread "It seems a bit difficult for people to wrap their heads around..."
However, this post must be the end of our "it is" - "it is not" argument. Casua, you are free not to agree and to do as you will. But let's not bore the audience any further. Someone reading may have other insights about this thread's topic, which is "how to defend" - and not "should I defend" or "is there really a problem".
Let's both give some space in the thread for other members now, if there's anyone left except you and me.
It is scraping (whether intentional or incidental)
Scriping is lifting content from pages, which this is not. Proxies are not always meant to be malicious.
Why should I spend my time fixing Google's problems?
We quite obviously this is a thread about people are experiencing significant problems due to proxy hijacking. If it doesn't involve you or your website, then naturally you might care very little about it.
I also agree about loosing natural linking etc. through the method posted that tedster mentioned.
Howether lets set that a side as I think that can be overcome and all your bookmarks kept Tedster. The main issue I would like to see discussed is the idea of serving page in a frame to protect accessibility(against cloaking) and prevent all forms of automated attack online.
If correct as I think it is serving a framed page based on ip protects against any form of hijack (except 302 but am assured that one is cured), scraper and email harvester, potentially we have a solution that ends automated black hat seo completely. Now the scripts are not ment to be a final solution merely spark debate. Please Tedster reconsider before you dismiss this idea as your objections I think can be solved.
What I need to know is as far as google are concerned will they agree to not penalise a site that serves a page to a non trusted ip using a frame displaying same content as google allowed to crawl.
The use of this would be purely for protection of websites. If so then we're in business and cleverer people than I can help overcome some objections. (An idea already springs to mind that a php could be used to serve a frame in url for non trusted ip's so no redirect and this would work well for any site using a modrewrite or funny enough a cannoical issue as you could frame the non mod rewrite or cannoical and block in robots.txt - preserving Tedsters bookmarks).
I also feel google and search engines have there eye off the ball here. Howether I do have some sympathy as it hard enough keeping the buggers off your site, must be a nightmare for search engines to sort out.
casua,
It is not as hard as you think - scrapers run in packs, they host in packs and they go diving deep in SERPS in packs as well.
Some things can not be controlled by a website owner, but for the most part there is a solution to nonsense like preventing PROXY Highjack, preventing scraping.
I am not saying I have a perfect script to stop it all, but it does prevent 99% of attempts and I don’t need to try to whack-a-moll, it also gives me more time to take care of business that we run.
We run e-commerce sites, blogs, hobby sites and to some it is our life. Plus there is nothing better than having a PLESANT DAY.
I am not saying I have a perfect script to stop it all, but it does prevent 99% of attempts and I don't need to try to whack-a-moll, it also gives me more time to take care of business that we run.
I started a separate thread on that and received only one comment so far - could you may be share your techniques?
[webmasterworld.com...]
So, you're saying it's easy to block entire blocks of IP's? Based on what? Log analysis or some sort of intelligence?
What else in terms of methods?
[edited by: tedster at 9:03 pm (utc) on April 10, 2008]
[edit reason] fix character-set issue [/edit]
We left it like this for a while thinking it was a Google mistake but eventually informed our Google rep. about it and serveral weeks later the problem was solved and everything was back to normal.
My question is this - would Google be able to find out who/what company hijacked our listing? (This problem occurred over a year ago.)
The rankings hijack is not always done with bad intent, and is quite often an innocent side effect of Google's spidering and the way proxy server urls are set up.
I'm glad to hear you got out of trouble. Have you taken any steps since then to prevent future troubles?
In terms of protecting ourselves from future attacks - do you recommend the code highlighted in the first post of this thread by Synergy?
Thanks for your response and help.
That thread is part of the Hot Topics area [webmasterworld.com], which is always pinned to the top of this forum's index page.
"the base URL may be overridden by an HTTP header accompanying the document"
If I were writing a proxy CGI with bad intentions I think I would make sure that I controlled the HTTP header sent.
Cheers
Sid
(1) <base href="http://www.yoursite.com/" />
does not work..they are removing it
(2) RewriteCond %{HTTP_REFERER} yourproblemproxy\.com
does not work..they are forwarding all server variables
(3) Absolute links
does not work..they are rewriting all url's
To boot, they are replacing all your ads with their own. So not only do you get to host & devolop the content for them. but your revenue to help pay for it goes to the proxy website.
The one easy solution that I see that Google could do (seeing we are pretty much powerless against this outright theft) is because they are forwarding the server variables to prevent us from detecting them. Google could just not index anything where the BASE URL they are crawling does not match the HTTP_HOST. They would have to turn off server variable forwarding to get crawled, in which case, we should then be able to detect them with HTTP_REFERER and prevent them on our side.
Just an idea...I might be reaching here
[edited by: USCountytrader at 5:02 pm (utc) on April 11, 2008]
Warnings:
No warranty: use at your own risk.
I will not help you figure out how to activate it.
This is an proof of concept Apache rewrite map program and it can totally jam your server if you change it and make a mistooke.
This should be considered totally untested despite the fact it has been running on my home server for 10 months and has processed several million real life log file entries.
It saves block and allow state over system restarts.
This was written to run on a *nix system.
It isn't very smart.
Enjoy.
#!/usr/bin/perl
use Socket;
$¦ = 1; # Turn off buffering
my $ipstartb = '';
my $ipendb = '';
my $cidrranges = '';
my $ipblocklist = '';
my $ipallow = '';
my $real-ip;
my $hostname;
my $retval;
my $pass1 = "y";
my $logline;
my $host;
my $rest;
my $refer;
my $agent;
my $file;
my $sw;
#
# Open the bad guys list and setup the filter
#
open(FILE, "/etc/apache2/ipblock2");
@raw_data = <FILE>;
close FILE;
foreach $ip(@raw_data)
{
chop($ip);
$ipblocklist{$ip} = "b";
}
#
# Open the good guy list and setup the pass through
#
open(FILE, "/etc/apache2/ipallow2");
@raw_data = <FILE>;
close FILE;
foreach $ip(@raw_data)
{
chop($ip);
$ipallow{$ip} = "a";
}
#
# Open the cidr webhost and bogon list
#
open(FILE, "/etc/apache2/cidrlist");
@cidr_list = <FILE>;
close FILE;
#
# The following commented out code is for testing it allows existing
# log files to be used
#
#open(DB, "</etc/apache2/logyyy") or &cgierr("error in search. unable to open database: logyyy. Reason: $!");
#while (<DB>)
#{
# ($host,$user,$date,$rest)= $_=~m,^([^\s]+)\s+-\s+([^ ]+)\s+\[(.*?)\]\s+(.*),;
# if ($rest)
# {
# ($rtype,$file,$proto,$code,$bytes,$r2)=split(/\s/,$rest,6);
# if ($r2)
# {
# my @Split=split(/\"/,$r2);
# $agent=$Split[3];
# }
# }
#$logline="$agent¦¦$host¦¦$file";
#doit($logline);
#}
#close DB;
#sub doit
#{
while (<STDIN>)
{
chomp;
#my ($agent, $rhostaddr, $url) = split(/\¦\¦/, $_[0], 3);
my ($agent, $rhostaddr, $url) = split(/#######/, $_, 3);
# got a bad boy send him some special content
if ($ipblocklist{$rhostaddr} eq "b")
{
print "/403.shtml\n";
}
else
{
# got a known good guy send him what he asked for
if ($ipallow{$rhostaddr} eq "a")
{
print "$url\n";
}
else
{
#
# handle the cidr rang lists
# note we even cache the range information for subsequent use
#
$sw = "n";
$ipint = unpack("N", pack("C4", split(/\./, $rhostaddr)));
foreach $crange(@cidr_list)
{
if ($sw ne "y")
{
$crange =~ s/\n//g;
if ($cidrranges{$crange} ne "y")
{
($x, $mask) = split( /\//, $crange );
($a,$b,$c,$d) = split( /\./, $x );
$ipstart = &ip2net( $crange );
$ipstartint = unpack("N", pack("C4", split(/\./, $ipstart)));
$size = 2 ** ( 32 - $mask );
$ipend = &int2ip( unpack("N", pack("C4", split(/\./, $ipstart)))+$size );
$ipendint = unpack("N", pack("C4", split(/\./, $ipend)));
$cidrranges{$crange} = "y";
$ipstartb{$crange} = $ipstartint;
$ipendb{$crange} = $ipendint;
if( ($ipint >= $ipstartint) && ($ipint < $ipendint) )
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/scrappersblock2");
print FD "$rhostaddr\n";
close FD;
$sw = "y";
print "/403.shtml\n";
}
}
else
{
$ipstartint = $ipstartb{$crange};
$ipendint = $ipendb{$crange};
if( ($ipint >= $ipstartint) && ($ipint < $ipendint) )
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/scrappersblock2");
print FD "$rhostaddr\n";
close FD;
$sw = "y";
print "/403.shtml\n";
}
}
}
}
#
# Handle noagent requests
#
if ($agent eq "" && $sw ne "y")
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/agentblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
else
{
#
# Check for some known downloaders
#
if (((index($agent,"lwp-trivial") >= 0) ¦¦ (index($agent,"Wget") >= 0)) && $sw ne "y")
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/agentblock2");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
else
{
#
# Handle Google/Media Partners
#
if (((index($agent,"Googlebot/") >= 0) ¦¦ (index($agent,"Mediapartners-Google/") >= 0)) && $sw ne "y")
{
$hostname = hostname($rhostaddr);
if (index($hostname,"googlebot") >= 0)
{
$real_ip = inet_ntoa(inet_aton($hostname));
if($real_ip == $rhostaddr)
{
$ipallow{$rhostaddr} = "a";
open (FD, ">>/etc/apache2/ipallow2");
print FD "$rhostaddr\n";
close FD;
print "$url\n";
}
else
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakegoogleblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
}
else
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakegoogleblock2");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
}
else
{
#
# Handle Slurp
#
if ((index($agent,"Slurp") >= 0) && $sw ne "y")
{
$hostname = hostname($rhostaddr);
if ((index($hostname,"inktomisearch.com") >= 0) ¦¦ (index($hostname,"yahoo.net") >= 0))
{
$real_ip = inet_ntoa(inet_aton($hostname));
if($real_ip == $rhostaddr)
{
$ipallow{$rhostaddr} = "a";
open (FD, ">>/etc/apache2/ipallow2");
print FD "$rhostaddr\n";
print "$url\n";
}
else
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakeyahooblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml/\n";
}
}
else
{
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/fakeyahooblock2");
print FD "$rhostaddr\n";
close FD;
$ipblocklist{$rhostaddr} = "b";
print "/403.shtml\n";
}
}
else
{
if ((substr($url,0,10) eq "/forbidden") && $sw ne "y")
{
$ipblocklist{$rhostaddr} = "b";
open (FD, ">>/etc/apache2/ipblock2");
print FD "$rhostaddr\n";
close FD;
open (FD, ">>/etc/apache2/badbotblock1");
print FD "$rhostaddr\n";
close FD;
print "/403.shtml\n";
}
else
{
print "$url\n";
}
}
}
}
}
}
}
}
sub hostname {
my (@bytes, @octets,
$packedaddr,
$raw_addr,
$host_name,
$ip
);
if($_[0] =~ /[a-zA-Z]/g) {
$raw_addr = (gethostbyname($_[0]))[4];
@octets = unpack("C4", $raw_addr);
$host_name = join(".", @octets);
} else {
@bytes = split(/\./, $_[0]);
$packedaddr = pack("C4",@bytes);
$host_name = (gethostbyaddr($packedaddr, 2))[0];
}
return($host_name);
}
sub int2ip {
local($ip) = @_;
return join(".", unpack("C4", pack("N", $ip)));
}
sub ip2net {
local($ip) = @_;
($ip2net, $ip2cidr) = split(/\//, $ip);
return &int2ip(unpack("N", pack("C4", split(/\./, $ip2net))) & ~ ( 2 ** (32 - $ip2cidr) - 1));
}
There done went all of the formatting, watch out for the forum possibly mangling the code.
Cheers,
theBear