Forum Moderators: Robert Charlton & goodroi
<base href="http://www.yoursite.com/" />
RewriteCond %{HTTP_REFERER} yourproblemproxy\.com
order allow,deny
deny from 11.22.33.44
allow from all
"In the last year while fighting all this nonsense I managed to move up the ranks from only 400K visitors a month to 900K+ (maybe 1M, we'll see how the month ends). This wouldn't have been possible to accomplish if the scrapers and hijacked pages had been left unchecked as I would still be competing against myself in Google, which I was before I went draconian on content access rules, and now it's not a problem."
If I add this to the .htaccess
AddType application/x-httpd-php .txt
And put the reverse dns lookup into the top of robots.txt, which adds any blocked ips to the .htaccess.
This will save the server php parsing every htm file on the server.
Or am I missing something?
It is the real Googlebot we're talking about isn't it? It's being sent thru a proxy server. I suppose the proxy could not request the robots.txt when Googlebot requests it, so my solution wouldn't be 100% effective.
I don't like the idea of turning on php for all htm files. My site gets over 200,000 htm page views a day. I'm worried what will happen if the server suddenly has to parse all these file requests as php. Or is it not a problem?
Actually the block is to trap fake Googlebots as well.
Blocking just a request for robots.txt won't handle that aspect at all.
As the fake ones even if told not to do something are still going to do it. Further assuming that the script/proxy would even return the 403 or whatever is also a consideration. The safe way is to nail them all.
200,000 pages views a day could be a load, but that all depends upon what you have for a server and how you have other things setup. I can't answer that question for you.
Because googlebot always reads robots.txt, surely the php code only needs to be in the robots.txt file?
I can see how this can be confusing but Google will most likely NEVER read your robots.txt file via a CGI or PHP proxy server.
If Google is about to crawl:
exampleproxysite.com/nph-page.pl/000000A/http/www.yoursite.com
Google will read robots.txt from here:
exampleproxysite.com/robots.txt
Does that make sense?
Therefore, protecting robots.txt on your site for this scenario is probably a waste of time.
Google does not have to read only the root folder for robots.txt files. They can also read inside folders.
So how does that equate to Google actually accessing the robots.txt file on the server being spoofed by the CGI proxy?
It doesn't, so let's not confuse the topic.
[edited by: incrediBILL at 8:39 pm (utc) on July 6, 2007]
Funny, on digg a story recently made front page about how using the agent switcher on firefox will allow you to access "restricted" sites if you mask yourself as googlebot. This forward/reverse thing will put the kabosh on that.
I'm currently testing a rewrite map perl based version of this kind of block.
It isn't as hard to test as some other forms of blocking are.
I can run the script standalone and feed it information from existing log files.
I also downloaded and will be playing with mod_layout a bit just to see how that one could handle this type of thing and not interfere with things like the X bit hack.
It helps to have many ways to do something.
Anyway, removing cacheing doesn't affect ranking does it?
A couple more questions about these bots. In your experience are they targetting sites mostly? Or do they follow links @ random or on a keyword basis?
I'm not planning to ban any bots. I just want to feed them junk data and junk links. Do they tend to follow the links they are fed?
After looking at well over 2 million requests from old logs I got 62 web hosting service based scrappers, 21 automated down loaders, 5 no agents down loaders, 18 fake Yahoo/Inktomi Slurps, and 19 fake Googlebots.
The stuff covers plenty of ground but most of it is aimed at groups of pages that do well for related key words.
I have a nice fast rewrite map script that can handle allowed known bot requests at a rate that exceeds 180,000 requests a second, the processing rate for banned bots is just slightly slower, the agent stops are slower still, and the hosting provider stuff can make a determination at the rate of over 7,000 requests a second when it has to classify.
He was saying that his classic "succeed on Google in 12 months" thread got ripped so many times that the rips have been ripped. Does this ring a bell? Could it have been in another thread?
I am not going to post htaccess code as its probably clumbsy compared to what experts here can do.
In htaccess reverse dns search bots aka previous posts remember msn this one is hugely important
Now create a list of trusted ip's in htaccess including the user agents of the search bots (they are reverse dns so can't be spoofed).
All other ip's do a globall redirect to a file called frame.php?
Remember in htacces not to apply the rules to this file.
Because the redirect can't pass the refferer this was the only way afriend thought of doing this but I am sure if you like the system people can vary it to achieve something better.
So the request to http://www.example.com/widgets.html is redirected if not a search bot or trusted ip to http://www.example.com/frame.php?http://www.example.com/widgets.html
Now you see there is the page that you can work with on the php frame page.Now on the php page called frame.php
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<head>
<title>Example</title>
<meta name="author" content="Example">
<meta name="robots" content="noindex,follow">
<meta name="description" content="Example ">
<?php echo'<base href="http://www.'.$_SERVER['HTTP_HOST'].''.$_SERVER['REQUEST_URI'].'" />';?>
</head>
<?
$referer='http://'.$_SERVER['HTTP_HOST'].''.$_SERVER['REQUEST_URI'];
$sitename='http://www.example.com/frame.php?';
$target = str_replace($sitename,"",$referer);
?>
<frameset>
<frame src="<?php echo $target?>" name="">
<noframes>
<body>
<p><a href="<?php echo $target?>">Example</a></p>
</noframes>
</frameset>
</html>
The advantage of this method I feel is there is no longer a need to ban ip's(I am always concerned of banning ip's that may contain a large amount of users) but a need to allow them instead. For those not allowed see the same page in a frame that preserves accessibility against cloaking the page as modern screen readers can read a single frame no problem. Also this system stops scrapers , email harvesters as well as proxy hijacks.
I am just worried how long it will take these #*$!s to get round it.
Be great to get some authourity comments on this.....
Ok I am off to spread the word........
After looking at well over 2 million requests from old logs I got 62 web hosting service based scrappers, 21 automated down loaders, 5 no agents down loaders, 18 fake Yahoo/Inktomi Slurps, and 19 fake Googlebots.
Is there a database/black list of those (with specific IP's)? I am starting to go through my logs (after seeing how much content scraped from my sites is floating around the Internet) and I want to block as many as those as possible. Would be nice to establish a shared database...
If you do ONE THING, the reverse-forward DNS stops this, otherwise you'll be fighting this problem until the day you die as it's a total waste of time to block them individually and a completely false sense of security as another proxy site will pop up to replace it the same day....The reverse-forward DNS spider validation is the only proxy blocker you need. Install it and then you can ignore the hijacking problem as it WILL completely resolve itself in time as all the spiders crawl the proxy a second time and remove your previously hijacked listings or replace them with junk (my personal favorite).
Try it - you'll like it. A lot.
reverse-forward DNS
Why should I slow down every first-time visitor by up to a few seconds by doing double DNS lookups? I say again, it's only Google's problem. They have billions, they have all the resources. They shall fix it.
I say again, it's only Google's problem. They have billions, they have all the resources. They shall fix it.
We're not talking about mere indexing of proxy urls here, we're talking about the proxy urls taking over your rankings. That means YOU can lose income.
We're also not talking about defending against scrapers - just proxy server urls. Should Google fix this? Oh yes. Will I wait around for them to do it? Not on your life.
Why should I slow down every first-time visitor...
Not every first time visitor - just the ones who claim to be a search engine spider. This is from jdMorgan's post, #5 in this thread:
Given an understanding of what a proxy *is* and how it works, the only step really needed is to verify that user-agents claiming to be Googlebot are in fact coming from Google IP addresses, and to deny access to requests that fail this test.If the purported-Googlebot requests are not coming from Google IP addresses, then one of two things is likely happening:
1) It is a spoofed user-agent, and not really Googlebot.
2) It *is* Googlebot, but it is crawling your site through a proxy.The latter is how sites get 'proxy hijacked' in the Google SERPs -- Googlebot will see your content on the proxy's domain.
I also suggest checking IPs that claim to be slurp, MSNbot and Teoma (Ask's crawler.) Here's that reference thread again about how to do the checking:
How to verify Googlebot is Googlebot [webmasterworld.com]