Forum Moderators: Robert Charlton & goodroi
<base href="http://www.yoursite.com/" />
RewriteCond %{HTTP_REFERER} yourproblemproxy\.com
order allow,deny
deny from 11.22.33.44
allow from all
AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "/my_full_account_path/httpdocs/reversedns.php"
Only had to do 2 modification
- replaced the broken ¦ in line 5 of the php
- inderted <?php before the php code and?> after it
2 observations:
visiting the test directory as Googlebot gives me an empty white page not my expected 403 Forbidden
This php checks for only googlebot and MSN bot in 2 lines
if(stristr($ua, 'msnbot') ¦¦ stristr($ua, 'googlebot')) {
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname)) {
can we have a full code that checks for slurp as well?
Many proxy's will either change the UA or delete it entirely during the request. Wouldn't this defeat the use of rDNS as the UA wouldn't be that of G?
incrediBill's full system is multilevel.
What you have in this thread works for any proxy that is a real proxy and for actual search engine spoofers , a lot of what is out there are not true proxies.
That is why I posted the access log entries for one form (there are several) of just one of the many things that are out there.
The last I knew Bill maintains a history of accesses as well, blocks all agents that access robots.txt and that aren't allowed search engine robots and marks the ip address as not allowed.
He also does other nifty things as well.
Basicly he has become very aggressive in handling access to his site. From his comments on visitor numbers I'd say he is doing something right.
His is a defense in depth.
I just tested posted reversedns.php and it works
using Bill's
AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "/my_full_account_path/httpdocs/reversedns.php"
If not, what would be the best alternative?
Is my .htaccess supposed to look like:
AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "http://www.mywebsite.com/reversedns.php"
or
#AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "http://www.mywebsite.com/reversedns.php"
or some other?
I have that php script above all the html tags in a file named 'reversedns.php'
"http://www.mywebsite.com/reversedns.php"
it should be of this form:
"/my_full_account_path/httpdocs/reversedns.php"
What goes before the reversedns.php will be something like:
/home/oldexpat/public_html/ which is a absolute reference within the file system on your server.
The actual name varies.
It should also go without saying that php must be available and active on the server which isn't always the case.
Put the following script in your root html folder on the server as phpinfoall.php and issue: http://www.example.com/phpinfoall.php in a browser of your choice.
<?php// Show all information, defaults to INFO_ALL
phpinfo();
?>
Correct anything that prevents you from get a nicely format but very long page back.
[edited by: theBear at 2:31 pm (utc) on July 3, 2007]
BTW this type of thread is why I love WebmasterWorld! Thanks for all the great info.
That is why I mentioned that Bill uses more than one method to detect any form of scraping including the cases like you are talking about.
Does what he use get them all, I'd wager he gets a few oceans worth of the sobs.
Bill is one old hand at using the right bait.
[edited by: theBear at 6:03 pm (utc) on July 3, 2007]
Detect a spoofed googlebot and display "hijack attempt thwarted". Well what if there's a problem in your code and you blocked the real googlebot? You'll figure it out soon when you stop getting any traffic. That freaks me out.
With webmaster central, why doesn't Google communicate with the webmaster? Like give us a unique code to display in our footer or somehow in the page. Then the scrapers will be popping up like a xmas tree because they will have those unique words all over the place.
And why can't they also use that same code in their useragent when sending googlebot over?
Example:
Webmasterworld gets a unique key: abc123.
Stick that in the page somewhere. If content that was detected @ webmasterworld is duplicated elsewhere including that unique keyword, boom, they are scrapers, kill them in the serps.
Then whenever googlebot visits webmasterworld, they will also append abc123 to the user agent...googlebot-abc123. Now anytime a robot comes crawling and says they are googlebot, but they don't have the unique string. Boom, you kill them in their steps.
But what's to stop a proxy just indexing your site as some other common user agent - Mozilla Firefox for example?
A proxy is a proxy, not a spider. That doesn't mean someone couldn't run a spider off the same server, but that would be a spider and not a proxy.
Of course someone other than Googlebot or Slurp could spider through a proxy and the reverse DNS code wouldn't catch them. Then again, they aren't causing your site to be hijacked in Google either which is all we were discussing trying to fix.
FYI, I log all accesses of Googlebot, etc. via proxy sites and then block others from doing the same. The fact that they entice Googlebot to crawl via their site is what gets them blocked from my server in the first place.
Gotta love it!
I like the reverse-forward dns validation solution. Email servers have been doing this for a while to identify spam SMTP servers.
The problem is this. A hijacker can modify the web proxy code to alter the user agent and return some other user agent, when it should return Googlebot. That would bypass user agent based detection. Is not happening now, but I don't think it will take long. User agent detection will need to be complemented with IP address detection.
IP address detection will need to work similar to current email spam detection. They use honeynets to trap the offending ips and maintain a public database of IPs that is accesible via DNS. We will need volunteers to maintain dns databases of hijacking proxies' ip addresses.
Ideally search engines will find a bulletproof solution.
The problem is this. A hijacker can modify the web proxy code to alter the user agent and return some other user agent, when it should return Googlebot. That would bypass user agent based detection. Is not happening now, but I don't think it will take long. User agent detection will need to be complemented with IP address detection.
Some proxy sites always filter the user agent so it's already happening. The problem is that the solutions to stop that are way more complicated so you use the simplest band-aids you can to stop the most problems for the most people until a better solution becomes available that is easily accessible for the masses.
I block entrire IP ranges of data centers which is where many proxies and scrapers are hosted but there are still exception.
My web site security has many layers, far too many to discuss in this thread, but I continue to use them all. Just because some traps may sometimes be defeated doesn't mean they are always defeated so you let the simpletons get trapped in the simplest security while you focus your ongoing efforts on the more advanced internet scoundrels.
[edited by: incrediBILL at 9:20 pm (utc) on July 3, 2007]
What I don't like about this is how hard it is to test your code.Detect a spoofed googlebot and display "hijack attempt thwarted". Well what if there's a problem in your code and you blocked the real googlebot? You'll figure it out soon when you stop getting any traffic. That freaks me out.
It's easy to test the code (using the PHP script method). You can first of all run the script by itself to check for errors. Then you can visit your page in Firefox surfing as Googlebot (using the User Agent Switcher extension) to check the 403. And then you can place a PHP email script in the footer of your pages so that you receive an email each time Googlebot visits a page. If Googlebot gets as far as the footer, then you're fine, and you can delete the email script.
If you do this, don't forget to switch back to Firefox default User Agent before you visit these forums, or you'll risk getting banned from WebmasterWorld (so I've heard).
I'm using HTML includes: <!--#include file="filename.inc" --> all files are set at chmod 755 and using XBitHack On - to eliminate the use of shtml so filenames remain .html. (Basically this allows me to run SSI without using shtml extensions).
I believe there was a reason I used SSI instead of PHP originally - not sure why as it was a while ago.
Trying to find solutions for a problem that would be so easy to fix by Google.
Webmasters are talking about the negative effects for a long long time.
Proxies are sooo easy to recognize, doing things they shouldn't do.
Why would Google NOT do anything about the proxy phenomena? Why not? I don't understand.
You might want to look into mod_layout, it is a third party BSD licensed Apache add on.
I can't say for certain exactly how it interfaces with the regular handlers it seems to have it's own so it may do what it does and not interfere with your use of the execute bit hack.
I still am of the opinion that this kind of thing is something that the rewrite map facility is good for. Despite all of its draw backs such a being invoked at start up and actually being available server wide (which unless you were self hosting would be an issue) , requiring a fair amount of programming, and being a pain if you make a programing mistooke.
[edited by: theBear at 4:32 am (utc) on July 5, 2007]
Thanks for the info, I think at this point it would be easier to go the route of the .htaccess rDNS check if my host will turn it on.
Otherwise a find and replace of the includes may be a simpler option. I feel this is important enough, since Google will not support people hurt by proxies and fix this, that its worth the effort to implement.
Some issues:
1. You don't want to ban IPs forever, do you? So you need to build in some system to datestamp latest logins and purge older bans.
2. You only want to run the forward/reverse DNS on IPs once. So you need to lookup each IP to make sure that you don't already have the answer. Well you don't want to run that on a regular table, do you? You want a heap table for that. But there are going to be more IPs than you can handle in a HEAP table. So you'll need a caching mechanism. And you'll need to keep X days in the HEAP table. And put older IPs into HEAP when they appear. You also need a purging mechanism for the HEAP table.
3. There are more than 1 search engines. While a lot of spoofing may happen with googlebot, how do you handle other search engines?
4. How do you identify that you're dealing with a spider? You don't want to run this code on every user, do you?
5. If you want to run a captcha like Incredible was talking about, that adds a whole layer of complexity.
If anyone has insight into these and other issues I missed, I'd love to hear it.
At the end of the day, will doing all this work really improve my site's SE visibility all that much? If so, has anyone run some kind of test to provide metrics?