Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Proxy Server URLs Can Hijack Your Google Ranking - how to defend?

         

synergy

1:59 pm on Jun 25, 2007 (gmt 0)

10+ Year Member



I posted about this in the back room but I think this need to be brought into public view. This is happening right now and could happen to you!

Over the weekend my index page and now some internal pages were proxy hijacked [webmasterworld.com] within Google's results. My well ranked index page dropped from the results and has no title, description or cache. A search for "My Company Name" brings up (now two) listings of the malicious proxy at the top of the results.

The URL of the proxy is formatted as such:
https://www.scumbagproxy.com/cgi-bin/nph-ssl.cgi/000100A/http/www.mysite.com

A quick search in Google for "cgi-bin/nph-ssl.cgi/000100A/" brings up now 55,000+ results when Saturday it was 13,000 and Sunday it was 30,000. The number of sites affected are increasing exponentially and your site could be next.

Take preventative action now by doing the following...

1. Add this to all of your headers:
<base href="http://www.yoursite.com/" />


and if you see an attempted hijack...

2. Block the site via .htaccess:
RewriteCond %{HTTP_REFERER} yourproblemproxy\.com


3. Block the IP address of the proxy
order allow,deny
deny from 11.22.33.44
allow from all


4. Do your research and file a spam report with Google.
[google.com...]

Patrick Taylor

4:40 pm on Jul 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For what it's worth, a PHP solution allows data to be easily collected and stored somewhere.

Hobbs

4:47 pm on Jul 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just tested posted reversedns.php and it works
using Bill's
AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "/my_full_account_path/httpdocs/reversedns.php"

in my .htaccess file

Only had to do 2 modification
- replaced the broken ¦ in line 5 of the php
- inderted <?php before the php code and?> after it

2 observations:
visiting the test directory as Googlebot gives me an empty white page not my expected 403 Forbidden

This php checks for only googlebot and MSN bot in 2 lines

if(stristr($ua, 'msnbot') ¦¦ stristr($ua, 'googlebot')) {

if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname)) {

can we have a full code that checks for slurp as well?

Patrick Taylor

6:09 pm on Jul 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Firefox and Opera show a blank page, although they do get the right 403 header: HTTP/1.x 403 Forbidden.

Hobbs

6:17 pm on Jul 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



True I just checked and found the 403 entry in the server logs.

g1smd

6:35 pm on Jul 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



To be clear, it is the HTTP response code in the HTTP headers that is most important, not what the visible on-page error message might say. In fact, the visible page "content" is irrelevant.

Always use a HTTP/1.1 header checker to actually check what the response code really is.

Philosopher

6:43 pm on Jul 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Maybe I'm missing something. If the proxy hijacking is, at least in some cases, being done on purpose, the rDNS solution doesn't seem as if it will work.

Many proxy's will either change the UA or delete it entirely during the request. Wouldn't this defeat the use of rDNS as the UA wouldn't be that of G?

theBear

8:25 pm on Jul 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



First things first.

incrediBill's full system is multilevel.

What you have in this thread works for any proxy that is a real proxy and for actual search engine spoofers , a lot of what is out there are not true proxies.

That is why I posted the access log entries for one form (there are several) of just one of the many things that are out there.

The last I knew Bill maintains a history of accesses as well, blocks all agents that access robots.txt and that aren't allowed search engine robots and marks the ip address as not allowed.

He also does other nifty things as well.

Basicly he has become very aggressive in handling access to his site. From his comments on visitor numbers I'd say he is doing something right.

His is a defense in depth.

old_expat

1:57 am on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just tested posted reversedns.php and it works
using Bill's
AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "/my_full_account_path/httpdocs/reversedns.php"

Sorry for the dumb question, but will this work even though many of my pages are .htm rather than .php?

If not, what would be the best alternative?

incrediBILL

6:50 am on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sorry for the dumb question, but will this work even though many of my pages are .htm rather than .php?

Yes, it will work, that's why there's a ".htm" in the command "AddType application/x-httpd-php .html .htm .txt"

old_expat

10:39 am on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks, incrediBILL.

old_expat

1:41 pm on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Okay, another dumb question. When I tried to look at my site after adding the code to .htaccess, I got one of those "What application do you want to open .."

Is my .htaccess supposed to look like:

AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "http://www.mywebsite.com/reversedns.php"

or

#AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "http://www.mywebsite.com/reversedns.php"

or some other?

I have that php script above all the html tags in a file named 'reversedns.php'

theBear

2:23 pm on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No # that means comment and it is not a reference to a page on your site but a file so this is wrong:

"http://www.mywebsite.com/reversedns.php"

it should be of this form:

"/my_full_account_path/httpdocs/reversedns.php"

What goes before the reversedns.php will be something like:

/home/oldexpat/public_html/ which is a absolute reference within the file system on your server.

The actual name varies.

It should also go without saying that php must be available and active on the server which isn't always the case.

Put the following script in your root html folder on the server as phpinfoall.php and issue: http://www.example.com/phpinfoall.php in a browser of your choice.


<?php

// Show all information, defaults to INFO_ALL
phpinfo();

?>

Correct anything that prevents you from get a nicely format but very long page back.

[edited by: theBear at 2:31 pm (utc) on July 3, 2007]

frakilk

5:42 pm on Jul 3, 2007 (gmt 0)

10+ Year Member



The reversedns.php solution should do the trick for proxies spoofing as GoogleBot and co. But what's to stop a proxy just indexing your site as some other common user agent - Mozilla Firefox for example? I am missing something?

BTW this type of thread is why I love WebmasterWorld! Thanks for all the great info.

theBear

6:02 pm on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think that case has been covered. It, sorta like, doesn't.

That is why I mentioned that Bill uses more than one method to detect any form of scraping including the cases like you are talking about.

Does what he use get them all, I'd wager he gets a few oceans worth of the sobs.

Bill is one old hand at using the right bait.

[edited by: theBear at 6:03 pm (utc) on July 3, 2007]

Clark

7:13 pm on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What I don't like about this is how hard it is to test your code.

Detect a spoofed googlebot and display "hijack attempt thwarted". Well what if there's a problem in your code and you blocked the real googlebot? You'll figure it out soon when you stop getting any traffic. That freaks me out.

With webmaster central, why doesn't Google communicate with the webmaster? Like give us a unique code to display in our footer or somehow in the page. Then the scrapers will be popping up like a xmas tree because they will have those unique words all over the place.

And why can't they also use that same code in their useragent when sending googlebot over?

Example:

Webmasterworld gets a unique key: abc123.
Stick that in the page somewhere. If content that was detected @ webmasterworld is duplicated elsewhere including that unique keyword, boom, they are scrapers, kill them in the serps.

Then whenever googlebot visits webmasterworld, they will also append abc123 to the user agent...googlebot-abc123. Now anytime a robot comes crawling and says they are googlebot, but they don't have the unique string. Boom, you kill them in their steps.

incrediBILL

7:27 pm on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But what's to stop a proxy just indexing your site as some other common user agent - Mozilla Firefox for example?

A proxy is a proxy, not a spider. That doesn't mean someone couldn't run a spider off the same server, but that would be a spider and not a proxy.

Of course someone other than Googlebot or Slurp could spider through a proxy and the reverse DNS code wouldn't catch them. Then again, they aren't causing your site to be hijacked in Google either which is all we were discussing trying to fix.

FYI, I log all accesses of Googlebot, etc. via proxy sites and then block others from doing the same. The fact that they entice Googlebot to crawl via their site is what gets them blocked from my server in the first place.

Gotta love it!

HamletBatista

8:27 pm on Jul 3, 2007 (gmt 0)

10+ Year Member



incrediBILL,

I like the reverse-forward dns validation solution. Email servers have been doing this for a while to identify spam SMTP servers.

The problem is this. A hijacker can modify the web proxy code to alter the user agent and return some other user agent, when it should return Googlebot. That would bypass user agent based detection. Is not happening now, but I don't think it will take long. User agent detection will need to be complemented with IP address detection.

IP address detection will need to work similar to current email spam detection. They use honeynets to trap the offending ips and maintain a public database of IPs that is accesible via DNS. We will need volunteers to maintain dns databases of hijacking proxies' ip addresses.

Ideally search engines will find a bulletproof solution.

incrediBILL

9:18 pm on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The problem is this. A hijacker can modify the web proxy code to alter the user agent and return some other user agent, when it should return Googlebot. That would bypass user agent based detection. Is not happening now, but I don't think it will take long. User agent detection will need to be complemented with IP address detection.

Some proxy sites always filter the user agent so it's already happening. The problem is that the solutions to stop that are way more complicated so you use the simplest band-aids you can to stop the most problems for the most people until a better solution becomes available that is easily accessible for the masses.

I block entrire IP ranges of data centers which is where many proxies and scrapers are hosted but there are still exception.

My web site security has many layers, far too many to discuss in this thread, but I continue to use them all. Just because some traps may sometimes be defeated doesn't mean they are always defeated so you let the simpletons get trapped in the simplest security while you focus your ongoing efforts on the more advanced internet scoundrels.

[edited by: incrediBILL at 9:20 pm (utc) on July 3, 2007]

Patrick Taylor

9:57 pm on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What I don't like about this is how hard it is to test your code.

Detect a spoofed googlebot and display "hijack attempt thwarted". Well what if there's a problem in your code and you blocked the real googlebot? You'll figure it out soon when you stop getting any traffic. That freaks me out.

It's easy to test the code (using the PHP script method). You can first of all run the script by itself to check for errors. Then you can visit your page in Firefox surfing as Googlebot (using the User Agent Switcher extension) to check the 403. And then you can place a PHP email script in the footer of your pages so that you receive an email each time Googlebot visits a page. If Googlebot gets as far as the footer, then you're fine, and you can delete the email script.

If you do this, don't forget to switch back to Firefox default User Agent before you visit these forums, or you'll risk getting banned from WebmasterWorld (so I've heard).

optimist

10:41 pm on Jul 3, 2007 (gmt 0)

10+ Year Member



AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "/my_full_account_path/httpdocs/reversedns.php"

Nice Idea but it does not work if you use XbitHack On

Any other ideas on implementing this?

frakilk

11:17 pm on Jul 4, 2007 (gmt 0)

10+ Year Member



Just a quick note on the reversedns.php script, I've had a few hits from MSNBot that resolve back to a phx.gbl domain and thus will get served back a 403 by the script. It seems phx.gbl is a fake domain that Microsoft uses for god knows what reason. Has anyone noticed this before?

theBear

1:39 am on Jul 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



optimist,

What happens when you have the execute bit hack activated?

optimist

3:35 am on Jul 5, 2007 (gmt 0)

10+ Year Member



My includes no longer work when switching the file to make html process as php.

I'm using HTML includes: <!--#include file="filename.inc" --> all files are set at chmod 755 and using XBitHack On - to eliminate the use of shtml so filenames remain .html. (Basically this allows me to run SSI without using shtml extensions).

I believe there was a reason I used SSI instead of PHP originally - not sure why as it was a while ago.

Gede

4:17 am on Jul 5, 2007 (gmt 0)

10+ Year Member



And here we are.

Trying to find solutions for a problem that would be so easy to fix by Google.

Webmasters are talking about the negative effects for a long long time.

Proxies are sooo easy to recognize, doing things they shouldn't do.

Why would Google NOT do anything about the proxy phenomena? Why not? I don't understand.

theBear

4:31 am on Jul 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



optimist,

You might want to look into mod_layout, it is a third party BSD licensed Apache add on.

I can't say for certain exactly how it interfaces with the regular handlers it seems to have it's own so it may do what it does and not interfere with your use of the execute bit hack.

I still am of the opinion that this kind of thing is something that the rewrite map facility is good for. Despite all of its draw backs such a being invoked at start up and actually being available server wide (which unless you were self hosting would be an issue) , requiring a fair amount of programming, and being a pain if you make a programing mistooke.

[edited by: theBear at 4:32 am (utc) on July 5, 2007]

optimist

5:01 am on Jul 5, 2007 (gmt 0)

10+ Year Member



theBear

Thanks for the info, I think at this point it would be easier to go the route of the .htaccess rDNS check if my host will turn it on.

Otherwise a find and replace of the includes may be a simpler option. I feel this is important enough, since Google will not support people hurt by proxies and fix this, that its worth the effort to implement.

Clark

6:29 pm on Jul 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does the forward backward lookup thingie work for yahoo too?

tedster

7:38 pm on Jul 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yahoo, yes, (but not MSN or Ask, as far as I know.) See Yahoo blog for details [ysearchblog.com].

Clark

8:53 pm on Jul 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Ted.
I've started playing around with some code to implement some of this and it's harder than it looks.

Some issues:
1. You don't want to ban IPs forever, do you? So you need to build in some system to datestamp latest logins and purge older bans.
2. You only want to run the forward/reverse DNS on IPs once. So you need to lookup each IP to make sure that you don't already have the answer. Well you don't want to run that on a regular table, do you? You want a heap table for that. But there are going to be more IPs than you can handle in a HEAP table. So you'll need a caching mechanism. And you'll need to keep X days in the HEAP table. And put older IPs into HEAP when they appear. You also need a purging mechanism for the HEAP table.
3. There are more than 1 search engines. While a lot of spoofing may happen with googlebot, how do you handle other search engines?
4. How do you identify that you're dealing with a spider? You don't want to run this code on every user, do you?
5. If you want to run a captcha like Incredible was talking about, that adds a whole layer of complexity.

If anyone has insight into these and other issues I missed, I'd love to hear it.

At the end of the day, will doing all this work really improve my site's SE visibility all that much? If so, has anyone run some kind of test to provide metrics?

theBear

1:11 am on Jul 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For metrics you should look at incrediBill's posting that says from 500k to 900+k / month
This 174 message thread spans 6 pages: 174