Welcome to WebmasterWorld Guest from 3.227.3.146

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Proxy Server URLs Can Hijack Your Google Ranking - how to defend?

     
1:59 pm on Jun 25, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:July 8, 2003
posts: 431
votes: 0


I posted about this in the back room but I think this need to be brought into public view. This is happening right now and could happen to you!

Over the weekend my index page and now some internal pages were proxy hijacked [webmasterworld.com] within Google's results. My well ranked index page dropped from the results and has no title, description or cache. A search for "My Company Name" brings up (now two) listings of the malicious proxy at the top of the results.

The URL of the proxy is formatted as such:
[scumbagproxy.com...]

A quick search in Google for "cgi-bin/nph-ssl.cgi/000100A/" brings up now 55,000+ results when Saturday it was 13,000 and Sunday it was 30,000. The number of sites affected are increasing exponentially and your site could be next.

Take preventative action now by doing the following...

1. Add this to all of your headers:

<base href="http://www.yoursite.com/" />

and if you see an attempted hijack...

2. Block the site via .htaccess:

RewriteCond %{HTTP_REFERER} yourproblemproxy\.com

3. Block the IP address of the proxy

order allow,deny
deny from 11.22.33.44
allow from all

4. Do your research and file a spam report with Google.
[google.com...]

4:40 pm on July 2, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


For what it's worth, a PHP solution allows data to be easily collected and stored somewhere.
4:47 pm on July 2, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member hobbs is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Mar 19, 2004
posts:3056
votes: 5


I just tested posted reversedns.php and it works
using Bill's
AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "/my_full_account_path/httpdocs/reversedns.php"

in my .htaccess file

Only had to do 2 modification
- replaced the broken in line 5 of the php
- inderted <?php before the php code and?> after it

2 observations:
visiting the test directory as Googlebot gives me an empty white page not my expected 403 Forbidden

This php checks for only googlebot and MSN bot in 2 lines

if(stristr($ua, 'msnbot') stristr($ua, 'googlebot')) {

if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname)) {

can we have a full code that checks for slurp as well?

6:09 pm on July 2, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


Firefox and Opera show a blank page, although they do get the right 403 header: HTTP/1.x 403 Forbidden.
6:17 pm on July 2, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member hobbs is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Mar 19, 2004
posts:3056
votes: 5


True I just checked and found the 403 entry in the server logs.
6:35 pm on July 2, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


To be clear, it is the HTTP response code in the HTTP headers that is most important, not what the visible on-page error message might say. In fact, the visible page "content" is irrelevant.

Always use a HTTP/1.1 header checker to actually check what the response code really is.

6:43 pm on July 2, 2007 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 28, 2002
posts:994
votes: 2


Maybe I'm missing something. If the proxy hijacking is, at least in some cases, being done on purpose, the rDNS solution doesn't seem as if it will work.

Many proxy's will either change the UA or delete it entirely during the request. Wouldn't this defeat the use of rDNS as the UA wouldn't be that of G?

8:25 pm on July 2, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


First things first.

incrediBill's full system is multilevel.

What you have in this thread works for any proxy that is a real proxy and for actual search engine spoofers , a lot of what is out there are not true proxies.

That is why I posted the access log entries for one form (there are several) of just one of the many things that are out there.

The last I knew Bill maintains a history of accesses as well, blocks all agents that access robots.txt and that aren't allowed search engine robots and marks the ip address as not allowed.

He also does other nifty things as well.

Basicly he has become very aggressive in handling access to his site. From his comments on visitor numbers I'd say he is doing something right.

His is a defense in depth.

1:57 am on July 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 26, 2003
posts:705
votes: 0


I just tested posted reversedns.php and it works
using Bill's
AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "/my_full_account_path/httpdocs/reversedns.php"

Sorry for the dumb question, but will this work even though many of my pages are .htm rather than .php?

If not, what would be the best alternative?

6:50 am on July 3, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Sorry for the dumb question, but will this work even though many of my pages are .htm rather than .php?

Yes, it will work, that's why there's a ".htm" in the command "AddType application/x-httpd-php .html .htm .txt"

10:39 am on July 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 26, 2003
posts:705
votes: 0


Thanks, incrediBILL.
1:41 pm on July 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 26, 2003
posts:705
votes: 0


Okay, another dumb question. When I tried to look at my site after adding the code to .htaccess, I got one of those "What application do you want to open .."

Is my .htaccess supposed to look like:

AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "http://www.mywebsite.com/reversedns.php"

or

#AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "http://www.mywebsite.com/reversedns.php"

or some other?

I have that php script above all the html tags in a file named 'reversedns.php'

2:23 pm on July 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


No # that means comment and it is not a reference to a page on your site but a file so this is wrong:

"http://www.mywebsite.com/reversedns.php"

it should be of this form:

"/my_full_account_path/httpdocs/reversedns.php"

What goes before the reversedns.php will be something like:

/home/oldexpat/public_html/ which is a absolute reference within the file system on your server.

The actual name varies.

It should also go without saying that php must be available and active on the server which isn't always the case.

Put the following script in your root html folder on the server as phpinfoall.php and issue: http://www.example.com/phpinfoall.php in a browser of your choice.


<?php

// Show all information, defaults to INFO_ALL
phpinfo();

?>

Correct anything that prevents you from get a nicely format but very long page back.

[edited by: theBear at 2:31 pm (utc) on July 3, 2007]

5:42 pm on July 3, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:Nov 2, 2005
posts:450
votes: 0


The reversedns.php solution should do the trick for proxies spoofing as GoogleBot and co. But what's to stop a proxy just indexing your site as some other common user agent - Mozilla Firefox for example? I am missing something?

BTW this type of thread is why I love WebmasterWorld! Thanks for all the great info.

6:02 pm on July 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


I think that case has been covered. It, sorta like, doesn't.

That is why I mentioned that Bill uses more than one method to detect any form of scraping including the cases like you are talking about.

Does what he use get them all, I'd wager he gets a few oceans worth of the sobs.

Bill is one old hand at using the right bait.

[edited by: theBear at 6:03 pm (utc) on July 3, 2007]

7:13 pm on July 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


What I don't like about this is how hard it is to test your code.

Detect a spoofed googlebot and display "hijack attempt thwarted". Well what if there's a problem in your code and you blocked the real googlebot? You'll figure it out soon when you stop getting any traffic. That freaks me out.

With webmaster central, why doesn't Google communicate with the webmaster? Like give us a unique code to display in our footer or somehow in the page. Then the scrapers will be popping up like a xmas tree because they will have those unique words all over the place.

And why can't they also use that same code in their useragent when sending googlebot over?

Example:

Webmasterworld gets a unique key: abc123.
Stick that in the page somewhere. If content that was detected @ webmasterworld is duplicated elsewhere including that unique keyword, boom, they are scrapers, kill them in the serps.

Then whenever googlebot visits webmasterworld, they will also append abc123 to the user agent...googlebot-abc123. Now anytime a robot comes crawling and says they are googlebot, but they don't have the unique string. Boom, you kill them in their steps.

7:27 pm on July 3, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


But what's to stop a proxy just indexing your site as some other common user agent - Mozilla Firefox for example?

A proxy is a proxy, not a spider. That doesn't mean someone couldn't run a spider off the same server, but that would be a spider and not a proxy.

Of course someone other than Googlebot or Slurp could spider through a proxy and the reverse DNS code wouldn't catch them. Then again, they aren't causing your site to be hijacked in Google either which is all we were discussing trying to fix.

FYI, I log all accesses of Googlebot, etc. via proxy sites and then block others from doing the same. The fact that they entice Googlebot to crawl via their site is what gets them blocked from my server in the first place.

Gotta love it!

8:27 pm on July 3, 2007 (gmt 0)

New User

10+ Year Member

joined:May 29, 2007
posts:5
votes: 0


incrediBILL,

I like the reverse-forward dns validation solution. Email servers have been doing this for a while to identify spam SMTP servers.

The problem is this. A hijacker can modify the web proxy code to alter the user agent and return some other user agent, when it should return Googlebot. That would bypass user agent based detection. Is not happening now, but I don't think it will take long. User agent detection will need to be complemented with IP address detection.

IP address detection will need to work similar to current email spam detection. They use honeynets to trap the offending ips and maintain a public database of IPs that is accesible via DNS. We will need volunteers to maintain dns databases of hijacking proxies' ip addresses.

Ideally search engines will find a bulletproof solution.

9:18 pm on July 3, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


The problem is this. A hijacker can modify the web proxy code to alter the user agent and return some other user agent, when it should return Googlebot. That would bypass user agent based detection. Is not happening now, but I don't think it will take long. User agent detection will need to be complemented with IP address detection.

Some proxy sites always filter the user agent so it's already happening. The problem is that the solutions to stop that are way more complicated so you use the simplest band-aids you can to stop the most problems for the most people until a better solution becomes available that is easily accessible for the masses.

I block entrire IP ranges of data centers which is where many proxies and scrapers are hosted but there are still exception.

My web site security has many layers, far too many to discuss in this thread, but I continue to use them all. Just because some traps may sometimes be defeated doesn't mean they are always defeated so you let the simpletons get trapped in the simplest security while you focus your ongoing efforts on the more advanced internet scoundrels.

[edited by: incrediBILL at 9:20 pm (utc) on July 3, 2007]

9:57 pm on July 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


What I don't like about this is how hard it is to test your code.

Detect a spoofed googlebot and display "hijack attempt thwarted". Well what if there's a problem in your code and you blocked the real googlebot? You'll figure it out soon when you stop getting any traffic. That freaks me out.

It's easy to test the code (using the PHP script method). You can first of all run the script by itself to check for errors. Then you can visit your page in Firefox surfing as Googlebot (using the User Agent Switcher extension) to check the 403. And then you can place a PHP email script in the footer of your pages so that you receive an email each time Googlebot visits a page. If Googlebot gets as far as the footer, then you're fine, and you can delete the email script.

If you do this, don't forget to switch back to Firefox default User Agent before you visit these forums, or you'll risk getting banned from WebmasterWorld (so I've heard).

10:41 pm on July 3, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:June 17, 2003
posts:96
votes: 0


AddType application/x-httpd-php .html .htm .txt
php_value auto_prepend_file "/my_full_account_path/httpdocs/reversedns.php"

Nice Idea but it does not work if you use XbitHack On

Any other ideas on implementing this?

11:17 pm on July 4, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:Nov 2, 2005
posts:450
votes: 0


Just a quick note on the reversedns.php script, I've had a few hits from MSNBot that resolve back to a phx.gbl domain and thus will get served back a 403 by the script. It seems phx.gbl is a fake domain that Microsoft uses for god knows what reason. Has anyone noticed this before?
1:39 am on July 5, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


optimist,

What happens when you have the execute bit hack activated?

3:35 am on July 5, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:June 17, 2003
posts:96
votes: 0


My includes no longer work when switching the file to make html process as php.

I'm using HTML includes: <!--#include file="filename.inc" --> all files are set at chmod 755 and using XBitHack On - to eliminate the use of shtml so filenames remain .html. (Basically this allows me to run SSI without using shtml extensions).

I believe there was a reason I used SSI instead of PHP originally - not sure why as it was a while ago.

4:17 am on July 5, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 19, 2003
posts:46
votes: 0


And here we are.

Trying to find solutions for a problem that would be so easy to fix by Google.

Webmasters are talking about the negative effects for a long long time.

Proxies are sooo easy to recognize, doing things they shouldn't do.

Why would Google NOT do anything about the proxy phenomena? Why not? I don't understand.

4:31 am on July 5, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


optimist,

You might want to look into mod_layout, it is a third party BSD licensed Apache add on.

I can't say for certain exactly how it interfaces with the regular handlers it seems to have it's own so it may do what it does and not interfere with your use of the execute bit hack.

I still am of the opinion that this kind of thing is something that the rewrite map facility is good for. Despite all of its draw backs such a being invoked at start up and actually being available server wide (which unless you were self hosting would be an issue) , requiring a fair amount of programming, and being a pain if you make a programing mistooke.

[edited by: theBear at 4:32 am (utc) on July 5, 2007]

5:01 am on July 5, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:June 17, 2003
posts:96
votes: 0


theBear

Thanks for the info, I think at this point it would be easier to go the route of the .htaccess rDNS check if my host will turn it on.

Otherwise a find and replace of the includes may be a simpler option. I feel this is important enough, since Google will not support people hurt by proxies and fix this, that its worth the effort to implement.

6:29 pm on July 5, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


Does the forward backward lookup thingie work for yahoo too?
7:38 pm on July 5, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Yahoo, yes, (but not MSN or Ask, as far as I know.) See Yahoo blog for details [ysearchblog.com].
8:53 pm on July 5, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


Thanks Ted.
I've started playing around with some code to implement some of this and it's harder than it looks.

Some issues:
1. You don't want to ban IPs forever, do you? So you need to build in some system to datestamp latest logins and purge older bans.
2. You only want to run the forward/reverse DNS on IPs once. So you need to lookup each IP to make sure that you don't already have the answer. Well you don't want to run that on a regular table, do you? You want a heap table for that. But there are going to be more IPs than you can handle in a HEAP table. So you'll need a caching mechanism. And you'll need to keep X days in the HEAP table. And put older IPs into HEAP when they appear. You also need a purging mechanism for the HEAP table.
3. There are more than 1 search engines. While a lot of spoofing may happen with googlebot, how do you handle other search engines?
4. How do you identify that you're dealing with a spider? You don't want to run this code on every user, do you?
5. If you want to run a captcha like Incredible was talking about, that adds a whole layer of complexity.

If anyone has insight into these and other issues I missed, I'd love to hear it.

At the end of the day, will doing all this work really improve my site's SE visibility all that much? If so, has anyone run some kind of test to provide metrics?

1:11 am on July 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


For metrics you should look at incrediBill's posting that says from 500k to 900+k / month
This 174 message thread spans 6 pages: 174