homepage Welcome to WebmasterWorld Guest from 54.196.63.93
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 37 message thread spans 2 pages: 37 ( [1] 2 > >     
Proxy Hijack - Now what should I do?
?
followgreg

10+ Year Member



 
Msg#: 3306064 posted 7:13 am on Apr 9, 2007 (gmt 0)

Guys,

I just found out that some of the pages on one of our sites was Hijacked by some proxy! :(

Our 2 years old blog homepage disapeared from Google index, and google cache from OUR site shows the proxy server URL!

The HTTP answer shows a x-pingback using xmlrpc.php on the blog (Wordpress) while the domain is the one form the spammer.

My question is, how do I fix it?

 

Keniki



 
Msg#: 3306064 posted 11:10 pm on Apr 9, 2007 (gmt 0)

Hi I am sorry to hear you are having these troubles:

Here are a few things you can do to help.

1. Add a base href in your meta tag

This can be added as follows

<base href="http://www.yoursite.com/" />

2. Block the site by adding this in .htaccess:

RewriteCond %{HTTP_REFERER} yourproblemproxy\.com

3. Track the proxy server dns using say dnslookup and block there ip address in .htaccess:

order allow,deny
deny from 11.22.33.44
allow from all
(the ip address of proxy - some are sneaky and use a different ip to cache your page so check after you have done this, you should see immediate break of the proxy hijack after this is done otherwise track it down by blocking ip ranges

4. Use Robots.txt to ban versions of page that you know dont exist on your server. I use google's robot rules for handling images to do this. So for instance if your site uses only urls ending .php I would add this in robots.txt to prevent hijacks.


Disallow: /*.asp*$
Disallow: /*.cgi*$
Disallow: /*.htm*$
Disallow: /*.html*$
and so on

Since all Hijacks depend on stealing someone else's content lets make them work off your robots.txt

5.Finally don't forget to make a spam report..

Hope This is of some help. Edited for coding reasons

[edited by: Keniki at 11:20 pm (utc) on April 9, 2007]

trinorthlighting

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3306064 posted 12:32 am on Apr 10, 2007 (gmt 0)

Block the ip of the site first. Then as it gets crawled google will drop the forbidden pages out of the index.

Keniki



 
Msg#: 3306064 posted 12:54 am on Apr 10, 2007 (gmt 0)

trinorthlighting good point. Blocking ip or ip range is the fastest way to break a proxy hijack. Those who known me a while will know how passionate I have been about robots.txt, that is because I feel it can allow the ultimate protection to break all hijacks by only allowing content you determine. What do you think of this idea trinorthlighting?

bobothecat



 
Msg#: 3306064 posted 1:08 am on Apr 10, 2007 (gmt 0)

Those who known me a while will know how passionate I have been about robots.txt, that is because I feel it can allow the ultimate protection to break all hijacks by only allowing content you determine.

I seriously doubt a Hijacker is going to read, or respect robots.txt.

.htaccess is a much better way to deal such issues... though if you can block from server, or router level - it's even better.

robots.txt is just that... text.

Keniki



 
Msg#: 3306064 posted 1:25 am on Apr 10, 2007 (gmt 0)

bobothecat the robots.txt isn't intended for the hijacker, its intended for the SE and it determines the urls that are allowed on the domain and prevents proxy hijacks and pretty much any other by determining the content allowed to be searched.
By the way been testing for six months and it works!

bobothecat



 
Msg#: 3306064 posted 1:33 am on Apr 10, 2007 (gmt 0)

By the way been testing for six months and it works!

and prevents proxy hijacks and pretty much any other by determining the content allowed to be searched.

I've been using robots.txt since it was invented, and I can assure you it doesn't. :)

[edited by: bobothecat at 1:41 am (utc) on April 10, 2007]

Keniki



 
Msg#: 3306064 posted 1:39 am on Apr 10, 2007 (gmt 0)

bobothecat thats funny because google only introduced handling of image files in robots.txt two years ago which code is based on and google itself is not yet 10 years old.
Previous post edited so took out my comment about alta vista

[edited by: Keniki at 1:42 am (utc) on April 10, 2007]

Keniki



 
Msg#: 3306064 posted 2:02 am on Apr 10, 2007 (gmt 0)

The important thing is to refocuss on robots.txt this file can prevent hijacks and determine how your site is crawled and is unaffected by spam.

trinorthlighting

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3306064 posted 2:07 am on Apr 10, 2007 (gmt 0)

You have to be careful what you put in the robots.txt since hackers like to read it.

Best way, sign up for google alerts to get an alert for you rsite

Then read the alerts as they come daily, first sign of a proxy, google will email it to you and you can stop it before it comes too big.

Also report it via webmaster tools as spam. That way google can take action against the site.

[edited by: engine at 8:44 am (utc) on April 10, 2007]
[edit reason] delinked [/edit]

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3306064 posted 5:50 am on Apr 10, 2007 (gmt 0)

OK, if you've been proxy hijacked I might be able to help.

I discussed this issue when I spoke at PubCon last year and I've been complaining about this to Google for quite some time and they can't seem to fix it on their end yet Yahoo and MSN appear to have it somewhat under control.

What happens is the proxy site cloaks tons of URLs to the search engine and the search engine crawls through the proxy site and gives the proxy site credit for the page.

Blocking individual IPs is a complete waste of time as the proxy sites pop up in new locations almost daily and I'm tracking a few hundred of them at the moment.

How you stop this problem is to use REVERSE and FORWARD DNS to validate that Googlebot is coming from an actual Google IP, which returns a domain name that contains ".googlebot.com". If the reverse DNS of the IP that claims to be Googlebot doesn't contain ".googlebot.com", you simply bounce them.

I display a specific encoded error page so that I can track where the data shows up in Google and ties it back to the original proxy crawl without compromising my pages being hijacked in Google.

There are other ways to detect proxies as well, and I block them using multiple methods, including known lists of proxy IPs. The only flaw with using the proxy IP list is the inbound and outbound connection may be different IPs which is why I deploy multiple detection methods which I unfortunately can't share otherwise the proxy sites would wise up and correct their "tells".

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3306064 posted 5:56 am on Apr 10, 2007 (gmt 0)

Keniki said:
The important thing is to refocuss on robots.txt this file can prevent hijacks and determine how your site is crawled and is unaffected by spam.

Bobo is right and robots.txt is meaningless for this problem.

What you're missing Keniki is when a crawler passes through a proxy it uses the robots.txt for the PROXY, not for your site.

Example:

Google crawls this page:
www.slimyproxysite.com/nph-page.pl/000000A/http/www.mydomain.com

Google will ask for robots.txt here:
www.slimyproxysite.com/robots.txt

Google will NOT ask for robots.txt here:
www.slimyproxysite.com/nph-page.pl/000000A/http/www.mydomain.com/robots.txt

Do you understand the problem now?

[edited by: incrediBILL at 5:57 am (utc) on April 10, 2007]

followgreg

10+ Year Member



 
Msg#: 3306064 posted 12:11 pm on Apr 10, 2007 (gmt 0)

Thanks for answering (and thanks to Richtc for his PM).

I heard about these proxy hijacks but to be honest I thought that it wasn't that easy to fool Google.

For now I've blocked the proxy domain + denied their IP's however I am pretty sure that Google is crawling through other proxies even though I can't find which ones.

Many pages of our site have no pagerank anymore and don't appear at Google webmaster central as part of the site.

What the h...? All hijacked pages I can count are 2 to 3 years old, are 100% unique, are linked from almost all other pages of the site AND have inbound links from other site so HOW can Google be so easily fooled and consider that they now belong to another domain - just like that?

Google seems to remove all our pages from its index, no pagerank, supplemental listing at best - just frustrating.
More frustrating is that I KNOW that it was done on purpose, but how disapointing to see that Google would just let websites dying over a few technical tricks they are already aware of.

For now I filed a DMCA as well - but moreover I am p..sed to such a degree that I want to bring the proxy owner to court.
Today is the day of my first appointment with our law firm, not sure if something can be done but I will fight this one to death.

---

foxtunes

10+ Year Member



 
Msg#: 3306064 posted 12:31 pm on Apr 10, 2007 (gmt 0)

These proxy sites are popping up all the time with cached pages from my sites. I deal with them by blocking via htaccess. I don't blame you for wanting to wheel in the lawyer Followreg, but a lot of these proxy owners are based in the middle east, Russia, India etc. A DMCA will work but as for getting any compensation out of them, I doubt it.

bwnbwn

WebmasterWorld Senior Member bwnbwn us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3306064 posted 3:23 pm on Apr 10, 2007 (gmt 0)

trinorthlighting
Setting up an alert on the site what would you set up the domain I am a little unsure on how to set up this..

What would you do if the site was a microsoft server as .htaccess won't work here. I do see setting up reverse and forwards dns as our only option.

trinorthlighting

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3306064 posted 3:50 pm on Apr 10, 2007 (gmt 0)

The Google alerts are not on your server.

Go to [google.com...]

And type in the search term mysite.com and even some specific unique sentences from your site.

You can set the email frequency. We did this a while ago and itís amazing what Google will send. We see every link that gets set up to our site and the page that links to us.

Makes it real easy to find scrapers and when Google hits a proxy and the page gets cached, you will immediatey know and you will be able to immediately block the ip and fill out a spam report.

Its easier to take care of scrapers and proxy's when the first page is indexed versus having to take care of hundreds of them.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3306064 posted 5:17 pm on Apr 10, 2007 (gmt 0)

I am pretty sure that Google is crawling through other proxies even though I can't find which ones

I explained how to find them, it's easy.

For now I filed a DMCA as well

Be very careful who you file a DMCA against in this instance because you will get yourself in trouble for filing a false claim, possibly subject to being counter sued.

If you're being proxy hijacked the site hijacking you doesn't actually contain your content so if they're shut down with a DMCA complaint all hell could break loose.

The only place you should file the complaint, if at all, is Google itself as that's the only place your content was reassigned and it's a bug in Google being exploited, there is no actual copying of your content anywhere except in Google's index.

Its easier to take care of scrapers and proxy's when the first page is indexed versus having to take care of hundreds of them.

It's easier just to check who claims to be Googlebot and block them when they access the first page, therefore ZERO pages are ever indexed and it's never an issue.

Shurik

10+ Year Member



 
Msg#: 3306064 posted 5:37 pm on Apr 10, 2007 (gmt 0)

For the past few weeks I am developing a comprehensive solution to address proxy hijack as I suspect that one of my sites suffers from such.

IncrediBILL, your solution is bulletproof as long as the proxy site does not alter user agent. What if it reports a popular browser as a user agent instead of googlebot? Do you go as far as using cookies and/or javascript to identify scambags? Your input is highly appreciated.

trinorthlighting

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3306064 posted 5:41 pm on Apr 10, 2007 (gmt 0)

Its hard keeping up with them, google alerts is just another tool to use to help.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3306064 posted 6:08 pm on Apr 10, 2007 (gmt 0)

What if it reports a popular browser as a user agent instead of googlebot? Do you go as far as using cookies and/or javascript to identify scambags? Your input is highly appreciated.

Most proxy servers disable javascript when the code passes through their servers so javascript would never work on the page no matter who accessed it.

So far I've not encountered an attempted hijacking incident where the user agent wasn't an actual SE user agent.

I actually cloak a specific unique keyword (bug) hidden into the page when the user agent claims to be a browser so I can see if that content shows up anywhere with a single search. None of my cloaked bugs have ever shown up associated with a proxy server yet, although my bugs show up in scraper sites, which is why I'm pretty sure they're passing the user agent verbatim.

Shurik

10+ Year Member



 
Msg#: 3306064 posted 8:08 pm on Apr 10, 2007 (gmt 0)

Thank you, incrediBILL
You saved me a lot of time and effort.

followgreg

10+ Year Member



 
Msg#: 3306064 posted 10:06 pm on Apr 10, 2007 (gmt 0)


Incredibill >> Yes that's correct you gave A solution by explaining the reverse and forward DNS, actually if I remember correctly Matt explained this one about 6 months ago.
However (I think that) Google also crawls sites using alternate DNS to detect cloaking pages so I guess that it is the best solution but has to be taken with a grain of salt imo.

I am not worried about being conter sued, no problem at all.
I would not sue these guys for money, this would be just for letting even more people know about how many bad things people can do to your business using the few Google weaknesses.
As for filing a complaint with Google, well I will let the lawyer to decide.

It's not Google's fault, although given circumstances I think that they could certainly come up with solutions - and sometimes it is not even the proxy company's fault if Google picks up their URL's -

Now that Google decides to pick up a URL and despites all indications that the owner is not the proxy server (backlinks, age, ...) that they decide to give credit to the proxy has to be fixed.

Anyone knows where a list of those bad proxies can be found?

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3306064 posted 10:53 pm on Apr 10, 2007 (gmt 0)

I am not worried about being conter sued, no problem at all.
I would not sue these guys for money, this would be just for letting even more people know about how many bad things people can do to your business using the few Google weaknesses.

You really need to go read Chilling Effects before you waive a DMCA around in this situation because counter suits happen all the time.

Anyone knows where a list of those bad proxies can be found?

Yes, I have a large list of them and compiled them by detecting Google crawling through a non-Google IP address. Having that list is meaningless as many of them are dead or have had their domain already dropped in Google.

That's why I said the REVERSE->FORWARD DNS test is about the only sure-fire and Google approved way to stop them because new ones pop up constantly to fill in where the old ones dropped out.

Prior to Google completing their reverse DNS progect, I simply did a WHOIS on the IP address to see if Google owned it and authorized googlebot or mediapartners-google if it crawled via the following IP ranges:

64.233.160.0 - 64.233.191.255
66.249.64.0 - 66.249.95.255
72.14.192.0 - 72.14.239.255
216.239.32.0 - 216.239.63.255

Been battling these things almost 2 years now and if I had anything better I would tell you.

FWIW, not all proxy sites are bad and I've never been hijacked by the legit ones. However, when you find proxy sites that replace ads in web pages they serve you'll usually find hijacking involved as they want more money at the expense of other sites.

[edited by: incrediBILL at 10:54 pm (utc) on April 10, 2007]

followgreg

10+ Year Member



 
Msg#: 3306064 posted 6:11 am on Apr 11, 2007 (gmt 0)

Good stuff Incredibill, thanks.

About counter suits, yeah I am pretty sure there are many - now how will they justify serving our company's copyrighted content to Google?
If done on purpose I am convinced that Google has historical data that can proove it and they will drop them hopefully. If not intentional then all will settle down peacefully and once again it will be time for all those phd's at the plex to work things out sometime soon, it penalizes legit businesses and it is embarassing for GG.

Just by going through our logs today I've found 3 more proxies with user agent "googlebot".

As someone else said before, Google saying that noone can hurt your rankings from another site can't be taken seriously at all unfortunately.

checking further this matter, I found a couple of .gov's content now indexed solely through proxies, wow now this is serious I think.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3306064 posted 7:15 am on Apr 11, 2007 (gmt 0)

checking further this matter, I found a couple of .gov's content now indexed solely through proxies, wow now this is serious I think.

It is serious, which is why I tell anyone willing to listen how to stop it.

now how will they justify serving our company's copyrighted content to Google?

They're a proxy, at the end of the day, it's Google that screws up and makes the mistake, not the proxy, which is why I cautioned you as I'm a big DMCA user, but this situation with proxy hijacking is dicey at best and could blow up in your face which is why I don't use DMCA against them.

followgreg

10+ Year Member



 
Msg#: 3306064 posted 8:20 am on Apr 11, 2007 (gmt 0)

Incredibil >> What else than the DMCA in such a critical situation? Do you simply send a certified mail or send a spam report?
Unfortunately an online spam report may and may not be read and you never know if Google will start acting on it.

Yes the prolem with proxies is that they eventually don't do it on purpose, I like to think that most people are honest. There are also apparently a lot of companies using proxies for hjacking SERPS so Google might want to also communicate with proxy website owners if possible.

From my prospective both spammers and Google are responsible. Except if the proxy didn't do it on purpose.

the solution however, as you said, is Google fixing their stuff.
Obviously it's not their own business that is under attack otherwise I assume that they would have fixed it a long time ago :)

morags

5+ Year Member



 
Msg#: 3306064 posted 8:59 am on Apr 16, 2007 (gmt 0)

I've just noticed that I too have this problem. IncrediBILL's reverse/forward DNS solution sounds about right. Just don't know how to go about this, so guess it's learning time - any pointers gratefully received :-)

I can see an obvious solution for Google - but perhaps there is a reason it won't work:

Using IncrediBILL's example url (which has the same format of the url involved in my case):

www.slimyproxysite.com/nph-page.pl/000000A/http/www.mydomain.com

Why can't Google just drop any page where there is anything before "http/www"? OK, I realise that there will be sites who legitimately have this pattern in a URL somewhere, but 1,000s of times? So maybe check for the number of urls within a site which contain the pattern, then if it exceeds a set limit, hijacking site is dropped from index.

I'm pretty sure that there is a good reason why this doesn't happen though. More learning for me I suppose.

The fact that the hijacking site is running Adsense just makes it worse. So not only are they exploiting a Google bug, they are being paid by Google to do so.

avalanche101

5+ Year Member



 
Msg#: 3306064 posted 11:49 pm on Apr 19, 2007 (gmt 0)

Hi,
We've got a couple of those: www.slimyproxysite.com/nph-page.pl/000000A/http/www.mydomain.com
when we do an inurl: site name search.
We've filled a spam report at: [google.com...]

Now looking at the reverse forward DNS thingy.
How do you do that? Or is that a silly question?

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3306064 posted 2:11 am on Apr 20, 2007 (gmt 0)

> "reverse forward DNS thingy"

Check if your server can be configured to deny access based on a "double reverse DNS lookup failure". Apache servers can be configured [httpd.apache.org] to do this.
Be aware that this can affect your log file format, unless you have server configuration privileges and can define a custom log file format using %a instead of %h as the first entry in the format.

What this means is that if you enable double reverse DNS lookups using the standard access log format, your log files will no longer show the remote IP address, but rather, the remote hostnames (if available) from which your server received requests.

Just to de-mystify things, a double reverse DNS lookup does this:
Take the requesting IP address, and look up the host name(s) associated with that IP address in DNS.
Then look up the IP address(es) of those host names, and if one of them does not match the IP address of the original request, then the operation fails.
You can block the requests based on that.

Jim

avalanche101

5+ Year Member



 
Msg#: 3306064 posted 3:38 pm on Apr 20, 2007 (gmt 0)

jdMorgan

Thank you for that, I'll find out if we can do that.
My boss went nuts when he found out about this and has mailed the owners, found their info from whois.
So far one has removed their proxy.

This 37 message thread spans 2 pages: 37 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved