Welcome to WebmasterWorld Guest from 34.231.21.123

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Proxy Server URLs Can Hijack Your Google Ranking - how to defend?

     
1:59 pm on Jun 25, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:July 8, 2003
posts: 431
votes: 0


I posted about this in the back room but I think this need to be brought into public view. This is happening right now and could happen to you!

Over the weekend my index page and now some internal pages were proxy hijacked [webmasterworld.com] within Google's results. My well ranked index page dropped from the results and has no title, description or cache. A search for "My Company Name" brings up (now two) listings of the malicious proxy at the top of the results.

The URL of the proxy is formatted as such:
[scumbagproxy.com...]

A quick search in Google for "cgi-bin/nph-ssl.cgi/000100A/" brings up now 55,000+ results when Saturday it was 13,000 and Sunday it was 30,000. The number of sites affected are increasing exponentially and your site could be next.

Take preventative action now by doing the following...

1. Add this to all of your headers:

<base href="http://www.yoursite.com/" />

and if you see an attempted hijack...

2. Block the site via .htaccess:

RewriteCond %{HTTP_REFERER} yourproblemproxy\.com

3. Block the IP address of the proxy

order allow,deny
deny from 11.22.33.44
allow from all

4. Do your research and file a spam report with Google.
[google.com...]

1:17 am on July 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


I saw that one, but as I remember it, that was more about internal deep linking and localized content than as a result of banning bad bots.
1:50 am on July 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


You might want to read it agian:

"In the last year while fighting all this nonsense I managed to move up the ranks from only 400K visitors a month to 900K+ (maybe 1M, we'll see how the month ends). This wouldn't have been possible to accomplish if the scrapers and hijacked pages had been left unchecked as I would still be competing against myself in Google, which I was before I went draconian on content access rules, and now it's not a problem."

8:25 am on July 6, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 17, 2004
posts:138
votes: 0


Does google have an option to report such sites?
Maybe they should create one.
11:39 am on July 6, 2007 (gmt 0)

New User

10+ Year Member

joined:June 24, 2005
posts:7
votes: 0


Because googlebot always reads robots.txt, surely the php code only needs to be in the robots.txt file?

If I add this to the .htaccess

AddType application/x-httpd-php .txt

And put the reverse dns lookup into the top of robots.txt, which adds any blocked ips to the .htaccess.

This will save the server php parsing every htm file on the server.

Or am I missing something?

1:26 pm on July 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


ozric,

You are missing something.

The fake Googlebots don't always read robots.txt .

2:53 pm on July 6, 2007 (gmt 0)

New User

10+ Year Member

joined:June 24, 2005
posts:7
votes: 0


theBear,

It is the real Googlebot we're talking about isn't it? It's being sent thru a proxy server. I suppose the proxy could not request the robots.txt when Googlebot requests it, so my solution wouldn't be 100% effective.

I don't like the idea of turning on php for all htm files. My site gets over 200,000 htm page views a day. I'm worried what will happen if the server suddenly has to parse all these file requests as php. Or is it not a problem?

3:39 pm on July 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


ozric,

Actually the block is to trap fake Googlebots as well.

Blocking just a request for robots.txt won't handle that aspect at all.

As the fake ones even if told not to do something are still going to do it. Further assuming that the script/proxy would even return the 403 or whatever is also a consideration. The safe way is to nail them all.

200,000 pages views a day could be a load, but that all depends upon what you have for a server and how you have other things setup. I can't answer that question for you.

4:28 pm on July 6, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Because googlebot always reads robots.txt, surely the php code only needs to be in the robots.txt file?

I can see how this can be confusing but Google will most likely NEVER read your robots.txt file via a CGI or PHP proxy server.

If Google is about to crawl:

exampleproxysite.com/nph-page.pl/000000A/http/www.yoursite.com

Google will read robots.txt from here:

exampleproxysite.com/robots.txt

Does that make sense?

Therefore, protecting robots.txt on your site for this scenario is probably a waste of time.

5:10 pm on July 6, 2007 (gmt 0)

New User

10+ Year Member

joined:June 24, 2005
posts:7
votes: 0


I see, thanks Guys! Just ignore my posts, sorry if I confused anybody :)
5:33 pm on July 6, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 12, 2003
posts:167
votes: 0


You shouldn't have much trouble running the php script with 200K page views per day. It is quite fast, we run quite large php scripts on all our pages loading ads (phpadsnew) and logging various bits to mysql doing about 10 requests/sec on an old Dell Poweredge 2850.
6:56 pm on July 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Google does not have to read only the root folder for robots.txt files. They can also read inside folders.

On freehosts, you might have a website per user per folder, and then they have to read one level down to find it.

8:27 pm on July 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Feb 12, 2006
posts:2710
votes: 116


[as a little side note, does anyone know if the well-known 'bad behaviour' script from 'homeland stupidity' protects against this kind of attack. from the looks of it, i think it does, but i don't know enough php to know for sure. it seems to block anything that fakes its user-agent.]
8:38 pm on July 6, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Google does not have to read only the root folder for robots.txt files. They can also read inside folders.

So how does that equate to Google actually accessing the robots.txt file on the server being spoofed by the CGI proxy?

It doesn't, so let's not confuse the topic.

[edited by: incrediBILL at 8:39 pm (utc) on July 6, 2007]

2:21 am on July 9, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


Ever since this thread started I've been mulling over this whole thing and figured out a plan to make it work w/ my site. I wish I could use php to change the access list for my whole server, but I suppose starting w/ one domain is a good start.

Funny, on digg a story recently made front page about how using the agent switcher on firefox will allow you to access "restricted" sites if you mask yourself as googlebot. This forward/reverse thing will put the kabosh on that.

4:20 am on July 9, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


Clark,

I'm currently testing a rewrite map perl based version of this kind of block.

It isn't as hard to test as some other forms of blocking are.

I can run the script standalone and feed it information from existing log files.

I also downloaded and will be playing with mod_layout a bit just to see how that one could handle this type of thing and not interfere with things like the X bit hack.

It helps to have many ways to do something.

2:40 am on July 22, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


I haven't had time to actually implement anything yet, but this thread keeps percolating in my head, giving me more ideas... one thing I came away with, is that it's best to not let Google cache your pages. Because robots are indexing your content through the cache. Damn, Google should force you to prove you're a human with a kaptcha before allowing your content that they cached on their site to be read...

Anyway, removing cacheing doesn't affect ranking does it?

A couple more questions about these bots. In your experience are they targetting sites mostly? Or do they follow links @ random or on a keyword basis?

I'm not planning to ban any bots. I just want to feed them junk data and junk links. Do they tend to follow the links they are fed?

10:48 pm on July 22, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


Clark,

After looking at well over 2 million requests from old logs I got 62 web hosting service based scrappers, 21 automated down loaders, 5 no agents down loaders, 18 fake Yahoo/Inktomi Slurps, and 19 fake Googlebots.

The stuff covers plenty of ground but most of it is aimed at groups of pages that do well for related key words.

I have a nice fast rewrite map script that can handle allowed known bot requests at a rate that exceeds 180,000 requests a second, the processing rate for banned bots is just slightly slower, the agent stops are slower still, and the hosting provider stuff can make a determination at the rate of over 7,000 requests a second when it has to classify.

2:02 pm on July 24, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


theBear,

That's very impressive. Are you building something for sale? Or this is all just for yourself?

How do you determine that it's a bot? Are you also using robot traps and "human" tests?

2:08 pm on July 24, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


Is it just my imagination, or did Brett not have a post in this thread? It was a classic and I went through all the pages, even did a printable version of the page and can't find his post.

He was saying that his classic "succeed on Google in 12 months" thread got ripped so many times that the rips have been ripped. Does this ring a bell? Could it have been in another thread?

3:07 pm on Oct 12, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


With the recent attention to proxy hijacks here, I thought we should re-open this thread. Has anyone found something new in the way of defending against this problem?
7:51 pm on Oct 12, 2007 (gmt 0)

Junior Member

joined:Oct 12, 2007
posts:44
votes: 0


Serving pages in a frame for non trusted ip's I think would prevent prevent proxies and scrapers/harvesters. Have a php script a friend sent me that does this. I don't know how long it would take for them to get round it though.
8:42 pm on Oct 12, 2007 (gmt 0)

Junior Member

joined:Oct 12, 2007
posts:44
votes: 0


Here's the system.

I am not going to post htaccess code as its probably clumbsy compared to what experts here can do.

In htaccess reverse dns search bots aka previous posts remember msn this one is hugely important

Now create a list of trusted ip's in htaccess including the user agents of the search bots (they are reverse dns so can't be spoofed).

All other ip's do a globall redirect to a file called frame.php?

Remember in htacces not to apply the rules to this file.

Because the redirect can't pass the refferer this was the only way afriend thought of doing this but I am sure if you like the system people can vary it to achieve something better.

So the request to http://www.example.com/widgets.html is redirected if not a search bot or trusted ip to http://www.example.com/frame.php?http://www.example.com/widgets.html

Now you see there is the page that you can work with on the php frame page.Now on the php page called frame.php

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">

<head>
<title>Example</title>

<meta name="author" content="Example">

<meta name="robots" content="noindex,follow">

<meta name="description" content="Example ">

<?php echo'<base href="http://www.'.$_SERVER['HTTP_HOST'].''.$_SERVER['REQUEST_URI'].'" />';?>

</head>

<?

$referer='http://'.$_SERVER['HTTP_HOST'].''.$_SERVER['REQUEST_URI'];

$sitename='http://www.example.com/frame.php?';

$target = str_replace($sitename,"",$referer);

?>

<frameset>

<frame src="<?php echo $target?>" name="">

<noframes>

<body>

<p><a href="<?php echo $target?>">Example</a></p>

</noframes>

</frameset>

</html>

The advantage of this method I feel is there is no longer a need to ban ip's(I am always concerned of banning ip's that may contain a large amount of users) but a need to allow them instead. For those not allowed see the same page in a frame that preserves accessibility against cloaking the page as modern screen readers can read a single frame no problem. Also this system stops scrapers , email harvesters as well as proxy hijacks.

I am just worried how long it will take these #*$!s to get round it.

10:28 pm on Oct 12, 2007 (gmt 0)

Junior Member

joined:Oct 12, 2007
posts:44
votes: 0


My afriend has also suggested that a spider trap is inserted into the frame.php that if triggered leads ip to a page that says in title "THIS SITE IS A SCRAPER OR HIJACKER" he says vary the text. In this way it would not only stop scrapers hijackers and email harvesters it would also highlight them throughout the net as there automated systems would display this title. My afriend is working now on a solution if they can scrape through the frame.

Be great to get some authourity comments on this.....

11:06 pm on Oct 12, 2007 (gmt 0)

Junior Member

joined:Oct 12, 2007
posts:44
votes: 0


Do you want me to paste the .htacces code to go with this or will you respond with it, I want this solution to go to all webmasters in an understandable way.

Ok I am off to spread the word........

6:21 am on Oct 14, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Maybe I'm missing something here - but why would I throw all my users into a framed situation where they can't get the real page's url from the address bar? That would kill natural internal backlinks, bookmarking and so on. Sounds a bit like throwing out the baby with the bathwater to me.
8:24 am on Oct 14, 2007 (gmt 0)

Full Member

10+ Year Member

joined:May 31, 2006
posts:268
votes: 0


theBear wrote:

After looking at well over 2 million requests from old logs I got 62 web hosting service based scrappers, 21 automated down loaders, 5 no agents down loaders, 18 fake Yahoo/Inktomi Slurps, and 19 fake Googlebots.

Is there a database/black list of those (with specific IP's)? I am starting to go through my logs (after seeing how much content scraped from my sites is floating around the Internet) and I want to block as many as those as possible. Would be nice to establish a shared database...

8:28 am on Oct 14, 2007 (gmt 0)

New User

joined:Oct 14, 2007
posts:27
votes: 0


Why should I spend my time fixing Google's problems? If their index is infested with hijacked pages and things like that, I say: you have the billions, you fix it. Thanks.
8:39 am on Oct 14, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


One problem with any database/blacklist approach is that new proxies pop up literally daily, hourly. The damage is done before your list is updated, and your blacklist maintenance becomes a total PITA. You need to block certain patterns of requests. I rarely take sides in discussions like this, but I'd say most people should just listen to incrediBill's earlier posts in this very thread.

If you do ONE THING, the reverse-forward DNS stops this, otherwise you'll be fighting this problem until the day you die as it's a total waste of time to block them individually and a completely false sense of security as another proxy site will pop up to replace it the same day.

...The reverse-forward DNS spider validation is the only proxy blocker you need. Install it and then you can ignore the hijacking problem as it WILL completely resolve itself in time as all the spiders crawl the proxy a second time and remove your previously hijacked listings or replace them with junk (my personal favorite).

Try it - you'll like it. A lot.

10:11 am on Oct 14, 2007 (gmt 0)

New User

joined:Oct 14, 2007
posts:27
votes: 0


reverse-forward DNS

Why should I slow down every first-time visitor by up to a few seconds by doing double DNS lookups? I say again, it's only Google's problem. They have billions, they have all the resources. They shall fix it.

10:48 am on Oct 14, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


I say again, it's only Google's problem. They have billions, they have all the resources. They shall fix it.

We're not talking about mere indexing of proxy urls here, we're talking about the proxy urls taking over your rankings. That means YOU can lose income.

We're also not talking about defending against scrapers - just proxy server urls. Should Google fix this? Oh yes. Will I wait around for them to do it? Not on your life.

Why should I slow down every first-time visitor...

Not every first time visitor - just the ones who claim to be a search engine spider. This is from jdMorgan's post, #5 in this thread:

Given an understanding of what a proxy *is* and how it works, the only step really needed is to verify that user-agents claiming to be Googlebot are in fact coming from Google IP addresses, and to deny access to requests that fail this test.

If the purported-Googlebot requests are not coming from Google IP addresses, then one of two things is likely happening:

1) It is a spoofed user-agent, and not really Googlebot.
2) It *is* Googlebot, but it is crawling your site through a proxy.

The latter is how sites get 'proxy hijacked' in the Google SERPs -- Googlebot will see your content on the proxy's domain.

I also suggest checking IPs that claim to be slurp, MSNbot and Teoma (Ask's crawler.) Here's that reference thread again about how to do the checking:

How to verify Googlebot is Googlebot [webmasterworld.com]

This 174 message thread spans 6 pages: 174