Welcome to WebmasterWorld Guest from

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

Proxy duplicate problems

8:19 pm on Aug 23, 2006 (gmt 0)

5+ Year Member

How can I block a proxy that not only duplicates my content, it changes the links so all go through it and adds advertising to all pages?

It's big enough that when My Site Name (with or without quotes) is searched in Google, they come up now.. my site is nowhere to be found. I contacted Google but no response yet, the proxies host isn't responding either (waited 4 days so far).

So, I decided to try to at least protect my own sites, here's the info:

Their url looks like http://www.example.com/o.php?logid=http%3A%2F%2Fmysite.com%2F and they do leave a trackable IP in my server logs.

So with that I tried blocking by IP:

deny from ###.###.###.##
allow from all

While that did show the 403 page on my site... header tests showed a 200 response (no 403), and it was still inside their proxy (URL theirs, ads and link changes still in effect)

The usual code I have for removing query strings has no effect at all:

RewriteCond %{THE_REQUEST} [?]
RewriteRule ^(.*)$ http://example.com/$1? [R=301,L]

Here's my very beginner attempt at combining rules, it gave a 500 error (ouch):

RewriteCond %{REMOTE_ADDR} ###\.###\.###\.## [?]
RewriteRule ^(.*)$ http://example.com/$1? [R=301,L]

And a simpler attempt to redirect them away:

RewriteCond %{REMOTE_ADDR} ###\.###\.###\.##
RewriteRule (.*) http://example.com/ [R=301,L]

That left my URL in the address bar and still inside the proxy (links altered, ads etc.) but showed the contents of example.com. Headers still returning a code 200 response (not the 301 I'd expected).

I've been searching all day for a way to escape this thing but I've gone as far as I can with almost no knowledge of htaccess (I am trying) and without knowing what else to search for. So again I have to ask for help :S

9:25 pm on Aug 23, 2006 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

> header tests showed a 200 response (no 403)

I'd fix that problem first. Make sure your custom error document is defined with a local URL-path, and not as a canonical URL.

ErrorDocument 403 http://www.example.com

will result in a 200-OK response for all Forbidden pages.

Also, defining Deny from x/Allow from y is dangerous is you don't specify an Order (see mod_access).


[edited by: jdMorgan at 9:26 pm (utc) on Aug. 23, 2006]

9:31 pm on Aug 23, 2006 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

After examining/fixing your error document directive, start with a simple ban:

RewriteCond %{REMOTE_ADDR} ###\.###\.###\.## [OR]
RewriteCond %{HTTP_REFERER} ^http://(www\.)?scraper-site\.com
RewriteRule !^path-to-403-error-document - [F]


[edited by: jdMorgan at 9:31 pm (utc) on Aug. 23, 2006]

10:39 pm on Aug 23, 2006 (gmt 0)

5+ Year Member

The 403 page is written as ErrorDocument 403 /403.html and does return a 403 properly in other cases (opening a blank folder for example). I just triple checked again, it's working as expected except in this case. (I also tried without a custom 403, same thing)

I tried the code you just wrote, same thing is happening.
- Their url stays in the address bar
- The 403 page does show up, but a link I added to test is altered to stay within their proxy
- Their ads are still there.
- Still returns a 200 as well (all other 403's are returning fine)

(I've cleared all private data and restarted the browser between tests, so that isn't it either.)

It seems nothing I try can escape this thing... serving up error pages, feeding them their own url to chew on.. no luck.

I've tried these variations seperatly as well:

RewriteCond %{REMOTE_ADDR} ###\.###\.###\.##
RewriteRule (.*) http://example.com/ [R=301,L]
RewriteCond %{REMOTE_ADDR} ###\.###\.###\.##
RewriteRule (.*) http://example.com/$1? [R=301,L]
RewriteCond %{REQUEST_URI} ^logid$ [NC,OR]
RewriteRule (.*) /$1? [F,L]

Same results.

11:01 pm on Aug 23, 2006 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

Is the 'test link' you added in canonical form, or server-relative?


11:14 pm on Aug 23, 2006 (gmt 0)

5+ Year Member

Cannonical. Also tried local (just filename.html link) the source gets rewritten as
<a href="http://scraper.tld/o.php?logid=http%3A%2F%2Fexample.com%2F">Cannonical</a>
<a href="http://scraper.tld/o.php?logid=http%3A%2F%2Fmysite.com%2Ffilename.html">Local</a>

Base hrefs do nothing as well. When caught in the proxy they are rewritten as:

<base href="http://scraper.tld/o.php?logid=http%3A%2F%2Fmysite.com%2F">

[edited by: LunaC at 11:39 pm (utc) on Aug. 23, 2006]

11:59 pm on Aug 23, 2006 (gmt 0)

5+ Year Member

Even forms are sent through them, my search that goes though my cgi-bin gets rewritten as:

<form method="get" action="http://www.scraper.tld/o.php">
<input type="hidden" name="logid" value="http://mysite.com/cgi-bin/search.cgi">

the original is:

<form method="get" action="http://mysite.com/cgi-bin/search.cgi">

Contact forms are altered the same as well.

What I'm most worried about is this spreading and personal information being stolen. They are proxying wikipedia and the ODP (and altering all links etc. to keep everything in the proxy).

I'm far from the only one this could be affecting :S

Still no real response from their host other than that I'm in the queue to be looked after.

1:06 am on Aug 24, 2006 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

All I can recommend is that you find out whether this is a 'live' proxy that fetches your apges every time someone requests one through their site, or whether it grabs your pages, caches them, processes them to change the links, and then serves those processed pages for some period of time.

Either way, the answer is to serve 403s, and not worry about the address bar. Once you are successfully blocking them, then you can possibly cloak your site so that they get useless pages from your site. If you serve them pages with no links at all, they can't very well modify them or impersonate your real site.

If they are actually grabbing and keeping copies of your pages, DMCA them. Otherwise, things get a bit more tricky, but you can still report this to search engines through their webmaster contact addresses (if you can find them).


1:12 am on Aug 24, 2006 (gmt 0)

10+ Year Member

If they are refetching your pages occasionally, you could serve them a 200 response and a completely blank page. Serving them a 403 response may just result in them keeping the older cached copy.

That all sounds like they're trying to get hold of usernames and passwords so they can hack into other sites.

1:15 am on Aug 24, 2006 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

I was also thinking it might be useful to send them a graphic that says, "You are NOT on <correct domain>, please visit us by typing <correct domain> into your browser address bar."

But that is step three, and we're still on step one.


1:19 am on Aug 24, 2006 (gmt 0)

5+ Year Member

It's a live proxy (updates in real time), and all atempts at 403's are still returning a 200 header response when trapped in the proxy no matter what I've tried (checked with Live http headers extension). I'll search the server logs tomorrow when they arrive to see if they give any more info.

As for cloaking, I'll look into that tomorrow as well. That's an area I've completely avoided... scares me almost as much as this :S

I've already contacted Google, but they are known for canned (if any) responses.

I'm out for tonight, thanks for your help and have a good night.

2:05 am on Aug 24, 2006 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

You won't be able to do anything about the 200-OK response from the proxy. It comes from the proxy, and is out of your control. So waste no time or mental effort on that.

All you need to do is serve them blank or alternate pages, so they will no longer rank for your keywords, and so that "your" visitors don't get scammed on their site.

That's your first priority, and it's really the only major concern here.

I wouldn't woory too much about the cloaking aspect of it. The cloaked pages are only to be served to the proxy site, and you are making no attempt to fool search engines or your visitors.

If you think cloaking is always seen as bad, you should check out major sites like CNN, The Washington Post, etc. They all serve alternate content to search engines for different reasons, but make no attempt to deceive anyone. And that is OK. It is cloaking with intent to deceive search engines or visitors that is frowned upon by search engines.