Welcome to WebmasterWorld Guest from 3.234.210.89

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Proxy Server URLs Can Hijack Your Google Ranking - how to defend?

     
1:59 pm on Jun 25, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:July 8, 2003
posts: 431
votes: 0


I posted about this in the back room but I think this need to be brought into public view. This is happening right now and could happen to you!

Over the weekend my index page and now some internal pages were proxy hijacked [webmasterworld.com] within Google's results. My well ranked index page dropped from the results and has no title, description or cache. A search for "My Company Name" brings up (now two) listings of the malicious proxy at the top of the results.

The URL of the proxy is formatted as such:
[scumbagproxy.com...]

A quick search in Google for "cgi-bin/nph-ssl.cgi/000100A/" brings up now 55,000+ results when Saturday it was 13,000 and Sunday it was 30,000. The number of sites affected are increasing exponentially and your site could be next.

Take preventative action now by doing the following...

1. Add this to all of your headers:

<base href="http://www.yoursite.com/" />

and if you see an attempted hijack...

2. Block the site via .htaccess:

RewriteCond %{HTTP_REFERER} yourproblemproxy\.com

3. Block the IP address of the proxy

order allow,deny
deny from 11.22.33.44
allow from all

4. Do your research and file a spam report with Google.
[google.com...]

3:50 am on June 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 31, 2004
posts:786
votes: 0


Hi all,

Had the exact same experience that Synergy described with one of our sites and yes file a DMCA which will remove the offending site.

The problem is that a) it takes a while for google to act to remove the site and b) because of the mass duplication of your site pages you can get hit with a 950+ penulty which ensures you dont rank until the mess is cleared up and google has updated its index. You dont instantly go back to where you left off!.

The prevention methods work, the other thing to add that is absolutely vital is that:-

You should make sure your url's are absolute!

I cant stress this enough, this will limited the extent of any damage. One reason why the one site of ours was hit so bad was because the URLs were not all absolute hence the proxy was able to get all pages indexed in google as Synergy described (pages going up by the day) from the index page to all other pages on the site because google was following /page2 /page3 etc as if it belonged to the proxy. Had the links been absolute it would have seen them as "outbound links" on the proxy server pages it was cashing, if you follow me.

Rich

4:18 am on June 29, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:July 8, 2003
posts:431
votes: 0


You don't instantly go back to where you left off!.

After implementing the things I mentioned in the first post of this thread, it took 3 days for Google to pick up the changes and update their index with me back in the results. I first showed 2 slots below what I was when I was dropped. By the next morning I rose 1 slot. Still 1 more to go to be back to pre-jack.

In total, I was off of Google for 4 days.

3:26 pm on June 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


RichTC,

Absolute urls are not a defense in this case as most of the scripts modify what is there for links.

Synergy,

Glad to hear you are getting back to normal.

4:10 pm on June 29, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:July 8, 2003
posts:431
votes: 0


To summarize all suggestions being made as well as my own pursuit, here's what I've gathered...

Preventative Measures

1. Add this to all page headers:

<base href="http://www.yoursite.com/" />

2. Set up Google Alerts [google.com] to notify you daily the results of the following queries: inurl:yoursite.com and yoursite.com

3. Implement a double reverse dns client to help identify Googlebot. A search for "Double-Reverse DNS" in Google brings up a site who refutes D-RDNS as a strong all-around security measure so please take this information into consideration.

If your site has been hijacked within Google

1. Use a site such as dnsstuff.com to attempt to get the IP, DNS, and hosting company of the hijacking site.

2. Ban the IP address and the URL at server level using htaccess (Apache)

order allow,deny
deny from 11.22.33.44
allow from all

RewriteCond %{HTTP_REFERER} hijackingwebsite\.com

3. Search Google for "hijackingwebsite.com" to see what info you can find related to the culprit. I found a unique username and a forum post advertising the hijacking proxy, a personals page with all kinds of juicy information as well as another related business he owns. Use the information for what you will.

4. File a spam report with Google [webmasterworld.com]. It may take them awhile to respond but hopefully the first two steps will have you back in the game before they even read your submission.

5. Contact the hosting company (if available) of the hijacker's website to advise them of the problem and inquire about where to send a subpoena if the issue is not resolved.

6. Contact a legal professional.

8:31 pm on June 29, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


I've been fighting proxy hijacking for a long time and most of your list is a complete waste of time and redundant.

Not like I know anything about this topic as I only did various speaking engagements on this very topic for the last 2 years at various tech conferences around the country, so what do I know about it...

The reverse-forward DNS is the ONLY thing you need to do!

Once this is done you don't need to waste time blocking IPs as the reverse-forward DNS checks block the sites for you. Besides, there are hundreds if not thousands of these proxy sites and they pop up daily all over the place.

Forget the refuters as the PhDs of Google recommend the reverse-forward DNS and the only weak link in the chain is the REVERSE DNS, which can be spoofed, which is why you have to do the forward DNS to complete the loop.

IP -> REVERSE DNS -> FORWARD DNS = Original IP

Example, Googlebot comes crawling:

66.249.72.7 "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.72.7 -> crawl-66-249-72-7.googlebot.com -> 66.249.72.7

If the reverse DNS was spoofed, the final IP address wouldn't belong to googlebot.com and the process would fail. If your entire DNS system is whacked and forward DNS is failing to properly resolve you have much bigger problems than a proxy hijacking as your DNS server is probably hacked.

All of the rest is not needed once reverse-forward DNS checking for Googlebot, Slurp, MSN Live, etc.

1. Add this to all page headers:
<base href="http://www.yoursite.com/" />

Waste of time as a good proxy site will covert that to:
<base href="exampleproxy.com/nph-proxy.pl/000000A/http/www.yoursite.com/" />

2. Set up Google Alerts to notify you daily the results of the following queries: inurl:yoursite.com and yoursite.com

Not needed with the proxy blocking as hijacking will never happen.

2. Ban the IP address and the URL at server level using htaccess (Apache)

order allow,deny
deny from 11.22.33.44
allow from all

RewriteCond %{HTTP_REFERER} hijackingwebsite\.com

Forget blocking the IPs as there are hundreds, if not thousands of these sites.

If you do ONE THING, the reverse-forward DNS stops this, otherwise you'll be fighting this problem until the day you die as it's a total waste of time to block them individually and a completely false sense of security as another proxy site will pop up to replace it the same day.

Do it right, do it once, then enjoy your website and stop chasing online hijackers.

Not sure why you keep dragging that RewriteCond along as it has absolutely NOTHING to do with proxy hijacking.

Lawyers? WHOA! Are you KIDDING?

5. Contact the hosting company (if available) of the hijacker's website to advise them of the problem and inquire about where to send a subpoena if the issue is not resolved.

6. Contact a legal professional.

This is about the most dangerous advice I've ever heard and you better really be careful what you do here or you'll lose your house, car, wife, kids...

The proxy site, unless you can prove they cloaked a list of sites to Google, has actually done NOTHING wrong!

Even filing a DMCA complaint against the proxy server can blow up in your face because the proxy site DOES NOT contain any of your content, never has, never will. The ISP/Host won't find your content on their server and then the recipient of a fraudulent DMCA report can turn around and legally sue your socks off.

Remember, just because GOOGLE has a BUG doesn't mean you can run around frivolously filing complaints and taking legal action unless you want to find yourself in a serious amount of pain.

BTW, anyone can invoke Googlebot to crawl through your site via the proxy server so you better be VERY careful who you blame as there's no way to really know short of filing a subpoena to get a complete copy of the proxy web site and even at that the list of domains they crawl could be hosted elsewhere and you would be left holding a very empty and expensive bag and possibly in legal jeopardy.

I could explain further but just suffice it to say that just because your site is hijacked does NOT mean the proxy site is responsible and remember it's a bug in Google so suing someone for something they have no control over... I wouldn't go there, BAD IDEA.

File that spam report with Google

That's about the safest thing to do, and/or file a DMCA complaint strictly with Google as THEY are the only ones duplicating your content and it's stored in the Google cache.

WHY WAIT ON GOOGLE? FIX YOUR OWN PROXY HIJACKING PROBLEM!

Here's the million dollar answer to recovering all your proxy hijacked pages out of Google and recover everything which is FIX IT YOURSELF!

After installing the reverse-forward DNS checking code, make your code return a page of text such as "Nothing to see here Google, this is a thwarted proxy hijacking" just to give Google something to index as they seem to take longer to process 403 ERRORS.

If you're impatient and don't want to wait on Google to reindex the hijacked page, try to direct them back through the proxy. You can attempt this by installing an actual link back to your site via the proxy on a page on your site somewhere you know Google will crawl often, and use the proxy against itself to replace your listing with garbage.

For example, this link on your page will direct a crawl via the proxy site:
[exampleproxy.com...]

Now, instead of your hijacked page showing in Google, you'll see the following indexed:

Nothing to see here Google, this is a thwarted proxy hijacking

You now should have control over your previous page content.

This is exactly what I did to get rid of THOUSANDS of hijacked pages in Google and Yahoo, without filing a single complaint.

Summary

It's all about working SMART and not working HARD or wasting your time on fruitless endeavors.

The reverse-forward DNS spider validation is the only proxy blocker you need. Install it and then you can ignore the hijacking problem as it WILL completely resolve itself in time as all the spiders crawl the proxy a second time and remove your previously hijacked listings or replace them with junk (my personal favorite).

[edited by: incrediBILL at 8:49 pm (utc) on June 29, 2007]

8:35 pm on June 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


Synergy,

Another cog in the infrastructure is the hosting providers pipe provider. Traceroute domain followed by a whois on the ip address will yield that.

Don't ever count on a base href for saving your bacon when one of these thingies is involved.

If someone is doing this intentionally, what is one more regular expression to nail a base href?

Tedster,

You might want to edit a bit more you missed one and I didn't do it ;-).

8:42 pm on June 29, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:July 8, 2003
posts:431
votes: 0


incrediBILL, thank you for sharing your expertise on the subject. Hopefully this thread will help others who are affected by this in the future. It's a problem that isn't going away anytime soon, and Google can't seem to fix.

[edited by: synergy at 8:50 pm (utc) on June 29, 2007]

8:46 pm on June 29, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


TheBear said...
"GET /robots.txt HTTP/1.0" 200 27 "-" "-"

OK, obviously the reverse-forward DNS doesn't catch this as you don't know who the crawler is. However, I also block blank user agents and block anyone from crawling that isn't authorized to read robots.txt in the first place.

Remember, robots.txt is a SPIDER TRAP itself. If you don't ask for robots.txt you might stumble into another spider trap I set, and if you DO ask for robots.txt and aren't authorized you've already hit the spider trap and are blocked.

I only authorize Google, Yahoo, MSN, Ask and a couple of others to read my robots.txt file and anyone else requesting it has their IP instantly flagged and blocked from further access.

Therefore, Google wouldn't be able crawl via a proxy that completely discarded the user agent.

8:50 pm on June 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


Hi Bill,

You know I suggested blocking empty user agents however it turns out some folks didn't like their favorite checking tools getting block. So what can I say, horse ==> water ==> 6 feet or so down I guess.

9:08 pm on June 29, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


You know I suggested blocking empty user agents however it turns out some folks didn't like their favorite checking tools getting block. So what can I say, horse ==> water ==> 6 feet or so down I guess.

Blocking bad user agents has stopped the scraping (another form of content hijacking) in Google so I really don't care if there are some innocent casualties in the battle for my content.

In the last year while fighting all this nonsense I managed to move up the ranks from only 400K visitors a month to 900K+ (maybe 1M, we'll see how the month ends). This wouldn't have been possible to accomplish if the scrapers and hijacked pages had been left unchecked as I would still be competing against myself in Google, which I was before I went draconian on content access rules, and now it's not a problem.

Basically, if their checking tools can't be patched to render a proper user agent, then tough nuts, it's blocked. To be perfectly honest, if someone writing code to run on the internet can't set the user agent string, that code probably isn't good enough to be allowed on my site in the first place.

YMMV

[edited by: incrediBILL at 9:09 pm (utc) on June 29, 2007]

10:18 pm on June 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


>> Why Google is incapable of and/or refuses to block crawling via these sites, after all these years, which appear to have obvious detectable fingerprints is beyond me. <<

That, I think, is the real question here.

11:00 pm on June 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Not to rain on anybodies parade here but some of the most aggressive site abuse comes from spam bots using Google Web Accelerator accounts and it's been going on for a long time. Don't think for a second that allowing only legitimate Google IPs to visit your site is going to somehow prevent your site from being ripped. There is a reason why so many Google IPs are showing up in blacklists.

If proxy sites aren't plugged into Google Web Accelerator yet, it won't be long before they are. It's ripe for abuse.

11:20 pm on June 29, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


spam bots using Google Web Accelerator

The Google Web Accelerator reverse DNS doesn't resolve to ".googlebot.com" so it's not an issue that even someone spoofing Googlebot via the Google Web Accelerator would get blocked.

Consider the parade not rained upon.

12:09 am on June 30, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Yeah, you can come up with a million reasons not to block this or that using this method or that, and that's fine -- It's your site, your mix of visitors, your content that appeals to your kind of scraper that abuses your site. Everybody's situation is a little different.

However, just because this or that method has a drawback, or has a small-to-medium-sized hole in it, doesn't mean you should just say "I give up, it's too hard, let's all just blame it on Google," and go on suffering abuse.

[Donning devil's hat] Actually, I guess it's OK. Go ahead and let them scrape your sites... Some of you are my competitors, and it's OK with me if you get buried in the SERPs... If you're not my competitors, then listen to what IncrediBill is saying, and implement those methods which are possible with your hosting setup, and which address the needs of your site and the abuse it is suffering or may suffer... [doff hat]

For the U.S. market, you can pretty much block all requests that have both a blank referrer and a blank user-agent, except for HEAD requests from AOL proxy IP address ranges.

Jim

12:38 am on June 30, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


Try as you might sometimes site owners aren't all that aware of what is out there.

But I don't have to worry about it at the moment.

My cure for such critters depends on the critter, some (most) of the critters are downright brain damaged and contain the means within to result in them getting banned. Funny thing about unbounded recursive systems ;-).

All I've said is don't expect certain things to protect you, however it is still baffling that given the wonderful ability of making order out of chaos why, the footprints of these beasts haven't already been taken advantage of by the powers that be. Maybe it is a processing power issue at this time.

With that, I'll find something cold and frothy.

Continue on I'm looking for new ideas. Maybe one will show up here.

2:23 am on June 30, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Googlebot and Mediapartner IPs don't always reverse resolve. It doesn't happen that often but often enough to make me think twice about depending on this kind of fix. You don't want to be feeding Googlebot [F], redirections, or other garbage intended for malicious bots or visitors.

Googlebot gives us more than an REMOTE_ADDR or a HTTP_USER_AGENT header when it makes a file request from a site. It always uses a HTTP_FROM header [googlebot(at)googlebot.com] and it doesn't leave a HTTP_LANGUAGE_ACCEPT or HTTP_REFERER header. If you have a high traffic site and you don't want to slow it down by having to reverse resolve every single hit, these headers can be used to make a very effective trap for fake Googlebot user-agents. Googlebot IP ranges are well known. You can always feed questionable user-agents a meta noindex,nofollow just in case it's a malformed, but legitimate hit from Googlebot.

A simple method I've found useful for detecting and blocking web proxies is a simple JavaScript/noscript solution that compares the domain name in the address bar URL with the server domain name. No match or noscript, the page is disabled and the spider trap bans the visitor. Any spider following up on the link would be prevented from accessing the page. Even if the proxy IP were changed the bot would quickly follow a a spider trap hole and end up banning itself.

Which is an important note- spider traps are very powerful tools with these types of malicious attempts. Even if the web proxy didn't properly forward Googlebot's user-agent to be detected for incrediBILL's suggested fix, it still would be banned.

4:29 am on June 30, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Googlebot and Mediapartner IPs don't always reverse resolve.

Please give us one valid example as Matt Cutts and Google claimed the reverse DNS project was completed, and I've never seen a failure yet since they made the announcement.

Is it possible your DNS server isn't properly updated?

If you have a high traffic site and you don't want to slow it down by having to reverse resolve every single hit...

I cache the reverse DNS results for 24 hours so it's a very negligible in performance, basically non-existent for all server load purposes.

A simple method I've found useful for detecting and blocking web proxies is a simple JavaScript/noscript solution that compares the domain name in the address bar URL with the server domain name. No match or noscript, the page is disabled and the spider trap bans the visitor.

This is a really BAD IDEA because quite a few visitors have Javascript disabled when visiting new sites because of the malicious IFRAME INJECTOR scripts that infect many sites all over the web.

Most likely you're bouncing a bunch of real live people just because they're trying to protect themselves from malicious internet code.

BTW, bots don't have address bars, so you're blocking Google, Yahoo and MSN?

Even if the web proxy didn't properly forward Googlebot's user-agent to be detected for incrediBILL's suggested fix, it still would be banned.

That's why I do both as there isn't really a one-size-fits-all solution but what I recommend will block a lot of nonsense as I've blocked over 1,400 fake Googlebot hits this year and most were from proxy sites.

FWIW, I know they were from proxy sites because the message my code sends when it detects a proxy server is indexed in Google and I can see a lot of the thwarted attempts still sitting in Google's index.

[edited by: incrediBILL at 4:31 am (utc) on June 30, 2007]

6:01 am on June 30, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Please give us one valid example as Matt Cutts and Google claimed the reverse DNS project was completed, and I've never seen a failure yet since they made the announcement.

I turned the detector plug-in I was using to catch these types of infractions off a while back so I haven't been keeping up to par on the issue. But it happens. Things break.

I cache the reverse DNS results for 24 hours so it's a minimal impact

Still, it's a resource hog. It might speed it up a somewhat but unless you get the same exact IPs everyday...

It just sounds like overkill to me. Sorta like killing a flea with an 50 pound sledge hammer. You are blocking an average of ~8 rogue Googlebot requests a day but at what cost? If you only have a small number of hits a day I don't see a problem with it but 1000's a day, 10,000's, or more? If this is the case, why not use Googlebot IP ranges and save your resources for your visitors? It's just a suggestion, a different point of view. Don't get your feathers all riled up. :)

This is a really BAD IDEA because quite a few visitors have Javascript disabled when visiting new sites because of the malicious IFRAME INJECTOR scripts that infect many sites all over the web.

Quite a few? Doubtful, lol! Where did you pull that one from? Maybe .001%. Most people don't know what an IFRAME INJECTOR script is much less an iframe or even JavaScript itself. The truth is most people have JavaScript turned on because that's the way it came in the box and wouldn't know how to turn it off if they cared too. Besides, many sites are useless unless you have JavaScript installed and it wouldn't be worth the trouble to them to disable it. Some webmasters might not want JavaScript disabled browsers on their site anyways due to lost ad revenue or other problems.

But, if you're going to take that point of view, why are you using spider traps? Surely a savvy visitor would have his or her browser settings set in such a way to show your spider trap link. How many of your innocent visitors have been banned by accidently clicking on that link? I'm guessing about .001% (probably less).

But that's besides the point anyhow. You don't have to ban the IP if you don't want to. A 403 is good enough security for most folks with a good META NOINDEX,NOFOLLOW INJECTOR for peace of mind.

BTW, bots don't have address bars, so you're blocking Google, Yahoo and MSN?

Nope, and include Ask to that list. I have them all isolated by IP ranges. But if they come through a proxy they're banned (well the proxy is). Not through a JavaScript but through my own security software.

All visitors to my site that don't have an approved search engine IP get a meta noindex,nofollow tag.

10:01 am on June 30, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 19, 2003
posts: 46
votes: 0


G says (somewhere else) that proxies listing your site in the G cache should not influence SERPs, and asks for SERPs that show proxies ranking above the original page, for "ordinary" search phrases.

I have found a weird case where a pr5 page is pushed out of the SERPs in favor of a result of a supplemental proxy version of the page, but I don't know if its the right sample.

I have been searching for more samples, but I admit its not easy. Do people have actual SERP samples where a proxy has taken over the original?

12:45 pm on June 30, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


An indexed proxy page is off domain duplicate content so what you remember Google saying has in the past been refuted by:

Matt Cutts Bacon fiasco, not to mention all of the lovely 302 issues.

Reduced to its essential components a proxy issue is offsite duplicate content along with internal link structure so even in theory it can happen, it is more likely to happen on pages that place because of consistent internal linking.

It is also the method that is behind presenting a copy of a high ranking page to the search engine as your own content and doing a javascript redirect when the viewer isn't a search engine.

I've also seen other tactics used in conjunction with the copying.

[edited by: theBear at 12:46 pm (utc) on June 30, 2007]

6:24 pm on June 30, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Still, it's a resource hog. It might speed it up a somewhat but unless you get the same exact IPs everyday...

I run a fairly high volume site and it's not even a blip on my radar.

You seem to be under the impression I run reverse DNS on 40K+ IPs a day and that would be silly. Exactly how much of a resource hog do you think processing reverse DNS can be for a couple of hundred IPs daily that claim to be Googlebot, Slurp or one of the other SEs?

Can you say INSIGNIFICANT?

Sheesh.

Quite a few? Doubtful, lol! Where did you pull that one from?

I look at my stats as I track who's loading ads, using javascript, cookies, etc. and the percentage of non-javascript users is much higher than .001% so don't kid yourself.

It also depends on your audience, if they're more technical in nature, especially if you have a lot of Firefox visitors using things like the "NoScript" plug-in, or some other security tools.

Considering I block data centers, they're all home or office visitors, so either there is close to an actual percentage point of my visitors with javascript disabled or I'm being scraped by a really big botnet a couple of pages at a time. Either way I can't just dump that many IPs for not using javascript.

I have them all isolated by IP ranges

IP range alone is insufficient unless it's a very narrow range as all the SEs have other services that allow non-bots access as well and people scrape via the translators and various proxy servers available, so although it's better than nothing, I'll stick with the reverse DNS for it's pinpoint accuracy and it's flexible as the SEs change IP utilization.

Surely a savvy visitor would have his or her browser settings set in such a way to show your spider trap link. How many of your innocent visitors have been banned by accidently clicking on that link

I'm not sure how a 'savvy' visitor would see a link in a hidden 0px sized iframe as you can't tab to it, click it, nothing.

[edited by: incrediBILL at 6:38 pm (utc) on June 30, 2007]

6:57 pm on June 30, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Sorry I can't add more to this discussion, but I want to inject a note of thanks to incrediBill for sharing his experience and his clarity on this topic. You've really helped me a lot - both with proactive steps that you know work, as well as reasons we might want to avoid some other approaches.
7:56 pm on June 30, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


I agree with tedster, but in practical terms, how does a non-programming sort of person implement this? Digging around, I found some PHP along the following lines - assume it would go at the top of the page, before <html>?


// Get the user agent.
$ua = $_SERVER['HTTP_USER_AGENT'];
// Check the user agent to see if it's identifying itself as a search engine bot.
if(stristr($ua, 'msnbot') stristr($ua, 'googlebot')) {
// The user agent is purporting to be MSN's bot or Google's bot.
// If the user agent string is spoofed, we won't find googlebot.com in the host name.
// Get the IP address requesting the page.
$ip = $_SERVER['REMOTE_ADDR'];
// Reverse DNS lookup the IP address to get a hostname.
$hostname = gethostbyaddr($ip);
// Check for '.googlebot.com' and '/search.live.com' in hostname.
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname)) {
// The host name does not belong to either live.com or googlebot.com.
// Remember the UA already said it is either MSNBot or Googlebot.
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
// Now we have a hit that half-passes the check. One last go:
// Forward DNS lookup the hostname to get an IP address.
$real_ip = gethostbyname($hostname);
if($ip!= $real_ip){
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
// Real bot.
$block = FALSE;
}
}
}
4:49 am on July 1, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:July 19, 2004
posts:142
votes: 0


This may be a real dumb question, but isn't what is being done to SUSPECTED hijacking attempts a form of cloaking?

Are there any risks in Google "auditing" their bot results by using IPs that don't pass the reverse/forward DNS checking? (i.e. you pass the audititing IP a different page than the "acceptable" bot?) Or, maybe now, G does their cloak checking by coming in on IPs that pass the reverse/forward DNS check?

1:26 pm on July 1, 2007 (gmt 0)

System Operator from US 

incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14664
votes: 99


Are there any risks in Google "auditing" their bot results by using IPs that don't pass the reverse/forward DNS checking? (i.e. you pass the audititing IP a different page than the "acceptable" bot?) Or, maybe now, G does their cloak checking by coming in on IPs that pass the reverse/forward DNS check?

If you were cloaking to Googlebot and did so based on the user agent alone you would still be cloaking to Googlebot even if it came via proxy.

If Googlebot is doing any checking on cloakers, they certainly wouldn't use known Google IPs or blatantly call it Googlebot, otherwise most people would continue to show Google the cloaked content so it wouldn't work.

Google can't tell you're going a reverse/forward DNS check either...

2:25 pm on July 1, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 1, 2004
posts:1258
votes: 0


Bill, regarding the "reverse/forward DNS check" - do you know of any place that has the "correct" instructions that show how to do this? I use cPanel if that helps. I'm at a total loss (clueless) as to how to accomplish this but I believe it would be well worth doing it.

[edited by: tedster at 4:42 pm (utc) on July 1, 2007]

4:32 pm on July 1, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:July 19, 2004
posts:142
votes: 0


This thread started out with synergy providing some techniques for avoiding site / content hijacking.

Others have brought forward lots of useful insight and comments as well, especially incrediBILL with what appears to be an almost bullet proof way to deal with the issue.

Would it be to much to ask for concrete code (e.g. .htaccess ) to implement a viable solution to this nasty problem?

5:13 pm on July 1, 2007 (gmt 0)

Preferred Member from CH 

10+ Year Member

joined:Mar 10, 2004
posts:429
votes: 0


"......Bill, regarding the "reverse/forward DNS check" - do you know of any place that has the "correct" instructions that show how to do this? I use cPanel if that helps. I'm at a total loss (clueless) as to how to accomplish this but I believe it would be well worth doing it......"

I 2nd that, a point in the right direction for us cpanel, plesk users running simple html/css content sites would be much appreciated :)

5:53 pm on July 1, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 19, 2003
posts:804
votes: 0


Such a function can be written using any valid programming language on your server and using the
RewriteMap facility of the Apache rewrite system.

# External Rewriting Program
MapType: prg, MapSource: Unix filesystem path to valid regular file

Here the source is a program, not a map file. To create it you can use the language of your choice, but the result has to be a executable (i.e., either object-code or a script with the magic cookie trick '#!/path/to/interpreter' as the first line).

This program is started once at startup of the Apache servers and then communicates with the rewriting engine over its stdin and stdout file-handles. For each map-function lookup it will receive the key to lookup as a newline-terminated string on stdin. It then has to give back the looked-up value as a newline-terminated string on stdout or the four-character string ``NULL'' if it fails (i.e., there is no corresponding value for the given key). A trivial program which will implement a 1:1 map (i.e., key == value) could be:

#!/usr/bin/perl
$¦ = 1;
while (<STDIN>) {
# ...put here any transformations or lookups...
print $_;
}

But be very careful:

1. ``Keep it simple, stupid'' (KISS), because if this program hangs it will hang the Apache server when the rule occurs.
2. Avoid one common mistake: never do buffered I/O on stdout! This will cause a deadloop! Hence the ``$¦=1'' in the above example...
3. Use the RewriteLock directive to define a lockfile mod_rewrite can use to synchronize the communication to the program. By default no such synchronization takes place.

6:20 pm on July 1, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


For the non-programmer Apache-non-expert scripting-novice it can be done with PHP. I posted a script above that seems to work, but it's hard to test properly.

Added: I've only tested it with the Firefox User Agent Switcher set to Googlebot.

[edited by: Patrick_Taylor at 6:22 pm (utc) on July 1, 2007]

This 174 message thread spans 6 pages: 174