Forum Moderators: DixonJones

Message Too Old, No Replies

.htaccess - How to ban from a specific site?

         

kapow

12:43 pm on Oct 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My sites all have dozens of referrals from who*is.sc
I believe these are spam/harvester related (because there are so many).

What do I put into .htaccess to ban anything from one site?

Note: I havn't attempted the 'almost perfect ban list thing' (not sure of the WebmasterWorld url) as I'm new to banning. Will get round to studying the subject in future.

Robert Charlton

5:14 am on Oct 11, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I havn't attempted the 'almost perfect ban list thing' (not sure of the WebmasterWorld url)

kapow - I haven't tried the perfect ban list thing either, but I remembered this link to the Updated Robots list from the current WebmasterWorld home page:

[webmasterworld.com...]

There's a forum (forum 11) dedicated to Spider Identification that might give you a better answer than here.

fiestagirl

8:02 pm on Oct 12, 2003 (gmt 0)

10+ Year Member



Try here for info on what it is.

[webmasterworld.com...]
[webmasterworld.com...]

claus

9:49 pm on Oct 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The ban's pretty trivial, should you desire it:

RewriteEngine on 
RewriteCond %{HTTP_REFERRER} who\*is\.sc [NC]
RewriteRule .* - [F]

The asterisk is a special character and so is the dot, that's why both must be escaped using "\".

If you already have the RewriteEngine on, don't include it again.

bird

11:04 pm on Oct 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If I understand you correctly, then the site you're getting those referrals from is perfectly harmless. In fact, the owner of that site is a valued member here...

What probably happens is that people are interested in your domain names, or similar names. They can search for those on that site, and then click on a link to get to the respective site for each domain (if one exists).

I'm not quite sure why you's want to block those visitors, just because they went there first. It's trivial for them to circumvent your block, and I don't see how their visit can do you any harm.

The asterisk is a special character and so is the dot, that's why both must be escaped using "\".

The asterix certainly shouldn't be in that RewriteCond pattern, as it also isn't in the real domain.

claus

11:34 pm on Oct 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> the site you're getting those referrals from is perfectly harmless

In addition to this, i'll have to emphasize that a ban is an individual decision, and you should always investigate, to make sure you have a valid reason before you decide to do so. I believe i've been trying to state the same a few times in the "close to perfect .htacces" thread.

>> The asterix certainly shouldn't be in that RewriteCond pattern

Sorry about that, i thought it was the verbatim referrer that was posted. I supposed it was a forged referrer string - the asterisk does not appear in real URLs, and the ".sc" is not a common TLD, so i didn't even consider it to be a URL, just a string.

Still, the decision might be valid for kapow even if it should be invalid for everyone else in the world. It was a perfectly unambiguos question ("how do i...") and i do not feel it was wrong to answer it; i would even tell how to ban the Googlebot if that was the issue.

/claus

bird

8:53 pm on Oct 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think now I see what's happening, at least in my logs.

It's not really referrals, but the system at that domain itself will fetch the root page of each domain, to see whether there's a web site online or not. You may or may not like that kind of statistics gathering, but personally I don't see any harm. After all, they *do* provide an extremely useful service to the general public.

Of course you're right, claus, the original question was of a purely technical nature. But I have seen enough people jump to conclusions and ban stuff just because someone else mentioned they'd ban it (often without giving any reasons), so I wanted to encourage a few second thoughts about the matter.

BlueSky

9:21 pm on Oct 13, 2003 (gmt 0)

10+ Year Member



I have no problem with this bot visiting my site. It's quite harmless. If you investigate, I bet you'll find a number of bots crawling around your site that use browser UA's to hide their identity. Those are far more irritating.

kapow

10:03 am on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thankyou everyone!

1.) Re. That site:
The WebmasterWorld system doesn't let me type the url (for obvious reasons). For the record I think that site is excellent! I use it sometimes my self. However, I think some people are using the facilities on that site for spam/harvesting reasons. I manage some domain names for company/internal use only, names that are not published. Why would there be 15+ referrals from that site for an 'internal-use' name? (the name is also very unusual - you would not mistake it or guess it). Because this keeps happening with unpromoted domain names I am suspicious of some-users of that site. As you said - if someone wanted to visit the site and they already know the name then they can easily do so.

2.) Re. Banning a site that I choose:
Thanks Claus - I can't think of a reason for doing so but who knows, one day I might want to ban visitors from google. I havn't decided if I want to ban or not for that site, but I do want to know how.

Suppose it is widget.com - how do I ban visitors from it?

Monus

10:43 am on Oct 14, 2003 (gmt 0)

10+ Year Member



hmmm....

A whois database who gater site information do whois bulk checks and no option for to block it. Looks for me a company that want to grap domains that are in pending delete and have some good trafic or provide that to other persons.

Sound for me worst than a spambot, or am I the only one here who thinks like that.

claus

12:39 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> how to ban pageviews referred from widget.com or google.com

All you have to do is to change the second line of the example in post #4:

RewriteCond %{HTTP_REFERRER} widget\.com/ [NC,OR] 
RewriteCond %{HTTP_REFERRER} google\.com/ [NC]

The last of these two will not catch all Google referrals, as there's also all the Google IP's, but i included it to illustrate the use of the [OR] operator - if i had not included this, an "AND" would be the default, and you can't really be reffered from two places at the same time, so that wouldn't work (1).

Another way of banning visitors referred by these two sites would be this:

RewriteCond %{HTTP_REFERRER} (widget¦google)\.com/ [NC]

Which is: "widget OR google" followed by ".com/" (2)

Sooner or later, however, somebody will tell you that you should use this line in stead if you want to ban widget.com:

RewriteCond %{HTTP_REFERRER} ^http://www\.widget\.com/ [NC]

The character before "http" is an anchor, it means that the string should start like this (3). So, that way of doing it is valid as well, but it will not catch the domain without "www." You'll have to make this subdomain optional by using the "optional"-operator; the questionmark - in this example it makes the content of the parenthesis optional:

RewriteCond %{HTTP_REFERRER} ^http://(www\.)?widget\.com/ [NC]

As you see, it's already a bit harder to read. Plus, other subdomains, like, say "

badbot.widget.com
" will not be banned by this.

For this reason i always include only the most significant part(s) of the string(s) i want to match. Everything else is likely to cause some kind of unexpected error at some point. So, could you just write, say, "widget" and not ".com/"? Of course you could, if you are sure you don't mind that a refferral from an url like this will be banned as well:

http //myownpage.com/greenwidgetstore/page.html

The pattern will try to match anywhere in the string if you do not specify an anchor. So, just stating the minimum can have sideeffects too. The ".com/" is normally not used in filenames, so it narrows down the possibilities of banning something that you didn't intend to.

---
I might as well answer that upcoming question straight away. To ban any referrer that is not your own domain, use the "not"-operator "!", like this:

RewriteCond %{HTTP_REFERRER} [b]![/b]^http://(www\.)?mypage\.com/ [NC]

I used the "strict" way of declaring the URL here, as i'm confident that you know the URL combinations of your own domain.

/claus


Notes:
(1) The other operator used, [NC], is a "no case" operator, to make the string case-insensitive.
(2) Note: The OR operator (the pipe) is converted to a broken pipe ("¦")when you post on WebmasterWorld, so before you use this rule in your .htaccess you must replace it with one you enter from your keyboard.
(3) Note: There's also an "end" anchor, it's the dollar sign ("$"). Using it like this matches URLs ending in "gif":
gif$
.


<edited:typo>

[edited by: claus at 1:31 pm (utc) on Oct. 14, 2003]

kapow

1:09 pm on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



AWESOME CLAUS!

Thank you :)

BlueSky

2:27 pm on Oct 14, 2003 (gmt 0)

10+ Year Member



That site puts up a graphical security code after so many queries to help reduce info collected by automatic scripts. There probably are people who use their info for malicious purposes, but the really serious ones don't have to waste their time bothering with them. It's very easy to write a script that automatically generates email addresses and tests for valid ones. There are also people who sell the entire whois database for any tld either as a whole or any viewpoints you want for relatively cheap prices. These other folks also send out bots, disguised ones, to our sites and collect info on us and our hosts to sell to others. I ran into one of them, and it was very educational learning what he did after tracking him down.

At least whois identifies their bot so those who don't want it on their site can easily ban it. The others don't because they want to hide what they're doing.