Forum Moderators: phranque
Can I do something like…
RewriteCond %{HTTP_REFERER} XXX:+++++++* [OR]
To get rid of those turkeys who use…
XXX:+++++++++++, etc.
And are now using
XXXX:+++++++++++, etc.
They are really screwing up my charts in my backup log.
I mean come on, XXXX:+++++++++++ is more than sufficient
I agree with Birdman that you should not block based soley on this referer, since it comes from a common proxy server in use at many corporations. The user often has no control over whether this "security" feature is turned on. But for an academic exercise:
RewriteCond %{HTTP_REFERER} ^X{3,4}:(\+){2,}$ [OR]
Jim
Everyone that I ever had, and I get about 2 a month, are all on ISP’s. I’ve had hits from every Fortune 1000 company and none have use that.
It goes in to my AXS log and then the chart and percentages are out in rightfield. Then I have to edit the bloody thing every month. Whom ever is doing it should realize that it is annoying to some, and using nothing or a ‘-‘ would still get the job done.
Muchas gracias
Well, that was my point. The proxy authors used X's and +'s to mask the real referer. The actual user may not even know it's happening. Most have never seen a log file and never will.
If you want to redirect them to a special page using that RewriteCond above, you can. But of course, at least the one original request (complete with annoying referer string) will still be in your logs. :(
But it's up to you, of course.
One other thing... If it doesn't work, it may be because those "+" characters are really spaces. Just as "-" is logged for a blank referer or user-agent, "+" is often used in logs for space characters.
Jim
You could easily write a script to erase the entries from the log, or any decent text editor should be able to handle a regular expression search and replace.
I don't care that it is in my assess_log, I have another log that I filter out SE bots, etc. and I don't want it to show there brcause it makes a real mess out of it.
could easily write a script to erase the entries from the
Yea, in my spare time. It's more of a principle thing. It’s not required and to say it is overkill would be an understatement. They are making a statement, and so am I.
If they stop making theirs, mine automatically goes away.
Well not really.
However, be careful who you block - you might want them in the future. You never know when some person who is paranoid is paranoid for a reason - or when some company like alexa (which is listed in some of the master block lists) might be incorporated in a metric you care about. This is not to say you shouldn't block the truly abusive, but otherwise - I just don't get it.
I was recently "hired" more or less to help increase the SE traffic to this company's site.
It looked to me at first inspection his site was blocked by google. On further inspection - the geniuses he hired before me (for webdesign) - had BOTH a robots.txt and metatags on EVERY page blocking all search engines.
$10,000 later and 30 days later - he is in google.
Ok I am kidding about the 30 days....
and the $10K - I turned the job down.
Blocking referrals is fine. It is when users start falsifying them, that action must be taken.
Unless someone can specifically name a product that uses this referer, I still maintain it is a server exploit bot that is being used. It isn't a proxy issue, a privacy issue, it is a site security issue.
I block any and all ips coming in with the xxx:+++ referrer. Give these types of products no quarter.
So to avoid hassle it just overwrites it with spaces instead, which in turn usually get logged as +.
You can scratch ZoneAlarm.
Ah but do they have ZoneAlarm Pro and do they have the privacy options cranked up?
So why don't you use a regular expression to edit it out of your logs - rather than just blocking people who have decided that they don't particularly like telling you what website they have just been looking at.
I don't have a problem with ppl not telling me where they've been, I have a problem with the way they are doing it. I have a problem with it costing me more work to fix the problem. It's over kill and not required. It fills my logs and takes bandwidth when a simple [Generic Search Engine] or something like that would do. 80 +'s is unacceptable.
Yeah, if it just omitted the Referer header then it would have to recalculate the content length of the message
Sounds like poor coding practices to me. But I’m not quite sure what you’re taking about.
Ah but do they have ZoneAlarm Pro and do they have the privacy options cranked up?
I don't know I'll have to check.
The reason you get 80 +'s is because the Referer URL was 80 bytes long and it has been overwritten with spaces.
Thats not so unreasonable.
[webmasterworld.com...]
is 78 bytes long.
If they simply removed the header or they replaced it with [Generic Search Engine] or whatever, then they would also have to calculate a new value for the Content-Length header.
It fills my logs and takes bandwidth
It should take up exactly the same bandwidth and possibly the same amount of space in your logs (depending how they work).
<!--#exec cgi="cgi-bin/log.cgi" -->
I put the above in each page I want in it. This way I can quickly look and see ‘evil doers’ and who’s eyeballs are actual on the page.
I have a bot trap that just says no entry because you look like a bot, smell like a bot and ergo must be a bot. It will now also say your HTTP_REFERER is unacceptable, get some software that isn’t so annoying, or change your HTTP_REFERER to something less than 15 characters.
<edit>log.cgi also has a filter so that I can filter SE IP's, etc.</edit>
[edited by: jim_w at 2:29 pm (utc) on May 2, 2003]
I don't like it. The developers of the software or proxy should just write private or something like that. Seems suspicious to me.
Heck if visitors knew even a quarter of the stuff we know about them, we'd have far fewer customers.
We must be selling different things. None of these people ever purchased anything and have saved my site for off line viewing, etc. I don't recall them even going to the price page let alone the purchase page. The last one just went straight to an article I had written.
I don't care where they came from. It's the statement they are making. What they are saying could just as well be said with N/A. And that is causing me extra work for nothing.
If they simply removed the header or they replaced it with [Generic Search Engine] or whatever, then they would also have to calculate a new value for the Content-Length header
If the string was 'N/A' it would only take milliseconds to recalculate it. It is either some of the poorest coding I have ever seen, (and I God knows I have written my fair share of poor code, yesteryear), or it is something else.
I think it is each individual webmaster's right to block anybody they want to block, which is why I posted the code. However, in this case, I personally rely on a variant of key_master's bad-bot trap script [webmasterworld.com] to catch the real troublemakers, and ignore the fact that this UA makes a messy log entry. That's what I'd recommend to stop site downloaders, but everyone's needs are different.
Jim
I have to look up the commands and syntax to see how to do it, I just wrote my 1st from scratch PERL last week, it does FP so that was fun, so I need to look up how to actually do it, but something like…
If(HTTP_REFERER == XXX or XXXX){
HTTP_REFERER=The X ppl
}
But can you change the value in HTTP_REFERER like that. It's environmental right
Then Server Side it before my log entry.
<!--#exec cgi="cgi-bin/ref-fix.cgi" -->
<!--#exec cgi="cgi-bin/log.cgi" -->
Sure, you could do something like that with an environment variable, but then that user-agent or referer could come right back again with the very next request, and access your pages. Environment variables only have meaning for the current http request. http is a "stateless" protocol - each http request has no knowledge or memory of previous or concurrent (or subsequent) http requests.
As to being worried about changing .htaccess on the fly, note that the thread I cited contains a modified version of key_master's original script. It flocks (file-locks) .htaccess while it adds an entry so that no "collisions" can occur. Several of us here on WebmasterWorld are using this script and talking about it; I'd guess that many, many more are using it and not talking about it. I've had it up and running for six months with no problems whatsoever. And remember, it only gets invoked when a troublemaker hits your site, so performance impact is very, very low.
The way I see it, either the script works or it doesn't. If it doesn't work, then it's just like any of the other 1000's of lines of code in your scripts and in Apache itself - it can screw things up.
But it works.
I encourage you to try it and watch it work. Once you're comfortable with it, you'll be able to walk away from your logs for days at a time, knowing that your guard dog is still watching over your site. :)
Jim
I'm sure it works. What I’m skittish about is having a write file open on a system component like that when if the server crashes for some reason it would corrupt the file system.
I’m on a hosting system using a SUN Unix box and it is virtual. On my 1st hosting service, I had my log file corrupted several times. So I changed services. (this is why I run a second log) The new service I got was great. Never a problem for more than a year. Then they were purchased by a hosting service in Atlanta GA, and I have had at least one problem a week sometimes several a week for the last 30 days. Toooooo risqué for me right now. They have corrupted my mailbox files, and I have already lost about $500 in business. So at this stage, I really need to play it really, really safe.
Since I only want to change the HTTP_REFERER for the pages I log in my second log, I could just do the put the logic in my log.cgi before the write. After thinking about it for a while longer.
A simple HTTP GET request doesn't have a content lenght. And even with a POST request, it would only count the lenght of the data payload, not the headers. Replacing the referer URL with XXX:+++ is more work for the software and therefore slower than just omitting that header line.
Yes, it may indeed be some kind of "security product". But if so, then it's an incredibly stupid one. And anyone thinking that they will improve their own (or their customers) privacy by making their visits stick out of my logs like a sore thumb should seriously get their sanity checked.
A simple HTTP GET request doesn't have a content lenght
Hmm.. good point. Oh well, that was just the explanation I had heard.
Heck if visitors knew even a quarter of the stuff we know about them, we'd have far fewer customers.
Ahh.. I take it your not a big fan of P3P [w3.org] then!
Any and all personally identifiable information that you record about visitors should be openly declared on your site via a privacy statement. Otherwise I suspect you risk being in breach of the data protection act (or your country's equivalent law).
Jim: I still think your shooting yourself in the foot just to make a statement. (or "Cutting off your nose to spite your face" as my mum used to say)
They may have only looked at an article this time, but presumably they read it so their awareness of your site has been raised. Next time they return they may be a customer.