Block all but desired USER AGENTS?

Forum Moderators: phranque

Message Too Old, No Replies

Block all but desired USER AGENTS?

Instead of blocking bad bots that change, why not allow only good bots?

JAB Creations

6:05 am on Aug 26, 2004 (gmt 0)

Here is my idea...
The sharp ones out there like us are crawling the main search engines trying to figure out what the hell this new user agent is subtracting stats and reports and we're not getting anywhere fast. We block one spam bot and two more replace it!

I don't know how to, but I hope I can encourage the work on an HTACCESS based or other type of project were we can work on adding legit agents to access our websites through the main .htaccess file. We DO know what browsers are out there! Why not go off of this information instead?

All non-approved agents are redirected to a rejection page. Webmasters can then check their logs to see what agents they will add and what agents they will blacklist.

I would like to contribute my share of search bots that I find credable. As far as I'm concerned anything credable crawling the web should have a homepage that cna be found even if it's not in the agent string on our logs.

ncw164x

7:45 am on Aug 26, 2004 (gmt 0)

We DO know what browsers are out there! Why not go off of this information instead?

because this info can be spoofed, if you only allow good bots a bad bot can then spoof the good one which is what happened last year where a bot from Asia was spoofing googlebot.

and it just goes on and on and on, some site rippers can't be stopped because the user agent can be whatever you want it to be.

ncw164x

jdMorgan

3:20 pm on Aug 26, 2004 (gmt 0)

Just because the user-agent can be spoofed doesn't mean the idea doesn't have merit, though. In the given example, the Googlebot spoof only worked if the code did not test anything more than the pattern "^Googlebot". This spoofed UA was easily blocked if the pattern also required the URL of the contact page to be included in the UA string.

I've used just such an approach on a non-commercial site that did not stand to lose much if a "good" UA was temporarily blocked. The trick lies in using a multi-tiered approach, allowing all known 'bots that don't start with "Mozilla/" and then applying further tests to those that do, filtering out the Mozilla spoofers. Finally, scripts such as Key_Master's bad-bot script and one of the two "access-speed throttle" scripts (all posted here on WebmasterWorld) can be used to detect the very-clever UA spoofers who manage to get everything right, but insist on fetching too fast or violating robots.txt.

The downside to the "ban all except" method is that you have to keep up with all the new "good 'bot" user-agents, just as normal block lists require you to keep up with new "bad" user-agents. Otherwise, you risk blocking the newest version of Googlebot -- For example, when they recently added "Mozilla" to their user-agent and changed their UA info URL from "googlebot.com" to "google.com" (See this thread [webmasterworld.com], message #24).

On the other hand, the "good ua" list doesn't change as often as the "bad ua" list, so there is some merit in this approach for some sites. Every Webmaster must make their own choices, and this subject is illustrative.

Jim

walrus

8:45 pm on Aug 26, 2004 (gmt 0)

That sounds like an amazing idea!
Complicated but great!

Lord Majestic

8:55 pm on Aug 26, 2004 (gmt 0)

we can work on adding legit agents to access our websites through the main .htaccess file.

Good idea - as soon as this sort of behaviour becomes acceptable on any significant number of sites then bots will stop declaring themselves and you will have to switch to banning of IPs and IP blocks. Then particularly nasty bots will just use hijacked PCs on fast connections - and these guys don't care about robots.txt, delays between requests etc.

This bot/witch hunting will only hit legitimate bots that might and even will bring you benefits. Just because you don't know which of them will, when and how much does not mean you should ban them all. Just consider it a Venture Capitalists strategy - they don't know where the "pearls" exactly are, but going through lots of sand justified at the end.

JAB Creations

2:19 am on Aug 27, 2004 (gmt 0)

I figured as much that they would drop user agents...

Spoofing does not concern me. If someone refuses to identify themself when accessing my web-site I have no concern if they can't access it.

Detecting user agents would be effective for an undetermined ammount of time. I agree that eventually the pricks will resort to spoofing their agents...

That leads me to ask the question, if an agent is spoofed to look like an allowable agent and then attempts to access a file, what will happen?

I'm not so much of a developer as I am a designer. So all I can do at that point is suggest some ideas at least and hope someone with the know how or sight reach can say yes or no if this is possible.

Should an agent be spoofed and the agent is able to access any files for only approved agents then I would imagine that the ability to test with Javascript or another technology would be able to confirm at client-side the true agent and relay that information to the server which can then force the user to the error page. If a technology like javascript is turned off to avoid I'm sure doing something like making files accessable initially through javascript (document.write to write the html initially being ignored if the script is turned off would keep the browser from going further) would be an effective wall against getting around the initial barrier.

Until the jerks resort from user agents we then have the advantage of shooting the monkey in the barrel. Once it braeks out we might still have them on a leash.

I'm curious, these are real human beings who are technically the equiv to computer terrorists. A virus to a computer is like a bomb to a building. I am personally grown quickly sick of doing ip lookups and writing to abuse addys. How can we work together to put these people behind bars? I'm starting to recieve virus attachments at my email addy. Call me a control freak, but I'm all for a quality experience, for the visitors and the webmasters :-)

JAB Creations

2:33 am on Aug 27, 2004 (gmt 0)

Oh yes...complicated...

If there is one thing about complicated that I've learned is that the more work WE do the less work (and thus the more enjoyable) our sites our to our visitors.

My site is frame based and trust me frames are a female dog to deal with. Working on issues around walls like frames has led me to learn much more then if I didn't. Of course you can only hit as high as you aim. :-)

jdMorgan

3:08 am on Aug 27, 2004 (gmt 0)

Should an agent be spoofed and the agent is able to access any files for only approved agents then I would imagine that the ability to test with Javascript or another technology would be able to confirm at client-side the true agent and relay that information to the server which can then force the user to the error page. If a technology like javascript is turned off to avoid I'm sure doing something like making files accessable initially through javascript (document.write to write the html initially being ignored if the script is turned off would keep the browser from going further) would be an effective wall against getting around the initial barrier.

Yes, but this assumes that the spoofer is using a browser's user-agent. What if it uses a robot's user-agent? Since robots don't typically 'run' client-side code such as JS, you'd end up blocking Slurp, Googlebot, and many more if you required client-side code.

As I stated above, a combination of user-agent blocking, IP address blocking and a couple of scripts that block based on behaviour is quite effective, even if the user-agent blocking is fairly liberal. If your goal is to block the page-scrapers, e-mail harvesters, and site downloaders in order to cut your bandwidth and un-clutter your logs, this approach works well. If you're concerned with stopping every single transgression, then it's by no means perfect. It also seems that once your site throws a few 403's, the 'word gets around' and you'll see far fewer attempts. The exploiters don't want you making a list of all their user-agents and proxy IP addresses -- After all, you might report your list or publish it.

To address LordMajestic's concerns, I'll just say that my "allow list" is fairly long, and includes many second and third-tier search engines, both foreign and domestic. I'm not concerned whether they are "big and famous," only that they are legitimate. Hijacked fast PC's can be blocked using the throttle script I mentioned. For the sake of their hapless owners, their IP addresses fall off the blocked list eventually.

Jim

JAB Creations

7:51 am on Aug 27, 2004 (gmt 0)

Simple, all one needs to do is concentrate their attention to the 403 page. Accesses referred to only that file will allow one to more easily cipher through what they want to allow in.

Now think about what % of bandwidth is sucked up by spam. More then 50% of emails are spam now. Yahoo, Google, and Microsoft are all email providers who I'm sure would gladly welcome any hard opposition to spam.

We are talking about cutting down the spam army as it leaves it's own castle. We know this would be effective and the $ saved by the death of spam by it's root source would surely capture the attention of the big players like yahoo Google and ms. I am sure that they would each be most willing to coordinate efforts with such a project.

In addition from what I have seen (just my perspective) each of the major crawlers typically come calling to my site via the same IP address. I would not be worried even in the initial phase of blocking out the major search engines because of the solid proof between them and their IP's. In cases of all other legit spiders the owners are as I have experienced always willing to correspond via email to confirm important information as they are typically developing their spiders.

The devastating fact is that the spam community is enjoying the fruits of our misery and the ignorance of our visitors. A devastating blow to their abilities will leave them with only the CD's of last year...that as time goes on thin in their once great success.

What we are fighting now is a [guerrilla] war against spam. Unlike [some guerrilla wars], we have a clearer ability to weed out who is on our side and who is not. Spam continues to be a huge problem like spyware through which the shady merchants teamed with the bad part of the hacker community thrive on always being on the offense.

If your castle is attacked do you leave your walls open and blindly seek your enemy on a hunch one by one?

I do not feel content to cower under pressure of those who I would gladly punch in the face should I ever meet them in person. If you decide to shoot at a million man army of gophers with an 17th century musket so be it. But I would like to see more construction on the idea then criticism of it.

[edited by: jdMorgan at 1:35 pm (utc) on Aug. 27, 2004]
[edit reason] Snipped specific geopolitical references. [/edit]

Lord Majestic

7:59 am on Aug 27, 2004 (gmt 0)

If your castle is attacked do you leave your walls open and blindly seek your enemy on a hunch one by one?

If castle is using "open doors" policy for everyone, then no one should be suprised if all sort of people come in and out. I understand and support banning of abusive bots that overload site and/or don't follow robots.txt, however creating whitelist of allowed bots is in my view trying to use sledgehammer to crack a small nut.

This "open doors" policy is what made the Net (WWW) what it is now. If people were blocking googlebot en masse when it was not a well known bot then its possible there would have been no Google. Just because you don't understand the reasons why someone might crawl your site (in non-abusive fashion) should not mean you ought to ban them because you might just be banning next "Google".

Bottom line is - if you put content on open access server then you should not be suprised that it will be accessed by bots and/or people who might not be in the list of your most desireable clients. If you want to protect your content then password protect it (with payment).

walrus

4:42 pm on Aug 27, 2004 (gmt 0)

Thats a good point , and after reading JDs
commments, seems like its not such a workable idea after all.
So it might not be practical but its still a progressive idea.
The net needs more guys with your attitude Jab!

Lord Majestic

4:56 pm on Aug 27, 2004 (gmt 0)

So it might not be practical but its still a progressive idea.

How is it "progressive" if it advocates freezing situation as it is now - with dominant search engines and no chance for a new startup to actually move the Internet forward by developing new search engine?!?

Regressive more like, unless that was sarcasm.

JAB Creations

5:05 pm on Aug 27, 2004 (gmt 0)

___
This "open doors" policy is what made the Net (WWW) what it is now. If people were blocking googlebot en masse when it was not a well known bot then its possible there would have been no Google. Just because you don't understand the reasons why someone might crawl your site (in non-abusive fashion) should not mean you ought to ban them because you might just be banning next "Google".
___

Hardly...such a project should be done privately and tested privately before getting in to the hands of the inexperienced. Furthermore if actual software was developed I am sure an analysis report forcing a webmaster to choose to block or allow unknowns would be effective at ensuring lesser spiders would be included.

Your prime argument is that webmasters are shmucks who will end up keeping out search engines and killing the internet. Instead of continuing to do so why don't you make suggestions on top of your criticisms? We're (the good ones) webmasters because we like the challenge and engagement from the field. Giving up never made anyone progress or money. If I had given up because of the complexities of anything I have done I would have not come anywhere near as far as I have come so far.

Furthermore I would like not to think of a secure internet as a pay-per-site internet. Should I charge my visitors to ensure I don't get server abuse? Now I'm STILL making the compromise.

Help us by making positive suggestions on where this stance could be made successfully and the power needed to ensure all good spiders are also allowed in with visitors. Power requires responsibility...so if you have a cool suggestion please by all means do so. Just don't lecture me on what I already know to be the faults without any effort to work around those faults. Because ultimately I will end up assuming you're one of them and I don't want to think that I would want to punch you should I ever meet you.

jdMorgan

5:42 pm on Aug 27, 2004 (gmt 0)

Folks, this is a technical forum. If someone wants to write code to block all but three users on this planet from his site, it is within this forum's charter to help him or her to develop that code. Friendly advice on why it might not be such a good idea to do so is welcome if included in a post that helps answer the original question.

It is a fact that public Web sites are subjected to abuse. Some Webmasters choose to ignore it, and others choose to take action. This action may range from policies that almost everyone would agree are good ideas to those that will seem harsh and excessive to many. However, our job here is to discuss the technical aspects of those policies, without making judgements about the person posting the question -- or those posting answers or comments.

While it is laudable to want to keep ours sites completely "free and open" as they were in the mid-90's when things first got rolling and the Web was more academically-oriented (as opposed to its commercial nature now), sometimes this is no longer possible due to those who abuse the medium. It's been said many times, but it bears repeating: Every Webmaster has the right to choose how "open" his or her site will be. Our job here is to discuss technical ways of implementing such policies.

I would like to steer this discussion back to implementation and surrounding issues. Off-topic posts will be subject to editing and deletion. Ad-hominem attacks may result in stronger action. Please see TOS [webmasterworld.com] #4.

Jim

digitalv

5:56 pm on Aug 27, 2004 (gmt 0)

You know, there is a much better way to do this ... I forget who showed me this method, but you can use it to permanently block all bots that do not follow your robots.txt guidelines.

(1) Create a script that will alter your IP filters and ban the IP address of anyone who accesses it. In other words, if anyone visits "blockme.asp" (blockme.php, whatever) their IP address will be immediately added to your deny list and banned from the site.

(2) Specificy in robots.txt that robots should NOT FOLLOW THE LINK TO THAT PAGE! This is important!

(3) Somewhere on your home page, create a hidden link to this page that no user could ever see or a click but a robot could.

That's it ... if the robot is following robots.txt like it's supposed to, it will never try to access that page and will never be banned. But if it DOESN'T follow robots.txt, it will be banned as soon as it hits that page.

jdMorgan

6:01 pm on Aug 27, 2004 (gmt 0)

Yes, this is the purpose of Key_Master's bad-bot [google.com] script and its derivatives, cited above.

Jim

walrus

6:25 pm on Aug 27, 2004 (gmt 0)

Sorry , i could have been more careful how i worded that.
I didnt mean the idea so much as i meant his creative attitude.
Just that i suppose it was redundant of me to say it considering the forum we're in is full of pros talking about how to protect sites from abuse.

digitalv

7:09 pm on Aug 27, 2004 (gmt 0)

Also wanted to add that I don't really think an allow-only thing is a good idea anyway. On the drawing board it might sound good, but once you dive into it you would see that it's not really feasible.

For one, I have a database of about 465,000 distinct user agents, and this is by no means a complete list - are you going to go through each and every one of them to figure out which is a browser and which is a robot? Sounds like a big waste of time to me - and even if you do, that doesn't compensate for those robots that mask their identity by using someone else's UserAgent like Internet Explorer or Googlebot.

You would ultimately do yourself more harm than good by attempting this.

jdMorgan

9:28 pm on Aug 27, 2004 (gmt 0)

The trick is to carefully select your regular-expressions patterns so that most bad bots and spoofers are blocked, and no innocent user-agents are affected. As outlined above, it works well on a small not-for-profit site I administrate. As long as you log the 403-Forbidden responses separately, review that log frequently and respond to unintentional blocks in a timely manner, it works very well. Combined with a bad-bot script or two, it works very very well.

The advantage is that there are far fewer new "good" user-agents added per unit time. The disadvantage is that some planning and forethought must go into it to avoid blocking future browsers like Firebird 0.9.4 and Mozilla 1.7.3, etc. And if you're not comfortable blocking some legitimate-but-new spider variant for a couple of days, then this is not the approach for you.

For some sites, this approach can greatly reduce maintenance and increase peace of mind, allowing for peaceful holidays and vacations.

Jim

JAB Creations

10:00 pm on Aug 27, 2004 (gmt 0)

digitalv's really idea kicks donkey! To make a quick criticism it still does not completely stop a spider. Would there a way to allow a spider so many hits before banning it from not accessing robots.txt? Such as allowing it to only hit the index page (and if you setup a redirect as some sites do set allowed hits to 2 for such folks) before requiring a robots.txt hit or default to banning.

Also what about automatically banning and creating a list of ips/agents that hit certain pages on their first hit (or allow this to be done as before within a ranged number of initial hits). Simply fetching the guestbook or contact page off the bat is defintly suspicious. With CSS we can make the contact link appear where we want while putting it in the html (or other language) at the bottom. A bad bot would simply ignore the other links or pretend to visit some of the other pages at best before getting what it's really there for.

Also I may not have a 465,000 unique visitor rank (per month?)..mine is about 10,000 average. However I have been able to go through searching for bots spiders and crawlers for a nifty tool in the matter of an hour and come up with the following spider report for my site for this month (so far as Auguest 27th / noon). My access log is about 100 MBs...of course my site is multimedia based. For about a month's worth how large are you folks seeing your access logs grow? I simply have my program find and delete all instances of every bot I find and it takes about 30 seconds per bot. Since access logs are our best bet I figure from my perspective even larger sites would not need a large effort to control.

Spiders
This list contains all found spiders, crawlers, and bots and the total unique number of instances
of hit requests and every version of their agent that attempted to access my site from Augest 1st through
Augest 27th at 4pm EST.

------------------------------------------------------------------------------------------------------------
Spider............// #Hits // Access Log Signature // URL IF signature lacks one or email addy
------------------------------------------------------------------------------------------------------------
Asterias .........// 970 // Asterias Crawler v4; +http://www.singingfish.com/help/spider.html; webmaster@singingfish.com); SpiderThread Revision: 1.9"
Baidus ...........// 5 // "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
Boitho ...........// 1 // "boitho.com-dc/0.52 ( [boitho.com...] )"
ClariaBot (Gator).// 2 // "ClariaBot/1.0" // 64.152.73.15 refers to Gator.com
Cosmix Crawler ...// 1 // "CosmixCrawler/0.1" // [?...]
Grub .............// 1 // Crawl your own stuff with [grub.org)"...]
IBM .............// 102 // "http://www.almaden.ibm.com/cs/crawler [c01]"
IRL Crawler ......// 4 // "TAMU_CS_IRL_CRAWLER/1.0" // [irl-crawler.cs.tamu.edu...]
FAST .............// 550 // "FAST Enterprise Crawler 6 (Experimental)"
FAST .............// 15 // "FAST Enterprise Crawler 6 used by Lycos, Inc. (spider@lycos.com)"
FAST .............// 24 // "FAST Enterprise Crawler 6 used by sentius (bconklin@sentius.com)"
Gias .............// 164 // "Gaisbot/3.0+(robot@gais.cs.ccu.edu.tw;+http://gais.cs.ccu.edu.tw/robot.php)"
Girafabot ........// 85 // Girafabot; girafabot at girafa dot com; [girafa.com)"...]
Google ...........// 2164 // "Googlebot/2.1 (+http://www.google.com/bot.html)"
Google ...........// 64 // Googlebot/2.1; +http://www.google.com/bot.html)"
Google? .......// 103 // "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" // ****Looks like a mimmic bot, I have sent an email to Google to clearfiy***
Google Media .....// 78 // "Mediapartners-Google/2.1"
Iconsurf .........// 4 // "IconSurf/2.0 favicon monitor (see [iconsurf.com...]
Iltrovatore ......// 3 // "Iltrovatore-Setaccio/1.2 (It-bot; [iltrovatore.it...] info@iltrovatore.it)"
Jetbot ...........// 6 // "Jetbot/1.0" // [jeteye.com...]
lcabot ...........// 4 // "lcabotAccept: */*" // Unknown but points to Aliant..phone service? PDA perhaps?
Microsoft ........// 1 // MSIECrawler)"
Mozilla ..........// 2 // "mozDex/0.05-dev (mozDex; [mozdex.com...] spider@mozdex.com)"
MSN ..............// 2270 // "msnbot/0.11 (+http://search.msn.com/msnbot.htm)"
Naver ............// 441 // "NaverBot-1.0 (NHN Corp. / +82-2-3011-1954 / nhnbot@naver.com)"
Nutch ............// 23 // "NutchCVS/0.05-dev (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)"
Objectsearch .....// 4 // "ObjectsSearch/0.05 (ObjectsSearch; [ObjectsSearch.com...] support@thesoftwareobjects.com)"
ODP (Mozilla......// 1 // ODP links test; [tuezilla.de...]
Peerbot ..........// 4 // "PEERbot www.peerbot.com"
Plantynet ........// 20 // ""PlantyNet_WebRobot_V1.9 dhkang@plantynet.com" // ***The lack of any contact info leads me to believe this is a spammer who targets webmaster addys specifically***
Scooter .........// 4 // "Scooter/3.3Y!CrawlX"
Turniton .........// 2 // "TurnitinBot/2.0 (http://www.turnitin.com/robot/crawlerinfo.html)"
Turniton .........// 501 // "TurnitinBot/2.0 [turnitin.com...]
Tutorial .........// 10 // "Tutorial Crawler 1.4 (http://www.tutorgig.com/crawler)"
Voila ............// 26 // VoilaBot BETA 1.2 (http://www.voila.com/)"
Webfilter ........// 1 // "WebFilter Robot 1.0" // 216.248.177.131 refers to itcdeltacom.com which is a telecom company AKA Block those mofos! :-D
Wisenut ..........// 213 // ZyBorg/1.0 (wn-2.zyborg@looksmart.net; [WISEnutbot.com)"...]
Wisenut ..........// 40 // ZyBorg/1.0 Dead Link Checker (wn.dlc@looksmart.net; [WISEnutbot.com)"...]
WHOIS (Expir inq).// 4 // "http://www.****/" "SurveyBot/2.3 (Whois Source)"
WHOIS (Expir inq).// 5 // "SurveyBot/2.3 (Whois Source)"
Yahoo ............// 10058 // Yahoo! Slurp; [help.yahoo.com...]
Yahoo ............// 5 // "Yahoo-MMAudVid/1.0 (mms dash mmaudvidcrawler dash support at yahoo dash inc dot com)"
Yahoo ............// 742 // "Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)"
ZipppBot..........// 2 // "ZipppBot/0.11 (ZipppBot; [zippp.net;...] webmaster@zippp.net)"
Unknown ..........// 13 // "Lycos_Spider_(modspider)" // [?...]
Unnamed (German)..// 3 // "Der Bot aus Poppelsdorf (http://www.avaris-webdesign.de, erstkontakt1@avaris-webdesign.de)"

In addition after a while of looking for spiders, crawlers, and bots, I ended up finding rebots.txt itself requested as I was looking for the next "bot". This was simply a win95 box running IE5. Obviously now I'm able to indentify things that DONT want to identify themselves and could be either good or bad. A simple IP whois [http://cqcounter.com/whois/] and I learn that 195.101.94.208 refers to some guy in france. Ok, not the best lead but as commonly agreed upon we all can decide for ourselves what we would do. I have a little free time on my hands (heh) and would contact who I could and do a small investigation. I wouldn't be too worried about someone who IS accessing my robots without an agent as I would assume this guy is testing out some bot that he built and just so happened to hit my site. I'd be more concerned with initial hits on hot pages like contact/guestbook, etc.

The reason I'm so confident is because when the bank has the money and you need money who has the power? We're the bank in this scenerio and we know that only the bad guys are going to do certain things our visitors wont.

JAB Creations

10:01 pm on Aug 27, 2004 (gmt 0)

My list spaces nicely in notepad if anyone wants a sane perpective.

jdMorgan

10:13 pm on Aug 27, 2004 (gmt 0)

You can research some of your "unknowns" using Bull's excellent WebmasterWorld forum11 Updated and Collated Bot List [webmasterworld.com].

Jim

isitreal

10:47 pm on Aug 27, 2004 (gmt 0)

Virtually all the site downloaders and bad spiders spoof their Navigator user agents, so basing any blocking on that pattern seems like pretty much an excercise in futility, that might have been a good technique a few years ago, but all the bad guys know all about it, they read WebmasterWorld too, especially threads like this, and the old 'ultimate htaccess ban list' thread, and they've adapted to that long ago, only totally rank amateur spider operators would bother trying to spider using navuseragent 'black widow' or whatever, those aren't the guys you need to worry about, it's the pros. Changing the user agent is just a radio button option in most decent spidering software I believe, you don't even have to type something in.

Blocking based on behavior is much more solid, as digitalv recalled. If a page runs a script like he says, and that page is linked to from every page, and your robots.txt bans that page, and something tries to access it, it means one of 3 things (I'd say 2, but some of the search engine spiders have trouble following robots.txt at times)

1. somebody tried to download your site using a downloader that doesn't obey robots.txt, unlike the open source wget for example.
2. a bad spider is spidering your site.
3. real spiders mess up and request the page anyway.

The threads mentioned on blocking these things work extremely well, and punish visitors only for their behavior, which is as far as I can tell the only way to punish them reliably.

Of course, it's probably best to make the blocked page named something like contact_us.html or something generic.

There are things one could do however, on the case of mozilla spoofing, htaccess could test for accept application xhtml+xml, and my guess is only real mozillas will pass that test, but then they can just switch to IE user agent, it's an endless game that gets you nowhere as far as I can tell.

JAB Creations

2:00 am on Aug 28, 2004 (gmt 0)

Then its seems there is a general consession for two coursesw of action.

If we are to persue the offenders one by one then our best course of action would be to study and adapt to their behaviors.

If we are to persue the course where we shut out the unknowns and filter them as they come we risk other things.

My question is how reliable can we detect via clientside whether a browser is spoofing or not?

In addition I think the best course of action would be a combination of both plans.

Simply put when an allowed agent passes through it will still persue certain behaviors. This reminds me of when I read on nominalism versus realism. Why are they opposing? They are not! One is simply in regards to condition versus the other on definition.

What others have suggested is a defensive stance. What I have suggested is an offensive stance. Why not carry both the sheild and the sword at the same time?

digitalv

3:24 pm on Aug 28, 2004 (gmt 0)

still does not completely stop a spider.

True, but I think it's about as close as you're going to get. Since user agents can be changed LEGITIMATELY as well as cloaked, blocking by the User Agent is going to block out more legit users than robots in the long run. Some software (like Alexa) actually modifies the User Agent in Internet Explorer, and you can do that yourself in the registry anyway. Software like Norton Internet Security (I think) gives you the option of hiding the User Agent all together and the webmaster will just see nothing or "-" in their logs.

465,000 unique visitor rank (per month?)

No, I was saying I have a database with 465,000 distinct user agents - they've been collected over the course of a year and a half. Lots and lots of user agents out there, and I know I certainly wouldn't want to go through them all.

My question is how reliable can we detect via clientside whether a browser is spoofing or not?

You can't. You have no way of knowing whether you're looking at a "real" user agent or a spoofed one.

If we are to persue the offenders one by one then our best course of action would be to study and adapt to their behaviors.

Yesterday I was viewing the website of a man who was running for a seat in the Senate. The site was maybe 30 pages, and I'm pretty sure I read all of them. From the home page, I opened all of the links in new tabs (in Firefox it's how I read stuff and click interesting links without losing my place on the current page). Given this situation, could any program possibly tell the difference between me and a bot? For small websites, it's not uncommon that a visitor will read every page before moving on.

Why not carry both the sheild and the sword at the same time?

Because you're going to trip and stab yourself with your sword :)

jdMorgan

5:33 pm on Aug 28, 2004 (gmt 0)

> blocking by the User Agent is going to block out more legit users than robots in the long run.

This presupposes that:

A) You don't do much analysis before setting up the list
B) That your "allow list" is so small and restrictive that it blocks almost everything -and-
C) That you never modify or update your list

It's not too hard to come up with a list of allowable user-agent patterns that blocks 80% of all bad-bots and allows 99.99% of all legitimate accesses. It does take several months of research and testing, though. And maybe a year before your unintentional blocks drop to practically zero percent.

Most objections to the "allow list" method make one or more of these invalid assumptions. It is every bit as viable as the "deny list" method, but gives increased effectiveness only in exchange for increased attentiveness. If you don't pay attention, you'll block a few new user-agents until you fix your patterns. That may be OK if they're blocked for two days, but if you block a new search engine spider for two weeks, your search rankings will very probably suffer. You also need to keep up with new browser user-agent strings -- maybe not the minor revision numbers, but certainly the name and major revision number substrings.

However, UA blocking is only one piece of the puzzle, and should be supplemented with IP blocks and behavioural filters in order to increase the catch rate from 80% to 95%. (Yes, maybe up to five percent will still get through, but being a pragmatist, that's good enough for me, and I'm not willing to devote my life to blocking every exploit attempt. My purpose is to improve my referrer tracking accuracy and to limit wasted bandwidth and legal fees associated with people "borrowing" my copyrighted materials.)

The number of attempts to exploit your site seems to go down as soon as your site establishes a reputation for blocking unwelcome visits. I don't know why, but it does.

For a given level of "security " (determined by you), blocking by UA reduces the number of IP addresses you need to block. Adding IP blocks of recognized UA spoofers improves that security further. Adding manual and automatic IP blocks based on user-agent bahaviour improves it yet again. If it becomes impossible to block by user-agent, then don't -- block by IP instead. Dial-up IPs can be blocked for short periods to avoid affecting innocent users that end up using that same IP address later.

Been there, done that, and it works well for small sites -- If you pay attention to your error logs or stats.

If it doesn't work for you, you can always change your code or delete it. This stuff ain't permanent.

Jim

JAB Creations

6:20 pm on Aug 28, 2004 (gmt 0)

Most objections to the "allow list" method make one or more of these invalid assumptions. It is every bit as viable as the "deny list" method, but gives increased effectiveness only in exchange for increased attentiveness.

That is what I'm talking about. If a webmaster wants to actively reduce abuse then they will gladly go through logs...and dont most of us (who are acknowledged in the area)? Only those who care to actively seek such a solution will have to do any work. I'm not entirely worried about accidentally blocking good agents/people as I would like to setup a contact form for legit users (not bots) to submit a compliant or whatever. Their information can be passed on.

Additionally it would be a matter of going through the new list (maybe once a day?) and manually setting your decision for each agent/bot.

What I am suggesting is that this should be treated like any other piece of software. Not available to the masses as no doubt when a game is released and you can't beat the 2nd level (*cough* heroes II *cough*) because of poor design or unfinished work send the wrong message. I think a group of high traffic sites with willing webmasters to put together a solid base list to start the project off would.

Additionally someone with the ability to create a program that can make a list of first hits combined with hotpages would give us the ability to track down ips and such. I do a lot of whois with first contact hotpages and find a lot of things related like gator under an unrelated name. Another one related to telemarketing and so on.

I'm sure someone could write an apache program to deny access from an ip that requests ahotpage initially, make a report and make links directly to a whois page making it easy to lookup and make a decision ip by ip on a timly manner.

I look at my stats and logs several times a day and I'm sure others do to.

In addition I'm sure it could also be developed a hotlist of words you're actively looking for when bots change. Yahoo, Google, MSN, Teoma, these words should be able to be added to a hotlist and whenever they appear on the denied list they are in red and at the top for your immediant treatment.

I should be a software director :-D