homepage Welcome to WebmasterWorld Guest from 54.227.160.102
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 86 message thread spans 3 pages: < < 86 ( 1 2 [3]     
The Whitelist
Key Elements
tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4640881 posted 12:13 am on Jan 29, 2014 (gmt 0)

Whitelisting access to a web site is easier than blacklisting. One is who you let in, the other is endless whack-a-mole. To start:

.htaccess to allow all comers access to robots.txt

robots.tx allows a short list (b, g, y, examples) all others 403

UAs allowed, a slightly longer list, but still very limited. No match: 403

Getting grancular: Referer keywords (this is a blacklist, but still pretty short): 403

These are my basic tools. Are there others, or perhaps refinements? And, where does Whitelisting fail? And how to poke holes for desired? (I know HOW, but have yet to really find a need to do it).

These are concepts I've been using for a few years, but I suspect there are other methods, so this is a request for discussion on Whitelisting in general. I went down the blacklisting hole for a number of years--and the subsequent heartburn and agitation--until making that paradigm shift from takiing out the bad to allowinig the good. And, though I'm not unhappy with the rsults so far, I wonder if the Whitelist might be losing some potential traffic.

Your thoughts?

Thanks.

 

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4640881 posted 7:36 am on Feb 9, 2014 (gmt 0)

ao the remaining robots claiming to be Netscape 2.0 when all the world is Netscape 13.7 stick out like sore thumbs.


Shhhh! lucy24, don't give away the other side of whitelist (which is as we all know is who gets in).

UA blocks are not perfect, of course, but they are key to the total defense. The if < x is not met then no entry. Works for me.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4640881 posted 7:52 am on Feb 9, 2014 (gmt 0)



As an aside...

(this is part of my whitelist)
Who is blocking HTTP1.0 ?
Are you allowing anyone through? Who?
How about OPAL/TalkTalk?

Angonasec

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4640881 posted 1:54 pm on Feb 9, 2014 (gmt 0)

KeyP:
I block all HTTP/1.0 since being nudged by Mr. Stiles :)
I also block great chunks of TalkTalk traffic. TT seems to send a bot on the back of legitimate visits, so I allow the human and block the TT bot which comes from a different Ip around the same time.

If I lived in the UK I would not use TT at all: It may be cheap, but it appears to be slippery from this angle.

Tangor:
"Shhhh!" Indeed, since I'm finding this thread useful, it is doubtless grabbing the attention of The Enemy too. They must be transfixed and rapt.

Perhaps we should continue the discussion in private?

trintragula



 
Msg#: 4640881 posted 3:31 pm on Feb 9, 2014 (gmt 0)

I think it's important to keep things in perspective here.

Before I put in a bot blocker last year, 80% of the traffic on my site was bots with the word 'bot' (or similar) in their user agent string. That's how tough it is out there.
(I'm being ironic here...)
I block a few more bots than just bots with an obvious UA, and my other tricks do a good job of separating traffic from cable/DSL from traffic from server farms pretty much by coincidence. Most of the time.

Pinterest still steals my images, in spite of my hot link blocker. I could probably stop them, but I'm choosing not to - I'd have to do it case by case with all similar 'services' and be on the lookout for them forever - I'm not going to do that.

Archive.is is stealing individual pages using what amounts to a bot net and browser UA from right under my nose, and I know of no way to stop them. I can't even tell that its happening (ideas anyone?).

At the moment I'm content to reduce my bandwidth needs and reduce the abuse my site gets (albeit quite a lot). There's a cost/benefit trade-off. It's like any other security: you budget for it, and you try to work smarter rather than harder to get the most for your effort.

Because I know I can't stop them all, I'm not tempted to go after every bot I could stop.

Most of the bot runners apparently think they have a legitimate business, and are probably not interested in the behaviour of a few weirdos who choose to block them.
I don't think discussing tactics in private will make any difference, unless you really think you're discussing things that will actually stop the guys who are really trying to hide. At the moment, I don't think we're even close to that.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4640881 posted 4:08 pm on Feb 9, 2014 (gmt 0)

Who is blocking HTTP1.0 ?
Are you allowing anyone through? Who?


No exceptions to HTTP1.0.

Archive.is is stealing individual pages using what amounts to a bot net and browser UA from right under my nose, and I know of no way to stop them. I can't even tell that its happening (ideas anyone?).


Past practices (at least for me) have shown that these tools are initiated by a solitary user from a non-related IP and on a pre-visit to the bot.
Just go back through you earlier logs and search the same page, then compare date to the bots appearance.
I've used this practice for many difficult bots, and for some years.
There may be an occasional casualty, however an exception to you 403's should allow a contact for a genuine resolution.

Pinterest still steals my images, in spite of my hot link blocker. I could probably stop them, but I'm choosing not to - I'd have to do it case by case with all similar 'services' and be on the lookout for them forever - I'm not going to do that.


same method as archive.is

trintragula



 
Msg#: 4640881 posted 6:47 pm on Feb 9, 2014 (gmt 0)

I'm trying to argue that the reason not to blacklist is because some bots (including archive.is, I think) are immune to blacklisting, so you are guaranteed not to catch them all.
If you accept that, then blacklisting loses its appeal.
As long as you're willing to blacklist other methods will always look marginal.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4640881 posted 8:53 pm on Feb 9, 2014 (gmt 0)

In the UK several smaller ISPs have been taken over during the past decade by larger companies, not always UK-owned ones. In each case the IP range has been taken over as well. Over time the "new", larger ISP finds limitations in one or other range and in general assigns any IP it owns to any customer within a certain type (dynamic, static, business etc). Some of the original ranges were /17 or less, some are /12 or more.

Add to this the fact that most people turn off their "modems" along with their computer, which results in them losing their IP and often picking up a new one next time they go online. Also, the ISPs sometimes disconnect an IP for some reason (maintenance, power-cycling etc); this happens even with my static IP.

For some of my customers, their IP can vary considerably across several A and A.B ranges week by week and sometimes day by day: I have had to whitelist large UK ranges for my mail server, just for this reason, and I still get surprise IP complaints on this from time to time. Whitelisting only a narrow range for even one person can easily mean that the person usually cannot get in and will eventually go away - I suppose that would solve the problem. :)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4640881 posted 9:01 pm on Feb 9, 2014 (gmt 0)

Pinterest still steals my images, in spite of my hot link blocker.

Technically they're not hotlinking. They download the page the image "belongs" to and then give that page as referer. If you're blacklisting, you have to go by UA instead.

Law-abiding robots almost invariably have a parenthetical bit at the end of the UA: (+http et cetera). You could scoop up a lot with a simple

BrowserMatch http

and then apply whitelisting to your permitted robots. That's assuming you've got full use of an <If...> construction-- taking us back into technology. Even

BrowserMatch \) ?$

will be 90% robots. The other 10% includes me: "(like Firefox 3.6.28)"

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4640881 posted 9:07 pm on Feb 9, 2014 (gmt 0)

keyplyr:

Who is blocking HTTP1.0 ?

Me, in part.

Are you allowing anyone through? Who?

Some proxies, especially but not exclusively UK educational.

How about OPAL/TalkTalk?

In general this is just another dynamic ISP but the range 78.151.160.0 - 78.151.165.255 is used by them for some infrastructural reason: I block that. I have about 80 Opal/TalkTalk IPs blocked - far less than many other ISPs in the wide world (BT, also in the UK, is around 250; US Comcast is over 1300 - ok, US is bigger but still a lot). Bear in mind I only host a couple of dozen sites.

trintragula - there is usually a way to block bots but you have to allow there may sometimes be a bit of legit traffic fall-out.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4640881 posted 10:38 pm on Feb 9, 2014 (gmt 0)

...but the range 78.151.160.0 - 78.151.165.255 is used by them for some infrastructural reason: I block that.

dstiles - I see 78.148.0.0 - 78.151.255.255. Has the range broadened?

trintragula



 
Msg#: 4640881 posted 10:47 pm on Feb 9, 2014 (gmt 0)

@lucy24

You could scoop up a lot with a simple

BrowserMatch http

and then apply whitelisting to your permitted robots


I do that. My list is 'bot', 'spider', 'crawl', 'http', '@'
I have a longer but still limited list of about a dozen keywords for browsers, but that's as far as I go with useragents.

My htaccess file is actually empty - my filtering is all php-based.

@dstiles
Could you stop MJ12 if they didn't tell you who they were? As bot nets go, they're one of the nice guys.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4640881 posted 10:48 pm on Feb 9, 2014 (gmt 0)

but you have to allow there may sometimes be a bit of legit traffic fall-out

That gets us back to the underlying philosophical question. If you blacklist, some robots will get in. If you whitelist, some humans will be locked out.

What type of site do you have? If humans are locked out, will they get in touch and ask to be let in, or will they simply wander off in disgust and/or bewilderment? Do they even know that "poke a hole for me" is an option?*

My hole-poking page says in part
You will need to include one of these pieces of information so I can pinpoint the problem: your current IP address, or your exact browser and operating system, or the link you clicked to get here.

This is, of course, a brazen lie. I watch requests for the stylesheet that belongs to the 403 page, so if a human has been locked out, I probably already know about it and can figure out why. Well, maybe not before I get the e-mail-- I don't check logs that often-- but eventually.

I don't personally see any reason to deny HTTP/1.0 as such. If it's a robot it's almost certainly been locked out on other grounds already. If it's a human they may not have any choice about the proxy.

:: detour to search for string 'HTTP/1.0" 200' in recent logs ::

Could you stop MJ12 if they didn't tell you who they were? As bot nets go, they're one of the nice guys.

I'm glad someone mentioned MJ12. I have no objection to them as such-- but I don't value them so highly that I'll poke a hole for them by name. So it all depends whether they crawl from a server farm that's currently blocked on strictly numeric grounds.


* Or, conversely, do they assume the site administrator is an idiot and will believe whatever story you make up? I'm on one forum where almost every ban is quickly followed by an e-mail to the forums administrator saying something like "I don't know what's wrong. I can't seem to log in." The lockout screen says exactly what's going on and why; the administrator probably banned them herself.

trintragula



 
Msg#: 4640881 posted 10:59 pm on Feb 9, 2014 (gmt 0)

Could you stop MJ12 if they didn't tell you who they were? As bot nets go, they're one of the nice guys.

I'm glad someone mentioned MJ12. I have no objection to them as such-- but I don't value them so highly that I'll poke a hole for them by name. So it all depends whether they crawl from a server farm that's currently blocked on strictly numeric grounds.

A distinguishing feature of MJ12 is that they don't. That's why I'm asking...

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4640881 posted 11:11 pm on Feb 9, 2014 (gmt 0)

[rant]MJ12 was a PITA IMO. They crawled all the time, getting stats to build their product and in exchange offered me info about my site which I already knew. Despite disallowing them in robots.txt, it took weeks of emails to get them to stop.[/rant]

trintragula



 
Msg#: 4640881 posted 11:43 pm on Feb 9, 2014 (gmt 0)

MJ12 have much history here. I brought them up only because they run as a bot net, which makes it hard to spot them in a crowd unless they wear a badge - which they do.
But what about the guys that don't?
My point was simply that not all bots are blockable in any obvious way, some possibly not all.

iomfan



 
Msg#: 4640881 posted 1:21 am on Feb 10, 2014 (gmt 0)

My robots.txt is open to anybody, and pretty well all hosts identifying themselves as MJ12 obey it. Those who don't get served a zero byte file if they look for anything else.

On most sites I serve a zero-byte file to HTTP/1.0 requests. There may be the odd proxy that uses HTTP/1.0 but almost all such requests look like they come from machines and not from people, and the sites I look after don't have the kind of content that would invite users to use proxies.

Angonasec

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4640881 posted 4:50 am on Feb 10, 2014 (gmt 0)

Repressive authorities forced the oppressed to use proxies, but the "legitimate" use of proxies waned rapidly with the dawn of alternative technologies such as 3G WiFi.

Nowadays; proxy = please block me.

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4640881 posted 10:26 am on Feb 10, 2014 (gmt 0)

@incrediBill, Your .net site blocks me:

it appears you are trying to access this site from a hosting service.


If I use my personal proxy server.

Do you do anything to allow proxies?

This is the first time I have found the proxy blocked BUT I have found my ISP allocated dynamic IP blocked several times.

Angonasec

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4640881 posted 2:47 pm on Feb 10, 2014 (gmt 0)

Tut-tut Grae, next you'll be using a Blackberry, or worse.... Opera.

Good on'ya Bill, wake them up.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4640881 posted 9:23 pm on Feb 10, 2014 (gmt 0)

keyplr
> I see 78.148.0.0 - 78.151.255.255. Has the range broadened?

That's one complete IP range for that ISP. What I quoted was the infrastructural IPs which access sites for some unknown reason apart from a simple DSL user access. Probably to do with anti-virus - I forget now: there is a thread on this somewhere in this forum. And the full range starts 78.144 not 78.148.

trintragula - yes, I can stop MJ12, with or without the UA. The bot itself actually obeys robots.txt: the only ones I see (and block) are fakes.

Incidentally: it is unlikely you will get a real MJ12 from a server. The bot is "distributed" and run mostly (if not entirely) from DSL lines.

Lucy - I return a blank page for a known (95% -ish confidence) bot or server farm but for lesser confidence I display an ID and URL to report problems. Oddly (but understandably) most people so trapped complain to the site owner rather than me, often with no ID (which includes time/date), just to make my life more difficult. I have a different response for obsolete or suspect browsers, returning a 405 or 403 with a list of possible reasons.

iomfan - many proxy servers run from server farms. Many run HTTP/1.0 protocol. The trick is to determine how many of those are real people (I quoted education) and which are scrapers. I also see a lot of accesses proxying local IP ranges - 10.n.n.n, 192.168.n.n etc. Those, providing they have valid browsers driving them, are merely cautious / paranoid people trying to ensure they are not fed a virus. They are usually running linux.

Angonasec - that's the second time I've seen an attack here on blackberries. I've never had a problem with them. I currently have three blocks on their UAs, all for other reasons. As for Opera, a lot of hackers use the UA but not necessarily the tool. As they do with Firefox and other browsers. The trick here is recognising the difference by other means, which is not so difficult.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4640881 posted 9:33 pm on Feb 10, 2014 (gmt 0)

Do you do anything to allow proxies?


Yup, I have proxies blocked.

Plus if your proxy is hosted somewhere it won't work either as most DCs are blocked.

It the proxy filters your browser headers it's blocked.

I'm a bad boy.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4640881 posted 1:37 am on Feb 11, 2014 (gmt 0)

As for Opera, a lot of hackers use the UA

Current Opera? With "OPR" in the UA string? I assume you're not talking about "Bork-edition", which appears to be Russian for "block me, I'm a robot".

Angonasec

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4640881 posted 4:47 am on Feb 11, 2014 (gmt 0)

Mr. Stiles:
"that's the second time I've seen an attack here on blackberries. I've never had a problem with them. I currently have three blocks on their UAs, all for other reasons."

The devices themselves are, what's the word... clutzy enough, in terms of their use, but my objection to them wholesale arises from their +users' profile+.

Over the years we've been white-listing and black-listing, monitoring access logs, the activities of both RIM and their users brought the portcullis down rapidly.

My local Blackberry dealership has closed down, which is marginally gratifying.

Opera: The same goes in triplicate for those Icelandic Chappies, and their users. (No offence intended Brett :)

When the creators are evil, the products will inevitably attract of their kind.

4serendipity

10+ Year Member



 
Msg#: 4640881 posted 12:32 pm on Feb 11, 2014 (gmt 0)

What type of site do you have?


I think this is the key question that you have to ask yourself. The answer to this question will likely be the deciding factor in which method to use.

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4640881 posted 5:01 am on Feb 12, 2014 (gmt 0)

What is wrong with Opera? How is it evil? Opera has been one of the most innovative browser vendors: they were the first to introduce tabs, Opera has been highly configurable in the past (do not know about new version though), it is a fast, light browser and has always been very standards compliant.

What is wrong with Blackberry users? Most I know use it because they need mobile access to access corporate email, and the user profiles should follow from that.

Angonasec

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4640881 posted 11:37 am on Feb 12, 2014 (gmt 0)

Grae: Simply observe your logs, and notice the activity of live human users of those "things".

But let us not divert a useful thread eh :)

This 86 message thread spans 3 pages: < < 86 ( 1 2 [3]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved