Forum Moderators: open

Message Too Old, No Replies

Newbie: Need help with server farms and IP ranges

         

MrSavage

8:22 pm on Mar 4, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hello all. I'm going to be very upfront here. I'm going to be one of the most aggressive webmasters in terms of dealing with my site traffic. However, I've been a pathetic webmaster up until this point, doing a half a'ssed job at being honest about my traffic and sites. Up until this point, I've relied on Awstats, and then utilizing cpanel IP banning. I just want to be transparent about my knowledge going into this.

So instead of buying this question in the server farms threads, perhaps this will help out any other fellow newbies.

There is a lot to digest and I'm overwhelmed.

So in these server farms threads, people are reporting various IP ranges. So do I:

-open up my site .htaccess file, add "deny from " and then paste those ranges people are posting? Rinse, repeat?

What I can see, this is almost like sports trading cards. Like we are collectors, sharing our info with others.

So should I just go at it and take the ranges that have been reported and just started editing the heck out of my .htaccess file? I can imagine how massive that's going to become, but is that the technique here I need to employ?

I'm very nervous of course because obviously making mistakes with banning IP's can happen, and go unnoticed when a newbie like me is involved.

So ultimately I'm just wondering how to apply this information that kind folks have been posted. I'm all in. For me, this stage of being a webmaster is make or break. I do feel that things have festered and traffic that I see, is probably going to be less than 50% of what I'm actually reading in awstats. It's going to be grim, but up until now, I've been an idiot.

I would be grateful and appreciate any guidance on basic implementation. Thank you.

aristotle

1:21 am on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well you've come to the right place, in that there are some real experts here who have certainly helped me many times.

But before you start asking a lot of questions, I suggest that you spend some time reading and studying, to get to the point where you know what questions to ask. I've always found that self study is the most efficient way to learn the basics.

In any case, there's a lot more to it than just blocking IP ranges.

MrSavage

3:10 am on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the non helpful reply.

not2easy

3:16 am on Mar 5, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Ideally you won't need to block everything you find shared here, the ones that matter to you will be enough to occupy some time to get started. Once you have some control, it does become a little easier.

To figure out the ones that matter for your traffic, download your raw access logs. Some hosts have set it up so you can get each day at the end of the day or a full month at a time. It might help to start with viewing a day at a time until you find your best method. If they aren't visiting your site, there's no reason to add extra lines to be blocked.

As you sift through things, you'll learn more, it starts making sense and soon you're here to share your latest finds. It may seem slow, but often you already have work habits to figure out best what works for you. It can't hurt to read through the threads here, and it's a good thing you've got this started to ask when you hit a wall.

MrSavage

3:30 am on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can appreciate the logs. I'm there. I'm a fresh set of eyes and I'm just saying that there is no clear starting place on this. Can I say that without sounding disrespectful? I've spent some time reading. The problem is reading and understanding. It's clear to those who know, but to someone like me, it's not clear. There are pages and pages of data center ips, from months and months.

So my question was for anyone wanting to take one of those posts or ip ranges, what the process for implementing that block? As in a htaccess code to paste example would be helpful.

I understand that people here know this and it's obvious, but anything is obvious when you know it. Explaining it is the challenge. Sorry, but although I'm considered a newbie, I'm also not an idiot. I hope that makes sense. Thanks.

not2easy

5:05 am on Mar 5, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Sorry, I see where you are. Assumptions don't help anyone. Yes, these are generally added to htaccess in a deny list. There are various formats to use and much depends on how the server is set up so that makes a difference whether you control the server or it is shared or some other arrangement.

order allow,deny
deny from 82.80.248.0/21
deny from 82.146.32.0/19
deny from 82.196.0.0/20
etc...
allow from all

is fairly standard. Some use a different format:
deny from 82.80.248.0/21,82.146.32.0/19


It might help you to copy CIDRs into your own spreadsheet and use that to check against as you examine logs. It helps a lot to keep them in numerical order within htaccess so it runs faster.

lucy24

5:24 am on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



there is no clear starting place on this

Well, there really isn't. Suppose you like Ukrainian robots?

Let's pick the most recent post in the ongoing SSID thread. Happily it just names one range, so that's easy:
cccomm.com
64.113.160.0/20
64.113.160.0 - 64.113.175.255

What you do with this information is up to you. If you take an immediate, aggressive approach of "ban all server farms as soon as I learn of their existence" then you would put the line
Deny from 64.113.160.0/20

in your htaccess. Except that if you keep doing this, your htaccess will soon be several miles long, so you will probably choose to combine lines:

Deny from aa.bb.cc.0/20 ee.ff.0.0/15 11.22.33.44 55.66 77.88

You can add as many numbers as you like, so long as they are valid CIDR ranges. (Careful! A single blunder will throw your whole site into Lookups mode, which is not pretty and creates extra work that most people's servers don't need or want.) For the sake of your own sanity, keep them in order, both overall and within lines.

My personal approach is to make a separate line for each A (top-level) group, for example
Deny from 64.27.0.0/17 64.30.0.0/15 64.34 64.38.0.0/18 64.71.128.0/18 64.74 64.120 64.79.96.0/20 64.71.192.0/20 64.191.0.0/17 64.202.160.0/19 64.209.144.0/20 64.222.64.0/18 64.222.128.0/17 64.223.64.0/18 64.237.32.0/19
That was an unusually long block, because the 64's have been around for a long time and contain plenty of server farms.

Originally I had all my "Deny from..." lines in numerical order. Later I found it's more convenient to group RIPE and ARIN ranges separately. (Nothing significant from the other three. I don't bother blocking Brazilian or Vietnamese robots, because they never seem to come back, and most of them are just humans with infected browsers.) I've also got a separate section for China, because everyone draws the line somewhere, and to heck with collateral damage.

If you look closely you'll notice that the quoted "Deny from" line does not actually mention 64.113.160.0/20 because, again by personal preference, I don't block people unless they're proven to be offensive. (Or Chinese. So shoot me.) If a nice well-behaved robot comes by, reads and obeys robots.txt, and collects pages at a reasonable pace, heck, let it. What I do do, as often as I remember it, is to go into my personal database*, find the line for 64.113.160.0/20, and flag it as "robot". Then, if I ever do get an offensive visitor from that range, I already know it's a server farm and don't have to look up the IP.

:: pause here to update records because I'd got all of 64.112.0.0/15 noted as a generic "US" with no further information ::

That being said... You will learn in time that some ranges are chronically dirtier than others. I don't suppose they actually post ads saying "Come one, come all, bad robots welcome, no questions asked"; it just turns out that way. You'll come to recognize the names.


* It's actually a group of html files, because reasons.

Pfui

5:59 am on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



We were all new to SSID and server farm clobbering and such once. Welcome to the club!

The best way to begin, as was kindly suggested, is read up, like the Library docs and the Charter, like major threads and the moderators and top posters. Understanding can come all too slowly, alas.

The best way to converse is respectfully. Snarky "Thanks for the non helpful reply" remarks waste everyone's time and get yourself ignored.

The best way to proceed is with specifics. You can Google and practice "[t]he process for implementing [IP blocks]" on your platform on your own. Then test your code on your own site(s). For example, make a private directory and start an .htaccess file and a blank index. Try various Allow/Deny and other rules you've learned by trying to block yourself by IP, Host, UA. Fail? Rinse. Repeat. If/when code you re-re-repeatedly try doesn't work, show your work -- your code and results -- when you ask for help from wiser eyeballs.

The best way to succeed is by sweat equity, rather than expect folks to custom-craft "htaccess code to paste" for you. That's your job.

wilderness

3:19 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Here's some old threads that should be useful in providing examples of syntax.

(Please keep in mind that one of the most important issues for a beginner with htaccess is in keeping copies of revised files, thus if your syntax fails and creates a 500 error (taking the server down) than you immediately reactivate the old/backup file).

seven magic words [webmasterworld.com]
Close to Perfect htaccess [webmasterworld.com], old thread (both IP's and UA's most outdated, however methods are still functional
Old links and/or methods [webmasterworld.com]
old thread on headers [webmasterworld.com]
Conditional denies [webmasterworld.com]

A crucial misunderstanding for most beginners is the use of anchors"
Begins with ^
Ends with $
Exactly as (or begins and ends with ^$
Contains - no anchor used
Exceptions !

MrSavage

3:21 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@Pfui, clearly you can save your time and not bother replying to anything I write or say. That's my way of being kind as to save you time. Your advice is about as helpful as, "use Google search", which is the kind of attitude I would expect at the Google support forums and not here. Snarky responses only come when idiotic comments are made. If you notice lack of newbies, then maybe just perhaps it's your attitude or outlook is part of the issue. I have no problem elsewhere on webamsters world so perhaps the issue is looking at yourself in the mirror. People, have, if you noticed offered tangible guidance whereas you just offered judgement for what purpose I have not idea. Fair enough? If it's people like you, then I'll pass on your advice.

MrSavage

3:23 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@wilderness, I appreciate that. It's not so much how do I edit an .htaccess file. It's more a question of actually implementing some of the ip address being listed in pages and pages and pages and pages of ip farms. I'm trying to ask what is the process that you folks use when looking at that data. I hope that makes sense.

wilderness

3:27 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Pfui,
Is a long standing participant of this forum and has made an extensive contribution here.

There's a very small group of regular participants in this forum, most of which have been here more than a decade (through a lot of crap).
Alienating yourself from same members will only decrease your chances for assistance.

Don

MrSavage

3:32 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@lucy, thank you very much. That gives me a good start to tackle this. So I'm just wondering which would be correct. Are the posts being made in the content farms threads, are all these evil? The good vs. evil and those are evil, but each webmaster is having to decide which ones are relevant? Just wondering if that's a correct assumption. Thank you.

MrSavage

3:34 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@wilderness, yeah, no kidding. Those types of people are usually the most unfriendly and types who know everything and therefore really have no clue what it's like to come in with fresh eyes. I know those types well, but I would rather not hear from them. If they are a long term member with that mentality, then I think I'm being respectful of their time by saying don't waste your time on me. That's who I am.

wilderness

3:37 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's more a question of actually implementing some of the ip address being listed in pages and pages and pages and pages of ip farms. I'm trying to ask what is the process that you folks use when looking at that data. I hope that makes sense.


I see multiple possibilities in questions that require expanding!

1) are you asking how to determine the activity criteria in denying a UA and/or IP?
2) OR are you asking for the method (syntax) in applying the criteria?

You've already been advised to review your raw access logs (to which you also replied as to "what should you look for").
Reviewing logs is a learned process and different for each one of us.

1) You might begin by looking at IP's and UA's that are NOT within regions where the traffic is NOT beneficial to your website.
2) UA's (User Agents) that are not standard browsers.
3) Understanding the structure of your website (s) also helps in reviewing logs and how visitors navigate your site (s), and whether the visitor is actually navigating your site or is an actual bot grabbing pages outside of structured links.

wilderness

3:43 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Those types of people are usually the most unfriendly and types who know everything and therefore really have no clue what it's like to come in with fresh eyes


I've had my share of arguments in this forum (despite the charter), even had comments removed that I felt were unfair.

In most instances, it's simply more effective to take a deep-breath and bite your tongue (i. e., keyboard), then consider that perhaps the longtime member is just having a bad day or even a DUH moment (happens to the best (myself included)).

MrSavage

4:02 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've been analyzing my raw data logs (thanks to the advice of webmasterworld members). I appreciate that you're saying it's a learned process, of which I'm fully engaged in at this point. I think I need to first deal with the many many many wp-login.php hits which will declutter some of what I have to go through.

I think to answer 1) and 2), it's partially both questions. With that much information/data in those farm threads, it's obvious that all those wouldn't be relevant to everyone and written into their htaccess file. Lucy has explained very well how this data can be placed into my htaccess (syntax), so that's great.

It's hard to narrow down the question regarding processing the data in those threads. The activity criteria that people might use to decide on what farms and ranges to ban. I can't quite understand how to get from looking at raw logs for example, to determining which IPs from those threads that I should be focused on. I guess it's a bit overwhelming to me if that's not already obvious. I'm not sure if it's possible to have a more clean answer than "it depends" when asking which ips from those threads should be considered for using in my htaccess.

It's obviously to me that it's not a cut and dry, one size fits all discussion. That's making it a bit difficult to clarify questions or information. As I've said, I greatly appreciate the assistance. I also will write a guide somewhere once I understand the process. Traffic control seems so important now and at the time time it's a very daunting task.

wilderness

4:17 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can't quite understand how to get from looking at raw logs for example, to determining which IPs from those threads that I should be focused on.


You take an IP from your logs
Then use one of the Registrars (ARIN, RIPE, APNIC or other) to perform a WHOIS to determine the broader range of the same IP.
EX:
Just 90-minutes ago, I had a request for both a page and folder that does not exist from the following IP:
130.0.232.49
The query at RIPE resulted in 130.0.232.0 - 130.0.233.255
With the CIDR located near the page bottom of 130.0.232.0/21

many many many wp-login.php

There are so many WP requests these days looking for PHP vulnerabilities that no matter what steps we might have in place, the requests are going to continue.
There are some questions to ask yourself to narrow the reach of those requests:
1) do I even have pages that are WP (simply deny all requests for WP if you do not)
2) do my sites use PHP? (deny requests for PHP beyond the names of your pages)

MrSavage

5:06 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thank you. I know a while back I came across a htaccess entry that will will block all ips, except for ones I manual enter. I don't know the ins and outs, but Limit Login Attempts, even with the most absurd settings possible, appears to have no appreciable positive results. But in terms of you suggesting of denying all request, I think I'm going to be aggressive to that extent. I'm going to look at the .php as you mention. I'm all in, so thank you for the advice on those two points.

wilderness

5:20 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I know a while back I came across a htaccess entry that will will block all ips, except for ones I manual enter. I don't know the ins and outs, but Limit Login Attempts, even with the most absurd settings possible, appears to have no appreciable positive results. But in terms of you suggesting of denying all request,


I wasn't suggesting that you consider denying all requests, rather conditional request for wp & PHP beyond your active pages.

Applying a 'deny all with exceptions', should only be considered if website (s) are of the private nature (i. e., intranet or extranet) and your NOT interested in public and/or commercial access.

It's entirely dependent upon the purpose/audience of your site (s), which only you may determine.

not2easy

5:56 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If you view the access logs in a spreadsheet format in EXcel or OpenOffice, you can sort by any field you like. If you sort by server response you can weed out those that are already receiving a 403 (forbidden/blocked) response. Blocking access does not prevent attempts. Your logs will remain full of 403 responses as long as attempts are being made. When you say
Limit Login Attempts, even with the most absurd settings possible, appears to have no appreciable positive results.
it is because you have an expectation that there will not be any requests, but that's not how it works. Unless you see requests with a "200" (OK) server response, it is doing exactly what it is supposed to do: prevent logins.

If you sort by IP you can get a look at what files are being requested and served. Those are either visitors or bots. Looking at the files requested tells you if those files would be needed by a human. When you see requests for html (or any "Page" URL) and no .css, no .js, etc. you can be reasonably sure it is a robot. Keep in mind that the first request is more important for those supporting files when caching provides them for subsequent pages. Then you can look at the UserAgent and decide if it is a bot you want to have visiting or if it is a bot disguised as a human but not requesting the files a human on IE8 (which it says is the UA) would need in order to look at the page they asked for.

Sort by Request to see which "Humans" ask for robots.txt. As you spot strangeness you pick up on ways to find more. All I can say is that it takes time to understand your traffic, you already know that so, good luck!

lucy24

6:50 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are the posts being made in the content farms threads, are all these evil?

No. Absolutely not. That's what I was trying to stress. In particular, the "server farms" thread (the one that goes on forever and gets re-booted when it gets too long) is intended to be a list of all server farms, good bad or mediocre. Everyone starts out small, and some robots serve useful purposes. In particular: If you use any tool that involves a robot crawling other people's sites then that tool is only useful if the robot is in fact allowed to crawl freely.

:: happily picturing wilderness taking a deep bite out of his keyboard, yum yum chomp ::

More how-to:

My current log-wrangling involves feeding raw logs into a javascript package that I run locally. (I have a clutch of very, very, very small sites.) Most of this involves separating the robots from the humans so I can see what the humans are up to.
-- Pull out known quantities (Googlebot, bingbot, Page Speed Insights-- anything that has already earned a place in the Ignore list).
-- Pull out 403s, because those have already been blocked. Even then, I'll check periodically to make sure people who got blocked by behavior or UA are also getting blocked by IP. Belt and suspenders.
-- Pull out and ignore certain other requests, such as robots.txt from an otherwise blocked visitor.
-- Pull out known robots such as Facebook and anyone I'm currently evaluating for Ignore or Block status.
-- Pull out known botnet patterns, which are often easy to identify after the fact even if they weren't blocked upfront
-- Pull out unambiguous humans, identifiable by behaviors such as requesting favicon.ico and/or /piwik/ or, sometimes, by a search-engine referer or by a certain pattern of .css requests. Even if an atypical robot requests all files linked from a particular page, the requests will typically not come in the same order as from a human.

Anything that's left gets a personal look. If a particular IP requested only a page, and nothing else, it's a robot. If they requested only certain files, but they were here yesterday, it's probably a human. If the visit began with a request for robots.txt, it's a robot. If a particular IP requested more than some-selected-number of pages, take a closer look and see whether it's a human who likes your site (hurrah!) or a robot doing a full sweep.

In general I ignore random robots that just grab the front page and then go on their way. It's just not worth the bother. The 403 candidates are the ones who request an interior page (even if it's just one), or who send a fake referer (assuming it isn't something like semalt that has been blocked on its own merits), or an obviously fake UA.

Open your log files in a text editor that does global Regular Expression searches and displays the results in a new window. Once you're looking at the list of only requests from 11.22.33 then it's generally very easy to see if it was a human or a robot.

I currently keep a separate log for image requests unconnected with a page. I may eventually decide it's not even worth the bother, especially since I'm never sure I've filtered out the ones that are just image-search SERPs, not actual requests.

Tangential: When I say "same IP" I've found by experience that it's best to check only the first three blocks, as in
^(\d+\.\d+\.\d+\.)\d+
Plenty of human ISPs don't use the identical IP number for all requests associated with a particular page. Some will have even further variation (notably AOL and satellite providers), but this is the most useful generic approach.

MrSavage

9:52 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Below, I'm hopeful the quote will add some additional value to this thread for any newcomers such as myself. In a way it's funny because in going through my logs today, searching google, I ended up back here in the monster server farm thread. So I'll quote this terrific summary regarding the server farm threads, because it's answering a portion of what I was trying to figure out.

Some of the legacy threads are gone now, but basically this particular thread, the Server Farms sub-category of the Search Engine Spider and User Agent Identification Forum, does list all known hosting server farms, clouds servers, data centers & colocation company ranges. AWS is so large and prominent, it got its own thread.

All this has been many years in the making and quite a lengthy read but one that will answer most of your questions. These forums are an archive of information for today's webmaster. Most bots do get mentioned in one of the forum's threads, but as I noted, many older threads are now either gone, or unsearchable after the reorganization of WW.

If you see behavior in your server's access logs that is questionable, look up the IP address. Do the research. Find out what type of company the range is assigned to. Many agents disguise themselves as something they're not. In time you'll become skilled at profiling them.

Most of us agree that agents from any of the above have no reason to access our web sites, thus we list the company and their respective server ranges here. What you do with this information is up to you. What works for one webmaster may not for another. One site's bad agent may be thought of as benign or even beneficial to another. Your site, your choice.

-keyplr responding to questions about: "How do you decide what to list in these threads?", "Is it intended to collect all farms by some criterion? Or are are you just using it to share occasional discoveries?", "How do you distinguish between a human-free farm and a provider who just happens to have a human in it with a virus-infected computer?", "And how do you decide what range to include? D'you just look up the CIDR in domaintools or some equivalent and block that?"


I'm hoping that I can help not only myself, but anyone else just seeing the light regarding cleaning up traffic. Thank you.

MrSavage

10:15 pm on Mar 5, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm very grateful to everyone, thank you very much. I hope that after I gain some real experience at this, that I can help out some other webmasters tackle their traffic. No question this has been very humbling thus far. As I've tried to mention, for me my webmastering days are at a crossroads. If cleaning up my traffic can't turn my fortunes around, then I will need to dial the whole grandiose plan back more than a few notches. I don't like giving up. Tidying up the traffic side of things truly is everything to me right now. I feel a lot of pressure to be honest so I'm a bit tense. Thanks again, I'm reading and rereading everything here.

keyplyr

2:11 am on Mar 6, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the non helpful reply.

I thought aristotle gave an appropriate reply.