Forum Moderators: open
Came as: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0/1.0 (bot; [)...]
Came from: 216.158.1.nnn
Consult Dynamics, Inc
There must be some old threads on this?
Although I don't care for these types of 3rd party services accessing my pages, one at least needs to consider their services in close proximity that many of the K-12 networks (choke-choke, gag-gag).
2004, 2007, 2007, 2009
#On topic request
216.158.61.zz - - [22/Apr/2004:10:47:07 -0700] "GET /MyFolder/MyPage.html HTTP/1.1" 200 29784 [google.com...]
lr=&ie=ISO-8859-1&oe=ISO-8859-1" "Mozilla/4.0 (compatible; MSIE 5.16; Mac_PowerPC)"
#Duplicated requests from Consult and visitor IP
58.227.159.zzz - - [18/Jan/2007:10:07:33 -0800] "GET /MyFolder/MyPage.html
HTTP/1.0" 403 - "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
207.245.84.zz - - [18/Jan/2007:10:07:48 -0800] "GET /SameFolder/SamePage.html
HTTP/1.1" 200 39006 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
#On topic request, utilized dual IP's
199.95.171.z - - [26/Oct/2007:09:52:51 -0500] "GET /MyImage.gif HTTP/1.1" 200 1925 "RequestedPage.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"
216.158.5.z - - [26/Oct/2007:09:52:51 -0500] "GET /RequestedPage.html HTTP/1.1" 200 49699 "http://www.google.com/search?q=On+topic+" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"
#Requested multiple pages
216.158.1.zzz - - [05/Feb/2009:22:15:58 -0600] "GET /MyFolder/MyPage.html HTTP/1.1" 200 11217 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"
Perhaps I should expand on the "K-12" mention?
There are various 3rd party services which provide networks to K-12 locales.
It was my intention to inject that these many new types of limited networks that we are seeing are a near parallel to the "K-12" networks.
Although I don't personally like the idea of throwing an "umbrella" over the terms "educational" or "3rd party", there are some legitimate instances of these services that provide benefit to both the user and the webmaster.
I simply feel that each instance must be reviewed.
I looked at one similar network service today that proclaimed their services as global. Thus the possibility exists that the services could (at least in effect) act as proxy when sending its customers to websites (by hiding their actual identity).
Don
I'm quicker to open that umbrella over those kinds of bots cause all my experience with educational bots has been negative. They're usually really badly behaved bots that are part of some student's class project. Proxy or whatever, they've all been tarnished in my mind.
Do you store all the IP Addresses these bots use for future reference? Cause I tried that for awhile and wound up with a database table that was in the tens of GB.
I'm quicker to open that umbrella over those kinds of bots cause all my experience with educational bots has been negative. They're usually really badly behaved bots that are part of some student's class project. Proxy or whatever, they've all been tarnished in my mind.
Gary,
"My widgets" and the pages/articles within my sites provide some historical references (somewhere there's an old thread where a General from the early revolution was an annual topic of research for specific school; my page provided a reference to a "widget" of the same name, which was not related to their quest, however the hits continued. Eventually I purposely mis-spelled the page contents name to avoid the hits.)
Much of my content actually provides excellent source leads for these types of learning, however I'm still required to make an individual determination on the activity of each network and whether, the visitor is crawling, caching, or simply utilizing the references I've available.
Do you store all the IP Addresses these bots use for future reference? Cause I tried that for awhile and wound up with a database table that was in the tens of GB.
Yes and no.
Primarily for North America, in the other registrars (RIPE and APNIC, I merely make notations of the ranges outside of the NA Ranges, NO INFO on the BOT's themselves because they don't get in my sites.
For some time, I would make the notations and additions to myself in emails and then periodically export the emails to a local folder.
Simultaneously, I built a directory structure on category and name which bot text files of logs and the references were contained.
For a few years now, I've been using the Copernic Desktop Tool for my widget data.
It also works on my IP, Registrar, crawler data as well.
The tools builds a database index, however not in format useable by an actual database software.
My created folders (including the aforementioned exported emails) are just under 150meg.
Perhaps my most frustrating and recent addition to the IP probes are the "tracerts" for IP's that do not have subnets defined (it should be a crime).
Don
216.158.1.192/28 -> [webmasterworld.com...]
I think part of the problem with the IP Address data I was storing was that I stored it for all user agents, even known browsers. I need to work on that and try again.
There are many people with active pages working in this regard (although they may be inaccurate), why replicate something so many others are doing, was/is my logic.
My method of simply documenting offenders (for lack of a better term) is a much smaller data storage.
Don