Forum Moderators: open

Message Too Old, No Replies

This thing shows up every 30-40 minutes on two of my sites

         

aristotle

9:18 pm on Mar 1, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It fetches the home page again and again throughout the day every 30-40 minutes, on two of my sites (same hosting company but different servers)

The Latest Visitor entry is as follows:
Host: 128.107.239.233
/
Http Code: 200 Date: Mar 01 14:23:10 Http Version: HTTP/1.1 Size in Bytes: 9304
Referer: -
Agent: http_load 12mar2006

The IP lookup result is:
IP: 128.107.239.233
Hostname: 128-107-239-233.cisco.com
ISP: Cisco Systems
Organization: Cisco Systems
Services: None detected
Type: Corporate
Assignment: Static IP
State/Region: California
City: San Jose

I just finished blocking this UA from all of my sites. But can someone please explain what purpose it serves to keep fetching the same file over and over again every 30-40 minutes. Why would someone do this?

keyplyr

12:43 am on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



IMO it's likely an "uptime" or "update" type check and the responsible company leases server space at Cisco.

These services have been popular with newbies who want assurance their host is keeping their site available 24/7. Other similar services will let the subscriber know when your page gets updated. Some social media companies may also use a tool like this to know when to come back and scrape new content.

Some infamous abusers: UptimeBot, PingDom. FollowThatPage, Genieo

aristotle

1:33 am on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks keyplyr
I think you're probably basically right. Actually, after i made my post I started thinking the same thing. I remember using a free uptime checking service years ago (when I was a newbie), but I created a special tiny little file for it so that it wouldn't use up much bandwidth. So this thing could be checking for something, but who knows what

lucy24

3:00 am on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Now, that would make plenty of sense if you were checking your own site. But do you have the kind of content where users go frantic with worry if you're down for half an hour, so they need to check continuously on their own initiative? I Am Envious.

keyplyr

3:10 am on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I get these hits from a dozen or so entities every single day. I guess I have that "kind of content where users go frantic with worry..." :)

blend27

12:10 pm on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@aristotle
re: I just finished blocking this UA from all of my sites: http_load 12mar2006

One could write a rule that would automagicaly block all user agents that do not contain parenthesis

or have parenthesis but unbalanced : Mozilla/4.0 (compatible; MSIE 7.0; (Windows NT 5.1; Trident/4.0) - UA has 2 opening ones and one closing one or vice-versa.

Throw in a presence of double quote " in UA and you got your self a winner!

Works like a charm.

aristotle

2:44 pm on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



blend27
Do you mean that all legitimate UA's always contain parentheses somewhere?

Anyway, my .htaccess coding skill isn't good enough for me to risk that kind of a general block. I need to keep things simple. My approach is to only take action when something becomes obnoxious, like in this case. If something doesn't show up more than once a day, I don't worry about it.

aristotle

3:09 pm on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lucy -- Apparently there is a bot especially devoted to change detection. Here is a visit it made yesterday:
Host: 63.249.66.212
/robots.txt
Http Code: 200 Date: Mar 01 13:09:52 Http Version: HTTP/1.1 Size in Bytes: 26
Referer: -
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; [changedetection.com...] )

/
Http Code: 200 Date: Mar 01 13:09:52 Http Version: HTTP/1.1 Size in Bytes: 153773
Referer: -
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; [changedetection.com...] )

It shows up about once a day on another one of my sites. I don't have any idea who is sending it.

blend27

3:10 pm on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The only legitimate & modern, to some extent UA's that don't have parentheses at this point would be OLDer Blackberry UAS's BB OS 5/6/7,

latest one BB10 is a strait forward as: Mozilla/5.0 (BB10; Touch) AppleWebKit/537.35+ (KHTML, like Gecko) Version/10.3.1.2243 Mobile Safari/537.35+

BB10; Touch is for anything that is BB10 and up.

Here is a list of the last 200 bad ones cought with the rule above:


Screaming Frog SEO Spider/3.1
pilican/Nutch-1.9
My Nutch Spider/Nutch-1.9
CRAZYWEBCRAWLER 0.9.2, http://www.crazywebcrawler.com
Firefox
python-requests/2.5.0 CPython/3.4.2 Windows/8
python-requests/2.5.1 CPython/2.7.8 Linux/3.14.26-24.46.amzn1.x86_64
PycURL/7.19.3 libcurl/7.35.0 GnuTLS/2.12.23 zlib/1.2.8 libidn/1.28 librtmp/2.3
Spiderbot/Nutch-1.7
A6-Indexer
Wegtam Crawler/Nutch-1.9
StatsInfo
QH/Nutch-1.5
Robocop
Screaming Frog SEO Spider/2.55
binlar_2.6.3 test@mgmt.mic
WhatWeb/0.4.8-dev
www.osaicbt.com/Nutch-2.2.1
GigablastOpenSource/1.0
Wegtam Crawler/Nutch-1.10-SNAPSHOT
Xenu Link Sleuth 1.2d
python/splinter
python-requests/2.3.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64
WinInet Test
ContextAd Bot 1.0
SMcrawler
test nutch/Nutch-1.8
MPDP-ALR-Search-Bot
curl/7.33.0
pilican/Nutch-1.9-SNAPSHOT
chroot-apach0day-HIDDEN BINDSHELL-ESTAB
chroot-apach0day
python-requests/2.0.0 CPython/2.7.3 Linux/3.2.0-40-virtual
wsr-agent/1.0
something
Python-urllib/3.4
IE/4.0
Wegtam Crawler/Nutch-1.9-SNAPSHOT
Lynx/2.8.8dev.5 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/2.8.6
mozilla-agent/5.0
Mozilla 28.0
Comodo-Webinspector-Crawler 2.1
python-requests/2.2.1 CPython/2.7.6 Linux/3.14.1-x86_64-linode39
NerdyBot
PycURL/7.29.0
Screaming Frog SEO Spider/2.30
Xiao/Nutch-1.8
My Nutch Spider/Nutch-1.5-SNAPSHOT
raynette_httprequest/1.0
libwww-perl/6.05
python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-358.14.1.el6.x86_64
Screaming Frog SEO Spider/2.20
hrbot
StructuredWeb Agent
Googlebot-2.2
testingforex.com
Mozilla/8.0
niki-bot
.NET Framework Test Client
pimonstar
curl/7.30.0
Testing/Googlebot
M
Lynx/2.8.7rel.1 libwww-FM/2.14FM
WWW-Mechanize/1.73
Comodo Spider 1.2
python-requests/2.1.0 CPython/2.7.3 Linux/2.6.32-042stab078.28
ZemlyaCrawl 1.0
http://blog.erratasec.com
athenalion Nutch Spider/Nutch-1.7
Nutch Spider/Nutch-1.5
ERACrawler/1.0
My Nutch crawler/Nutch-1.5
CCBot/2.0
www.socialayer.com Agent 0.1
Go 1.1 package http
python-requests/1.2.3 CPython/2.7.3 Linux/3.2.0-53-generic-pae
AJCrawler/Nutch-1.7
Lynx/2.8.8dev.16 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/1.0.1
QACC browser
python-requests/0.14.1 CPython/2.7.2 Windows/7
YisouSpider
curl/7.29.0
My Nutch Spider/Nutch-2.2.1
blogtop.us crawler - http://blogtop.us/
xmlset_roodkcableoj28840ybtide
scrutiny/3
MozillaXYZ/1.0
http_requester/0.1
asynchttp
asynchttp
Nutch12/Nutch-1.2
CB/Nutch-1.7
test
ADmantX Platform Semantic Analyzer - ADmantX Inc. - www.admantx.com - support@admantx.com
rdream.com
Mysite/Nutch-2.2.1
feedfinder/1.36 Python-urllib/1.17 +http://www.aaronsw.com/2002/feedfinder/
Python-urllib/3.3
LWP::Simple/5.835 libwww-perl/5.836
guoming/Nutch-1.6
python-requests/1.1.0 CPython/2.7.4 Linux/3.8.0-19-generic
W3C_Validator/1.781
NIS Nutch Spider/Nutch-1.7
xrumerguestbook2.exe
xcvbs.exe
MyNutchSpider/Nutch-2.2.1
Chrome
Mozilla 5.2
MyNutchSpider/Nutch-2.2
MyNutchTest/Nutch-1.6
PHP/5.2.17p1
nrsbot/6
Mozilla/10.07
Firefox/19.01
visaduhoc.info Crawler
NETCRAFT
MyNutchSpider/Nutch-2.1
Googlebot
ip-web-crawler.com
W3C_Validator/1.3 http://validator.w3.org/services
Valuethesite.org
WhatWeb/0.4.8
DHBot
Samsung Galaxy Notebook II
checks.panopta.com
kraken/0.6.0
Content Crawler Spider
Nutch Spider/Nutch-1.4
Nutch Spider/Nutch-1.6
w3m/0.5.2
libwww-perl/5.837
Pinterest/0.1 +http://pinterest.com/
OpenWebIndex/Nutch-1.6
Lynx/2.8.6rel.4 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.8g
Microsoft Office Existence Discovery
W3C_Validator/1.3
Zookabot/2.4;++http://zookabot.com
Lynx/2.8.6rel.5 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.8h
Mozilla/5.0 whoiam [http://www.axxus.de/]
aboutthedomain
curl/7.28.1
My Nutch Spider/Nutch-1.6
vVwWgW4 r4W4QjbrwOb
Online Chat Crawler
Screaming Frog SEO Spider/2,03
WordPress.com mShots; http://support.wordpress.com/contact/
Mysite/Nutch-2.0
www.integromedb.org/Crawler
Mozila/5.0
Microblog-Explorer/0.3
Lynx/2.8.5rel.1 libwww-FM/2.15FC SSL-MM/1.4.1c OpenSSL/0.9.7e-dev
Zeus 27924 Webster Pro V2.9 Win32
My Nutch Spider/Nutch-1.5
EasouSpider
NexiSpider/Nutch-1.5.1
LSSRocketCrawler/1.0 LightspeedSystems
SEOstats 2.1.0 https://github.com/eyecatchup/SEOstats
WebCompanyCrawler
LWP::Simple/6.00 libwww-perl/6.04
wminer/Nutch-1.4
CC-rget/5.818 libwww-perl/5.837
dmoz_scraper/1.0
OpenWebIndex/Nutch-1.5
nb-bot
PycURL/7.24.0
PopScreenBot
LWP::Simple/5.835 libwww-perl/5.837
Clickthink/CT-3.1
Zeus 40614 Webster Pro V2.9 Win32
DomainTaggingbot; +http://www.opendns.com/community/domaintagging
nutch/Nutch-1.5
MobileSafari/7534.48.3 CFNetwork/548.0.4 Darwin/11.0.0
My Crawler/Nutch-1.4
IE 5.5 Compatible Browser
LinksCrawler 0.1beta
Pinky and Brain/Nutch-1.5.1
nutch-solr-integration/Nutch-1.4
'Mozilla/5.0
Feed::Find/0.07
Java/1.6.0_24
Java/1.6.0_21
python-requests/0.12.1
HTMLParser/2.0
AutoIt
sGroup crawler 1/Nutch-1.3
Screaming Frog SEO Spider/1,90
intelium_bot
LWP::Simple/5.79
libwww-perl/6.04
COMODOSpider/Nutch-1.2
coruscan/Nutch-1.4
nutch-1.4/Nutch-1.4
Explorer Bot
My Nutch Spider/Nutch-1.4
WordPress/2.9.2; http://alishiawebsterministry.co.cc
WordPress/2.9.2; http://luxrewards.yoursexualaids.net
Microsoft-WebDAV-MiniRedir/6.1.7601
SemrushBot/0.92
Screaming Frog SEO Spider/1.90


[edited by: blend27 at 3:14 pm (utc) on Mar 2, 2015]

[edited by: phranque at 8:02 pm (utc) on Mar 2, 2015]
[edit reason] unlinked URLs [/edit]

roshaoar

3:11 pm on Mar 2, 2015 (gmt 0)

10+ Year Member



I think it's something new, exact same thing started recently on one of my sites too. Maybe it's something the hosting company activated to check hosted site uptime/downtime. I already use pingdom.

aristotle

4:12 pm on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



roshaoar --
I think you're right about "http_load 12mar2006" just starting to show up very recently. And it also occurred to me that the hosting company could be involved. Although I'm not sure how one site on a shared server could be down unless the whole server is down.

blend27
That's an impressive list. But couldn't those W3C_Validators and pinterest be okay in some situations

lucy24

5:19 pm on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, and don't forget visionutils. It's part of the Facebook package. So you'd have to make an exemption on the no-parentheses list.

No parentheses at all:
^[^(]*$

roshaoar

6:27 pm on Mar 2, 2015 (gmt 0)

10+ Year Member



First mention in my logfiles is: 128.107.239.233 - - [25/Feb/2015:05:05:38 +0000] "GET / HTTP/1.0" 301 235 "-" "http_load 12mar2006" - since then every 30/40 mins. Are you on fasthosts as well?

aristotle

7:49 pm on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



roshaoar
No I'm on A Small Orange, but it was bought by a bigger company that might also own Fasthost. My sites used to be in Atlanta, but were apparently moved to LA onto servers leased from Softlayer. It's hard to keep track.

lucy24

9:22 pm on Mar 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: returning after obligatory morning's absence ::

or have parenthesis but unbalanced

Those are tricky. I started by searching for anything with parentheses nested at least three deep:
\([^)\n]*\([^)\n]*\(

(the \n exclusion was just to make it work in the text editor; in .htaccess it doesn't matter). I eventually found this winner:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; Mozilla/4.0(Compatible Mozilla/4.0(Compatible-EmbeddedWB 14.57 http://example.com/ EmbeddedWB- 14.57 from: http://example.com/ )

where both occurrences of "example.com" are really referer spam. But that was from a blocked IP-- and the \w\( sequence might also merit a lockout-- so it should be sufficient to check for a mismatched nest

"[^()"\n]*[()][^()"\n]*[()][^()"\n]*[()][^()"\n]*" *$


(translation: UA contains exactly three parentheses). But these, in turn, resolve to a clutch of recurring UAs:

Mozilla/4.0 (compatible; MSIE8.0; Windows NT 6.0) .NET CLR 2.0.50727)

(note the element "MSIE8")

Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))

(note "WIndows")

Opera/9.80 (Windows NT 5.1); U; en) Presto/2.2.15 Version/10.00

where all version numbers except the initial "Opera/9.80" are variable. Unfortunately Opera/9.80 is a legitimate human UA in current use. At least in RIPE territory looking at you, wilderness.

The sequence "en)" alone is used in Opera Mini. But "))" seems to be used almost entirely by malign robots-- including quite a few infected Russian browsers-- except for a handful of Linux builds that could be exempted if you like. Also some "Alexa Toolbar"; don't know what that's about.

As a follow-up I checked for single parentheses, constraining the search to 200 requests (because if it's from a blocked IP we already know there's something hinky about it):

200 \d+ "[^"]+" "[^()"\n]*[()][^()"\n]*" *$


Fascinatingly, the first result I found was
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp

but this seems to have been a transmission error; the UA normally has a final ) parenthesis.

Also some discobots-- remember them?-- but that was from years ago.

Conclusion: you could block UAs with an odd number of parentheses
^[^()]*[()][^()]*([()][^()]*[()][^()]*)*$

(combined rule) but the vast majority of them also commits one or more other offenses that would be easier and probably less server-intensive to check for.

blend27

2:09 am on Mar 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



.. where both occurrences of "example.com" are really referer spam...

That one is easy too, double forward slash in UA - // = ZAP, same as @ pretty much(in Russian we call it Sabaka-Dog, in Polish - Malpa-Monkey).