| 5:26 pm on Jun 24, 2014 (gmt 0)|
I've yet to see it but apparently it's been around for almost a year. [projecthoneypot.org...]
FWIW, even obscure sites are not immune from bots basing crawling by IP addresses (akin to auto-dialer spam phone callers). Our small, private server gets bots hitting all the active sites within seconds, presumably after having tried all 255 numbers in our CIDR.
And many bots start by crawling their own server farm mothership, which may include tens of thousands of private sites, obscure or other wise.
Last but not least, all too often long-time bot-spotters like m'self have no clue what all too many bots are up to, or for whom. But their why is easy -- like Bill said the other day, there's money in it.
| 6:51 pm on Jun 24, 2014 (gmt 0)|
Thanks for the insight.
Based on what the company does, I suspected the bot was looking for malware infected websites as a continuing test of their systems for securing networks. I also suspected that they are very interested in identifying malware infected botnets that have yet to execute a zero day attack.
Those are my best guesses.
| 7:54 pm on Jun 24, 2014 (gmt 0)|
Many of us block "spyder" "spider" "nutch" "crawler" and other categorical names found in the User Agent.
| 8:01 pm on Jun 24, 2014 (gmt 0)|
Palo Alto Techops is a server, listed as part if PNAP, all blocked. I fist spotted them last June coming in on a malformed UA: "'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)'" (note the extra single-quote) but the IP you listed is a smaller range inside PaloAlto simply registered as:
Private Customer INAP-SJE-PALOALTOTECHOPS-64-74-215-0 (NET-64-74-215-0-1)
188.8.131.52 - 184.108.40.206
| 9:14 pm on Jun 24, 2014 (gmt 0)|
Thanks for confirming the range. Before I could put in a block, the bot came back and grabbed 10 pages using a different UA.
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"
It looks like my guess of what the bot was up to was wrong.
| 11:00 pm on Jun 24, 2014 (gmt 0)|
Yes, that's why sometimes CIDR IP block is the best way to keep them out. They can switch UAs all day.
| 12:09 am on Jun 25, 2014 (gmt 0)|
220.127.116.11 - 18.104.22.168
| 11:15 am on Jun 25, 2014 (gmt 0)|
|"GET /robots.txt HTTP/1.0" 403 |
| 3:15 pm on Jun 25, 2014 (gmt 0)|
I refer you back to my original post.
Here is the opening lines of my .htaccess file.
# Allow all bots to fetch robots.txt
SetEnvIf Request_URI "^/(robots\.txt)$" allow_all
Allow from env=allow_all
The robot gets through initially but is later denied by rewrites that ban UAs later in the file as I said in the OP. I presume this is the reason.