homepage Welcome to WebmasterWorld Guest from 54.237.184.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Palo Alto Networks bot
slipkid




msg:4682308
 10:13 am on Jun 24, 2014 (gmt 0)

A new bot for me, from a company that Jim Cramer has been pumping the stock on his cable show.

64.74.215.27 - - [23/Jun/2014:20:42:45 -0400] "GET /robots.txt HTTP/1.0" 403 237 "-" "spyder/Nutch-2.1 (just another internet crawler; http://www.paloaltonetworks.com/products/features/url-filtering.html; ghalevy@paloaltonetworks.com)"

I let all bots in through robots.txt, but ban them from going further with various .htaccess tests. This one failed for HTTP/1.0 and UA words spyder, crawler, and Nutch.

I have a suspicion of what the bot is up to regarding an obscure web site as my own, but no doubt experienced webmasters know precisely their goal based on their company services.

[edited by: incrediBILL at 4:31 pm (utc) on Jun 24, 2014]
[edit reason] formatting [/edit]

 

Pfui




msg:4682418
 5:26 pm on Jun 24, 2014 (gmt 0)

I've yet to see it but apparently it's been around for almost a year. [projecthoneypot.org...]

FWIW, even obscure sites are not immune from bots basing crawling by IP addresses (akin to auto-dialer spam phone callers). Our small, private server gets bots hitting all the active sites within seconds, presumably after having tried all 255 numbers in our CIDR.

And many bots start by crawling their own server farm mothership, which may include tens of thousands of private sites, obscure or other wise.

Last but not least, all too often long-time bot-spotters like m'self have no clue what all too many bots are up to, or for whom. But their why is easy -- like Bill said the other day, there's money in it.

slipkid




msg:4682433
 6:51 pm on Jun 24, 2014 (gmt 0)

Thanks for the insight.

Based on what the company does, I suspected the bot was looking for malware infected websites as a continuing test of their systems for securing networks. I also suspected that they are very interested in identifying malware infected botnets that have yet to execute a zero day attack.

Those are my best guesses.

keyplyr




msg:4682440
 7:54 pm on Jun 24, 2014 (gmt 0)



Many of us block "spyder" "spider" "nutch" "crawler" and other categorical names found in the User Agent.

not2easy




msg:4682442
 8:01 pm on Jun 24, 2014 (gmt 0)

Palo Alto Techops is a server, listed as part if PNAP, all blocked. I fist spotted them last June coming in on a malformed UA: "'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)'" (note the extra single-quote) but the IP you listed is a smaller range inside PaloAlto simply registered as:
Private Customer INAP-SJE-PALOALTOTECHOPS-64-74-215-0 (NET-64-74-215-0-1)
64.74.215.0 - 64.74.215.255
64.74.215.0/24

slipkid




msg:4682462
 9:14 pm on Jun 24, 2014 (gmt 0)

Thanks for confirming the range. Before I could put in a block, the bot came back and grabbed 10 pages using a different UA.

"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"

It looks like my guess of what the bot was up to was wrong.

not2easy




msg:4682500
 11:00 pm on Jun 24, 2014 (gmt 0)

Yes, that's why sometimes CIDR IP block is the best way to keep them out. They can switch UAs all day.

keyplyr




msg:4682524
 12:09 am on Jun 25, 2014 (gmt 0)


Internap
64.74.0.0 - 64.74.255.255
64.74.0.0/16

thetrasher




msg:4682674
 11:15 am on Jun 25, 2014 (gmt 0)

"GET /robots.txt HTTP/1.0" 403
Why 403?
slipkid




msg:4682756
 3:15 pm on Jun 25, 2014 (gmt 0)

I refer you back to my original post.

Here is the opening lines of my .htaccess file.

# Allow all bots to fetch robots.txt
SetEnvIf Request_URI "^/(robots\.txt)$" allow_all

Order Deny,Allow

<Limit GET>
Allow from env=allow_all
</Limit>

The robot gets through initially but is later denied by rewrites that ban UAs later in the file as I said in the OP. I presume this is the reason.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved