homepage Welcome to WebmasterWorld Guest from 54.205.254.108
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
of pipl and robots
lucy24




msg:4367195
 1:34 am on Sep 26, 2011 (gmt 0)

67.228.nnn.nn - - [24/Sep/2011:17:52:25 -0700] "GET /rats/images/Yummy.jpg HTTP/1.1" 200 22814 "-" "Mozilla/5.0+(compatible;+PiplBot;++http://www.pipl.com/bot/)"

They've got a good line:
PiplBot is Pipl's web-indexing robot. PiplBot crawler collects documents from the Web to build a searchable index for our People Search engine.

Unlike a typical search-engine robots, PiplBot is designed to retrieve information from the deep web [pipl.com]; our robots are set to interact with searchable databases and not only follow links from other websites.

As part of the crawling, PiplBot takes robots.txt standards into account to ensure we do not crawl and index content from those pages whose content you do not want included in Pipl Search.

I found this paragraph a little obscure, since their bot did not even go through the motions of consulting robots.txt before heading straight for a roboted-out directory.

the term "deep web" refers to a vast repository of underlying content, such as documents in online databases that general-purpose web crawlers cannot reach. The deep web content is estimated at 500 times that of the surface web, yet has remained mostly untapped due to the limitations of traditional search engines.

That would be, like, the tedious formality of reading and obeying robot-exclusion rules?

It's an awful shame you're not allowed to post personal links. Something tells me I'm going to lie awake nights wondering who out there in the Internet stopped short at the picture of Miranda, Malcolm and Nelly and cried "That looks just like cousin Maisie!" before running off to People Search with this promising lead.

Be kind to your four-footed friends
Any rat may be somebody's long-lost relative


Hm. Doesn't quite scan, does it?

 

wilderness




msg:4367217
 2:36 am on Sep 26, 2011 (gmt 0)

Lucy,
Are you allowing activity from Soft Layer?
The entire Class B is a standard.

lucy24




msg:4367225
 3:27 am on Sep 26, 2011 (gmt 0)

Funny you should say that. I don't normally blot out whole ranges on the basis of one visit, but this time after a quick check to verify that I've never seen anyone else from this neighborhood, I proceeded directly to

Deny from 67.228.0.0/15

Thankfully my host has now fixed the glitch that made it impossible to add "Deny from..." directives unless you were willing to sacrifice normal-looking logs. Converting everything into RegExes for SetEnvIf was getting old.

Where else do softlayer live? I've had a solitary visit from 50.22 (Bender), but they behaved themselves. And a solitary annoyance from next door 50.23.

Bad robots always seem to go for my second-fattest file.* What do they expect to find there? Lists of plain-text passwords? How do they even know to look for it? g### used to say how big a file was, but somewhere when I wasn't paying attention they seem to have stopped.


* In html, that is. Many of the e-books weigh in at a lot more if you count the images-- but robots seem to be allergic to the "ebooks" directory name. That must be why they avoid the #1 fattest text.

Mokita




msg:4367232
 4:32 am on Sep 26, 2011 (gmt 0)

Where else do softlayer live?

Lucy,

I have these ranges noted as SoftLayer, but there might be more:

50.22.0.0 - 50.23.255.255
64.125.118.64 - 64.125.118.95
67.228.0.0 - 67.228.255.255
174.36.0.0 - 174.37.255.255
184.172.0.0 - 184.173.255.255

Pfui




msg:4367235
 4:59 am on Sep 26, 2011 (gmt 0)

Some more, from...

[webmasterworld.com...]

74.86.0.0/16
173.192.0.0/15

[webmasterworld.com...]

75.126.0.0/16

And last but not least:

deny from softlayer

lucy24




msg:4367500
 9:23 pm on Sep 26, 2011 (gmt 0)

Postscript: I gotta drift OT for a moment to share a couple more "Gee, could this possibly be a robot?" UAs I found while looking for something else.* They're from banned IPs so I never happened to notice the UAs before.

#1:
example/1.0

Bad robot! You leave our "example" string alone! And take your highly questionable g7.in/1CCC.html referer with you.

#2:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; 
snprtz|S04727582701828#1828|isdn; .NET CLR 2.0.50727)

Gesundheit.

I don't even remember why I originally blocked 208.80. but I'm sure it was a good and valid reason. Cursory searching [webmasterworld.com] tells me they've been around for a while. Maybe they should see a doctor about that persistent phonestheme.


* Nagvaalauqtunga, let's say.

dstiles




msg:4367503
 9:30 pm on Sep 26, 2011 (gmt 0)

My IP ranges for Softlayer are...

50.22.0.0 - 50.23.255.255
50.97.0.0 - 50.97.255.255
66.228.112.0 - 66.228.127.255
67.228.0.0 - 67.228.255.255
74.86.0.0 - 74.86.255.255
75.126.0.0 - 75.126.255.255
173.192.0.0 - 173.193.255.255
174.36.0.0 - 174.37.255.255
208.43.0.0 - 208.43.255.255
208.101.0.0 - 208.101.63.255

Lucy - your line "Deny from 67.228.0.0/15" - I have no record of anything odd coming from 67.229.0.0/16, which does not appear to be softlayer anyway?

lucy24




msg:4367506
 9:42 pm on Sep 26, 2011 (gmt 0)

your line "Deny from 67.228.0.0/15" - I have no record of anything odd coming from 67.229.0.0/16, which does not appear to be softlayer anyway?

No idea. The 67.229 sequence only appears in my raw logs as the second half of a googlebot IP-- that is, nnn.nnn.67.229. No help there. (If Spotlight does RegEx, they keep it a closely guarded secret.) Maybe just because I found them listed as a block, 67.228.0.0 to 67.229.255.255. If I make up random 67.229 numbers I land on random other hosts.

wilderness




msg:4367507
 9:50 pm on Sep 26, 2011 (gmt 0)

Lucy,
ARIN-WHOIS is not exactly user-friendly these days, however if you go there and past in "SoftLayer Technologies Inc." (minus the quotes.)

You'll be provided with a list of results (your required to view each individually) for all the available IP's.

aleksl




msg:4393083
 7:30 pm on Dec 1, 2011 (gmt 0)

Ok, can I bring old topic from the dead.

You guys are blocking entire SoftLayer servers?


What about people who host servers there, like us regular webmaster folks?

wilderness




msg:4393085
 7:36 pm on Dec 1, 2011 (gmt 0)

What about people who host servers there, like us regular webmaster folks?


other websites servers are not generally considered beneficial traffic to most webmasters, at least the ones that are capable of differentiating between visitors and harvesters.

Please see the very long Amazon thread as well.

Mokita




msg:4393091
 7:46 pm on Dec 1, 2011 (gmt 0)

@aleksl

Our blocking of Softlayer (or any server farm) would only affect you or other webmasters (regular or not) if you or they are attempting to crawl / scrape our sites.

We are blocking traffic coming FROM the servers, not going TO the servers.

Most "regular webmaster folks" don't have any need or desire to crawl our sites.

[edited by: Mokita at 7:48 pm (utc) on Dec 1, 2011]

aleksl




msg:4393092
 7:46 pm on Dec 1, 2011 (gmt 0)

Ok.

Amazon thread, this one?
[webmasterworld.com...]

Pfui




msg:4393177
 9:54 pm on Dec 1, 2011 (gmt 0)

That was just Part 1 (amazonaws.com plays host to wide variety of bad bots).

Here's Part 2: Amazon AWS Hosts Bad Bots [webmasterworld.com...]

wilderness




msg:4427231
 3:32 am on Mar 10, 2012 (gmt 0)

Had a request for robots.text and two requests for the main page from:

159.253.143.zz - - [10/Mar/2012:02:50:00 +0000] "GET / HTTP/1.0" 301 234 "-" "Mozilla/5.0 (compatible; SWEBot/1.0; +http://swebot-crawler.net)"

The interesting part is that the IP is returned as

netname: NETBLK-SOFTLAYER-RIPE-CUST-MB27388-RIPE
descr: Hosting Services Inc. (Midphase)
country: US

Started doing requests to see how far the Class C went and still belonged to Softlayer.
After four verifications of small Class D blocks (to different orgs) up to 111, I stopped and just denied that entire Class B, which may not be beneficial to everyone.

lucy24




msg:4427248
 4:41 am on Mar 10, 2012 (gmt 0)

Oh, gosh, I once met an absolutely horrendous robot from that neighborhood. 159.253.145.nn. It's in my notes as
no robots.txt, 400 requests in 50 sec, most links wrong

Notes also say (1) 159.253.128-159 softlayer/Netherlands and (2) 253.145.128-191 (159.253.145.128/26) which is the form I've blocked them in, though I could perfectly well have gone to the /19 form.

January 1, so it probably rates a mention next door in At Home With the Robots [webmasterworld.com]
:: shuffling papers ::
Yup. It's the one I flagged as "stupid robot" because of its utter incompetence when it came to parsing <a ...> tags:
With 394 hits in 50 seconds I would put it at the top of the ### list ... if it weren't for its mind-boggling, over-the-top, jaw-dropping, have-to-see-it-to-believe-it stupidity.

wilderness




msg:4427257
 4:58 am on Mar 10, 2012 (gmt 0)

lucy,
It's likely you could block the entire Class C.
I opted for the B because it fits my needs.

keyplyr




msg:4427288
 7:52 am on Mar 10, 2012 (gmt 0)


Thanks dstiles, I didn't have one of those Softlayer ranges.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved