Welcome to WebmasterWorld Guest from 107.21.163.40

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

of pipl and robots

     
1:34 am on Sep 26, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13218
votes: 348


67.228.nnn.nn - - [24/Sep/2011:17:52:25 -0700] "GET /rats/images/Yummy.jpg HTTP/1.1" 200 22814 "-" "Mozilla/5.0+(compatible;+PiplBot;++http://www.pipl.com/bot/)" 


They've got a good line:
PiplBot is Pipl's web-indexing robot. PiplBot crawler collects documents from the Web to build a searchable index for our People Search engine.

Unlike a typical search-engine robots, PiplBot is designed to retrieve information from the deep web [pipl.com]; our robots are set to interact with searchable databases and not only follow links from other websites.

As part of the crawling, PiplBot takes robots.txt standards into account to ensure we do not crawl and index content from those pages whose content you do not want included in Pipl Search.

I found this paragraph a little obscure, since their bot did not even go through the motions of consulting robots.txt before heading straight for a roboted-out directory.

the term "deep web" refers to a vast repository of underlying content, such as documents in online databases that general-purpose web crawlers cannot reach. The deep web content is estimated at 500 times that of the surface web, yet has remained mostly untapped due to the limitations of traditional search engines.

That would be, like, the tedious formality of reading and obeying robot-exclusion rules?

It's an awful shame you're not allowed to post personal links. Something tells me I'm going to lie awake nights wondering who out there in the Internet stopped short at the picture of Miranda, Malcolm and Nelly and cried "That looks just like cousin Maisie!" before running off to People Search with this promising lead.

Be kind to your four-footed friends
Any rat may be somebody's long-lost relative


Hm. Doesn't quite scan, does it?
2:36 am on Sept 26, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


Lucy,
Are you allowing activity from Soft Layer?
The entire Class B is a standard.
3:27 am on Sept 26, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13218
votes: 348


Funny you should say that. I don't normally blot out whole ranges on the basis of one visit, but this time after a quick check to verify that I've never seen anyone else from this neighborhood, I proceeded directly to

Deny from 67.228.0.0/15

Thankfully my host has now fixed the glitch that made it impossible to add "Deny from..." directives unless you were willing to sacrifice normal-looking logs. Converting everything into RegExes for SetEnvIf was getting old.

Where else do softlayer live? I've had a solitary visit from 50.22 (Bender), but they behaved themselves. And a solitary annoyance from next door 50.23.

Bad robots always seem to go for my second-fattest file.* What do they expect to find there? Lists of plain-text passwords? How do they even know to look for it? g### used to say how big a file was, but somewhere when I wasn't paying attention they seem to have stopped.


* In html, that is. Many of the e-books weigh in at a lot more if you count the images-- but robots seem to be allergic to the "ebooks" directory name. That must be why they avoid the #1 fattest text.
4:32 am on Sept 26, 2011 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 21, 2005
posts: 379
votes: 0


Where else do softlayer live?

Lucy,

I have these ranges noted as SoftLayer, but there might be more:

50.22.0.0 - 50.23.255.255
64.125.118.64 - 64.125.118.95
67.228.0.0 - 67.228.255.255
174.36.0.0 - 174.37.255.255
184.172.0.0 - 184.173.255.255
4:59 am on Sept 26, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Some more, from...

[webmasterworld.com...]

74.86.0.0/16
173.192.0.0/15

[webmasterworld.com...]

75.126.0.0/16

And last but not least:

deny from softlayer
9:23 pm on Sept 26, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13218
votes: 348


Postscript: I gotta drift OT for a moment to share a couple more "Gee, could this possibly be a robot?" UAs I found while looking for something else.* They're from banned IPs so I never happened to notice the UAs before.

#1:
example/1.0

Bad robot! You leave our "example" string alone! And take your highly questionable g7.in/1CCC.html referer with you.

#2:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; 
snprtz|S04727582701828#1828|isdn; .NET CLR 2.0.50727)

Gesundheit.

I don't even remember why I originally blocked 208.80. but I'm sure it was a good and valid reason. Cursory searching [webmasterworld.com] tells me they've been around for a while. Maybe they should see a doctor about that persistent phonestheme.


* Nagvaalauqtunga, let's say.
9:30 pm on Sept 26, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3125
votes: 4


My IP ranges for Softlayer are...

50.22.0.0 - 50.23.255.255
50.97.0.0 - 50.97.255.255
66.228.112.0 - 66.228.127.255
67.228.0.0 - 67.228.255.255
74.86.0.0 - 74.86.255.255
75.126.0.0 - 75.126.255.255
173.192.0.0 - 173.193.255.255
174.36.0.0 - 174.37.255.255
208.43.0.0 - 208.43.255.255
208.101.0.0 - 208.101.63.255

Lucy - your line "Deny from 67.228.0.0/15" - I have no record of anything odd coming from 67.229.0.0/16, which does not appear to be softlayer anyway?
9:42 pm on Sept 26, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13218
votes: 348


your line "Deny from 67.228.0.0/15" - I have no record of anything odd coming from 67.229.0.0/16, which does not appear to be softlayer anyway?

No idea. The 67.229 sequence only appears in my raw logs as the second half of a googlebot IP-- that is, nnn.nnn.67.229. No help there. (If Spotlight does RegEx, they keep it a closely guarded secret.) Maybe just because I found them listed as a block, 67.228.0.0 to 67.229.255.255. If I make up random 67.229 numbers I land on random other hosts.
9:50 pm on Sept 26, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


Lucy,
ARIN-WHOIS is not exactly user-friendly these days, however if you go there and past in "SoftLayer Technologies Inc." (minus the quotes.)

You'll be provided with a list of results (your required to view each individually) for all the available IP's.
7:30 pm on Dec 1, 2011 (gmt 0)

Senior Member

joined:Jan 3, 2003
posts:1023
votes: 0


Ok, can I bring old topic from the dead.

You guys are blocking entire SoftLayer servers?


What about people who host servers there, like us regular webmaster folks?
7:36 pm on Dec 1, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


What about people who host servers there, like us regular webmaster folks?


other websites servers are not generally considered beneficial traffic to most webmasters, at least the ones that are capable of differentiating between visitors and harvesters.

Please see the very long Amazon thread as well.
7:46 pm on Dec 1, 2011 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 21, 2005
posts: 379
votes: 0


@aleksl

Our blocking of Softlayer (or any server farm) would only affect you or other webmasters (regular or not) if you or they are attempting to crawl / scrape our sites.

We are blocking traffic coming FROM the servers, not going TO the servers.

Most "regular webmaster folks" don't have any need or desire to crawl our sites.

[edited by: Mokita at 7:48 pm (utc) on Dec 1, 2011]

7:46 pm on Dec 1, 2011 (gmt 0)

Senior Member

joined:Jan 3, 2003
posts:1023
votes: 0


Ok.

Amazon thread, this one?
[webmasterworld.com...]
9:54 pm on Dec 1, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


That was just Part 1 (amazonaws.com plays host to wide variety of bad bots).

Here's Part 2: Amazon AWS Hosts Bad Bots [webmasterworld.com...]
3:32 am on Mar 10, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


Had a request for robots.text and two requests for the main page from:

159.253.143.zz - - [10/Mar/2012:02:50:00 +0000] "GET / HTTP/1.0" 301 234 "-" "Mozilla/5.0 (compatible; SWEBot/1.0; +http://swebot-crawler.net)"

The interesting part is that the IP is returned as

netname: NETBLK-SOFTLAYER-RIPE-CUST-MB27388-RIPE
descr: Hosting Services Inc. (Midphase)
country: US

Started doing requests to see how far the Class C went and still belonged to Softlayer.
After four verifications of small Class D blocks (to different orgs) up to 111, I stopped and just denied that entire Class B, which may not be beneficial to everyone.
4:41 am on Mar 10, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13218
votes: 348


Oh, gosh, I once met an absolutely horrendous robot from that neighborhood. 159.253.145.nn. It's in my notes as
no robots.txt, 400 requests in 50 sec, most links wrong

Notes also say (1) 159.253.128-159 softlayer/Netherlands and (2) 253.145.128-191 (159.253.145.128/26) which is the form I've blocked them in, though I could perfectly well have gone to the /19 form.

January 1, so it probably rates a mention next door in At Home With the Robots [webmasterworld.com]
:: shuffling papers ::
Yup. It's the one I flagged as "stupid robot" because of its utter incompetence when it came to parsing <a ...> tags:
With 394 hits in 50 seconds I would put it at the top of the ### list ... if it weren't for its mind-boggling, over-the-top, jaw-dropping, have-to-see-it-to-believe-it stupidity.
4:58 am on Mar 10, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


lucy,
It's likely you could block the entire Class C.
I opted for the B because it fits my needs.
7:52 am on Mar 10, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:6674
votes: 131



Thanks dstiles, I didn't have one of those Softlayer ranges.