Welcome to WebmasterWorld Guest from 54.204.100.232

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

of pipl and robots

   
1:34 am on Sep 26, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



67.228.nnn.nn - - [24/Sep/2011:17:52:25 -0700] "GET /rats/images/Yummy.jpg HTTP/1.1" 200 22814 "-" "Mozilla/5.0+(compatible;+PiplBot;++http://www.pipl.com/bot/)" 


They've got a good line:
PiplBot is Pipl's web-indexing robot. PiplBot crawler collects documents from the Web to build a searchable index for our People Search engine.

Unlike a typical search-engine robots, PiplBot is designed to retrieve information from the deep web [pipl.com]; our robots are set to interact with searchable databases and not only follow links from other websites.

As part of the crawling, PiplBot takes robots.txt standards into account to ensure we do not crawl and index content from those pages whose content you do not want included in Pipl Search.

I found this paragraph a little obscure, since their bot did not even go through the motions of consulting robots.txt before heading straight for a roboted-out directory.

the term "deep web" refers to a vast repository of underlying content, such as documents in online databases that general-purpose web crawlers cannot reach. The deep web content is estimated at 500 times that of the surface web, yet has remained mostly untapped due to the limitations of traditional search engines.

That would be, like, the tedious formality of reading and obeying robot-exclusion rules?

It's an awful shame you're not allowed to post personal links. Something tells me I'm going to lie awake nights wondering who out there in the Internet stopped short at the picture of Miranda, Malcolm and Nelly and cried "That looks just like cousin Maisie!" before running off to People Search with this promising lead.

Be kind to your four-footed friends
Any rat may be somebody's long-lost relative


Hm. Doesn't quite scan, does it?
2:36 am on Sep 26, 2011 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Lucy,
Are you allowing activity from Soft Layer?
The entire Class B is a standard.
3:27 am on Sep 26, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Funny you should say that. I don't normally blot out whole ranges on the basis of one visit, but this time after a quick check to verify that I've never seen anyone else from this neighborhood, I proceeded directly to

Deny from 67.228.0.0/15

Thankfully my host has now fixed the glitch that made it impossible to add "Deny from..." directives unless you were willing to sacrifice normal-looking logs. Converting everything into RegExes for SetEnvIf was getting old.

Where else do softlayer live? I've had a solitary visit from 50.22 (Bender), but they behaved themselves. And a solitary annoyance from next door 50.23.

Bad robots always seem to go for my second-fattest file.* What do they expect to find there? Lists of plain-text passwords? How do they even know to look for it? g### used to say how big a file was, but somewhere when I wasn't paying attention they seem to have stopped.


* In html, that is. Many of the e-books weigh in at a lot more if you count the images-- but robots seem to be allergic to the "ebooks" directory name. That must be why they avoid the #1 fattest text.
4:32 am on Sep 26, 2011 (gmt 0)

5+ Year Member



Where else do softlayer live?

Lucy,

I have these ranges noted as SoftLayer, but there might be more:

50.22.0.0 - 50.23.255.255
64.125.118.64 - 64.125.118.95
67.228.0.0 - 67.228.255.255
174.36.0.0 - 174.37.255.255
184.172.0.0 - 184.173.255.255
4:59 am on Sep 26, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Some more, from...

[webmasterworld.com...]

74.86.0.0/16
173.192.0.0/15

[webmasterworld.com...]

75.126.0.0/16

And last but not least:

deny from softlayer
9:23 pm on Sep 26, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Postscript: I gotta drift OT for a moment to share a couple more "Gee, could this possibly be a robot?" UAs I found while looking for something else.* They're from banned IPs so I never happened to notice the UAs before.

#1:
example/1.0

Bad robot! You leave our "example" string alone! And take your highly questionable g7.in/1CCC.html referer with you.

#2:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; 
snprtz|S04727582701828#1828|isdn; .NET CLR 2.0.50727)

Gesundheit.

I don't even remember why I originally blocked 208.80. but I'm sure it was a good and valid reason. Cursory searching [webmasterworld.com] tells me they've been around for a while. Maybe they should see a doctor about that persistent phonestheme.


* Nagvaalauqtunga, let's say.
9:30 pm on Sep 26, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



My IP ranges for Softlayer are...

50.22.0.0 - 50.23.255.255
50.97.0.0 - 50.97.255.255
66.228.112.0 - 66.228.127.255
67.228.0.0 - 67.228.255.255
74.86.0.0 - 74.86.255.255
75.126.0.0 - 75.126.255.255
173.192.0.0 - 173.193.255.255
174.36.0.0 - 174.37.255.255
208.43.0.0 - 208.43.255.255
208.101.0.0 - 208.101.63.255

Lucy - your line "Deny from 67.228.0.0/15" - I have no record of anything odd coming from 67.229.0.0/16, which does not appear to be softlayer anyway?
9:42 pm on Sep 26, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



your line "Deny from 67.228.0.0/15" - I have no record of anything odd coming from 67.229.0.0/16, which does not appear to be softlayer anyway?

No idea. The 67.229 sequence only appears in my raw logs as the second half of a googlebot IP-- that is, nnn.nnn.67.229. No help there. (If Spotlight does RegEx, they keep it a closely guarded secret.) Maybe just because I found them listed as a block, 67.228.0.0 to 67.229.255.255. If I make up random 67.229 numbers I land on random other hosts.
9:50 pm on Sep 26, 2011 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Lucy,
ARIN-WHOIS is not exactly user-friendly these days, however if you go there and past in "SoftLayer Technologies Inc." (minus the quotes.)

You'll be provided with a list of results (your required to view each individually) for all the available IP's.
7:30 pm on Dec 1, 2011 (gmt 0)



Ok, can I bring old topic from the dead.

You guys are blocking entire SoftLayer servers?


What about people who host servers there, like us regular webmaster folks?
7:36 pm on Dec 1, 2011 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



What about people who host servers there, like us regular webmaster folks?


other websites servers are not generally considered beneficial traffic to most webmasters, at least the ones that are capable of differentiating between visitors and harvesters.

Please see the very long Amazon thread as well.
7:46 pm on Dec 1, 2011 (gmt 0)

5+ Year Member



@aleksl

Our blocking of Softlayer (or any server farm) would only affect you or other webmasters (regular or not) if you or they are attempting to crawl / scrape our sites.

We are blocking traffic coming FROM the servers, not going TO the servers.

Most "regular webmaster folks" don't have any need or desire to crawl our sites.

[edited by: Mokita at 7:48 pm (utc) on Dec 1, 2011]

7:46 pm on Dec 1, 2011 (gmt 0)



Ok.

Amazon thread, this one?
[webmasterworld.com...]
9:54 pm on Dec 1, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



That was just Part 1 (amazonaws.com plays host to wide variety of bad bots).

Here's Part 2: Amazon AWS Hosts Bad Bots [webmasterworld.com...]
3:32 am on Mar 10, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Had a request for robots.text and two requests for the main page from:

159.253.143.zz - - [10/Mar/2012:02:50:00 +0000] "GET / HTTP/1.0" 301 234 "-" "Mozilla/5.0 (compatible; SWEBot/1.0; +http://swebot-crawler.net)"

The interesting part is that the IP is returned as

netname: NETBLK-SOFTLAYER-RIPE-CUST-MB27388-RIPE
descr: Hosting Services Inc. (Midphase)
country: US

Started doing requests to see how far the Class C went and still belonged to Softlayer.
After four verifications of small Class D blocks (to different orgs) up to 111, I stopped and just denied that entire Class B, which may not be beneficial to everyone.
4:41 am on Mar 10, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Oh, gosh, I once met an absolutely horrendous robot from that neighborhood. 159.253.145.nn. It's in my notes as
no robots.txt, 400 requests in 50 sec, most links wrong

Notes also say (1) 159.253.128-159 softlayer/Netherlands and (2) 253.145.128-191 (159.253.145.128/26) which is the form I've blocked them in, though I could perfectly well have gone to the /19 form.

January 1, so it probably rates a mention next door in At Home With the Robots [webmasterworld.com]
:: shuffling papers ::
Yup. It's the one I flagged as "stupid robot" because of its utter incompetence when it came to parsing <a ...> tags:
With 394 hits in 50 seconds I would put it at the top of the ### list ... if it weren't for its mind-boggling, over-the-top, jaw-dropping, have-to-see-it-to-believe-it stupidity.
4:58 am on Mar 10, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



lucy,
It's likely you could block the entire Class C.
I opted for the B because it fits my needs.
7:52 am on Mar 10, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




Thanks dstiles, I didn't have one of those Softlayer ranges.