homepage Welcome to WebmasterWorld Guest from 54.167.11.16
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
anyone seen the kimono scraper tool yet?
londrum




msg:4652780
 6:35 pm on Mar 10, 2014 (gmt 0)

I just became aware of this today — it's a new scraper tool called kimono.

from what i can gather people can select which bits of your page they need, collect it all up into an API, and then reproduce it wherever they want. and the site keeps scraping your stuff forever, sort of like an RSS feed

i'm sure it has its uses, but i always block stuff like this. the problem is i cant find any info on their site, and i cant find anything on the web about it either (i think the site only went live in january)

the tool is already showing up in my stats, so its definitely getting some use.

anyone know anything about it? or how to block it?

 

Angonasec




msg:4652961
 11:05 am on Mar 11, 2014 (gmt 0)

How about a glimpse of what you notice in your raw logs?

Angonasec




msg:4652964
 11:14 am on Mar 11, 2014 (gmt 0)

Bing turned up this alarming blurb about it;

Created by kimonolabs:

Q/
We host your APIs and data in the cloud and run them on the schedule that’s right for you. We serve up cached data even if the source URL is down or the extractor fails. App builder lets you create responsive web apps on top of your APIs without writing any code”
/Q

Definitely one for webmasters to Nippon in the bud.

motorhaven




msg:4653057
 4:21 pm on Mar 11, 2014 (gmt 0)

IP address/range?

thetrasher




msg:4653065
 4:42 pm on Mar 11, 2014 (gmt 0)

We host your APIs and data in the cloud
The cloud -> Amazon

kimonify.kimonolabs.com is an alias for chiba-2935.herokussl.com.
chiba-2935.herokussl.com is an alias for elb034166-1295166034.us-east-1.elb.amazonaws.com.


kimonify.kimonolabs.com/kimload?url=http%3A%2F%2Fwww.example.com%2F

54.204.162.231
54.234.118.208

54.197.89.239 - - [11/Mar/2014:10:13:02 -0700] "GET /?kimonify.kimonolabs.com HTTP/1.1" 200 779 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11"

lucy24




msg:4653137
 9:36 pm on Mar 11, 2014 (gmt 0)

Oh. Well, that's easy then.

Deny from 54.192.0.0/10

GET /?kimonify.kimonolabs.com

Do you mean that they requested this from your site? So if you've got one of those dynamic lockout functions, you can flag any request with "kimono" in the name and block the IP for the next 24 hours.

thetrasher




msg:4653143
 10:05 pm on Mar 11, 2014 (gmt 0)

In order to identify the request in the public log file (not my site), I called
kimonify.kimonolabs.com/kimload?url=http%3A%2F%2Fwww.example.com%2F%3Fkimonify.kimonolabs.com

Angonasec




msg:4653303
 1:52 pm on Mar 12, 2014 (gmt 0)

:) Easier still... deny from 54.

Already blocked, but I'd still like to see what their delightful API footprint looks like in your logs Mr. Londrum.

keyplyr




msg:4653371
 5:37 pm on Mar 12, 2014 (gmt 0)

I just test drove this thing. If they get away with this, then international copyright agreements are out the window. This thing literally captures everything on your site, and turns it into an API for the user.

londrum




msg:4653400
 6:41 pm on Mar 12, 2014 (gmt 0)

I'd still like to see what their delightful API footprint looks like in your logs Mr. Londrum

i havent actually looked through my logs. I was notified about it through adsense — apparently someone is displaying my ads on kimonify using this tool. so i guess they just grabbed every single thing on the page.
...i will look through my logs

Angonasec




msg:4653531
 5:47 am on Mar 13, 2014 (gmt 0)

KeyP: Good man; what did you find in you logs please...

Londrum: Thank for alerting us. I'm sure those charming erks at Adsense won't penalise you once you've explained how this happened.

keyplyr




msg:4653550
 6:22 am on Mar 13, 2014 (gmt 0)

KeyP: Good man; what did you find in you logs please...

You don't think I would actually use it on my own site do you?

(thetrasher posted the hit above in msg:4653065)

londrum




msg:4653616
 10:20 am on Mar 13, 2014 (gmt 0)

I've notified them about this thread, so let's see if they post a reply.

To be fair to them, they offered to remove the API's if I provided them with the URLs, and they sounded like alright people in the email

keyplyr




msg:4653779
 5:21 pm on Mar 13, 2014 (gmt 0)




To be fair to them, they offered to remove the API's if I provided them with the URLs, and they sounded like alright people in the email

Isn't that like email SPAMMERS who offer a link to be removed, but after you've already been SPAMMED?

Trip41




msg:4653813
 8:30 pm on Mar 13, 2014 (gmt 0)

Hi this is Ryan from kimonolabs.
@londrum thanks for pointing us to this forum. I'm Pratap's co-founder.

I think you guys all raise valid points. You don't want to be scraped. You don't want your data stolen from you. It's also clear to me that by building kimono we're making ourselves an arms dealer, and by doing that we enable, in some cases, bad people to do bad things. Believe it or not we don't actually want to create a tool for data theft either. That's not what we're trying to do.

You'd be surprised that the vast majority of people that are using kimono are coming to us simply because they can't get data in the format that's right for them -- not because they want to 'steal' it through some backchannel or avoid some fee or something to that effect. 99 times out of 100, developers and other users would be perfectly happy getting it directly from the source, and even paying you for it -- if only it was just packaged up exactly how they want, machine readable and reliably updated.

You might say you have an API, or are happy to build one but at the end of the day some developer is going to come along and want to work with some slice of your data that's not available. Again surprisingly often this isn't because you are actually trying to protect the data, it's just that you haven't thought about providing it in that way. What does he do? He scrapes you, eventually you might find out, block him or whatever, and thus the cycle continues. This doesn't really make sense to me. You have the data, you're publishing it on the web for people to consume (i'm assuming if it was private you'd put it behind some sort of auth-wall). People who come to get it need it in all manner of bits and slices. All the pieces of the puzzle are there.

What we *really* want to do is build a framework for connecting the data consumer to you the webmaster. To 'make an introduction' of sorts . In an ideal scenario we even turn that person into a customer of yours. If we can make it so easy for people to get in the format they need -- they'd often be willing to pay you for that.

Some of the other concepts/tools for webmasters we're thinking about thus far are
- responsible scraping (building kimono in such a way that it doesn't hit frequently or cause any scaling/ddos attack issues -- i.e. be very nice to your servers)
- the ability to turn data streams off/block kimono on parts of your site you don't want made available to developers via robots.txt or other means
- analytics (give you insight into exactly what data people want, by what organizing principles, to inform your own API building, etc.
- the ability to 'monetize' your data stream through kimono
- what else?

We are a (very) young company. We launched just over 8 weeks ago and are two people. The advantage of this is that we're extremely nimble, can listen to the communities that matter, and can implement things very quickly. We are happy to take these sort of webmaster tools in any direction, we just need to hear what is the right direction.

We want to be a *good bot*. In an ideal world we'd sit up there with the likes of google in the sense that there'd be so much benefit to letting us through your gates that it's a no brainer. So. to ask very directly. What would make us a better bot? What are the clear-cut rules to not break with you so that we don't overstep our bounds?

You know, this is one of the most important things for us to get right, if not the most important. It's totally reasonable to think that a primary cause for our business failing down the line is if we mishandle issues like terms of use, robots.txt, data protection, etc and get on the bad side of people like you all.

To be honest, I don't consider myself an expert in the space and honestly could use a little more insight into what works what doesn't, what seems fair, best practices and how to position things moving forward. If you're not 100% against us already, and are willing to give us some feedback – I would love to get your thoughts. I'm happy to talk on the phone, (or skype if you're outside the US) or whatever.


Thanks
-ryan 650-704-2755

keyplyr




msg:4653820
 8:53 pm on Mar 13, 2014 (gmt 0)

Thanks Ryan for the comprehensive reply/explanation. I'm sure that helps many of us understand what your company is doing.

That said, as webmasters we *do* want our content accessible to those who are interested, however existing law clearly states that the interested party must seek this information directly from us, the owners and *not* from a third party (kimonolabs, et al) who does not have our permission to copy, reproduce or otherwise distribute our property in any manner.

This is US & International copyright law established by:
• Berne Convention
• Universal Copyright Convention (UCC)
• Digital Millennium Copyright Act (DMCA)
• European Union's Intellectual Property Rights Enforcement Directive (IPRED)

I can promise that if I find any of our copyrighted content in your API or anywhere else on servers owned/leased/managed by your company, you'll be contacted by our law firm and aggressively prosecuted to the full extent of the law.

keyplyr




msg:4653840
 10:35 pm on Mar 13, 2014 (gmt 0)

Just some more info in case anyone is interested.

The Kimono scraper seems to so far be coming for Amazon Tech:
54.192.0.0 - 54.255.255.255
54.192.0.0/10

Kimonolabs.com apparently is hosted on AmazonAWS:
23.20.0.0 - 23.23.255.255
23.20.0.0/14

Pratap.com apparently is hosted on skylock.net, resold by Net2EZ:
173.245.0.0 - 173.245.31.255
173.245.0.0/19

If anyone sees an error or has additional info, please post :)

Samizdata




msg:4653844
 10:53 pm on Mar 13, 2014 (gmt 0)

You have the data, you're publishing it on the web for people to consume (i'm assuming if it was private you'd put it behind some sort of auth-wall).

There is a difference between privacy and ownership.

The right to copy is at the discretion of the copyright holder.

To be honest, I don't consider myself an expert in the space

I'm sure you will benefit from advice in this thread.

...

Angonasec




msg:4654092
 4:26 pm on Mar 14, 2014 (gmt 0)

Ryan you display stunning naivety at best.

Learn to respect copyright, and laws, if you wish to prosper.

Thanks KeyP; all nicely blocked now.

wilderness




msg:4654120
 5:50 pm on Mar 14, 2014 (gmt 0)

We are happy to take these sort of webmaster tools in any direction, we just need to hear what is the right direction.

We want to be a *good bot*. In an ideal world we'd sit up there with the likes of google in the sense that there'd be so much benefit to letting us through your gates that it's a no brainer. So. to ask very directly. What would make us a better bot? What are the clear-cut rules to not break with you so that we don't overstep our bounds?


Hello Ryan,
Many thanks for your effort and explanation.

The simplest solution is to use a valid "User Agent" when crawling (identifying your bot properly) and then comply with "robots.txt with same crawling (as).

Most every longtime participant in this forum makes it a point, to explain to newcomers as to why what bots are "robots.txt compliant".

However, and in summary, using Amazon.com (or many other server farm hosts) will simply get you denied without even having a chance to comply with robots.txt.

dstiles




msg:4654165
 8:28 pm on Mar 14, 2014 (gmt 0)

I have, as part of the T&C on several of my web sites, the warning that making money out of any content on the site is illegal and prohibited unless specifically approved. I wonder how many bot-owners take note of those pages? My guess is None. However, the warning is there and if anyone is caught making money out of my data there could be trouble for them. "Making money" can be interpreted here as a rather loose term.

I do not bother over-much with complex robots.txt files. The ever-growing complexity coupled with the fact that even google fails to respect it makes it impracticable. Instead I reject overtures with a 4xx error. Bots should treat this as a "go away" alternative to robots.txt but seem rarely to do so.

Wilderness, please do not discourage people from bot-driving via amazon. It makes life so much easier if they do! :)

tangor




msg:4654169
 8:42 pm on Mar 14, 2014 (gmt 0)

My biggest problem is the repackaging of a product I have made available IN ONE FORMAT on the web. If they can't get it, they aren't using the allowed devices/programs to access that data. Anything that converts my data to another format is theft, pure and simple.

I would suggest that your "tool" be made available to WEBMASTERS who might wish to provide their content in this new API format rather than scraping the web.

Angonasec




msg:4654245
 2:00 am on Mar 15, 2014 (gmt 0)

"please do not discourage people from bot-driving via amazon. It makes life so much easier if they do! :)"

Echoed. And after IPv6 catches on, we'll will look back on this time as; the Good ol'Days, when bot-wrangling was simple.

It's probably wise to also consider the possibility that Kimono's rep posting here may so "clumsily" may be a deliberate wind-up, or worse.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved