Forum Moderators: buckworks
What am I doing with this content? Actually, I throwing most of it out the window. The only thing I'm keeping is SKUs...and prices.
In the past I've mentioned to others webmasters that I'm a scraper, and I can tell that they were distainful of the idea. However, I have managed to build the most dynamic, real-time pricing mechanism in my industry and frankly I'm proud of it. I can programatically set a price each day that optimized against several competitors. Furthermore, I know my competitors are doing the same because I spend just as much time trying to block their bots from accessing my prices.
So my question for the webmastering community is --> is scraping really unethical 100% of the time? Should I consider myself a black hat even if I'm not republishing the content?
How is that different from entering a brick and mortar store of your competitor, disguising yourself as a customer and spying on the prices and service, etc - the technique that is endorsed by every business book, marketing guru and has been admittedly used by so many renowned enterpreneurs (e.g Sears founder)?
I tend to agree. Bakedjake's analysis is on target as well.
How do you disguise yourself as a customer? Drool, shuffle your feet and roll your eyes a lot?
LIA, You seem to be arguing against the implementation and not the practice. Are you against the practice (gathering pricing intelligence)?
good point, which is why pricing intelligence might be a bad example -- you could query google base to get information (at least in the us, uk and germany, where I doubt there are too many shops who don't pocket that extra exposure) on your competitors. Still a bot, but they have an API and allow automated queries to that API.
I'm on both sides aswell, I build bots and I'm webmastering for a few mediumsized sites. We don't have any official bot-policy, and since they're informational sites, bots are usually search engines or content-harvesters. With the exemption of google (they may hit our server as hard as they please), we ban whoever gets abusive. That theoretically includes yahoo and msn, since they're irrelevant in our market, but we let them pass aswell, since bandwidth is free. On sites where each request more means server load, we block them too. Since we don't do automatic blocking, it's more of a thing "wow, apache-logs and js-based tracking are really different ... we must have a wild bot" (we log known bots into a special access log).
I'd love it if people would put a reasonable bot-policy on their sites
This is the content of the robots.txt file your bot should be served by my sites:
# Please note that robots may be denied access
# to content at any time for any reason unless
# they are demonstrably useful to this site or
# have made a prior arrangement with the owner
User-agent: *
Disallow: /
# Have a nice day!
I hope you consider that reasonable enough.
I'd recommend not to insult or threat bot-authors in html-comments or stuff
I'd recommend not to insult site owners by sending them a robot that you have deliberately - and dishonestly - disguised as a normal browser with the specific purpose of trampling on the site owner's expressed wishes and violating the generally accepted (though admittedly not compulsory) Robots Exclusion Protocol that is designed for such cases and has been around for fifteen years.
I know one or two people that might get angry. And you don't want that.
Interesting words, unreasonable, insulting, threatening and unethical.
...
LIA, You seem to be arguing against the implementation and not the practice. Are you against the practice (gathering pricing intelligence)?
Which is why I have softened quite a bit from my initial response. This is not such a black/white issue.
Are you abusing a park because you walk through it 100 times more in a day then everyone else? Are you "stealing" because you cost them more in grass maintenance because of your increased presence?
I say no. You are doing nothing wrong or unethical. You are using the web exactly for what it was designed for and that is.... Retrieving information for your use. Whether it
be for archiving data for future use, for expanding ones knowledge base, for killing time, or for better setting your prices.
Just because you have the good sense to automate the process doesn't make what you are doing wrong. it makes it efficient.
You could just as easily hire a small staff that could manually go to these sites you scrape, record the data, compile it, then report it to you, which you could then use to set your prices.... it would use the same bandwidth and would be the same thing, but wouldn't be scraping somehow.
I hope you consider that reasonable enough.
I don't, sorry. I don't get why you hate bots. I once built a tool that, given a few keywords, checked wether combinations of those were free. The german nic has an acl on is-this-domain-free-checks. Their explanation was, that bandwidth costs money. I wrote an email saying I'll happily pay a more than generous compensation, say, I'll pay for a gigabyte for every megabyte I use. Of course, they didn't agree, which leads me to believe that bandwidth wasn't the real issue. I mean, I don't see a reason to say "I don't want any bots on my site". If it's bandwidth-costs: ask them to pay for an API. Most will, because it's more fun to get a clear documentation than to adapt your code with every change you make.
And you're even blocking search engines?
Interesting words, unreasonable, insulting, threatening and unethical.
Please, don't misunderstand me. All I tried to get at was "don't make them mad. block them if you want to, but don't make them want to hurt you". You may not want a bot on your site. But you definetly do not want some crazy guy with a botnet going after your site. That's all I meant. By all means, tell them in html-comments why you don't want their bots. I always enjoy a webmaster being aware of the situation. I once read a html-comment on some page I wanted to scrape infos off where he had written something like "Attention Bot Authors: to save you the trouble, here's an API: http://example.com/bots/myapi". That, to me, is the best take on the situation. You cannot control bots. You don't want to waste bandwidth. You don't want to have your stats altered. Of course, that doesn't make sense for content-pages, and you probably wouldn't do that for your prices, but if you see a lot of bots querying some data, give it a thought.
You could just as easily hire a small staff that could manually go to these sites you scrape, record the data, compile it, then report it to you, which you could then use to set your prices.... it would use the same bandwidth and would be the same thing, but wouldn't be scraping somehow.
and that other example someone gave about walking into a store and browsing the aisles writing down all the prices, suggesting that was the same thing... well that doesn't cost the shop owner any money either. but when you send a bot to scrap all their stuff it does cost them money in bandwidth.
i would ask this of the scrapers: if all these website owners started blocking your bot with robots.txt, what would you do? i'm guessing that you would ignore it and scrap away anyway. you'd have to, because to do otherwise would affect your business. and that suggests that it is unethical.
and that other example someone gave about walking into a store and browsing the aisles writing down all the prices, suggesting that was the same thing... well that doesn't cost the shop owner any money either.
unless, of course, you count personell costs, cleaning, heating, etc pp, which you could easily divide by the number of people visiting the store etc pp. Sure, it's not much ... but is the bandwidth?
I don't know, maybe we're all talking about different amounts here. If a bot does a gigabyte of traffic on your site, I'd say, yes, that's quite a bit. But if he does 3 megabytes? I don't know about traffic costs around the world, but I'd have trouble to even calculate that in euros.
A website is a publicly accessible place, it's content is there for people to use.
Let's use the park analogy, though. What if you go to the park, pull out some of the grass, and take it back to plant in your park? Or you take one of the park benches back to your park? Although you might find a problem with that, another person might excuse the actions by saying that the grass was from an area way in the back of the park or that no one was using the bench at that time.
OK, to try to keep the analogy more apt, let's say you and lots of your friends overrun the park so that no one else can use it (analogous to rogue bots that bog down the site)?
OK, to try to keep the analogy more apt, let's say you and lots of your friends overrun the park so that no one else can use it (analogous to rogue bots that bog down the site)?
have you ever had such a bot on your site? was it a legitimate one? I've seldomly discovered a private/stealth bot that misbehaved in such a way. I've banned alot of small search engines and university projects becaused they hammered our server with 5 - 10 requests per second.
Is scraping to improve a business model ethical? We seem to have no consensus at the moment.
Competitive Pricing (what is under discussion) is a business method all businesses recognize. It is how value is determined for any product. The OP wants to sleep at night because he's wearing a wig to get that info. Everything else which has been discussed does not address that specific question.
As a businessman (brick and mortar) I don't like it but have done it with my competition many times...as they have with me. Keeps both of us honest, or at least competitive. As a webmaster with no product for sale I have no opinion. Other than what I've just stated.
the shop is going to have to pay the lighting bills and heating bills regardless of whether the mystery shopper attends or not.
open the doors a few thousand times a day and you will see a difference in heating costs. that's cost-per-customer. if you have two customers at any given time, you don't need much personell. if you have twenty customers? you'll need more. that's what I was aiming for. Sure, it's a few people in sales divided by a lot of customers (if your shop runs well), but it's still a costfactor.
Did anyone ever try to actually put a price tag on the traffic made by bots?
Does it make Bus. Sence for the bot Owner? Yeah!
Does it make Bus. Sence for the site Owner? Nope.
Absolutely Not, Most site owners would be considering bot traffic as a waste of their time.
To me, if someone deliberatly is trying to waste my time, or the time of the IT Person(not 1 USD per Gig of bandwidth, more like 75/100 USD per hour here) that I pay to handle the techical aspects of the sites well being, I would definately consider this Unethical.
What do you think?
What do you think?
I have lots of thoughts:
If you don't want people to know your prices, don't list them publicly for the world to see.
If you do list them publicly for the world to see, don't be surprised when folk use something other than your "preferred method of access" to find them.
Don't presume that all folk subscribe to your ethical model. In business, as in life, we are all influenced differently by our experiences.
Either business will continue to evolve and be more beneficial to the consumer (which is the end game, of course - as much as business owners don't like it, cut throat pricing does benefit the consumer), or we will all become goat farmers.
When you launch a site, be prepared for unexpected data transfer costs (due to bots), much like you'd be prepared for any loss of revenue activity in an offline store. Budget for it, expect it, take preventive steps to curb it.
Don't be a luddite.
Mr bot comes along everyday eating up my bandwith in return I get nothing other than a higer bill due to his continued spidering of all my products.
I go into a store walk around, leave come back, leave come back and walk out with a dollar pack of gum. I didn't pay for this gum so is this OK or stealing from the owner of the store you visited many times and in return all I did was cost him money. Is this OK
The title reads
The Ethics And Value In Scraping, or Data Miningshould be changed to The Ethics And Value In Scraping, or Data Stealing
Ethics are what a person considers right or wrong and as we all know everyone Ethics can be justified by ther personal reasoning.
Can't justify stealing any other way than wrong. Scraping sites is stealing plain and simple.
[edited by: yobaby at 6:40 pm (utc) on April 29, 2009]
i suppose it's a bit like shoplifters in shops. yeah, of course it's unfair that it happens, but it does, so there's not much point moaning about it.
they don't steal your stuff (unless copying and _not_ republishing is stealing, which means, I steal your website everytime my browser caches it). If you need a real world comparison, I'd go with people who go into your shop, maybe get some info from a salesperson, but don't buy just now, instead, they go home and buy online. they cost you money, yeah, but not really that much, and it's not against the law. I guess most siteowners lose less money on bots in a month than on one customer returning an item, or one customer not buying an item because the database refused connections for 10 seconds on a single day and the customer was scared away.
blend27:
To me, if someone deliberatly is trying to waste my time, or the time of the IT Person(not 1 USD per Gig of bandwidth, more like 75/100 USD per hour here) that I pay to handle the techical aspects of the sites well being, I would definately consider this Unethical.
unless you (or your IT guys) handle every request manually and quickly type up the responding html-page, how is any request wasting _your_ time? It's wasting server resources, yes, but your time?
yobaby:
By that standard, I'm "stealing" on a daily basis. Just today I was shopping and asked a guy at the shop wether they had xyz. He said they do and lead me to the shelf. I thanked him and he walked away. I saw the price they asked for (which was double the price I was expecting from prior experience) and did not buy the product. The guy literally took me from one end of the store to another, so let's say the whole thing took a minute of his time. The shop didn't make a dime, but they still have to pay for the guys salary. Is that unethical?
[edited by: janharders at 6:47 pm (utc) on April 29, 2009]
I assume you ban search engine spidershow you figure that from my post. Search engine spiders send me traffic that purchase or the possibility to purchase. What this post is about has nothing to do with search engine spiders as there is a value there.
What he asked has no value at all for me and never will I ever get traffic from him,
so your post is out of line.
so your post is out of line.
Respectfully, no. You made a black and white statement that said:
Scraping sites is stealing plain and simple.
Then you said:
Search engine spiders send me traffic that purchase or the possibility to purchase.
Search engines scrape sites in the same way that other bots do. Technologically speaking, there is no difference.
My point is that the issue is not the implementation (robots scraping websites) that is grey, more of the practice (pricing intelligence vs. building a search engine vs. republishing content in violation of copyright vs. whatever).
It is very important to speak to the practice rather than the implementation, because robots spidering content is not always unethical, according to most opinions in the thread.
Most people like Google traffic. Therefore Google is not bad. Google scrapes your site, and by extension, Google scraping your side is not bad.
That is not true in all cases. So you can't say "scraping sites is stealing plain and simple". In some cases it is almost universally good and welcomed.
The shop didn't make a dime, but they still have to pay for the guys salary. Is that unethical?
Arieng, consider whether your actions are parasitical or symbiotic. What are you giving back to the websites you scrape? Are you sending them back any traffic, or giving them anything else of value?
I'd question whether Arieng is a scraper in the usual meaning of the term. If he'd left the word "scraping" out of his original post's subject line or replaced it with "pricing intelligence" (to borrow a phrase from bakedjake), emotions probably wouldn't be running as high as they are.
but they'd have to pay his salary anyway.
A simple example, to re-use the brick and mortar case everyone loves: going to a store 100 time is OK, going to a store and taking pictures of their presentation and prices is OK only if they let you. Usually, letting you, includes first inquiring why you are doing it (checking your user agent) which is then followed by a refusal or not (robots.txt).
You are not following this, so you are unethical in your actions, no doubt about it.
I once got reprimanded for taking pictures of the exterior of a brand-new store that had opened. The exterior, not even the interior.
Totaly, if that same person comes in and wastes salesmans time all day long, looking at different prices with out any intentions of buiying, every day. Salesman does not sell cause he can't due to a same visitor waisting his time, wether the price is right or not. in you case if the price is right the first time, you might buy it, right? so it's not the same.
btw, "typing every responce manualy" was funny.