| This 136 message thread spans 5 pages: 136 (  2 3 4 5 ) > > || |
|Ecommerce: The Ethics And Value In Scraping, or Data Mining|
| 8:09 pm on Apr 28, 2009 (gmt 0)|
I like to follow the threads about how to combat scrapers, but not because I am having my content stolen. The truth is, I'm a scraper. I scrape other sites almost every day. I am constantly evolving my strategies to extract content from sites that are not my own.
What am I doing with this content? Actually, I throwing most of it out the window. The only thing I'm keeping is SKUs...and prices.
In the past I've mentioned to others webmasters that I'm a scraper, and I can tell that they were distainful of the idea. However, I have managed to build the most dynamic, real-time pricing mechanism in my industry and frankly I'm proud of it. I can programatically set a price each day that optimized against several competitors. Furthermore, I know my competitors are doing the same because I spend just as much time trying to block their bots from accessing my prices.
So my question for the webmastering community is --> is scraping really unethical 100% of the time? Should I consider myself a black hat even if I'm not republishing the content?
| 8:46 pm on Apr 28, 2009 (gmt 0)|
Perhaps they are feeding you false data?
[edited by: MrHard at 8:48 pm (utc) on April 28, 2009]
| 8:52 pm on Apr 28, 2009 (gmt 0)|
Doubt it. I check for that through various proxies and locations.
| 9:57 pm on Apr 28, 2009 (gmt 0)|
To me, scraping=stealing bandwidth. I consider stealing to be unethical. So, yes, I consider it unethical.
But almost by definition, doing something unethical means you don't care what other people think, so why does it matter if other people consider it unethical?
| 10:20 pm on Apr 28, 2009 (gmt 0)|
Honestly, I hadn't considered bandwidth to be the central issue. I have always felt like I'm an honorable business person, and from other perspectives (SEO/SEM/email) I think that I operate completely aboveboard.
LIA, I got into this because others were doing it to me and I felt I needed to level the playing field. Now, I do it better than they do. However, if I was to learn that this is clearly an unethical thing to be doing, I'd have to stop. I respect many of this board's members, and I was hoping that you might help me steer in the right direction moving forward.
Continue with something that works, or abandon it for ethical reasons? I think its a legitimate question to ask here.
| 10:28 pm on Apr 28, 2009 (gmt 0)|
OK, that's a fair question. I admit I was probably a little harsh in my initial response. If I was in your position, I may feel the need to act similarly.
Here are some questions:
1) Does your scraper bot clearly identify itself in its user agent?
2) Does your scraper bot fully obey robots.txt?
3) Do you have a page on your site that describes what your bot does and describe how webmasters can block your bot?
If yes to all three, then I *might* concede you the dubious title of ethical scraper. :)
| 11:11 pm on Apr 28, 2009 (gmt 0)|
Yes, yes, and...no.
We pass the second one pretty much just by happenstance, but I have to admit that there is an element of stealth to the whole deal.
I think that third item is a good idea. I'll discuss it with my team. I am hoping to find a solution here that still lets me sleep comfortably.
| 11:18 pm on Apr 28, 2009 (gmt 0)|
>>is scraping really unethical 100% of the time?
not 100% of the time
but my important sites all have terms and conditions which explicitly forbid scraping and robotic retrieval of pages, to scrape them would be unethical.
| 11:51 pm on Apr 28, 2009 (gmt 0)|
Sounds more like data mining to me, not scraping.
| 12:42 am on Apr 29, 2009 (gmt 0)|
|Sounds more like data mining to me, not scraping. |
And how many angels can dance on that pin with you?
Interesting discussion though.
| 2:59 am on Apr 29, 2009 (gmt 0)|
What you are doing is creating a vicious circle of price undercutting. Spending all your time on reducing your profit margins with your competitors and fine tuning the mechanism of your demise.
Being a follower rather then a leader.
Meanwhile others are spending that energy developing new positive business relationships, or on other constructive pursuits which will enable them to sell items even at a higher price or otherwise gain an advantage.
| 5:30 am on Apr 29, 2009 (gmt 0)|
I don't think that "scraping" - if that means the automated retrieval of files that are publicly available - is unethical in itself.
After all, there are many legitimate tools designed to do just that, and the data can be collected simply by visiting the relevant pages in a normal browser.
The distaste for "scrapers" so often seen on WebmasterWorld arises from the republishing of our content that often follows the scrape.
Some of us also block most automated tools where we can on the grounds that they use our bandwidth and give nothing in return, but as we are talking about documents that are publicly available anyway this is a matter of personal choice about how we run our sites.
If, as you say, your bot identifies itself and obeys robots.txt then I see no ethical problem with it.
But I will refuse it access.
| 8:14 am on Apr 29, 2009 (gmt 0)|
Arieng, consider whether your actions are parasitical or symbiotic. What are you giving back to the websites you scrape? Are you sending them back any traffic, or giving them anything else of value?
If it's all take and no give, then what you are doing is unethical. But if there's some chance that the sites you scrape will see a benefit, and there's an easy way to opt out if they wish, then carry on.
| 9:03 am on Apr 29, 2009 (gmt 0)|
I agree with Samizdata, and I'd actually don't even think you'd have to identify yourself openly.
Bots aren't bad in themselves, it's what you do with the results that makes it ethical or unethical. You're gathering prices. You could do the same thing by having a student stare at the screen for hours. As far as bandwidth is concerned, the bot is actually nicer, because it doesn't load images, css-files etc pp.
I understand why many people don't want bots on their page: they want human visitors. But in many cases, the guy who scrapes your content (but does not republish it!) won't be a valued visitor if he did it manually. Think of the competitor checking your prices -- he won't buy from you, just because he saw your website.
I've been involved in scraping-projects myself, where we used to collect data for internal usage. We didn't identify ourselves, but we made sure we don't hammer the servers, didn't just crawl without thought but only requested what we needed, at a very low speed. And when presenting the data to users, we always gave 'em a source-link so they could visit the original site.
Also, I've had quite a few cases where I was asked to scrape specific pages on a site because users needed the information that was updated frequently, but the webmasters where incompetent and had faulty scripts (and didn't want to fix it after they had been told), so we build a small script that'd get the data and fix the rest.
It's like SEO, you can work in different shades of gray. I've always wanted to publish a guide "how to really make a scraper hate you" and put down some of the anti-scrape methods I've seen so far. A few of them really took some work to understand, but any and everyone of them didn't keep me from getting what I wanted. If a bot is in deep cover and the author cares for it not to be discovered, you won't see it, unless it's the only user that uses your page ;)
| 11:38 am on Apr 29, 2009 (gmt 0)|
|I'd actually don't even think you'd have to identify yourself openly |
As we are discussing ethics, this is where I would disagree with janharders.
While no rules or laws are being broken, deliberately disguising a bot by using (for example) a standard browser user-agent is fundamentally dishonest.
Ethical dishonesty is a concept for philosophers with time on their hands, not webmasters.
|If it's all take and no give, then what you are doing is unethical |
Rosalind, when you publish on the web you can have no expectations of anything in return.
You can have hopes, but not expectations.
The analogy might be a free newspaper supported by advertising - it is not unethical to take a copy and use it to line a birdcage or soak up a spillage, ignoring the content completely.
| 1:04 pm on Apr 29, 2009 (gmt 0)|
it would be unethical for people to read the news stories and use them in their own paper, though. that is more like what a scaper does. they are using other people's content to create their own.
question to arieng: if these other sites started blocking your bot in robots.txt, can you honestly say that you wouldn't go against their wishes and bypass it? your entire business would go down the pan otherwise.
that kind of implies that your website is reliant on other people creating the info and handing it over for nothing.
| 1:24 pm on Apr 29, 2009 (gmt 0)|
|Spending all your time on reducing your profit margins with your competitors and fine tuning the mechanism of your demise. |
That pretty much sums it up. I used to worry constantly if I was at-or-below the price of my competitors. Only to run the numbers and find out (after factoring in the cost of returns, credit card processing, boxes and tape, etc) that I was losing money on every order.
The old saying, "Lose money on every order, but make it up in volume" doesn't work. You're better off maximizing profits, even if it means lowering your revenue.
| 1:35 pm on Apr 29, 2009 (gmt 0)|
|While no rules or laws are being broken, deliberately disguising a bot by using (for example) a standard browser user-agent is fundamentally dishonest. |
you're absolutely right, it's dishonest. still, I do it. I understand some people don't want bots on their site, but unless you're copying their content, I don't see a reason why it should matter. For a bad comparison: if I want to order some food and they ask the color of my pants and reply "sorry, we don't serve people that wear blue jeans", next time, I'd simply tell them whatever they want to hear, if the delivery-guy doesn't double check on delivery.
It's a thin line, since site owners would argue that I use bandwidth but don't see their ads and don't click on them. OTOH: who doesn't use some sort of adblock?
| 1:45 pm on Apr 29, 2009 (gmt 0)|
|The analogy might be a free newspaper supported by advertising - it is not unethical to take a copy and use it to line a birdcage or soak up a spillage, ignoring the content completely. |
ROFL, My hubby gets a free paper every day, but we have never read one, they line the rats cage instead.
| 2:38 pm on Apr 29, 2009 (gmt 0)|
|you're absolutely right, it's dishonest. still, I do it |
I do not mean to criticize you, but in the context of this thread that is surely unethical.
|site owners would argue that I use bandwidth but don't see their ads |
In my case it has nothing to do with advertising, which no site of mine will ever carry.
|There's no reason at all against having my bot fetch the results for me |
My reason would be that I dislike dishonesty.
It is my site and I am entitled to take that view and block your bot (if I can).
[edited by: Samizdata at 2:39 pm (utc) on April 29, 2009]
| 3:00 pm on Apr 29, 2009 (gmt 0)|
|My reason would be that I dislike dishonesty. |
so you'd allow reasonable bot-usage if it's identifiyable? I'd love it if people would put a reasonable bot-policy on their sites. "don't. just don't" is not something I consider reasonable. (Again: I'm talking about bots who check / do stuff for me, not content-scraping & republishing.)
|It is my site and I am entitled to take that view and block your bot (if I can). |
which is hard. I've always seen it as a small competition, to be honest.
Also, personally, I'd recommend not to insult or threat bot-authors in html-comments or stuff. They usually tend to ignore it anyway and I know one or two people that might get angry. And you don't want that.
| 3:37 pm on Apr 29, 2009 (gmt 0)|
|but my important sites all have terms and conditions which explicitly forbid scraping and robotic retrieval of page |
Do you specifically exempt SE crawlers from that policy in your TOS?
|I can programatically set a price each day that optimized against several competitors. |
I guess in my mind this is simply good business. Republishing content definitely crosses a line (legally), but your practice sounds no different than walking into a Home Depot store and noting their prices (they specifically forbid this practice, btw) or visiting a website every day and copying down prices onto a yellow legal pad.
So to answer the question, no, I wouldn't consider it unethical, and honestly expect my competitors to do this already. In fact, I know they do.
| 3:40 pm on Apr 29, 2009 (gmt 0)|
There is nothing unethical about this. Competition monitoring/analysis increases competitiveness in the industry, improves market efficiency and drive prices lower for me as a customer. Thank you for doing this, go on with it and I hope your competitors do it as well.
All who consider this kind of scraping unethical, I really don't understand what you are talking about.
How is that different from entering a brick and mortar store of your competitor, disquising yourself as a customer and spying on the prices and service, etc - the technique that is endorsed by every business book, marketing guru and has been admittedly used by so many renowned enterpreneurs (e.g Sears founder)?
| 3:43 pm on Apr 29, 2009 (gmt 0)|
Ethics are relative. Culturally and personally. Are we collectivist or individualist? Society provides a framework, and its usually geographic, although different ethical frameworks exist in pockets throughout the world, making global distinctions quite difficult and prone to racial stereotypes.
Here's one I do hold true thoough: The Internet is very very dodgy, generally.
It is unethical that:
- a website tells you that it has the best rate guaranteed, when so does its competitor. Who's lying?
- a website and search engine confirms to you indirectly that it is an authority but then allows anyone to add content in an unmoderated fashion.
- a website allows users to view copyright protected material and permits them to insert that material into other webpages for free
- a website tells you that there is no commission being charged to you, but in fact there is.
- a website tells you that it has found the best offer for you, when in fact it's just served you a page with a "searching for best offer" when the best offer is always the same.
- a website tells you that it takes the privacy concerns of its individuals very seriously, but then when asked by official government bodies to implement best practices, they decline to do so.
_ you are aware that a system has its faults, but yet can technically justify why it is not at fault (the classic, the system works fine, when we know it doesn't)
How you group the good or bad of this is quite personal and culturally skewed. I think to some degree, Business acting as an "entity" and not a "person" is the first problem, this allows people to hide behind their actions which, were they responsible for at a personal level, they would behave in a different way.
Let me qualify that and say the phrase "it's in the interests of business" might mean that John, who is a friend of the company for over 30 years, needs to be told to hit the door.
In IT we have little in the way of ethical frameworks that can help IT managers or skilled IT professionals reason their way out of any ethical dilemmas, but it is precisely these frameworks that help us prepare for the unexpected. It's usually convenient for business to apply guiding priciples or policies after something bad has happened rather than have an ethical code of conduct in place.
Ethics ain't easy.
| 3:44 pm on Apr 29, 2009 (gmt 0)|
|Spending all your time on reducing your profit margins with your competitors and fine tuning the mechanism of your demise |
Actually, quite the opposite. We're not trying to be the lowest price on the web. We could never live on the margins those sites are willing to work with. We strive to be the most knowledgeable, have the best selection, and offer the fastest shipping. If we can be the best provider in those areas, I sure don't want to lose a sale due to price. Our pricing algorithms are ensuring that we're in line with the other top tier providers, and if we can shave a point or two and still make decent margins so much the better.
|You could do the same thing by having a student stare at the screen for hours. |
This brings up a good point, pricing research is nothing new. In the old days, you had some poor employee hand-keying prices out of a print catalog. However, since we've started experimenting with automated pricing retrieval the data is exponentially more exhaustive and reliable.
|My reason would be that I dislike dishonesty. |
Me too. When I first started toying with this idea, I didn't really think of it as a dishonest practice. I saw it as a way to get an edge on competitors that I felt were getting an edge on me. Thanks for all the great feedback on this thread, it's given me a lot of food for thought.
| 4:05 pm on Apr 29, 2009 (gmt 0)|
^^ I agree with bakedjake... and Jake is an expert in competitive intelligence. You have an advantage over your competitors, because you know how to do something automatically which they are *probably doing* using Internet Explorer and a yellow legal pad. And you're not hacking into their back-end network to take confidential price lists. You're looking at and learning from information that is freely available to everyone else on the planet. I say, go ahead.
But watch out - what if they started doing the same thing? If you automatically undercut their price by a $0.01, then they automatically undercut yours by $0.01, prices on both sites will start spiralling down into $0-ness.
Another point: Don't crawl them every hour with an aggressive bot. Just as if you were wandering into Home Depot with a notebook, you should behave respectfully while you're in their store, don't harass the staff, don't make a scene, get a little info and leave, then come back later wearing a different wig.
I wouldn't call it unethical, but it is sneaky. Some people really hate sneaky because sneaky is dishonest. Others would applaud you for being smart. It really depends on your perspective.
If you were scraping me, you're darn right I'd be doing everything I can to identify and block your sorry little bot. LOL
| 4:12 pm on Apr 29, 2009 (gmt 0)|
|you're darn right I'd be doing everything I can to identify and block your sorry little bot |
Tip: don't block. It alerts the bot owner to the fact that they're dealing with someone at least moderately sophisticated.
If it's for e-commerce, and you're sure the bot is being used for pricing intelligence, then by all means give the bot a price. That price doesn't really have to be the same price that you actually show to consumers. ;-)
| 4:13 pm on Apr 29, 2009 (gmt 0)|
|If you were scraping me, you're darn right I'd be doing everything I can to identify and block your sorry little bot. LOL |
don't. identify who's behind it and serve him different prices. counter intelligence ftw ;)
| 4:18 pm on Apr 29, 2009 (gmt 0)|
|Some people really hate sneaky because sneaky is dishonest. Others would applaud you for being smart. It really depends on your perspective. |
This is exactly why I have been having problems with this thread. As a business owner, I definitely applaud him for his ingenuity and proactive business sense. But as a site owner, I also don't like bots scraping my sites when they don't offer anything in return (Google/Yahoo/other SEs- that's fine, as long as I get some exposure and links back to the site.), ESPECIALLY since the scraping (or data mining if you will) is basically being used against me. (Again, I applaud the ingenuity of it, but I also hate it! :) )
And of course there's the Catch-22: if you act ethically and allow people to block you easily, you are giving up an advantage to your competitors who do the something similar but act unethically by making it difficult to block their bots.
| 4:22 pm on Apr 29, 2009 (gmt 0)|
|I also don't like bots scraping my sites when they don't offer anything in return |
LIA, You seem to be arguing against the implementation and not the practice. Are you against the practice (gathering pricing intelligence)?
| This 136 message thread spans 5 pages: 136 (  2 3 4 5 ) > > |