Forum Moderators: buckworks

Message Too Old, No Replies

Ecommerce: The Ethics And Value In Scraping, or Data Mining

         

arieng

8:09 pm on Apr 28, 2009 (gmt 0)

10+ Year Member



I like to follow the threads about how to combat scrapers, but not because I am having my content stolen. The truth is, I'm a scraper. I scrape other sites almost every day. I am constantly evolving my strategies to extract content from sites that are not my own.

What am I doing with this content? Actually, I throwing most of it out the window. The only thing I'm keeping is SKUs...and prices.

In the past I've mentioned to others webmasters that I'm a scraper, and I can tell that they were distainful of the idea. However, I have managed to build the most dynamic, real-time pricing mechanism in my industry and frankly I'm proud of it. I can programatically set a price each day that optimized against several competitors. Furthermore, I know my competitors are doing the same because I spend just as much time trying to block their bots from accessing my prices.

So my question for the webmastering community is --> is scraping really unethical 100% of the time? Should I consider myself a black hat even if I'm not republishing the content?

Seb7

7:18 pm on Apr 29, 2009 (gmt 0)

10+ Year Member



I think it depends on what the scrapers intend to do with the data.

People go out and gather data all the time. Scrapers is just an automated version.

Supermarkets monitor each other prices, insurances monitor their competitors.

If your scraping just to duplicate then that a definite no no, but if your scraping data as to present a completely different perspective than what is already being offered, then your presumably offering a new service to your users.

[edited by: Seb7 at 7:19 pm (utc) on April 29, 2009]

londrum

7:23 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



but you are missing the important point that you're costing the other person money, in bandwidth. and giving him nothing in return. it's as if you are taking his data and then presenting him with a bill, which he has to pay himself. that is what makes it unethical.

yobaby

7:29 pm on Apr 29, 2009 (gmt 0)

10+ Year Member



bakedjake the thread is not about spiders but scrapes and trying to lump them together is not close to the same.

Scrapes are there for one thing and one thing only Stealing information to benefit the site that is scraping the information and not the site owner.

Scrapers steal content for their use.
Scrapers in this case is using his as a lets say spy to be used against the site he is scraping for the benefit of the scraper at the expense the site owner.

Is our conversation gonna make one bit of difference either way nope so each has his own opinion as well as Ethics.

To me it is stealing to others it is a business were Ethics in each person draw the line.

Ever heard the old saying from a thief "He will never miss it, it was only a $1.00"

HRoth

7:34 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I do not see scraping to automatically lower your prices as wrong, just a foolish business practice. You have justified it slightly, but if you want the kind of customer who will buy from you because your widget is one cent cheaper than the other guy's, take all you want.

However, when the threats started in--"All I tried to get at was "don't make them mad. block them if you want to, but don't make them want to hurt you". You may not want a bot on your site. But you definetly do not want some crazy guy with a botnet going after your site. That's all I meant."--that kind of changed my mind for me. If you are hanging around with the kind of zit-faced boy who gets his underpants in a bunch because someone wrote something snotty on their own website and who then sets out to deliberately harm that site as "revenge," then yes, what you are doing is unethical and your ethics will continue to erode as you spend time with these people. Perhaps you know this and that is precisely why you had to ask this question, because otherwise, there is no ethical problem here that I can see.

[edited by: HRoth at 7:35 pm (utc) on April 29, 2009]

blend27

7:40 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



where is Bill?

bakedjake

7:42 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



bakedjake the thread is not about spiders but scrapes and trying to lump them together is not close to the same.

The technology is exactly the same. It's automated page retrieval from a webserver.

Question: If I were to build a shopping comparison site that got its data exclusively by spidering websites, would you consider it unethical if I spidered your website to list your prices as well as five of your competitors together, sending you sales and traffic, and did this out of the goodness of my heart just to help consumers find the best prices?

What if I sold ads on top of that content, while still sending you sales and traffic?

What if I sold that information back to your competitors (all the while continuing to send you traffic and sales)?

Wouldn't it simply be an economic choice? Either my spidering of your prices makes economic sense for you to allow, or it doesn't. It's binary.

I accept the argument that spidering costs some number in bandwidth costs. If that spidering costs a number that is so big you need to take action, it sounds like a smart business decision to me.

But publishing content on a world-wide network for all to see is an implicit invitation for everyone to see it. Why should their means of viewing it make it "ethical" or "unethical"?

Either you agree with competitive pricing intelligence activities or you do not. It should not matter whether it is 10 humans in a room or a robot, because if it is done correctly, it will end up "costing" the same amount of money in bandwidth and computing resources to you, the target of those activities.

If any "user" of your website is using your resources (read: costing you money), and has absolutely no chance of purchasing a product, then you should stop that access. But it shouldn't matter what their motivation is - whether competitive pricing intelligence or whether they're a user from India and you don't ship there.

If you block bots from accessing your website, do you block users from geographies who you have no intention of serving? Their access costs money, too.

[edited by: bakedjake at 7:46 pm (utc) on April 29, 2009]

janharders

7:43 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



HRoth, you mixed up the OP with me. I'm the one who warned not to offend programers (and especially not those who build bots). I'm not saying that people are allowed or right to go after someone digitally if they feel offended, I'm just saying there are people that do. It's the web-equivalent of "if you stand next to that monstrous kickboxer over there, try not to tell him that steroids make some parts grow and some parts shrink". yeah, you're totally within your rights to tell him, and yes, he's not allowed to hit you, but it's usually the better idea to not challenge his ethics. that's all I tried to say. And I'm sure my ethics are right where they should be, thank you very much for your concern.

arieng

7:49 pm on Apr 29, 2009 (gmt 0)

10+ Year Member



I'd question whether Arieng is a scraper in the usual meaning of the term.

Admittedly, I probably could have used another term, especially since we don't refer to it as scraping internally. However, in speaking with other webmasters the response is usually along the lines of, "So what you're doing is essentially scraping." Figured that would be the best context in which to discuss it.

Stealing information to benefit the site that is scraping the information and not the site owner.

Yobaby, it doesn't sound quite as nice when put that way but essentially its accurate nonetheless. In a competitive business environment, is that unethical?

This thread has brought up some pretty interesting questions for me. We've tried to stay under the radar, not overdoing the quantity or frequency or violating robots.txt. So far I don't think anyone has picked up on it. However, if someone figured it out and blocked us, would I respect that? That is the one I'm wrestling with.

bakedjake

8:06 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



However, if someone figured it out and blocked us, would I respect that?

You obviously think this is important enough to your business to do (and it seems you've spent some money developing the technology), so even if your robot gets "blocked" you would likely still use some mechanism (even if it's a yellow legal pad or Excel) to compare prices with this competitor, correct?

Demaestro

8:07 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



LIA, those are fair comparisons to my analogy. Wish is why I hate real world to digital world examples. They just aren't the same thing.

I presume that the OP's bot is not using that much bandwidth and I hope it has never taken down a site while doing it's thing.

Obviously your use of the content must be a legal use, and from what the OP has laid out for us in his post is that is uses are a legal one.

I have seen many times over people going through a store with a clip board recording prices and product placement on shelves. No intention of buying anything. Some stores allow it, some will throw you out, web operators have the same choice. Physics points out the drain on the B&M owners' resources above ^^

you are grabbing stuff from them which you are using to build your site, and giving them nothing in return.

And how is this change then if a human does it?

arieng

8:10 pm on Apr 29, 2009 (gmt 0)

10+ Year Member



correct?

100% correct. I'm going to use competitive pricing data regardless of the method of acquisition.

yobaby

8:14 pm on Apr 29, 2009 (gmt 0)

10+ Year Member



aring
We've tried to stay under the radar,
because?
not overdoing the quantity or frequency
but arieng there are thousnds upon thousands of bots hitting the sites the combined money they cost the owners is staggering when put together
or violating robots.txt.
Most don't.

if someone figured it out and blocked us, would I respect that? That is the one I'm wrestling with.
So I quess then it goes a step further and now it's all about you and everyone else is fair game.

The things we do in life reflect the person we are. Knowing what is wrong and doing it is how we lose or adjust our Ethics to fit our need whatever it takes at whatever cost it may be to others and ourselves.

tangor

8:18 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



100% correct. I'm going to use competitive pricing data regardless of the method of acquisition.

Discussion over (as far as I'm concerned). Business ethic announced. This has been an interesting discussion, up to a point, and some thoughtful observations pro and con have been voiced.

Personally, I'm not concerned since I whitelist robots. As arieng indicates ethics to honor robots.txt that's good enough for me. Come in any other way and get nuked, just like all the other bots that don't obey the site admission rules. :)

londrum

8:19 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



people window shop and browse all the time without ever spending any money, but at least there's the potential of making a sale somewhere down the line. after all, that is what advertising is all about. putting your product inside people's heads.
if shops started actively banning everyone who didn't hand over some pennies then i'm guessing their business would die pretty quick. in fact, shops actively encourage that kind of behaviour by sending people free brochures in the post and tarting up their window displays and opening coffee shops next to their clothes racks.

but how can that be compared to what a scraper/data miner does? a scraper has no intention of buying anything, ever. he is just costing you money whilst mining your stuff for his own benifit -- and to make it even worse, he is likely going to be in competition with you.

bakedjake

8:20 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I'm going to use competitive pricing data regardless of the method of acquisition.

:)

janharders

8:32 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



but how can that be compared to what a scraper/data miner does? a scraper has no intention of buying anything, ever. he is just costing you money whilst mining your stuff for his own benifit -- and to make it even worse, he is likely going to be in competition with you.

in this (pricing intelligence) case, yes. in other cases: why? if it's a price comparsim, it will not buy itself, but it'll send you traffic that may convert to sales. I happen to have written a bot for a friend of mine that spiders his distributor's prices for certain products because that distributor isn't able (no joke here, they stated "sorry, our software won't let us export to csv or any other format") to send them a price list. You'd have to go to their website, log in as a registered reseller and look up a price before you order. Of course, if you have multiple possible distributors, you won't buy at one where you have to spend hours just to research prices which change every few days ... so this bot actually brought them sales which wouldn't happen without this bot. Granted, it still uses their bandwidth, and also, they don't really know it (one guy at their IT does, but he said "don't ask. they'll probably deny it because they won't understand. just do it, we won't block you").

rachel123

8:35 pm on Apr 29, 2009 (gmt 0)

10+ Year Member



As long as you respect my robots.txt and don't republish my stuff, I'm cool with it - take whatever you want. After all, I made it publicly available, and there's nothing stopping me from responding in kind.

Now, if my site is getting so plugged up by bots from all sorts of places, or if your gleaning of info is negatively affecting MY business model - well then, I'm going to either block you or cloak you, depending on what the situation calls for.

Since I'm a biologist in another life, I think of these things in terms of symbiosis.

Google and I are mutually symbiotic. I welcome them.

Other, friendly, respectful bots and I are commensally symbiotic. Couldn't care less about them.

Bots that use excessive bandwith and/or republish my stuff and/or otherwise put the information they get to sinister usage are parasites. I fumigate. :)

Demaestro

9:59 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Scrapes are there for one thing and one thing only Stealing information to benefit the site that is scraping the information and not the site owner.

If only things were so black and white. I could list 100s of examples of scrapers that are not only a benfit to me but to society in general. Are you suggesting that search engines are of no benefit to site owners?

I don't know how to explain this to someone who thinks gathering public accessible data is the same as theft other than to say, they are not the same thing... not even close.

It is people like you yobaby that perpetuate misunderstandings with the Internet. Keep it up yobaby, you are doing good work, if you can change everyone's mind you will succeed in making the population as ignorant as you are.

I have a solution for you.... put your website up on a private server, on a private network, and have the people you want to see it come into your building and view it on an intranet.

This way no one can view your content unless you expressly give them your permission. But if you want the public to view your site you might want to PUBLISH your web page and make it available to the PUBLIC to use within the confines of the law. Maybe you should understand what publishing a website means before you keep talking on the subject.

If you want to redefine theft within the law books then go ahead. But until such time you do..... DO NOT come here, a place for professionals calling me and other 100% legal webmasters thieves. That is libel and IS a crime, punishable by law...

You can keep repeating it but it doesn't make it true, you may think you are clever for taking such a hard stance but really you look a fool for stretching something so thin it is unrecognizable.

[edited by: Demaestro at 10:02 pm (utc) on April 29, 2009]

signor_john

11:00 pm on Apr 29, 2009 (gmt 0)



you are missing the important point that you're costing the other person money, in bandwidth. and giving him nothing in return.

Does Dean & Deluca subscribe to Williams-Sonoma's catalog?

Isn't the Chevy dealer who advertises in the WIDGETVILLE CHRONICLE paying for the Ford dealer to read his ads?

When Budweiser advertises on TV during the Super Bowl, do Bud's marketers get upset because they're paying for an audience that includes competitors who are taking notes?

If Pete's Pizza Parlor has a stack of take-out menus on the front counter, does Pete cry "Theft!" if an employee of Paul's Pies picks up a menu for his boss on the way to work?

Bandwidth is cheap, and any business that publishes a Web site for public consumption is making that site available to prospects, to the merely curious, and--yes--even to competitors.

incrediBILL

11:19 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've spent years fighting scrapers to get my STOLEN content out of the SERPs and regain 100% SE ranking for my own material and I'm very vocal about how I feel about scrapers.

How can I be respectful of people that have cost me many thousands (tens? hundreds?) of dollars using my own content against me?

Anyone I catch scraping can expect everything from a DMCA to AUP violation reports, but respect will not be on the list.

Like I said, nothing personal, but business-wise, I'm your worst enemy.

[edited by: bakedjake at 11:52 pm (utc) on April 29, 2009]

[edited by: incrediBILL at 12:05 am (utc) on April 30, 2009]

arieng

11:22 pm on Apr 29, 2009 (gmt 0)

10+ Year Member



You've read the first paragraph of my first post and taken it out of context. The discussion is about the ethics of using automated content extraction as a competitive pricing tool.

No content is ever being republished anywhere.

incrediBILL

11:26 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You've read the first paragraph of my first post and taken it out of context. The discussion is about the ethics of using automated content extraction as a competitive pricing tool

Even at a competitive price tool you'll still steal MY resources, impacting MY servers response time, and using MY own material against me.

It doesn't matter if you republish the content or not, you're still theoretically using it against me, which puts us at odds against each other.

On a more pragmatic note, unless you have a few thousand fast flux IPs you won't scrape my properties anyway, but in principal the attempt itself it irks me to the nth degree.

[edited by: incrediBILL at 11:27 pm (utc) on April 29, 2009]

koan

11:48 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think the problem, compared to real world examples, is that automatic methods of scraping don't scale well for the owner of the web site and becomes a financial and technical burden when you add them all up, all to the benefit of the scraper who have to invest very little, without giving anything in return.

It's like spammers who are offended that you won't accept receiving just their tiny daily email that barely costs you a cent in bandwidth and a second of your time. The problem is there are thousands of other spammers who think the same way and suddenly, they're slowly destroying the medium by overwhelming the victims.

I'm sure a real world open store that would have to deal with 300 people with their yellow notepad taking notes everyday, taking space and time that should be spent on real customers, would have to figure out a way to expel them.

Why doesn't it happen? because the 300 people would cost too much money to real world "scrapers". But with computerized, automatic techniques, like spam sending millions of emails day with little expenses, the burden is on the victim, not the scraper.

So this form of scraping is indeed unethical, unless you enter a deal with the other web site to allow such practices, maybe compensating them a little financially, or with some form of promotion. That's why webmasters welcome legit bots like Google, Yahoo or Live. There's a mutually beneficial relationship. Not with scrappers.

PS: That comment about the scrappers being willing to attack a web site that doesn't welcome them reminds me of the behavior of email spammers, who are also willing to sink to such lows. If a certain activity overwhelmingly attracts that kind of sociopaths, there are questions to ponder for the more upstanding people doing it.

swa66

11:55 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm not ready to consider determining the price your competitors sell stuff at to be scraping. But that's part of a definition.

Determining prices is what large physical stores do, but walking in the competitor's store. Not so unethical as it's accepted practice out here in the real world. I' not sure how open they do it, but they're quite obvious when in e.g. a supermarket.

The line where it gets bad is where copyright is on one end (but what you sell at how much isn't copyrightable AFAIK) and price fixing on the other end.
If you hurt the performance of the store that's unethical too. (Imagine a supermarket only visited by people noting the prices for competition with no room in the parking lot for regular customers anymore....

Key_Master

11:59 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think the definition of what a 'scraper' is has become overly broad and has probably lost it's original meaning over the years. Considering the hostility in this thread towards any automated Web application, one would have to assume a Web browser is a scraper and we're all guilty of site scraping.

My girlfriend is a scrapper. Not the dubious type in discussion in this thread- the hobby type. She takes content from a variety of sources (photos, articles and such) and compiles them together into a scrapbook she calls her own- most often for a specific topic. Not unlike what a true scraper does with our content.

A search engine bot is a Web harvester, although some search engines also mine data (Google Alerts) and sometimes they cross the line into scraping (Google Cache).

Data mining (web data mining, text data mining or whatever), is the extraction of specific information usually used for research or statistical purposes. A sku or a price is not copyright protected information, so you could hardly the poster a thief or accuse him of stealing. In any case, the poster doesn't republish this information, the bot identifies itself as a bot, and it even obeys robots.txt.

If you're against data mining, that's fine, but don't be hypocritical. We all use use data mined from one resource or another- it's everywhere.

If the prevailing belief is that for a scraper site to be 'ok', it has to provide some benefit for the sites it scrapes, then that pretty much leaves out a lot of the hated scraper sites out there. A lot of them link back to the original source and even refer traffic.

arieng

12:32 am on Apr 30, 2009 (gmt 0)

10+ Year Member



There have been quite a few analogies put forward, I'll add one of my own.

Call centers. There is an tried and true strategy in the mail order industry from the pre-web days. Call your competitor's order line, inquire about a variety of products, special offers, member's discounts, etc.

One could argue that the financial harm of this is consistent with a spidering of a website. The web just makes it easier to accomplish, and easier to counteract.

In my opinion, this practice is so widespread that there isn't any dilemma on ethical grounds. Why is the web different?

rachel123

12:45 am on Apr 30, 2009 (gmt 0)

10+ Year Member



In my opinion, this practice is so widespread that there isn't any dilemma on ethical grounds.

LOL well this gets to it, doesn't it.

I put forth, that the effectiveness or the usefulness or the commonness of the practice is beyond dispute.

But deception is never ethical...however necessary it may be in terms of $$. Sure, everyone does it, but I don't pretend that it's morally the right thing. It's a principle I'm willing to compromise for a buck. ;)

Edit to add - and I'll cloak your bot without losing sleep, too. Another principle I'm willing to compromise. LOL

[edited by: rachel123 at 12:48 am (utc) on April 30, 2009]

koan

12:48 am on Apr 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There have been quite a few analogies put forward, I'll add one of my own.

All the real world analogies have one thing to counteract the possible abuse of "scraping". It actually requires time and energy for the scrapers to do what they do, so these limits act as a natural guards and help create a sustainable ratio between real customers and competitors.

In the computer world where large operations can be done at the click of a button, suddenly it becomes overwhelming for web sites who have to cover *all* the costs.

arieng

12:51 am on Apr 30, 2009 (gmt 0)

10+ Year Member



Well said Rachel. I guess that phoning competitor's call centers isn't exactly the model of ethical behavior, but in terms of business ethics I'd give it a pass.

incrediBILL

1:00 am on Apr 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'll cloak your bot without losing sleep, too. Another principle I'm willing to compromise. LOL

Exactly why visitors to my sites get a captcha after X page views to verify they're human or bot.

If no human response, the IP and UA are quarantined from that point forward with logic I won't disclose that keeps this type of scraping to a bare minimum

This 136 message thread spans 5 pages: 136