Ecommerce: The Ethics And Value In Scraping, or Data Mining

Forum Moderators: buckworks

Message Too Old, No Replies

Ecommerce: The Ethics And Value In Scraping, or Data Mining

arieng

8:09 pm on Apr 28, 2009 (gmt 0)

I like to follow the threads about how to combat scrapers, but not because I am having my content stolen. The truth is, I'm a scraper. I scrape other sites almost every day. I am constantly evolving my strategies to extract content from sites that are not my own.

What am I doing with this content? Actually, I throwing most of it out the window. The only thing I'm keeping is SKUs...and prices.

In the past I've mentioned to others webmasters that I'm a scraper, and I can tell that they were distainful of the idea. However, I have managed to build the most dynamic, real-time pricing mechanism in my industry and frankly I'm proud of it. I can programatically set a price each day that optimized against several competitors. Furthermore, I know my competitors are doing the same because I spend just as much time trying to block their bots from accessing my prices.

So my question for the webmastering community is --> is scraping really unethical 100% of the time? Should I consider myself a black hat even if I'm not republishing the content?

incrediBILL

9:00 pm on May 1, 2009 (gmt 0)

Legally sound, certainly, but not ethically.

Whether stealth bots are legal is questionable IMO and I'd love to see the law tested on such a case, I really would.

Demaestro

11:33 pm on May 2, 2009 (gmt 0)

Whether stealth bots are legal is questionable IMO and I'd love to see the law tested on such a case, I really would.

Which existing law would you like to see tested?

Theft?

Trespassing?

Hacking?

[edited by: Demaestro at 11:34 pm (utc) on May 2, 2009]

incrediBILL

12:05 am on May 3, 2009 (gmt 0)

Which existing law would you like to see tested?

I only discussed it a couple of times in this thread if you read it:
"Computer Hacking and Unauthorized Access Laws"

I ran it past a few legal minds, they thought scraping fit the definitions of unauthorized access, it just needs a test case to make it stick.

anallawalla

3:51 am on May 4, 2009 (gmt 0)

This is something that only a court can decide, but I think scraping prices and SKUs is content theft when the entire site is scraped. If the site owners have copyright statements on their site, then it is a breach of copyright as well.

I see the OP's example akin to someone repurposing stock exchange data where they "only" take the ticker symbol and the trade prices but not the detailed company name, description, etc. I consider this unethical but it doesn't get me worked up.

Sylver

4:07 am on May 4, 2009 (gmt 0)

A lot has been said for both sides of the issue, and I would like to share a few thoughts. Nothing really new, so it's more like a vote:

* Bandwidth: "Bots use bandwidth they do not pay for, therefore it is stealing."

By the same reckoning, visitors use bandwidth without paying for it, therefore they are stealing... not.

A well behaved bot use about as much resource as a visitor, which comes down to a few cents, if that (most websites owners pay a flat fee for bandwidth per month and never exceed it, so the cost is actually 0 for most website owners)

And let's be honest here, guys. The objection is not the few cents of bandwidth used, the real objection is that you do not want your competitors to keep tabs on your business. "Don't compete with me".

With all this talk about "dishonesty", one would think that our paragons of virtue would have the honesty to come right out and say what they really mean instead of hiding behind a flimsy "my bandwidth, mine, my precious".

If bandwidth isn't what you are really concerned about, then please have the honesty to say what is.

In a free market economy, your competitors have a right to know your prices, and vice versa. Every store must do some market research. You can't start in a business without knowing what people normally pay for the type of products/services you sell.

Of course, a bot hitting your website hard is a totally different matter. One guy coming in your store once a month with a pad to check your prices is different from a thousand guys blocking out your customers every day of the week.

* "Let's ban all bots, except those that bring traffic"
In practice, that means banning all bots except existing search engines. Cool, why didn't we do that a few years ago? Google could never have evolved and we would still be searching on Altavista and directories.

* Business intelligence:
To people here complaining about business intelligence gathering, let me ask you a question: How do you feel about checking your competitors' rank on Google or checking their backlinks? What about their keywords? What about looking at their ads and see what they are targeting and how?

In a free market economy, market research (or business intelligence, whatever you want to call it) is necessary.

I am not saying that anything goes, but that we ARE competitors. And competitors compete.

I am a translation & software provider. I try to provide better services/products than my competitors, within the same price range, which means I have to know what level of service they provide and do better, and I have to know their price ranges. That requires a certain level of market research. Not too much in my case, but still some.

* Legality:
Before discussing legality, it is necessary to establish jurisdiction.

The Computer Hacking and unauthorized access laws (http://www.ncsl.org/programs/lis/CIP/hacklaw.htm) differ from state to state, not to mention other countries (I am not a US citizen and do not reside in the US).

Unless jurisdiction is established, any talk of legal action is meaningless. (IncrediBill, run that through your legally minded friends and see what their opinion is).

incrediBILL

6:53 am on May 4, 2009 (gmt 0)

* Business intelligence:
To people here complaining about business intelligence gathering, let me ask you a question: How do you feel about checking your competitors' rank on Google or checking their backlinks? What about their keywords? What about looking at their ads and see what they are targeting and how?

Not only do I object to all of the above (and most of your post) but I've told people, even my competitors, how to thwart it all for the last few years, most notably 2 times at Pubcon.

Not to brag, but I was the first to ID how both Picscout, LinkSpace and many others were crawling sites, making screen shots, and much more.

The point you're overlooking is in these scenarios the business itself is often bearing the brunt of the expense of the intel gathering, maybe even illegal, which can typically be thwarted with a little monitoring software.

Unless jurisdiction is established, any talk of legal action is meaningless. (IncrediBill, run that through your legally minded friends and see what their opinion is).

Get real, not like I'm going to file suit against someone in Russia, but anyone in the US is fair game and they are numerous. However, you called me out but missed the point that I'll not likely be testing these laws because I have automated procedures in place that stop the problem before it needs to be tested in court.

Someone else will have to give it a try.

Bottom line, it has nothing to do with business intelligence but everything to do with business entitlement mentality to do whatever they want online without repercussion which some of us has stymied and more will follow suit.

Sylver

4:13 pm on May 4, 2009 (gmt 0)

* Business intelligence:
To people here complaining about business intelligence gathering, let me ask you a question: How do you feel about checking your competitors' rank on Google or checking their backlinks? What about their keywords? What about looking at their ads and see what they are targeting and how?

Not only do I object to all of the above (and most of your post) but I've told people, even my competitors, how to thwart it all for the last few years, most notably 2 times at Pubcon.
Not to brag, but I was the first to ID how both Picscout, LinkSpace and many others were crawling sites, making screen shots, and much more.

So you are saying that people should not:
* Check who is ranking above them for their main keywords.
* Search for the links of their competitors
* Visit their competitors' websites
* Search Google for keyword ideas (Google's results are basically a listing of their competitors for the said keywords)
* Look at their competitors' ads
* Find out what prices are currently practiced in their industry (are people really supposed to just decide on a price with no concern for the going rates?)
...

Why shouldn't they? I don't really see the basis for your objection.

It bears mentioning that if this is really what you mean, you seem to be quite alone in that persuasion, if WebmasterWorld is any indication.

The point you're overlooking is in these scenarios the business itself is often bearing the brunt of the expense of the intel gathering, maybe even illegal, which can typically be thwarted with a little monitoring software.

Looks like we are talking about different things.

First, none of the steps above can be actually prevented in any way with "a little monitoring software". There is no caching or cloaking that could prevent anyone from checking out your rankings on Google, your backlinks or your ads.

Second, "the brunt of the expense of the intel gathering"... As far as I understand, here we are talking about a few Mbs a month (if you pull only text, no pictures/video, a few Mb is a lot of content) so the costs can be calculated in cents and quarters.

Not a very convincing argument.

If we were talking about several GB per day, that would be a different matter (DoS).

Get real, not like I'm going to file suit against someone in Russia, but anyone in the US is fair game and they are numerous. However, you called me out but missed the point that I'll not likely be testing these laws because I have automated procedures in place that stop the problem before it needs to be tested in court.
Someone else will have to give it a try.

I assumed you were talking about a solution to the problem rather than a revenge. Sure, there are a lot of US Internet users, but they are an increasingly small minority so there is no genuine solution in sight down the legal path.

Bottom line, it has nothing to do with business intelligence but everything to do with business entitlement mentality to do whatever they want online without repercussion which some of us has stymied and more will follow suit.

That's not what this is about. There are valid uses of the technology and abuses. Your position appears to be that all spidering that doesn't directly benefit the website owner is illegitimate and should stop at once.

In my opinion, such an attitude stifles innovation and greatly reduces the value of the Web. There is nothing ethical about it.

blend27

2:32 pm on May 5, 2009 (gmt 0)

Sylver,

--- Not a very convincing argument. ---

I think you missing the point. If I already pay for it, it does not mean that You are free to do what ever you want with it at will.

-- There are valid uses of the technology and abuses --

There are, but only when authorized by the site Owner. The mentality of "I am going to do it until I am forced to stop" is far way different from "Can I please have a byte of the Apple that you paid for".

Try running WebPosition on Google SERP. Are you Gathering Bus. Inteligence? I don't this so, not anymore.

incrediBILL

5:33 pm on May 5, 2009 (gmt 0)

First, none of the steps above can be actually prevented in any way with "a little monitoring software". There is no caching or cloaking that could prevent anyone from checking out your rankings on Google, your backlinks or your ads.

True, nothing I can do can stop you from scraping Google, but Google has blocked more than a few on their own. However, I can make it as difficult as possible to gather intel with NOARCHIVE to limit access to a full cache pages and block all archiving sites to eliminate any historical records of how my SEO evolved, etc.

I was able to thwart sites such as LinkScape from gathering intel directly from my servers so competitors snooping me got as lot less details than they did about other sites.

Second, "the brunt of the expense of the intel gathering"... As far as I understand, here we are talking about a few Mbs a month (if you pull only text, no pictures/video, a few Mb is a lot of content) so the costs can be calculated in cents and quarters. If we were talking about several GB per day, that would be a different matter (DoS).

What part of thousands of bots doing the same thing don't you see causing an expense?

Miss the point 100% that it's not all about data bandwidth, and in some parts of the world bandwidth rations start much lower and cost a lot more.

I had to increase the horsepower of the dedicated server to deal with them, additional expense, $$$s more a month.

Upgrading from a 10mp pipe to a 100mp pipe to handle the extra load, EXPENSE, $$s more a month.

Building automation to stop them, software costs money, EXPENSE

Scrapers ranking against me (before I stopped it) costing me money, $$$,$$$ possible lost income

When you have a few hundred, sometimes 1K automated tools hit your site per day, it's a DDOS all day every day and yes we were talking about nearly 1Gb/day before I stopped it.

There are valid uses of the technology and abuses. Your position appears to be that all spidering that doesn't directly benefit the website owner is illegitimate and should stop at once.

Without the webmaster expressly authorizing that access there are no valid uses for any abuses.

Bottomline, it's MY server, not anyone else's, and there are absolutely ZERO valid arguments to rationalize what I should and should not allow access my servers.

[edited by: incrediBILL at 5:34 pm (utc) on May 5, 2009]

incrediBILL

5:58 pm on May 5, 2009 (gmt 0)

Let's evaluate who's scraping, er crawling, and how much impact it could have on your sites.

Google, Yahoo and MSN crawling daily. Just the three of these guys can easily read as many pages as 10K visitors a day. Include search engines like Ask, Gigablast, Snap, Fast, etc. and your bandwidth is easily on the rise.

Now expand that list to include the international search engines like Baidu, Sogou, Orange's ViolaBot, Majestic12, Yodao, and on and on, growing list daily

Throw in all the OpenSource search engines like Nutch and Heritrix that everyone thinks they can use to make "the next best thing" and there are literally hundreds of these out there,

Then we have all the spybots that attempt to crawl like Picscout, Cyveillance, Monitor110, Picmole, RTGI, and on and on.

Next add up all the specialty niche bots like Become, Pronto, OptionCarriere, ShopWiki, etc.

Plus the web downloaders, offline readers, directories and other things link checking, making screen shots, and much more.

Don't forget RSS feed readers and aggregators that pull down your RSS feeds that nobody ever reads or those feed finders like IEAutodiscovery that run amok on your site pulling thousands of pages just looking for RSS feeds.

Then we have the family friendly filtering sites plus the anti-virus vendors like AVG all trying to preview pages before they were viewed.

I'm just getting started and I'm sure I've glossed over a bunch of stuff like the script kiddies, the SEO tools, failed pre-fetch technology in FireFox and Google Web Accelerator, and all sorts of other junk that hammers sites daily.

Legit or not legit, it's TOO MUCH when the bots exceed your visitors.

Yup, no bandwidth wasted there, none whatsoever.

Not to mention the poor ecommerce site owner that doesn't understand what's happening with all this automated noise and is pulling his hair out stressing on all the "bounce" from his site, firing SEOs and designers and webmasters because of all the "bounce" when in reality a lot less of it is human than they realize.

No harm tricking sites with stealth crawling, no stress on the poor site owner with all those "customers" that aren't buying anything.

No harm at all, it's all perfectly fine.

[edited by: incrediBILL at 6:00 pm (utc) on May 5, 2009]

Whitey

10:28 pm on May 7, 2009 (gmt 0)

My take on whether it's ethical or not , is overtaken by "is it legal ? "

- Do the owners of the data give permission for it to be re-published and promoted?
- Do you adhere to their conditions?
- Do you disclose the source ?

Anyone that takes data without permission is subject to a copyright action. And there are many classic cases that went before the Courts around the World.

Bottom line is , if it's consensual - OK , if not , well it's not OK in law. Ethics is another issue , but this might help.

Sylver

6:47 am on May 8, 2009 (gmt 0)

@IncrediBILL: thanks for the explanation, I can see where you are coming from.

Still I think that somewhere around the line, we must find some kind of middle ground and accommodate notions like fair use (not talking about the copyright concept, just the normal meaning of these 2 words), serving the users (the reason why most sites exist) and keeping expenses within reasonable limits.

Whenever you put information on the Web, you do so with the understanding that you are taking on expenses (hosting, bandwidth, dev time,...) to provide services to a lot of strangers and possibly to contribute your piece to the sum of human knowledge (sounds lofty until you realize that "where to buy decent blue widgets" is also part of said human knowledge).

Of course, you expect to benefit from this as well, but you know right from the start that a large part of the expenses will not benefit you directly, if at all.

So it comes down to serving users while avoiding leeches. And serving the users is not limited to people visiting your website with your browser of choice. They say "no man is an island" and that's also true for websites.

Allowing Google to spider your website and use your bandwidth ultimately contributes to all of us being able to find stuff on the Web. Not just finding your site. The value is far beyond that. While you may consider foreign engines like Baidu a waste of your bandwidth, they nevertheless assist millions of users and increasingly so.

I recently completed a study of Internet usage per language and GDP (contact me in private if interested) and it turns out that over 70% of the Web's users don't speak English, a number which is increasing at a significant pace. This does shine a different light on foreign search engines in terms of serving the users.

If you go through your list of bots, you will find that "serving the user" is the common denominator of most of them. Off-Line readers make life a lot easier for users who travel a lot (I am one of those) and some of that "failed prefetch technology" is really making a difference for users with slow connections.

Apart from script kiddies and copycats, it mostly comes down to the level of service, direct or indirect, which you provide to the users.

I agree with you that too much is too much and that a Website owner has every right to decide which service he wishes to provide. If a website owner wanted to restrict his website to Opera 5 users browsing from 7-8pm GMT+6 on Friday night, he would perfectly have the right to do so. Unfortunately, that would also make his website mostly useless.

My point is that there has to be a fair middle ground somewhere and that you shouldn't really put in the same bag a bot using a few Mbs every once in a while and folks hitting you 24/7 for all they can get.

It's kind of like managing your hard disk space: Cleaning up small individual files which you think you won't need is useless and often damaging ("*** I needed that") vs. isolating and removing the large offenders (for instance, Chrome will use Gbs just to store thumbnails!)

I would say that as long as a bot owner respects the robot.txt, keeps its usage within reason and does not republish copyrighted content without permission, he is perfectly within his rights.

@Whitey: Please read the thread before you post.
First, look at [copyright.gov...] for details of what copyright protects. Prices are not copyright protected.
Second, law depends on jurisdiction, so saying something is illegal without specifying the location doesn't amount to much.
Third, we are discussing the ethics of the practice, not the legality of it.

HRoth

3:00 pm on May 8, 2009 (gmt 0)

"you shouldn't really put in the same bag a bot using a few Mbs every once in a while and folks hitting you 24/7 for all they can get."

But one point Incredibill made (convincingly, IMO) was that it isn't that it's a bot using a few mb once in a while; it's a thousand bots using a few mb each. That puts a different color on the practice, no?

incrediBILL

3:59 pm on May 8, 2009 (gmt 0)

some of that "failed prefetch technology" is really making a difference for users with slow connections

Not on my site it isn't because it yanks down pages like LEGAL, PRIVACY, etc. that people never read, it's totally random, which is why I call it FAILED and block it from working on my sites.

If prefetch allowed the webmaster to tag the popular pages to prefetch, or got that information from another source, then it would make sense if it's likely to be read.

I would say that as long as a bot owner respects the robot.txt, keeps its usage within reason and does not republish copyrighted content without permission, he is perfectly within his rights.

He is within his rights is he honors this for my sites:

#allow these bots
User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
User-agent: Mediapartners-Google*
Disallow:
#block all other bots that ask
User-agent: *
Disallow: /

[edited by: incrediBILL at 4:06 pm (utc) on May 8, 2009]

Sylver

4:01 pm on May 8, 2009 (gmt 0)

"you shouldn't really put in the same bag a bot using a few Mbs every once in a while and folks hitting you 24/7 for all they can get."
But one point Incredibill made (convincingly, IMO) was that it isn't that it's a bot using a few mb once in a while; it's a thousand bots using a few mb each. That puts a different color on the practice, no?

I got that, but I don't think that's true, even based on IncrediBILL's own post:

Google, Yahoo and MSN crawling daily. Just the three of these guys can easily read as many pages as 10K visitors a day.Include search engines like Ask, Gigablast, Snap, Fast, etc. and your bandwidth is easily on the rise.
Now expand that list to include the international search engines like Baidu, Sogou, Orange's ViolaBot, Majestic12, Yodao, and on and on, growing list daily.
...those feed finders like IEAutodiscovery that run amok on your site pulling thousands of pages just looking for RSS feeds.

Next to the bandwidth used up by any of these, the traffic of small, bandwidth-cautious bots is insignificant.

Let's run a bit of math. Suppose 500 small bots hitting your website every month for 5Mb each.

Now, a small site seldom get that kind of attention, so this is a scenario that applies to a fairly large site already. This is not "mom' & dad" kind of operation.

All together, they run up about 2.5 Gb per month or 30gb per year. Based on the price of a regular UK host (a decent one, with data centers and security, and overflow bandwidth charges), and using the *least advantageous* package, those 30Gb represent a total value of $15. Per year.

ONE spybot will eat up more bandwidth in a single day than all your bandwidth conscious bots would use in a single month.

That's what I mean by "don't put them all in the same bag". Did you ever try cleaning up your hard disk by looking at 50kb files and wondering if you really need them? I think it's a classic. This is the same principle: If you remove a useless 4Gb thumbnail file from Chrome, you don't have to worry about the thousands of small files you aren't sure about.

Same applies to bandwidth, IMO.

I love my money just as much as the next guy, but let's target

incrediBILL

4:19 pm on May 8, 2009 (gmt 0)

Here's some nice historical reading on what happened right here at WebmasterWorld thanks to bots:
[webmasterworld.com...]

Quote from Brett Tabke himself:

We spend 5-8hrs a week here fighting them. It is the biggest problem we have ever faced.
We have pushed the limits of page delivery, banning, ip based, agent based, to avoid the rogue bots - but it is becoming an increasingly difficult problem to control.

This 136 message thread spans 5 pages: 136