Forum Moderators: buckworks
What am I doing with this content? Actually, I throwing most of it out the window. The only thing I'm keeping is SKUs...and prices.
In the past I've mentioned to others webmasters that I'm a scraper, and I can tell that they were distainful of the idea. However, I have managed to build the most dynamic, real-time pricing mechanism in my industry and frankly I'm proud of it. I can programatically set a price each day that optimized against several competitors. Furthermore, I know my competitors are doing the same because I spend just as much time trying to block their bots from accessing my prices.
So my question for the webmastering community is --> is scraping really unethical 100% of the time? Should I consider myself a black hat even if I'm not republishing the content?
[edited by: MrHard at 8:48 pm (utc) on April 28, 2009]
LIA, I got into this because others were doing it to me and I felt I needed to level the playing field. Now, I do it better than they do. However, if I was to learn that this is clearly an unethical thing to be doing, I'd have to stop. I respect many of this board's members, and I was hoping that you might help me steer in the right direction moving forward.
Continue with something that works, or abandon it for ethical reasons? I think its a legitimate question to ask here.
Here are some questions:
1) Does your scraper bot clearly identify itself in its user agent?
2) Does your scraper bot fully obey robots.txt?
3) Do you have a page on your site that describes what your bot does and describe how webmasters can block your bot?
If yes to all three, then I *might* concede you the dubious title of ethical scraper. :)
Yes, yes, and...no.
We pass the second one pretty much just by happenstance, but I have to admit that there is an element of stealth to the whole deal.
I think that third item is a good idea. I'll discuss it with my team. I am hoping to find a solution here that still lets me sleep comfortably.
Being a follower rather then a leader.
Meanwhile others are spending that energy developing new positive business relationships, or on other constructive pursuits which will enable them to sell items even at a higher price or otherwise gain an advantage.
After all, there are many legitimate tools designed to do just that, and the data can be collected simply by visiting the relevant pages in a normal browser.
The distaste for "scrapers" so often seen on WebmasterWorld arises from the republishing of our content that often follows the scrape.
Some of us also block most automated tools where we can on the grounds that they use our bandwidth and give nothing in return, but as we are talking about documents that are publicly available anyway this is a matter of personal choice about how we run our sites.
If, as you say, your bot identifies itself and obeys robots.txt then I see no ethical problem with it.
But I will refuse it access.
...
If it's all take and no give, then what you are doing is unethical. But if there's some chance that the sites you scrape will see a benefit, and there's an easy way to opt out if they wish, then carry on.
I understand why many people don't want bots on their page: they want human visitors. But in many cases, the guy who scrapes your content (but does not republish it!) won't be a valued visitor if he did it manually. Think of the competitor checking your prices -- he won't buy from you, just because he saw your website.
I've been involved in scraping-projects myself, where we used to collect data for internal usage. We didn't identify ourselves, but we made sure we don't hammer the servers, didn't just crawl without thought but only requested what we needed, at a very low speed. And when presenting the data to users, we always gave 'em a source-link so they could visit the original site.
Also, I've had quite a few cases where I was asked to scrape specific pages on a site because users needed the information that was updated frequently, but the webmasters where incompetent and had faulty scripts (and didn't want to fix it after they had been told), so we build a small script that'd get the data and fix the rest.
It's like SEO, you can work in different shades of gray. I've always wanted to publish a guide "how to really make a scraper hate you" and put down some of the anti-scrape methods I've seen so far. A few of them really took some work to understand, but any and everyone of them didn't keep me from getting what I wanted. If a bot is in deep cover and the author cares for it not to be discovered, you won't see it, unless it's the only user that uses your page ;)
I'd actually don't even think you'd have to identify yourself openly
As we are discussing ethics, this is where I would disagree with janharders.
While no rules or laws are being broken, deliberately disguising a bot by using (for example) a standard browser user-agent is fundamentally dishonest.
Ethical dishonesty is a concept for philosophers with time on their hands, not webmasters.
If it's all take and no give, then what you are doing is unethical
Rosalind, when you publish on the web you can have no expectations of anything in return.
You can have hopes, but not expectations.
The analogy might be a free newspaper supported by advertising - it is not unethical to take a copy and use it to line a birdcage or soak up a spillage, ignoring the content completely.
...
question to arieng: if these other sites started blocking your bot in robots.txt, can you honestly say that you wouldn't go against their wishes and bypass it? your entire business would go down the pan otherwise.
that kind of implies that your website is reliant on other people creating the info and handing it over for nothing.
Spending all your time on reducing your profit margins with your competitors and fine tuning the mechanism of your demise.
The old saying, "Lose money on every order, but make it up in volume" doesn't work. You're better off maximizing profits, even if it means lowering your revenue.
While no rules or laws are being broken, deliberately disguising a bot by using (for example) a standard browser user-agent is fundamentally dishonest.
you're absolutely right, it's dishonest. still, I do it. I understand some people don't want bots on their site, but unless you're copying their content, I don't see a reason why it should matter. For a bad comparison: if I want to order some food and they ask the color of my pants and reply "sorry, we don't serve people that wear blue jeans", next time, I'd simply tell them whatever they want to hear, if the delivery-guy doesn't double check on delivery.
Even with some services I use, I hate to have to log in and use their javascript overloaded sites to get my stats. I set up a cron that mails me my stats. They block my bot, unless I fake to be a browser. There's no reason at all against having my bot fetch the results for me. Some offer webservices, which is nice, and gets everyone what they want.
It's a thin line, since site owners would argue that I use bandwidth but don't see their ads and don't click on them. OTOH: who doesn't use some sort of adblock?
you're absolutely right, it's dishonest. still, I do it
I do not mean to criticize you, but in the context of this thread that is surely unethical.
site owners would argue that I use bandwidth but don't see their ads
In my case it has nothing to do with advertising, which no site of mine will ever carry.
There's no reason at all against having my bot fetch the results for me
My reason would be that I dislike dishonesty.
It is my site and I am entitled to take that view and block your bot (if I can).
...
[edited by: Samizdata at 2:39 pm (utc) on April 29, 2009]
My reason would be that I dislike dishonesty.
so you'd allow reasonable bot-usage if it's identifiyable? I'd love it if people would put a reasonable bot-policy on their sites. "don't. just don't" is not something I consider reasonable. (Again: I'm talking about bots who check / do stuff for me, not content-scraping & republishing.)
It is my site and I am entitled to take that view and block your bot (if I can).
which is hard. I've always seen it as a small competition, to be honest.
Also, personally, I'd recommend not to insult or threat bot-authors in html-comments or stuff. They usually tend to ignore it anyway and I know one or two people that might get angry. And you don't want that.
but my important sites all have terms and conditions which explicitly forbid scraping and robotic retrieval of page
Do you specifically exempt SE crawlers from that policy in your TOS?
I can programatically set a price each day that optimized against several competitors.
I guess in my mind this is simply good business. Republishing content definitely crosses a line (legally), but your practice sounds no different than walking into a Home Depot store and noting their prices (they specifically forbid this practice, btw) or visiting a website every day and copying down prices onto a yellow legal pad.
So to answer the question, no, I wouldn't consider it unethical, and honestly expect my competitors to do this already. In fact, I know they do.
All who consider this kind of scraping unethical, I really don't understand what you are talking about.
How is that different from entering a brick and mortar store of your competitor, disquising yourself as a customer and spying on the prices and service, etc - the technique that is endorsed by every business book, marketing guru and has been admittedly used by so many renowned enterpreneurs (e.g Sears founder)?
Here's one I do hold true thoough: The Internet is very very dodgy, generally.
It is unethical that:
- a website tells you that it has the best rate guaranteed, when so does its competitor. Who's lying?
- a website and search engine confirms to you indirectly that it is an authority but then allows anyone to add content in an unmoderated fashion.
- a website allows users to view copyright protected material and permits them to insert that material into other webpages for free
- a website tells you that there is no commission being charged to you, but in fact there is.
- a website tells you that it has found the best offer for you, when in fact it's just served you a page with a "searching for best offer" when the best offer is always the same.
- a website tells you that it takes the privacy concerns of its individuals very seriously, but then when asked by official government bodies to implement best practices, they decline to do so.
_ you are aware that a system has its faults, but yet can technically justify why it is not at fault (the classic, the system works fine, when we know it doesn't)
How you group the good or bad of this is quite personal and culturally skewed. I think to some degree, Business acting as an "entity" and not a "person" is the first problem, this allows people to hide behind their actions which, were they responsible for at a personal level, they would behave in a different way.
Let me qualify that and say the phrase "it's in the interests of business" might mean that John, who is a friend of the company for over 30 years, needs to be told to hit the door.
In IT we have little in the way of ethical frameworks that can help IT managers or skilled IT professionals reason their way out of any ethical dilemmas, but it is precisely these frameworks that help us prepare for the unexpected. It's usually convenient for business to apply guiding priciples or policies after something bad has happened rather than have an ethical code of conduct in place.
Ethics ain't easy.
Spending all your time on reducing your profit margins with your competitors and fine tuning the mechanism of your demise
Actually, quite the opposite. We're not trying to be the lowest price on the web. We could never live on the margins those sites are willing to work with. We strive to be the most knowledgeable, have the best selection, and offer the fastest shipping. If we can be the best provider in those areas, I sure don't want to lose a sale due to price. Our pricing algorithms are ensuring that we're in line with the other top tier providers, and if we can shave a point or two and still make decent margins so much the better.
You could do the same thing by having a student stare at the screen for hours.
This brings up a good point, pricing research is nothing new. In the old days, you had some poor employee hand-keying prices out of a print catalog. However, since we've started experimenting with automated pricing retrieval the data is exponentially more exhaustive and reliable.
My reason would be that I dislike dishonesty.
Me too. When I first started toying with this idea, I didn't really think of it as a dishonest practice. I saw it as a way to get an edge on competitors that I felt were getting an edge on me. Thanks for all the great feedback on this thread, it's given me a lot of food for thought.
But watch out - what if they started doing the same thing? If you automatically undercut their price by a $0.01, then they automatically undercut yours by $0.01, prices on both sites will start spiralling down into $0-ness.
Another point: Don't crawl them every hour with an aggressive bot. Just as if you were wandering into Home Depot with a notebook, you should behave respectfully while you're in their store, don't harass the staff, don't make a scene, get a little info and leave, then come back later wearing a different wig.
I wouldn't call it unethical, but it is sneaky. Some people really hate sneaky because sneaky is dishonest. Others would applaud you for being smart. It really depends on your perspective.
If you were scraping me, you're darn right I'd be doing everything I can to identify and block your sorry little bot. LOL
you're darn right I'd be doing everything I can to identify and block your sorry little bot
Tip: don't block. It alerts the bot owner to the fact that they're dealing with someone at least moderately sophisticated.
If it's for e-commerce, and you're sure the bot is being used for pricing intelligence, then by all means give the bot a price. That price doesn't really have to be the same price that you actually show to consumers. ;-)
Some people really hate sneaky because sneaky is dishonest. Others would applaud you for being smart. It really depends on your perspective.
And of course there's the Catch-22: if you act ethically and allow people to block you easily, you are giving up an advantage to your competitors who do the something similar but act unethically by making it difficult to block their bots.