Forum Moderators: buckworks
What am I doing with this content? Actually, I throwing most of it out the window. The only thing I'm keeping is SKUs...and prices.
In the past I've mentioned to others webmasters that I'm a scraper, and I can tell that they were distainful of the idea. However, I have managed to build the most dynamic, real-time pricing mechanism in my industry and frankly I'm proud of it. I can programatically set a price each day that optimized against several competitors. Furthermore, I know my competitors are doing the same because I spend just as much time trying to block their bots from accessing my prices.
So my question for the webmastering community is --> is scraping really unethical 100% of the time? Should I consider myself a black hat even if I'm not republishing the content?
In the computer world where large operations can be done at the click of a button, suddenly it becomes overwhelming for web sites who have to cover *all* the costs.
It's worth noting that Arieng did answer "yes" and "yes" to the questions about identifying his bot and respecting robots.txt. In other words, he's like a secret shopper who wears an "I'm a secret shopper" button and respects the "No price monitoring allowed" sign in a brick-and-mortar merchant's display window.
Speaking of robots.txt, Webmaster World has a thread from 2006 about robots.txt whitelisting that may be of interest--not just because it offers a whitelist script, but also because it talks about the whitelist's implications:
[webmasterworld.com...]
It's worth noting that Arieng did answer "yes" and "yes" to the questions about identifying his bot and respecting robots.txt. In other words, he's like a secret shopper who wears an "I'm a secret shopper" button and respects the "No price monitoring allowed" sign in a brick-and-mortar merchant's display window.
I agree that it's both notable and note worthy...
HOWEVER, the majority do NOT look at their logs, wouldn't understand most of what they saw if they tried to analyze them, and would not perceive the threat that Arieng's bot posed to their ecommerce store.
I've been preaching whitelisting for years now which would solve the problem, yet the majority of websites have either no robots.txt or still do blacklisting like they did back in 1996.
Being above board doesn't help the helpless, but it's still noble at best.
An email per domain asking permission is the best answer IMO.
I got hit by an out of control bot a couple of hours ago - IP checked and blocked immediately.
I could say show me a completely ethical businessman and I'll show you a bankrupt!
As for price comparison scraping its rife - all the supermarkets are doing it. Some are posting the results in TV ads, some in the front of their stores.
If the big boys are doing it, and they are, then you have to gear yourself up to do it back to them. If you can't defend yourself or play at the level required then it is time to get out of the game.
If your worried about the bandwidth costs of bots scraping your site then you've got the wrong hosting deal. (I'm not talking about DDOS here but competitor bots - DDOS is another matter entirely)
Face reality - business is about making a profit within the bounds of legality. If price comparison scraping helps in that and delivers results at a reasonable ROI, then until it is made illegal you might as well do it.
I don't get why you hate bots
I don't hate bots. I dislike dishonesty.
And you're even blocking search engines?
I allow those I deem useful.
You cannot control bots
I can effectively control those that are programmed by honest people and declare themselves.
I treat the rest with the contempt they deserve, and in most cases block them.
Some get through, and I lose no sleep over it.
This thread is about ethics, and dishonesty is inherently unethical.
you're absolutely right, it's dishonest. still, I do it
It is precisely that attitude that makes me want to block your bot.
Would you say your position was ethically sound?
...
At the moment, I'm not involved in any e-commerce projects, though I have been in the past.
I've never used automated price comparison when I was working on e-commerce projects. Not on any kind of a moral ground, but just because it wasn't really necessary for those projects. And, I'm against "competitive price" models anyway. I believe in "supply chain/margin pricing." Set your prices based on what it costs to deliver your product. Fight to deliver the highest value you can for the lowest price point at which you can retain a reasonable margin. Manage your costs, supply chain, etc, like a bear.
I just kind of believe (and I'm not alone here), that if you manage all this stuff well, you're going to compete well, without having to constantly look over your shoulder to see what your competition is doing, and how they're pricing.
So, for perspective: I don't scrape, I don't bot, I don't see the need for it in the projects I've worked on.
Do I think what arieng is ethically wrong?
Not one bit.
He's just using a different business model than I do, and his is based on competitive pricing. It's not an invalid model, it's just different.
As for all the snark about "stealing bandwidth" and "stealing content" with no hope of return for the competition?
Boo-frickin-hoo. He's spending just as much money on bandwidth running his bots as you are having bots scan your site. Bandwidth costs money in both directions.
If he's doing what he says he is, keeping the bot activity to a very low ebb (partly for stealth reasons), and not re-using actual content, just data-mining for prices, then he's just acting like a strong competitor.
In some businesses, competitive pricing is mandatory. Just look at all the airfare and hotel fare resellers out there. In that business, if you aren't checking the competition using automated methods, you're not even in the game.
If he was scraping sites and republishing actual content, then I'd be with the torch wielding mob.
But that's not what he's doing.
He's gathering business intelligence. He's doing it well. And he's doing it in a way that wouldn't be considered unethical in a B&M business.
For those that say "the web is different" - grow up.
Business is business.
If he's doing what he says he is, keeping the bot activity to a very low ebb (partly for stealth reasons), and not re-using actual content, just data-mining for prices, then he's just acting like a strong competitor.
Which is exactly why some of us, like myself, do what we do and block access to our sites from all data centers so he can't do what he does from commercial locations. We also track and monitor access to our sites to identify stealth access from residential or office locations, so unless you just pull 1 page per day, which will never get you far on sites with thousands of pages, we'll know you're there and block your bot.
Bot blocking, that's being a stronger competitor :)
I don't get why you hate botsI don't hate bots. I dislike dishonesty.
you block every bot, and probably whitelist some - even if they say who they are. You don't even know what they want to do, not at that point.
It is precisely that attitude that makes me want to block your bot.Would you say your position was ethically sound?
I'd say it's gray. If I was to build a crawler that just unspecifically crawls the web, I'd agree, it should identify itself and obey any rules.
If I build a specific bot to fetch me some information, it's more of an automated extension of myself.
If it's the "fully automated and stealthy" that makes you hate it, how about this:
I'm sure you'll allow any real user to surf without a unique user-agent. If I were to set up a proxy that filters the content it relays and saves the bits and pieces I want it to, than direct my browser to use it and manually click through your page (with the information being extracted by the bot/proxy), would that be okay or not okay?
The clicking isn't the hard and time consuming part in bot-usage, it's the data extraction. That's what's saving the author the time. The automated requesting is just a bonus.
incrediBILL:
Do you push your prices to any portals like google base? In that case, why would anyone even use your site for pricing intelligence? Is your captcha-test unspecific (as in "whoops, 15 pages viewed, here's your captcha") or targeted only on those who are suspected to be bots (by speed, headers they sent or whatever else one could think of)?
If it's unspecific, is the limit so high that very few real users would ever hit it? If it's a shop, aren't you afraid to scare a customer away that's just browsing?
Thread is getting muddy. Pro bots v Con bots. Blackhat v Whitehat.
Reality is bandwidth which is a fixed cost v what is allowed for the shoplifters (scrapers). Cost me bucks... well I must deal with that best way possible: denied.
I want to thank the OP for originating this thread which reveals both sides of the equation of webmasters v webmasters, not webmasters v search engines. The search engines don't hide (much) but webmasters running bots against competitor webmasters do.
Business plan? Ethics?
Live or die seems to be the mantra. Better me than you. I get it. Have been getting it since 1999. My old robots.txt with many denied has been replaced with a white list of (4) okay bots and all others denied. Those bots/UAs that act like bots which don't honor the admission requirements get nuked.
Can't speak for everyone but I'm pretty sure most of us do the same without broadcasting it. The bot guys don't care, 403s mean nothing as their unleashed critters are brainless and unrelenting. The illumination has been instructive. Excuse me while I reaffirm my whitelist and, perhaps, add a few more to the UA blacklist list. :)
Do you push your prices to any portals like google base?
Nope.
Don't allow search engines to cache either, NOARCHIVE on every page.
Don't allow any internet archive sites to make a copy either.
Anyone trying to scrape me has to hit me directly and good luck with that, let me know how it turns out ;)
Nope.Don't allow search engines to cache either, NOARCHIVE on every page.
do you add nosnippet? just figured, a scraper could try to remote-scrape prices (if that's all he's after, wouldn't make sense for whole-content-scrapers) by google serp snippets, with something like
site:example.com "this text stands close to the price"
Boo-frickin-hoo. He's spending just as much money on bandwidth running his bots as you are having bots scan your site. Bandwidth costs money in both directions.
There's a HUGE difference.
Let's bring out another real-life analogy. Let's say that you hire a new employee who is supposed to be doing work for you. It turns out that he is actually a corporate spy for your competitor. Not only do you not get any productive work to help your business for the salary and benefits that you pay him, but your competitor finds out about all your trade secrets. You eventually discover what's going on and terminate him (analogous to blocking a bot), but the damage and costs can't be undone.
I am an ecommerce owner probably in one of the most competitive cut throat areas on the net.
What he is doing is most likely cheaper than a competitior hitting my site viewing 100's of pages or doing multiple searches looking at my prices to compare his to mine.
the difference is that he's transferring all the cost from himself to you. in your example he would have to pay out for the staff to do it, so he's cutting down his bill but doesn't mind about yours.
still, where's the difference on the site-owner's end?
and what about the scenario I made up, where a proxy is extracting the data while a real person is browsing the site? How would the site-owners here feel about that?
People will continue to do it whether people consider it ethical or not. So the argument about having to keep up with your competitors is valid. But it all comes down to whether you will engage in unethical practices or not.
I guess I'm calling the kettle black. I admit that when we decide whether to work with a new hotel and what retail prices we set, we will (manually) look at some of our competitors to see if they even sell the same hotel and what their price points are, then may adjust our prices accordingly.
Although I do not know it for an absolute 100% fact, I am pretty sure they do the same regarding us. Do they have a bot doing it or do they do it manually like we do? I don't know.
This thread has given me some things to think about as I had not really thought about some of the same arguments I have been making to others.
Personally, I feel that my bandwidth usage to check their prices probably about equals their bandwidth usage checking our prices. I can live with that.
If I was in a different market and needed a bot to do the same thing automatically on a daily basis and I knew my usage exceeded my competitors', would I continue? I don't know.
still, where's the difference on the site-owner's end?
Well, if you want it in dollars and cents - competitively, I like anything that costs you money. So if you're paying a real person to browse, at least I know that while you may be spying, you're paying for it too. Irritates me to think that I may be footing the entire bill for your intelligence gathering. LOL
Make your decisions deliberately. This one can easily be defended as an effective business strategy. It is not so easily defended as an example of ethical or moral uprightness.
I can be as ruthless, deceptive, and competitive as the next guy. I'm not afraid to own it either. ;) Don't try to sugarcoat what you do. Otherwise you get into the kind of ethical gray-water whirlpool that spawned this thread, unsure on what you think of your own practices. Know what principles you are willing to compromise and why. But don't pretend that there's no conflict here.
If I can legally get ahead of my competition, I'm going to do it. But I'm going to do it with my eyes wide open, and I'm not going to pout if they decide to play the game too.
The comment someone made before about not irritating bot-writers was just plain kid stuff. Someone who would launch a DDOS attack on a site in retaliation for a robots.txt block... well the last thing I'm going to do is cower in fear. It could happen, but someone who would do that isn't getting one ounce of my respect. Even if I end up as another feather in their black hat...and I take plenty of precautions to make sure I don't.
The comment someone made before about not irritating bot-writers was just plain kid stuff. Someone who would launch a DDOS attack on a site in retaliation for a robots.txt block... well the last thing I'm going to do is cower in fear. It could happen, but someone who would do that isn't getting one ounce of my respect.
I really need to work on my writing skills, sorry.
I wasn't referring to robots.txt (nobody would be insulted by a block, that'd be a challenge), I was referring to insults. Read my post on another page of this thread about the huge guy on steroids you shouldn't insult, even if it's your right to do so.
I totally agree, it's lame and doesn't get respect, but what good is your knowledge that "I don't respect that" when your site is down and you're losing money?
It was meant as a "even if you're mad that someone is wasting your resources, don't make it personal", because not only would you waste your time, they might have a lot of time to waste. Like trolls on forums, don't feed them.
Like trolls on forums, don't feed them.
nobody would be insulted by a block, that'd be a challenge
Makes me wonder - if you are going to try to work around any problems you run into with a competitors robots.txt file - why bother looking at it in the first place? To help you sleep better? ;)
Are 'secret shoppers' illegal/unethical? They use store resources - when they open the door they let the cold air in, the salesperson might talk to them, they could trip and fall and then the store has a lawsuit on their hands, etc. Yet my impression is that many stores use this practice on each other. What's the difference between that and 'data mining'?
Bill
How can I be respectful of people that have cost me many thousands (tens? hundreds?) of dollars using my own content against me?
So you admit you are lumping bad scraping scenarios in with the good ones.
What about plagiarism bots that scrape sites so that teachers can check papers against online posts? Is this wrong?
I have written my own "RSS" scraper for a couple sites, that don't have an RSS feed, that I visit often. I don't republish the feed, it is for my own use. Is this wrong? Keep in mind I block ads anyway (yes, I also mute my TV for commericials, but don't tell the networks ;)), also since my methods don't request the images from a page I actually save the site owners bandwidth (maybe even up to $.20 a year at the current cost of bandwidth)
Really, some of you sound like huge babies,
"You might cost me $1 every 5 years in extra bandwidth."
If this is your reason for not wanting someone to automate price checking on your website then you really have too much time on your hands.... I wish my problems were this small.
[edited by: Demaestro at 10:06 pm (utc) on April 30, 2009]
What about plagiarism bots that scrape sites so that teachers can check papers against online posts? Is this wrong?
If they honor robots.txt, NO, if they don't, YES
"You might cost me $1 every 5 years in extra bandwidth."
Bandwidth isn't the only issue, people end up paying for bigger and faster servers to accommodate the load, upgrade from the 10mb pipe to the 100mp pipe, so on and so forth.
We're not just talking about a single bot, I get hit by nearly a thousand a day, on some sites I've seen the bot traffic exceed the human traffic.
It's out of control, we're not going to take it anymore.
Call us babies if you want, but some of us have child-proofed our sites to keep the kiddies from playing.
There is no entitlement for anyone to get whatever you want just because it's online and as more and more sites rebel against the entitlement mentality, the SEs filtering out the scraped sites, the pendulum will swing the other direction.
[edited by: incrediBILL at 10:49 pm (utc) on April 30, 2009]
Let's bring out another real-life analogy. Let's say that you hire a new employee who is supposed to be doing work for you. It turns out that he is actually a corporate spy for your competitor.
Here in KY that scenario would constitute theft by deception, which is a felony. Price checking used to be a misdemeanor, but only if a store had a no price checking sign at the door. I think that statute has been either repealed or struck down by the courts though. And even when it was enforced which was rarely, all it criminalized was physically copying down more than a certain number of prices per store visit.
Re the topic, what arieng is doing doesn't fit my definition of scrapping or black hat. And as long as his bot isn't violating robot.txt, even if just by happenstance, ethical.
In the California definitions the phrase that caught my eye:
Knowingly and without permission uses or causes to be used computer services
Therefore one could assume running in stealth mode and ignoring robots.txt could be easily seen as "Knowingly and without permission" because you knew you weren't being allowed and did whatever it took to gain access.
Even if the act itself causes relatively little harm, there is still a potential for penalty according to the laws I read.
So just keep flaunting it and someone somewhere that catches you scraping and feels particularly harmed and determines it's time to test the legal waters will probably see if they can play this out in court.
Would you say your position was ethically sound?
janharders answered:
I'd say it's gray
I'm afraid that counts as a "no" in the context of the question.
You WOULD NOT say that your position is ethically sound (because you cannot).
Legally sound, certainly, but not ethically.
If it's the "fully automated and stealthy" that makes you hate it
I don't hate bots. I dislike dishonesty. See above.
you block every bot, and probably whitelist some
I don't block every bot. I do whitelist some.
You don't even know what they want to do, not at that point.
I ask them what they want to do.
If they tell me honestly then I consider their request.
If they lie then I tell them "no" without further consideration.
You appear to think that unreasonable.
I disagree.
--
@ arieng
I think the consensus here is that what you were doing was entirely honest, ethical and polite.
I certainly have no problem with it, and I don't hate your bot.
I just dislike dishonesty.
...
Automating a method of gathering content from a websites IS NOT BAD if:
- You respect robots.txt
- You don't try to "cover up" your activities
- You make efforts to reduce the bandwidth toll on the site who's content you are retrieving
- If you are blocked, you respect it.
Automating a method of gathering content from a websites IS BAD if:
- You are republishing content
- You are not respecting robots.txt
- You make attempts to cover your activities
- You are going against the TOS of the website
- Your method is a bandwidth "hog"
- If you are blocked, and you don't respect it and try to circumvent.
Does this about cover it? Would anyone like to make an addendum to this list?
I think if we can come up with some good guidelines we can reduce the amount of anger people hold when they hear the word "scraping"
@ arieng.... I think you fall into the "IS NOT BAD" list. I think most on here would agree.
[edited by: Demaestro at 8:59 pm (utc) on May 1, 2009]