homepage Welcome to WebmasterWorld Guest from 54.227.67.210
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque & physics

Webmaster General Forum

    
Thoughts on scraping?
stef25




msg:4146223
 12:04 pm on Jun 3, 2010 (gmt 0)

I'd like to hear people's thoughts on scraping websites.

In my country there are a few very large sites that are the "go to" sites for buying and renting real estate. They have been dominating the market for the past decade and they all have awful interfaces (cluttered, 25 mandatory fields in the search form etc)

I've been playing around with scraping and parsing those sites' results and storing them in my own DB. The idea is to publish this project as a sort of real estate "search engine" from where you can get results from all the big sites on one page.

Assuming I don't list any contact information and that users HAVE to follow my prominently displayed link back to the property detail page on the original site to reach the owner or agency that is selling / renting out the property, I think it's fair to say I'm just casting a wider net or "opening a additional shop front" for the original sites.

Would you as an owner of such a site object to my scraping and if so on what grounds?

 

londrum




msg:4146236
 12:27 pm on Jun 3, 2010 (gmt 0)

why dont you ask the sites themselves if they object. there's not much point getting they say-so from us without getting it from them.

stef25




msg:4146237
 12:34 pm on Jun 3, 2010 (gmt 0)

Agree, I could do that. I'm just trying to get a feel for the opinion of fellow techies before deciding whether or not it's worth sinking a lot of time in to.

Also, I'm pretty sure they will say "no" because they don't understand it. I think they might well even fail to notice they are being scraped.

londrum




msg:4146238
 12:37 pm on Jun 3, 2010 (gmt 0)

if you think they're going to say no there's no point going on, in my opinion. because they might block you. and if there's only a "few very large sites" like you say, getting blocked by even one will make your site little more than junk.

stef25




msg:4146239
 12:43 pm on Jun 3, 2010 (gmt 0)

Ok, imagine it comes down to the point where lawyers are involved. What would be the difference between my site and Google or a niche search engine?

topr8




msg:4146260
 1:23 pm on Jun 3, 2010 (gmt 0)

uninformed opinion:

scraping is not illegal. however it might be against specific sites terms and conditions to scrape them.

any decent site will already be doing what it can to block scrapers and rogue bots anyway so you may find you are blocked in any case, certainly by one or some of them.

farmboy




msg:4146295
 2:30 pm on Jun 3, 2010 (gmt 0)

How are those large sites obtaining the content they have on their sites? Are they advertising (paying money) to get people to provide the content?


FarmBoy

stef25




msg:4146307
 2:41 pm on Jun 3, 2010 (gmt 0)

Joe Sixpack can post the property he is selling or renting out for free, agencies pay money to put their listing up. There is also a lot of (obtrusive) advertising on the site. I'd imagine those agencies pay to put their listing up on all major real estate sites.

They advertise their site via Adwords and other channels but AFAIK nobody is paid to put up content, on the contrary.

lammert




msg:4146311
 2:53 pm on Jun 3, 2010 (gmt 0)

I think they might well even fail to notice they are being scraped.

I don't know about the size of these sites you want to scrape, but one, or a small group of IP addresses downloading a significant number of pages won't go unnoticed, unless these site owners don't analyze the visits to their sites at all.

One simple extra rule in their server config file or firewall and you are out of business, unless you plan to go really blackhat and distribute your requests over a network of anonymous IP addresses.

stef25




msg:4146322
 3:14 pm on Jun 3, 2010 (gmt 0)

lammert,

I have a feeling they won't notice it very quickly after having worked at an agency that did some dev work for them.

But you are right, even if they notice after a year all my work would go down the drain. Unless they appreciate the additional traffic I'd ideally be sending them.

Atm I'm just a dev looking for a fun non-client project, and perhaps earn a little pocket money along the way. I myself live in a flat found through one of these sites and know half a dozen people in the same situation. I honestly believe this project would provide the public with added value without taking business away from the original sites (on the contrary)

(Isn't that "what they always say", build something YOU see a need for ?)

lammert




msg:4146341
 3:39 pm on Jun 3, 2010 (gmt 0)

Real estate is a niche where a lot of money is involved and you might step on some sensitive toes with a scraper in that niche. I know of one country where a scraper--or search portal as they use to call themselves--scrapes some large real estate sites on a regular base. They had to defend themselves a few times in court during the last years but apparently they are still on-line.

rocknbil




msg:4146429
 5:52 pm on Jun 3, 2010 (gmt 0)

buying and renting real estate.


In the U.S., many of these feeds are controlled and protected, and cost money to acquire, or you have to be part of an organization (MLS, for example.) I think you'll meet resistance. However,

a few very large sites that are the "go to" sites


You need to make your site the "go to" site, using the principles you've laid out: simpler, easier to use, has the elements you find lacking in the go-to sites. Some ideas that might help you do that:

- Find out where the feeds come from, when you pick yourself off the floor after hearing the costs, make the investment and "do it right." If you're sure of your idea, it will be worth it.

- Arrangements with these existing sites, maybe you can offer a fraction of the costs THEY are investing as compensation for access to their feeds. This will actually make your job easier - now instead of scraping pages and dealing with all the complications of that (there are many,) you may get access to the actual feed via automated FTP or wget command line, or even direct access to their database (mysql was designed for this.) The real advantage here is you have the raw data BEFORE their site has munged it up, maybe even fields their sites aren't using that will make yours stand out.

- Add exclusive access to members for their own promo pages on your site, and ability to add their own listings, which can start a following of real estate agencies. Once your site becomes the "go to" site, you won't need the other sites any more. Neither will the realtors.

- In the U.S., there are certain things a registered realtor/real estate agent has access to that most people don't. Team up with a Realtor, develop the site as a business venture. You can split both costs and profits. With a realtor as part owner, you can legally access a lot of stuff you as a consumer cannot.

I have more . . . but that's where I'd start.

damon_cool




msg:4148370
 8:26 pm on Jun 7, 2010 (gmt 0)

Simple, what does their robots.txt say?

nomis5




msg:4148384
 8:39 pm on Jun 7, 2010 (gmt 0)

Are you joking?

Scrape the content from my sites and you wont ever earn another penny from Adsense - guaranteed. It's my content you are scraping and not yours.

Am I missing something? Scraping content is the lowest of the low, if you do it you deserve all the worst of luck.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved