Anyone Try To Crawl Superpages.com?

Forum Moderators: anallawalla & bakedjake

Message Too Old, No Replies

Anyone Try To Crawl Superpages.com?

tennis fan28

4:31 am on Sep 21, 2007 (gmt 0)

I am looking to start building a local search site for my state and I would want to fill in the business information from a large trusted source like Superpages.com or Yellowpages.com. I have searched around and have seen threads where people say that they have done it with posting very little information on the how. There are any number of open-source spiders out there or programs like eGrabber and Content Grabber. I was wondering if anyone has any experience in doing this that they would like to share? I don't really have a timeline of when I would like to have this finished, it's more of a hobby which is why I don't want to shell out the big bucks from a large data provider.

Thanks!

vincevincevince

4:34 am on Sep 21, 2007 (gmt 0)

It's fairly straightforward to crawl most of that kind of site, so long as the admin isn't doing something to stop you (page rate limits, bot detection, etc.)

On the other hand, as a collection of data, superpages or yellowpages does have a copyright claim to it and you aren't free to just create a derivative work.

inbound

12:33 pm on Sep 21, 2007 (gmt 0)

You don't have to shell out big bucks for data from just one state. $1000 will get you data that's in a nice format to work with. With data at that price it's going to be cheaper to buy than 'harvest' and you'll stay on the right side of the law.

tennis fan28

6:51 pm on Sep 21, 2007 (gmt 0)

Thanks inbound, I know of the site you speak of and I'll be honest, I thought about buying that information. When they launched their product, they were careful to say that they didn't harvest from Yahoo Local but instead went to sites where spidering didn't violate their TOS. I am looking at this whole site as a project which includes spidering sites (I've never done that before, would like to try something new) and see what I can learn.

I have seen other data brokers say that they spider a large number of sites, I can't imagine that they would use sites like Chamber of Commerce sites but maybe they do.

vincevincevince

3:08 am on Sep 22, 2007 (gmt 0)

The key comes down to what is copyrightable and what isn't.

As I've seen it explained; individual items of information (name, address, phone number) aren't copyrightable as there's no creative step, but when you compile them into a directory or similar then you've created something new which you can claim copyright on.

If you are going to take a large section of someone's copyright directory and modify it to make your own directory then you will be 'creating a derivative work', something you aren't entitled to do without the permission of the original copyright holder.

Not banning spidering isn't granting you permission to create a derivative work, nor is the absence of a copyright symbol evidence that someone "doesn't claim copyright". Only a specific statement granting you that permission is good enough and copyright is an automatic right produced when you create anything, whether or not you attach a symbol or register it.

Be aware that most major directories have 'deliberate mistakes', fictional entries or incorrectly spelt addresses which allow them to easily demonstrate when someone has copied their content. The only way to find these is to manually check each and every entry you copy.

tennis fan28

4:46 am on Sep 22, 2007 (gmt 0)

vincevincevince,
Very interesting. I'm learning all the time...lol. Just a quick question, sorry of I am being a pain but I am looking at a large data broker's website, one of the top ones, on a page where they describe how they get their information, they say "We scrutinize and catalog 5,200 phone books" among other documents. How would me spidering the Yellow Pages or Superpages site be any different from going out and manually inputting the information myself? Granted, they can prove I spidered the site but really, is there a difference?

Another site, the one that inbound mentioned for me to pay $1000 for the data addresses their data gathering as such: "I just wanted to address spidering - we are talking about lightweight spidering on sites that do allow." This site has around 14,000,000 listings. How can anyone possibly lightly spider so many sites? There aren't that many large business directories. Sorry if I sound a bit stupid when I say this but someone would have to assume that a) they did spider the large sites in some way, shape or form along with other smaller ones b) they bought the information from someone and deduped it and are reselling it or c) spidered and deduped an incredible amount of sites which would have to take a lot of time and resources. I am sure there is a missing option d) that I am just not seeing mainly because I am so new to this and I don't fully understand just how in the heck some sites do it and really would like to learn how.

AhmedF

9:37 pm on Sep 24, 2007 (gmt 0)

I guess I should clarify [as I assume it is our product that is being discussed at $1000]

We buy data from various sources. You will be surprised by how many companies you deal with that are more than willing to sell your data [and by companies, I am including your municipal/state/federal government]. So eg we buy data from credit companies. We buy data from the state and federal government. Delivery companies are more than happy to sell your data, and we also get data from large organizations (think restaurants, dentists, etc - anything with an association a 'professional' will want to be a part of). There are also CMRs and other middlemen whose job is to get their client's information out there (from small time mom and pop shops to Fortune 100 companies).

The lightweight spidering refers to augmenting the data when we can. It is nowhere near our primary source.

I should add - the hard part is not the data collection. The hard part is making sense of it.

A second addition - data brokers of all kind (local data, mapping, weather, etc) - they all poison-pill their data.

[edited by: AhmedF at 9:46 pm (utc) on Sep. 24, 2007]

tennis fan28

3:20 am on Sep 25, 2007 (gmt 0)

Ahmed, of course that was your site I mentioned as it is well respected. As for your reply, I never seriously considered purchasing the information from the various organizations/government agencies since there are just so many and the cost would exceed buying it from a data broker.

I would like to thank you for providing an answer to a question that has been bugging me for I don't know how long and while I have you on the line, since it has been asked in other threads, did you consider buying the information from one of those mailing list/business lead providers that are obviously far less in cost (and also assuming less in quality) and if you did buy one, what was your experience with it?

vincevincevince

3:38 am on Sep 25, 2007 (gmt 0)

AhmedF, is yours the site which famously allows the people to "download the data for free" for "one fixed price"? Would be good to have that explained.

AhmedF

3:19 pm on Sep 25, 2007 (gmt 0)

vincevincevince - yes that is us.

tennis - the basic story is simple - we launched in Toronto over 18 months ago. As we mulled over launching in other cities [Canada and US], we didn't like some of the terms [eg revshare]. Coming from a tech background, and with a lot of experience in city databases etc, we decided it would be a better move for us to build our own database.

As we had already worked in the whitepages industry (again - the amount of data you can purchase will *stun* Average Joe) and thus it was easier for us to buy the data than someone starting new.

So - yes we did use one of the data brokers, and we ended up deciding it was a better move for us (background-considered) to do our own. Plus none were keen on the entire opening up/wiki style system :)

Silvery

8:18 pm on Sep 26, 2007 (gmt 0)

It is copyright infringement to take Superpages listing data and display elsewhere.

It would be an infringement against Superpages as well as against a number of their partners who provide them with data.

Further, it's a violation of their terms of use to make multiple, automated queries in that fashion, since that can impact performance for real users.

Superpages does have a partnership API, and affiliate program API that would allow you to redisplay some of their content on your website, if you qualify.

SEOPTI

3:16 pm on Oct 5, 2007 (gmt 0)

What do you mean by poison-pill their data?

AhmedF

3:49 pm on Oct 5, 2007 (gmt 0)

Poison pill - purposely putting in incorrect information that you seeded - if anyone else has it, they must have gotten it (directly or indirectly) from you.

Eg mapping (NAVTEQ/Mapquest/Tele-atlas) - they may put a little dead end road somewhere that doesn't really exist. If they find that dead-end road appearing anywhere else - someone must have ripped them off.

Eg business data - you can slightly modify an address (223 Elm Street instead of the correct 222 Elm Street), or slightly modify the name, or even just put in an incorrect listing.

This happens for all commercial-grade 'data', whatever be it. Data sellers usually do customized poison-pills for every customer, making it damn easy to trace back.

SEOPTI

10:35 pm on Oct 9, 2007 (gmt 0)

Thanks for the info AhmedF I did not know this.

I think it will be really really hard for superpages and local data brokers to find their poison pill because local is fragmented and 99.9% of local webmasters enjoy a nice -950 penalty and additionally 99% of local sites suffer the supplemental index syndrome due to the number of URLs reqired for getting traffic.

[edited by: SEOPTI at 10:36 pm (utc) on Oct. 9, 2007]

AhmedF

4:14 pm on Oct 10, 2007 (gmt 0)

Well - all they do is search Google or Yahoo or MSN for the invalid address/name/whatever, and any site that comes up is using their data.

Cross reference with their customer URLs and voila - you have a list of sites using your data when they shouldn't be :)