Forum Moderators: bakedjake

Message Too Old, No Replies

Cost to Crawl

estimate from a PHD thesis

         

old_expat

4:28 am on Apr 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"An estimation for the cost of an entire crawl of the World Wide Web is about US $1.5 Million [CCHM04],
considering just the network bandwidth necessary to download the pages .."

Lord Majestic

11:43 am on Apr 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Its near zero if you use the right approach - using distibuted crawling that takes advantage of idle broadband connections that already paid for via fix fee.

old_expat

11:57 am on Apr 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello LM, can you expand on this a bit. I'm not very technical.

inbound

1:37 pm on Apr 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Distributed crawling is not as cheap as you think it may be.

Costs include (but are not resticted to):

bandwidth involved in getting the information to a central database
servers that the DB sits on
electricity that powers the DB servers
somewhere for the servers to 'live' safely
backup costs
staffing/programming, you can't shove the whole web into a flat file database - you need people with really good experience of dealing with more data than you can imagine
also
cost of making people aware of the project
cost of supporting people who opt in

$1.5 Million to crawl the whole web (or at least 50 Billion pages of it), If anyone can do it for that (in a reasonable time frame) I know a company who would probably bite your hand off.

old_expat

5:48 pm on Apr 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The guy (thesis) was talking about bandwidth only, I believe.

runarb

6:52 pm on Apr 18, 2006 (gmt 0)

10+ Year Member



The thesis is locaated her: [dcc.uchile.cl...]

Lord Majestic

7:26 pm on Apr 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



old_expat: check Wikipedia for "distributed crawling" :)

If anyone can do it for that (in a reasonable time frame) I know a company who would probably bite your hand off.

We crawl 50 mln per day and its scalable linearly, ie 10 times people join and we will get to 500 mln per day, that's 3 months to crawl 50 bln pages.

Of course the challenges that you mentioned are all true but our work pretty much proven that it can be done without millions needed to be invested. Certainly takes time and effort but its do-able - much harder task to actually have good ranking that is competitive with Top tier search engines.

I can't say more because of "self-promotion" rules here but those who seek will find, just like those who dare win ;)

old_expat

2:39 am on Apr 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Runard - thanks for adding that link. I didn't do it because I'm never sure of the TOS re: linking, here on WW

inbound

5:33 pm on May 6, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



much harder task to actually have good ranking that is competitive with Top tier search engines.

Indeed: a little look at your project confirms the difficulty of ranking pages well.

What you've achieved is impressive, I applaud people who have the vision and determination to do things on the scale you are attempting.

However, the real value of any search engine is in the ability to remove spam/poor pages and return the most relevant ones for searches. I'm sure you are learning a great deal about how tricky that must be (I can't profess to know much about that).

It looks as though I've got another interesting site to visit on a regular basis.

Best of luck with it.

Lord Majestic

10:13 pm on May 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks inbound. It is true that its the ranking that's the most difficult thing rather than getting data in the first place. I am positive however that its not impossible to achieve with fairly limited resources, though being able to distribute workload allows to actually narrow the gap. Its certainly very hard, but nobody said it will be easy and for me victory is sweeter if it was hard fought for :)

JamesR3

9:54 pm on May 28, 2006 (gmt 0)

10+ Year Member



Lord Majestic, care to PM me or elaborate on what you are doing? Might be some opportunities for collaboration.