Forum Moderators: open

Message Too Old, No Replies

How do SE spider programs work?

... and how many pages do they spider per day?

         

Joker

9:21 pm on Mar 29, 2002 (gmt 0)

10+ Year Member



I'm curious how do SE spiders work? are they Perl (or similar) scripts running on a server

And my supplementary, How many pages does one server spider in a day? (or your best guess)

Hope this is the right place to post this.

PsychoTekk

6:58 am on Mar 30, 2002 (gmt 0)

10+ Year Member



i have seen some spider scripts written in perl but i'm
sure there are others, too

william_dw

11:37 pm on Mar 30, 2002 (gmt 0)

10+ Year Member



Hi there,
High end spiders are typically written in C++, for large scale engines multiple servers are used to speed up crawling times.
The initial google project apparently used python for downloading and C++ for indexing.

If you wanted to index 50 million pages per crawl, with a typical crawl lasting a month, then you'd need to index
1,666,666 pages a day
69,444 pages an hour
1157 pages a minute
or 19 pages a second.

My personal best with a single computer running at 800mhz is indexing around 5/second if I remember correctly.

HTH,
Dw

Joker

10:35 am on Mar 31, 2002 (gmt 0)

10+ Year Member



Thank for that William, just the sort of info I was after.

Now to see if the rest of my plan will work.