web Crawler database design

Forum Moderators: open

Message Too Old, No Replies

web Crawler database design

several millions of records

amiz

10:11 am on Dec 26, 2006 (gmt 0)

Crawler database design

Small question about database design concerning a table that will hold several millions of records
Containing URL information.

Let's say that I have a table with about 1000+ root websites
And the crawler is starting to fetch links from the root website and building huge Url_links table
And from time to time I have to get top 1000 urls from this table that are UN crawled URL's grouped by website_ID and ordered by insert date.

When this table is starting to grow (4M records and more) the IO is starting to be very loaded and it is slowing down the process dramatically

Any tweaks to the design of the table that can improve this process
We already have indexes on the website ID and date but it is still very slow…
I was thinking to create a buffer table and separate the UN crawled urls from the crawled ones
But maybe you have more creative thoughts

jatar_k

4:32 pm on Jan 5, 2007 (gmt 0)

>> separate the UN crawled urls from the crawled ones

that would be a good start

you could split them into multiple tables for uncrawled if necessary

you could split all crawled data to another machine, which would help keep the total size of the db lower

Once the data is indexed, not to be crawled again until a future date, you should throw it onto another machine that could then send it back to the main crawler machine once it needs to be recrawled.

you probably need to distribute load as well as tweak the crawling db

stajer

5:58 pm on Jan 5, 2007 (gmt 0)

You definitely need an index on crawled vs. uncrawled and probably a clustered index on date and crawled/uncrawled.

I think a 4m row table that is like this:

id (int)
url (varchar)
websiteid (int)
insertdate (datetime)
craweled (int)

Shouldn't be a problem. YOu may need to add memory so your server can deal with that many rows, but a sql engine shouldn't have too much difficulty dealing with a table of that structure and load.