Forum Moderators: open
Small question about database design concerning a table that will hold several millions of records
Containing URL information.
Let's say that I have a table with about 1000+ root websites
And the crawler is starting to fetch links from the root website and building huge Url_links table
And from time to time I have to get top 1000 urls from this table that are UN crawled URL's grouped by website_ID and ordered by insert date.
When this table is starting to grow (4M records and more) the IO is starting to be very loaded and it is slowing down the process dramatically
Any tweaks to the design of the table that can improve this process
We already have indexes on the website ID and date but it is still very slow…
I was thinking to create a buffer table and separate the UN crawled urls from the crawled ones
But maybe you have more creative thoughts
that would be a good start
you could split them into multiple tables for uncrawled if necessary
you could split all crawled data to another machine, which would help keep the total size of the db lower
Once the data is indexed, not to be crawled again until a future date, you should throw it onto another machine that could then send it back to the main crawler machine once it needs to be recrawled.
you probably need to distribute load as well as tweak the crawling db
I think a 4m row table that is like this:
id (int)
url (varchar)
websiteid (int)
insertdate (datetime)
craweled (int)
Shouldn't be a problem. YOu may need to add memory so your server can deal with that many rows, but a sql engine shouldn't have too much difficulty dealing with a table of that structure and load.