Huge data handling questions

Forum Moderators: open

Message Too Old, No Replies

Huge data handling questions

How to handle 30M rows daily

Moosetick

10:19 pm on May 16, 2008 (gmt 0)

I have about 30 million new rows of data generated daily that I would like to store in a database. Currently it is sitting in a flat file and difficult to mine useful data from. That file is anout 1GB zipped. Each row would have about 10 columns with 10-250 bytes each. I would like to keep 9-12 months of historical data. If you multiply it out, that could total 10+ billion rows of data.

I'm familiar with MySQL, but am willing to use Oracle or whatever is necessary. It may be unrealistic, but I'd like it to be responsive enough so that I can have a web based front end that can do dynamic queries on the data.

I've never had to deal with this much data. Can anyone provide suggestions for the best way to handle/organize the data?

plumsauce

12:30 am on May 17, 2008 (gmt 0)

30 million rows is about 350 records/second.

Of course, that is based on being busy 24 hours.

If you want to actually have a period where you can do analysis, then you might want a bulk dataload that takes less than say, 12 hours. Now, MS SQL server can bulk load from text files at much greater rates than 350/second.

The advantage of MS SQL is that, at least the enterprise editions, allow you to mix and match between operational and data warehousing models.

A years worth of rows is about 11 billion rows. Big, but not huge. You just need to plan on the disk space in advance. Upsizing disk space for db's is not fun. Let's say 1GB stores 1 million rows with indexes. That becomes 11,000GB, or 11TB. With fudge factor, 20TB.

If you want fast access, I think you are going to be using prebuilt data cubes.

mysql can't do much of the above. MS SQL can. At a cost. And no, the Express Edition won't cut it here. You will need whichever edition supports the sizes you want, and the OLAP capabilities.

Your budget is not going to be for the faint of heart.

For one thing, you won't be doing this on a bunch of sata drives. For another thing, the license is not going to be cheap. Finally, sql of any flavour loves cpu and memory. As much as you can afford.

Another alternative is to use an ISAM system. This will take a *lot* of bare metal programming. The limits of the ISAM I use is 16TB per db, but multiple db's are supported. In the right circumstances, it can be an order of magnitude faster.

Now, if you don't need the raw data, summarisation will make the whole thing much easier and faster. But, first you have to load the raw data. You might want to keep 90 days of raw data and summarise the rest.

For web analytics, the raw data is seldom useful after some period of time. The trending numbers are more important. And relationships are important.

Now, if you are doing stock quote analysis, text files can actually be blindingly fast. You just put one security in each file and append the daily numbers.