Flatfiles, or Database, or File system - Databases forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Flatfiles, or Database, or File system

mySQL vs. my.txt vs. files - large number of very simple records

claus

12:06 pm on Dec 15, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Here's a question for the DB savvy and those of you who likes to test stuff ;)

What I want to do:

What I want to do is to take "Input A" (a unique string) and return the corresponding "Output A" (another string - not always unique). That's all. I will not need any more than that. A long list of pairs - potentially very long.

Now, this little darling will just accumulate pairs of strings. Over some years, it could get a seven or eight digit number of "pairs". And if it all explodes it might get more, but that's unlikely. I will probably want to support a 10- or 11 digit number of pairs, just to be on the safe side. Getting that many will be very unlikely, but not impossible - it will take several years, though.

How should I do it:

I see three possible solutions, please advice:

(A: Database) A mySQL table with three colums (unique id third, or perhaps an index on the lookup strings would be better?) Also: A relational database isn't really needed as there are no relations, just one simple lookup. So, if you know better DB's for this, please tell

(B: Flatfiles) A collection of flatfiles. With a little effort you could make some kind of system in the lookup strings, so that you could make flatfiles of some arbitrary lenght, say 10,000 lines or something. Each line would be "input value - separator - output value"

(C: File system) "Atomic" use of flatfiles: Make each "Input string" a file name and write the output string to the content of the file. So, when a request comes in, just read the corresponding file.

What do you think: A, B, or C?

- Speed?
- Reliability?
- Cost / processing?
- Backup / Maintenance?

I appreciate all kinds of fedback :)

Frank_Rizzo

12:29 pm on Dec 15, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

What's the total byte size of files? and what's the total number of files?

If just one or two flat files, and the total file size is only a few Mb then go for flat file. Do this especially if the database is mostly read only. You could then just load it into an array, or variable and very little disk access will be used.

If lots of files or dozens of Mb then certainly go for mysql.

I've always gone with the line the keep it simple view. If possible stick with flat files as these are quick, easy to setup and easy to maintain. Only go proper db if the files get large or you need indexing, joins etc., or if you are regularily updating / appending records.

graeme_p

12:34 pm on Dec 15, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

MySQL is definitely overkill, even something like SQLite would be overkill for this.

volatilegx

2:34 pm on Dec 15, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I'd use a combination of B and C.

arran

3:37 pm on Dec 15, 2005 (gmt 0)

10+ Year Member

Based on the potential number of records (10,000,000+), why mess around (f)grepping thousands of 10,000 line files? I suppose you could create pseudo-indexes based on filepaths but why re-invent the wheel? In terms of solution simplicity, I would go with the database (one table, one index).

arran.

claus

8:47 pm on Dec 15, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Wow, that was quick - I didn't expect to get so many answers already. Thanks a lot :)

I/O: Each "pair" will have one write and many reads. It will not need changing once it's written (well, one could, but I find it easier to just make a new entry in stead). It could possibly be a thousand times more reads than writes.

In memory: It's a web service, so you can't just read it all into memory like that. Or perhaps you can, but I'm not smart enough to do that ;) I can put it in a variable, but if I have, say 20 simultaneous sessions?

Pseudo Index: My thoughts exactly - as an example you could have one folder for (files with) strings starting with "a", another for strings starting with "b", and so on.

Megabytes: I haven't calculated that as
(1) It's not big strings (the total length of each pair will not exceed, say 265 chars), and
(2) I'm not sure how to - how many characters per megabyte?

Thousands of files: Well, the files could be 100,000 lines just as well as 10,000 - I'm not religious about that. It might even take some years to get to 100,000 but at that time the next 100K will happen a lot quicker.

Q: What filesize would be appropriate?

If I do files, I'll probably read+write them with Perl. It will be processed on a standard Apache shared hosting setup. I have no idea how much RAM or how many other clients on the box as I can't influence that but there's plenty of storage. It's not the sort of thing you would put on a dedicated box.

Overkill: Is a database really overkill? I suspected it might be, as it's not really a "base" just one simple table. But, arran, you're right about the simplicity, that's what appeals to me about this. Is a billion records in one table a piece of cake for mySQL?

Any more thoughts?

stajer

10:47 pm on Dec 15, 2005 (gmt 0)

10+ Year Member

Do you need to record the pairs at all? If they are being generated from a formula couldn't you just use a script to take the input and generate the output?

If not, I would recommend going through the minimal hassle of a db solution. You do it once and you don't have to worry about scalibility for a long time.

claus

10:54 pm on Dec 15, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Well, yes I need to record it. It goes like this:

1) One user inputs a $string.
2) That $string is assigned a $key that is returned to the user.
--> Key+string is what I want to store, nothing else
3) The user dissappears in the cloud
4) Any period of time elapses
5) Another user (perhaps the same, doesn't matter) inputs the $key
6) The system returns the original $string from (1) to this user
7) Steps (4) to (6) repeats.

Oh, $key is not derived from $string, and should not be so. It can't exactly just be a number either. But neither of those issues are my problem - I'm only trying to find the best method for storage+retrieval :)

runner

12:04 am on Dec 16, 2005 (gmt 0)

10+ Year Member

Just out of curiosity, can you describe the application?

claus

12:05 am on Dec 16, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I just described it above. That's really all there is to it :)

claus

1:05 pm on Dec 16, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Just bumping this thread, as I'd really like to know:

(1) How many bytes does one standard ASCII character take up? (how many characters for one megabyte)?

(2) Say a table with one million lines - would that be a very large table for a database system (eg. mySQL) or would it be a piece of cake?

(3) Does anybody know where this stuff starts to break down, giving bad performance, or perhaps regular breakdowns: Hundred thousand records, one million records, one billion records, more?

arran

2:00 pm on Dec 16, 2005 (gmt 0)

10+ Year Member

(1) How many bytes does one standard ASCII character take up? (how many characters for one megabyte)?

7 bits per character. Let's just say 1 byte. So 1Mb = 1 million characters.

(2) Say a table with one million lines - would that be a very large table for a database system (eg. mySQL) or would it be a piece of cake?

1 million rows of the data you describe would be easily handled. The problems occur when your row size increases and you start doing operations involving full table scans. If the only two operations you perform are simple inserts and

select * from blah where id = $id

with an index on the

id

column, you should be ok.

(3) Does anybody know where this stuff starts to break down, giving bad performance, or perhaps regular breakdowns: Hundred thousand records, one million records, one billion records, more?

In theory, MySQL should scale up fairly well because of the simplicity of your data model. Most problems in MySQL stem from concurrency issues when you have a high proportion of write operations (UPDATE/INSERT/UPDATE). Doing some rough calculations based on your potential row sizes, I would guess things might start creaking between 10,000,000 and 100,000,000 rows.

arran.

claus

3:41 pm on Dec 16, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks arran, I appreciate it a lot. I just found this MySQL document [dev.mysql.com] that states that I can have at least two Gigabyte in a table.

Also, I did some tests, generated a file with lines of 66 characters. The results (on Win XP) are:

1,000,000 lines: 66.027.520 byte (62,9 Mb)
5,000,000 lines: 330.006.528 byte (314 Mb)

I tried to make 10M lines, but WinXP went belly-up for me around 8,5M (I had a few other programs running).

So, with two Gb I can have around 30M lines of that size, it seems. As the number of characters will exceed 66, a worst-case limit estimate of around 5-6M records seems about right.

And, with the ext3 filesystem (I'll check if the host runs that of course) I will have 4TB, so I should be okay.

All of the above regarding MySQL only, as I'm not sure the host will offer packages of Gigabyte disk space, much less Terabyte. So, regardless of which path I choose I will have to scale the ambitions down ;)

iamlost

1:00 am on Dec 17, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

If you haven't considered the Berkeley DB it might be the ideal single-key look-up solution: basically works as a hash saved to a file, each record value treated as a string and located via its key. Reading of random entries is extremely fast partly through elimination of SQL query processing.

Note: it is a 'C' library that runs as part of your application (the API plays well with perl), max-record-size of 4GB, max-table-size of 256TB, multiple processes of multiple threads per application, AES, etc.

I'm not sure the host will offer packages of Gigabyte disk space, much less Terabyte

Hard drives are getting bigger and cheaper. n-TB storage has not been a host 'problem' for clients in the past couple of years although it does usually require 'negotiation' :-) Perhaps buying/leasing dedicated n-GB drives as required is a cost effective scalable solution?

The size 'limit' is often reached by the OS long before the DB maxes (Win32 is especially limited) but most (all?) current *nix kernels seem adequate. Write a hosting requirement spec list and shop it around - it may be unusual but certainly is not unique.

claus

11:03 pm on Dec 20, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

iamlost, you are not. Thanks a lot.

Right now something else have come up, so it'll be something that I'll put together sometime next year - I'm in no hurry as it's not really urgent, just a fun little thing to do :)