Forum Moderators: open
What I want to do:
What I want to do is to take "Input A" (a unique string) and return the corresponding "Output A" (another string - not always unique). That's all. I will not need any more than that. A long list of pairs - potentially very long.
Now, this little darling will just accumulate pairs of strings. Over some years, it could get a seven or eight digit number of "pairs". And if it all explodes it might get more, but that's unlikely. I will probably want to support a 10- or 11 digit number of pairs, just to be on the safe side. Getting that many will be very unlikely, but not impossible - it will take several years, though.
How should I do it:
I see three possible solutions, please advice:
(A: Database) A mySQL table with three colums (unique id third, or perhaps an index on the lookup strings would be better?) Also: A relational database isn't really needed as there are no relations, just one simple lookup. So, if you know better DB's for this, please tell
(B: Flatfiles) A collection of flatfiles. With a little effort you could make some kind of system in the lookup strings, so that you could make flatfiles of some arbitrary lenght, say 10,000 lines or something. Each line would be "input value - separator - output value"
(C: File system) "Atomic" use of flatfiles: Make each "Input string" a file name and write the output string to the content of the file. So, when a request comes in, just read the corresponding file.
What do you think: A, B, or C?
- Speed?
- Reliability?
- Cost / processing?
- Backup / Maintenance?
I appreciate all kinds of fedback :)
If just one or two flat files, and the total file size is only a few Mb then go for flat file. Do this especially if the database is mostly read only. You could then just load it into an array, or variable and very little disk access will be used.
If lots of files or dozens of Mb then certainly go for mysql.
I've always gone with the line the keep it simple view. If possible stick with flat files as these are quick, easy to setup and easy to maintain. Only go proper db if the files get large or you need indexing, joins etc., or if you are regularily updating / appending records.
I/O: Each "pair" will have one write and many reads. It will not need changing once it's written (well, one could, but I find it easier to just make a new entry in stead). It could possibly be a thousand times more reads than writes.
In memory: It's a web service, so you can't just read it all into memory like that. Or perhaps you can, but I'm not smart enough to do that ;) I can put it in a variable, but if I have, say 20 simultaneous sessions?
Pseudo Index: My thoughts exactly - as an example you could have one folder for (files with) strings starting with "a", another for strings starting with "b", and so on.
Megabytes: I haven't calculated that as
(1) It's not big strings (the total length of each pair will not exceed, say 265 chars), and
(2) I'm not sure how to - how many characters per megabyte?
Thousands of files: Well, the files could be 100,000 lines just as well as 10,000 - I'm not religious about that. It might even take some years to get to 100,000 but at that time the next 100K will happen a lot quicker.
Q: What filesize would be appropriate?
If I do files, I'll probably read+write them with Perl. It will be processed on a standard Apache shared hosting setup. I have no idea how much RAM or how many other clients on the box as I can't influence that but there's plenty of storage. It's not the sort of thing you would put on a dedicated box.
Overkill: Is a database really overkill? I suspected it might be, as it's not really a "base" just one simple table. But, arran, you're right about the simplicity, that's what appeals to me about this. Is a billion records in one table a piece of cake for mySQL?
Any more thoughts?
If not, I would recommend going through the minimal hassle of a db solution. You do it once and you don't have to worry about scalibility for a long time.
1) One user inputs a $string.
2) That $string is assigned a $key that is returned to the user.
--> Key+string is what I want to store, nothing else
3) The user dissappears in the cloud
4) Any period of time elapses
5) Another user (perhaps the same, doesn't matter) inputs the $key
6) The system returns the original $string from (1) to this user
7) Steps (4) to (6) repeats.
Oh, $key is not derived from $string, and should not be so. It can't exactly just be a number either. But neither of those issues are my problem - I'm only trying to find the best method for storage+retrieval :)
(1) How many bytes does one standard ASCII character take up? (how many characters for one megabyte)?
(2) Say a table with one million lines - would that be a very large table for a database system (eg. mySQL) or would it be a piece of cake?
(3) Does anybody know where this stuff starts to break down, giving bad performance, or perhaps regular breakdowns: Hundred thousand records, one million records, one billion records, more?
(1) How many bytes does one standard ASCII character take up? (how many characters for one megabyte)?
7 bits per character. Let's just say 1 byte. So 1Mb = 1 million characters.
(2) Say a table with one million lines - would that be a very large table for a database system (eg. mySQL) or would it be a piece of cake?
1 million rows of the data you describe would be easily handled. The problems occur when your row size increases and you start doing operations involving full table scans. If the only two operations you perform are simple inserts and
select * from blah where id = $id with an index on the id column, you should be ok. (3) Does anybody know where this stuff starts to break down, giving bad performance, or perhaps regular breakdowns: Hundred thousand records, one million records, one billion records, more?
In theory, MySQL should scale up fairly well because of the simplicity of your data model. Most problems in MySQL stem from concurrency issues when you have a high proportion of write operations (UPDATE/INSERT/UPDATE). Doing some rough calculations based on your potential row sizes, I would guess things might start creaking between 10,000,000 and 100,000,000 rows.
arran.
Also, I did some tests, generated a file with lines of 66 characters. The results (on Win XP) are:
1,000,000 lines: 66.027.520 byte (62,9 Mb)
5,000,000 lines: 330.006.528 byte (314 Mb)
I tried to make 10M lines, but WinXP went belly-up for me around 8,5M (I had a few other programs running).
So, with two Gb I can have around 30M lines of that size, it seems. As the number of characters will exceed 66, a worst-case limit estimate of around 5-6M records seems about right.
And, with the ext3 filesystem (I'll check if the host runs that of course) I will have 4TB, so I should be okay.
All of the above regarding MySQL only, as I'm not sure the host will offer packages of Gigabyte disk space, much less Terabyte. So, regardless of which path I choose I will have to scale the ambitions down ;)
Note: it is a 'C' library that runs as part of your application (the API plays well with perl), max-record-size of 4GB, max-table-size of 256TB, multiple processes of multiple threads per application, AES, etc.
I'm not sure the host will offer packages of Gigabyte disk space, much less Terabyte
The size 'limit' is often reached by the OS long before the DB maxes (Win32 is especially limited) but most (all?) current *nix kernels seem adequate. Write a hosting requirement spec list and shop it around - it may be unusual but certainly is not unique.