flat text vs. dbms vs mySQL vs whatever

Forum Moderators: coopster & phranque

Message Too Old, No Replies

flat text vs. dbms vs mySQL vs whatever

littleman

11:28 pm on Apr 12, 2001 (gmt 0)

Seems everyone has been raging about mySQL,like it is a sexy fantasy date. That's cool and all, but it seems overkill for a lot of web stuff. If I'm
selling 1,000 products I think I would opt for a perl and a flat text file. Of course there are web applications that would be better suited for DBMs or mySQL, where that line is I have no clue.

I've been searching all over the world for a good study on when it makes sense to go with what and I can't find ANYTHING substantial. If you all know of any studies, please post, or talk about your own experience.

BoneHeadicus

11:34 pm on Apr 12, 2001 (gmt 0)

Timely post littleman.

Do you happen to know the differences in access times between those db's.

mivox

11:51 pm on Apr 12, 2001 (gmt 0)

I've heard SQL is much faster for larger databases...

My employer's store runs on a flat file DB, and I'm thinking of transitioning it to a multi-database configuration, because retrieval times are getting slow at around 300 items. Of course, each item has 18 database fields associated with it, one of which is a somewhat long text description.

The shopping cart developer says a single flatfile is perfectly fast enough up to 2MB in size, but after that, segmenting your inventory into multiple DBs or switching to SQL is preferrable. That's just for that one program though... and that program *can* use multiple flat file databases.

What works best for in any given situation is whatever works best with the script in question, I'd think.

evinrude

11:55 pm on Apr 12, 2001 (gmt 0)

Hmmm, don't have a study for you (though there are plenty of comparison reviews of various db's out there...)

As someone who tries to use a database for data storage and retreaval as often as possible, my question would be why use a flat file? 1,000 products, to me, would scream database. :)

Are the products tied to a vender? Is there any relational data that might be better suited being broken out into multiple tables (and thus result in a faster data return when querying a single table?) What do you do when you need to edit a particular product (like, it's price changes?) I can't claim any bonus in speed or performance, simply because there are too many variables (processor, memory, query size, etc...) But I feel a database, when properly designed, improves the structure and integrety of the data.

Can you tell I don't like flat files? ;) I wouldn't argue one way or the other for mySQL, I've heard my share of horror stories. Here we use MS-SQL or Oracle, but only 'cause we can afford to do so, and they have tons of other "features". Actually, I've heard more praise for PostgreSQL then mySQL.

mivox

12:15 am on Apr 13, 2001 (gmt 0)

I like flat files. So there. :)

I can open a flat file DB in a plain text editor, do a global find and replace, and check the results, without having to operate through a "database program" front end. I can actually open up a plain text document, and see my data sitting there, unadorned with fancy user interface wingdings. It's not pretty, but it's there.

I like being able to see the code for my web pages and my cgi scripts, and I like being able to see the data in my database, without having to go through a specialized program to touch it.

rcjordan

12:25 am on Apr 13, 2001 (gmt 0)

>I can't claim any bonus in speed or performance, simply because there are too many variables (processor, memory, query size, etc...)

This is really it. I'm running a 3meg flatfile with 26 fields per record and pouring the query results through template files. It's OK -but it's on a Sun box.

I've been in 2 other forums re applications that offer both flatfile and mySQL versions. There have been several threads which seemed to indicate that the mySQL versions did not end up providing the expected improvements in speed for those that upgraded. I believe Brett tried it on the WmW forum files and went back to flatfiles and perl, but I'm not sure if speed was the only issue.

>edit a particular product

For me, as long as I have Ultraedit, updating flatfiles is often faster than other data entry methods and I frequently go straight into the raw files. On the other hand, I wouldn't want to have to train someone how to do that.

evinrude

12:43 am on Apr 13, 2001 (gmt 0)

Hehe, this is probably something that's been debated time and again, but I'll stick my nose in a little further. ;)

I guess what baffles me (and that ain't hard...really) is that nearly all of the flat file databases I've encountered contained whopping loads of redundant data. The more redundancy in the data, the more places there are for an error to occure.

Take the example of a product database. The flat file database entry for every product contains 1)Product Name 2)Price 3)Vender Name 4)Vender Address 5)Vender phone number.

Only five columns for the sake of space. :) Every time a new product is entered, those fields are filled in. Get two products from the same vender, you have the vender entered twice. Once for each item. What happens if you don't get the phone number right in one, or more, entries?

In a non-flatfile relational database, you'd have two seperate tables. One for products and one for venders tied together, most likely, with a vendor ID. Less redundancy, less chance of errors, quicker access time to tables with fewer columns. Your data is more normalized at this point. You don't have to enter a vendor for every item (unless you have a new vendor), simply a vendor ID. Data entry is quicker, data retrieval is quicker.

Woz

5:23 am on Apr 13, 2001 (gmt 0)

Well, Ive never used a flat file system, in fact I seem to spend a goodly proportion of my time when doing sites for other people pulling together their information froma various unrelated sources (read - scraps of paper) into one cohesive database.

However I do agree with evinrude about data redundancy and the lower possibility of errors from relational databases. I use either File Maker Pro or Access and like the speed and the ability to set up pre-saved queries in situ which (theoretically) run faster than a complicated SQL query. I do believe thought that you cannot so this with MySQL, not sure about PostgreSQL.

I also like the ability to set up queries to automatically modify data when necessary within the database prior to upload.

Also, you can view data in tables and do search/replace just as easily as working with a flat file so no real benifit there.

But then, to each his own...

Onya
Woz

sugarkane

11:42 am on Apr 13, 2001 (gmt 0)

What works best for in any given situation...

I agree with that wholeheartedly. But...

I use both flat files and MySql extensively, but am slowly moving over to using MySql for most projects. Apart from evinrude's good points about redundancy, I find it a lot easier to work with SQL queries - I can 'drop in' a new feature far quicker when using SQL by just modifying the query rather than having to write a new subroutine to manipulate the flatfile in a new way.

Also, as an SQL query is the same no matter which language you're using to write the script, it makes life easier when I'm using a mixture of Perl and PHP to access the same data.

It is horses for courses though - whatever works works.

<aside>
I mention MySql not because I think it's the best DB out there, but purely because it's what I happen to use ;)
</aside>

littleman

6:50 pm on Apr 13, 2001 (gmt 0)

Thanks for all your input. I am still pondering all this! So what about DBMs, like SDBM? I came across some old writings talking about how to build a relational database using DBMs. That is kind of interesting to me because SDBM must have a very low overhead. I find it odd that there is little talk about using smallish sdbm systems for cgi applications. It seems everyone jumps from flat files to a full on SQL engine. Perhaps it is just too much work to get the DBMs to do what you like?

And then there is the hybrid systems like DBD::CVS. It is a perl module written in C that allows SQL on flat files. It would seem to me that this may be a practical alternative for some. For smaller dbs it would have to be faster than connecting to a remote MySQL server, yet again I see little buzz on it.

Brett_Tabke

3:20 am on Jun 6, 2001 (gmt 0)

There are some good dbm's out there little. I used PerlFect for a search engine over on sew for awhile. It used a sdbm db and was extremely fast on large amounts of data. I've stayed away from them, mainly because flatfiles offer you edit ability in a text editor - whereas dbms are just hashes dumped to disk (partly binary and uneditable).

Xoc

2:56 pm on Jun 6, 2001 (gmt 0)

I use XML for most things that you would use a database for. Has lots of advantages. There are still things that are appropriate for a database: when you have lots of data, although a SAX parser on XML may work just as well. By lots, I mean gigabytes. But the main thing for databases is when you need multiple users to write. None of the XML parsers that I know of support locking except at a file level.

Everyman

4:24 pm on Jun 6, 2001 (gmt 0)

Speed, speed, speed is the name of the game if you ever plan to allow data queries by spiders. Here's the big question: Will you end up with database queries in URL links, either hard-coded on your page or returned by some other link?

If so, then will you allow spiders into your cgi-bin?

Can interpreters such as Java or Perl, interfaced with mySQL, handle the speed you need? Google, for example, will hit you as often as four times a second, in my experience.

As other spiders venture into the cgi-bin "deep" web, you should be aware of speed issues (which also imply memory overhead and CPU overhead, since an increasing load decreases your speed).

I say a custom ordered flat file structure, optimized for redundant data with pointers and look-up files as needed, all accessed with compiled C on a fast Linux box that rarely sees a load of greater than 0.30.

Speed, speed, speed.

And then if you allow data queries by "good" spiders, you will also need coding to detect "bad" spiders so that you can lock them out within a minute or so. Every Gen-Y joystick yahoo with his cap on backwards seems to be using a personal spider. Some of these launch requests at a rate of 18 per second, although it's more common to see about 3 per second. You aren't going to be able to detect this with Perl because your server will begin to resemble a useless boat anchor after about 30 seconds of this.

In fact, these personal spiders are the very spiders who don't bother reading robots.txt, so if your site has lots of interesting pages that are heavily interlinked, you need speed even if you DON'T plan to allow spiders!

Most commercial sites don't have this problem, because no one wants to spider an entire inventory of women's shoes, for example. But some noncommercial sites with thousands of interesting pages are having a real problem with those personal spiders. It's much worse than it was just one or two years ago.

Speed, speed, speed.

littleman

5:35 pm on Jun 6, 2001 (gmt 0)

There is also the overhead factor to consider. Speed in a vacuum may not reflect the daily reality of a server. I have one box that has had more than 120k page requests in one day. The box is modest, 128 memory and 500 celeron. It is running entirely with perl and delimited flat files and 100% of the pages on this box are dynamically made and cloaked. I've tested it against beefier serves running variations of SQL databases. My page variation is very high, as high or higher than what the SQL servers were spitting out, but in every case the simple perl/text file system was faster. I am sure there would be a point where MySQL would take the lead as an application grows in size and scope, where that is, I have no idea.

MySQL and DBI is such an easy combination to use, I could really see how people could get hooked on the SQL quarrying language. It really could streamline development and make for creative quarrying.

Brett, I know sdbm could be very fast in a light weight application. I've done some (not much) experimenting. Also, any box that has perl has sdbm.

Everyman, no doubt a well written application in C would be faster, but it is so much more work to write. I lack the skill/discipline to write good C. I guess that is what is cool about a lot of the modules coming out of cpan these days -- they have a lot of optimized C code embedded into them.

The O'Reilly "Programming the Perl DBI" actually has an excellent examination of the benefits and the drawbacks of each approaches in the first few chapters. It really is a good read.

Brett_Tabke

10:24 pm on Jun 6, 2001 (gmt 0)

>I use XML for most things that you would use a
>database for. Has lots of advantages.

Name just one over flatfiles? (aside from potential browser compat)

Xoc

2:47 am on Jun 7, 2001 (gmt 0)

You can reuse the data in ways that you didn't originally plan for, including totally reorganizing it into a new format for the new purpose using an XSLT style sheet. For example, I have a page of hyperlinks constructed from an XML file that I described in this post: [webmasterworld.com ], full description also on my web site. By applying a slightly different XSLT style sheet to it, I turned it into a Channel Definition File, which is the file format that you can use to load the IE Favorites. So by doing one click on one of my web pages, I load every one of my links into the person's favorites. [This is actually for my own benefit, as I am travelling frequently and am using other people's machines...nice to load my favorites quickly.]

I didn't originally plan to produce CDF files, but all I had to do was construct a short XSLT style sheet and I added a huge new feature.

Since XML is a text file, uploading it and maintaining it is trivially easy. I can move the content to the web server the same as any web page. No having to merge the data into an existing database or deal with binary files that might get corrupted.

XML has all the advantanges of a flat file, but also you have the benefits of someone else writing the parser for you, and the file format isn't flat (instead it is hierarchical), which allows a much richer expression of data. And through XSLT, if you don't like the way the data is formatted, you can easily change it into something else, even on a temporary basis.

To give one more example, I have a list of info on the most common spiders that I keep in XML on my web site. If you wanted to be lazy, you could borrow that list, apply an XSLT style sheet on it turning it into a flat file, and load that into something else, say a MySQL database automatically. I challenge you to do that in reverse as quickly.

BTW, the 2002 version of Excel will save its spreadsheets in XML. So you can have someone who knows nothing about anything except Excel be able to save out some data that you just grab and put on your web site. Apply an XSLT style sheet to it and turn it into HTML.

Proprietary file formats are dead. Get over it ;)!

Brett_Tabke

4:50 am on Jun 7, 2001 (gmt 0)

Ok, thanks for the info.

>have the benefits of someone else writing the parser for you

Got a parser in raw Perl handy?

>Proprietary file formats are dead. Get over it ;)!

I tend to agree with that, but the more complicated the format, the more difficult to maintain and port to new projects.

I probably stick with flat files now simply because I have such a huge library of routines. New projects for me are like digging through a sock drawer trying find that match to the one left over in the dryer.

Bolotomus

9:15 pm on Aug 3, 2001 (gmt 0)

> If I'm selling 1,000 products I think I would opt
> for a perl and a flat text file

Hmmm... well, here's what I would say. IF you can never have to iterate through that file in any of your non-administrative CGI calls, then I would say sure, stick with the flat file.

E.g., you might have a page called "showproduct" that works something like /cgi-bin/showproduct?product=XYZ123

The flat file would require iteration through the whole file (A-Z) until it finally finds the product record. That's O(N) performance, very poor. It even could become the basis of a Denial-of-Service attack if somebody tortured your server by calling that script several times a second.

HOWEVER... suppose you instead had an adminstrative script that iterated through the 1000 products and produced 1000 HTML pages. Now they person calls up /widgets/product_XYZ123.html and there is no pressure on your server's CPU at all. Presumably, all the necessary info for each file (price, etc.) is hard-coded into the page itself. Suddenly the issue of whether it's in a flat file or MySQL doesn't make any difference.

All that being said, my observations tell me that usually 100 products becomes 1000 becomes 10,000 and suddenly the system starts to fall apart at the seams--and the client feels that you've somehow cheated them because the system they paid for was not scaleable. I would advise biting the bullet and going for a full-fledged database even if you have "only" 1000 records.

mark_roach

9:32 pm on Aug 3, 2001 (gmt 0)

>>Can interpreters such as Java or Perl, interfaced with >>mySQL, handle the speed you need? Google, for example, >>will hit you as often as four times a second, in my >>experience.

My site is hosted on a shared server and all the content is served up using Perl and MySql. It works fine with normal visitors. However twice it has been taken down by rogue spiders, the most recent being a couple of days ago when Ask Jeeves "asked" for 40 pages in 3 seconds. My host automatically disables scripts that consume excess resources (and I don't blame them), therefore making the site unavailable to all visitors.

Brett_Tabke

12:27 pm on Aug 4, 2001 (gmt 0)

>That's O(N) performance, very poor.

Depends on the box. [webmasterworld.com...]
Currently, there are 1500 entries parsed on every call. 15-30k views a day.
(I'm headed to trim that back though, that's a touch high). The MySql version was running 4 seconds average generation time - too slow.

Do a search: [joefarmer.com...] - Thats, three five meg flatfiles searched in an average of less than a half second (awesome box - wish we had it here). The MySql version was running 9seconds per search - unacceptable.

Do a search: [searchengineworld.com...] thats 13meg worth of databases searched (touch slow for my tastes, but the MySql version was running 50seconds per search - so I'm sticking with the flatfiles).

For a search engine, I find MySql has such post processing overhead, that any savings on sub 20meg files is chewed up in a hurry when it comes time to actually pump out a result. It's still a slave to perl.

ggrot

4:12 am on Aug 5, 2001 (gmt 0)

Ultimately, either oracle or MS(microsoft)SQL are the best in terms of raw overall speed. They both beat mysql for most things. However, if you are debating between flat files and mysql, remember also that mysql, mssql, and oracle all use an extremely similar SQL language interface. If your traffic or inventory explodes, flat files wont scale. mysql wont either after a certain point, but it is so very easy to port your app to any other SQL language db's from mysql unlike flat files.

MaliciousDan

6:45 am on Aug 5, 2001 (gmt 0)

I remember when I always worked with flat files, and whenever someone suggest SQL I wondered what the point was, it seemed like a lot of work for some very minor benifits... well, these days it's difficult to think of a good reason to use a flat file for anything except, saving temporary files (which are intended to be ready fully and then discarded completely) and for log files. Everything else seems to make more sense in a database, one record or a million records. I suppose very simple programs like page counters would be better off in a flat file, if you used a database there you'd have to provide a way to edit the information stored in the database (taking the script from 5 lines to several hundred). Flat files are nice in that you don't have to rely on someone having access to a database, but my guess is these days if they have access to a CGI bin they should have no problem getting access to a database.

The speed difference between a database and a flat file is entirely up to the skill of the programmer, if they aren't very familiar with how to design a database to run smoothly then it's certain to perform very poorly as the database grows.

MySQL can be a VERY fast database, but only if it's setup correctly. A very large portion of the programmers who build cgi programs that include support for mysql do not know enough about how to setup that tables and access the information so that as the tables grow the information can still return quickly (My theory is most of those programmers sell a product, to businesses that can throw faster servers at slow running scripts instead of having to make the code run more efficiently).

If you read all that and it didn't make sense I blame it on me trying to keep it short.

Bolotomus

4:19 pm on Aug 5, 2001 (gmt 0)

Brett,

9 seconds? 50 seconds? That doesn't sound right to me... I've got a system with millions of records (prices of bonds stored in realtime). A search like "find the highest price of this bond over the last 365 days" (searching on a key field) come back in fractions of seconds. And that's *million* of records--in some cases, 10's of millions. The only time I ever have seen queries take over a whole second it was my own fault, for not indexing properly.

Brett_Tabke

4:49 pm on Aug 5, 2001 (gmt 0)

On a search engine, it is the post processing that is time killer. There's no real solid way of getting context out of MySQL for rankings. So the rankings algo has to be perl processed. If it is a flat file, you can do a great deal of that processing and algo work while you are doing the actual search.

>"find the highest price of this bond over the last 365 days"

Ok, now issue a select to "find the word 'foo' in all fields" (where all fields are either blobs or varchars)

The only other way to do it with MySQL with any speed, is to issue multiple selects on specific fields. You could search specifically for matches on title - then other selects on the other fields you are looking to match. In the end, the multple selects are just as slow as Perl.

I've yet to find a quality (speedy) search engine done in MySQL.

ggrot

9:49 pm on Aug 5, 2001 (gmt 0)

I guess it wouldnt be too terribly effective in your case brett, but you could do like google does and load up a static database for each common search term with the pages as records and a ranking as an id field. Of course this is redundant...but fast.

Bolotomus

10:57 pm on Aug 5, 2001 (gmt 0)

Ah! You are trying to do a search like the SE's do a search, very tricky stuff. Now I can understand why a db doesn't save any time. You have to iterate through the whole table anyhow.

Fortunately MySQL has a new feature which seeks to fix this problem: the FULLTEXT index. It allows for substring searches. It's still a little primitive, though. I've experimented with it recently. It's good stuff, especially if they live up to their promises and add new features like quoted-strings and the + and - operators.

[Tip: if you're going to set it up, change the source-code before you compile to allow 3-letter words into the lexicon (the default is 4). The docs cover how to make that change.]

MaliciousDan

7:07 pm on Aug 6, 2001 (gmt 0)

MySQL is suitable for a search engine, however using varchar or text/blob types on any heavily accessed table will kill speed. There are ways to deal with this, but if that is one of the things you expect to use a lot then mysql isn't the best choice. I haven't played with the full text indexing of mysql yet, if I remember right it relies on the berkley db file format which is supposed to be one of the fastest formats for full text indexing. I haven't actually played with it though (much less with realistic amounts of data in the table).