Welcome to WebmasterWorld Guest from 54.163.168.15

Forum Moderators: bakedjake

How many files should one have in a directory?

linux file directory inode

   
2:09 am on Apr 14, 2009 (gmt 0)

5+ Year Member



I have a bunch of user files in the hundreds of thousands and will probably explode into the tens of millions (maybe a few orders of magnitude more) that I am saving on disk.

Obviously we are using a hierarchy structure but are wondering
1. what are the limiations we should impose on the most number of files that any one sub-directory should have?
2. Do a summation on sub-dirs or does that even matter>

Thanks!

2:14 am on Apr 14, 2009 (gmt 0)

WebmasterWorld Senior Member vincevincevince is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I don't know what these files are; but it may be better to use a database rather than individual files. Even small binary strings are fine in most databases; and these systems are designed to handle very large numbers of small records.

I don't think there is any limit on the newer linux filesystems concerning the number of files per directory.

Have you looked into cloud storage solutions? This may be your answer as they are optimised for immense numbers of files, and scale beautifully with your growth.

5:16 am on Apr 14, 2009 (gmt 0)

5+ Year Member



We actually have the files in a DB right now and MySql is horrible when tables start getting into Gigabytes...
2:46 pm on Apr 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It depends on the database structure and indexing. Partitioning helps a lot. I have a mysql database with 35 gigs of music in it, and it brings up a random song without any problem.

On another server I have a table that's only 14 gigs, but 105 million rows, and it does fine as long as I give it a well-indexed query, otherwise it will hang.

The real downside of the database, I think, is that if it gets corrupted, you could lose more data than you would with a file system. And, it can be more difficult to back up.

5:43 pm on Apr 14, 2009 (gmt 0)

5+ Year Member



Wow, thanks for sharing that mcavic.

In our case one of the offending tables has less than 20,000 rows but is about 2Gigs in size. There is a mediumblob column in this table that causes even a simple SELECT to take as much as a minute - I'm talking about a SELECT on PRIMARY KEY taking over a minute such as
SELECT * FROM OFFENDING_TABLE where id='735'

We think we have indexing right... regardless why would such a simple query take so long?

We also did another simple test where we copied over the table (exported it) as another table, dropped the mediumblob column and ran the same queries. The queries now took a fraction of a second to run - including complex queries with large multiple joins.

So what could I be doing wrong? What can we do differently? Would appreciate any feedback, anyone can offer.

Thanks much!

5:57 pm on Apr 14, 2009 (gmt 0)

WebmasterWorld Administrator lifeinasia is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



We think we have indexing right

For that query, unless you have an index on the id field, I'd say you don't have it right. :)

You may want to try rebuilding the index(es). Over time, indexes can become corrupted (especially if there are a lot of inserts/updates to the fields indexed), so they can periodically need rebuilding/reindexing.

6:17 pm on Apr 14, 2009 (gmt 0)

5+ Year Member



Thanks for your inputs LifeinAsia, I think you missed th fact that I said id was a PRIMARY KEY. Let me even add to it and say that it is AutoIncrement.

And we run
mysqlcheck -A -c -a -s --use-frm
everyday. Periodically we do ANALYZE as well.

Now what? :(

7:38 pm on Apr 14, 2009 (gmt 0)

WebmasterWorld Administrator lifeinasia is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



Oops, ya, missed that. :)

The only other thing that I can think of is that the underlying file structure may be outta whack and needs to be defragged (or whatever the *nix equivalent is).

Also, since the query went so quickly on the second table without the mediumblob column, try the SELECT query on the original table, but exclude the mediumblob column. (In other words, explicitly list all the columns except the mediumblob column.) Does that make any difference?

8:23 pm on Apr 14, 2009 (gmt 0)

5+ Year Member



Yeah we're not even using the medium blob column in any of the queries... I jut showed SELECT * to make the point. In actuality we do list the columns explicitly.

All operations on that table are HORRIBLY slow, even going to (as in browse) to the last page of the OFFENDING_TABLE in phpMyAdmin.

We have not done any partitioning (horizontal or vertical). That seems like the only other thing that I can think of.. and will spend some time to educate myself on that.

Or, just go to flat files and put all these mediumblobs as files.

Thanks for your inputs. Would welcome any ideas?

Regards,
--

12:50 am on Apr 15, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is very strange. Partitioning shouldn't help much on a 2 gig table, but something interesting to try might be:

ALTER TABLE MyTable
PARTITION BY LINEAR KEY(MyPrimaryKey)
PARTITIONS 30;

That would cut the size of each data file down to 67 megs. And if you select a specific record, it would only have to open one data file and one index file to find it. You could then time it to see how much faster it is.

9:17 am on Apr 15, 2009 (gmt 0)

5+ Year Member



I will try this and report back ;) BUT, we need to select more than one record. Typically it is like a "Browse" functionality that the user can then paginate over... makes sense?

Thanks for your help & ideas. Hopefully we will resolve this soon :)
--

4:43 am on Apr 16, 2009 (gmt 0)

5+ Year Member



No matter whether I do it from mysql> prompt or from phpMyAdmin I get the error:

#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'PARTITION BY LINEAR KEY(`id`)
PARTITIONS 20'

...and I did try it with & without the reverse-apostrophes/ticks around the table and field names...

Arrrgh! :)
--

4:43 am on Apr 16, 2009 (gmt 0)

5+ Year Member



No matter whether I do it from mysql> prompt or from phpMyAdmin I get the error:

#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'PARTITION BY LINEAR KEY(`id`)
PARTITIONS 20'

...and I did try it with & without the reverse-apostrophes/ticks around the table and field names...

Arrrgh! :)
--

4:45 am on Apr 16, 2009 (gmt 0)

5+ Year Member



P.S.
I tried ALTER table as well as CREATE table and both are spitting out the same error.

We are running
mysql Ver 14.12 Distrib 5.0.22, for redhat-linux-gnu (i686) using readline 5.0

Thanks again!
--

5:10 am on Apr 16, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry, I should have mentioned that partitioning isn't supported until MySQL 5.1.
5:27 am on Apr 16, 2009 (gmt 0)

5+ Year Member



ok, I suppose we can upgrade our MySQL. Any known gotchas with that?

But it still bothers me that this issue has eaten up days & days of my brain cycles and is till unresolved... especially given that you you are doing the exact same thing without any problems really bugs me!

Y'know what else we figured out: e.g. when doing a join with another table instead of JOINING on OFFENDING_TABLE.id we joined on OFFENDING_TABLE.id+0 it spits the results right back! Voila!

Basically what we did was force it to not use the index for this table... by using a math operator: 0 in this example. Hopefully someone else can learn/benefit from this too :)

What is the URL to your app? Rgds,
--

6:12 am on Apr 16, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



When doing a major upgrade, it's recommended to dump all of your databases and re-import them afterward, rather than carrying the binary data files over.

I know! It's bugging me too. Perhaps there's some sort of a bug in your installation that would be resolved by upgrading. If it was a hardware problem I assume there would be other symptoms.

The music database I mentioned is a streaming server that I'm writing in PHP for my own use. I'd be happy to send you a copy, but it's not really finished.

8:44 am on Apr 16, 2009 (gmt 0)

5+ Year Member



Ok so we upgraded to Server version: 5.1.33-community and I was able to alter the table to PARTITION.

Looks like performance is totally un-affected :(
--

9:41 am on Apr 17, 2009 (gmt 0)

WebmasterWorld Senior Member vincevincevince is a WebmasterWorld Top Contributor of All Time 10+ Year Member



What is the field type of the id column? I know it's a silly question but...

Also - how does the performance of :
SELECT * FROM OFFENDING_TABLE where id='735'
compare to:
SELECT * FROM OFFENDING_TABLE where id=735

11:10 pm on Apr 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't think there is any limit on the newer linux filesystems concerning the number of files per directory.

There isn't a limit. But you'll find directory operations can (e.g. open, create, delete, etc.) can get very slow if there are a lot of files. The reiser filesystem can help, it uses a b-tree directory structure. With ext3 you usually have to do a linear search of all the directory pages to find a file (or to find out if it isn't there during a create operation). Note: ext3 can do b-tree directories, it's just not the default.

10:55 am on Apr 25, 2009 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



another thing to consider if you go with files is if you are using ftp.
i've seen significant performance problems with some ftp clients and large directories.
i'm sure you can find a working solution but something to consider if you have software preferences or limitations...
 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month