Forum Moderators: coopster

Message Too Old, No Replies

When searching a folder, how many files is too many?

         

londrum

8:01 am on Oct 29, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



i've set up a little caching system for my php pages, and each directory has its own folder to dump the written files too. the next time someone visits, the script searches through that folder to see if the file has been cached there.

but... i probably didn't plan it too well because one of the folders has now got 300,000 files in it. so i was wondering... is this going to have an effect on performance? i don't want to spend more time searching for the file than i would have spent just writing it.

does anyone know if there's a rule of thumb that you should use, when filling up a folder with lots of files? at the moment i'm thinking that 10 thousand in each directory might be a handy number

penders

11:59 am on Oct 29, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Rather curious as to why you would have so many cached files... unless you have that many pages?

...the script searches through that folder...


Again, I'm wondering why you would be "searching" for the file? I would have thought that any one resource you want to cache would map directly to one cache file. No "searching" should be required, it should be a direct lookup.

...one of the folders has now got 300,000 files in it


Certainly, if you are having to "search" that directory then it's going to start to cripple your caching, if it hasn't done already.

How long has it taken to generate that many files? You could even find that you start to reach the limits of the filesystem, or the tools you are using to navigate that filesystem?! Or maintenance becomes a burden. (Just as a comparison... I believe FAT32 has a physical limit of 65535 files in any one directory.)

"How many....?"

I'm guessing... If you are doing direct lookups then 10K should be OK I would have thought. But "searching" would still be an issue. And directory listings / FTP / etc. could also be slow.

londrum

12:14 pm on Oct 29, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



search was a poor choice of word, it doesnt search for anything. the script knows what the filename is and just grabs it (or writes over it, depending on its age)

300,000 is the final number, it doesnt grow beyond that, but they are all unique pages. there are quite a few database queries involved in generating each page, so thats why i have started to cache them
it

swa66

12:33 pm on Oct 29, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your operating system is most probably doing a search for you in the directory, and that search is most probably a linear search.
That means that it'll overflow the directory name lookup cache (using unix terminology here) in your OS every time you access that directory.
Most systems suffer severely from such use (unless you have more complex filesystems under it - but they too have their own drawbacks). The effect can be a system that's spending most of it's CPU usage in "system" mode for no apparent reason. Seeing directory cache performance statistics if very hard in most unices.

Best way to proceed: store the file in a hierarchical tree: just hash the name or so (use something that changes) and make a few levels of subdirectories based on that to not have more than a 1000 or so entries in each directory.

Oh BTW: in most OSes you need to remove the directory (not just the subdirectories and files in it) before it shrinks again to reasonable sizes.

penders

12:38 pm on Oct 29, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ah OK. Interesting problem.

Are you able to record any time difference between having 300,000 and say 1,000 cached files?

I came across this statement on another site:
I have a directory with 88,914 files in it. This is used for storing thumbnails and on a Linux server.

Listed files via FTP or a php function is slow yes, but there is also a performance hit on displaying the file. e.g. www.website.com/thumbdir/gh3hg4h2b4h234b3h2.jpg has a wait time of 200-400 ms. As a comparison on another site I have with a around 100 files in a directory the image is displayed after just ~40ms of waiting.


However, another comment suggests that this shouldn't be a problem, and that maybe there could be an issue with the filesystem used?

I'm sure there was a thread by Brett_Tabke ages ago where he was discussing flat file systems. I'm sure it was stated that webmasterworld.com is built on a flat-file system with 1,000's (and then some) files.

swa66

1:07 pm on Oct 29, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



if you can chose what filesystem the OS uses, there are filesystems like XFS that use more complex lookup strategies than the linear search - but as stated there are other drawbacks to them.

Simplest solution is not to go there and simply use a 256 subdirectories (00 to ff) and then one or two more layers as you need them. Or if you store things like e.g. thumbnails to store them in a year/month/day/ tree that you create as needed based upon initial upload date (or hour if you get really many of them at once) and not have to bother about DNLC cache performance.

londrum

1:32 pm on Oct 29, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



yup, having 300,000 was a bit nuts.
i regularly have to go in and clear out the directories though, because the files are cached for a day at a time, but when changes are made to the script they don't want to wait around for 24 hours for them to update the cached files, so having loads and loads of subdirectories would also be a pain — although that was probably the best idea.

i think i'll go for 10,000 max and see how that goes, and reduce it down from there.

penders

1:57 pm on Oct 29, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



i regularly have to go in and clear out the directories though...


Sounds like you need a way to expire the cache?

londrum

2:22 pm on Oct 31, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



just to follow up... my site has noticeably sped up since i made the change from 300,000 in a folder to under 10,000. it really zips along now.
i've decided to work on my image folders next because they've got 3,500 files in.

it seems like a quick and easy way to speed up your site a bit.

swa66

3:13 pm on Oct 31, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Actually you speed up your OS ...