Forum Moderators: coopster

Message Too Old, No Replies

How much of a performance hit is file_exists?

Question.

         

HughMungus

7:27 pm on Dec 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have a database full of products but don't have images for all products. Originally I was going to have an indication in the database of if there is an image or not (display the image if the "image" field = "y", display a placeholder if it does not). Then I thought about using file_exists, instead (since the images are named after the database id field number).

But now I'm wondering how much of a performance hit this is compared to using the database indicator method since the script would have to actually physically check the hard drive for a file. I'm sure the number of items per page would be a factor. At what point does doing a bunch of file_exists operations become cumbersome? I'm thinking the most I'd have per page is about 20 (that is, 20 products per page).

TIA

ergophobe

11:02 pm on Dec 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Tiny. Filesystem access is very fast (10x DB access). If you are already retrieving the row from the DB, of course, it is an additional action. Still, it should be relatively quick.

Write a script with a loop in it and have it check file_exists 40,000 times and see what you get for results. If you don't know how to benchmark a script, look for the benchmarking article in the forum library. And remember, report back with results when you do.

ergophobe

11:18 pm on Dec 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just ran the following with the xdebug profiler on

for ($i=1; $i<100000; $i++) {file_exists($_SERVER['PHP_SELF']);}

It took 10.5 seconds for 100,000 iterations.

By comparison, I set it up to call about the simplest user-defined function that I could think of - one that doesn't do anything at all. So calling this instead

function testfunc()
{
return;
}

Took 3 seconds.

jatar_k

12:25 am on Dec 9, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



since ergophobe has proven the thread I will send this wildly off topic

ergophobe

what are the results if you call that function twice?

what is the rough setup of the server it was tested on?
(ram, dedicated/shared, processor)

ergophobe

1:55 am on Dec 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I thought about posting all that too. Part of why I posted the comparison is because I figured the relative values there were good indicators. I didn't mean the absolute values to have any particular meaning, because they will of course vary so much.

You might be right in wondering how it scales to different systems. The main point I was trying to make is that calling the file_exists() function was only about three times faster than calling a function that performs no function (sorry). I also did a quick test with one that simply assigned an integer to a variable and then quit and that was slower than the empty function by about 50% or roughly half the execution time of file_exists().

That was just running on the older computer I was surfing on with a testbed WAMP setup

- Athlon 750MHz
- ca 640MB of RAM (can ergo add? 256 + 256 + 128 = 640)
- lots of other apps running, but no live web server, no concurrent requests.

I suppose the seeks might slow down a fair bit with many concurrent requests waiting for the read head to come around again. I suppose it might depend on how everything was running though. If there weren't a lot of files on the disk, might it just keep the FAT or part of it in memory and not recheck the disk every time? Of course I assume that file_exists does not actually check the file location, but just looks it up in the FAT.

I didn't try calling the function twice (you mean twice per iteration through the loop?)

HughMungus

2:36 am on Dec 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Fascinating. I always assumed that calls to the file system would always be slower than calls to the database. Great learning info. Thanks. I ended up switching from a database entry check to using file_exists. It's fast enough for my taste (especially since I'm only doing 20 per page or so).

Thanks!

jatar_k

4:49 am on Dec 9, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



ahhh, sorry ergophobe, I misunderstood a bit

yes, for a moment I thought a single call to that function took 3 seconds, my mind reeled a bit at the implications.

hehe, I should drink more coffee sometimes but I think this time less would have been better.

Yes, I was wondering about twice per, trying to figure if it would actually double or not. You wouldn't think it should matter but I wonder. Some benchmarking results over the years still don't make sense to me. I can't remember any off the top of my head but I know there are some every once in a while.

I will have to try the same test on my servers tomorrow and see the variations.

ergophobe

6:24 pm on Dec 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Fascinating. I always assumed that calls to the file system would always be slower than calls to the database.

You can think of it with this analogy:

If you were looking for someone in a city and you had their address, you would just go straight to the address - very fast (filesystem when you know the file name and don't have to search through it).

If you only knew the name and had to go around knocking on doors house-by-house until you found the right house - very slow (filesystem when you don't know which file the data is in or it's in a very large file and you have to read it line-by-line until you find it).

If you had a phone book and the name and you could use the name to look up the address and then go straight there - not as fast as just knowing the address, but still quite fast (database situation).

So if you want the whole file and you know where it is, or in your case you know the name and want to see whether the name is valid, it's fastest to use the filesystem. If that file has 86,000 rows of data and you want to find one row, you should use a DB.

mincklerstraat

1:56 pm on Dec 10, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interesting. I'd have guessed that this benchmark would have been unreliable, since the results of
file_exists()
are cached. Idea is then, once it's checked to see if the file's there, for the rest of those 99,000 times it pretends it's checking, it's really just taking a peek at where it hid this value in memory. Like your night watchman who does his round once and just ticks off the boxes each following hour since he remembers everything was there. Thought being that if PHP really sees fit to cache this kind of thing, there must be some sort of savings involved. If this were true, I could strut around like Mr. Smartypants.

Anyways, here's the results - first the benchmark code:


require '/somepath/PEAR/benchmark/Benchmark-1.2.1/Timer.php';
$timer = new Benchmark_Timer();
$timer->start();
for($i = 0; $i< 100000; $i++){
file_exists($_SERVER['PHP_SELF']);
}
$timer->setMarker('Mark1');
echo "Elapsed time between Start and Mark1 (file_exists() only): " .
$timer->timeElapsed('Start', 'Mark1') . "<br />\n";
for($i = 0; $i< 100000; $i++){
clearstatcache();
file_exists($_SERVER['PHP_SELF']);
}
$timer->setMarker('Mark2');
echo "Elapsed time between Mark1 and Mark2 (clearstatcache() + file_exists() : " .
$timer->timeElapsed('Mark1', 'Mark2') ."<br />\n";
$timer->setMarker('Mark3');
for($i = 0; $i< 100000; $i++){
clearstatcache();
}
$timer->setMarker('Mark4');
echo "Elapsed time between Mark3 and Mark4 (clearstatcache() only) : " .
$timer->timeElapsed('Mark3', 'Mark4') ."<br />\n";
$timer->stop();
$timer->display();

the relevant results:


Elapsed time between Start and Mark1 (file_exists() only): 0.479363
Elapsed time between Mark1 and Mark2 (clearstatcache() + file_exists() : 0.524402
Elapsed time between Mark3 and Mark4 (clearstatcache() only) : 0.129451

Conclusion:

  • clearstatcache()
    offers no serious time loss for checking to see if the same file exists, 100,000 times in a row (only about 10%, and less than
    file_exists()
    and
    clearstatcache()
    separately). Not really the kind of info you'd need, though, for a 'real life' application.
  • Can't go Mr. Smartypantsing you all here, but my puter's faster than yours, ergophobe! (just had to do that)

Benchmarked on Debian Linux, P4 2.4MHz, 1GB RAM, SATA hard disk w/8MB cache, various stuff open (in X) but no public server.

ergophobe

9:39 pm on Dec 10, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I thought about caching, but I assumed (but then, what emprical value does that have?) that the results of file_exists() would not get cached because to have any value as a function, it must be checked every time through. Otherwise, how would you verify an unlink()?

if (file_exists('my.file'))
{
unlink('my.file');
}

if (file_exists('my.file'))
{
echo "Sorry, delete failed";
}

If the results of file_exists() were cached, you woudl get the failure message every time.

And yes, I'm old school - sub 1GHz. For a desktop, it seems to do most things just fine until I get around to editing my cinematic masterpiece, but that's down the road.

jatar_k

10:16 pm on Dec 10, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



thats ok I ran this on sub gig too ;)

I ran a bunch of things just for interest and cam up wioth these numbers. Just using microtime before and after and subtracting. all 100K iterations

file_exists($_SERVER['PHP_SELF']);

1.64718198776
1.68153810501
1.66256690025

file_exists($_SERVER['PHP_SELF']);
clearstatcache();

1.75823688507
1.76556396484
1.76278805733

testfunc(); // which does the big nada

0.361613035202
0.32157087326
0.342176914215

double tesfunc

0.528954029083
0.579276800156
0.52706694603

quad tesfunc

0.832072973251
0.829246044159
0.811228990555

Benchmarked on Sunfire V120, Solaris 8 , UltraSparc IIi 550MHz, 512 MB RAM, SCSI hard disk, dev server, runs massive amounts of crap, including Oracle server and client, mysql, Apache, minimal concurrent connections

>> my puter's faster than yours, ergophobe!

geez looks like on paper it's faster than mine too, but then again, I don't think so ;)

HughMungus

10:28 pm on Dec 10, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Great analogy erg. Thanks again. My php stuff is pretty simple and it's not like the difference is going to make or break the site, but I do often wonder what is most efficient (for future reference).

mincklerstraat

10:16 am on Dec 11, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ergophobe >> I assumed ...

Yeah, this is exactly what I assumed, and I remember when I first read the thing in the manual on the

file_exists()
page about caching this function, and I thought, what the ***? For the same reasons you mention. Apparently, if you want to do what you mention, you do have to call
clearstatcache() [be2.php.net]
before checking on a file a second time. This has been so since PHP3 - maybe it's a remnant from the days when people thought PHP would be used a lot for doing batch-like tasks ala PERL, or when people depended more on the filesystem rather than just sticking every imaginable thing in the database, like they seem to do today.

I'm also not much of a currentest-stuff-in-your-box sorta guy, got this machine to replace a PII 330MHz 192MB jobber with a 14" CRT, and haven't even bothered yet getting the ATI graphics acceleration to work. You've got a nice amount of RAM, though - hardly ever use any more than that, and I think I'd take 750MHz with 640MB over 2.4GHz and 256MB any day.

Jatar: no contest here - sheesh, Solaris / SCSI and Oracle on yer desktop? You get the option of the 20-year nuclear UPS? You're way out-ubering all the ubergeeks I know.

jatar_k

6:00 pm on Dec 11, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



not really my desktop, I have a standard winxp 2.1 or 2.4 GHz 512 MB ram, 40GB IDE setup. I just don't run any dev on it. All our dev is done on servers at work. Above just happens to be one of closest friends ;), our primary dev server.

just the thought of solaris as a desktop makes me cringe.

ergophobe

1:23 am on Dec 12, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sure enough, the results of file_exists() are cached. Now I've been warned. Of course, based on the benchmarking you guys did, the caching doesn't seem to speed things up in this case.