Forum Moderators: open
what happened to the other 40%
is there are limite on the number of files inside a disk as well as on bytes?
anyone know what the problem could be?
(nb the server has been rebooted properly
after setting up extra drive, and all that jazz)
Matt
$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/**censored** 4294967295 0 4294967295 0% /
/dev/**censored** 4889248 4441898 447350 91% /**censored**
here is the output for just df...
/dev/**censored** 19421984 4461992 14959992 23% /
/dev/**censored** 38464340 18324652 18185784 51% /**censored**
i guess that means that my primary hard drive does not suffer from inode limitations but my secondary one does. i have to buy a whole new server this year anyway.
thanks for the information. i'll take it into account when purchasing.
Finally, note that some filesystems (XFS and VxFS, for instance) don't suffer from inode limitations at all. XFS is SGI's filesystem which has been ported to Linux, and has been available in the 'standard' kernel since late in the 2.4 series. VxFS is the Veritas filesystem; it's available for purchase (at a premium) from Veritas. There's also FreeVxFS in the Linux kernel, but I've not played with it at all.
Goog luck. =)
all i understand about mounting the disk is that i type
mount /dev/something /somethingelse
-----------------
as for the sql scheme, this is my current situation:
i use the millions of files so that i can take any word queried by a user and just attach a file-suffix to the word and open a specific file for reading.
eg if they search for eastwood, it will take the first letter, E and open a file called ..../.../E/eastwood.suffix
so in other words, a user's query becomes simply the naming of a file to be opened.
will sql truly work faster than that? the file, in case you're curious, contains a list of product pages, except the list is boiled down to tiny identifiers which can similarly be prefixed by something (a url) to lead to the appropriate page; that, in turn, is also opened in a similar way, taking the identifier and using it to open yet another microscopic text file.
each file in this second haul is a single row in a database of 3 million rows.
i got forced into this method after my database expanded enormously, my original searching methods caused my server to get totally overloaded and my site to lose much of its utility and profit whilst i poked around for a viable, immediate solution
when i first started using sql i had a table with an entry for every product and i was getting it to look for keyword matches in each of about 3 million product descriptions - which proved insanely slow.
the system i developed in which i created the millions of keyword entries has not yet been put into sql by me.
i originally thought that it would be pointless, since opening a file with a given filename seemed like something a machine could do instantaneously.
but you've got me wondering now.
if i have a command in a perl script which is
open (file, "<filename.suffix");
and the file happens to be one of 3 or 4 million,
then is it going to have to process anything difficult to open that file, or will it do it just as easily as it would if there were only 5 files on the same disk?
if the answer is that it would do it just as easily, then i reckon i don't need sql. if it is not, then i totally believe you and will certainly move the keyword database to sql in the fullness of time.
can i take the drive which currently has the inappropriate filesystem and remount it with the appropriate filesystem?
No, the partition will need to be reformatted. This will (obviously) destroy the data on the disk, so you'll need a migration path (this may be as simple as scheduling downtime where you back up your data, reformat the disk, and restore the data). I'd strongly suggest experimenting with filesystems on test system, though! =).
Also note that support for the filesystem you're using will need to be present in your kernel if you want to make use of it.
As to the speed, it should open a file in a directory of 1,000,000 files as fast as it would in a directory with 10 files; just don't do any kind of file globbing operations, or things will get REALLY ugly REALLY fast (ask me how I know). You could clean up things a *bit* by making your directory structure deeper:
..../e/a/s/eastwood.suffix
* the structure of the database is critical
* proper uses of indexes is critical
* table joins are not cheap; they're very useful, but if you don't have to use them for a particular operation...don't.
* integer comparisons are cheaper than string comparisons
So, for instance, let's say that you have a form with a drop-down with a list of $somethings. You'll then take the submitted value and do something like:
SELECT * FROM <Table> WHERE <column> = '<submitted_value>'
When you populate the drop-down, populate it from a database table which maps integers (ids) to strings. Then set up the drop-down like this (populated from the database table):
<SELECT name="stuff_on_my_floor">
<OPTION value="1">Carpet
<OPTION value="2">Sofa
<OPTION value="3">Dustbunnies
</SELECT>
Now, when the data gets submitted, you can do:
SELECT * FROM <Table> WHERE item_id = '<submitted_integer>'
SELECT * FROM <Table> WHERE item_name = '<submitted_string>'
Queries can also be optimized; one of the DB folks at my place of employment optimized a query from tens of seconds down to a few milliseconds. Other things to note:
* perl/python/whatever CGIs do not scale; languages with embedded interpreters (mod_perl, PHP, etc) scale far, far better
* use database connection pooling; setting up (and tearing down) connections to a database is a waste of resources. Note that database connection pooling can NOT be done with fork()'d CGIs, since they exit between invocations.
(edit: spelling correction)
[edited by: sitz at 2:04 am (utc) on April 30, 2005]
i always resort to perl these days because i've got so used to it i don't have to think at all when writing in it.
before i can take the next steps with sql from my linux environment i have to learn more linux - and first get a dummy machine;
besides, before taking the development any further i need to make further adjustments to the current design of how the pages get 'scored' (i.e. we're talking freelance independent bespoke search engine pageranking) and hammer out the final structure of the whole database and scoring system. this phase of the design started in august last year and it's taken all that time to run only 3 updates of the database, one of which got screwed up badly by the other kind of server problem (virus/hacker/etc).
when it is all finally ready, in terms of structure, migrating it from my file-picking mechanism to an sql database will be the easy bit.
the brick wall which i have to deal with in terms of the structure of it all lies in a script which slows down my crawl (done on a max os x unix server) so it takes 3 months to create the database...
the source of the problem is that when going through 3 million sets of easily-makable scored words, per product, to convert them into sets of scored products per word, i end up using a lot of time making it reread the scores per word every time it writes a new score; without sql one cannot update in a yodalike way - i.e. without opening, reading, changing, rewriting all.
hence my new linux installation will certainly end up housing the next upgraded version of my crawler, and i'll definitely only be able to eliminate 2 and three quarters of the months it takes to crawl by putting all these damn files into sql databases.
oh what fun my new machine will be. the only reason i can't wait to get going on it is so that i can get it over with that much more quickly.
i suppose i should really, now that i've just recently finished a crawl (hence the tarball antics), get my mac's sql running and sort out the top priority issue of ridding those 2.75 wasted months per quarter!
--
just investigated that.
my os x is 10.1.4
(although it comes with disks to make it 10.2)
i finally have an opportunity to wipe it clean and reinstall all the software for the first time in 5 years or something.
i've already dug up the page at mysql.com where i can then download sql to install on my mac. this time next week i expect i'll be in totally absorbed in making my faster crawler. right now sleep. first thing tomorrow - reinstall mac. i'll even make a goddamn note which the mac itself will thrust in my face at the designated time.
cheerio for now.