Forum Moderators: open

Message Too Old, No Replies

GZ Files into Database

         

Jozsoft

2:10 pm on Jul 18, 2007 (gmt 0)

10+ Year Member



I have an issue that I have never faced therefore I am asking the following question on several forums in hopes that can find a solution.

I am building a database driven web site using PHP and MySQL. It’s a pretty basic setup nothing fancy or anything that I haven’t done before. However, the data that will go into the database is coming to me via comma delimited text files that are compressed in the Gzip format. There are around 100 of these gzip files that need to be pulled off a FTP server and loaded into database.

Now here’s the kicker: The database needs to be updated on a daily basis. Which means that the gzip files must be downloaded from the FTP server every morning. They then need to be decompressed into the comma delimited text files and then each one needs to be loaded into the MySQL database.

Now, if this was for just a few text files then following the procedure above wouldn’t be that big of a deal. But, the thing is, that there are over 100 of these comma delimited text files that have been compressed in gzip. All of these files need to be inserted into the database every morning ASAP, to stay competitive with our competition.

So, my question is: is there any server side utility that can decompress a large collection of gzip files and then insert them into a MySQL database? Or is there something that I can do (PHP script wise or other) to minimize this process?

Here is some info on the server and software that I am running:

Operating system Linux
Kernel version 2.6.17.11-grsechg
Apache version 1.3.37
PERL version 5.8.7
PHP version 4.4.7
MySQL version 4.1.21-standard

Demaestro

3:04 pm on Jul 18, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This isn't as bad as you may be thinking.

I am about unsure about existing scripts but you should be able to write an import routine for these files. All you need are server resources.

Using a cron task I would hit a Python method that I would write that would be an import routine....

Start by using Wget to grab the file and then unzip it to a directory that would be set up to have this files unzipped in a sub directory that would be named with a date and time stamp of the time that the file was unzipped.

I would then grab all the files and run them through a parsing method that would make sure the file format will work with the import routine.... things like converting weird chars into some type of plain text... like weird Mac line breaks I would change to <line_break> then that could be parsed back into a line break later when put back to screen or whatever you are doing with them.... also make there are the right amount of field delimiters.... you will have to write a parsing method for each file if they are going to be of a different nature.

Then read the file into a variable named 'file'...

Then something like this

rows = [x for x in file.split('/n')] #this will be an array... each item will be a row

for row in rows:
....columns = [x for x in row.split('/t')] #this will give you an array... each item in the array is a column
....name = columns[0]
....address = columns
....phone = columns[2]

#to save resources you don't have to assign the columns to variable names but you should at least comment which column is which field to save you a headache when debugging.

....try:
........sql = 'insert into table ('name', 'address', 'phone') values(name, address, phone)'
........sql_execute(sql)
....except(Exception e)
........write_to_error_log(all the row info)

**********

A lot of time will be spent figuring out the columns and matching them up in code....I have some that are a 100 lines of code just for the import part of it... if a sql statement fails write to a log the columns and line number and as much info so you can go back and look at the file to see why a row didn't go in.... it will usually be a delimiter issue. You will have to make sure the format of the files is static and if any fields are added or removed that you know so you can adjust your routine.

Once and a while zip the directory that stores the files and archive it somewhere so you have the originals. It will act as a back up if you loose the database you can re run the import on all the files.

This is just how I would do it... I am not saying it is the best way but it gives you god like control over the import and when headaches arise you will be glad to have that level of control.

[1][edited by: Demaestro at 3:05 pm (utc) on July 18, 2007]

Jozsoft

4:59 pm on Jul 18, 2007 (gmt 0)

10+ Year Member



To be honest, inserting the comma separated text files is not much of a worry, I am more concerned at the first parts of the process. I think I might be able to use a PHP function to grab from the FTP. However, I have no idea how to decompress the gz files. However, after they are decompressed, i am not to worried about writing a script to insert into the MySQL tables. I guess I am just stuck with the decompressing the gz files.

Demaestro

8:22 pm on Jul 18, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have messaged you two URLs that have tutorial for dealing with files in PHP including Zipping and unzipping.

Basically you use the "system" or "exec" calls that will allow you to execute command line stuff

**Mods if these are allowed I will put them into the message body