Forum Moderators: coopster
So, I have this large file.txt (~ 65 MB size), and I want to parse the data of it into mySQL. The textfile contents are as this(it's not a typo if two words or a number and a word are written together,it's just like that in the original txt file):
1000039 PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999
I want to insert this data as this (in order of appearance):
1000039 - insert into column 1 PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge - insert into column 2 PONTIN D.O.O. - insert into column 3 0600888913 - insert into column 4 19990706 - insert into column 5 13 - insert into column 6 Zadarska - insert into column 7 5207 - insert into column 8 Zadar - insert into column 9 71951 - insert into column 10 Zadar - insert into column 10 23000 - insert into column 11 Gazenicka Cesta - insert into column 12 32 - insert into column 13 5190019950217 - insert into column 14 999 - insert into column 15 What approach should I take to solve this problem with success? Any help is highly appreciated as I am rather confused what approach whould I take.
Thank you very much for any help provided.
1000039 PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999
1000039 - this is fairly easy
PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge - how do you know, where this ends?
PONTIN D.O.O. - is it always capital letters, is the first word alway the same as first word in no 2?
0600888913 - this is fairly easy, if there is no space in between
19990706 - is it always 8 digits?
13 - how do you know that it's 13, and not 613?
Zadarska - this is fairly easy
5207 - this as well
Zadar - and this
71951 - this the same
Zadar - and this
23000 - and this
Gazenicka Cesta - will this always be followed by a digit?
32 - are the three last numbers space separated always? Then it is also fairly easy
5190019950217 - see 13
999 - see 13
To parse that I would perform a regex, but please answer the question before creating the regex
Michal
Before answering your question, there are few things that need to be cleared
Of course, no problem.
PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge - how do you know, where this ends?
PONTIN D.O.O. - is it always capital letters, is the first word alway the same as first word in no 2?
The first word is always written in capital letters. And yes, the first word is always the same as first word in no 2 (PONTIN, i.e.)
0600888913 - this is fairly easy, if there is no space in between
Honestly, I'm not sure. But, for the sake of conversation, let's asume there is no spave in between.
19990706 - is it always 8 digits?
Yes. It's a date of registration and can only be in 8-digits format, no more or less than 8.
13 - how do you know that it's 13, and not 613?
I don't need to know why is it 13. It's just like that in the txt file I got. That number is the geographical identificator (something like county code) for my country. It is always a two-digit number. Is it goind to be a trouble to take number 13 (in this example) for itself, and not taking it along with the 19990706 which I need for column 5?
Gazenicka Cesta - will this always be followed by a digit?
In 98% of cases, yes. Gazenicka cesta is simply a street name. What follows after it is a house number. Now, that number (32 i.e.) can be 4 digits long, and can contain a letter, a number or a combination of the two.
32 - are the three last numbers space separated always? Then it is also fairly easy
I assume you were thinking of the number 999 and not 32? Yes, the three last numbers are always space separated.
To parse that I would perform a regex, but please answer the question before creating the regex
Michal, thank you very much for your cooperation. I knew I should prolly go for this with regex'es but unfortunatelly I don't know regular expression that well yet. Hope you'll find this reply useful.
<?php
$str = "1000039 PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999";
$temp = $str; //copy the original data, so that the original will not be changed
$temp_result = array(); //variable to hold data from preg_match
$parsed = array();$pattern = '@(^[0-9]+) ([a-z]+) @i';
if(!preg_match($pattern, $temp, $temp_result)) die("Error, could not parse the data");//or do whatever you want on error, eg continue$parsed[1] = $temp_result[1]; // 1000039
$temp = ltrim($temp, $parsed[1].' ');//get rid of parsed text$temp_result = strpos($temp, $temp_result[2], 1);
$parsed[2] = substr($temp, 0, $temp_result - 1);//PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge
$temp = substr($temp, $temp_result);//get rid of parsed text//PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999";
$pattern = '@(^[^0-9]+)([0-9]+) ([0-9]{8})([0-9]+)([a-z]+) ([0-9]+) ([a-z]+) ([0-9]+) ([a-z]+) ([0-9]+)([^0-9]+)([0-9]+) ([0-9]+) ([0-9]+)@i';
if(!preg_match($pattern, $temp, $temp_result)) die("Error, could not parse the data");//or do whatever you want on error, eg continue$parsed[3] = trim($temp_result[1]);//PONTIN D.O.O.
$parsed[4] = $temp_result[2];//0600888913
$parsed[5] = $temp_result[3];//19990706
$parsed[6] = $temp_result[4];//13
$parsed[7] = $temp_result[5];//Zadarska
$parsed[8] = $temp_result[6];//5207
$parsed[9] = $temp_result[7];//Zadar
$parsed[10] = $temp_result[8];//71951
$parsed[11] = $temp_result[9];//Zadar
$parsed[12] = $temp_result[10];//23000
$parsed[13] = trim($temp_result[11]);//Gazenicka Cesta
$parsed[14] = $temp_result[12];//32
$parsed[15] = $temp_result[13];//5190019950217
$parsed[16] = $temp_result[14];//999?>
It worked for me. I hope it will for you.
Regards
Michal
<?php
$str = "1000039 PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999";
$temp = $str; //copy the original data, so that the original will not be changed
$temp_result = array(); //variable to hold data from preg_match
$parsed = array();$pattern = '@(^[0-9]+) ([^ ]+) @i';
if(!preg_match($pattern, $temp, $temp_result)) die("Error, could not parse the data");//or do whatever you want on error, eg continue$parsed[1] = $temp_result[1]; // 1000039
$temp = ltrim($temp, $parsed[1].' ');//get rid of parsed text$temp_result = strpos($temp, trim($temp_result[2], ","), 1);
$parsed[2] = substr($temp, 0, $temp_result - 1);//PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge
$temp = substr($temp, $temp_result);//get rid of parsed textif(substr($temp, -4, 1) == " ") $parsed[16] = substr($temp, -3, 3);
$temp = substr($temp, 0, strlen($temp)-4);//PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999";
$pattern = '@(^[^0-9]+)([0-9]+) ([0-9]{8})([0-9]+)([^0-9]+)([0-9]+) ([^0-9]+)([0-9]+) ([^0-9]+)([0-9]+)([^0-9]+)([^ ]+) ([0-9 ]+)@i';
if(!preg_match($pattern, $temp, $temp_result)) die("Error, could not parse the data");//or do whatever you want on error, eg continue$parsed[3] = trim($temp_result[1]);//PONTIN D.O.O.
$parsed[4] = $temp_result[2];//0600888913
$parsed[5] = $temp_result[3];//19990706
$parsed[6] = $temp_result[4];//13
$parsed[7] = trim($temp_result[5]);//Zadarska
$parsed[8] = $temp_result[6];//5207
$parsed[9] = trim($temp_result[7]);//Zadar
$parsed[10] = $temp_result[8];//71951
$parsed[11] = trim($temp_result[9]);//Zadar
$parsed[12] = $temp_result[10];//23000
$parsed[13] = trim($temp_result[11]);//Gazenicka Cesta
$parsed[14] = $temp_result[12];//32
$parsed[15] = $temp_result[13];//5190019950217print_r($parsed);
?>
Works for most of them.
With sheer luck for:
1000201 B I R O M A D.O.O. ZA UREDSKOM OPREMOM I KANCELARIJSKIM MATERIJALOM B I R O M A D.O.O. 080235620 1999052742Grad Zagreb 1353 Grad Zagreb 73150 Zagreb 12000ÄŒIKOÅ EVA 10/b 5238112 950233 999
But not for:
$str = "1000560 GO - MO, D.O.O. ZA UGOSTITELJSTVO,TRGOVINU GO - MO D.O.O. 040043692 1997011614Bistarska 4485 Umag 66761 Umag 55370MATTEA BENUSSIA 1 5543015450216 999";
PS. Does anybody know how to find the second repetition? I mean, the position of second B I R O M A D.O.O and second GO - MO?
Regards
Michal