Forum Moderators: coopster

Message Too Old, No Replies

How to write a custom .txt parser in PHP?

A rather large txt file needs to be parsed via php and imported in mysql

         

matun

8:58 am on Feb 16, 2007 (gmt 0)

10+ Year Member



Hello everybody. As I found WebmasterWorld the very best place to ask a technical question and get a relevant answer/help, I will try to ask you something that bothers me pretty much, as I am a novice php programmer.

So, I have this large file.txt (~ 65 MB size), and I want to parse the data of it into mySQL. The textfile contents are as this(it's not a typo if two words or a number and a word are written together,it's just like that in the original txt file):


1000039 PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999

I want to insert this data as this (in order of appearance):

1000039
- insert into column 1
PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge
- insert into column 2
PONTIN D.O.O.
- insert into column 3
0600888913
- insert into column 4
19990706
- insert into column 5
13
- insert into column 6
Zadarska
- insert into column 7
5207
- insert into column 8
Zadar
- insert into column 9
71951
- insert into column 10
Zadar
- insert into column 10
23000
- insert into column 11
Gazenicka Cesta
- insert into column 12
32
- insert into column 13
5190019950217
- insert into column 14
999
- insert into column 15

What approach should I take to solve this problem with success? Any help is highly appreciated as I am rather confused what approach whould I take.

Thank you very much for any help provided.

mcibor

10:21 am on Feb 16, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Before answering your question, there are few things that need to be cleared:

1000039 PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999

1000039 - this is fairly easy
PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge - how do you know, where this ends?
PONTIN D.O.O. - is it always capital letters, is the first word alway the same as first word in no 2?
0600888913 - this is fairly easy, if there is no space in between
19990706 - is it always 8 digits?
13 - how do you know that it's 13, and not 613?
Zadarska - this is fairly easy
5207 - this as well
Zadar - and this
71951 - this the same
Zadar - and this
23000 - and this
Gazenicka Cesta - will this always be followed by a digit?
32 - are the three last numbers space separated always? Then it is also fairly easy
5190019950217 - see 13
999 - see 13

To parse that I would perform a regex, but please answer the question before creating the regex

Michal

matun

11:44 am on Feb 16, 2007 (gmt 0)

10+ Year Member



Before answering your question, there are few things that need to be cleared

Of course, no problem.

PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge - how do you know, where this ends?
PONTIN D.O.O. - is it always capital letters, is the first word alway the same as first word in no 2?

The first word is always written in capital letters. And yes, the first word is always the same as first word in no 2 (PONTIN, i.e.)

0600888913 - this is fairly easy, if there is no space in between

Honestly, I'm not sure. But, for the sake of conversation, let's asume there is no spave in between.

19990706 - is it always 8 digits?

Yes. It's a date of registration and can only be in 8-digits format, no more or less than 8.

13 - how do you know that it's 13, and not 613?

I don't need to know why is it 13. It's just like that in the txt file I got. That number is the geographical identificator (something like county code) for my country. It is always a two-digit number. Is it goind to be a trouble to take number 13 (in this example) for itself, and not taking it along with the 19990706 which I need for column 5?

Gazenicka Cesta - will this always be followed by a digit?

In 98% of cases, yes. Gazenicka cesta is simply a street name. What follows after it is a house number. Now, that number (32 i.e.) can be 4 digits long, and can contain a letter, a number or a combination of the two.

32 - are the three last numbers space separated always? Then it is also fairly easy

I assume you were thinking of the number 999 and not 32? Yes, the three last numbers are always space separated.


To parse that I would perform a regex, but please answer the question before creating the regex

Michal, thank you very much for your cooperation. I knew I should prolly go for this with regex'es but unfortunatelly I don't know regular expression that well yet. Hope you'll find this reply useful.

mcibor

10:59 am on Feb 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To solve your problem I will use both regex and string functions. With just regex I don't know how to do that.

<?php
$str = "1000039 PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999";
$temp = $str; //copy the original data, so that the original will not be changed
$temp_result = array(); //variable to hold data from preg_match
$parsed = array();

$pattern = '@(^[0-9]+) ([a-z]+) @i';
if(!preg_match($pattern, $temp, $temp_result)) die("Error, could not parse the data");//or do whatever you want on error, eg continue

$parsed[1] = $temp_result[1]; // 1000039
$temp = ltrim($temp, $parsed[1].' ');//get rid of parsed text

$temp_result = strpos($temp, $temp_result[2], 1);
$parsed[2] = substr($temp, 0, $temp_result - 1);//PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge
$temp = substr($temp, $temp_result);//get rid of parsed text

//PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999";
$pattern = '@(^[^0-9]+)([0-9]+) ([0-9]{8})([0-9]+)([a-z]+) ([0-9]+) ([a-z]+) ([0-9]+) ([a-z]+) ([0-9]+)([^0-9]+)([0-9]+) ([0-9]+) ([0-9]+)@i';
if(!preg_match($pattern, $temp, $temp_result)) die("Error, could not parse the data");//or do whatever you want on error, eg continue

$parsed[3] = trim($temp_result[1]);//PONTIN D.O.O.
$parsed[4] = $temp_result[2];//0600888913
$parsed[5] = $temp_result[3];//19990706
$parsed[6] = $temp_result[4];//13
$parsed[7] = $temp_result[5];//Zadarska
$parsed[8] = $temp_result[6];//5207
$parsed[9] = $temp_result[7];//Zadar
$parsed[10] = $temp_result[8];//71951
$parsed[11] = $temp_result[9];//Zadar
$parsed[12] = $temp_result[10];//23000
$parsed[13] = trim($temp_result[11]);//Gazenicka Cesta
$parsed[14] = $temp_result[12];//32
$parsed[15] = $temp_result[13];//5190019950217
$parsed[16] = $temp_result[14];//999

?>

It worked for me. I hope it will for you.

Regards
Michal

mcibor

11:14 am on Feb 23, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Because there were further restrictions, the following parser:


<?php
$str = "1000039 PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999";
$temp = $str; //copy the original data, so that the original will not be changed
$temp_result = array(); //variable to hold data from preg_match
$parsed = array();

$pattern = '@(^[0-9]+) ([^ ]+) @i';
if(!preg_match($pattern, $temp, $temp_result)) die("Error, could not parse the data");//or do whatever you want on error, eg continue

$parsed[1] = $temp_result[1]; // 1000039
$temp = ltrim($temp, $parsed[1].' ');//get rid of parsed text

$temp_result = strpos($temp, trim($temp_result[2], ","), 1);
$parsed[2] = substr($temp, 0, $temp_result - 1);//PONTIN drustvo s ogranicenom odgovornoscu za ribarstvo,preradu ribe, trgovinu, turizam i usluge
$temp = substr($temp, $temp_result);//get rid of parsed text

if(substr($temp, -4, 1) == " ") $parsed[16] = substr($temp, -3, 3);
$temp = substr($temp, 0, strlen($temp)-4);

//PONTIN D.O.O. 0600888913 1999070613Zadarska 5207 Zadar 71951 Zadar 23000Gazenicka Cesta 32 5190019950217 999";
$pattern = '@(^[^0-9]+)([0-9]+) ([0-9]{8})([0-9]+)([^0-9]+)([0-9]+) ([^0-9]+)([0-9]+) ([^0-9]+)([0-9]+)([^0-9]+)([^ ]+) ([0-9 ]+)@i';
if(!preg_match($pattern, $temp, $temp_result)) die("Error, could not parse the data");//or do whatever you want on error, eg continue

$parsed[3] = trim($temp_result[1]);//PONTIN D.O.O.
$parsed[4] = $temp_result[2];//0600888913
$parsed[5] = $temp_result[3];//19990706
$parsed[6] = $temp_result[4];//13
$parsed[7] = trim($temp_result[5]);//Zadarska
$parsed[8] = $temp_result[6];//5207
$parsed[9] = trim($temp_result[7]);//Zadar
$parsed[10] = $temp_result[8];//71951
$parsed[11] = trim($temp_result[9]);//Zadar
$parsed[12] = $temp_result[10];//23000
$parsed[13] = trim($temp_result[11]);//Gazenicka Cesta
$parsed[14] = $temp_result[12];//32
$parsed[15] = $temp_result[13];//5190019950217

print_r($parsed);
?>

Works for most of them.

With sheer luck for:
1000201 B I R O M A D.O.O. ZA UREDSKOM OPREMOM I KANCELARIJSKIM MATERIJALOM B I R O M A D.O.O. 080235620 1999052742Grad Zagreb 1353 Grad Zagreb 73150 Zagreb 12000ÄŒIKOÅ EVA 10/b 5238112 950233 999
But not for:
$str = "1000560 GO - MO, D.O.O. ZA UGOSTITELJSTVO,TRGOVINU GO - MO D.O.O. 040043692 1997011614Bistarska 4485 Umag 66761 Umag 55370MATTEA BENUSSIA 1 5543015450216 999";

PS. Does anybody know how to find the second repetition? I mean, the position of second B I R O M A D.O.O and second GO - MO?

Regards
Michal