Welcome to WebmasterWorld Guest from 54.225.16.10

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Splitting Mixed Delimiters in a CSV like data?

comma,"separated,fields",partly,qualified

     
10:28 pm on May 1, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member rocknbil is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Nov 28, 2004
posts:7999
votes: 0


Task: Export data from offline spreadsheet proggie, upload and import to online database.

I've wrangled with this one on and off. Usually I take the "Google Base" approach and convince the client to quote qualify everything, or nothing, just stay consistent.

But some customers. You know. So what would you do to split up lines, something like this?

id,name,phone,address,city,state.....
1234,"Smith,Steve",123-4567,"124 ""5th"" Street, Apt 24",Anytown,PR.....

I could figure something out, or use a module, just wondering what others do with these.

The quote qualifiers may or may not be present, they won't always be present in any particular field (as in "name" above,) some export programs double-quote field value quotes as shown above . . .

What do you do with these?

2:07 am on May 2, 2009 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11080
votes: 106


it might be worth investigating cpan modules such as Text::CSV [search.cpan.org].
2:28 pm on May 2, 2009 (gmt 0)

Administrator

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 31, 2003
posts:12548
votes: 2


I ran into this issue a few years ago too and ended up writing my own as part of my file processing class. The function below accepts each line of a file (csv, tsv, etc.). I cannot take full credit here as you see in the comments -- I peeled some of the logic from the PHP fputcsv function source code, which is written in C. I converted the logic to PHP which you should be able to easily adapt to perl (share the code here if you do please).
// {{{ _outputCsv 
/**
* PHP < 5.1.0 workaround for fputcsv
* PHP Source: /php/ext/standard/file.c (fputcsv)
* Find any enclosures?
* Escape with an additional enclosure
* Next, find any spaces, delimiters or enclosures?
* Wrap the whole works in a set of enclosures
*
* @param string $fields array to process
* @param string $delimiter field delimiter character
* @param string $enclosure field enclosure character
* @param string $newline newline character
* @return void printed output should be buffered
* @access private
*/
private function _outputCsv($fields, $delimiter, $enclosure, $newline)
{
$out = '';
$pattern = '/' . preg_quote($enclosure, '/') . '/';
$enclose = '/[\s' . preg_quote("$delimiter$enclosure", '/') . ']/';
$delimit = '/' . preg_quote($delimiter, '/') . '$/';
foreach ($fields as $field) {
$field = preg_replace($pattern, "$enclosure$enclosure", $field);
if (preg_match($enclose, $field)) {
$field = "$enclosure$field$enclosure";
}
$out .= "$field$delimiter";
}
print preg_replace($delimit, '', $out) . $newline;
}
/**
* End PHP < 5.1.0 workaround for fputcsv
*/
8:37 pm on May 2, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


perl comes with two modules:

Text::Balanced
Text::ParseWords

both or one could work for this type of parsing.

A quick test with ParseWords:


use Text::ParseWords;
$var = '1234,"Smith,Steve",123-4567,"124 ""5th"" Street, Apt 24",Anytown,PR';
@lists = &nested_quotewords(',', 0, $var);
for (@lists) {
print "$_\n" for @{$_};
}

output:

1234
Smith,Steve
123-4567
124 5th Street, Apt 24
Anytown
PR

See Text::ParseWords [perldoc.perl.org] for details

[edited by: phranque at 12:21 am (utc) on May 3, 2009]
[edit reason] added link [/edit]

7:19 pm on May 4, 2009 (gmt 0)

Full Member

10+ Year Member

joined:July 23, 2003
posts:228
votes: 0


Text::xSV is the best .csv module I've found. You can try to write your own program that "almost works", but wait until they put embedded newlines into the data, which is legal in a .csv file.
9:50 pm on May 4, 2009 (gmt 0)

Administrator

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 31, 2003
posts:12548
votes: 2


Yes, embedded newlines are "legal", as are many other characters. The hardest part of this task is that there is no standard just guidelines, so to speak. I remember doing a lot of research before I wrote my own code to handle variations and tested thoroughly.

Side note: I just noticed I documented that method incorrectly, the "fields" param is of type "array", not string.

7:16 am on May 5, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


Text::CSV_XS can handle the embedded characters issue also. I have never tried Text::xSV (never even heard of it until just now) probably because it is a very new module. Looks interesting though.
8:43 pm on May 13, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:May 8, 2008
posts: 74
votes: 0

Text::CSV from CPAN is best module for this purpose and will use faster Text::CSV_XS automatically if it is available.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members