Forum Moderators: coopster

Message Too Old, No Replies

Accented characters in filenames

         

typesk

3:52 am on May 27, 2022 (gmt 0)



I am running PHP 7.0.10 on Windows

I have this function to grab all the filenames in a directory.

function getDirectoryListing($folder) {

$aryListing = array();

$dir = new RecursiveDirectoryIterator($folder, FilesystemIterator::SKIP_DOTS);

// Flatten the recursive iterator, folders come before their files
$it = new RecursiveIteratorIterator($dir, RecursiveIteratorIterator::SELF_FIRST);

foreach ($it as $fileinfo) {
if ($fileinfo->isFile()) {
$f = array();
$f['file'] = $fileinfo->getFilename();
$f['dir'] = "\\" . $it->getSubPath();
$f['pathfile'] = $it->getSubPathName();
$f['size'] = $fileinfo->getSize();
$f['size_human'] = bytesToHuman($fileinfo->getSize());
$f['time_mod'] = $fileinfo->getMTime();
$f['time_mod_full'] = date('F j, Y, g:i a', $fileinfo->getMTime());
$aryListing[] = $f;
} elseif ($fileinfo->isDir()) {
//print($fileinfo->__toString() . PHP_EOL);
} else {
// echo $fileinfo->getFilename(); // what
}
}

return $aryListing;

}


And a couple directories might have some accented characters, such as: ğ ( ğ [en.wikipedia.org...] ) or ě ( ě [en.wikipedia.org...] )

Filenames with those characters seem to return as the letter e or g, so that when it goes to check if it is a file or directory, it returns false for both since the files do not exist since it is checking for test_e.txt instead of the filename that has an accented e.

This does not happen with all accented characters. Sometimes I will have an accented character, like if it has an é like in café it will be fine.

Is there any way to get the actual filename and be able to have it return as such (at least without manually renaming the files manually for the script). I have tried with scandir, and get the same results back.

Thanks

tangor

9:48 am on May 27, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@typesk: Welcome to Webmasterworld.

Are these user generated files? Just asking since you indicate that manually renaming is being done. Do the filtering/replacing in the upload script first. Not THE answer, but is AN answer.

lucy24

5:59 pm on May 27, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This does not happen with all accented characters.
That would be because the ones like é or ü or ñ are in Latin-1, while the examples from the previous paragraph (g-hacek and e-hacek *, looks like) aren’t, so the function “flattens” them to their diacritic-less form. Can your php function’s encoding be changed?

* The unicode consortium calls it a “caron”, but literally everyone else in the entire universe says it’s a hacek.

typesk

6:14 am on May 28, 2022 (gmt 0)



Are these user generated files? Just asking since you indicate that manually renaming is being done. Do the filtering/replacing in the upload script first. Not THE answer, but is AN answer.


I know I can probably find all these files that have this issue (I can probably estimate there to be a couple handful of files total). I was just hoping to fix the script instead.

Can your php function’s encoding be changed?


What do you mean? I can modify the function as needed?

lucy24

4:29 pm on May 28, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Disclaimer: I’m not familiar with the function(s) involved, and I only speak about three words of php. But sometimes there is a way to change the encoding used by php (cgi, perl, whatever). First step might be to step through the function and find out at exactly what point the accented characters are being “flattened”. Clearly the server as a whole uses utf-8, or those unusual characters wouldn’t be able to exist in the first place. But equally clearly it knows what the equivalents are, or it wouldn't know to change ā to a and so on.

typesk

8:59 pm on May 28, 2022 (gmt 0)



First step might be to step through the function and find out at exactly what point the accented characters are being “flattened”


It seems to be having an issue on the very first call, RecursiveDirectoryIterator. I can't find anything in the documentation that might allow me to change encoding on the call.

I dunno, at this point I been at this a few days, and I think I just going to go the route of renaming those particular files.

not2easy

9:46 pm on May 28, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I don't think lucy24 was suggesting to encode the script differently, but rather to try different encoding for the process (PHP). AFAIK, PHP default is UTF8 for the past several years, but you can change the encoding to see if it works better at handling the characters as another encoding. See the explanation at PHPnet: [php.net...]