MacOS Big Sur: unexpected diff behaviour - utf-8

On MacOS Big Sur (Intel), I see the following unexpected behaviour, that I have not seen on Catalina. Prerequisites.
My home directory /Users/foo resides on an afps-filesystem.
/mnt/win10 is an SMB share mounted from a Win 10-machine (20H2).
When I have a file where the filename contains unicode characters that can have more than one representation (precomposed vs decomposed) , the representation will change when copying the file from one filesystem to the other.
touch /Users/foo/ä
cp /home/user/ä /mnt/win10/.
Then diff /Users/foo/ä /mnt/win10/ä shows no differences but diff /Users/foo/. /mnt/win10/. will rather unexpectedly result in:
Only in /Users/foo/.: ä
Only in /mnt/win10/.: ä
Checking the encoding using a bit of Perl in the following script (call with `./script.pl /Users/foo/. /mnt/win10/.'):
use strict;
use warnings;
use utf8;
use String::Dump qw( dump_hex dump_bin );
while (my $d = shift) {
opendir(D, "$d") || die "Can't open directory $d: $!\n";
my #list = readdir(D);
closedir(D);
foreach my $f (#list) {
print "\$f = $f\n";
print dump_hex($f);
print "\n";
}
}
will give the rather unexpected result for the afps-filesystem /Users/foo
$f = ä
C3 A4
and for the SMB share /mnt/win10
$f = ä
61 CC 88
So on the afps, we see the precomposed UTF-8 representation of 'ä', while on the SMB share, we find the decomposed UTF-8 representation (a small 'a' to be combined with two dots).
I have never encountered this with Catalina (but have not checked this for a while), I cannot find any documentation of a changed behaviour.
Needless to say that this obviously breaks things (Apple provided diff 2.8.1, homebrew provided diff 3.7). rsync 3.2.3 can work around this when you call it with --iconv=utf-8-mac,utf-8-mac (sic!).
I have my own personal opinion if an OS should behave like that.
What is the proper approach to handle this in a POSIX compliant manner?
Should I convert every filename with iconv(3) to utf-8-mac?

Related

Why doesn't this path work to open a Windows file in PERL?

I tried to play with Strawberry Perl, and one of the things that stumped me was reading the files.
I tried to do:
open(FH, "D:\test\numbers.txt");
But it can not find the file (despite the file being there, and no permissions issues).
An equivalent code (100% of the script other than the filename was identical) worked fine on Linux.
As per Perl FAQ 5, you should be using forward slashes in your DOS/Windows filenames (or, as an alternative, escaping the backslashes).
Why can't I use "C:\temp\foo" in DOS paths? Why doesn't `C:\temp\foo.exe` work?
Whoops! You just put a tab and a formfeed into that filename! Remember that within double quoted strings ("like\this"), the backslash is an escape character. The full list of these is in Quote and Quote-like Operators in perlop. Unsurprisingly, you don't have a file called "c:(tab)emp(formfeed)oo" or "c:(tab)emp(formfeed)oo.exe" on your legacy DOS filesystem.
Either single-quote your strings, or (preferably) use forward slashes. Since all DOS and Windows versions since something like MS-DOS 2.0 or so have treated / and \ the same in a path, you might as well use the one that doesn't clash with Perl--or the POSIX shell, ANSI C and C++, awk, Tcl, Java, or Python, just to mention a few. POSIX paths are more portable, too.
So your code should be open(FH, "D:/test/numbers.txt"); instead, to avoid trying to open a file named "D:<TAB>est\numbers.txt"
As an aside, you could further improve your code by using lexical (instead of global named) filehandle, a 3-argument form of open, and, most importantly, error-checking ALL your IO operations, especially open() calls:
open(my $fh, "<", "D:/test/numbers.txt") or die "Could not open file: $!";
Or, better yet, don't hard-code filenames in IO calls (the following practice MAY have let you figure out a problem sooner):
my $filename = "D:/test/numbers.txt";
open(my $fh, "<", $filename) or die "Could not open file $filename: $!";
Never use interpolated strings when you don't need interpolation! You are trying to open a file name with a tab character and a newline character in it from the \t and the \n!
Use single quotes when you want don't need (or want) interpolation.
One of the biggest problems novice Perl programmers seem to run into is that they automatically use "" for everything without thinking. You need to understand the difference between "" and '' and you need to ALWAYS think before you type so that you choose the right one. It's a hard habit to get into, but it's vital if you're going to write good Perl.

How do I write a file whose *filename* contains utf8 characters in Perl?

I am struggling creating a file that contains non-ascii characters.
The following script works fine, if it is called with 0 as parameter but dies when called with 1.
The error message is open: Invalid argument at C:\temp\filename.pl line 15.
The script is started within cmd.exe.
I expect it to write a file whose name is either (depending on the paramter) äöü.txt or äöü☺.txt. But I fail to create the filename containing a smiley.
use warnings;
use strict;
use Encode 'encode';
# Text is stored in utf8 within *this* file.
use utf8;
my $with_smiley = $ARGV[0];
my $filename = 'äöü' .
($with_smiley ? '☺' : '' ).
'.txt';
open (my $fh, '>', encode('cp1252', $filename)) or die "open: $!";
print $fh "Filename: $filename\n";
close $fh;
I am probably missing something that is obvious to others, but I can't find, so I'd appreciate any pointer towards solving this.
First of all, saying "UTF-8 character" is weird. UTF-8 can encode any Unicode character, so the UTF-8 character set is the Unicode character set. That means you want to create file whose name contain Unicode characters, and more specifically, Unicode characters that aren't in cp1252.
I've answered this on PerlMonks in the past. Answer copied below.
Perl treats file names as opaque strings of bytes. That means that file names need to be encoded as per your "locale"'s encoding (ANSI code page).
In Windows, code page 1252 is commonly used, and thus the encoding is usually cp1252.* However, cp1252 doesn't support Tamil and Hindi characters [or "☺"].
Windows also provides a "Unicode" aka "Wide" interface, but Perl doesn't provide access to it using builtins**. You can use Win32API::File's CreateFileW, though. IIRC, you need to still need to encode the file name yourself. If so, you'd use UTF-16le as the encoding.
Aforementioned Win32::Unicode appears to handle some of the dirty work of using Win32API::File for you. I'd also recommend starting with that.
* — The code page is returned (as a number) by the GetACP system call. Prepend "cp" to get the encoding.
** — Perl's support for Windows sucks in some respects.
The following runs on Windows 7, ActiveState Perl. It writes "hello there" to a file with hebrew characters in its name:
#-----------------------------------------------------------------------
# Unicode file names on Windows using Perl
# Philip R Brenan at gmail dot com, Appa Apps Ltd, 2013
#-----------------------------------------------------------------------
use feature ":5.16";
use Data::Dump qw(dump);
use Encode qw/encode decode/;
use Win32API::File qw(:ALL);
# Create a file with a unicode name
my $e = "\x{05E7}\x{05EA}\x{05E7}\x{05D5}\x{05D5}\x{05D4}".
"\x{002E}\x{0064}\x{0061}\x{0074}\x{0061}"; # File name in UTF-8
my $f = encode("UTF-16LE", $e); # Format supported by NTFS
my $g = eval dump($f); # Remove UTF ness
$g .= chr(0).chr(0); # 0 terminate string
my $F = Win32API::File::CreateFileW
($g, GENERIC_WRITE, 0, [], OPEN_ALWAYS, 0, 0); # Create file via Win32API
say $^E if $^E; # Write any error message
# Write to the file
OsFHandleOpen(FILE, $F, "w") or die "Cannot open file";
binmode FILE;
print FILE "hello there\n";
close(FILE);
no need to encode the filename (at least not on linux). This code works on my linux system:
use warnings;
use strict;
# Text is stored in utf8 within *this* file.
use utf8;
my $with_smiley = $ARGV[0] || 0;
my $filename = 'äöü' .
($with_smiley ? '?' : '' ).
'.txt';
open my $fh, '>', $filename or die "open: $!";
binmode $fh, ':utf8';
print $fh "Filename: $filename\n";
close $fh;
HTH, Paul

How do you create unicode file names in Windows using Perl

I have the following code
use utf8;
open($file, '>:encoding(UTF-8)', "さっちゃん.txt") or die $!;
print $file "さっちゃん";
But I get the file name as ã•ã£ã¡ã‚ƒã‚“.txt
I was wondering if there was a way of making this work as I would expect (meaning I have a unicode file name) this without resorting to Win32::API, Win32API::* or moving to another platform and using a Samba share to modify the files.
The intent is to ensure we do not have any Win32 specific modules that need to be loaded (even conditionally).
Perl treats file names as opaque strings of bytes. They need to be encoded as per your "locale"'s encoding (ANSI code page).
In Windows, this is is usually cp1252. It is returned by the GetACP system call. (Prepend "cp"). However, cp1252 doesn't support Japanese characters.
Windows also provides a "Unicode" aka "Wide" interface, but Perl doesn't provide access to it using builtins*. You can use Win32API::File's CreateFileW, though. IIRC, you need to still need to encode the file name yourself. If so, you'd use UTF-16le as the encoding.
* — Perl's support for Windows sucks in some respects.
Use Encode::Locale:
use utf8;
use Encode::Locale;
use Encode;
open($file, '>:encoding(UTF-8)', encode(locale_fs => "さっちゃん.txt") ) or die $!;
print $file "さっちゃん";
The following produces a unicoded file name on Windows 7 using Activestate Perl.
#-----------------------------------------------------------------------
# Unicode file names on Windows using Perl
# Philip R Brenan at gmail dot com, Appa Apps Ltd, 2013
#-----------------------------------------------------------------------
use feature ":5.16";
use Data::Dump qw(dump);
use Encode qw/encode decode/;
use Win32API::File qw(:ALL);
# Create a file with a unicode name
my $e = "\x{05E7}\x{05EA}\x{05E7}\x{05D5}\x{05D5}\x{05D4}".
"\x{002E}\x{0064}\x{0061}\x{0074}\x{0061}"; # File name in UTF-8
my $f = encode("UTF-16LE", $e); # Format supported by NTFS
my $g = eval dump($f); # Remove UTF ness
$g .= chr(0).chr(0); # 0 terminate string
my $F = Win32API::File::CreateFileW
($g, GENERIC_WRITE, 0, [], OPEN_ALWAYS, 0, 0); # Create file via Win32API
say $^E if $^E; # Write any error message
# Write to the file
OsFHandleOpen(FILE, $F, "w") or die "Cannot open file";
binmode FILE;
print FILE "hello there\n";
close(FILE);

Perl regular expression problem

I have this conditional in a perl script:
if ($lnFea =~ m/^(\d+) qid\:([^\s]+).*?\#docid = ([^\s]+) inc = ([^\s]+) prob = ([^\s]+)$/)
and the $lnFea represents this kind of line:
0 qid:7968 1:0.000000 2:0.000000 3:0.000000 4:0.000000 5:0.000000 6:0.000000 7:0.000000 8:0.000000 9:0.000000 10:0.000000 11:0.000000 12:0.000000 13:0.000000 14:0.000000 15:0.000000 16:0.005175 17:0.000000 18:0.181818 19:0.000000 20:0.003106 21:0.000000 22:0.000000 23:0.000000 24:0.000000 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.000000 30:0.000000 31:0.000000 32:0.000000 33:0.000000 34:0.000000 35:0.000000 36:0.000000 37:0.000000 38:0.000000 39:0.000000 40:0.000000 41:0.000000 42:0.000000 43:0.055556 44:0.000000 45:0.000000 46:0.000000 #docid = GX000-00-0000000 inc = 1 prob = 0.0214125
The problem is that the if is true on Windows but false on Linux (Fedora 11). Both systems are using the most recent perl version. So what is the reason of this problem?
Assuming that $InFea is read from a file, I'd wager that the file is in DOS format. That would cause the $ anchor to prevent matching on Linux due to differences in the line-endings between those platforms. Perl's automagic newline transformation only works for platform-native text files. If the input file is in DOS format, the Linux box would see an extra carriage return before the end-of-line.
It's probably best to convert the input file to the native format for each platform. If that's not possible you should binmode the filehandle (preventing Perl from performing newline transformations) before reading from it and account for the various newline sequences in the regex and anywhere else the data is used.

Batch renaming of files with international chars on Windows XP

I have a whole bunch of files with filenames using our lovely Swedish letters å å and ö.
For various reasons I now need to convert these to an [a-zA-Z] range. Just removing anything outside this range is fairly easy. The thing that's causing me trouble is that I'd like to replace å with a, ö with o and so on.
This is charset troubles at their worst.
I have a set of test files:
files\Copy of New Text Documen åäö t.txt
files\fofo.txt
files\New Text Document.txt
files\worstcase åäöÅÄÖéÉ.txt
I'm basing my script on this line, piping it's results into various commands
for %%X in (files\*.txt) do (echo %%X)
The wierd thing is that if I print the results of this (the plain for-loop that is) into a file I get this output:
files\Copy of New Text Documen †„” t.txt
files\fofo.txt
files\New Text Document.txt
files\worstcase †„”Ž™‚.txt
So something wierd is happening to my filenames before they even reach the other tools (I've been trying to do this using a sed port for Windows from something called GnuWin32 but no luck so far) and doing the replace on these characters doesn't help either.
How would you solve this problem? I'm open to any type of tools, commandline or otherwise…
EDIT: This is a one time problem, so I'm looking for a quick 'n ugly fix
You can use this code (Python)
Rename international files
# -*- coding: cp1252 -*-
import os, shutil
base_dir = "g:\\awk\\" # Base Directory (includes subdirectories)
char_table_1 = "áéíóúñ"
char_table_2 = "aeioun"
adirs = os.walk (base_dir)
for adir in adirs:
dir = adir[0] + "\\" # Directory
# print "\nDir : " + dir
for file in adir[2]: # List of files
if os.access(dir + file, os.R_OK):
file2 = file
for i in range (0, len(char_table_1)):
file2 = file2.replace (char_table_1[i], char_table_2[i])
if file2 <> file:
# Different, rename
print dir + file, " => ", file2
shutil.move (dir + file, dir + file2)
###
You have to change your encoding and your char tables (I tested this script with Spanish files and works fine). You can comment the "move" line to check if it's working ok, and remove the comment later to do the renaming.
You might have more luck in cmd.exe if you opened it in UNICODE mode. Use "cmd /U".
Others have proposed using a real programming language. That's fine, especially if you have a language you are very comfortable with. My friend on the C# team says that C# 3.0 (with Linq) is well-suited to whipping up quick, small programs like this. He has stopped writing batch files most of the time.
Personally, I would choose PowerShell. This problem can be solved right on the command line, and in a single line. I'll
EDIT: it's not one line, but it's not a lot of code, either. Also, it looks like StackOverflow doesn't like the syntax "$_.Name", and renders the _ as &#95.
$mapping = #{
"å" = "a"
"ä" = "a"
"ö" = "o"
}
Get-ChildItem -Recurse . *.txt | Foreach-Object {
$newname = $_.Name
foreach ($l in $mapping.Keys) {
$newname = $newname.Replace( $l, $mapping[$l] )
$newname = $newname.Replace( $l.ToUpper(), $mapping[$l].ToUpper() )
}
Rename-Item -WhatIf $_.FullName $newname # remove the -WhatIf when you're ready to do it for real.
}
I would write this in C++, C#, or Java -- environments where I know for certain that you can get the Unicode characters out of a path properly. It's always uncertain with command-line tools, especially out of Cygwin.
Then the code is a simple find/replace or regex/replace. If you can name a language it would be easy to write the code.
I'd write a vbscript (WSH) to scan the directories, then send the filenames to a function that breaks up the filenames into their individual letters, then does a SELECT CASE on the Swedish ones and replaces them with the ones you want. Or, instead of doing that the function could just drop it thru a bunch of REPLACE() functions, reassigning the output to the input string. At the end it then renames the file with the new value.

Resources