How to convert character encoding from Windows exec output in TCL - windows

I'm just going mad with this one. I've already read everything: answers, wiki & manual pages with no luck.
I need to convert the output returned from a Windows command in a TCL script preserving full characters localization, the output of an 'exec' command I mean.
So, let's say:
catch { exec cmd.exe /C dir } _output
puts $_output
This will print any localized character as not correctly shown.
Any advice to solve this?
Thanks for your time.

If the encoding isn't correct by default — it isn't the same as the result of encoding system, which is also the encoding of filenames and miscellaneous other strings fed into the OS — then you can't use exec. But all is not lost. Instead of exec, use this:
# Open as a binary pipeline
set f [open |[list cmd.exe /C dir] "rb"]
set data [read $f]
close $f
Now you have the literal bytes from the command, whatever they are, and can use encoding convertfrom to handle the mess, possibly with the help of regexp and string range and so on to slice-and-dice that string as required. After all, binary data in Tcl is just another kind of string.
If you're just using this to list the filenames, use the built-in glob command instead. It's much faster and avoids all these problems except in the most intensely weird cases (such as with strange external devices with wrong filesystem-level metadata).

Related

Batch variable being set to ■1 instead of intended output

I'm putting together a script and need to take a file's content as input for setting a variable. I'm using Out-File to produce a text file:
$string | Out-File -FilePath C:\Full\Path\To\file.txt -NoNewLine
Then I am using that file to set a variable in batch:
set /P variablename=<C:\Full\Path\To\file.txt
The content of that file is a unique id string that looks practically like this:
1i32l54bl5b2hlthtl098
When I echo this variable, I get this:
echo %variablename%
■1
When I have tried a different string in the input file, I see that what is being echoed is the ■ character and then the first character in the string. So, if my string was "apfvuu244ty0vh" then it would echo "■a" instead.
Why isn't the variable being set to the content of the file? I'm using the method from this stackoverflow post where the chosen answer says to use this syntax with the set command. Am I doing something wrong? Is there perhaps a problem with using a full path as input to a set variable?
tl;dr:
Use Out-File -Encoding oem to produce files that cmd.exe reads correctly.
This effectively limits you to the 256 characters available in the legacy "ANSI" / OEM code pages, except NUL (0x0). See bottom section if you need full Unicode support.
In Windows PowerShell (but not PowerShell Core), Out-File and its effective alias > default to UTF-16LE character encoding, where most characters are represented as 2-byte sequences; for characters in the ASCII range, the 2nd byte of each sequence is NUL (0x0); additionally, such files start with a BOM that indicates the type of encoding.
By contrast, cmd.exe expects input to use the legacy single-byte OEM encoding (note that starting cmd.exe with /U only controls the encoding of its output).
When cmd.exe (unbeknownst to it) encounters UTF-16LE input:
It interprets the bytes individually as characters (even though characters in UTF-16LE are composed of 2 bytes (typically), or, in rare cases, of 4 (a pair of 2-byte sequences)).
It interprets the 2 bytes that make up the BOM (0xff, 0xfe) as part of the string. With OEM code page 437 (US-English) in effect, 0xff renders like a space, whereas 0xfe renders as ■.
Reading stops once the first NUL (0x0 byte) is encountered, which happens with the 1st character from the ASCII range, which in your sample string is 1.
Therefore, string 1i32l54bl5b2hlthtl098 encoded as UTF-16LE is read as  ■1, as you state.
If you need full Unicode support, use UTF-8 encoding:
Use Out-File -Encoding utf8 in PowerShell.
Before reading the file in cmd.exe (in a batch file), run chcp 65001 in order to switch to the UTF-8 code page.
Caveats:
Not all Unicode chars. may render correctly, depending on the font used in the console window.
Legacy applications may malfunction with code page 65001 in effect, especially on older Windows versions.
A possible strategy to avoid problems is to temporarily switch to code page 65001, as needed, and then switch back.
Note that the above only covers communication via files, and only in one direction (PowerShell -> cmd.exe).
To also control the character encoding used for the standard streams (stdin, stdout, stderr), both when sending strings to cmd.exe / external programs and when interpreting strings received from them, see this answer of mine.

Issue with encoding of a character (not able to sed or .gsub)

I am dealing with some multilingual data(English and Arabic) in a json file with a weird character i am not able to parse. I am not sure what the character is. I tried getting the ASCII value via vim and this is what i got
"38 0x26"
This is the status line in vim i used to get the value (http://vim.wikia.com/wiki/Showing_the_ASCII_value_of_the_current_character).
:set statusline=%<%f%h%m%r%=%b\ 0x%B\ \ %l,%c%V\ %P
This is how the character looks in vim -
I tried 'sed' and '.gsub' to replace this character unsuccessfully.
Is there a way where i can replace this character(preferably with .gsub ruby) with '&' or something else?
Thanks
try with something like
sed 's/[[:alpnum:][:space:]\[\]{}()\.\*\\\/_(AllAsciiVariationYouWant)/&/g;t
s/./?/g' YourFile
where (AllAsciiVariationYouWant) is all character that you want to keep as is (without the surrounding "()" )
JSON is encoded in UTF-8 (Unicode). If you're seeing funky-looking characters in your file, it's probably because your editor is not treating Unicode characters properly. That could be caused by the use of a terminal emulator that doesn't support Unicode; an incorrect $LANG setting; vim not being able to correctly determine the encoding of the file; and likely other reasons.
What terminal program are you using? What's your $LANG environment variable set to (echo $LANG)? If you're certain your terminal supports Unicode, try:
LANG=en_US.utf-8 vim your_file_here.json
(The above example assumes that U.S. English is appropriate for the file, which it may not be.)
As for replacing characters in the file, vim's substitution command can be used:
:%s/old text/new text/g
The above command will run the substitute command on all lines in the file (%), replacing every instance of "old text" with "new text". (The g at the end tells vim to replace every instance on a line, not just the first it finds.)

Some symbols don't effect cmd commands while others do

I noticed that cmd seems to accept some characters at the ends of commands. for example all of the following function correctly:
cls.
cls;
cls(
cls\
cls+
cls=
cls\"whatever"
cls\$
cls\#
and these do not:
cls'
cls$
cls)
cls-
cls#
cls\/
Does anybody know why this happens?
Thanks in advance.
It depends on the batch parser.
;,= are general batch delimiters, so you can append/prepend them to the most commands without effect.
;,,= ,=; echo hello
;,cls,;,,
The . dot can be appended to the most commands, as the parser will try to find a file named cls (without extension) cls.exe cls.bat, and when nothing is found then it takes the internal command.
The opening bracket is also a special charcter that the parser removes without error.
The \ backslash is used as path delimiter, so sometimes it works but sometimes you could change even the command.
cls\..\..\..\windows\system32\calc.exe

Windows Perl --> Unix not working after port, possible encoding issue

I've got a Perl program that I wrote on Windows. It starts with:
$unused_header = <STDIN>;
my #header_fields = split('\|\^\|', $unused_header, -1);
Which should split input that consists of a very large file of:
The|^|Quick|^|Brown|^|Fox|!|
Into:
{The, Quick, Brown, Fox|!|}
Note: This line just does the headre alone, theres another one like it to do the repetitive data lines.
It worked great on windows, but on linux it fails. However, if I define a string with the same contents within Perl, and run the split on that, it works fine.
I think it's a UTF-16 encoding handling issue, but I'm not sure how to handle it. Does anyone know how I can get perl to understand the UTF-16 being piped into STDIN?
I found: http://www.haboogo.com/matching_patterns/2009/01/utf-16-processing-issue-in-perl.html but I'm not sure what to do with it.
If STDIN is UTF-16, use one of the following
binmode(STDIN, ':encoding(UTF-16le)'); # Byte order used by Windows.
binmode(STDIN, ':encoding(UTF-16be)'); # The other byte order.
binmode(STDIN, ':encoding(UTF-16)'); # Use BOM to determine byte order.
Tom has written a lengthy answer with regards to perl and unicode. It contains some bolierplate code to properly and fully support UTF-8, but you can replace with UTF-16 as needed.
I doubt it's a UTF-xx encoding issue, as neither Windows Perl nor Unix Perl will try to read data with those encodings unless you tell it to.
If the Unix script is reading the exact same file as the Windows script but behaves differently, maybe it's a line-ending issue. The dos2unix command on most Unix-y systems can change the line endings on a file, or you can strip off the line-endings yourself in the Perl script
$unused_header = <STDIN>;
$unused_header =~ s/\r?\n$//; # chop \r\n (Windows) or \n (Unix)

Why doesn't this path work to open a Windows file in PERL?

I tried to play with Strawberry Perl, and one of the things that stumped me was reading the files.
I tried to do:
open(FH, "D:\test\numbers.txt");
But it can not find the file (despite the file being there, and no permissions issues).
An equivalent code (100% of the script other than the filename was identical) worked fine on Linux.
As per Perl FAQ 5, you should be using forward slashes in your DOS/Windows filenames (or, as an alternative, escaping the backslashes).
Why can't I use "C:\temp\foo" in DOS paths? Why doesn't `C:\temp\foo.exe` work?
Whoops! You just put a tab and a formfeed into that filename! Remember that within double quoted strings ("like\this"), the backslash is an escape character. The full list of these is in Quote and Quote-like Operators in perlop. Unsurprisingly, you don't have a file called "c:(tab)emp(formfeed)oo" or "c:(tab)emp(formfeed)oo.exe" on your legacy DOS filesystem.
Either single-quote your strings, or (preferably) use forward slashes. Since all DOS and Windows versions since something like MS-DOS 2.0 or so have treated / and \ the same in a path, you might as well use the one that doesn't clash with Perl--or the POSIX shell, ANSI C and C++, awk, Tcl, Java, or Python, just to mention a few. POSIX paths are more portable, too.
So your code should be open(FH, "D:/test/numbers.txt"); instead, to avoid trying to open a file named "D:<TAB>est\numbers.txt"
As an aside, you could further improve your code by using lexical (instead of global named) filehandle, a 3-argument form of open, and, most importantly, error-checking ALL your IO operations, especially open() calls:
open(my $fh, "<", "D:/test/numbers.txt") or die "Could not open file: $!";
Or, better yet, don't hard-code filenames in IO calls (the following practice MAY have let you figure out a problem sooner):
my $filename = "D:/test/numbers.txt";
open(my $fh, "<", $filename) or die "Could not open file $filename: $!";
Never use interpolated strings when you don't need interpolation! You are trying to open a file name with a tab character and a newline character in it from the \t and the \n!
Use single quotes when you want don't need (or want) interpolation.
One of the biggest problems novice Perl programmers seem to run into is that they automatically use "" for everything without thinking. You need to understand the difference between "" and '' and you need to ALWAYS think before you type so that you choose the right one. It's a hard habit to get into, but it's vital if you're going to write good Perl.

Resources