UTF-16 perl input output - windows

I am writing a script that takes a UTF-16 encoded text file as input and outputs a UTF-16 encoded text file.
use open "encoding(UTF-16)";
open INPUT, "< input.txt"
or die "cannot open > input.txt: $!\n";
open(OUTPUT,"> output.txt");
while(<INPUT>) {
print OUTPUT "$_\n"
}
Let's just say that my program writes everything from input.txt into output.txt.
This WORKS perfectly fine in my cygwin environment, which is using "This is perl 5, version 14, subversion 2 (v5.14.2) built for cygwin-thread-multi-64int"
But in my Windows environment, which is using "This is perl 5, version 12, subversion 3 (v5.12.3) built for MSWin32-x64-multi-thread",
Every line in output.txt is pre-pended with crazy symbols except the first line.
For example:
<FIRST LINE OF TEXT>
਀    ㈀  ㄀Ⰰ ㈀Ⰰ 嘀愀 ㌀ 䌀栀椀愀 䐀⸀⸀⸀  儀甀愀渀最 䠀ഊ<SECOND LINE OF TEXT>
...
Can anyone give some insight on why it works on cygwin but not windows?
EDIT: After printing the encoded layers as suggested.
In Windows environment:
unix
crlf
encoding(UTF-16)
utf8
unix
crlf
encoding(UTF-16)
utf8
In Cygwin environment:
unix
perlio
encoding(UTF-16)
utf8
unix
perlio
encoding(UTF-16)
utf8
The only difference is between the perlio and crlf layer.

[ I was going to wait and give a thorough answer, but it's probably better if I give you a quick answer than nothing. ]
The problem is that crlf and the encoding layers are in the wrong order. Not your fault.
For example, say you do print "a\nb\nc\n"; using UTF-16le (since it's simpler and it's probably what you actually want). You'd end up with
61 00 0D 0A 00 62 00 0D 0A 00 63 00 0D 0A 00
instead of
61 00 0D 00 0A 00 62 00 0D 00 0A 00 63 00 0D 00 0A 00
I don't think you can get the right results with the open pragma or with binmode, but it can be done using open.
open(my $fh, '<:raw:encoding(UTF-16):crlf', $qfn)
You'll need to append a :utf8 with some older version, IIRC.
It works on cygwin because the crlf layer is only added on Windows. There you'd get
61 00 0A 00 62 00 0A 00 63 00 0A 00

You have a typo in your encoding. It should be use open ":encoding(UTF-16)" Note the colon. I don't know why it would work on Cygwin but not Windows, but could also be a 5.12 vs 5.14 thing. Perl seems to make up for it, but it could be what's causing your problem.
If that doesn't do it, check if the encoding is being applied to your filehandles.
print map { "$_\n" } PerlIO::get_layers(*INPUT);
print map { "$_\n" } PerlIO::get_layers(*OUTPUT);
Use lexical filehandles (ie. open my $fh, "<", $file). Glob filehandles are global and thus something else in your program might be interfering with them.
If all that checks out, if lexical filehandles are getting the encoding(UTF-16) applied, let us know and we can try something else.
UPDATE: This may provide your answer: "BOMed UTF files are not suitable for streaming models, and they must be slurped as binary files instead." Looks like you have to read the file in as binary and do the encoding as a string. This may have been a bug fixed in 5.14.
UPDATE 2: Yep, I can confirm this is a bug that was fixed in 5.14.

Related

invalid characters not visible in BASH

I have been working on some device that allowed login via telnet and I extracted some data from devices and made some reports, without any problems. recently, I had to switch to SSH while rest of the script is all the same, only login procedure has been changed from telnet to SSH. after switching to SSH, I am facing some problem with the data extracted that there are some invalid characters in some of the lines, below is an example: as can be seen, there is an invalid character after PON7 in the line:
OLT:LT6.PON7.ONT1,ALARM,Date time,
problem is that this invalid character is not even visible in the bash/csv file, but it was discovered when I copied the line in notepad++ or while posting it here.
now I have two problems:
1st: if someone knows what is causing these invalid characters while switching between telnet/ssh.
2nd: how to deal with this invalid character in BASH as it is not even visible in BASH, but this report is being used somewhere and these invalid characters are causing problems.
Edit:
Pasting the text into a text-to-hex converter produces this:
4f 4c 54 3a 4c 54 36 2e 50 4f 4e 37 11 2e 4f 4e 54 31 2c 41 4c 41 52 4d 2c 44 61 74 65 20 74 69 6d 65 2c
It looks like there's a DC1 character (hex 11) between the "7" and the ".".
Unfortunately, this edit also has the side effect of removing the character from the sample text.
Passing your text through a text to hexadecimal converter shows that the invisible character is an ASCII DC1 character (hex 11, octal 021). This character is also known as Ctrl-Q or XON. It's sometimes used in flow control.
In a bash script, you could filter it out using the tr program:
echo $badtext | tr -d '\021'
SSH doesn't inherently insert DC1 characters into text streams. If you're getting a DC1 character in the output from a device, presumably the device sent that character.

Why does printing the string "Hühnchen" in the command prompt under Windows 7 cause a "No such file or directory" Error?

Executing the following python program
test.py:
# -*- coding: utf-8 -*-
print "Hühnchen"
hexdump:
00000000 23 20 2d 2a 2d 20 63 6f 64 69 6e 67 3a 20 75 74 |# -*- coding: ut|
00000010 66 2d 38 20 2d 2a 2d 20 0a 0a 70 72 69 6e 74 20 |f-8 -*- ..print |
00000020 22 48 c3 bc 68 6e 63 68 65 6e 22 0a |"H..hnchen".|
from the command prompt in Windows 7 with code page 65001 and the Lucida Console font causes an IOError:
$ python test.py
HühnchenTraceback (most recent call last):
File "test.py", line 3, in <module>
print "Hühnchen"
IOError: [Errno 2] No such file or directory
To exclude any side effects of my Windows installation I reproduced the problem in a fresh virtual machine with the following steps:
Install Windows 7 Ultimate SP1 Build 7601 (SHA1 of the ISO: 36ae90defbad9d9539e649b193ae573b77a71c83) in a virtual machine
Install python 2.7.13 64-bit
Open cmd.exe
Set the font to Lucida Console
Change the code page to 65001 to support UTF-8
Execute the above script (make sure that the file encoding is UTF-8)
With the same result:
What is happening here?
So the only way I could recreate this error was to change to Lucida and run the chcp 65001 command before running the program. According to this post, 'Prior to Windows 8, output using codepage 65001 is also broken'[1]. If you want to just print "Hühnchen" or any characters with from UTF-8 I think the best solution is to not do so from the command line as it always seems to mess up the characters. When I ran the same code on PyCharm CE it worked fine and even when I opened the python 2.7 command prompt it still printed properly. It seems to be a very niche problem you're experiencing which should be able to be worked around.
[1] chcp 65001 codepage results in program termination without any error

Windows Powershell Output to File. Strange Characters

Running this command:
echo "foo" > test.txt
I get strange results in test.txt. I actually can't even copy-paste the contents directly here in stack overflow but need to show the hex output of the file. Which looks like this-
ff fe 66 00 6f 00 6f 00 0d 00 0a 00
Not sure what FF and FE are but it looks like it's also placing NULL between each character.
Can any Windows people guide me in the right direction as to why this is happening and how I can resolve it? I just want to the contents "foo" to be placed in that file unmolested.
0xFFFE is the Unicode byte order mark for UTF-16 LE (little endian), which is normal Windows Unicode.
The contents of your file are the two byte BOM, 0x6600 ("f"), 0x6f00 ("o"), 0x6f00 ("o"), 0x0d00 ("`r" or carriage return), 0x0a00 ("`n" or new line)
Try:
echo "foo" | Out-File -FilePath test.txt -Encoding utf8

Read and Write File Section of File Only (Without Streams)

Most high-level representations of files are as streams. C's fopen, ActiveX's Scripting.FileSystemObject and ADODB.Stream - in fact, anything built on top of C is very likely to use a stream representation to edit files.
However, when modifying large (~4MiB) fixed-structure binary files, it seems wasteful to read the whole file into memory and write almost exactly the same thing back to disk - this will almost certainly have a performance penalty attached. Looking at the majority of uncompressed filesystems, there seems to me to be no reason why a section of the file couldn't be read from and written to without touching the surrounding data. At a maximum, the block would have to be rewritten, but that's usually on the order of 4KiB; much less than the entire file for large files.
Example:
00 01 02 03
04 05 06 07
08 09 0A 0B
0C 0D 0E 0F
might become:
00 01 02 03
04 F0 F1 F2
F3 F4 F5 0B
0C 0D 0E 0F
A solution using existing ActiveX objects would be ideal, but any way of doing this without having to re-write the entire file would be good.
Okay, here's how to do the exercise in powershell (e.g. hello.ps1):
$path = "hello.txt"
$bw = New-Object System.IO.BinaryWriter([System.IO.File]::Open($path, [System.IO.FileMode]::Open, [System.IO.FileAccess]::ReadWrite, [System.IO.FileShare]::ReadWrite))
$bw.BaseStream.Seek(5, [System.IO.SeekOrigin]::Begin)
$bw.Write([byte] 0xF0)
$bw.Write([byte] 0xF1)
$bw.Write([byte] 0xF2)
$bw.Write([byte] 0xF3)
$bw.Write([byte] 0xF4)
$bw.Write([byte] 0xF5)
$bw.Close()
You can test it from command line:
powershell -file hello.ps1
Then, you can invoke this from your HTA as:
var wsh = new ActiveXObject("WScript.Shell");
wsh.Run("powershell -file hello.ps1");

pcl6.exe v9.15 silently converting APOSTROPHE => RIGHT SINGLE QUOTATION MARK

Good afternoon.
I am running pcl6.exe version 9.15 on Windows 8.1.
I am running into a problem where pcl6.exe in silently converting any APOSTROPHE characters into RIGHT SINGLE QUOTATION MARK characters using the 16602 typeface in a PCL5 file.
Here is the command line I am using:
pcl6.exe -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=test.txt test.prn
test.prn input (hex)
1B 28 30 55 1B 28 73 31 70 31 30 76 31 36 36 30 32 54 1B 26 61 30 76 30 48 3E 27 3C
test.prn input (text ['.' is the escape character])
.(0U.(s1p10v16602T.&a0v0H>'<
test.txt output (hex)
20 20 3E E2 80 99 3C 0D 0A
test.txt output (text)
>’<..
expected test.txt output (hex)
20 20 3E 27 3C 0D 0A
expected test.txt output (text)
>'<..
Is there a flag or an option somewhere that can disable this conversion?
Thank you for your time.
txtwrite does the best it can with the input, PCL tends not to have much information in the PCL file to allow us to determine what the glyph should be (PostScript is better, and PDF often still better).
If you think there is a real problem I would suggest your best bet is to open a bug report. Apart from anything else, I would need to see the PCL file to determine what's going on. Most lkely the character code you are using corresponds to an apostrophe, which in the specific font is a right quote. There is no way for the text extraction device to know what shape the font will draw in response to a character code. At least, not in PCL
The problem was caused by the symbolset.
The sample was using the PCL ISO 6: ASCII symbolset (code 0U )
http://www.pcl.to/symbolset/pcl_0u.pdf
According to the 0U symbolset reference, APOSTROPHE (0x27) is replaced with RIGHT SINGLE QUOTATION MARK (0x2019). Pcl6.exe then converts these UTF-16 bytes into their UTF-8 equivalent: 0xE28099

Resources