Perl on Windows: Problems with Encoding

Perl on Windows: Problems with Encoding - windows

I have a problem with my Perl scripts. In UNIX-like systems it prints out all Unicode characters like ä properly to the console. In the Windows commandline, the characters are broken to senseless glyphs. Is there a simple way to avoid this? I'm using use utf8;.
Thanks in advance.

use utf8; simply tells Perl your source is encoded using UTF-8.
It's not working on unix either. There are some strings that won't print properly (print chr(0xE9);), and most that do will print a "Wide character" warning (print chr(0x2660);). You need decode your inputs and encode your outputs.
In unix systems, that's usuaully
use open ':std', ':encoding(UTF-8)';
In Windows system, you'll need to use chcp to find the console's character page. (437 for me.)
use open ':std', ':encoding(cp437)'; # Encoding used by console
use open IO => ':encoding(cp1252)'; # Encoding used by files

Related

How to prevent \n to from being translated to \r\n on Windows

I am using Windows 10 and Strawberry Perl
It is well know that the line terminator in Linux is \n, and in Windows is \r\n.
I found that, on my computer, files of Linux type will automatically transform to windows type \r\n after a replacement operation like
perl -i.bak -pe "s/aaa/bbb/g" test.txt
But this is not what I want, and it seems unreasonable. I would like to know if this is a Strawberry Perl issue, or another factor?
How can I leave the line terminator unaffected on Windows?

This is standard behavior of Perl on Windows (to convert \n to \r\n).
You can get around it by using binmode, which prevents Perl from doing the automatic line-ending conversion.
Your command would then be changed to look like this. It tells binmode to write to STDOUT and then that output has to be redirected to another file. The following command should do what you want (though not in place):
perl -pe "BEGIN{ binmode(STDOUT) } s/aaa/bbb/g" test.txt > newtest.txt

"Actually I set unix format as notepad++ default which is my main editor" I think you should make the effort the keep files with the correct line endings for the appropriate system. You won't make any friends if you keep Linux files everywhere, as it will make it very hard for others to work with your non-standard methodology
It isn't very hard to work with both systems properly, as all you have to do is to make the change automatically when copying from one system to another. You can use dos2unix and unix2dos when making the copy, but it would be a simple job to write a Perl program to update all of your systems with the relevant version of the text files
However, if you insist on this plan, this should help you to achieve it
By default, when running on Windows, perl will use the IO layers :unix and :crlf, which means it works the same as on a Linux system but will translate CRLF to LF on input, and LF to CRLF on output
You can make individual open calls behave differently by adding an explicit pseudo-layer :raw, which removes the :crlf layer. But if you want to modify the special file handlesSTDIN, STDOUT and ARGV then you need a different tactic, because those handles are opened for you by perl
You can use the open pragma at the top of your program, like this
use open IO => ':raw';
which will implicitly apply the :raw layer to every input or output file handle, including the special handles. You can set this from the command line by using
perl -Mopen=IO,raw program.pl
Or you can set the PERLIO environment variable
set PERLIO=raw
which will affect every program run henceforth from the same cmd window

Using terminal to sort data and keep the format in notepad.exe

I'm using Ubuntu Bash within Windows 10 and I have a text document with:
{u'osidjfoij23': 3894798, u'oisjdao':234567, u'oaijsdofj': 984759}
using tr, in terminal I change my output to
'osidjfoij23': 3894798,
'oisjdao':234567,
'oaijsdofj': 984759}
when opening the same document via notepad.exe, the newline "\n" added from tr doesn't register and all the data gets presented as a paragraph.
I know this is because bash and notepad have different encodings for their documents, is there a way to make these work together or an alternative I can use for notepad?

You can use unix2dos to convert a file to Windows line endings. Linux programs handle Windows line endings fairly well, so this shouldn't break anything (especially if that's JSON as it appears to be).

Read Double byte characters in plist from shell

I am working on Mac. I have a p-list entry containing double byte chinese characters,
ie.ProductRoot /Users/labuser/Desktop/您好.
Now i am running this command on terminal
defaults read "path to p-list" ProductRoot
and I am getting /Users/labuser/Desktop/\u60a8\u597d
........How can i fix this?

"defaults read" doesn't seem to have any way to change the format of the output. Maybe you could pipe that to another command-line tool to unescape the Unicode characters.
Failing that, it'd be very easy to write a tool in Objective-C or Swift to dump just that one value as a string.
As a side note, you claim the file has double-byte characters. If it's being created by native Mac code, it's probably more-likely to be in UTF-8 encoding. I don't know if that would matter at all, but I figured I'd add that in case it's relevant.

You could try this:
defaults read | grep ppt | perl -npe 's/\\\\U(\w\w\w\w)/chr hex $1/ge'

Ruby and Accented Characters

Summary of the wall of text below: How can I display accented characters (so they work via puts, etc) in Ruby?
Hello! I am writing a program for my class which will display some sentences in Spanish. When I try to use accented characters in Ruby, they do not display correctly (in the NetBeans output window (which displays accented characters in Java fine) or in the Command Prompt).
At first, some of my code didn't even run because the accented characters in my arrays where throwing off the Ruby interrupter (I guess?). I got errors like Ruby was expecting a closing bracket.
But I did some research, and found a solution, to add the following line of code to the beginning of my Ruby file:
# coding: utf-8
In NetBeans, my program ran regardless of this line. But I needed to add this line to get my program to run successfully in Command Prompt. (I don't know why.)
I'm still, however, having a problem actually displaying the characters to the screen. A word such as "será" will display in the NetBeans output window as "serÃ©". And in the command prompt it draws little pipe characters (that I don't know how to type).
Doing some more research, I heard about:
$KCODE = 'UTF-8'
but I'm not having any luck with this.
I'm using Ruby 1.8 and 1.9 (I go back and forth between different machines).
Thanks,
Derek

A command prompt in Windows 7 has raster fonts by default. And it doesn't support unicode. At first, you should change cmd font to Lucida Console or Consolas. And then change the command prompt's codepage with chcp 65001. You can do it manually or add this line to your ruby programm:
# encoding: utf-8
`chcp 65001` #change cmd encoding to unicode
puts 'será test '

How can I quickly fix EBCDIC control characters in large files using Perl?

My apologies if this comes across as a newbie question. I'm not a Perl developer, but am trying to use it within an automation process, and I've hit a snag.
The following command runs quickly (a few seconds) on my Linux system (Ubuntu 9.10 x64, Perl 5.10), but is extremely slow on a Windows system (Windows 2003 x86, Strawberry Perl 5.12.1.0).
perl -pe 's/\x00\x42\x00\x11/\x00\x42\x00\xf0/sgx' inputfile > outputfile
The pattern to find/replace hex characters is intended to fix EBCDIC carriage control characters in a file that is between 500MB to 2GB in size. I'm not sure if this is even the most efficient way to do this, but it would seem to do the trick... if only it would run quickly on the Windows system it needs to run on.
Any thoughts?

Note that there is a distinction between text and binary files on Windows. Text files are subject to automatic EOL conversion which I assume might add to the run time as well as potentially messing up your binary substitution (presumably not the case here).
Also, there is no point using the /sx with this substitution.
I think the heart of the matter boils down to this: With the -p switch, you are supposed to be processing the input line-by-line. Where is the first EOL (as understood by perl) in the file? Are you trying to read a huge string into memory, do the s/// on it and write out?
How about using the following script:
#!/usr/bin/perl
use strict; use warnings;
$/ = "\x00\x42\x00\x11";
$\ = "\x00\x42\x00\xf0";
while ( <> ) {
chomp;
print;
}
Also, you absolutely need to use double-quotes on Windows. Compare and contrast:
C:\Temp> perl -pe 's/perl/merl/' t.pl
#!/usr/bin/perl
...
C:\Temp> perl -pe "s/perl/merl/" t.pl
#!/usr/bin/merl
...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Perl on Windows: Problems with Encoding - windows

I have a problem with my Perl scripts. In UNIX-like systems it prints out all Unicode characters like ä properly to the console. In the Windows commandline, the characters are broken to senseless glyphs. Is there a simple way to avoid this? I'm using use utf8;. Thanks in advance.

Related

How to prevent \n to from being translated to \r\n on Windows

Using terminal to sort data and keep the format in notepad.exe

Read Double byte characters in plist from shell

Ruby and Accented Characters

How can I quickly fix EBCDIC control characters in large files using Perl?

Categories

Resources