Windows Perl --> Unix not working after port, possible encoding issue - windows

I've got a Perl program that I wrote on Windows. It starts with:
$unused_header = <STDIN>;
my #header_fields = split('\|\^\|', $unused_header, -1);
Which should split input that consists of a very large file of:
The|^|Quick|^|Brown|^|Fox|!|
Into:
{The, Quick, Brown, Fox|!|}
Note: This line just does the headre alone, theres another one like it to do the repetitive data lines.
It worked great on windows, but on linux it fails. However, if I define a string with the same contents within Perl, and run the split on that, it works fine.
I think it's a UTF-16 encoding handling issue, but I'm not sure how to handle it. Does anyone know how I can get perl to understand the UTF-16 being piped into STDIN?
I found: http://www.haboogo.com/matching_patterns/2009/01/utf-16-processing-issue-in-perl.html but I'm not sure what to do with it.

If STDIN is UTF-16, use one of the following
binmode(STDIN, ':encoding(UTF-16le)'); # Byte order used by Windows.
binmode(STDIN, ':encoding(UTF-16be)'); # The other byte order.
binmode(STDIN, ':encoding(UTF-16)'); # Use BOM to determine byte order.

Tom has written a lengthy answer with regards to perl and unicode. It contains some bolierplate code to properly and fully support UTF-8, but you can replace with UTF-16 as needed.

I doubt it's a UTF-xx encoding issue, as neither Windows Perl nor Unix Perl will try to read data with those encodings unless you tell it to.
If the Unix script is reading the exact same file as the Windows script but behaves differently, maybe it's a line-ending issue. The dos2unix command on most Unix-y systems can change the line endings on a file, or you can strip off the line-endings yourself in the Perl script
$unused_header = <STDIN>;
$unused_header =~ s/\r?\n$//; # chop \r\n (Windows) or \n (Unix)

Related

How to convert character encoding from Windows exec output in TCL

I'm just going mad with this one. I've already read everything: answers, wiki & manual pages with no luck.
I need to convert the output returned from a Windows command in a TCL script preserving full characters localization, the output of an 'exec' command I mean.
So, let's say:
catch { exec cmd.exe /C dir } _output
puts $_output
This will print any localized character as not correctly shown.
Any advice to solve this?
Thanks for your time.
If the encoding isn't correct by default — it isn't the same as the result of encoding system, which is also the encoding of filenames and miscellaneous other strings fed into the OS — then you can't use exec. But all is not lost. Instead of exec, use this:
# Open as a binary pipeline
set f [open |[list cmd.exe /C dir] "rb"]
set data [read $f]
close $f
Now you have the literal bytes from the command, whatever they are, and can use encoding convertfrom to handle the mess, possibly with the help of regexp and string range and so on to slice-and-dice that string as required. After all, binary data in Tcl is just another kind of string.
If you're just using this to list the filenames, use the built-in glob command instead. It's much faster and avoids all these problems except in the most intensely weird cases (such as with strange external devices with wrong filesystem-level metadata).

running a cmd file with an accented character in its name, in Python 2 on Windows

I have the file t2ű.cmd on Windows with an accented character in its name, and I'd like to run it from Python 2 code.
Opening the file (open(u't2\u0170.cmd')) works if I pass the filename as a unicode literal, but no str literal works, because \u0170 is not on the code page of Windows. (See this question for more on opening files with accented characters in their name: opening a file with an accented character in its name, in Python 2 on Windows.)
Running the file from the Command Prompt without Python works.
I tried passing an str literal to os.system, os.popen, os.spawnl and subprocess.call (both with and without the shell), but it wasn't able to find the file.
These don't work, they raise UnicodeDecodeError: 'ascii' codec can't encode character u'\u170'...:
os.system(u't2\u170.cmd')
os.popen(u't2\u170.cmd')
os.spawnl(u't2\u170.cmd', u't2')
subprocess.call(u't2\u170.cmd')
subprocess.call(u'"t2\u170.cmd"')
subprocess.call([u't2\u170.cmd'])
In this project it's not feasible to upgrade to Python 3.
It's not feasible to rename the file, because these files can have arbitrary (user-supplied) names on a read-only share, and also the directory name can contain accented characters.
In C I would use any of the wsystem, wpopen or wspawnl functions in <process.h>.
Preferably I'm looking for a solution which works with the standard Python modules (no need to install packages). But I'm interested in any solution.
I need a solution which doesn't open a new window.
Eventually I want to pass command-line arguments to program, and the arguments will contain arbitrary Unicode characters.
This is based on the comment by #eryksun.
We need to call the system call CreateProcessW or the C functions wspawnl, wsystem or wpopen. Python 2 doesn't have anything built in which would call any of these functions. Writing an extension module in C or calling the functions using ctypes could be a solution.
The C functions CreateProcessA, spawnl, system and popen don't work.
As described in the pep 0263, if you want to use unicode characters in a python script, just add a # -*- coding: utf-8 -*- at the beginning of your script (it's ok after the she-bang):
#!/bin/env python
# -*- coding: utf-8 -*-
import os
os.system('t2ű.cmd')
If you still find problems, you may take a look on some packages, like win-unicode-console.
It should work now directly, with no escaping code.

Issue with encoding of a character (not able to sed or .gsub)

I am dealing with some multilingual data(English and Arabic) in a json file with a weird character i am not able to parse. I am not sure what the character is. I tried getting the ASCII value via vim and this is what i got
"38 0x26"
This is the status line in vim i used to get the value (http://vim.wikia.com/wiki/Showing_the_ASCII_value_of_the_current_character).
:set statusline=%<%f%h%m%r%=%b\ 0x%B\ \ %l,%c%V\ %P
This is how the character looks in vim -
I tried 'sed' and '.gsub' to replace this character unsuccessfully.
Is there a way where i can replace this character(preferably with .gsub ruby) with '&' or something else?
Thanks
try with something like
sed 's/[[:alpnum:][:space:]\[\]{}()\.\*\\\/_(AllAsciiVariationYouWant)/&/g;t
s/./?/g' YourFile
where (AllAsciiVariationYouWant) is all character that you want to keep as is (without the surrounding "()" )
JSON is encoded in UTF-8 (Unicode). If you're seeing funky-looking characters in your file, it's probably because your editor is not treating Unicode characters properly. That could be caused by the use of a terminal emulator that doesn't support Unicode; an incorrect $LANG setting; vim not being able to correctly determine the encoding of the file; and likely other reasons.
What terminal program are you using? What's your $LANG environment variable set to (echo $LANG)? If you're certain your terminal supports Unicode, try:
LANG=en_US.utf-8 vim your_file_here.json
(The above example assumes that U.S. English is appropriate for the file, which it may not be.)
As for replacing characters in the file, vim's substitution command can be used:
:%s/old text/new text/g
The above command will run the substitute command on all lines in the file (%), replacing every instance of "old text" with "new text". (The g at the end tells vim to replace every instance on a line, not just the first it finds.)

Why doesn't this path work to open a Windows file in PERL?

I tried to play with Strawberry Perl, and one of the things that stumped me was reading the files.
I tried to do:
open(FH, "D:\test\numbers.txt");
But it can not find the file (despite the file being there, and no permissions issues).
An equivalent code (100% of the script other than the filename was identical) worked fine on Linux.
As per Perl FAQ 5, you should be using forward slashes in your DOS/Windows filenames (or, as an alternative, escaping the backslashes).
Why can't I use "C:\temp\foo" in DOS paths? Why doesn't `C:\temp\foo.exe` work?
Whoops! You just put a tab and a formfeed into that filename! Remember that within double quoted strings ("like\this"), the backslash is an escape character. The full list of these is in Quote and Quote-like Operators in perlop. Unsurprisingly, you don't have a file called "c:(tab)emp(formfeed)oo" or "c:(tab)emp(formfeed)oo.exe" on your legacy DOS filesystem.
Either single-quote your strings, or (preferably) use forward slashes. Since all DOS and Windows versions since something like MS-DOS 2.0 or so have treated / and \ the same in a path, you might as well use the one that doesn't clash with Perl--or the POSIX shell, ANSI C and C++, awk, Tcl, Java, or Python, just to mention a few. POSIX paths are more portable, too.
So your code should be open(FH, "D:/test/numbers.txt"); instead, to avoid trying to open a file named "D:<TAB>est\numbers.txt"
As an aside, you could further improve your code by using lexical (instead of global named) filehandle, a 3-argument form of open, and, most importantly, error-checking ALL your IO operations, especially open() calls:
open(my $fh, "<", "D:/test/numbers.txt") or die "Could not open file: $!";
Or, better yet, don't hard-code filenames in IO calls (the following practice MAY have let you figure out a problem sooner):
my $filename = "D:/test/numbers.txt";
open(my $fh, "<", $filename) or die "Could not open file $filename: $!";
Never use interpolated strings when you don't need interpolation! You are trying to open a file name with a tab character and a newline character in it from the \t and the \n!
Use single quotes when you want don't need (or want) interpolation.
One of the biggest problems novice Perl programmers seem to run into is that they automatically use "" for everything without thinking. You need to understand the difference between "" and '' and you need to ALWAYS think before you type so that you choose the right one. It's a hard habit to get into, but it's vital if you're going to write good Perl.

Perl regular expression problem

I have this conditional in a perl script:
if ($lnFea =~ m/^(\d+) qid\:([^\s]+).*?\#docid = ([^\s]+) inc = ([^\s]+) prob = ([^\s]+)$/)
and the $lnFea represents this kind of line:
0 qid:7968 1:0.000000 2:0.000000 3:0.000000 4:0.000000 5:0.000000 6:0.000000 7:0.000000 8:0.000000 9:0.000000 10:0.000000 11:0.000000 12:0.000000 13:0.000000 14:0.000000 15:0.000000 16:0.005175 17:0.000000 18:0.181818 19:0.000000 20:0.003106 21:0.000000 22:0.000000 23:0.000000 24:0.000000 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.000000 30:0.000000 31:0.000000 32:0.000000 33:0.000000 34:0.000000 35:0.000000 36:0.000000 37:0.000000 38:0.000000 39:0.000000 40:0.000000 41:0.000000 42:0.000000 43:0.055556 44:0.000000 45:0.000000 46:0.000000 #docid = GX000-00-0000000 inc = 1 prob = 0.0214125
The problem is that the if is true on Windows but false on Linux (Fedora 11). Both systems are using the most recent perl version. So what is the reason of this problem?
Assuming that $InFea is read from a file, I'd wager that the file is in DOS format. That would cause the $ anchor to prevent matching on Linux due to differences in the line-endings between those platforms. Perl's automagic newline transformation only works for platform-native text files. If the input file is in DOS format, the Linux box would see an extra carriage return before the end-of-line.
It's probably best to convert the input file to the native format for each platform. If that's not possible you should binmode the filehandle (preventing Perl from performing newline transformations) before reading from it and account for the various newline sequences in the regex and anywhere else the data is used.

Resources