Perl on Windows translates my newlines to CRLF - windows

print FILEHANDLE; - when run from a Windows box - always converts a trailing \n into \r\n - resulting in a DOS formatted file. The difference between a DOS and a UNIX file is that in UNIX, the last character of each line is \n, whereas in Windows it is \r\n. I have tried changing the line termination character $\ = "\n"; but the print command still does the conversion to DOS format. This only occurs on Windows boxes.

If you don't like how Perl decides to output your data, you can change it. In the three-argument open, it looks like this:
open my $fh, '>:raw', $filename;
Or, if you already have the filehandle, you can use binmode:
binmode $fh, ':raw';
binmode $fh; # :raw is the default
The output depends on various IO "layers", each of which gets to stick their dirty fingers into your data before it is output. The perlio docs have the list. There's a :crlf layer that turns unix line endings, and you are probably getting it by default. Note that changing the output record separator is something that happens at the print level, but there are deeper layers that can still do their work.

Related

std::endl equivalent on Ruby?

I just can't find it.
Found you can remove them with chomp, but not how to create them.
There is a global variable $/ which represent input record separator (default to newline (\n)).
>> $/
=> "\n"
Methods like Kernel#gets use this to determine input boundary.
As long as you work with files in text mode (the default), Ruby itself does the translation of the operating system's end-of-line character sequences to "\n" in Ruby:
When reading from a file in text mode, all line endings will appear as "\n".
When writing to a file in text mode, all newline characters "\n" will be written as the operating system's end-of-line character sequence.
So for all practical purposes when dealing with files in text mode, you can use "\n" as a constant to mean the OS-specific line ending, like std::endl.
Source: How to make your Ruby code work on Windows PCs, section "Get your file modes right".

Excel saves tab delimited files without newline (UNIX/Mac os X)

This is a common issue I have and my solution is a bit brash. So I'm looking for a quick fix and explanation of the problem.
The problem is that when I decide to save a spreadsheet in excel (mac 2011) as a tab delimited file it seems to do it perfectly fine. Until I try to parse the file line by line using Perl. For some reason it slurps the whole document in one line.
My brutish solution is to open the file in a web browser and copy and paste the information into the tab delimited file in TextEdit (I never use rich text format). I tried introducing a newline in the end of the file before doing this fix and it does not resolve the issue.
What's going on here? An explanation would be appreciated.
~Thanks!~
The problem is the actual character codes that define new lines on different systems. Windows systems commonly use a CarriageReturn+LineFeed (CRLF) and *NIX systems use only a LineFeed (LF).
These characters can be represented in RegEx as \r\n or \n (respectively).
Sometimes, to hash through a text file, you need to parse New Line characters. Try this for DOS-to-UNIX in perl:
perl -pi -e 's/\r\n/\n/g' input.file
or, for UNIX-to-DOS using sed:
$ sed 's/$'"/`echo \\\r`/" input.txt > output.txt
or, for DOS-to-UNIX using sed:
$ sed 's/^M$//' input.txt > output.txt
Found a pretty simple solution to this. Copy data from Excel to clipboard, paste it into a google spreadsheet. Download google spreadsheet file as a 'tab-separated values .tsv'. This gets around the problem and you have tab delimiters with an end of line for each line.
Yet another solution ...
for a tab-delimited file, save the document as a Windows Formatted Text (.txt) file type
for a comma-separated file, save the document as a `Windows Comma Separated (.csv)' file type
Perl has a useful regex pattern \R which will match any common line ending. It actually matches any vertical whitespace -- the same as \v -- or the CR LF combination, so it's the same as \r\n|\v
This is useful here because you can slurp your entire file into a single scalar and then split /\R/, which will give you a list of file records, already chomped (if you want to keep the line terminators you can split /\R\K/ instead
Another option is the PerlIO::eol module. It provides a new Perl IO layer that will normalize line endings no matter what the contents of the file are
Once you have loaded the module with use PerlIO::eol you can use it in an open statement
open my $fh, '<:eol(LF)', 'myfile.tsv' or die $!;
or you can use the open pragma to set it as the default layer for all input file handles
use open IN => ':raw:eol(LF)';
which will work fine with an input file from any platform

Windows Perl --> Unix not working after port, possible encoding issue

I've got a Perl program that I wrote on Windows. It starts with:
$unused_header = <STDIN>;
my #header_fields = split('\|\^\|', $unused_header, -1);
Which should split input that consists of a very large file of:
The|^|Quick|^|Brown|^|Fox|!|
Into:
{The, Quick, Brown, Fox|!|}
Note: This line just does the headre alone, theres another one like it to do the repetitive data lines.
It worked great on windows, but on linux it fails. However, if I define a string with the same contents within Perl, and run the split on that, it works fine.
I think it's a UTF-16 encoding handling issue, but I'm not sure how to handle it. Does anyone know how I can get perl to understand the UTF-16 being piped into STDIN?
I found: http://www.haboogo.com/matching_patterns/2009/01/utf-16-processing-issue-in-perl.html but I'm not sure what to do with it.
If STDIN is UTF-16, use one of the following
binmode(STDIN, ':encoding(UTF-16le)'); # Byte order used by Windows.
binmode(STDIN, ':encoding(UTF-16be)'); # The other byte order.
binmode(STDIN, ':encoding(UTF-16)'); # Use BOM to determine byte order.
Tom has written a lengthy answer with regards to perl and unicode. It contains some bolierplate code to properly and fully support UTF-8, but you can replace with UTF-16 as needed.
I doubt it's a UTF-xx encoding issue, as neither Windows Perl nor Unix Perl will try to read data with those encodings unless you tell it to.
If the Unix script is reading the exact same file as the Windows script but behaves differently, maybe it's a line-ending issue. The dos2unix command on most Unix-y systems can change the line endings on a file, or you can strip off the line-endings yourself in the Perl script
$unused_header = <STDIN>;
$unused_header =~ s/\r?\n$//; # chop \r\n (Windows) or \n (Unix)

What changes when a file is saved in Kedit for windows that the unix2dos command doesn't do?

So I have a strange question. I have written a script that re-formats data files. I basically create new files with the right column order, spacing, and such. I then unix2dos these files (the program I am formatting these files for is DIPS for windows, and I assume that the files should be ansi). When I go to open the files in the DIPS Program however an error occurs and the file won't open.
When I create the same kind of data file through the DIPS program and open it in note pad, it matches exactly with the data files I have created with my script.
On the other hand if I open the data files that I have created with my script in Kedit first, save them, and then open them in the DIPS program everything works.
My question is what could saving in Kedit possibly do that unix2dos does not?
(Also if I try using note pad or word pad to save instead of Kedit the file doesn't open in DIPS)
Here is what was created using the diff command in unix
"
1,16c1,16
* This file is generated by Dips for Windows.
* The following 2 lines are the Title of this file.
Cobre Panama
Drill Hole B11106-GT
Number of Traverses: 0
Global Orientation is:
DIP/DIPDIRECTION
0.000000 (Declination)
NO QUANTITY
Number of extra columns are: 0
--
* This file is generated by Dips for Windows.
* The following 2 lines are the Title of this file.
Cobre Panama
Drill Hole B11106-GT
Number of Traverses: 0
Global Orientation is:
DIP/DIPDIRECTION
0.000000 (Declination)
NO QUANTITY
Number of extra columns are: 0
18c18
--
440c440
--
442c442
-1
-1
"
Any help would be appreciated! Thanks!
Okay! Figured it out.
Simply when you unix2dos your file you do not strip any space characters in between the last letter in a line and the line break character. When saving in Kedit you do strip the spaces between the last letter in a line and the line break character.
In my script I had a poor programing practice in which I was writing a string like this;
echo "This is an example string " >> outfile.txt
The character count is 32, and if you could see the break line character (chr(10)) the line would read;
This is an example string
If you unix2dos outfile.txt the line looks the same as above but with a different break line character. However when you place the file into Kedit and save it, now the character count is 25 and the line looks like this;
This is an example string
This occurs because Kedit does not preserve spaces at the end of a line. It places the return or line break character at the last letter or "non space" character in a line.
So programs that read literal input like DIPS (i'm guessing) or more widely used AutoCAD scripting will have a real problem with extra spaces before the return character. Basically in AutoCAD scripting a space in a line is treated as a return character. So if you have ten extra spaces at the end of a line it's treated the same as ten returns instead of the one you probably intended.
OH and if this helped you out or though it was good please give me a vote up!
unix2dos converts the line-break characters at the end of each line, from unix line breaks (10) to dos line breaks (13, 10)
Kedit could possible change the encoding of the file (like from ansi to UTF-8)
You can change the encoding of a file with the iconv utility (on a linux box)

Perl regular expression problem

I have this conditional in a perl script:
if ($lnFea =~ m/^(\d+) qid\:([^\s]+).*?\#docid = ([^\s]+) inc = ([^\s]+) prob = ([^\s]+)$/)
and the $lnFea represents this kind of line:
0 qid:7968 1:0.000000 2:0.000000 3:0.000000 4:0.000000 5:0.000000 6:0.000000 7:0.000000 8:0.000000 9:0.000000 10:0.000000 11:0.000000 12:0.000000 13:0.000000 14:0.000000 15:0.000000 16:0.005175 17:0.000000 18:0.181818 19:0.000000 20:0.003106 21:0.000000 22:0.000000 23:0.000000 24:0.000000 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.000000 30:0.000000 31:0.000000 32:0.000000 33:0.000000 34:0.000000 35:0.000000 36:0.000000 37:0.000000 38:0.000000 39:0.000000 40:0.000000 41:0.000000 42:0.000000 43:0.055556 44:0.000000 45:0.000000 46:0.000000 #docid = GX000-00-0000000 inc = 1 prob = 0.0214125
The problem is that the if is true on Windows but false on Linux (Fedora 11). Both systems are using the most recent perl version. So what is the reason of this problem?
Assuming that $InFea is read from a file, I'd wager that the file is in DOS format. That would cause the $ anchor to prevent matching on Linux due to differences in the line-endings between those platforms. Perl's automagic newline transformation only works for platform-native text files. If the input file is in DOS format, the Linux box would see an extra carriage return before the end-of-line.
It's probably best to convert the input file to the native format for each platform. If that's not possible you should binmode the filehandle (preventing Perl from performing newline transformations) before reading from it and account for the various newline sequences in the regex and anywhere else the data is used.

Resources