Perl regular expression problem - windows

I have this conditional in a perl script:
if ($lnFea =~ m/^(\d+) qid\:([^\s]+).*?\#docid = ([^\s]+) inc = ([^\s]+) prob = ([^\s]+)$/)
and the $lnFea represents this kind of line:
0 qid:7968 1:0.000000 2:0.000000 3:0.000000 4:0.000000 5:0.000000 6:0.000000 7:0.000000 8:0.000000 9:0.000000 10:0.000000 11:0.000000 12:0.000000 13:0.000000 14:0.000000 15:0.000000 16:0.005175 17:0.000000 18:0.181818 19:0.000000 20:0.003106 21:0.000000 22:0.000000 23:0.000000 24:0.000000 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.000000 30:0.000000 31:0.000000 32:0.000000 33:0.000000 34:0.000000 35:0.000000 36:0.000000 37:0.000000 38:0.000000 39:0.000000 40:0.000000 41:0.000000 42:0.000000 43:0.055556 44:0.000000 45:0.000000 46:0.000000 #docid = GX000-00-0000000 inc = 1 prob = 0.0214125
The problem is that the if is true on Windows but false on Linux (Fedora 11). Both systems are using the most recent perl version. So what is the reason of this problem?

Assuming that $InFea is read from a file, I'd wager that the file is in DOS format. That would cause the $ anchor to prevent matching on Linux due to differences in the line-endings between those platforms. Perl's automagic newline transformation only works for platform-native text files. If the input file is in DOS format, the Linux box would see an extra carriage return before the end-of-line.
It's probably best to convert the input file to the native format for each platform. If that's not possible you should binmode the filehandle (preventing Perl from performing newline transformations) before reading from it and account for the various newline sequences in the regex and anywhere else the data is used.

Related

Perl on Windows translates my newlines to CRLF

print FILEHANDLE; - when run from a Windows box - always converts a trailing \n into \r\n - resulting in a DOS formatted file. The difference between a DOS and a UNIX file is that in UNIX, the last character of each line is \n, whereas in Windows it is \r\n. I have tried changing the line termination character $\ = "\n"; but the print command still does the conversion to DOS format. This only occurs on Windows boxes.
If you don't like how Perl decides to output your data, you can change it. In the three-argument open, it looks like this:
open my $fh, '>:raw', $filename;
Or, if you already have the filehandle, you can use binmode:
binmode $fh, ':raw';
binmode $fh; # :raw is the default
The output depends on various IO "layers", each of which gets to stick their dirty fingers into your data before it is output. The perlio docs have the list. There's a :crlf layer that turns unix line endings, and you are probably getting it by default. Note that changing the output record separator is something that happens at the print level, but there are deeper layers that can still do their work.

Why is \r\n being converted to \n when a string is saved to a file?

The string is originating as a return value from:
> msg = imap.uid_fetch(uid, ["RFC822"])[0].attr["RFC822"]
In the console if I type msg, a long string is displayed with double quotes and \r\n separating each line:
> msg
"Delivered-To: email#test.com\r\nReceived: by xx.xx.xx.xx with SMTP id;\r\n"
If I match part of it with a regex, the return value has \r\n:
> msg[/Delivered-To:.*?\s+Received:/i]
=> "Delivered-To: email#test.com\r\nReceived:"
If I save the string to a file, read it back in and match it with the same regex, I get \n instead of \r\n:
> File.write('test.txt', msg)
> str = File.read('test.txt')
> str[/Delivered-To:.*?\s+Received:/i]
=> "Delivered-To: email#test.com\nReceived:"
Is \r\n being converted to \n when the string is saved to a file?
Is there a way to save the string to a file, read it back in without the line endings being modified?
This is covered in the IO.new documentation:
The following modes must be used separately, and along with one or more of the modes seen above.
"b" Binary file mode
Suppresses EOL <-> CRLF conversion on Windows. And
sets external encoding to ASCII-8BIT unless explicitly
specified.
"t" Text file mode
In other words, Ruby, like many other languages, senses the OS it's on and will automatically translate line-ends between "\r\n" <-> "\n" when reading/writing a file in text mode. Use binary mode to avoid translation.
str = File.read('test.txt')
A better practice would be to read the file using foreach, which negates the need to even care about line-endings; You'll get each line separately. An alternate is to use readlines, however it uses slurping which can be very costly on large files.
Also, if you're processing mail files, I'd strongly recommend using something written to do so rather than write your own. The Mail gem is one such package that's pre-built and well tested.

std::endl equivalent on Ruby?

I just can't find it.
Found you can remove them with chomp, but not how to create them.
There is a global variable $/ which represent input record separator (default to newline (\n)).
>> $/
=> "\n"
Methods like Kernel#gets use this to determine input boundary.
As long as you work with files in text mode (the default), Ruby itself does the translation of the operating system's end-of-line character sequences to "\n" in Ruby:
When reading from a file in text mode, all line endings will appear as "\n".
When writing to a file in text mode, all newline characters "\n" will be written as the operating system's end-of-line character sequence.
So for all practical purposes when dealing with files in text mode, you can use "\n" as a constant to mean the OS-specific line ending, like std::endl.
Source: How to make your Ruby code work on Windows PCs, section "Get your file modes right".

Windows Perl --> Unix not working after port, possible encoding issue

I've got a Perl program that I wrote on Windows. It starts with:
$unused_header = <STDIN>;
my #header_fields = split('\|\^\|', $unused_header, -1);
Which should split input that consists of a very large file of:
The|^|Quick|^|Brown|^|Fox|!|
Into:
{The, Quick, Brown, Fox|!|}
Note: This line just does the headre alone, theres another one like it to do the repetitive data lines.
It worked great on windows, but on linux it fails. However, if I define a string with the same contents within Perl, and run the split on that, it works fine.
I think it's a UTF-16 encoding handling issue, but I'm not sure how to handle it. Does anyone know how I can get perl to understand the UTF-16 being piped into STDIN?
I found: http://www.haboogo.com/matching_patterns/2009/01/utf-16-processing-issue-in-perl.html but I'm not sure what to do with it.
If STDIN is UTF-16, use one of the following
binmode(STDIN, ':encoding(UTF-16le)'); # Byte order used by Windows.
binmode(STDIN, ':encoding(UTF-16be)'); # The other byte order.
binmode(STDIN, ':encoding(UTF-16)'); # Use BOM to determine byte order.
Tom has written a lengthy answer with regards to perl and unicode. It contains some bolierplate code to properly and fully support UTF-8, but you can replace with UTF-16 as needed.
I doubt it's a UTF-xx encoding issue, as neither Windows Perl nor Unix Perl will try to read data with those encodings unless you tell it to.
If the Unix script is reading the exact same file as the Windows script but behaves differently, maybe it's a line-ending issue. The dos2unix command on most Unix-y systems can change the line endings on a file, or you can strip off the line-endings yourself in the Perl script
$unused_header = <STDIN>;
$unused_header =~ s/\r?\n$//; # chop \r\n (Windows) or \n (Unix)

Windows Command to detect and remove text in a file

I have an ascii file and in there somewhere is the line:
BEGIN
and later on the line:
END
I'd like to be able to remove those two lines and everything in between from a command line call in windows. This needs to be completely automated.
EDIT: See sed in Vista - how to delete all symbols between? for details on how to use sed to do this (cygwin has sed).
EDIT: I am finding that SED could be working but when I pipe the output to a file, the carriage returns have been removed. How can I keep these? Using this sed regex:
/^GlobalSection(TeamFoundationVersionControl) = preSolution$/,/^EndGlobalSection$/{
/^GlobalSection(TeamFoundationVersionControl) = preSolution$/!{
/^EndGlobalSection$/!d
}
}
.. where the start section is 'GlobalSection(TeamFoundationVersionControl) = preSolution' and the end section is 'EndGlobalSection'. I'd also like to delete these lines as well.
EDIT: I am now using something simpler for sed:
/^GlobalSection(TeamFoundationVersionControl) = preSolution$/,/^EndGlobalSection$/d
The line feeds are still an issue though
Alternately, what I use these days is a scripting language that plays nicely with windows like Ruby or Python for such tasks. Ruby is easy to install in windows and makes problems like this child's play.
Here's a script you could use like:
cutBeginEnd.rb myFileName.txt
sourcefile = File.open(ARGV[0])
# Get the string and do a multiline replace
fileString = sourceFile.read()
slicedString = fileString.gsub(/BEGIN.*END\n/m,"")
#Overwrite the file
sourcefile.pos = 0
sourcefile.print slicedString
sourcefile.truncate(f.pos)
This does a pretty good job, allows for a lot of flexiblity, and is possibly more readable than sed.
Here is a 1-line Perl command that does what you want (just type it from the Command Prompt window):
perl -i.bak -ne "print unless /^BEGIN\r?\n/ .. /^END\r?\n/" myfile.txt
Carriage returns and line feeds will be preserved properly. The original version of myfile.txt will be saved as myfile.txt.bak.
If you don't have Perl installed, get ActivePerl.
Here's how to delete the entire GlobalSection(TeamFoundationVersionControl) = preSolution section using a C# regular expression:
// Create a regex to match against an entire GlobalSection(TeamFoundationVersionControl) section so that it can be removed (including preceding and trailing whitespace).
// The symbols *, +, and ? are greedy by default and will match everything until the LAST occurrence of EndGlobalSection, so we must use their non-greedy counterparts, *?, +?, and ??.
// Example of string to match against: " GlobalSection(TeamFoundationVersionControl) ...... EndGlobalSection "
Regex _regex = new Regex(#"(?i:\s*?GlobalSection\(TeamFoundationVersionControl\)(?:.|\n)*?EndGlobalSection\s*?)", RegexOptions.Compiled);

Resources