Ruby on a Mac -- Regular Expression Spanning Two Lines of Text - ruby

On the PC, the following Ruby regular expression matches data. However, when run on the Mac against the same input text file, no matches occur. Am I matching line returns in a way that should work cross-platform?
data = nil
File.open(ARGV[0], "r") do |file|
data = file.readlines.join("").scan(/^Name: (.*?)[\r\n]+Email: (.*?)$/)
end
Versions
PC: ruby 1.9.2p135
Mac: ruby 1.8.6
Thank you,
Ben

The problem was the ^ and $ pattern characters! Ruby doesn't consider \r (a.k.a. ^M) a line boundary. If I modified my pattern, replacing both ^ and $ with "\r", the pattern matched as desired.
data = file.readlines.join.scan(/\rName: (.*?)\rEmail: (.*?)\r/)
Instead of modifying the pattern, I opted to do a gsub on the text, replacing \r with \n before calling scan.
data = file.readlines.join.gsub(/\r/, "\n").scan(/^Name: (.*?)\nEmail: (.*?)$/)
Thank you each for your responses to my question.

When going from Windows -> Unix based (MAC) I've had this issue: ^M =? \r\n. The Carriage return gets rendered as a Control-M which may or may not be interpreted correctly by your regexp~

On Unix (OS X is a Unix), end of lines are \n, not \r\n. Putting simply [\n] will work on Mac.
To have a cross-platform script, may be you could first replace each \r\n sequence by a \n character?

Related

How to obtain basename in ruby from the given file path in unix or windows format?

I need to parse a basename in ruby a from file path which I get as input. Unix format works fine on Linux.
File.basename("/tmp/text.txt")
return "text.txt".
However, when I get input in windows format:
File.basename("C:\Users\john\note.txt")
or
File.basename("C:\\Users\\john\\note.txt")
"C:Usersjohn\note.txt" is the output (note that \n is a new line there), but I didn't get "note.txt".
Is there some nice solution in ruby/rails?
Solution:
"C:\\test\\note.txt".split(/\\|\//).last
=> "note.txt"
"/tmp/test/note.txt".split(/\\|\//).last
=> "note.txt"
If the Linux file name doesn't contain \, it will work.
Try pathname:
require 'pathname'
Pathname.new('C:\Users\john\note.txt').basename
# => #<Pathname:note.txt>
Pathname docs
Ref How to get filename without extension from file path in Ruby
I'm not convinced that you have a problem with your code. I think you have a problem with your test.
Ruby also uses the backslash character for escape sequences in strings, so when you type the String literal "C:\Users\john\note.txt", Ruby sees the first two backslashes as invalid escape sequences, and so ignores the escape character. \n refers to a newline. So, to Ruby, this literal is the same as "C:Usersjohn\note.txt". There aren't any file separators in that sequence, since \n is a newline, not a backslash followed by the letter n, so File.basename just returns it as it receives it.
If you ask for user input in either a graphical user interface (GUI) or command line interface (CLI), the user entering input needn't worry about Ruby String escape sequences; those only matter for String literals directly in the code. Try it! Type gets into IRB or Pry, and type or copy a file path, and press Enter, and see how Ruby displays it as a String literal.
On Windows, Ruby accepts paths given using both "/" (File::SEPARATOR) and "\\" (File::ALT_SEPARATOR), so you don't need to worry about conversion unless you are displaying it to the user.
Backslashes, while how Windows expresses things, are just a giant nuisance. Within a double-quoted string they have special meaning so you either need to do:
File.basename("C:\\Users\\john\\note.txt")
Or use single quotes that avoid the issue:
File.basename('C:\Users\john\note.txt')
Or use regular slashes which aren't impacted:
File.basename("C:/Users/john/note.txt")
Where Ruby does the mapping for you to the platform-specific path separator.

How to determine line ending types in Ruby

I am using CSVLint to run some validation on flat files. The sources for the files can have varied line endings, some are \n, some \r\n. The Validator constructor takes a dialect parameter where I need to specify the line ending type.
Is there a good/quick/easy way to sample the first line of the flat file to determine the line ending type in Ruby?
Update
The answer below is the correct answer to my question. If you need auto line endings in CSVLint, however, try this in the dialect:
"lineTerminator" => :auto
Also, #sawa's answer below pertains to my original question (and typo) of looking for \r and \r\n.
To detect \n and \r\n line endings, simply match the first line against the regular expression /\r?\n$/:
def determine_line_ending(filename)
File.open(filename, 'r') do |file|
return file.readline[/\r?\n$/]
end
end
determine_line_ending('./windows_file.csv')
# => "\r\n"
determine_line_ending('./unix_file.csv')
# => "\n"
This doesn't handle weird edge cases like the Mac OS 9 (discontinued in 2001) \r line ending, but covers everything else. If you want some background on historical line endings, the Wikipedia article is pretty interesting.
Edit The following is an answer to the original question, not the question after it has changed.
When you have the first line line,
line[/[\r\n]+/]
will give you what line ending you have.

Ruby gsub issues

I have a piece of text that resembled the following:
==EXCLUDE
#lots of lines of text
==EXCLUDE
#this is what I actually want
And so I was trying to remove the unwanted bit by doing:
str.gsub!(/==EX.*?==EXCLUDE/, '')
However, its not working. When I tried to remove the \n chars first, it worked like a dream. The issue is that I can't actually remove the \n characters. How can I do a substitution like this while leaving newlines in place?
By default, the . does not match line break chars. If you enable the m modifier in Ruby (in other languages, this is the s modifier) it should work:
str.gsub!(/==EX.*?==EXCLUDE/m, '')
Here's a live demo on Rubular: http://rubular.com/r/YxLSB1Iq95
Try str.gsub!(/==EX.*?==EXCLUDE/m, '')
That should make it span new lines.

Rubular/Ruby discrepancy in captured text

I've carefully cut and pasted from this Rubular window http://rubular.com/r/YH8Qj2EY9j to my code, yet I get different results. The Rubular match capture is what I want. Yet
desc_pattern = /^<DD>(.*\n?.*)\n/
if desc =~ desc_pattern
puts description = $1
end
only gets me the first line, i.e.
<DD>#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
I don't think it's my test data, but that's possible. What am I missing?
(ruby 1.9 on Ubuntu 10.10(
Paste your test data into an editor that is able to display control characters and verify your line break characters. Normally it should be only \n on a Linux system as in your regex. (I had unusual linebreaks a few weeks ago and don't know why.)
The other check you can do is, change your brackets and print your capturing groups. so that you can see which part of your regex matches what.
/^<DD>(.*)\n?(.*)\n/
Another idea to get this to work is, change the .*. Don't say match any character, say match anything, but \n.
^<DD>([^\n]*\n?[^\n]*)\n
I believe you need the multiline modifier in your code:
/m Multiline mode: dot matches newlines, ^ and $ both match line starts and endings.
The following:
#!/usr/bin/env ruby
desc= '<DD>#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
– Johnny Badhair (8spiders) http://twitter.com/8spiders/status/92876473853157377
<DT>la la this should not be matched oh good'
desc_pattern = /^<DD>(.*\n?.*)\n/
if desc =~ desc_pattern
puts description = $1
end
prints
#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
– Johnny Badhair (8spiders) http://twitter.com/8spiders/status/92876473853157377
on my system (Linux, Ruby 1.8.7).
Perhaps your line breaks are really \r\n (Windows style)? What if you try:
desc_pattern = /^<DD>(.*\r?\n?.*)\r?\n/

Does Perl's /m regex modifier match differently on Windows?

The following Perl statements behave identically on Unixish machines. Do they behave differently on Windows? If yes, is it because of the magic \n?
split m/\015\012/ms, $http_msg;
split m/\015\012/s, $http_msg;
I got a failure on one of my CPAN modules from a Win32 smoke tester. It looks like it's an \r\n vs \n issue. One change I made recently was to add //m to my regexes.
For these regexes:
m/\015\012/ms
m/\015\012/s
Both /m and /s are meaningless.
/s: makes . match \n too.
Your regex doesn't contain .
/m: makes ^ and $ match next to embedded \n in the string.
Your regex contains no ^ nor $, or their synonyms.
What is possible is indeed if your input handle (socket?) works in text mode, the \r (\015) characters will have been deleted on Windows.
So, what to do? I suggest making the \015 characters optional, and split against
/\015?\012/
No need for /m, /s or even the leading m//. Those are just cargo cult.
There is no magic \n. Both \n and \r always mean exactly one character, and on all ASCII-based platforms that is \cJ and \cM respectively. (The exceptions are EBCDIC platforms (for obvious reasons) and MacOS Classic (where \n and \r both mean \cM).)
The magic that happens on Windows is that when doing I/O through a file handle that is marked as being in text mode, \r\n is translated to \n upon reading and vice versa upon writing. (Also, \cZ is taken to mean end-of-file – surprise!) This is done at the C runtime library layer.
You need to binmode your socket to fix that.
You should also remove the /s and /m modifiers from your pattern: since you do not use the meta-characters whose behaviour they modify (. and the ^/$ pair, respectively), they do nothing – cargo cult.
Why did you add the /m? Are you trying to split on line? To do that with /m you need to use either ^ or $ in the regex:
my #lines = split /^/m, $big_string;
However, if you want to treat a big string as lines, just open a filehandle on a reference to the scalar:
open my $string_fh, '<', \ $big_string;
while( <$string_fh> ) {
... process a line
}

Resources