Rubular/Ruby discrepancy in captured text - ruby

I've carefully cut and pasted from this Rubular window http://rubular.com/r/YH8Qj2EY9j to my code, yet I get different results. The Rubular match capture is what I want. Yet
desc_pattern = /^<DD>(.*\n?.*)\n/
if desc =~ desc_pattern
puts description = $1
end
only gets me the first line, i.e.
<DD>#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
I don't think it's my test data, but that's possible. What am I missing?
(ruby 1.9 on Ubuntu 10.10(

Paste your test data into an editor that is able to display control characters and verify your line break characters. Normally it should be only \n on a Linux system as in your regex. (I had unusual linebreaks a few weeks ago and don't know why.)
The other check you can do is, change your brackets and print your capturing groups. so that you can see which part of your regex matches what.
/^<DD>(.*)\n?(.*)\n/
Another idea to get this to work is, change the .*. Don't say match any character, say match anything, but \n.
^<DD>([^\n]*\n?[^\n]*)\n

I believe you need the multiline modifier in your code:
/m Multiline mode: dot matches newlines, ^ and $ both match line starts and endings.

The following:
#!/usr/bin/env ruby
desc= '<DD>#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
– Johnny Badhair (8spiders) http://twitter.com/8spiders/status/92876473853157377
<DT>la la this should not be matched oh good'
desc_pattern = /^<DD>(.*\n?.*)\n/
if desc =~ desc_pattern
puts description = $1
end
prints
#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
– Johnny Badhair (8spiders) http://twitter.com/8spiders/status/92876473853157377
on my system (Linux, Ruby 1.8.7).
Perhaps your line breaks are really \r\n (Windows style)? What if you try:
desc_pattern = /^<DD>(.*\r?\n?.*)\r?\n/

Related

Multibyte character issue with .match?

The following code is something I am beginning to test for use within a "Texas Hold Em" style game I am working on.
My question is why, when running the following code, does the puts involving a "♥" return a "\u" in it's place. I feel certain it is this multibyte character that is causing the issue becuse on the second puts , I replaced the ♦ with a d in the array of strings and it returned what i was expecting. See Below:
My Code:
#! /usr/bin/env ruby
# encoding: utf-8
table_cards = ["|2♥|", "|8♥|", "|6d|", "|6♣|", "|Q♠|"]
# Array of cards
player_1_face_1 = "8"
player_1_suit_1 = "♦"
# Player 1's face and suit of first card he has
player_1_face_2 = "6"
player_1_suit_2 = "♥"
# Player 1's face and suit of second card he has
test_str_1 = /(\D8\D{2})/.match(table_cards.to_s)
# EX: Searching for match between face values on (player 1's |8♦|) and the |8♥| on the table
test_str_2 = /(\D6\D{2})/.match(table_cards.to_s)
# EX: Searching for match between face values on (player 1's |6♥|) and the |6d| on the table
puts "#{test_str_1}"
puts "#{test_str_2}"
Puts to Screen:
|8\u
|6d|
-- My goal would be to get the first puts to return: |8♥|
I am not so much looking for a solution to this (there may not even be one) but more so a "as simple as possible" explanation of what is causing this issue and why. Thanks ahead of time for any information on what is happening here and how I can tackle the goal.
The "\u" you're seeing is the Unicode string indicator.
For example, Unicode character 'HEAVY BLACK HEART' (U+2764) can be printed as "\u2764".
A friendly Unicode character listing site is http://unicode-table.com/en/sets/
Are you able to launch interactive Ruby in your shell and print a heart like this?
irb
irb> puts "\u2764"
❤
When I run your code in my Ruby, I get the answer you expect:
test_str_1 = /(\D8\D{2})/.match(table_cards.to_s)
=> #<MatchData "|8♥|" 1:"|8♥|">
What happens if you try a regex that is more specific to your cards?
test_str_1 = /(\|8[♥♦♣♠]\|)/.match(table_cards.to_s)
In your example output, you're not seeing the Unicode heart symbol as you want. Instead, your output is printing the "\u" which is the Unicode starter, but then not printing the rest of the expected string which is "2764".
See the comment by the Tin Man that describes encoding for your console. If he's correct, then I expect the more-specific regex will succeed, but still print the wrong output.
See the comment by David Knipe that says it looks like it gets truncated because the regex only matches 4 characters. If he's correct, then I expect the more-specific regex will succeed and also print the right output.
(The rest of this answer is typical for Unix; if you're on Windows, ignore the rest here...)
To show your system language settings, try this in your shell:
echo $LC_ALL
echo $LC_CTYPE
If they are not "UTF-8" or something like that, try this in your shell:
export LC_ALL=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
Then re-run your code -- be sure to use the same shell.
If this works, and you want to make this permanent, one way is to add these here:
# /etc/environment
LC_ALL=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
Then source that file from your .bashrc or .zshrc or whatever shell startup file you use.

Issue copying file into new file gsub with regex, variable and string?

I'm struggling with a script to target specific XML files in a directory and rename them as copies with a different name.
I put in the puts statements for debugging, and, from what I can tell, everything looks OK until the FileUtils.cp line. I tried this with simpler text and it worked, but my overly complicated cp(file, file.gsub()) seems to be causing problems that I can't figure out.
def piano_treatment(cats)
FileUtils.chdir('12Piano')
src = Dir.glob('*.xml')
src.each do |file|
puts file
cats.each do |text|
puts text
if file =~ /#{text}--\d\d/
puts "Match Found!!"
puts FileUtils.pwd
FileUtils.cp(file, file.gsub!(/#{text}--\d\d/, "#{text}--\d\dBass "))
end
end
end
end
piano_treatment(cats)
I get the following output in Terminal:
12Piano--05Free Stuff--11Test.xml
05Free Stuff
Match Found!!
/Users/mbp/Desktop/Sibelius_Export/12Piano
cp 12Piano--05Free Stuff--ddBass Test.xml 12Piano--05Free Stuff--ddBass Test.xml
/Users/mbp/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/fileutils.rb:1551:in `stat': No such file or directory - 12Piano--05Free Stuff--ddBass Test.xml (Errno::ENOENT)
Why is \d\d showing up as "dd" when it should actually be numbers? Is this a single vs. double quote issue? Both yield errors.
Any suggestions are appreciated. Thanks.
EDIT One additional change was needed to this code. The FileUtils.chdir('12Piano') would change the directory for the first iteration of the loop, but it would revert to the source directory after that. Instead I did this:
def piano_treatment(cats)
src = Dir.glob('12Piano/*.xml')
which sets the match path for the whole method.
Your replacement string is not a regex, so \d has no special meaning, but is just a literal string. You need to specify a group in your regex, and then you can use the captured group in your replacement string:
FileUtils.cp(file, file.gsub(/#{text}--(\d\d)/, "#{text}--\\1Bass "))
The parenthesis in the regex form the group, which can be used (by number) in the replacement string: \1 for the first group, \2 for the second, etc. \0 refers to the entire regex match.
Update
Replaced gsub!() with gsub() and escaped the backslash in the replacement string (to treat \1 as the capture group, not a literal character... Doh!).

Weird behavior when changing line separator and then changing it back

I was following the advice from this question when trying to read in multi-line input from the command line:
# change line separator
$/ = 'END'
answer = gets
pp answer
However, I get weird behavior from STDIN#gets when I try to change $/ back:
# put it back to normal
$/ = "\n"
answer = gets
pp answer
pp 'magic'
This produces output like this when executed with Ruby:
$ ruby multiline_input_test.rb
this is
a multiline
awesome input string
FTW!!
END
"this is\n\ta multiline\n awesome input string\n \t\tFTW!!\t\nEND"
"\n"
"magic"
(I input up to the END and the rest is output by the program, then the program exits.)
It does not pause to get input from the user after I change $/ back to "\n". So my question is simple: why?
As part of a larger (but still small) application, I'm trying to devise a way of recording notes; as it is, this weird behavior is potentially devastating, as the rest of my program won't be able to function properly if I can't reset the line separator. I've tried all manner of using double- and single-quotes, but that doesn't seem to be the issue. Any ideas?
The problem you're having is that your input ends with END\n. Ruby sees the END, and there's still a \n left in the buffer. You do successfully set the input record separator back to \n, so that character is immediately consumed by the second gets.
You therefore have two easy options:
Set the input record separator to END\n (use double quotes in order to have the newline character work):
$/ = "END\n"
Clear the buffer with an extra call to gets:
$/ = 'END'
answer = gets
gets # Consume extra `\n`
I consider option 1 clearer.
This shows it working on my system using option 1:
$ ruby multiline_input_test.rb
this is
a multiline
awesome input string
FTW!!
END
"this is\n a multiline\n awesome input string\n FTW!!\nEND\n"
test
"test\n"
"magic"

Ruby gsub issues

I have a piece of text that resembled the following:
==EXCLUDE
#lots of lines of text
==EXCLUDE
#this is what I actually want
And so I was trying to remove the unwanted bit by doing:
str.gsub!(/==EX.*?==EXCLUDE/, '')
However, its not working. When I tried to remove the \n chars first, it worked like a dream. The issue is that I can't actually remove the \n characters. How can I do a substitution like this while leaving newlines in place?
By default, the . does not match line break chars. If you enable the m modifier in Ruby (in other languages, this is the s modifier) it should work:
str.gsub!(/==EX.*?==EXCLUDE/m, '')
Here's a live demo on Rubular: http://rubular.com/r/YxLSB1Iq95
Try str.gsub!(/==EX.*?==EXCLUDE/m, '')
That should make it span new lines.

Ruby on a Mac -- Regular Expression Spanning Two Lines of Text

On the PC, the following Ruby regular expression matches data. However, when run on the Mac against the same input text file, no matches occur. Am I matching line returns in a way that should work cross-platform?
data = nil
File.open(ARGV[0], "r") do |file|
data = file.readlines.join("").scan(/^Name: (.*?)[\r\n]+Email: (.*?)$/)
end
Versions
PC: ruby 1.9.2p135
Mac: ruby 1.8.6
Thank you,
Ben
The problem was the ^ and $ pattern characters! Ruby doesn't consider \r (a.k.a. ^M) a line boundary. If I modified my pattern, replacing both ^ and $ with "\r", the pattern matched as desired.
data = file.readlines.join.scan(/\rName: (.*?)\rEmail: (.*?)\r/)
Instead of modifying the pattern, I opted to do a gsub on the text, replacing \r with \n before calling scan.
data = file.readlines.join.gsub(/\r/, "\n").scan(/^Name: (.*?)\nEmail: (.*?)$/)
Thank you each for your responses to my question.
When going from Windows -> Unix based (MAC) I've had this issue: ^M =? \r\n. The Carriage return gets rendered as a Control-M which may or may not be interpreted correctly by your regexp~
On Unix (OS X is a Unix), end of lines are \n, not \r\n. Putting simply [\n] will work on Mac.
To have a cross-platform script, may be you could first replace each \r\n sequence by a \n character?

Resources