Replace words in string by multiple scans - ruby

My goal is to turn any two consecutive commas into ",NA,". This means that:
str = ",,,123,,BLAH,," changes to ",NA,123,NA,BLAH,NA,"
",,," changes to ",NA,NA,"
",,,," changes to ",NA,NA,NA,"
",blah,,hi," changes to ",blah,NA,hi,"
There could be anywhere between 1 and 100,000 commas in the strings with any number of characters between the commas. My code is:
str = str.gsub!(",,",",NA,")
# => ",NA,123,NABLAH,NA"
I am running into issues because it needs to happen multiple times. If I repeat the gsub multiple times, I hit an error undefined method gsub! for nil class because gsub returns the result, yet if there is no substitution, it returns nil.

ruby > ",,,,,,,,,,,,,,,,,,,,,,".gsub(",",",NA")
=> ",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA"
or alternately:
ruby > ",,,,,,,,,,,,,,,,,,,,,,".gsub(",","NA,")
=> "NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"
edit: To handle the use case better (didn't quite get original question):
2.2.0 :004 > str=",,,123,,BLAH,,"
=> ",,,123,,BLAH,,"
2.2.0 :005 > str.split(",")
=> ["", "", "", "123", "", "BLAH"]
2.2.0 :006 > str.split(",").map{|x|x.length == 0 ? "NA" : x}.join(",")
=> "NA,NA,NA,123,NA,BLAH"

According to your use-case (",,,123,,BLAH,," turning into ",NA,123,NA,BLAH,NA,") I'm assuming you want all commas between characters to turn into ,NA,?
This is easily done using regular expressions with gsub.
str=",,,123,,BLAH,,"
str.gsub!(/,+/,",NA,") #returns ",NA,123,NA,BLAH,NA,"
the regular expression /,+/ is matching 'one or more' commas

Related

Couldn't understand why the Regexp option i got disabled in my code

I have just started playing with Ruby and I'm stuck on something. Is
there some trick to modify the casefold attribute of a Regexp object after
it's been instantiated?
The best idea what I tried is the following:
irb(main):001:0> a = Regexp.new('a')
=> /a/
irb(main):002:0> aA = Regexp.new(a.to_s, Regexp::IGNORECASE)
=> /(?-mix:a)/i
But none of the below seems to work:
irb(main):003:0> a =~ 'a'
=> 0
irb(main):004:0> a =~ 'A'
=> nil
irb(main):005:0> aA =~ 'a'
=> 0
irb(main):006:0> aA =~ 'A'
=> nil
Something I don't understand is happening here. Where did the 'i' go on line
8?
irb(main):07:0> aA = Regexp.new(a.to_s, Regexp::IGNORECASE)
=> /(?-mix:a)/i
irb(main):08:0> aA.to_s
=> "(?-mix:a)"
irb(main):09:0>
I am using Ruby 1.9.3.
I am also unable understand the below code: why returning false:
/(?i:a)/.casefold? #=> false
As your console output shows, a.to_s includes the case sensitiveness as an option for your subexpression, so aA is being defined as
/(?-mix:a)/i
so you're asking ruby for a regular expression that is case insensitive, but the only thing in that case insensitive regexp is a group for when case sensitivity has be turned on, so the net effect is that 'a' is matched case sensitively
Since the result of to_s is just the regular expression string itself - no delimiters or external flags - the flags are translated into the (?i:...) syntax that sets or clears them temporarily inside the expression itself. This lets you get a Regexp object back out via a simple Regexp.new(s) call that will match the same strings.
The wrapping, unfortunately, includes explicitly clearing the flags that are not set on the object. So your first regex gets stringified into something between (?:-i...) - that is, the casefold option is explicitly turned off between the parentheses. Turning it back on for the object doesn't have any effect.
You can use a.source instead of a.to_s to get just the original expression, without the flag settings:
irb(main):001:0> a=/a/
=> /a/
irb(main):002:0> aA = Regexp.new(a.source, Regexp::IGNORECASE)
=> /a/i
irb(main):003:0> a =~ 'a'
=> 0
irb(main):004:0> a =~ 'A'
=> nil
irb(main):005:0> aA =~ 'a'
=> 0
irb(main):006:0> aA =~ 'A'
=> 0
As Frederick already explains, calling to_s on a regex will add modifiers around it that ensure that its properties like case-sensitiveness are preserved. So if you insert a case-sensitive regex into a case-insensitive regex, the inserted part will still be case-sensitive. Likewise the modifiers given to Regexp.new will have no effect if the first argument is a regex or the result of calling to_s on one.
To solve this issue, call source on the regex instead of to_s. Unlike to_s, source simply returns the source of regex without adding anything:
aA = Regexp.new(a.source, Regexp::IGNORECASE)
I am also unable understand the below code: why returning false:
/(?i:a)/.casefold?
Because (?i:...) sets the i flag locally, not globally. It only applies to the part of the regex within the parentheses, not the whole regex. Of course in this case the whole regex is within the parentheses, but that doesn't matter as far as methods like casefold? are concerned.

Ruby 1.9.3 regular expressions with gsub: Bugs or features?

Take this snippet of code which is supposed to replace a href tag with its URL:
irb> s='<p>Click here!</p>'
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
=> "<p></p>"
This regex fails (URL is not found). Then I escape the < character in the regex, and it works:
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
1: According to RubyMine's inspections, this escape should not be necessary. Is this correct? If so, why is the escape of > apparently not necessary as well?
2: Afterwards in the same IRB session, with the same string, the original regex suddenly works too:
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
Is this because the $1 variable is not cleared when calling gsub again? If so, is it intentional behaviour or is this a Ruby regex bug?
3: When I change the string, and reexecute the same command, $1 will only change after calling gsub twice on the changed string:
irb> s='<p>Click here!</p>'
=> "<p>Click here!</p>"
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/xxxxyyy</p>"
Is this intentional? If so, what is the logic behind this?
4: As replacement character, some tutorials suggest using "#{$n}", others suggest using '\n'. With the backslash variant, the problems above do not appear. Why - what is the difference between the two?
Thank you!
$1 contains the first capture of the last match. In your example, it is evaluated before the matching (actually even before gsub is called), therefore the value of $1 is fixed to nil (because you did not match anything, yet). So you always get the first capture of the previous match, you do not even need to change your original regex to get the expected result the second time:
s='<p>Click here!</p>'
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
# => "<p></p>"
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
# => "<p>http://localhost/activate/57f7e805827f</p>"
You can pass a block to gsub though, which is evaluated after the matching, e. g.
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/){ $1 }
# => "<p>http://localhost/activate/57f7e805827f</p>"
This way, $1 behaves as you'd expect. I like to always use named captures so i don't have to keep track of the numbers when i add a capture, though:
s.gsub(/<a href="(?<href>([^ '"]*))"([^>]*)?>([^<]*)<\/a>/){ $~[:href] }
# => "<p>http://localhost/activate/57f7e805827f</p>"

How to match a left and right parenthesis and substitute a character for both in a regular expression in Ruby?

I'm trying to pattern match the following and substitute "c" for both a left and right parenthesis.
Example:
string = "(a,b)"
So I want the string to come out like "ca,cb" after I call string.sub(//,"c") on it. I've tried string.sub(/[()]/,"c"), but that only results in "ca,b)". How do I pattern match the left AND right parenthesis?
ruby-1.9.3-p125 :001 > string = "(a,b)"
=> "(a,b)"
ruby-1.9.3-p125 :002 > string.gsub(/[()]/, "c")
=> "ca,bc"
Note the gsub: sub makes a single substitution; gsub ("global sub") substitutes as many as it can.
For single char substitution try tr:
'(a,b)'.tr '()', 'c'
If your expected output is really "ca,cb" rather than "ca,bc", which is the result of the other answers given so far, then the following should do the trick:
1.9.3-p194 :001 > "(a,b)".tr('(', 'c').gsub(/(.)\)/, 'c\1')
=> "ca,cb"
You have not specified how to handle empty parenthesis or multiple levels, so those cases are not considered.

Chaining array into new split function call

I have the following and am trying to split on '.' and then split the returned first part on '-' and return the last of the first part. I want to return 447.
a="cat-vm-447.json".split('.').split('-')
Also, how would I do this as a regular expression? I have this:
a="cat-vm-447.json".split(/-[\d]+./)
but this is splitting on the value. I want to return the number.
I can do this:
a="cat-vm-447.json".slice(/[\d]+/)
and this gives me back 447 but would really like to specify that the - and . surround it. Adding those in regex return them.
First question. Split returns an array, so you need to use Array#[] to get first(0) or last(-1) elements of this array. Alternatives is Array#first and Array#last methods.
a="cat-vm-447.json".split('.')[0].split('-')[-1] # => "447"
Second question. You can match your number into group and then get it from the response (it will have index 1. Item with index 0 will be full match ("-447." in your case). You can use String#[] or String#match (among others) methods to match your regex.
"cat-vm-447.json"[/-(\d+)\./, 1] # => "447"
# or
"cat-vm-447.json".match(/-(\d+)\./)[1] # => "447"
Split returns an array, so you need to specify the index for the next split.
a="cat-vm-447.json".split('.').first.split('-').last
For the regular expression, you need to wrap what you want to capture in parentheses.
/-(\d+)\./
a = "cat-vm-447.json"
b = a.match(/-(\d+)\./)
p b[0] # => 447
Try something like that:
if "cat-vm-447.json" =~ /([\d]+)/
p $1
else
p "No matches"
end
The parentheses in the regex extract the result in the $1 variable.
When you split your string second time, you actually trying to split Array instead of String.
ruby-1.9.3-head :003 > "cat-vm-447.json".split('.')
# => ["cat-vm-447", "json"]
In regexp case, you can use /[-.]/
ruby-1.9.3-head :008 > "cat-vm-447.json".split(/[-.]/)
# => ["cat", "vm", "447", "json"]
ruby-1.9.3-head :009 > "cat-vm-447.json".split(/[-.]/)[2]
# => "447"

Split specific string by regular expression

i am trying to get an array that contain of aaaaa,bbbbb,ccccc as split output below.
a_string = "aaaaa[x]bbbbb,ccccc";
split_output a_string.split.split(%r{[,|........]+})
what supposed i put as replacement of ........ ?
No need for a regex when it's just a literal:
irb(main):001:0> a_string = "aaaaa[x]bbbbb"
irb(main):002:0> a_string.split "[x]"
=> ["aaaaa", "bbbbb"]
If you want to split by "open bracket...anything...close bracket" then:
irb(main):003:0> a_string.split /\[.+?\]/
=> ["aaaaa", "bbbbb"]
Edit: I'm still not sure what your criteria is, but let's guess that what you are really doing is looking for runs of 2-or-more of the same character:
irb(main):001:0> a_string = "aaaaa[x]bbbbb,ccccc"
=> "aaaaa[x]bbbbb,ccccc"
irb(main):002:0> a_string.scan(/((.)\2+)/).map(&:first)
=> ["aaaaa", "bbbbb", "ccccc"]
Edit 2: If you want to split by either the of the literal strings "," or "[x]" then:
irb(main):003:0> a_string.split /,|\[x\]/
=> ["aaaaa", "bbbbb", "ccccc"]
The | part of the regular expression allows expressions on either side to match, and the backslashes are needed since otherwise the characters [ and ] have special meaning. (If you tried to split by /,|[x]/ then it would split on either a comma or an x character.)
no regex needed, just use "[x]"

Resources