Regex for multiple string operations in a single pass? - ruby

How can I do the following in a single gsub what is the regex to get the desired output?
string = "Make all the changes within a single pass"
string.gsub(/[^aeiou|\s]/, '*').gsub(/\s/, '&')
#=> "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
First gsub if it's not a vowel or a space replace it
with *
Second gsub If it's a space replace it with a &
The reason I ask is because I feel like chaining gsub is not the right way to do this. Please let me know if you think this is a good way..

This uses String#tr to do the substitution in a single pass. This assumes the string consists of printable ASCII characters.
string.tr " \t\nB-DF-HJ-NP-TV-Zb-df-hj-np-tv-z!-#[-`{-~", '&&&*'
# => "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
For tr, - is the range operator. So for the letters B, C, D, since these are consecutive, it can be written as B-D. So B-DF-HJ-NP-TV-Z is basically all the capital letters minus the vowels. Same with lowercase, followed by all printable punctuation on the ASCII chart. These all get replaced by a *. The only 3 whitespace characters that match \s are space, tab, and newline, and these are listed explicitly at the front of the string and are each replaced by &.
If 2 passes are allowed, then it can be written more concisely as
string.tr(' ','&').tr('^AEIOUaeiou&','*')

Ok so I figured out I can pass a block like this:
string.gsub(/[^aeiou]/) {|g| g =~ /\s/ ? "&" : "*"}
#=> "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
I prefer the solution above but this also works:
string.gsub(/[^aeiou|\s]/, '*').gsub(/\s/, '&')
#=> "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
Benchmark results (corrected): Using Benchmark class (900k length string sample size)
Benchmark.measure { string.gsub(/[^aeiou]/) {|g| g =~ /\s/ ? "&" : "*"} }
#=> 0.800000 0.010000 0.810000 ( 0.801419 )
Benchmark.measure { string.gsub(/[^aeiou|\s]/, '*').gsub(/\s/, '&') }
#=> 0.230000 0.000000 0.230000 ( 0.231482 )
Looks like the second option is many times faster and the clear winner in speed and appears to have the preferred readability.
Update
Based on #Matt's answer I also was able to use: string#tr This solution is blazing fast (fastest of all tested) string #900k char size.
string.tr(' ', '&').tr('^[aeiou|&]', '*')
Benchmark.measure { string.tr(' ', '&').tr('^[aeiou|&]', '*') }
#=> 0.000000 0.000000 0.000000 ( 0.015000 )

string.gsub(/(\s)|([^aeiou])/){$1 ? "&" : "*"}
# => "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"

Related

How do you delete one apostrophe where it is duplicated in a string?

In ruby, say I have this string: "abc''xyz''"
(those are 2 single quotes after abc and xyz)
Now, I am trying to find a way to make it into this string: "abc'xyz'"
I want to delete only one apostrophe from this string in locations where there are two apostrophes back to back. Thanks in advance.
You can use String#squeeze:
"abc''xyz''".squeeze("'")
#=> "abc'xyz'"
This method removes duplicates of a certain character if they are immediately after each other. It will reduce n characters in a row to just one.
For example, if you had the string " '''''' ", squeezing it would return the following:
" '''''' ".squeeze("'")
#=> " ' "
String#squeeze is what you need and gsub is really a bad idea.
Benchmark.bm do |bm|
bm.report("squeeze") do
iterations.times do
"e''eee''e'e''''e".squeeze("'")
end
end
bm.report("gsub") do
iterations.times do
"e''eee''e'e''''e".gsub(/\'+/, "'")
end
end
end
And results:
user system total real
squeeze 6.109000 0.000000 6.109000 ( 6.110040)
gsub 22.454000 0.000000 22.454000 ( 22.469204)

Manipulate string in ruby

I have a grouping of string variables that will be something like "height_low". I want to use something clean like gsub or something else to get rid of the underscore and everything past it. so it will be like "height". Does someone have a solution for this? Thanks.
Try this:
strings.map! {|s| s.split('_').first}
Shorter:
my_string.split('_').first
The unavoidable regex answer. (Assuming strings is an array of strings.)
strings.map! { |s| s[/^.+?(?=_)/] }
FWIW, solutions based on String#split perform poorly because they have to parse the whole string and allocate an array. Their performance degrades as the number of underscores increases. The following performs better:
string[0, string.index("_") || string.length]
Benchmark results (with number of underscores in parenthesis):
user system total real
String#split (0) 0.640000 0.000000 0.640000 ( 0.650323)
String#split (1) 0.760000 0.000000 0.760000 ( 0.759951)
String#split (9) 2.180000 0.010000 2.190000 ( 2.192356)
String#index (0) 0.610000 0.000000 0.610000 ( 0.625972)
String#index (1) 0.580000 0.010000 0.590000 ( 0.589463)
String#index (9) 0.600000 0.000000 0.600000 ( 0.605253)
Benchmarks:
strings = ["x", "x_x", "x_x_x_x_x_x_x_x_x_x"]
Benchmark.bm(16) do |bm|
strings.each do |string|
bm.report("String#split (#{string.count("_")})") do
1000000.times { string.split("_").first }
end
end
strings.each do |string|
bm.report("String#index (#{string.count("_")})") do
1000000.times { string[0, string.index("_") || string.length] }
end
end
end
Try as below using str[regexp, capture] → new_str or nil:
If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.
strings.map { |s| s[/(.*?)_.*$/,1] }
If you're looking for something "like gsub", why not just use gsub?
"height_low".gsub(/_.*$/, "") #=> "height"
In my opinion though, this is a bit cleaner:
"height_low".split('_').first #=> "height"
Another option is to use partition:
"height_low".partition("_").first #=> "height"
Learn to think in terms of searches vs. replacements. It's usually easier, faster, and cleaner to search for, and extract, what you want, than it is to search for, and strip, what you don't want.
Consider this:
'a_b_c'[/^(.*?)_/, 1] # => "a"
It looks for only what you want, which is the text from the start of the string, up to _. Everything preceding _ is captured, and returned.
The alternates:
'a_b_c'.sub(/_.+$/, '') # => "a"
'a_b_c'.gsub(/_.+$/, '') # => "a"
have to look backwards until the engine is sure there are no more _, then the string can be truncated.
Here's a little benchmark showing how that affects things:
require 'fruity'
compare do
string_capture { 'a_b_c'[/^(.*?)_/, 1] }
string_sub { 'a_b_c'.sub(/_.+$/, '') }
string_gsub { 'a_b_c'.gsub(/_.+$/, '') }
look_ahead { 'a_b_c'[/^.+?(?=_)/] }
string_index { 'a_b_c'[0, s.index("_") || s.length] }
end
# >> Running each test 8192 times. Test will take about 1 second.
# >> string_index is faster than string_capture by 19.999999999999996% ± 10.0%
# >> string_capture is similar to look_ahead
# >> look_ahead is faster than string_sub by 70.0% ± 10.0%
# >> string_sub is faster than string_gsub by 2.9x ± 0.1
Again, searching is going to be faster than any sort of replace, so think about what you're doing.
The downfall to the "search" regex-based tactics like "string_capture" and "look_ahead" is they don't handle missing _, so if there's any question whether your string will, or will not, have _, then use the "string_index" method which will fall-back to using string.length to grab the entire string.

Match characters by their place in Ruby string

I am trying to match numeric characters by their place within a string. For example, in the string "1234567", I would like to select the second through the fourth characters: "234". "D9873Y.31" should also turn up "987". Would you have any suggestions?
You don’t need a regex, you can just use String#[]:
s = '1234567'
s[1..3] #=> "234"
s = 'D9873Y.31'
s[1..3] #=> "987"
You can use a regex for this, and, patterns are flexible enough to write them several different ways. I try to keep them very simple, because they can become maintenance nightmares due to their cryptic nature:
"1234567"[/^.(.{3})/, 1]
=> "234"
"D9873Y.31"[/^.(.{3})/, 1]
=> "987"
"1234567".match(/^.(.{3})/)[1]
=> "234"
"D9873Y.31".match(/^.(.{3})/)[1]
=> "987"
You can also take advantage of named-captures:
/^.(?<chars2_4>.{3})/ =~ "1234567"
chars2_4
=> "234"
/^.(?<chars2_4>.{3})/ =~ "D9873Y.31"
chars2_4
=> "987"
All that's nice, but it's really important to dig in and learn them well, because, done wrong, you can grab the wrong data, or worse, really slow your script by making the regex engine work very hard to do something simple.
For instance, I used ^ above. ^ matches the start of a line, which is the start of a string and the character immediately following a new-line. That's OK for a short string, but long strings, especially with embedded new-lines can slow down the engine. Instead you might want to use \A. The same situation applies to using $ or \Z or \z. This is from the Regexp documentation section for "Anchors":
^ - Matches beginning of line
$ - Matches end of line
\A - Matches beginning of string.
\Z - Matches end of string. If string ends with a newline, it matches just before newline
\z - Matches end of string
And all that is why you sometimes want to avoid using a regexp and instead use a substring such as #AndrewMarshall recommended.
Here's another reason why the simple substring way is preferable:
require 'benchmark'
N = 1_000_000
Benchmark.bm(13) do |b|
b.report('string index') { N.times {
"1234567"[1..3]
"D9873Y.31"[1..3]
} }
b.report('regex index') { N.times {
"1234567"[/^.(.{3})/, 1]
"D9873Y.31"[/^.(.{3})/, 1]
} }
b.report('match') { N.times {
"1234567".match(/^.(.{3})/)[1]
"D9873Y.31".match(/^.(.{3})/)[1]
} }
b.report('named capture') { N.times {
/^.(?<chars2_4>.{3})/ =~ "1234567"
/^.(?<chars2_4>.{3})/ =~ "D9873Y.31"
} }
b.report('look behind') { N.times {
"1234567"[/(?<=^.{2}).{3}/, 1]
"D9873Y.31"[/(?<=^.{2}).{3}/, 1]
} }
end
Which returns:
user system total real
string index 0.730000 0.000000 0.730000 ( 0.727323)
regex index 1.370000 0.000000 1.370000 ( 1.377121)
match 4.400000 0.000000 4.400000 ( 4.398849)
named capture 5.240000 0.010000 5.250000 ( 5.243799)
look behind 1.430000 0.000000 1.430000 ( 1.437286)
You can do it using a lookbehind with an anchor, example:
(?<=^.{2}).{3}
will give you 345

Strip ruby string of a specific control character

This is pretty simple: how do I strip a ruby string of a special character? Here's the character:
http://www.fileformat.info/info/unicode/char/2028/index.htm
And here's the string, with the two special characters between the period and the ending quote:
"Each of the levels requires logic, skill, and brute force to crush the enemy.

"
I've unsuccessfully tried this:
string.gsub!(/[\x00-\x1F\x7F]/, '')
and gsub("/\n/", "")
I'm using ruby 1.9.3p125
String#gsub will work, but is more general and less efficient at this than String#tr
irb> s ="Hello,\u2028 World; here's some ctrl [\1\2\3\4\5\6] chars"
=> "Hello,\u2028 World; here's some ctrl [\u0001\u0002\u0003\u0004\u0005\u0006] chars"
irb> s.tr("\u0000-\u001f\u007f\u2028",'')
=> "Hello, World; here's some ctrl [] chars"
require 'benchmark'
Benchmark.bm {|x|
x.report('tr') { 1_000_000.times{ s.tr("\u0000-\u001f\u007f\u2028",'') } }
x.report('gsub') { 1_000_000.times{ s.gsub(/[\0-\x1f\x7f\u2028]/,'') } }
}
user system total real
tr 1.440000 0.000000 1.440000 ( 1.448090)
gsub 4.110000 0.000000 4.110000 ( 4.127100)
I figured it out! .gsub(/\u2028/, '')

Ruby, remove last N characters from a string?

What is the preferred way of removing the last n characters from a string?
irb> 'now is the time'[0...-4]
=> "now is the "
If the characters you want to remove are always the same characters, then consider chomp:
'abc123'.chomp('123') # => "abc"
The advantages of chomp are: no counting, and the code more clearly communicates what it is doing.
With no arguments, chomp removes the DOS or Unix line ending, if either is present:
"abc\n".chomp # => "abc"
"abc\r\n".chomp # => "abc"
From the comments, there was a question of the speed of using #chomp versus using a range. Here is a benchmark comparing the two:
require 'benchmark'
S = 'asdfghjkl'
SL = S.length
T = 10_000
A = 1_000.times.map { |n| "#{n}#{S}" }
GC.disable
Benchmark.bmbm do |x|
x.report('chomp') { T.times { A.each { |s| s.chomp(S) } } }
x.report('range') { T.times { A.each { |s| s[0...-SL] } } }
end
Benchmark Results (using CRuby 2.13p242):
Rehearsal -----------------------------------------
chomp 1.540000 0.040000 1.580000 ( 1.587908)
range 1.810000 0.200000 2.010000 ( 2.011846)
-------------------------------- total: 3.590000sec
user system total real
chomp 1.550000 0.070000 1.620000 ( 1.610362)
range 1.970000 0.170000 2.140000 ( 2.146682)
So chomp is faster than using a range, by ~22%.
Ruby 2.5+
As of Ruby 2.5 you can use delete_suffix or delete_suffix! to achieve this in a fast and readable manner.
The docs on the methods are here.
If you know what the suffix is, this is idiomatic (and I'd argue, even more readable than other answers here):
'abc123'.delete_suffix('123') # => "abc"
'abc123'.delete_suffix!('123') # => "abc"
It's even significantly faster (almost 40% with the bang method) than the top answer. Here's the result of the same benchmark:
user system total real
chomp 0.949823 0.001025 0.950848 ( 0.951941)
range 1.874237 0.001472 1.875709 ( 1.876820)
delete_suffix 0.721699 0.000945 0.722644 ( 0.723410)
delete_suffix! 0.650042 0.000714 0.650756 ( 0.651332)
I hope this is useful - note the method doesn't currently accept a regex so if you don't know the suffix it's not viable for the time being. However, as the accepted answer (update: at the time of writing) dictates the same, I thought this might be useful to some people.
str = str[0..-1-n]
Unlike the [0...-n], this handles the case of n=0.
I would suggest chop. I think it has been mentioned in one of the comments but without links or explanations so here's why I think it's better:
It simply removes the last character from a string and you don't have to specify any values for that to happen.
If you need to remove more than one character then chomp is your best bet. This is what the ruby docs have to say about chop:
Returns a new String with the last character removed. If the string
ends with \r\n, both characters are removed. Applying chop to an empty
string returns an empty string. String#chomp is often a safer
alternative, as it leaves the string unchanged if it doesn’t end in a
record separator.
Although this is used mostly to remove separators such as \r\n I've used it to remove the last character from a simple string, for example the s to make the word singular.
name = "my text"
x.times do name.chop! end
Here in the console:
>name = "Nabucodonosor"
=> "Nabucodonosor"
> 7.times do name.chop! end
=> 7
> name
=> "Nabuco"
Dropping the last n characters is the same as keeping the first length - n characters.
Active Support includes String#first and String#last methods which provide a convenient way to keep or drop the first/last n characters:
require 'active_support/core_ext/string/access'
"foobarbaz".first(3) # => "foo"
"foobarbaz".first(-3) # => "foobar"
"foobarbaz".last(3) # => "baz"
"foobarbaz".last(-3) # => "barbaz"
if you are using rails, try:
"my_string".last(2) # => "ng"
[EDITED]
To get the string WITHOUT the last 2 chars:
n = "my_string".size
"my_string"[0..n-3] # => "my_stri"
Note: the last string char is at n-1. So, to remove the last 2, we use n-3.
Check out the slice() method:
http://ruby-doc.org/core-2.5.0/String.html#method-i-slice
You can always use something like
"string".sub!(/.{X}$/,'')
Where X is the number of characters to remove.
Or with assigning/using the result:
myvar = "string"[0..-X]
where X is the number of characters plus one to remove.
If you're ok with creating class methods and want the characters you chop off, try this:
class String
def chop_multiple(amount)
amount.times.inject([self, '']){ |(s, r)| [s.chop, r.prepend(s[-1])] }
end
end
hello, world = "hello world".chop_multiple 5
hello #=> 'hello '
world #=> 'world'
Using regex:
str = 'string'
n = 2 #to remove last n characters
str[/\A.{#{str.size-n}}/] #=> "stri"
x = "my_test"
last_char = x.split('').last

Resources