Strip ruby string of a specific control character - ruby

This is pretty simple: how do I strip a ruby string of a special character? Here's the character:
http://www.fileformat.info/info/unicode/char/2028/index.htm
And here's the string, with the two special characters between the period and the ending quote:
"Each of the levels requires logic, skill, and brute force to crush the enemy.

"
I've unsuccessfully tried this:
string.gsub!(/[\x00-\x1F\x7F]/, '')
and gsub("/\n/", "")
I'm using ruby 1.9.3p125

String#gsub will work, but is more general and less efficient at this than String#tr
irb> s ="Hello,\u2028 World; here's some ctrl [\1\2\3\4\5\6] chars"
=> "Hello,\u2028 World; here's some ctrl [\u0001\u0002\u0003\u0004\u0005\u0006] chars"
irb> s.tr("\u0000-\u001f\u007f\u2028",'')
=> "Hello, World; here's some ctrl [] chars"
require 'benchmark'
Benchmark.bm {|x|
x.report('tr') { 1_000_000.times{ s.tr("\u0000-\u001f\u007f\u2028",'') } }
x.report('gsub') { 1_000_000.times{ s.gsub(/[\0-\x1f\x7f\u2028]/,'') } }
}
user system total real
tr 1.440000 0.000000 1.440000 ( 1.448090)
gsub 4.110000 0.000000 4.110000 ( 4.127100)

I figured it out! .gsub(/\u2028/, '')

Related

Regex for multiple string operations in a single pass?

How can I do the following in a single gsub what is the regex to get the desired output?
string = "Make all the changes within a single pass"
string.gsub(/[^aeiou|\s]/, '*').gsub(/\s/, '&')
#=> "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
First gsub if it's not a vowel or a space replace it
with *
Second gsub If it's a space replace it with a &
The reason I ask is because I feel like chaining gsub is not the right way to do this. Please let me know if you think this is a good way..
This uses String#tr to do the substitution in a single pass. This assumes the string consists of printable ASCII characters.
string.tr " \t\nB-DF-HJ-NP-TV-Zb-df-hj-np-tv-z!-#[-`{-~", '&&&*'
# => "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
For tr, - is the range operator. So for the letters B, C, D, since these are consecutive, it can be written as B-D. So B-DF-HJ-NP-TV-Z is basically all the capital letters minus the vowels. Same with lowercase, followed by all printable punctuation on the ASCII chart. These all get replaced by a *. The only 3 whitespace characters that match \s are space, tab, and newline, and these are listed explicitly at the front of the string and are each replaced by &.
If 2 passes are allowed, then it can be written more concisely as
string.tr(' ','&').tr('^AEIOUaeiou&','*')
Ok so I figured out I can pass a block like this:
string.gsub(/[^aeiou]/) {|g| g =~ /\s/ ? "&" : "*"}
#=> "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
I prefer the solution above but this also works:
string.gsub(/[^aeiou|\s]/, '*').gsub(/\s/, '&')
#=> "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
Benchmark results (corrected): Using Benchmark class (900k length string sample size)
Benchmark.measure { string.gsub(/[^aeiou]/) {|g| g =~ /\s/ ? "&" : "*"} }
#=> 0.800000 0.010000 0.810000 ( 0.801419 )
Benchmark.measure { string.gsub(/[^aeiou|\s]/, '*').gsub(/\s/, '&') }
#=> 0.230000 0.000000 0.230000 ( 0.231482 )
Looks like the second option is many times faster and the clear winner in speed and appears to have the preferred readability.
Update
Based on #Matt's answer I also was able to use: string#tr This solution is blazing fast (fastest of all tested) string #900k char size.
string.tr(' ', '&').tr('^[aeiou|&]', '*')
Benchmark.measure { string.tr(' ', '&').tr('^[aeiou|&]', '*') }
#=> 0.000000 0.000000 0.000000 ( 0.015000 )
string.gsub(/(\s)|([^aeiou])/){$1 ? "&" : "*"}
# => "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"

Manipulate string in ruby

I have a grouping of string variables that will be something like "height_low". I want to use something clean like gsub or something else to get rid of the underscore and everything past it. so it will be like "height". Does someone have a solution for this? Thanks.
Try this:
strings.map! {|s| s.split('_').first}
Shorter:
my_string.split('_').first
The unavoidable regex answer. (Assuming strings is an array of strings.)
strings.map! { |s| s[/^.+?(?=_)/] }
FWIW, solutions based on String#split perform poorly because they have to parse the whole string and allocate an array. Their performance degrades as the number of underscores increases. The following performs better:
string[0, string.index("_") || string.length]
Benchmark results (with number of underscores in parenthesis):
user system total real
String#split (0) 0.640000 0.000000 0.640000 ( 0.650323)
String#split (1) 0.760000 0.000000 0.760000 ( 0.759951)
String#split (9) 2.180000 0.010000 2.190000 ( 2.192356)
String#index (0) 0.610000 0.000000 0.610000 ( 0.625972)
String#index (1) 0.580000 0.010000 0.590000 ( 0.589463)
String#index (9) 0.600000 0.000000 0.600000 ( 0.605253)
Benchmarks:
strings = ["x", "x_x", "x_x_x_x_x_x_x_x_x_x"]
Benchmark.bm(16) do |bm|
strings.each do |string|
bm.report("String#split (#{string.count("_")})") do
1000000.times { string.split("_").first }
end
end
strings.each do |string|
bm.report("String#index (#{string.count("_")})") do
1000000.times { string[0, string.index("_") || string.length] }
end
end
end
Try as below using str[regexp, capture] → new_str or nil:
If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.
strings.map { |s| s[/(.*?)_.*$/,1] }
If you're looking for something "like gsub", why not just use gsub?
"height_low".gsub(/_.*$/, "") #=> "height"
In my opinion though, this is a bit cleaner:
"height_low".split('_').first #=> "height"
Another option is to use partition:
"height_low".partition("_").first #=> "height"
Learn to think in terms of searches vs. replacements. It's usually easier, faster, and cleaner to search for, and extract, what you want, than it is to search for, and strip, what you don't want.
Consider this:
'a_b_c'[/^(.*?)_/, 1] # => "a"
It looks for only what you want, which is the text from the start of the string, up to _. Everything preceding _ is captured, and returned.
The alternates:
'a_b_c'.sub(/_.+$/, '') # => "a"
'a_b_c'.gsub(/_.+$/, '') # => "a"
have to look backwards until the engine is sure there are no more _, then the string can be truncated.
Here's a little benchmark showing how that affects things:
require 'fruity'
compare do
string_capture { 'a_b_c'[/^(.*?)_/, 1] }
string_sub { 'a_b_c'.sub(/_.+$/, '') }
string_gsub { 'a_b_c'.gsub(/_.+$/, '') }
look_ahead { 'a_b_c'[/^.+?(?=_)/] }
string_index { 'a_b_c'[0, s.index("_") || s.length] }
end
# >> Running each test 8192 times. Test will take about 1 second.
# >> string_index is faster than string_capture by 19.999999999999996% ± 10.0%
# >> string_capture is similar to look_ahead
# >> look_ahead is faster than string_sub by 70.0% ± 10.0%
# >> string_sub is faster than string_gsub by 2.9x ± 0.1
Again, searching is going to be faster than any sort of replace, so think about what you're doing.
The downfall to the "search" regex-based tactics like "string_capture" and "look_ahead" is they don't handle missing _, so if there's any question whether your string will, or will not, have _, then use the "string_index" method which will fall-back to using string.length to grab the entire string.

Match characters by their place in Ruby string

I am trying to match numeric characters by their place within a string. For example, in the string "1234567", I would like to select the second through the fourth characters: "234". "D9873Y.31" should also turn up "987". Would you have any suggestions?
You don’t need a regex, you can just use String#[]:
s = '1234567'
s[1..3] #=> "234"
s = 'D9873Y.31'
s[1..3] #=> "987"
You can use a regex for this, and, patterns are flexible enough to write them several different ways. I try to keep them very simple, because they can become maintenance nightmares due to their cryptic nature:
"1234567"[/^.(.{3})/, 1]
=> "234"
"D9873Y.31"[/^.(.{3})/, 1]
=> "987"
"1234567".match(/^.(.{3})/)[1]
=> "234"
"D9873Y.31".match(/^.(.{3})/)[1]
=> "987"
You can also take advantage of named-captures:
/^.(?<chars2_4>.{3})/ =~ "1234567"
chars2_4
=> "234"
/^.(?<chars2_4>.{3})/ =~ "D9873Y.31"
chars2_4
=> "987"
All that's nice, but it's really important to dig in and learn them well, because, done wrong, you can grab the wrong data, or worse, really slow your script by making the regex engine work very hard to do something simple.
For instance, I used ^ above. ^ matches the start of a line, which is the start of a string and the character immediately following a new-line. That's OK for a short string, but long strings, especially with embedded new-lines can slow down the engine. Instead you might want to use \A. The same situation applies to using $ or \Z or \z. This is from the Regexp documentation section for "Anchors":
^ - Matches beginning of line
$ - Matches end of line
\A - Matches beginning of string.
\Z - Matches end of string. If string ends with a newline, it matches just before newline
\z - Matches end of string
And all that is why you sometimes want to avoid using a regexp and instead use a substring such as #AndrewMarshall recommended.
Here's another reason why the simple substring way is preferable:
require 'benchmark'
N = 1_000_000
Benchmark.bm(13) do |b|
b.report('string index') { N.times {
"1234567"[1..3]
"D9873Y.31"[1..3]
} }
b.report('regex index') { N.times {
"1234567"[/^.(.{3})/, 1]
"D9873Y.31"[/^.(.{3})/, 1]
} }
b.report('match') { N.times {
"1234567".match(/^.(.{3})/)[1]
"D9873Y.31".match(/^.(.{3})/)[1]
} }
b.report('named capture') { N.times {
/^.(?<chars2_4>.{3})/ =~ "1234567"
/^.(?<chars2_4>.{3})/ =~ "D9873Y.31"
} }
b.report('look behind') { N.times {
"1234567"[/(?<=^.{2}).{3}/, 1]
"D9873Y.31"[/(?<=^.{2}).{3}/, 1]
} }
end
Which returns:
user system total real
string index 0.730000 0.000000 0.730000 ( 0.727323)
regex index 1.370000 0.000000 1.370000 ( 1.377121)
match 4.400000 0.000000 4.400000 ( 4.398849)
named capture 5.240000 0.010000 5.250000 ( 5.243799)
look behind 1.430000 0.000000 1.430000 ( 1.437286)
You can do it using a lookbehind with an anchor, example:
(?<=^.{2}).{3}
will give you 345

Fastest way to check if a string matches a regexp in ruby?

What is the fastest way to check if a string matches a regular expression in Ruby?
My problem is that I have to "egrep" through a huge list of strings to find which are the ones that match a regexp that is given at runtime. I only care about whether the string matches the regexp, not where it matches, nor what the content of the matching groups is. I hope this assumption can be used to reduce the amount of time my code spend matching regexps.
I load the regexp with
pattern = Regexp.new(ptx).freeze
I have found that string =~ pattern is slightly faster than string.match(pattern).
Are there other tricks or shortcuts that can used to make this test even faster?
Starting with Ruby 2.4.0, you may use RegExp#match?:
pattern.match?(string)
Regexp#match? is explicitly listed as a performance enhancement in the release notes for 2.4.0, as it avoids object allocations performed by other methods such as Regexp#match and =~:
Regexp#match?
Added Regexp#match?, which executes a regexp match without creating a back reference object and changing $~ to reduce object allocation.
This is a simple benchmark:
require 'benchmark'
"test123" =~ /1/
=> 4
Benchmark.measure{ 1000000.times { "test123" =~ /1/ } }
=> 0.610000 0.000000 0.610000 ( 0.578133)
"test123"[/1/]
=> "1"
Benchmark.measure{ 1000000.times { "test123"[/1/] } }
=> 0.718000 0.000000 0.718000 ( 0.750010)
irb(main):019:0> "test123".match(/1/)
=> #<MatchData "1">
Benchmark.measure{ 1000000.times { "test123".match(/1/) } }
=> 1.703000 0.000000 1.703000 ( 1.578146)
So =~ is faster but it depends what you want to have as a returned value. If you just want to check if the text contains a regex or not use =~
This is the benchmark I have run after finding some articles around the net.
With 2.4.0 the winner is re.match?(str) (as suggested by #wiktor-stribiżew), on previous versions, re =~ str seems to be fastest, although str =~ re is almost as fast.
#!/usr/bin/env ruby
require 'benchmark'
str = "aacaabc"
re = Regexp.new('a+b').freeze
N = 4_000_000
Benchmark.bm do |b|
b.report("str.match re\t") { N.times { str.match re } }
b.report("str =~ re\t") { N.times { str =~ re } }
b.report("str[re] \t") { N.times { str[re] } }
b.report("re =~ str\t") { N.times { re =~ str } }
b.report("re.match str\t") { N.times { re.match str } }
if re.respond_to?(:match?)
b.report("re.match? str\t") { N.times { re.match? str } }
end
end
Results MRI 1.9.3-o551:
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re =~ str 2.390000 0.000000 2.390000 ( 2.397331)
str =~ re 2.450000 0.000000 2.450000 ( 2.446893)
str[re] 2.940000 0.010000 2.950000 ( 2.941666)
re.match str 3.620000 0.000000 3.620000 ( 3.619922)
str.match re 4.180000 0.000000 4.180000 ( 4.180083)
Results MRI 2.1.5:
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re =~ str 1.150000 0.000000 1.150000 ( 1.144880)
str =~ re 1.160000 0.000000 1.160000 ( 1.150691)
str[re] 1.330000 0.000000 1.330000 ( 1.337064)
re.match str 2.250000 0.000000 2.250000 ( 2.255142)
str.match re 2.270000 0.000000 2.270000 ( 2.270948)
Results MRI 2.3.3 (there is a regression in regex matching, it seems):
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re =~ str 3.540000 0.000000 3.540000 ( 3.535881)
str =~ re 3.560000 0.000000 3.560000 ( 3.560657)
str[re] 4.300000 0.000000 4.300000 ( 4.299403)
re.match str 5.210000 0.010000 5.220000 ( 5.213041)
str.match re 6.000000 0.000000 6.000000 ( 6.000465)
Results MRI 2.4.0:
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re.match? str 0.690000 0.010000 0.700000 ( 0.682934)
re =~ str 1.040000 0.000000 1.040000 ( 1.035863)
str =~ re 1.040000 0.000000 1.040000 ( 1.042963)
str[re] 1.340000 0.000000 1.340000 ( 1.339704)
re.match str 2.040000 0.000000 2.040000 ( 2.046464)
str.match re 2.180000 0.000000 2.180000 ( 2.174691)
What about re === str (case compare)?
Since it evaluates to true or false and has no need for storing matches, returning match index and that stuff, I wonder if it would be an even faster way of matching than =~.
Ok, I tested this. =~ is still faster, even if you have multiple capture groups, however it is faster than the other options.
BTW, what good is freeze? I couldn't measure any performance boost from it.
Depending on how complicated your regular expression is, you could possibly just use simple string slicing. I'm not sure about the practicality of this for your application or whether or not it would actually offer any speed improvements.
'testsentence'['stsen']
=> 'stsen' # evaluates to true
'testsentence'['koala']
=> nil # evaluates to false
What I am wondering is if there is any strange way to make this check even faster, maybe exploiting some strange method in Regexp or some weird construct.
Regexp engines vary in how they implement searches, but, in general, anchor your patterns for speed, and avoid greedy matches, especially when searching long strings.
The best thing to do, until you're familiar with how a particular engine works, is to do benchmarks and add/remove anchors, try limiting searches, use wildcards vs. explicit matches, etc.
The Fruity gem is very useful for quickly benchmarking things, because it's smart. Ruby's built-in Benchmark code is also useful, though you can write tests that fool you by not being careful.
I've used both in many answers here on Stack Overflow, so you can search through my answers and will see lots of little tricks and results to give you ideas of how to write faster code.
The biggest thing to remember is, it's bad to prematurely optimize your code before you know where the slowdowns occur.
To complete Wiktor Stribiżew and Dougui answers I would say that /regex/.match?("string") about as fast as "string".match?(/regex/).
Ruby 2.4.0 (10 000 000 ~2 sec)
2.4.0 > require 'benchmark'
=> true
2.4.0 > Benchmark.measure{ 10000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
=> #<Benchmark::Tms:0x005563da1b1c80 #label="", #real=2.2060338060000504, #cstime=0.0, #cutime=0.0, #stime=0.04000000000000001, #utime=2.17, #total=2.21>
2.4.0 > Benchmark.measure{ 10000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
=> #<Benchmark::Tms:0x005563da139eb0 #label="", #real=2.260814556000696, #cstime=0.0, #cutime=0.0, #stime=0.010000000000000009, #utime=2.2500000000000004, #total=2.2600000000000007>
Ruby 2.6.2 (100 000 000 ~20 sec)
irb(main):001:0> require 'benchmark'
=> true
irb(main):005:0> Benchmark.measure{ 100000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
=> #<Benchmark::Tms:0x0000562bc83e3768 #label="", #real=24.60139879199778, #cstime=0.0, #cutime=0.0, #stime=0.010000999999999996, #utime=24.565644999999996, #total=24.575645999999995>
irb(main):004:0> Benchmark.measure{ 100000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
=> #<Benchmark::Tms:0x0000562bc846aee8 #label="", #real=24.634255946999474, #cstime=0.0, #cutime=0.0, #stime=0.010046, #utime=24.598276, #total=24.608321999999998>
Note: times varies, sometimes /regex/.match?("string") is faster and sometimes "string".match?(/regex/), the differences maybe only due to the machine activity.

Ruby, remove last N characters from a string?

What is the preferred way of removing the last n characters from a string?
irb> 'now is the time'[0...-4]
=> "now is the "
If the characters you want to remove are always the same characters, then consider chomp:
'abc123'.chomp('123') # => "abc"
The advantages of chomp are: no counting, and the code more clearly communicates what it is doing.
With no arguments, chomp removes the DOS or Unix line ending, if either is present:
"abc\n".chomp # => "abc"
"abc\r\n".chomp # => "abc"
From the comments, there was a question of the speed of using #chomp versus using a range. Here is a benchmark comparing the two:
require 'benchmark'
S = 'asdfghjkl'
SL = S.length
T = 10_000
A = 1_000.times.map { |n| "#{n}#{S}" }
GC.disable
Benchmark.bmbm do |x|
x.report('chomp') { T.times { A.each { |s| s.chomp(S) } } }
x.report('range') { T.times { A.each { |s| s[0...-SL] } } }
end
Benchmark Results (using CRuby 2.13p242):
Rehearsal -----------------------------------------
chomp 1.540000 0.040000 1.580000 ( 1.587908)
range 1.810000 0.200000 2.010000 ( 2.011846)
-------------------------------- total: 3.590000sec
user system total real
chomp 1.550000 0.070000 1.620000 ( 1.610362)
range 1.970000 0.170000 2.140000 ( 2.146682)
So chomp is faster than using a range, by ~22%.
Ruby 2.5+
As of Ruby 2.5 you can use delete_suffix or delete_suffix! to achieve this in a fast and readable manner.
The docs on the methods are here.
If you know what the suffix is, this is idiomatic (and I'd argue, even more readable than other answers here):
'abc123'.delete_suffix('123') # => "abc"
'abc123'.delete_suffix!('123') # => "abc"
It's even significantly faster (almost 40% with the bang method) than the top answer. Here's the result of the same benchmark:
user system total real
chomp 0.949823 0.001025 0.950848 ( 0.951941)
range 1.874237 0.001472 1.875709 ( 1.876820)
delete_suffix 0.721699 0.000945 0.722644 ( 0.723410)
delete_suffix! 0.650042 0.000714 0.650756 ( 0.651332)
I hope this is useful - note the method doesn't currently accept a regex so if you don't know the suffix it's not viable for the time being. However, as the accepted answer (update: at the time of writing) dictates the same, I thought this might be useful to some people.
str = str[0..-1-n]
Unlike the [0...-n], this handles the case of n=0.
I would suggest chop. I think it has been mentioned in one of the comments but without links or explanations so here's why I think it's better:
It simply removes the last character from a string and you don't have to specify any values for that to happen.
If you need to remove more than one character then chomp is your best bet. This is what the ruby docs have to say about chop:
Returns a new String with the last character removed. If the string
ends with \r\n, both characters are removed. Applying chop to an empty
string returns an empty string. String#chomp is often a safer
alternative, as it leaves the string unchanged if it doesn’t end in a
record separator.
Although this is used mostly to remove separators such as \r\n I've used it to remove the last character from a simple string, for example the s to make the word singular.
name = "my text"
x.times do name.chop! end
Here in the console:
>name = "Nabucodonosor"
=> "Nabucodonosor"
> 7.times do name.chop! end
=> 7
> name
=> "Nabuco"
Dropping the last n characters is the same as keeping the first length - n characters.
Active Support includes String#first and String#last methods which provide a convenient way to keep or drop the first/last n characters:
require 'active_support/core_ext/string/access'
"foobarbaz".first(3) # => "foo"
"foobarbaz".first(-3) # => "foobar"
"foobarbaz".last(3) # => "baz"
"foobarbaz".last(-3) # => "barbaz"
if you are using rails, try:
"my_string".last(2) # => "ng"
[EDITED]
To get the string WITHOUT the last 2 chars:
n = "my_string".size
"my_string"[0..n-3] # => "my_stri"
Note: the last string char is at n-1. So, to remove the last 2, we use n-3.
Check out the slice() method:
http://ruby-doc.org/core-2.5.0/String.html#method-i-slice
You can always use something like
"string".sub!(/.{X}$/,'')
Where X is the number of characters to remove.
Or with assigning/using the result:
myvar = "string"[0..-X]
where X is the number of characters plus one to remove.
If you're ok with creating class methods and want the characters you chop off, try this:
class String
def chop_multiple(amount)
amount.times.inject([self, '']){ |(s, r)| [s.chop, r.prepend(s[-1])] }
end
end
hello, world = "hello world".chop_multiple 5
hello #=> 'hello '
world #=> 'world'
Using regex:
str = 'string'
n = 2 #to remove last n characters
str[/\A.{#{str.size-n}}/] #=> "stri"
x = "my_test"
last_char = x.split('').last

Resources