Simple way for removing all non word characters - ruby

I'd like to remove all characters from string, using most simple way.
For example
from "a,sd3 31ds" to "asdds"
I cad do it something like this:
"a,sd3 31ds".gsub(/\W/, "").gsub(/\d/,"")
# => "asdds"
but it looks a little bit awkward. Maybe it is possible to merge these rexegs in one?

"a,sd3 31ds".gsub(/(\W|\d)/, "")

I would go for the regexp /[\W\d]+/. It is potentially faster than e.g. /(\W|\d)/.
require 'benchmark'
N = 500_000
Regexps = [ "(\\W|\\d)", "(\\W|\\d)+", "(?:\\W|\\d)", "(?:\\W|\\d)+",
"\\W|\\d", "[\\W\\d]", "[\\W\\d]+" ]
Benchmark.bm(15) do |x|
Regexps.each do | re_str |
re = Regexp.new(re_str)
x.report("/#{re_str}/:") { N.times { "a,sd3 31ds".gsub(re, "") }}
end
end
gives (with ruby 2.0.0p195 [x64-mingw32])
user system total real
/(\W|\d)/: 1.950000 0.000000 1.950000 ( 1.951437)
/(\W|\d)+/: 1.794000 0.000000 1.794000 ( 1.787569)
/(?:\W|\d)/: 1.857000 0.000000 1.857000 ( 1.855515)
/(?:\W|\d)+/: 1.638000 0.000000 1.638000 ( 1.626698)
/\W|\d/: 1.856000 0.000000 1.856000 ( 1.865506)
/[\W\d]/: 1.732000 0.000000 1.732000 ( 1.754596)
/[\W\d]+/: 1.622000 0.000000 1.622000 ( 1.617705)

You can do this with the regex "OR".
"205h2n0bn r0".gsub(/\W|\d/, "")
will do the trick :)

What about
"a,sd3 31ds".gsub(/\W|\d/,"")
You can always join regular expressions by | to express an "or".

You can try this regex:
\P{L}
not Unicode letter, but I don't know, does Ruby support this class.

A non regex solution:
"a,sd3 31ds".delete('^A-Za-z')

Related

Regex for multiple string operations in a single pass?

How can I do the following in a single gsub what is the regex to get the desired output?
string = "Make all the changes within a single pass"
string.gsub(/[^aeiou|\s]/, '*').gsub(/\s/, '&')
#=> "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
First gsub if it's not a vowel or a space replace it
with *
Second gsub If it's a space replace it with a &
The reason I ask is because I feel like chaining gsub is not the right way to do this. Please let me know if you think this is a good way..
This uses String#tr to do the substitution in a single pass. This assumes the string consists of printable ASCII characters.
string.tr " \t\nB-DF-HJ-NP-TV-Zb-df-hj-np-tv-z!-#[-`{-~", '&&&*'
# => "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
For tr, - is the range operator. So for the letters B, C, D, since these are consecutive, it can be written as B-D. So B-DF-HJ-NP-TV-Z is basically all the capital letters minus the vowels. Same with lowercase, followed by all printable punctuation on the ASCII chart. These all get replaced by a *. The only 3 whitespace characters that match \s are space, tab, and newline, and these are listed explicitly at the front of the string and are each replaced by &.
If 2 passes are allowed, then it can be written more concisely as
string.tr(' ','&').tr('^AEIOUaeiou&','*')
Ok so I figured out I can pass a block like this:
string.gsub(/[^aeiou]/) {|g| g =~ /\s/ ? "&" : "*"}
#=> "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
I prefer the solution above but this also works:
string.gsub(/[^aeiou|\s]/, '*').gsub(/\s/, '&')
#=> "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"
Benchmark results (corrected): Using Benchmark class (900k length string sample size)
Benchmark.measure { string.gsub(/[^aeiou]/) {|g| g =~ /\s/ ? "&" : "*"} }
#=> 0.800000 0.010000 0.810000 ( 0.801419 )
Benchmark.measure { string.gsub(/[^aeiou|\s]/, '*').gsub(/\s/, '&') }
#=> 0.230000 0.000000 0.230000 ( 0.231482 )
Looks like the second option is many times faster and the clear winner in speed and appears to have the preferred readability.
Update
Based on #Matt's answer I also was able to use: string#tr This solution is blazing fast (fastest of all tested) string #900k char size.
string.tr(' ', '&').tr('^[aeiou|&]', '*')
Benchmark.measure { string.tr(' ', '&').tr('^[aeiou|&]', '*') }
#=> 0.000000 0.000000 0.000000 ( 0.015000 )
string.gsub(/(\s)|([^aeiou])/){$1 ? "&" : "*"}
# => "*a*e&a**&**e&**a**e*&*i**i*&a&*i***e&*a**"

Manipulate string in ruby

I have a grouping of string variables that will be something like "height_low". I want to use something clean like gsub or something else to get rid of the underscore and everything past it. so it will be like "height". Does someone have a solution for this? Thanks.
Try this:
strings.map! {|s| s.split('_').first}
Shorter:
my_string.split('_').first
The unavoidable regex answer. (Assuming strings is an array of strings.)
strings.map! { |s| s[/^.+?(?=_)/] }
FWIW, solutions based on String#split perform poorly because they have to parse the whole string and allocate an array. Their performance degrades as the number of underscores increases. The following performs better:
string[0, string.index("_") || string.length]
Benchmark results (with number of underscores in parenthesis):
user system total real
String#split (0) 0.640000 0.000000 0.640000 ( 0.650323)
String#split (1) 0.760000 0.000000 0.760000 ( 0.759951)
String#split (9) 2.180000 0.010000 2.190000 ( 2.192356)
String#index (0) 0.610000 0.000000 0.610000 ( 0.625972)
String#index (1) 0.580000 0.010000 0.590000 ( 0.589463)
String#index (9) 0.600000 0.000000 0.600000 ( 0.605253)
Benchmarks:
strings = ["x", "x_x", "x_x_x_x_x_x_x_x_x_x"]
Benchmark.bm(16) do |bm|
strings.each do |string|
bm.report("String#split (#{string.count("_")})") do
1000000.times { string.split("_").first }
end
end
strings.each do |string|
bm.report("String#index (#{string.count("_")})") do
1000000.times { string[0, string.index("_") || string.length] }
end
end
end
Try as below using str[regexp, capture] → new_str or nil:
If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.
strings.map { |s| s[/(.*?)_.*$/,1] }
If you're looking for something "like gsub", why not just use gsub?
"height_low".gsub(/_.*$/, "") #=> "height"
In my opinion though, this is a bit cleaner:
"height_low".split('_').first #=> "height"
Another option is to use partition:
"height_low".partition("_").first #=> "height"
Learn to think in terms of searches vs. replacements. It's usually easier, faster, and cleaner to search for, and extract, what you want, than it is to search for, and strip, what you don't want.
Consider this:
'a_b_c'[/^(.*?)_/, 1] # => "a"
It looks for only what you want, which is the text from the start of the string, up to _. Everything preceding _ is captured, and returned.
The alternates:
'a_b_c'.sub(/_.+$/, '') # => "a"
'a_b_c'.gsub(/_.+$/, '') # => "a"
have to look backwards until the engine is sure there are no more _, then the string can be truncated.
Here's a little benchmark showing how that affects things:
require 'fruity'
compare do
string_capture { 'a_b_c'[/^(.*?)_/, 1] }
string_sub { 'a_b_c'.sub(/_.+$/, '') }
string_gsub { 'a_b_c'.gsub(/_.+$/, '') }
look_ahead { 'a_b_c'[/^.+?(?=_)/] }
string_index { 'a_b_c'[0, s.index("_") || s.length] }
end
# >> Running each test 8192 times. Test will take about 1 second.
# >> string_index is faster than string_capture by 19.999999999999996% ± 10.0%
# >> string_capture is similar to look_ahead
# >> look_ahead is faster than string_sub by 70.0% ± 10.0%
# >> string_sub is faster than string_gsub by 2.9x ± 0.1
Again, searching is going to be faster than any sort of replace, so think about what you're doing.
The downfall to the "search" regex-based tactics like "string_capture" and "look_ahead" is they don't handle missing _, so if there's any question whether your string will, or will not, have _, then use the "string_index" method which will fall-back to using string.length to grab the entire string.

How %r(..) differs from /../ in Regexp creation in Ruby? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I am using Ruby1.9.3. I am newbie to this platform.
From the docs I got to know we can make Regexp using the below :
%r{pattern}
/pattern/
Now is there any difference between the the two styles above mentioned, interms of fast pattern matching symbol, Area specifics(***can use/can't use restrictions***) etc.
I found one as below :
irb(main):006:0> s= '2/3'
=> "2/3"
irb(main):008:0> /2\/3/ =~ s
=> 0
irb(main):009:0> %r(2/3) =~ s
=> 0
irb(main):010:0> exit
Here I found one diferrence between %r(..) and /../ is we don't need to use \ to escape /. Is there any more from your practical experiences?
EDIT
As per #akashspeaking suggestion I tried this and found what he said:
> re=%r(2/3)­
=> /2\/3/ # giving the pattern /../. Means Ruby internally converted this %r(..) to /../, which it should not if we created such regexp pattern manually.
>
From the above it is very clear theoretically that %r(..) is slower than the /../.
Can anyone help me by executing quickbm(10000000) { /2\­/3/=~s } and quickbm(10000000) { %r(2/3) =~ s }to measure the execution time. I don't have the required gem benchmark installed here. But curios to know the output of that two.If any one has - could you try on your terminal and paste the details here?
Thanks
There is absolutely no difference in %r/foo/ and /foo/.
irb(main):001:0> %r[foo]
=> /foo/
irb(main):002:0> %r{foo}
=> /foo/
irb(main):003:0> /foo/
=> /foo/
The source script will be analyzed by the interpreter at startup and both will be converted to a regexp, which, at run-time, will be the same.
The only difference is the source-code, not the executable. Try this:
require 'benchmark'
str = (('a'..'z').to_a * 256).join + 'foo'
n = 1_000_000
puts RUBY_VERSION, n
puts
Benchmark.bm do |b|
b.report('%r') { n.times { str[%r/foo/] } }
b.report('/') { n.times { str[/foo/] } }
end
Which outputs:
1.9.3
1000000
user system total real
%r 8.000000 0.000000 8.000000 ( 8.014767)
/ 8.000000 0.000000 8.000000 ( 8.010062)
That's on an old MacBook Pro running 10.8.2. Think about it, that's 6,656,000,000 (26 * 256 * 1,000,000) characters being searched and both returned what's essentially the same value. Coincidence? I think not.
Running this on a machine and getting an answer that varies significantly between the two tests on that CPU would indicate a difference in run-time performance of the two syntactically different ways of specifying the same thing. I seriously doubt that will happen.
EDIT:
Running it multiple times shows the randomness in action. I adjusted the code a bit to make it do five loops across the benchmarks this morning. The system was scanning the disk while running the tests so they took a little longer, but they still show minor random differences between the two runs:
require 'benchmark'
str = (('a'..'z').to_a * 256).join + 'foo'
n = 1_000_000
puts RUBY_VERSION, n
puts
regex = 'foo'
Benchmark.bm(2) do |b|
5.times do
b.report('%r') { n.times { str[%r/#{ regex }/] } }
b.report('/') { n.times { str[/#{ regex }/] } }
end
end
And the results:
# user system total real
%r 12.440000 0.030000 12.470000 ( 12.475312)
/ 12.420000 0.030000 12.450000 ( 12.455737)
%r 12.400000 0.020000 12.420000 ( 12.431750)
/ 12.400000 0.020000 12.420000 ( 12.417107)
%r 12.430000 0.030000 12.460000 ( 12.467275)
/ 12.390000 0.020000 12.410000 ( 12.418452)
%r 12.400000 0.030000 12.430000 ( 12.432781)
/ 12.390000 0.020000 12.410000 ( 12.412609)
%r 12.410000 0.020000 12.430000 ( 12.427783)
/ 12.420000 0.020000 12.440000 ( 12.449336)
Running about two seconds later:
# user system total real
%r 12.360000 0.020000 12.380000 ( 12.390146)
/ 12.370000 0.030000 12.400000 ( 12.391151)
%r 12.370000 0.020000 12.390000 ( 12.397819)
/ 12.380000 0.020000 12.400000 ( 12.399413)
%r 12.410000 0.020000 12.430000 ( 12.440236)
/ 12.420000 0.030000 12.450000 ( 12.438158)
%r 12.560000 0.040000 12.600000 ( 12.969364)
/ 12.640000 0.050000 12.690000 ( 12.810051)
%r 13.160000 0.120000 13.280000 ( 14.624694) # <-- opened new browser window
/ 12.650000 0.040000 12.690000 ( 13.040637)
There is no consistent difference in speed.
Here I found one diferrence between %r(..) and /../ is we don't need
to use \ to escape /.
That is their primary use. Unlike strings, whose delimiters change their semantics, the only real differences between the regular expression literals are the delimiters themselves.
Look also to this thread The Ruby %r{ } expression and 2 paragraphs of this doc http://www.ruby-doc.org/core-1.9.3/Regexp.html
there is no difference except of using any symbols as delimiters after %r instead of //
If you use %r notation, you can use an arbitrary symbol as delimiter. For example, you can write a regex as any of the following (and more):
%r{pattern}
%r[pattern]
%r(pattern)
%r!pattern!
This can be useful if your regex contains lots of '/'
Note: No matter what you use, it will be saved in default form. i.e.
%r:pattern: will default to /pattern/

Fastest way to check if a string matches a regexp in ruby?

What is the fastest way to check if a string matches a regular expression in Ruby?
My problem is that I have to "egrep" through a huge list of strings to find which are the ones that match a regexp that is given at runtime. I only care about whether the string matches the regexp, not where it matches, nor what the content of the matching groups is. I hope this assumption can be used to reduce the amount of time my code spend matching regexps.
I load the regexp with
pattern = Regexp.new(ptx).freeze
I have found that string =~ pattern is slightly faster than string.match(pattern).
Are there other tricks or shortcuts that can used to make this test even faster?
Starting with Ruby 2.4.0, you may use RegExp#match?:
pattern.match?(string)
Regexp#match? is explicitly listed as a performance enhancement in the release notes for 2.4.0, as it avoids object allocations performed by other methods such as Regexp#match and =~:
Regexp#match?
Added Regexp#match?, which executes a regexp match without creating a back reference object and changing $~ to reduce object allocation.
This is a simple benchmark:
require 'benchmark'
"test123" =~ /1/
=> 4
Benchmark.measure{ 1000000.times { "test123" =~ /1/ } }
=> 0.610000 0.000000 0.610000 ( 0.578133)
"test123"[/1/]
=> "1"
Benchmark.measure{ 1000000.times { "test123"[/1/] } }
=> 0.718000 0.000000 0.718000 ( 0.750010)
irb(main):019:0> "test123".match(/1/)
=> #<MatchData "1">
Benchmark.measure{ 1000000.times { "test123".match(/1/) } }
=> 1.703000 0.000000 1.703000 ( 1.578146)
So =~ is faster but it depends what you want to have as a returned value. If you just want to check if the text contains a regex or not use =~
This is the benchmark I have run after finding some articles around the net.
With 2.4.0 the winner is re.match?(str) (as suggested by #wiktor-stribiżew), on previous versions, re =~ str seems to be fastest, although str =~ re is almost as fast.
#!/usr/bin/env ruby
require 'benchmark'
str = "aacaabc"
re = Regexp.new('a+b').freeze
N = 4_000_000
Benchmark.bm do |b|
b.report("str.match re\t") { N.times { str.match re } }
b.report("str =~ re\t") { N.times { str =~ re } }
b.report("str[re] \t") { N.times { str[re] } }
b.report("re =~ str\t") { N.times { re =~ str } }
b.report("re.match str\t") { N.times { re.match str } }
if re.respond_to?(:match?)
b.report("re.match? str\t") { N.times { re.match? str } }
end
end
Results MRI 1.9.3-o551:
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re =~ str 2.390000 0.000000 2.390000 ( 2.397331)
str =~ re 2.450000 0.000000 2.450000 ( 2.446893)
str[re] 2.940000 0.010000 2.950000 ( 2.941666)
re.match str 3.620000 0.000000 3.620000 ( 3.619922)
str.match re 4.180000 0.000000 4.180000 ( 4.180083)
Results MRI 2.1.5:
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re =~ str 1.150000 0.000000 1.150000 ( 1.144880)
str =~ re 1.160000 0.000000 1.160000 ( 1.150691)
str[re] 1.330000 0.000000 1.330000 ( 1.337064)
re.match str 2.250000 0.000000 2.250000 ( 2.255142)
str.match re 2.270000 0.000000 2.270000 ( 2.270948)
Results MRI 2.3.3 (there is a regression in regex matching, it seems):
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re =~ str 3.540000 0.000000 3.540000 ( 3.535881)
str =~ re 3.560000 0.000000 3.560000 ( 3.560657)
str[re] 4.300000 0.000000 4.300000 ( 4.299403)
re.match str 5.210000 0.010000 5.220000 ( 5.213041)
str.match re 6.000000 0.000000 6.000000 ( 6.000465)
Results MRI 2.4.0:
$ ./bench-re.rb | sort -t $'\t' -k 2
user system total real
re.match? str 0.690000 0.010000 0.700000 ( 0.682934)
re =~ str 1.040000 0.000000 1.040000 ( 1.035863)
str =~ re 1.040000 0.000000 1.040000 ( 1.042963)
str[re] 1.340000 0.000000 1.340000 ( 1.339704)
re.match str 2.040000 0.000000 2.040000 ( 2.046464)
str.match re 2.180000 0.000000 2.180000 ( 2.174691)
What about re === str (case compare)?
Since it evaluates to true or false and has no need for storing matches, returning match index and that stuff, I wonder if it would be an even faster way of matching than =~.
Ok, I tested this. =~ is still faster, even if you have multiple capture groups, however it is faster than the other options.
BTW, what good is freeze? I couldn't measure any performance boost from it.
Depending on how complicated your regular expression is, you could possibly just use simple string slicing. I'm not sure about the practicality of this for your application or whether or not it would actually offer any speed improvements.
'testsentence'['stsen']
=> 'stsen' # evaluates to true
'testsentence'['koala']
=> nil # evaluates to false
What I am wondering is if there is any strange way to make this check even faster, maybe exploiting some strange method in Regexp or some weird construct.
Regexp engines vary in how they implement searches, but, in general, anchor your patterns for speed, and avoid greedy matches, especially when searching long strings.
The best thing to do, until you're familiar with how a particular engine works, is to do benchmarks and add/remove anchors, try limiting searches, use wildcards vs. explicit matches, etc.
The Fruity gem is very useful for quickly benchmarking things, because it's smart. Ruby's built-in Benchmark code is also useful, though you can write tests that fool you by not being careful.
I've used both in many answers here on Stack Overflow, so you can search through my answers and will see lots of little tricks and results to give you ideas of how to write faster code.
The biggest thing to remember is, it's bad to prematurely optimize your code before you know where the slowdowns occur.
To complete Wiktor Stribiżew and Dougui answers I would say that /regex/.match?("string") about as fast as "string".match?(/regex/).
Ruby 2.4.0 (10 000 000 ~2 sec)
2.4.0 > require 'benchmark'
=> true
2.4.0 > Benchmark.measure{ 10000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
=> #<Benchmark::Tms:0x005563da1b1c80 #label="", #real=2.2060338060000504, #cstime=0.0, #cutime=0.0, #stime=0.04000000000000001, #utime=2.17, #total=2.21>
2.4.0 > Benchmark.measure{ 10000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
=> #<Benchmark::Tms:0x005563da139eb0 #label="", #real=2.260814556000696, #cstime=0.0, #cutime=0.0, #stime=0.010000000000000009, #utime=2.2500000000000004, #total=2.2600000000000007>
Ruby 2.6.2 (100 000 000 ~20 sec)
irb(main):001:0> require 'benchmark'
=> true
irb(main):005:0> Benchmark.measure{ 100000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
=> #<Benchmark::Tms:0x0000562bc83e3768 #label="", #real=24.60139879199778, #cstime=0.0, #cutime=0.0, #stime=0.010000999999999996, #utime=24.565644999999996, #total=24.575645999999995>
irb(main):004:0> Benchmark.measure{ 100000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
=> #<Benchmark::Tms:0x0000562bc846aee8 #label="", #real=24.634255946999474, #cstime=0.0, #cutime=0.0, #stime=0.010046, #utime=24.598276, #total=24.608321999999998>
Note: times varies, sometimes /regex/.match?("string") is faster and sometimes "string".match?(/regex/), the differences maybe only due to the machine activity.

Remove "#" sign and everything after it in Ruby

I am working on an application where I need to pass on the anything before "#" sign from the user's email address as his/her first name and last name. For example if the user has an email address "user#example.com" than when the user submits the form I remove "#example.com" from the email and assign "user" as the first and last name.
I have done research but was not able to find a way of doing this in Ruby. Any suggestions ??
You can split on "#" and just use the first part.
email.split("#")[0]
That will give you the first part before the "#".
To catch anything before the # sign:
my_string = "user#example.com"
substring = my_string[/[^#]+/]
# => "user"
Just split at the # symbol and grab what went before it.
string.split('#')[0]
The String#split will be useful. Given a string and an argument, it returns an array splitting the string up into separate elements on that String. So if you had:
e = test#testing.com
e.split("#")
#=> ["test", "testing.com"]
Thus you would take e.split("#")[0] for the first part of the address.
use gsub and a regular expression
first_name = email.gsub(/#[^\s]+/,"")
irb(main):011:0> Benchmark.bmbm do |x|
irb(main):012:1* email = "user#domain.type"
irb(main):013:1> x.report("split"){100.times{|n| first_name = email.split("#")[0]}}
irb(main):014:1> x.report("regex"){100.times{|n| first_name = email.gsub(/#[a-z.]+/,"")}}
irb(main):015:1> end
Rehearsal -----------------------------------------
split 0.000000 0.000000 0.000000 ( 0.000000)
regex 0.000000 0.000000 0.000000 ( 0.001000)
-------------------------------- total: 0.000000sec
user system total real
split 0.000000 0.000000 0.000000 ( 0.001000)
regex 0.000000 0.000000 0.000000 ( 0.000000)
=> [#<Benchmark::Tms:0x490b810 #label="", #stime=0.0, #real=0.00100016593933105, #utime=0.0, #cstime=0.0, #total=0.0, #cutime=0.0>, #<Benchmark::Tms:0x4910bb0 #
label="", #stime=0.0, #real=0.0, #utime=0.0, #cstime=0.0, #total=0.0, #cutime=0.0>]

Resources