I want to know what's the best way to make the String.include? methods ignore case. Currently I'm doing the following. Any suggestions? Thanks!
a = "abcDE"
b = "CD"
result = a.downcase.include? b.downcase
Edit:
How about Array.include?. All elements of the array are strings.
Summary
If you are only going to test a single word against an array, or if the contents of your array changes frequently, the fastest answer is Aaron's:
array.any?{ |s| s.casecmp(mystr)==0 }
If you are going to test many words against a static array, it's far better to use a variation of farnoy's answer: create a copy of your array that has all-lowercase versions of your words, and use include?. (This assumes that you can spare the memory to create a mutated copy of your array.)
# Do this once, or each time the array changes
downcased = array.map(&:downcase)
# Test lowercase words against that array
downcased.include?( mystr.downcase )
Even better, create a Set from your array.
# Do this once, or each time the array changes
downcased = Set.new array.map(&:downcase)
# Test lowercase words against that array
downcased.include?( mystr.downcase )
My original answer below is a very poor performer and generally not appropriate.
Benchmarks
Following are benchmarks for looking for 1,000 words with random casing in an array of slightly over 100,000 words, where 500 of the words will be found and 500 will not.
The 'regex' text is my answer here, using any?.
The 'casecmp' test is Arron's answer, using any? from my comment.
The 'downarray' test is farnoy's answer, re-creating a new downcased array for each of the 1,000 tests.
The 'downonce' test is farnoy's answer, but pre-creating the lookup array once only.
The 'set_once' test is creating a Set from the array of downcased strings, once before testing.
user system total real
regex 18.710000 0.020000 18.730000 ( 18.725266)
casecmp 5.160000 0.000000 5.160000 ( 5.155496)
downarray 16.760000 0.030000 16.790000 ( 16.809063)
downonce 0.650000 0.000000 0.650000 ( 0.643165)
set_once 0.040000 0.000000 0.040000 ( 0.038955)
If you can create a single downcased copy of your array once to perform many lookups against, farnoy's answer is the best (assuming you must use an array). If you can create a Set, though, do that.
If you like, examine the benchmarking code.
Original Answer
I (originally said that I) would personally create a case-insensitive regex (for a string literal) and use that:
re = /\A#{Regexp.escape(str)}\z/i # Match exactly this string, no substrings
all = array.grep(re) # Find all matching strings…
any = array.any?{ |s| s =~ re } # …or see if any matching string is present
Using any? can be slightly faster than grep as it can exit the loop as soon as it finds a single match.
For an array, use:
array.map(&:downcase).include?(string)
Regexps are very slow and should be avoided.
You can use casecmp to do your comparison, ignoring case.
"abcdef".casecmp("abcde") #=> 1
"aBcDeF".casecmp("abcdef") #=> 0
"abcdef".casecmp("abcdefg") #=> -1
"abcdef".casecmp("ABCDEF") #=> 0
class String
def caseinclude?(x)
a.downcase.include?(x.downcase)
end
end
my_array.map!{|c| c.downcase.strip}
where map! changes my_array, map instead returns a new array.
To farnoy in my case your example doesn't work for me. I'm actually looking to do this with a "substring" of any.
Here's my test case.
x = "<TD>", "<tr>", "<BODY>"
y = "td"
x.collect { |r| r.downcase }.include? y
=> false
x[0].include? y
=> false
x[0].downcase.include? y
=> true
Your case works with an exact case-insensitive match.
a = "TD", "tr", "BODY"
b = "td"
a.collect { |r| r.downcase }.include? b
=> true
I'm still experimenting with the other suggestions here.
---EDIT INSERT AFTER HERE---
I found the answer. Thanks to Drew Olsen
var1 = "<TD>", "<tr>","<BODY>"
=> ["<TD>", "<tr>", "<BODY>"]
var2 = "td"
=> "td"
var1.find_all{|item| item.downcase.include?(var2)}
=> ["<TD>"]
var1[0] = "<html>"
=> "<html>"
var1.find_all{|item| item.downcase.include?(var2)}
=> []
Related
Say I have a string "rubinassociatespa", what I would like to do is detect any substring of that string with 3 characters or more, in any other string.
For example, the following strings should be detected:
rubin
associates
spa
ass
rub
etc.
But what should NOT be detected are the following strings:
rob
cpa
dea
ru
or any other substring that does not appear in my original string, or is shorter than 3 characters.
Basically, I have a string and I am comparing many other strings against it and I only want to match the strings that comprise a substring of the original string.
I hope that's clear.
str = "rubinassociatespa"
arr = %w| rubin associates spa ass rub rob cpa dea ru |
#=> ["rubin", "associates", "spa", "ass", "rub", "rob", "cpa", "dea", "ru"]
Just use String#include?.
def substring?(str, s)
(s.size >= 3) ? str.include?(s) : false
end
arr.each { |s| puts "#{s}: #{substring? str, s}" }
# rubin: true
# associates: true
# spa: true
# ass: true
# rub: true
# rob: false
# cpa: false
# dea: false
# ru: false
you can use match
str = "rubinassociatespa"
test_str = "associates"
str.match(test_str) #=> #<MatchData "associates">
str.match(test_str).to_s #=> "associates"
test_str = 'rob'
str.match(test_str) #=> nil
So, if test_str is a substring of str, then the match method will return the entire test_str, otherwise, it will return nil.
if test_str.length >= 3 && str.match(test_str)
# do stuff here.
end
First you need a list of acceptable strings. Something like https://github.com/first20hours/google-10000-english would probably be usefull.
Secondly you want a data structure that allows for fast lookups to see if a word is valid. I would use a Bloom Filter for this. This gem might be useful if you don't want to implement it on your own: https://github.com/igrigorik/bloomfilter-rb
Then you need to initiate the Bloom filter with the list of all valid words in the valid word list.
Then, For each substring in your string you want to do a lookup in the bloom filter structure to see if it is in the valid word list. See this example for how to get all substrings: What is the best way to split a string to get all the substrings by Ruby?
If the bloom filter returns true you need to do a secondary check to confirm that it is actually in the list since Bloom filters is a probabilistic data structure. You probably need to use a database to store the valid word list collection, so you can just do a database lookup to confirm if it's valid.
I hope this gives you an idea on how to proceed.
Even after reading the standard documentation, I still can't understand how Ruby's Array#pack and String#unpack exactly work. Here is the example that's causing me the most trouble:
irb(main):001:0> chars = ["61","62","63"]
=> ["61", "62", "63"]
irb(main):002:0> chars.pack("H*")
=> "a"
irb(main):003:0> chars.pack("HHH")
=> "```"
I expected both these operations to return the same output: "abc". Each of them "fails" in a different manner (not really a fail since I probably expect the wrong thing). So two questions:
What is the logic behind those outputs?
How can I achieve the effect I want, i.e. transforming a sequence of hexadecimal numbers to the corresponding string. Even better - given an integer n, how to transform it to a string identical to the text file that when is considered as a number (say, in a hex editor) equals n?
We were working on a similar problem this morning. If the array size is unknown, you can use:
ary = ["61", "62", "63"]
ary.pack('H2' * ary.size)
=> "abc"
You can reverse it using:
str = "abc"
str.unpack('H2' * str.size)
=> ["61", "62", "63"]
The 'H' String directive for Array#pack says that array contents should be interpreted as nibbles of hex strings.
In the first example you've provided:
irb(main):002:0> chars.pack("H*")
=> "a"
you're telling to pack the first element of the array as if it were a sequence of nibbles (half bytes) of a hex string: 0x61 in this case that corresponds to the 'a' ASCII character.
In the second example:
irb(main):003:0> chars.pack("HHH")
=> "```"
you're telling to pack 3 elements of the array as if they were nibbles (the high part in this case): 0x60 corresponds to the '`' ASCII character. The low part or second nibble (0x01) "gets lost" due to missing '2' or '*' modifiers for "aTemplateString".
What you need is:
chars.pack('H*' * chars.size)
in order to pack all the nibbles of all the elements of the array as if they were hex strings.
The case of 'H2' * char.size only works fine if the array elements are representing 1 byte only hex strings.
It means that something like chars = ["6161", "6262", "6363"] is going to be incomplete:
2.1.5 :047 > chars = ["6161", "6262", "6363"]
=> ["6161", "6262", "6363"]
2.1.5 :048 > chars.pack('H2' * chars.size)
=> "abc"
while:
2.1.5 :049 > chars.pack('H*' * chars.size)
=> "aabbcc"
The Array#pack method is pretty arcane. Addressing question (2), I was able to get your example to work by doing this:
> ["61", "62", "63"].pack("H2H2H2")
=> "abc"
See the Ruby documentation for a similar example. Here is a more general way to do it:
["61", "62", "63"].map {|s| [s].pack("H2") }.join
This is probably not the most efficient way to tackle your problem; I suspect there is a better way, but it would help to know what kind of input you are starting out with.
The #pack method is common to other languages, such as Perl. If Ruby's documentation does not help, you might consult analogous documentation elsewhere.
I expected both these operations to return the same output: "abc".
The easiest way to understand why your approach didn't work, is to simply start with what you are expecting:
"abc".unpack("H*")
# => ["616263"]
["616263"].pack("H*")
# => "abc"
So, it seems that Ruby expects your hex bytes in one long string instead of separate elements of an array. So the simplest answer to your original question would be this:
chars = ["61", "62", "63"]
[chars.join].pack("H*")
# => "abc"
This approach also seems to perform comparably well for large input:
require 'benchmark'
chars = ["61", "62", "63"] * 100000
Benchmark.bmbm do |bm|
bm.report("join pack") do [chars.join].pack("H*") end
bm.report("big pack") do chars.pack("H2" * chars.size) end
bm.report("map pack") do chars.map{ |s| [s].pack("H2") }.join end
end
# user system total real
# join pack 0.030000 0.000000 0.030000 ( 0.025558)
# big pack 0.030000 0.000000 0.030000 ( 0.027773)
# map pack 0.230000 0.010000 0.240000 ( 0.241117)
I have a string, for example:
'This is a test string'
and an array:
['test', 'is']
I need to find out how many elements in array are present in string (in this case, it would be 2). What's the best/ruby-way of doing this? Also, I am doing this thousands of time, so please keep in mind efficiency.
What I tried so far:
array.each do |el|
string.include? el #increment counter
end
Thanks
['test', 'is'].count{ |s| /\b#{s}\b/ =~ 'This is a test string' }
Edit: adjusted for full word matching.
['test', 'is'].count { |e| 'This is a test string'.split.include? e }
Your question is ambiguous.
If you are counting the occurrences, then:
('This is a test string'.scan(/\w+/).map(&:downcase) & ['test', 'is']).length
If you are counting the tokens, then:
(['test', 'is'] & 'This is a test string'.scan(/\w+/).map(&:downcase)).length
You can further speed up the calculation by replacing Array#& by some operation using a Hash (or Set).
Kyle's answer gave you the simple practical way of doing the job. But looking at it, allow me to remark that more efficient algorithms exist to solve your problem, when n (string length and/or number of matched strings) climbs to millions. We commonly encounter such problems in biology.
Following will work provided there are no duplicates in string or array.
str = "This is a test string"
arr = ["test", "is"]
match_count = arr.size - (arr - str.split).size # 2 in this example
What is the preferred way of removing the last n characters from a string?
irb> 'now is the time'[0...-4]
=> "now is the "
If the characters you want to remove are always the same characters, then consider chomp:
'abc123'.chomp('123') # => "abc"
The advantages of chomp are: no counting, and the code more clearly communicates what it is doing.
With no arguments, chomp removes the DOS or Unix line ending, if either is present:
"abc\n".chomp # => "abc"
"abc\r\n".chomp # => "abc"
From the comments, there was a question of the speed of using #chomp versus using a range. Here is a benchmark comparing the two:
require 'benchmark'
S = 'asdfghjkl'
SL = S.length
T = 10_000
A = 1_000.times.map { |n| "#{n}#{S}" }
GC.disable
Benchmark.bmbm do |x|
x.report('chomp') { T.times { A.each { |s| s.chomp(S) } } }
x.report('range') { T.times { A.each { |s| s[0...-SL] } } }
end
Benchmark Results (using CRuby 2.13p242):
Rehearsal -----------------------------------------
chomp 1.540000 0.040000 1.580000 ( 1.587908)
range 1.810000 0.200000 2.010000 ( 2.011846)
-------------------------------- total: 3.590000sec
user system total real
chomp 1.550000 0.070000 1.620000 ( 1.610362)
range 1.970000 0.170000 2.140000 ( 2.146682)
So chomp is faster than using a range, by ~22%.
Ruby 2.5+
As of Ruby 2.5 you can use delete_suffix or delete_suffix! to achieve this in a fast and readable manner.
The docs on the methods are here.
If you know what the suffix is, this is idiomatic (and I'd argue, even more readable than other answers here):
'abc123'.delete_suffix('123') # => "abc"
'abc123'.delete_suffix!('123') # => "abc"
It's even significantly faster (almost 40% with the bang method) than the top answer. Here's the result of the same benchmark:
user system total real
chomp 0.949823 0.001025 0.950848 ( 0.951941)
range 1.874237 0.001472 1.875709 ( 1.876820)
delete_suffix 0.721699 0.000945 0.722644 ( 0.723410)
delete_suffix! 0.650042 0.000714 0.650756 ( 0.651332)
I hope this is useful - note the method doesn't currently accept a regex so if you don't know the suffix it's not viable for the time being. However, as the accepted answer (update: at the time of writing) dictates the same, I thought this might be useful to some people.
str = str[0..-1-n]
Unlike the [0...-n], this handles the case of n=0.
I would suggest chop. I think it has been mentioned in one of the comments but without links or explanations so here's why I think it's better:
It simply removes the last character from a string and you don't have to specify any values for that to happen.
If you need to remove more than one character then chomp is your best bet. This is what the ruby docs have to say about chop:
Returns a new String with the last character removed. If the string
ends with \r\n, both characters are removed. Applying chop to an empty
string returns an empty string. String#chomp is often a safer
alternative, as it leaves the string unchanged if it doesn’t end in a
record separator.
Although this is used mostly to remove separators such as \r\n I've used it to remove the last character from a simple string, for example the s to make the word singular.
name = "my text"
x.times do name.chop! end
Here in the console:
>name = "Nabucodonosor"
=> "Nabucodonosor"
> 7.times do name.chop! end
=> 7
> name
=> "Nabuco"
Dropping the last n characters is the same as keeping the first length - n characters.
Active Support includes String#first and String#last methods which provide a convenient way to keep or drop the first/last n characters:
require 'active_support/core_ext/string/access'
"foobarbaz".first(3) # => "foo"
"foobarbaz".first(-3) # => "foobar"
"foobarbaz".last(3) # => "baz"
"foobarbaz".last(-3) # => "barbaz"
if you are using rails, try:
"my_string".last(2) # => "ng"
[EDITED]
To get the string WITHOUT the last 2 chars:
n = "my_string".size
"my_string"[0..n-3] # => "my_stri"
Note: the last string char is at n-1. So, to remove the last 2, we use n-3.
Check out the slice() method:
http://ruby-doc.org/core-2.5.0/String.html#method-i-slice
You can always use something like
"string".sub!(/.{X}$/,'')
Where X is the number of characters to remove.
Or with assigning/using the result:
myvar = "string"[0..-X]
where X is the number of characters plus one to remove.
If you're ok with creating class methods and want the characters you chop off, try this:
class String
def chop_multiple(amount)
amount.times.inject([self, '']){ |(s, r)| [s.chop, r.prepend(s[-1])] }
end
end
hello, world = "hello world".chop_multiple 5
hello #=> 'hello '
world #=> 'world'
Using regex:
str = 'string'
n = 2 #to remove last n characters
str[/\A.{#{str.size-n}}/] #=> "stri"
x = "my_test"
last_char = x.split('').last
I want to replace the last occurrence of a substring in Ruby. What's the easiest way?
For example, in abc123abc123, I want to replace the last abc to ABC. How do I do that?
How about
new_str = old_str.reverse.sub(pattern.reverse, replacement.reverse).reverse
For instance:
irb(main):001:0> old_str = "abc123abc123"
=> "abc123abc123"
irb(main):002:0> pattern="abc"
=> "abc"
irb(main):003:0> replacement="ABC"
=> "ABC"
irb(main):004:0> new_str = old_str.reverse.sub(pattern.reverse, replacement.reverse).reverse
=> "abc123ABC123"
"abc123abc123".gsub(/(.*(abc.*)*)(abc)(.*)/, '\1ABC\4')
#=> "abc123ABC123"
But probably there is a better way...
Edit:
...which Chris kindly provided in the comment below.
So, as * is a greedy operator, the following is enough:
"abc123abc123".gsub(/(.*)(abc)(.*)/, '\1ABC\3')
#=> "abc123ABC123"
Edit2:
There is also a solution which neatly illustrates parallel array assignment in Ruby:
*a, b = "abc123abc123".split('abc', -1)
a.join('abc')+'ABC'+b
#=> "abc123ABC123"
Since Ruby 2.0 we can use \K which removes any text matched before it from the returned match. Combine with a greedy operator and you get this:
'abc123abc123'.sub(/.*\Kabc/, 'ABC')
#=> "abc123ABC123"
This is about 1.4 times faster than using capturing groups as Hirurg103 suggested, but that speed comes at the cost of lowering readability by using a lesser-known pattern.
more info on \K: https://www.regular-expressions.info/keep.html
Here's another possible solution:
>> s = "abc123abc123"
=> "abc123abc123"
>> s[s.rindex('abc')...(s.rindex('abc') + 'abc'.length)] = "ABC"
=> "ABC"
>> s
=> "abc123ABC123"
When searching in huge streams of data, using reverse will definitively* lead to performance issues. I use string.rpartition*:
sub_or_pattern = "!"
replacement = "?"
string = "hello!hello!hello"
array_of_pieces = string.rpartition sub_or_pattern
( array_of_pieces[(array_of_pieces.find_index sub_or_pattern)] = replacement ) rescue nil
p array_of_pieces.join
# "hello!hello?hello"
The same code must work with a string with no occurrences of sub_or_pattern:
string = "hello_hello_hello"
# ...
# "hello_hello_hello"
*rpartition uses rb_str_subseq() internally. I didn't check if that function returns a copy of the string, but I think it preserves the chunk of memory used by that part of the string. reverse uses rb_enc_cr_str_copy_for_substr(), which suggests that copies are done all the time -- although maybe in the future a smarter String class may be implemented (having a flag reversed set to true, and having all of its functions operating backwards when that is set), as of now, it is inefficient.
Moreover, Regex patterns can't be simply reversed. The question only asks for replacing the last occurrence of a sub-string, so, that's OK, but readers in the need of something more robust won't benefit from the most voted answer (as of this writing)
You can achieve this with String#sub and greedy regexp .* like this:
'abc123abc123'.sub(/(.*)abc/, '\1ABC')
simple and efficient:
s = "abc123abc123abc"
p = "123"
s.slice!(s.rindex(p), p.size)
s == "abc123abcabc"
string = "abc123abc123"
pattern = /abc/
replacement = "ABC"
matches = string.scan(pattern).length
index = 0
string.gsub(pattern) do |match|
index += 1
index == matches ? replacement : match
end
#=> abc123ABC123
I've used this handy helper method quite a bit:
def gsub_last(str, source, target)
return str unless str.include?(source)
top, middle, bottom = str.rpartition(source)
"#{top}#{target}#{bottom}"
end
If you want to make it more Rails-y, extend it on the String class itself:
class String
def gsub_last(source, target)
return self unless self.include?(source)
top, middle, bottom = self.rpartition(source)
"#{top}#{target}#{bottom}"
end
end
Then you can just call it directly on any String instance, eg "fooBAR123BAR".gsub_last("BAR", "FOO") == "fooBAR123FOO"
.gsub /abc(?=[^abc]*$)/, 'ABC'
Matches a "abc" and then asserts ((?=) is positive lookahead) that no other characters up to the end of the string are "abc".