Remove duplicate text from multiple strings

Remove duplicate text from multiple strings - ruby

I have:
a = "This is Product A with property B and propery C. Buy it now!"
b = "This is Product B with property X and propery Y. Buy it now!"
c = "This is Product C having no properties. Buy it now!"
I'm looking for an algorithm that can do:
> magic(a, b, c)
=> ['A with property B and propery C',
'B with property X and propery Y',
'C having no properties']
I have to find for duplicates in 1000+ texts. Super performance isn't a must, but would be nice.
-- Update
I'm looking for sequence of words. So if:
d = 'This is Product D with text engraving: "Buy". Buy it now!'
The first "Buy" should not be a duplicate. I'm guessing I have to use a threshold of n words following eachother in order to be seen as duplicate.

def common_prefix_length(*args)
first = args.shift
(0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } }
end
def magic(*args)
i = common_prefix_length(*args)
args = args.map { |a| a[i..-1].reverse }
i = common_prefix_length(*args)
args.map { |a| a[i..-1].reverse }
end
a = "This is Product A with property B and propery C. Buy it now!"
b = "This is Product B with property X and propery Y. Buy it now!"
c = "This is Product C having no properties. Buy it now!"
magic(a,b,c)
# => ["A with property B and propery C",
# "B with property X and propery Y",
# "C having no properties"]

Your data
sentences = [
"This is Product A with property B and propery C. Buy it now!",
"This is Product B with property X and propery Y. Buy it now!",
"This is Product C having no properties. Buy it now!"
]
Your magic
def magic(data)
prefix, postfix = 0, -1
data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break while true
data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break while true
data.map{ |d| d[prefix..postfix] }
end
Your output
magic(sentences)
#=> [
#=> "A with property B and propery C",
#=> "B with property X and propery Y",
#=> "C having no properties"
#=> ]
Or you can use loop instead of while true
def magic(data)
prefix, postfix = 0, -1
loop{ data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break }
loop{ data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break }
data.map{ |d| d[prefix..postfix] }
end

edit: this code has bugs. Just leaving my answer for reference and because i dont like it if people delete answers after being downvoted. Everyone makes mistakes :-)
I liked #falsetru's approach but felt the code was unnecessarily complex. Here's my attempt:
def common_prefix_length(strings)
i = 0
i += 1 while strings.map{|s| s[i] }.uniq.size == 1
i
end
def common_suffix_length(strings)
common_prefix_length(strings.map(&:reverse))
end
def uncommon_infixes(strings)
pl = common_prefix_length(strings)
sl = common_suffix_length(strings)
strings.map{|s| s[pl...-sl] }
end
As the OP may be concerned about performance, i did a quick benchmark:
require 'fruity'
require 'securerandom'
prefix = 'PREFIX '
suffix = ' SUFFIX'
test_data = Array.new(1000) do
prefix + SecureRandom.hex + suffix
end
def fl00r_meth(data)
prefix, postfix = 0, -1
data.map{ |d| d[prefix] }.uniq.size == 1 && prefix += 1 or break while true
data.map{ |d| d[postfix] }.uniq.size == 1 && postfix -= 1 or break while true
data.map{ |d| d[prefix..postfix] }
end
def falsetru_common_prefix_length(*args)
first = args.shift
(0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } }
end
def falsetru_meth(*args)
i = falsetru_common_prefix_length(*args)
args = args.map { |a| a[i..-1].reverse }
i = falsetru_common_prefix_length(*args)
args.map { |a| a[i..-1].reverse }
end
def padde_common_prefix_length(strings)
i = 0
i += 1 while strings.map{|s| s[i] }.uniq.size == 1
i
end
def padde_common_suffix_length(strings)
padde_common_prefix_length(strings.map(&:reverse))
end
def padde_meth(strings)
pl = padde_common_prefix_length(strings)
sl = padde_common_suffix_length(strings)
strings.map{|s| s[pl...-sl] }
end
compare do
fl00r do
fl00r_meth(test_data.dup)
end
falsetru do
falsetru_meth(*test_data.dup)
end
padde do
padde_meth(test_data.dup)
end
end
These are the results:
Running each test once. Test will take about 1 second.
fl00r is similar to padde
padde is faster than falsetru by 30.000000000000004% ± 10.0%

Related

How to optimize code - it works, but I know I'm missing much learning

The exercise I'm working on asks "Write a method, coprime?(num_1, num_2), that accepts two numbers as args. The method should return true if the only common divisor between the two numbers is 1."
I've written a method to complete the task, first by finding all the factors then sorting them and looking for duplicates. But I'm looking for suggestions on areas I should consider to optimize it.
The code works, but it is just not clean.
def factors(num)
return (1..num).select { |n| num % n == 0}
end
def coprime?(num_1, num_2)
num_1_factors = factors(num_1)
num_2_factors = factors(num_2)
all_factors = num_1_factors + num_2_factors
new = all_factors.sort
dups = 0
new.each_index do |i|
dups += 1 if new[i] == new[i+1]
end
if dups > 1
false
else
true
end
end
p coprime?(25, 12) # => true
p coprime?(7, 11) # => true
p coprime?(30, 9) # => false
p coprime?(6, 24) # => false

You could use Euclid's algorithm to find the GCD, then check whether it's 1.
def gcd a, b
while a % b != 0
a, b = b, a % b
end
return b
end
def coprime? a, b
gcd(a, b) == 1
end
p coprime?(25, 12) # => true
p coprime?(7, 11) # => true
p coprime?(30, 9) # => false
p coprime?(6, 24) # => false```

You can just use Integer#gcd:
def coprime?(num_1, num_2)
num_1.gcd(num_2) == 1
end

You don't need to compare all the factors, just the prime ones. Ruby does come with a Prime class
require 'prime'
def prime_numbers(num_1, num_2)
Prime.each([num_1, num_2].max / 2).map(&:itself)
end
def factors(num, prime_numbers)
prime_numbers.select {|n| num % n == 0}
end
def coprime?(num_1, num_2)
prime_numbers = prime_numbers(num_1, num_2)
# & returns the intersection of 2 arrays (https://stackoverflow.com/a/5678143)
(factors(num_1, prime_numbers) & factors(num_2, prime_numbers)).length == 0
end

Optimising code for matching two strings modulo scrambling

I am trying to write a function scramble(str1, str2) that returns true if a portion of str1 characters can be rearranged to match str2, otherwise returns false. Only lower case letters (a-z) will be used. No punctuation or digits will be included. For example:
str1 = 'rkqodlw'; str2 = 'world' should return true.
str1 = 'cedewaraaossoqqyt'; str2 = 'codewars' should return true.
str1 = 'katas'; str2 = 'steak' should return false.
This is my code:
def scramble(s1, s2)
#sorts strings into arrays
first = s1.split("").sort
second = s2.split("").sort
correctLetters = 0
for i in 0...first.length
#check for occurrences of first letter
occurrencesFirst = first.count(s1[i])
for j in 0...second.length
#scan through second string
occurrencesSecond = second.count(s2[j])
#if letter to be tested is correct and occurrences of first less than occurrences of second
#meaning word cannot be formed
if (s2[j] == s1[i]) && occurrencesFirst < occurrencesSecond
return false
elsif s2[j] == s1[i]
correctLetters += 1
elsif first.count(s1[s2[j]]) == 0
return false
end
end
end
if correctLetters == 0
return false
end
return true
end
I need help optimising this code. Please give me suggestions.

Here is one efficient and Ruby-like way of doing that.
Code
def scramble(str1, str2)
h1 = char_counts(str1)
h2 = char_counts(str2)
h2.all? { |ch, nbr| nbr <= h1[ch] }
end
def char_counts(str)
str.each_char.with_object(Hash.new(0)) { |ch, h| h[ch] += 1 }
end
Examples
scramble('abecacdeba', 'abceae')
#=> true
scramble('abecacdeba', 'abweae')
#=> false
Explanation
The three steps are as follows.
str1 = 'abecacdeba'
str2 = 'abceae'
h1 = char_counts(str1)
#=> {"a"=>3, "b"=>2, "e"=>2, "c"=>2, "d"=>1}
h2 = char_counts(str2)
#=> {"a"=>2, "b"=>1, "c"=>1, "e"=>2}
h2.all? { |ch, nbr| nbr <= h1[ch] }
#=> true
The last statement is equivalent to
2 <= 3 && 1 <= 2 && 1 <= 2 && 2 <=2
The method char_counts constructs what is sometimes called a "counting hash". To understand how char_counts works, see Hash::new, especially the explanation of the effect of providing a default value as an argument of new. In brief, if a hash is defined h = Hash.new(0), then if h does not have a key k, h[k] returns the default value, here 0 (and the hash is not changed).
Suppose, for different data,
h1 = { "a"=>2 }
h2 = { "a"=>1, "b"=>2 }
Then we would find that 1 <= 2 #=> true but 2 <= 0 #=> false, so the method would return false. The second comparison is 2 <= h1["b"]. As h1 does not have a key "b", h1["b"] returns the default value, 0.
The method char_counts is effectively a short way of writing the method expressed as follows.
def char_counts(str)
h = {}
str.each_char do |ch|
h[ch] = 0 unless h.key?(ch) # instead of Hash.new(0)
h[ch] = h[c] + 1 # instead of h[c][ += 1
end
h # no need for this if use `each_with_object`
end
See Enumerable#each_with_object, String#each_char (preferable to String.chars, as the latter produces an unneeded temporary array whereas the former returns an enumerator) and Hash#key? (or Hash#has_key?, Hash#include? or Hash#member?).
An Alternative
def scramble(str1, str2)
str2.chars.difference(str1.chars).empty?
end
class Array
def difference(other)
h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
reject { |e| h[e] > 0 && h[e] -= 1 }
end
end
I have found the method Array#difference to be so useful I proposed it be added to the Ruby Core (here). The response has been, er, underwhelming.

One way:
def scramble(s1,s2)
s2.chars.uniq.all? { |c| s1.count(c) >= s2.count(c) }
end
Another way:
def scramble(s1,s2)
pool = s1.chars.group_by(&:itself)
s2.chars.all? { |c| pool[c]&.pop }
end
Yet another:
def scramble(s1,s2)
('a'..'z').all? { |c| s1.count(c) >= s2.count(c) }
end
Since this appears to be from codewars, I submitted my first two there. Both got accepted and the first one was a bit faster. Then I was shown solutions of others and saw someone using ('a'..'z') and it's fast, so I include that here.
The codewars "performance tests" aren't shown explicitly but they're all up to about 45000 letters long. So I benchmarked these solutions as well as Cary's (yours was too slow to be included) on shuffles of the alphabet repeated to be about that long (and doing it 100 times):
user system total real
Stefan 1 0.812000 0.000000 0.812000 ( 0.811765)
Stefan 2 2.141000 0.000000 2.141000 ( 2.127585)
Other 0.125000 0.000000 0.125000 ( 0.122248)
Cary 1 2.562000 0.000000 2.562000 ( 2.575366)
Cary 2 3.094000 0.000000 3.094000 ( 3.106834)
Moral of the story? String#count is fast here. Like, ridiculously fast. Almost unbelievably fast (I actually had to run extra tests to believe it). It counts through about 1.9 billion letters per second (100 times 26 letters times 2 strings of ~45000 letters, all in 0.12 seconds). Note that the difference to my own first solution is just that I do s2.chars.uniq, and that increases the time from 0.12 seconds to 0.81 seconds. Meaning this double pass through one string takes about six times as long as the 52 passes for counting. The counting is about 150 times faster. I did expect it to be very fast, because it presumably just searches a byte in an array of bytes using C code (edit: looks like it does), but this speed still surprised me.
Code:
require 'benchmark'
def scramble_stefan1(s1,s2)
s2.chars.uniq.all? { |c| s1.count(c) >= s2.count(c) }
end
def scramble_stefan2(s1,s2)
pool = s1.chars.group_by(&:itself)
s2.chars.all? { |c| pool[c]&.pop }
end
def scramble_other(s1,s2)
('a'..'z').all? { |c| s1.count(c) >= s2.count(c) }
end
def scramble_cary1(str1, str2)
h1 = char_counts(str1)
h2 = char_counts(str2)
h2.all? { |ch, nbr| nbr <= h1[ch] }
end
def char_counts(str)
str.each_char.with_object(Hash.new(0)) { |ch, h| h[ch] += 1 }
end
def scramble_cary2(str1, str2)
str2.chars.difference(str1.chars).empty?
end
class Array
def difference(other)
h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
reject { |e| h[e] > 0 && h[e] -= 1 }
end
end
Benchmark.bmbm do |x|
n = 100
s1 = (('a'..'z').to_a * (45000 / 26)).shuffle.join
s2 = s1.chars.shuffle.join
x.report('Stefan 1') { n.times { scramble_stefan1(s1, s2) } }
x.report('Stefan 2') { n.times { scramble_stefan2(s1, s2) } }
x.report('Other') { n.times { scramble_other(s1, s2) } }
x.report('Cary 1') { n.times { scramble_cary1(s1, s2) } }
x.report('Cary 2') { n.times { scramble_cary2(s1, s2) } }
end

What's wrong with my code?

def encrypt(string)
alphabet = ("a".."b").to_a
result = ""
idx = 0
while idx < string.length
character = string[idx]
if character == " "
result += " "
else
n = alphabet.index(character)
n_plus = (n + 1) % alphabet.length
result += alphabet[n_plus]
end
idx += 1
end
return result
end
puts encrypt("abc")
puts encrypt("xyz")
I'm trying to get "abc" to print out "bcd" and "xyz" to print "yza". I want to advance the letter forward by 1. Can someone point me to the right direction?

All I had to do was change your alphabet array to go from a to z, not a to b, and it works fine.
def encrypt(string)
alphabet = ("a".."z").to_a
result = ""
idx = 0
while idx < string.length
character = string[idx]
if character == " "
result += " "
else
n = alphabet.index(character)
n_plus = (n + 1) % alphabet.length
result += alphabet[n_plus]
end
idx += 1
end
return result
end
puts encrypt("abc")
puts encrypt("xyz")

Another way to solve the issue, that I think is simpler, personally, is to use String#tr:
ALPHA = ('a'..'z').to_a.join #=> "abcdefghijklmnopqrstuvwxyz"
BMQIB = ('a'..'z').to_a.rotate(1).join #=> "bcdefghijklmnopqrstuvwxyza"
def encrypt(str)
str.tr(ALPHA,BMQIB)
end
def decrypt(str)
str.tr(BMQIB,ALPHA)
end
encrypt('pizza') #=> "qjaab"
decrypt('qjaab') #=> "pizza"

Alternatively if you don't want to take up that memory storing the alphabet you could use character codings and then just use arithmetic operations on them to shift the letters:
def encrypt(string)
result = ""
idx = 0
while idx < string.length
result += (string[idx].ord == 32 ? (string[idx].chr) : (string[idx].ord+1).chr)
idx += 1
end
result
end
Other strange thing about ruby is that you do not need to explicitly return something at the end of the method body. It just returns the last thing by default. This is considered good style amongst ruby folks.

Your question has been answered, so here are a couple of more Ruby-like ways of doing that.
Use String#gsub with a hash
CODE_MAP = ('a'..'z').each_with_object({}) { |c,h| h[c] = c < 'z' ? c.next : 'a' }
#=> {"a"=>"b", "b"=>"c",..., "y"=>"z", "z"=>"a"}
DECODE_MAP = CODE_MAP.invert
#=> {"b"=>"a", "c"=>"b",..., "z"=>"y", "a"=>"z"}
def encrypt(word)
word.gsub(/./, CODE_MAP)
end
def decrypt(word)
word.gsub(/./, DECODE_MAP)
end
encrypt('pizza')
#=> "qjaab"
decrypt('qjaab')
#=> "pizza"
Use String#gsub with Array#rotate
LETTERS = ('a'..'z').to_a
#=> ["a", "b", ..., "z"]
def encrypt(word)
word.gsub(/./) { |c| LETTERS.rotate[LETTERS.index(c)] }
end
def decrypt(word)
word.gsub(/./) { |c| LETTERS.rotate(-1)[LETTERS.index(c)] }
end
encrypt('pizza')
#=> "qjaab"
decrypt('qjaab')
#=> "pizza"

How do I perform variable assignment with a loop, and break?

Here is the logic:
y = 'var to check for'
some_var = some_loop.each do |x|
x if x == y
break if x
end
Is there a better way to write this?
Something like
x && break if x == y
Thank you in advance!

The correct answer is to use include?. eg:
found = (array_expression).include? {|x| x == search_value}
It's possible to also use each and break out on the first matched value, but the C implementation of include? is faster than a ruby script with each.
Here is a test program, comparing the performance of invoking include? on a very large array vs. invoking each on the same array with the same argument.
#!/usr/bin/env ruby
#
require 'benchmark'
def f_include a, b
b if a.include?(b)
end
def f_each_break a, b
a.each {|x| return b if x == b }
nil
end
# gen large array of random numbers
a = (1..100000).map{|x| rand 1000000}
# now select 1000 random numbers in the set
nums = (1..1000).map{|x| a[rand a.size]}
# now, check the time for f1 vs. f2
r1 = r2 = nil
Benchmark.bm do |bm|
bm.report('incl') { r1 = nums.map {|n| f_include a,n} }
bm.report('each') { r2 = nums.map {|n| f_each_break a,n} }
end
if r1.size != r2.size || r1 != r2
puts "results differ"
puts "r1.size = #{r1.size}"
puts "r2.size = #{r2.size}"
exceptions = (0..r1.size).select {|x| x if r1[x] != r2[x]}.compact
puts "There were #{exceptions.size} exceptions"
else
puts "results ok"
end
exit
Here is the output from the test:
$ ./test-find.rb
user system total real
incl 5.150000 0.090000 5.240000 ( 7.410580)
each 7.400000 0.140000 7.540000 ( 9.269962)
results ok

Why not:
some_var = (some_loop.include? y ? y : nil)

What is the best way to handle this type of inclusive logic in Ruby?

Is there a better way of handling this in Ruby, while continuing to use the symbols?
pos = :pos1 # can be :pos2, :pos3, etc.
if pos == :pos1 || pos == :pos2 || pos == :pos3
puts 'a'
end
if pos == :pos1 || pos == :pos2
puts 'b'
end
if pos == :pos1
puts 'c'
end
The obvious way would be swapping out the symbols for number constants, but that's not an option.
pos = 3
if pos >= 1
puts 'a'
end
if pos >= 2
puts 'b'
end
if pos >= 3
puts 'c'
end
Thanks.
EDIT
I just figured out that Ruby orders symbols in alpha/num order. This works perfectly.
pos = :pos2 # can be :pos2, :pos3, etc.
if pos >= :pos1
puts 'a'
end
if pos >= :pos2
puts 'b'
end
if pos >= :pos3
puts 'c'
end

Not sure if this is the best way......
I would make use of the include? method from array:
puts 'a' if [:pos1, :pos2, :pos3].include? pos
puts 'b' if [:pos1, :pos2].include? pos
puts 'c' if [:pos1].include? pos

Just use the case statement
pos = :pos1 # can be :pos2, :pos3, etc.
case pos
when :pos1 then %w[a b c]
when :pos2 then %w[a b]
when :pos3 then %w[a]
end.each {|x| puts x }

There are lots of different ways to get your output. Which one you
want depends on your specific objections to your if statements.
I've added a bunch of extra formatting to make the output easier
to read.
If you don't like the logical ORs and how they separate the results
from the output, you can use a lookup table:
puts "Lookup table 1:"
lookup_table1 = {
:pos1 => %w{a b c},
:pos2 => %w{a b },
:pos3 => %w{a },
}
[:pos1, :pos2, :pos3].each { |which|
puts "\t#{which}"
lookup_table1[which].each { |x| puts "\t\t#{x}" }
}
Or, if you want all the "work" in the lookup table:
puts "Lookup table 2:"
lookup_table2 = {
:pos1 => lambda do %w{a b c}.each { |x| puts "\t\t#{x}" } end,
:pos2 => lambda do %w{a b }.each { |x| puts "\t\t#{x}" } end,
:pos3 => lambda do %w{a }.each { |x| puts "\t\t#{x}" } end,
}
[:pos1, :pos2, :pos3].each { |which|
puts "\t#{which}"
lookup_table2[which].call
}
If your problem is that symbols aren't ordinals, then you can
ordinalize them by converting them to strings:
puts "Ordinals by .to_s and <="
[:pos1, :pos2, :pos3].each { |which|
puts "\t#{which}"
if which.to_s <= :pos3.to_s
puts "\t\ta"
end
if which.to_s <= :pos2.to_s
puts "\t\tb"
end
if which.to_s <= :pos1.to_s
puts "\t\tc"
end
}
Or you could monkey patch a comparison operator into the Symbol
class (not recommended):
puts "Ordinals by Symbol#<="
class Symbol
def <= (x)
self.to_s <= x.to_s
end
end
[:pos1, :pos2, :pos3].each { |which|
puts "\t#{which}"
if which <= :pos3
puts "\t\ta"
end
if which <= :pos2
puts "\t\tb"
end
if which <= :pos1
puts "\t\tc"
end
}
Or you could use a lookup table to supply your ordinal values:
puts "Ordinals through a lookup table:"
ordinal = {
:pos1 => 1,
:pos2 => 2,
:pos3 => 3,
}
[:pos1, :pos2, :pos3].each { |which|
puts "\t#{which}"
if ordinal[which] <= 3
puts "\t\ta"
end
if ordinal[which] <= 2
puts "\t\tb"
end
if ordinal[which] <= 1
puts "\t\tc"
end
}
Those are the obvious ones off the top of my head. It is hard to say what would be best without more specifics on what your problem with your if approach is; your second example indicates that what you really want is a way to make symbols into ordinals.

More generically, you can use this:
pos = :pos3
arr = [:pos1,:pos2,:pos3]
curr = 'a'
idx = arr.length
while idx > 0
puts curr if arr.last(idx).include? pos
curr = curr.next
idx -= 1
end
Or this, for your specific example:
puts 'a'
puts 'b' if pos != :pos3
puts 'c' if pos == :pos1

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Remove duplicate text from multiple strings - ruby

Related

How to optimize code - it works, but I know I'm missing much learning

Optimising code for matching two strings modulo scrambling

What's wrong with my code?

How do I perform variable assignment with a loop, and break?

What is the best way to handle this type of inclusive logic in Ruby?

Categories

Resources