Remove words from string which are present in some set - ruby

I want to remove words from a string which are there in some set. One way is iterate over this set and remove the particular word using str.gsub("subString", ""). Does this kind of function already exits ?
Example string :
"Hotel Silver Stone Resorts"
Strings in set:
["Hotel" , "Resorts"]
Output should be:
" Silver Stone "

You can build a union of several patterns with Regexp::union:
words = ["Hotel" , "Resorts"]
re = Regexp.union(words)
#=> /Hotel|Resorts/
"Hotel Silver Stone Resorts".gsub(re, "")
#=> " Silver Stone "
Note that you might have to escape your words.

You can subtract one array from another in ruby. Result is that all elements from the first array are removed from the second.
Split the string on whitespace, remove all extra words in one swift move, rejoin the sentence.
s = "Hotel Silver Stone Resorts"
junk_words = ['Hotel', 'Resorts']
def strip_junk(original, junk)
(original.split - junk).join(' ')
end
strip_junk(s, junk_words) # => "Silver Stone"
It certainly looks better (to my eye). Not sure about performance characteristics (too lazy to benchmark it)

I am not sure what you wanted but as I understood
sentence = 'Hotel Silver Stone Resorts'
remove_words = ["Hotel" , "Resorts"] # you can add words to this array which you wanted to remove
sentence.split.delete_if{|x| remove_words.include?(x)}.join(' ')
=> "Silver Stone"
OR
if you have an array of strings, it's easier:
sentence = 'Hotel Silver Stone Resorts'
remove_words = ["Hotel" , "Resorts"]
(sentence.split - remove_words).join(' ')
=> "Silver Stone"

You could try something different , but I don't know if it will be faster or not (depends on the length of your strings and set)
require 'set'
str = "Hotel Silver Stone Resorts"
setStr = Set.new(str.split)
setToRemove = Set.new( ["Hotel", "Resorts"])
modifiedStr = (setStr.subtract setToRemove).to_a.join " "
Output
"Silver Stone"
It uses the Set class which is faster for retrieving single element (built on Hash).
But again, the underlying transformation with to_a may not improve speed if your strings / set are very big.
It also remove implicitly the duplicates in your string and your set (when your create the sets)

Related

How to match undefined number of arguments or how to match known keywords in a regular expression

Some questions about regex, simple for you but not for me.
a) I want to match a string using a regular expression.
keyword term1,term2,term3,.....termN
The number of terms is undefined. I know how to begin but after I am lost ;-)
\(\w+)(\s+) but after ?\i
b) A little bit more complicated:
capitale france paris,england london,germany berlin, ...
I want to separate the couples ai bi in order to analyse them.
c) how to check if one among several keywords are present or not ?
direction LEFT,RIGHT,UP,DOWN
This isn't a good task for a regular expression as you want to use it. In addition, you're asking several questions that have to be addressed in several steps; Determining duplicates isn't part of a regex's skill set.
Regex assume there is a repeating pattern, and if you're trying to parse an entire line of indeterminate number of elements at once, it will take a very complex pattern.
I'd recommend you use a simple split(',') to break the line on commas:
'keyword term1,term2,term3,.....termN'.split(',')
# => ["keyword term1", "term2", "term3", ".....termN"]
'capitale france paris,england london,germany berlin, ...'.split(',')
# => ["capitale france paris", "england london", "germany berlin", " ..."]
Once you have the line split, if you want to break apart complex entries on white-space, use a bare split:
'capitale france paris,england london,germany berlin, ...'.split(',').map(&:split)
# => [["capitale", "france", "paris"],
# ["england", "london"],
# ["germany", "berlin"],
# ["..."]]
This will all fall apart if there are embedded commas in a field. The data you're working with looks like CSV (comma-separated values), and that spec allows for them. IF you're working with true CSV data, then use the CSV library that comes with Ruby. It will save your sanity and keep you from trying to reinvent a wheel.
To count keywords you can do something like:
entries = 'capitale france paris,england london,germany berlin, ...'.split(',').map(&:split)
# => [["capitale", "france", "paris"],
# ["england", "london"],
# ["germany", "berlin"],
# ["..."]]
keywords = Hash.new { |h, k| h[k] = 0 }
entries.each do |entry|
entry.each do |e|
keywords[e] += 1 if e[/\b(?:france|england|germany)\b/i]
end
end
keywords # => {"france"=>1, "england"=>1, "germany"=>1}
There are other ways to do this using various methods in Enumerable and Array, but this demonstrates the technique. I used a pattern to locate the keyword hits because it's fast and can find the keyword within a string. You could do a lookup using index or find or any? but they'll slow your code as the list of keywords grows.

Ruby diff two strings and make an array of the parts that are the same

With Ruby, how can I get the diff between two strings, then use the identical parts as a base to split the rest?
For example, I have two strings (Not all strings will have this formatting):
String1 = "Computer: Person1, Title: King, Phone: 555-1212"
String2 = "Computer: PersonB, Title: Queen, Phone: 123-4567"
I would like to be able to compare (diff) the two strings so that I get the result:
["Computer: ",", Title:",", Phone:"]
then use this to reparse the original strings to get:
["Person1","King","555-1212"] and ["PersonB","Queen","123-4567"]
which I could label in db/storage with the former array.
Are there features to do this and how would I achieve these results?
The object of this is not need prior knowledge of formatting. This way just the data are analyzed for patterning and then divided as such. It may be comma delimited, new lines, spaced out, etc.
I am looking at gem "diffy" and "diff-lcs" to see if they might help split this up.
I think all you need is a hash, with hash you can do anything fancy.
>> String1 = "Computer: Person1, Title: King, Phone: 555-1212"
>> a = String1.gsub(/[^\s\:]/) { |w| "\"#{w}\"" }
>> a.insert(0, "{")
>> a.insert(-1, "}")
>> a1 = JSON.parse(a)
>> #=> {
"Computer" => "Person1",
"Title" => "King",
"Phone" => "555-1212"
}
Then you can request what you want in question, like
>> a1["Computer"]
>> #=> "Person1"
Add
And you can abstract it to a method further
def str_to_hash(str)
ouput = str.gsub(/[^\s\:]/) { |w| "\"#{w}\"" }
output.insert(0, "{").insert(-1, "}")
JSON.parse(out)
end
>> h2 = str_to_hash(String2)
>> h2["Computer"]
>> #=>"PersonB"
String1 = "Computer: Person1, Title: King, Phone: 555-1212"
String2 = "Computer: PersonB, Title: Queen, Phone: 123-4567"
keys = String1.split - (String1.split - String2.split)
values = String1.split - keys
You need to find a suitable way to split for your specific data. For instance, if values are allowed to contain spaces inside double quotes, you can to something like .split(/"?[^ ]*\ ?[^ ]*"?/), but there is no general solution for this, that will handle any type of data.
And then you need to clean up the resulting values.
Given those strings, I would rather split columns by ,, then use the part before : as name of column.
There is an longest common subsequence problem, which has something to do, but is not smart enough to handle semantics of data.
s1 = String1.split(' ')
s2 = String2.split(' ')
s1 - s2
=> ["Person1,", "King,", "555-1212"]
s2 - s1
=> ["PersonB,", "Queen,", "123-4567"]

Truncate string to the first n words

What's the best way to truncate a string to the first n words?
n = 3
str = "your long long input string or whatever"
str.split[0...n].join(' ')
=> "your long long"
str.split[0...n] # note that there are three dots, which excludes n
=> ["your", "long", "long"]
You could do it like this:
s = "what's the best way to truncate a ruby string to the first n words?"
n = 6
trunc = s[/(\S+\s+){#{n}}/].strip
if you don't mind making a copy.
You could also apply Sawa's Improvement (wish I was still a mathematician, that would be a great name for a theorem) by adjusting the whitespace detection:
trunc = s[/(\s*\S+){#{n}}/]
If you have to deal with an n that is greater than the number of words in s then you could use this variant:
s[/(\S+(\s+)?){,#{n}}/].strip
You can use str.split.first(n).join(' ')
with n being any number.
Contiguous white spaces in the original string are replaced with a single white space in the returned string.
For example, try this in irb:
>> a='apple orange pear banana pineaple grapes'
=> "apple orange pear banana pineaple grapes"
>> b=a.split.first(2).join(' ')
=> "apple orange"
This syntax is very clear (as it doesn't use regular expression, array slice by index). If you program in Ruby, you know that clarity is an important stylistic choice.
A shorthand for join is *
So this syntax str.split.first(n) * ' ' is equivalent and shorter (more idiomatic, less clear for the uninitiated).
You can also use take instead of first
so the following would do the same thing
a.split.take(2) * ' '
This could be following if it's from rails 4.2 (which has truncate_words)
string_given.squish.truncate_words(number_given, omission: "")

Split a string with multiple delimiters in Ruby

Take for instance, I have a string like this:
options = "Cake or pie, ice cream, or pudding"
I want to be able to split the string via or, ,, and , or.
The thing is, is that I have been able to do it, but only by parsing , and , or first, and then splitting each array item at or, flattening the resultant array afterwards as such:
options = options.split(/(?:\s?or\s)*([^,]+)(?:,\s*)*/).reject(&:empty?);
options.each_index {|index| options[index] = options[index].sub("?","").split(" or "); }
The resultant array is as such: ["Cake", "pie", "ice cream", "pudding"]
Is there a more efficient (or easier) way to split my string on those three delimiters?
What about the following:
options.gsub(/ or /i, ",").split(",").map(&:strip).reject(&:empty?)
replaces all delimiters but the ,
splits it at ,
trims each characters, since stuff like ice cream with a leading space might be left
removes all blank strings
First of all, your method could be simplified a bit with Array#flatten:
>> options.split(',').map{|x|x.split 'or'}.flatten.map(&:strip).reject(&:empty?)
=> ["Cake", "pie", "ice cream", "pudding"]
I would prefer using a single regex:
>> options.split /\s*, or\s+|\s*,\s*|\s+or\s+/
=> ["Cake", "pie", "ice cream", "pudding"]
You can use | in a regex to give alternatives, and putting , or first guarantees that it won’t produce an empty item. Capturing the whitespace with the regex is probably best for efficiency, since you don’t have to scan the array again.
As Zabba points out, you may still want to reject empty items, prompting this solution:
>> options.split(/,|\sor\s/).map(&:strip).reject(&:empty?)
=> ["Cake", "pie", "ice cream", "pudding"]
As "or" and "," does the same thing, the best approach is to tell the regex that multiple cases should be treated the same as a single case:
options = "Cake or pie, ice cream, or pudding"
regex = /(?:\s*(?:,|or)\s*)+/
options.split(regex)

How can I do fuzzy substring matching in Ruby?

I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score.
I have one very long string, which is a document, and a substring. The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. The substring will match a section of the text in the original document 99% or more. I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts.
If the string was identical because no random error was introduced, I would use document.index(substring), however this fails if there is even one character difference.
I thought the difference would be accounted for by removing all characters except a-z in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed.
The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages.
You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: https://github.com/flori/amatch.
Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:
include 'amatch'
module FuzzyFinder
def scanner( input )
out = [] unless block_given?
pos = 0
input.scan(/(\w+)(\W*)/) do |word, white|
startpos = pos
pos = word.length + white.length
if block_given?
yield startpos, word
else
out << [startpos, word]
end
end
end
def find( text, doc )
index = scanner(doc)
sstr = text.gsub(/\W/,'')
levenshtein = Amatch::Levensthtein.new(sstr)
minlen = sstr.length
maxndx = index.length
possibles = []
minscore = minlen*2
index.each_with_index do |x, i|
spos = x[0]
str = x[1]
si = i
while (str.length < minlen)
i += 1
break unless i < maxndx
str += index[i][1]
end
str = str.slice(0,minlen) if (str.length > minlen)
score = levenshtein.search(str)
if score < minscore
possibles = [spos]
minscore = score
elsif score == minscore
possibles << spos
end
end
[minscore, possibles]
end
end
Obviously there are numerous improvements possible and probably necessary! A few off the top:
Process the document once and store
the results, possibly in a database.
Determine a usable length of string
for an initial check, process
against that initial substring first
before trying to match the entire
fragment.
Following up on the previous,
precalculate starting fragments of
that length.
A simple one is fuzzy_match
require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus
A more elaborated one (you wouldn't say it from this example though) is levenshein, which computes the number of differences.
require 'levenshtein'
Levenshtein.distance('test', 'test') # => 0
Levenshtein.distance('test', 'tent') # => 1
You should look at the StrikeAMatch implementation detailed here:
A better similarity ranking algorithm for variable length strings
Instead of relying on some kind of string distance (i.e. number of changes between two strings), this one looks at the character pairs patterns. The more character pairs occur in each string, the better the match. It has worked wonderfully for our application, where we search for mistyped/variable length headings in a plain text file.
There's also a gem which combines StrikeAMatch (an implementation of Dice's coefficient on character-level bigrams) and Levenshtein distance to find matches: https://github.com/seamusabshere/fuzzy_match
It depends on the artifacts that can end up in the substring. In the simpler case where they are not part of [a-z] you can use parse the substring and then use Regexp#match on the document:
document = 'Ulputat non nullandigna tortor dolessi illam sectem laor acipsus.'
substr = "tortor - dolessi _%&# +illam"
re = Regexp.new(substr.split(/[^a-z]/i).select{|e| !e.empty?}.join(".*"))
md = document.match re
puts document[md.begin(0) ... md.end(0)]
# => tortor dolessi illam
(Here, as we do not set any parenthesis in the Regexp, we use begin and end on the first (full match) element 0 of MatchData.
If you are only interested in the start position, you can use =~ operator:
start_pos = document =~ re
I have used none of them, but I found some libraries just by doing a search for 'diff' in rubygems.org. All of them can be installed by gem. You might want to try them. I myself is interested, so if you already know these or if you try them out, it would be helpful if you leave your comment.
diff
diff-lcs
differ
difflcs
pretty_diff
diffy
kronk
khtmldiff
gdiff
ruby_diff
tdiff
diffrenderer
diffplex
dbdiff
diff_dirs
rsyncdiff
wdiff
diff4all
davidtrogers-htmldiff
edouard-htmldiff
diff2xml
dirdiff
rrdiff
nokogiri-diff
pretty-diff
easy_diff
smartdiff

Resources