Replace all A with B and replace all B with A - algorithm

Suppose I want to switch certain pairs of words. Say, for example, I want to switch dogs with cats and mice with rats, so that
This is my opinion about dogs and cats: I like dogs but I don't like cats. This is my opinion about mice and rats: I'm afraid of mice but I'm not afraid of rats.
becomes
This is my opinion about cats and dogs: I like cats but I don't like dogs. This is my opinion about rats and mice: I'm afraid of rats but I'm not afraid of mice.
The naїve approach
text = text.replace("dogs", "cats")
.replace("cats", "dogs")
.replace("mice", "rats")
.replace("rats", "mice")
is problematic since it can perform replacement on the same words multiple times. Either of the above example sentences would become
This is my opinion about dogs and dogs: I like dogs but I don't like dogs. This is my opinion about mice and mice: I'm afraid of mice but I'm not afraid of mice.
What's the simplest algorithm for replacing string pairs, while preventing something from being replaced multiple times?

Use whichever string search algorithm you deem to be appropriate, as long as it is able to search for regular expressions. Search for a regex that matches all the words you want to swap, e.g. dogs|cats|mice|rats. Maintain a separate string (in many languages, this needs to be some kind of StringBuilder in order for repeated appending to be fast) for the result, initially empty. For each match, you append the characters between the end of the previous match (or the beginning of the string) and the current match, and then you append the appropriate replacement (presumably obtained from a hashmap) to the result.
Most standard libraries should allow you to do this easily with built-in methods. For a Java example, see the documentation of Matcher.appendReplacement(StringBuffer, String). I recall doing this in C# as well, using a feature where you can specify a lambda function that decides what to replace each match with.

A naive solution that avoids any unexpected outcomes would be to replace each string with a temporary string, and then replace the temporary strings with the final strings. This assumes however, that you can form a string which is known not to be in the text, e.g.
text = text.replace("dogs", "{]1[}")
.replace("cats", "{]2[}")
.replace("mice", "{]3[}")
.replace("rats", "{]4[}")
.replace("{]2[}", "dogs")
.replace("{]1[}", "cats")
.replace("{]4[}", "mice")
.replace("{]3[}", "rats")

I am admittedly not very familiar with regex, so my idea is to create an array then loop through the elements to see if it should be replaced. First split() the sentence into an array of words:
String text = "This is my opinion about dogs and cats: I like dogs but I don't like cats.";
String[] sentence = text.split("[^a-zA-Z]"); //can't avoid regex here
Then use a for loop which contains a series of if statements to replace words:
for(int i = 0; i < sentence.length; i++) {
if(sentence[i].equals("cats") {
sentence[i] = "dogs";
}
//more similar if statements
}
Now sentence[] contains the new sentence with words. Some regex magic should allow you to also keep punctuation marks. I hope this helps, and please let me know if anything could be improved.

Related

EditPad: How do you work with replacement string conditionals?

I want to be able to search for a letter using Regex and then convert the case of the letter. I know I can refer to the letters and use this:
Search for ^[adw] and replace with \U0 for example but I can't do this searching for ^\w so the expression applies to any letter in the search.
I have read a reply from the author of the product that refers to this:
https://www.editpadpro.com/history.html#800
In here the relevant change is:
Regex: Replacement string conditionals in the form of (?1matched:unmatched) and (?{name}matched:unmatched)
But it's not clear to me what this means. How do I actually use this?
If I have these strings and want to change the case of all the first letters in each sentence how would I use this conditional syntax to achieve it?
an apple a day.
dogs are great animals.
what about a game of football?
The result I want is this:
An apple a day.
Dogs are great animals.
What about a game of football?

Case-sensitive substitutions with gsub

As an exercise, I'm working on an accent translation dictionary. My dictionary is contained in a hash, and I'm thinking of using #gsub! to run inputted strings through the translator.
I'm wondering if there's any way to make the substitutions case-sensitive. For example, I want "didja" to translate to "did you" and "Didja" to translate to "Did you", but I don't want to have to create multiple dictionary entries to deal with case.
I know I can use regex syntax to find strings to replace case-insensitively, with str.gsub!(/#{x}/i,dictionary[x]) where x is a variable. The problem is that this replaces "Didja" with "did you", rather than matching the original case.
Is there any way to make it match the original case?
Suppose we have:
a method to_key that converts a string str to a key in a hash DICTIONARY; and
a method transform that converts the pair [str, DICTIONARY[to_key(str)]] to the replacement for str.
Then str is to be replaced with:
transform(str, DICTIONARY[to_key(str)]])
Without lose of generality, I think we can assume that DICTIONARY's keys and values are all of the same case (say, lower case) and that to_key is simply:
def to_key(str)
str.downcase
end
So all that is necessary is to define the method transform. However, the specification provided does not imply a unique mapping. We therefore must decide what transform should do.
For example, suppose the rule is simply that, if the first character of str and the first character of the dictionary value are both letters, the latter is to be converted to upper case if the former is upper case. Then:
def transform(str, dict_value)
(str[0] =~ /[A-Z]/) ? dict_value.capitalize : dict_value
end
(I originally had dict_value[0] = dict_value[0].upcase if..., but came to my senses after reading #sawa's answer.)
Note that if DICTIONARY['cat'] => 'dog', 'Cat' will be converted to 'Dog'.
One might think that another possibility is that all characters of str that are letters should maintain their case. This is problematic, however, as the dictionary mapping may (without further specification) remove letters, and it may not be clear from DICTIONARY[str] which letters of str were removed, some of which may be lower case and others upper case.
It is not clear what capitalization patterns you have in mind. I assume that you only need to deal with words that are all low case or all low case except the first letter.
str.gsub!(/#{x}/i){|x| x.downcase! ? dictionary[x].capitalize : dictionary[x]}
I don't think this is possible since in this scenario you need to specify the exact string that must take place of the replaced string.
With that in mind, this is the best I can suggest:
subs = {'didja' => 'did you'}
subs.clone.each{ |k, v| subs[k.capitalize] = v.capitalize }
# if you want to replace all occurrences i.e. even substrings:
regex = /#{subs.keys.join('|')}/
# if you want to remove complete words only: (as the Tin man points out)
regex = /\b(?:#{subs.keys.join('|')})\b/ # \b checks for word-boundaries
"didja Didja".gsub(regex, subs)
Update:
Because in your example, the case-sensitive character isn't to be replaced by another value, you could use this:
regex = /(?<=(d))idja/i # again, keep in mind the substrings
"didja Didja".gsub(regex, "id you")

Is Regex faster than array comparison in this case?

Say I have an incoming string that I want scan to see if it contains any of the words I have chosen to be "bad." :)
Is it faster to split the string into an array, as well as keep the bad words in an array, and then iterate through each bad word as well as each incoming word and see if there's a match, kind of like:
badwords.each do |badword|
incoming.each do |word|
trigger = true if badword == word
end
end
OR is it faster to do this:
incoming.each do |word|
trigger = true if badwords.include? word
end
OR is it faster to leave the string as it is and run a .match() with a regex that looks something like:
/\bbadword1\b|\bbadword2\b|\bbadword3\b/
Or is the performance difference almost completely negligible? Been wondering this for a while.
You're giving the regex an advantage by not stopping your loop when it finds a match. Try:
incoming.find{|word| badwords.include? word}
My money is still on the regex though which should be simplified to:
/\b(badword1|badword2|badword3)\b/
or to make it a fair fight:
/\a(badword1|badword2|badword3)\z/
Once it is compiled, the Regex is the fastest in real live (i.e. really long incoming string, many similar bad words, etc.) since it can run on incoming in situ and will handle overlapping parts of your "bad words" really well.
The answer probably depends on the number of bad words to check: if there is only one bad word it probably doesn't make a huge difference, if there are 50 then checking an array would probably get slow. On the other hand, with tens or hundreds of thousands of words the regexp probably won't be too fast either
If you need to handle large numbers of bad words, you might want to consider splitting into individual words and then using a bloomfilter to test whether the word is likely to be bad or not.
This does not excatly answer your question but this will definitely help solve it.
Take some examples what your are tring to acheive and put them to bench marks.
you can find how to do benchmarking in ruby here
Just put the varoius forms between report block and get the benchmarks and decide yourself what suits you the best.
http://ruby.about.com/od/tasks/f/benchmark.htm
http://ruby-doc.org/stdlib-1.9.3/libdoc/benchmark/rdoc/Benchmark.html
For better solutions use the real data to test.
Benchmarks are always better than discussions :)
If you want to scan a string for occurrences of words, use scan to find them.
Use Regexp.union to build a pattern that will find the strings in your black-list. You will want to wrap the result with \b to force matching word-boundaries, and use a case-insensitive search.
To give you an idea of how Regexp.union can help:
words = %w[foo bar]
Regexp.union(words)
=> /foo|bar/
'Daniel Foo killed him a bar'.scan(/\b#{Regexp.union(words)}\b/i)
=> ["foo", "bar"]
You could also build the pattern using Regexp.new or /.../ if you want a bit more control:
Regexp.new('\b(?:' + words.join('|') + ')\b', Regexp::IGNORECASE)
=> /\b(?:foo|bar)\b/i
/\b(?:#{words.join('|')})\b/i
=> /\b(?:foo|bar)\b/i
'Daniel Foo killed him a bar'.scan(/\b(?:#{words.join('|')})\b/i)
=> ["Foo", "bar"]
As a word of advice, black-listing words you find offensive is easily tricked by a user, and often gives results that are wrong because many "offensive" words are only offensive in a certain context. A user can deliberately misspell them or use "l33t" speak and have an almost inexhaustible supply of alternate spellings that will make you constantly update your list. It's a source of enjoyment to some people to fool a system.
I was once given a similar task and wrote a translator to supply alternate spellings for "offensive" words. I started with a list of words and terms I'd gleaned from the Internet and started my code running. After several million alternates were added to the database I pulled the plug and showed management it was a fools-errand because it was trivial to fool it.

A more elegant way to parse a string with ruby regular expression using variable grouping?

At the moment I have a regular expression that looks like this:
^(cat|dog|bird){1}(cat|dog|bird)?(cat|dog|bird)?$
It matches at least 1, and at most 3 instances of a long list of words and makes the matching words for each group available via the corresponding variable.
Is there a way to revise this so that I can return the result for each word in the string without specifying the number of groups beforehand?
^(cat|dog|bird)+$
works but only returns the last match separately , because there is only one group.
OK, so I found a solution to this.
It doesn't look like it is possible to create an unknown number of groups, so I went digging for another way of achieving the desired outcome: To be able to tell if a string was made up of words in a given list; and to match the longest words possible in each position.
I have been reading Mastering Regular Expressions by Jeffrey E. F. Friedl and it shed some light on things for me. It turns out that NFA based Regexp engines (like the one used in Ruby) are sequential as well as lazy/greedy. This means that you can dictate how a pattern is matched using the order in which you give it choices. This explains why scan was returning variable results, it was looking for the first word in the list that matched the criteria and then moved on to the next match. By design it was not looking for the longest match, but the first one. So in order to rectify this all I needed to do was reorder the array of words used to generate the regular expression from alphabetical order, to length order (longest to shortest).
array = %w[ as ascarid car id ]
list = array.sort_by {|word| -word.length }
regexp = Regexp.union(list)
Now the first match found by scan will be the longest word available. It is also pretty simple to tell if a string contains only words in the list using scan:
if "ascarid".scan(regexp).join.length == word.length
return true
else
return false
end
Thanks to everyone that posted in response to this question, I hope that this will help others in the future.
You could do it in two steps:
Use /^(cat|dog|bird)+$/ (or better /\A(cat|dog|bird)+\z/) to make sure it matches.
Then string.scan(/cat|dog|bird/) to get the pieces.
You could also use split and a Set to do both at once. Suppose you have your words in the array a and your string in s, then:
words = Set.new(a)
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
parts = s.split(re).reject(&:empty?)
if(parts.any? {|w| !words.include?(w) })
# 's' didn't match what you expected so throw a
# hissy fit, format the hard drive, set fire to
# the backups, or whatever is appropriate.
else
# Everything you were looking for is in 'parts'
# so you can check the length (if you care about
# how many matches there were) or something useful
# and productive.
end
When you use split with a pattern that contains groups then
the respective matches will be returned in the array as well.
In this case, the split will hand us something like ["", "cat", "", "dog"] and the empty strings will only occur between the separators that we're looking for and so we can reject them and pretend they don't exist. This may be an unexpected use of split since we're more interested in the delimiters more than what is being delimited (except to make sure that nothing is being delimited) but it gets the job done.
Based on your comments, it looks like you want an ordered alternation so that (ascarid|car|as|id) would try to match from left to right. I can't find anything in the Ruby Oniguruma (the Ruby 1.9 regex engine) docs that says that | is ordered or unordered; Perl's alternation appears to be specified (or at least strongly implied) to be ordered and Ruby's certainly behaves as though it is ordered:
>> 'pancakes' =~ /(pan|pancakes)/; puts $1
pan
So you could sort your words from longest to shortest when building your regex:
re = /(#{a.sort_by{|w| -w.length}.map{|w| Regexp.quote(w)}.join('|')})/
and hope that Oniguruma really will match alternations from left to right. AFAIK, Ruby's regexes will be eager because they support backreferences and lazy/non-greedy matching so this approach should be safe.
Or you could be properly paranoid and parse it in steps; first you'd make sure your string looks like what you want:
if(s !~ /\A(#{a.map{|w| Regexp.quote(w)}.join('|')})+\z/)
# Bail out and complain that 's' doesn't look right
end
The group your words by length:
by_length = a.group_by(&:length)
and scan for the groups from the longest words to the shortest words:
# This loses the order of the substrings within 's'...
matches = [ ]
by_length.keys.sort_by { |k| -k }.each do |group|
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
s.gsub!(re) { |w| matches.push(w); '' }
end
# 's' should now be empty and the matched substrings will be
# in 'matches'
There is still room for possible overlaps in these approaches but at least you'd be extracting the longest matches.
If you need to repeat parts of a regex, one option is to store the repeated part in a variable and just reference that, for example:
r = "(cat|dog|bird)"
str.match(/#{r}#{r}?#{r}?/)
You can do it with .Net regular expressions. If I write the following in PowerShell
$pat = [regex] "^(cat|dog|bird)+$"
$m = $pat.match('birddogcatbird')
$m.groups[1].captures | %{$_.value}
then I get
bird
dog
cat
bird
when I run it. I know even less about IronRuby than I do about PowerShell, but perhaps this means you can do it in IronRuby as well.

Ruby: Fast way to filter out keywords from text based on an array of words

I have a large text string and about 200 keywords that I want to filter out of the text.
There are numerous ways todo this, but I'm stuck on which way is the best:
1) Use a for loop with a gsub for each keyword
3) Use a massive regular expression
Any other ideas, what would you guys suggest
A massive regex is faster as it's going to walk the text only once.
Also, if you don't need the text, only the words, at the end, you can make the text a Set of downcased words and then remove the words that are in the filter array. But this only works if you don't need the "text" to make sense at the end (usually for tags or full text search).
Create a hash with each valid keyword as key.
keywords = %w[foo bar baz]
keywords_hash = Hash[keywords.map{|k|[k,true]}]
Assuming all keywords are 3 letters or more, and consist of
alphanumeric characters or a dash, case is irrelevant,
and you only want each keyword present in the text returned once:
keywords_in_text = text.downcase.scan(/[[:alnum:][-]]{3,}/).select { |word|
keywords_hash.has_key? word
}.uniq
This should be reasonably efficient even when both the text to be searched and the list of valid keywords are very large.

Resources