Replace with multiple patterns mutually exclusively - ruby

I have the following text:
a phrase whith length one, which is "uno"
Using the following dictionary,
1) phrase --- frase
2) a phrase --- una frase
3) one --- uno
4) uno --- one
I'm trying to replace the occurrences of the dictionary items in the text. The desired output is:
[a phrase|una frase] whith length [one|uno], which is "[uno|one]"
I've done this:
text = %(a phrase whith length one, which is "uno")
dictionary.each do |original, translation|
text.gsub! original, "[#{original}|#{translation}]"
end
This snippet outputs the following for each dictionary word:
1) a [phrase|frase] whith length one, which is "uno"
2) a [phrase|frase] whith length one, which is "uno"
3) a [phrase|frase] whith length [one|uno], which is "uno"
3) a [phrase|frase] whith length [one|[uno|one]], which is "[uno|one]"
I see two problems here:
The word phrase is being replaced instead of a phrase. I think that this can be fixed by sorting the dictionary by length, giving priority to longer terms.
The already replaced words are being re-replaced, like uno in [one|uno]. I thought of using some sort of regular expression list (with Regex::union), but I don't know how efficient and clean it'll be.
Any ideas?

To solve your second problem, you have to replace in a single pass.
Convert the dictionary into a hash with the key-value pairs in the order you mention (sorted by length, perhaps).
dictionary = {
"a phrase" => "[a phrase|una frase]",
"phrase" => "[phrase|frase]",
"one" => "[one|uno]",
"uno" => "[uno|one]",
}
Then replace all in a single pass.
text.gsub(Regexp.union(*dictionary.keys.map{|w| "\b#{w}\b"}), dictionary)

Related

Need explanation of the short Kotlin solution for strings in Codewars

I got a task on code wars.
The task is
In this simple Kata your task is to create a function that turns a string into a Mexican Wave. You will be passed a string and you must return that string in an array where an uppercase letter is a person standing up.
Rules are
The input string will always be lower case but maybe empty.
If the character in the string is whitespace then pass over it as if it was an empty seat
Example
wave("hello") => []string{"Hello", "hEllo", "heLlo", "helLo", "hellO"}
So I have found the solution but I want to understand the logic of it. Since its so minimalistic and looks cool but I don't understand what happens there. So the solution is
fun wave(str: String) = str.indices.map { str.take(it) + str.drop(it).capitalize() }.filter { it != str }
Could you please explain?
str.indices just returns the valid indices of the string. This means the numbers from 0 to and including str.length - 1 - a total of str.length numbers.
Then, these numbers are mapped (in other words, transformed) into strings. We will now refer to each of these numbers as "it", as that is what it refers to in the map lambda.
Here's how we do the transformation: we first take the first it characters of str, then combine that with the last str.length - it characters of str, but with the first of those characters capitalized. How do we get the last str.length - it characters? We drop the first it characters.
Here's an example for when str is "hello", illustrated in a table:
it
str.take(it)
str.drop(it)
str.drop(it).capitalize()
Combined
0
hello
Hello
Hello
1
h
ello
Ello
hEllo
2
he
llo
Llo
heLLo
3
hel
lo
Lo
helLo
4
hell
o
O
hellO
Lastly, the solution also filters out transformed strings that are the same as str. This is to handle Rule #2. Transformed strings can only be the same as str if the capitalised character is a whitespace (because capitalising a whitespace character doesn't change it).
Side note: capitalize is deprecated. For other ways to capitalise the first character, see Is there a shorter replacement for Kotlin's deprecated String.capitalize() function?
Here's another way you could do it:
fun wave2(str: String) = str.mapIndexed { i, c -> str.replaceRange(i, i + 1, c.uppercase()) }
.filter { it.any(Char::isUpperCase) }
The filter on the original is way more elegant IMO, this is just as an example of how else you might check for a condition. replaceRange is a way to make a copy of a string with some of the characters changed, in this case we're just replacing the one at the current index by uppercasing what's already there. Not as clever as the original, but good to know!

Ruby Delete From Array On Criteria

I'm just learning Ruby and have been tackling small code projects to accelerate the process.
What I'm trying to do here is read only the alphabetic words from a text file into an array, then delete the words from the array that are less than 5 characters long. Then where the stdout is at the bottom, I'm intending to use the array. My code currently works, but is very very slow since it has to read the entire file, then individually check each element and delete the appropriate ones. This seems like it's doing too much work.
goal = File.read('big.txt').split(/\s/).map do |word|
word.scan(/[[:alpha:]]+/).uniq
end
goal.each { |word|
if word.length < 5
goal.delete(word)
end
}
puts goal.sample
Is there a way to apply the criteria to my File.read block to keep it from mapping the short words to begin with? I'm open to anything that would help me speed this up.
You might want to change your regex instead to catch only words longer than 5 characters to begin with:
goal = File.read('C:\Users\bkuhar\Documents\php\big.txt').split(/\s/).flat_map do |word|
word.scan(/[[:alpha:]]{6,}/).uniq
end
Further optimization might be to maintain a Set instead of an Array, to avoid re-scanning for uniqueness:
goal = Set.new
File.read('C:\Users\bkuhar\Documents\php\big.txt').scan(/\b[[:alpha:]]{6,}\b/).each do |w|
goal << w
end
In this case, use the delete_if method
goal => your array
goal.delete_if{|w|w.length < 5}
This will return a new array with the words of length lower than 5 deleted.
Hope this helps.
I really don't understand what a lot of the stuff you are doing in the first loop is for.
You take every chunk of text separated by white space, and map it to a unique value in an array generated by chunking together groups of letter characters, and plug that into an array.
This is way too complicated for what you want. Try this:
goal = File.readlines('big.txt').select do |word|
word =~ /^[a-zA-Z]+$/ &&
word.length >= 5
end
This makes it easy to add new conditions, too. If the word can't contain 'q' or 'Q', for example:
goal = File.readlines('big.txt').select do |word|
word =~ /^[a-zA-Z]+$/ &&
word.length >= 5 &&
! word.upcase.include? 'Q'
end
This assumes that each word in your dictionary is on its own line. You could go back to splitting it on white space, but it makes me wonder if the file you are reading in is written, human-readable text; a.k.a, it has 'words' ending in periods or commas, like this sentence. In that case, splitting on whitespace will not work.
Another note - map is the wrong array function to use. It modifies the values in one array and creates another out of those values. You want to select certain values from an array, but not modify them. The Array#select method is what you want.
Also, feel free to modify the Regex back to using the :alpha: tag if you are expecting non-standard letter characters.
Edit: Second version
goal = /([a-z][a-z']{4,})/gi.match(File.readlines('big.txt').join(" "))[1..-1]
Explanation: Load a file, and join all the lines in the file together with a space. Capture all occurences of a group of letters, at least 5 long and possibly containing but not starting with a '. Put all those occurences into an array. the [1..-1] discards "full match" returned by the MatchData object, which would be all the words appended together.
This works well, and it's only one line for your whole task, but it'll match
sugar'
in
I'd like some 'sugar', if you know what I mean
Like above, if your word can't contain q or Q, you could change the regex to
/[a-pr-z][a-pr-z']{4,})[ .'",]/i
And an idea - do another select on goal, removing all those entries that end with a '. This overcomes the limitations of my Regex

Performing operations on each line of a string

I have a string named "string" that contains six lines.
I want to remove an "Z" from the end of each line (which each has) and capitalize the first character in each line (ignoring numbers and white space; e.g., "1. apple" -> "1. Apple").
I have some idea of how to do it, but have no idea how to do it in Ruby. How do I accomplish this? A loop? What would the syntax be?
Using regular expression (See String#gsub):
s = <<EOS
1. applez
2. bananaz
3. catz
4. dogz
5. elephantz
6. fruitz
EOS
puts s.gsub(/z$/i, '').gsub(/^([^a-z]*)([a-z])/i) { $1 + $2.upcase }
# /z$/i - to match a trailing `z` at the end of lines.
# /^([^a-z]*)([a-z])/i - to match leading non-alphabets and alphabet.
# capture them as group 1 ($1), group 2 ($2)
output:
1. Apple
2. Banana
3. Cat
4. Dog
5. Elephant
6. Fruit
I would approach this by breaking your problem into smaller steps. After we've solved each of the smaller problems, you can put it all back together for a more elegant solution.
Given the initial string put forth by falsetru:
s = <<EOS
1. applez
2. bananaz
3. catz
4. dogz
5. elephantz
6. fruitz
EOS
1. Break your string into an array of substrings, separated by the newline.
substrings = s.split(/\n/)
This uses the String class' split method and a regular expression. It searches for all occurrences of newline (backslash-n) and treats this as a delimiter, splitting the string into substrings based on this delimiter. Then it throws all of these substrings into an array, which we've named substrings.
2. Iterate through your array of substrings to do some stuff (details on what stuff later)
substrings.each do |substring|
.
# Do stuff to each substring
.
end
This is one form for how you iterate across an array in Ruby. You call the Array's each method, and you give it a block of code which it will run on each element in the array. In our example, we'll use the variable name substring within our block of code so that we can do stuff to each substring.
3. Remove the z character at the end of each substring
substrings.each do |substring|
substring.gsub!(/z$/, '')
end
Now, as we iterate through the array, the first thing we want to do is remove the z character at the end of each string. You do this with the gsub! method of String, which is a search-and-replace method. The first argument for this method is the regular expression of what you're looking for. In this case, we are looking for a z followed by the end-of-string (denoted by the dollar sign). The second argument is an empty string, because we want to replace what's been found with nothing (another way of saying - we just want to remove what's been found).
4. Find the index of the first letter in each substring
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
end
The String class also has a method called index which will return the index of the first occurrence of a string that matches the regular expression your provide. In our case, since we want to ignore numbers and symbols and spaces, we are really just looking for the first occurrence of the very first letter in your substring. To do this, we use the regular expression /[a-zA-Z]/ - this basically says, "Find me anything in the range of small A to small Z or in big A to big Z." Now, we have an index (using our example strings, the index is 3).
5. Capitalize the letter at the index we have found
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
substring[index] = substring[index].capitalize
end
Based on the index value that we found, we want to replace the letter at that index with that same letter, but capitalized.
6. Put our substrings array back together as a single-string separated by newlines.
Now that we've done everything we need to do to each substring, our each iterator block ends, and we have what we need in the substrings array. To put the array back together as a single string, we use the join method of Array class.
result = substrings.join("\n")
With that, we now have a String called result, which should be what you're looking for.
Putting It All Together
Here is what the entire solution looks like, once we put together all of the steps:
substrings = s.split(/\n/)
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
substring[index] = substring[index].capitalize
end
result = substrings.join("\n")

How do I get all words before and after a character?

I never tried regex before today, and I like it so far, but I'm lost on some things.
I have a string that looks like this:
Type OtherType ThirdType - SubType AnotherSubType QuiteTheType
I want two regex, both care about the '-' character.
First I want all words before that character, then all words after it. I will be using Ruby's gsub to turn them into an array of strings, two arrays, which is why I need two regex expressions.
So far I have this: ([a-zA-z]{1,}) (?=-) but that only gets me the word right before the dash, I.E. ThirdType.
If I just use ([a-zA-z]{1,}) I get all words highlighted, but that includes the ones AFTER the - which I don't want yet.
How can I get all occurrences of [a-zA-z]{1,} that happen before - but not necessarily IMMEDIATELY before it?
s = "Type OtherType ThirdType - SubType AnotherSubType QuiteTheType"
words_before, words_after = s.split(/\s*-\s*/).map do |t|
t.split(/\s+/)
end
p words_before # => ["Type", "OtherType", "ThirdType"]
p words_after # => ["SubType", "AnotherSubType", "QuiteTheType"]
Here's how this works:
s.split(/\s*-\s*/)
This splits the string in two, using a regular expression delimiter. The delimiter means "any amount of white-space, then a dash, then any amount of white-space." The result is an array with two strings in it: The part on the left of the delimeter, and the part on the right.
...map do |t|
...
end
map takes an array and transforms it into another array with the same number of elements. It takes each element of the array, passes it to the block, and uses the return value from the block as the new value for that element. We'll use it to transform the two strings into two arrays of words.
So, what's in the block?
t.split(/\s+/)
It's another split. This time we'll split on one or more whitespace characters. That results in an array of words.
Since the map applies that split to first the left side and then the right side, the result of the entire s.split... expression is an array of two arrays.
Now we'll use one of Ruby's fun syntaxes:
words_before, words_after = s.split...
Whenever you have multiple variables on the left side of an assignment, ruby will "decompose" the array on the right side, assigning the first element of the array to the first variable, the second element of the array to the second variable, and so on. Since our array has two elements (the first being an array of words from the left side, and the second being an array of words from the right side), we'll use two variables to hold them.
I don't know exactly how Ruby's regex implementation works, but the following regex in Perl should get you what you want:
/^([a-zA-z\s]+) \- ([a-zA-Z\s]+)$/
For example:
perl -e '$_="Type OtherType ThirdType - SubType AnotherSubType QuiteTheType";
if(/^([a-zA-z\s]+) \- ([a-zA-Z\s]+)$/){print "$1\n";print "$2\n";}'
produces
Type OtherType ThirdType
SubType AnotherSubType QuiteTheType
ETA: To explain what's going on, the initial ^ denotes the beginning of the line and the ending $ denotes the end of the line. So, ^([a-zA-Z\s]+) starts at the beginning and (greedily) matches all of the words from the beginning of the line up until the space before the dash (which is escaped by a backslash, since - is a reserved character in most regex implementations). Likewise with ([a-zA-Z\s]+)$.
You can use look-ahead:
(\w+)(?=.*?-)
While regex is powerful and useful, it often leads to a more complicated solution than you need, and complicated results in more work and maintenance.
sentence = 'Type OtherType ThirdType - SubType AnotherSubType QuiteTheType'
sentence.split('-') # => ["Type OtherType ThirdType ", " SubType AnotherSubType QuiteTheType"]
sentence.scan(/[^-]+/) # => ["Type OtherType ThirdType ", " SubType AnotherSubType QuiteTheType"]
If the whitespace surrounding the hyphen is annoying pass the returned sections through strip:
sentence.split('-').map{ |w| w.strip } # => ["Type OtherType ThirdType", "SubType AnotherSubType QuiteTheType"]
sentence.scan(/[^-]+/).map{ |w| w.strip } # => ["Type OtherType ThirdType", "SubType AnotherSubType QuiteTheType"]
If you want the individual words, and not the sentences before and after the hyphen:
sentence.split('-').map{ |w| w.strip.split(' ') } # => [["Type", "OtherType", "ThirdType"], ["SubType", "AnotherSubType", "QuiteTheType"]]
sentence.scan(/[^-]+/).map{ |w| w.strip.split(' ') } # => [["Type", "OtherType", "ThirdType"], ["SubType", "AnotherSubType", "QuiteTheType"]]

Algorithm for multiple word matching in text

I have a large set of words (about 10,000) and I need to find if any of those words appear in a given block of text.
Is there a faster algorithm than doing a simple text search for each of the words in the block of text?
input the 10,000 words into a hashtable then check each of the words in the block of text if its hash has an entry.
Faster though I don't know, just another method (would depend on how many words you are searching for).
simple perl examp:
my $word_block = "the guy went afk after being popped by a brownrabbit";
my %hash = ();
my #words = split /\s/, $word_block;
while(<DATA>) { chomp; $hash{$_} = 1; }
foreach $word (#words)
{
print "found word: $word\n" if exists $hash{$word};
}
__DATA__
afk
lol
brownrabbit
popped
garbage
trash
sitdown
Try out the Aho-Corasick algorithm:
http://en.wikipedia.org/wiki/Aho-Corasick_algorithm
Build up a trie of your words, and then use that to find which words are in the text.
The answer heavily depends on the actual requirements.
How large is the word list?
How large is the text block?
How many text blocks must be processed?
How often must each text block be processed?
Do the text blocks or the word list change? If, how frequent?
Assuming relativly small text blocks compared to the word list and processing each text block only once, I suggest to put the words from the word list into a hash table. Then you can perform a hash lookup for each word in the text block and find out if the word list contains the word.
If you have to process the text blocks multiple times, I suggest to invert the text blocks. Inverting a text block means creating a list for each word that containing all the text blocks containing the specific word.
In still other situations it might be helpful to generate a bit vector for each text block with one bit per word indicating if the word is contained in the text block.
you can build a graph used as a state machine and when you process the ith character of your input word - Ci - you try to go to the ith level of your graph by checking if your previous node, linked to Ci-1, has a child node linked to Ci
ex: if you have the following words in your corpus
("art", "are", "be", "bee")
you will have the following nodes in your graph
n11 = 'a'
n21 = 'r'
n11.sons = (n21)
n31 = 'e'
n32= 't'
n21.sons = (n31, n32)
n41='art' (here we have a leaf in our graph and the word build from all the upper nodes is associated to this node)
n31.sons = (n41)
n42 = 'are' (here again we have a word)
n32.sons = (n42)
n12 = 'b'
n22 = 'e'
n12.sons = (n22)
n33 = 'e'
n34 = 'be' (word)
n22.sons = (n33,n34)
n43 = 'bee' (word)
n33.sons = (n43)
during your process if you go through a leaf while you are processing the last character of your input word, and only in this case, it means that your input is in your corpus.
This method is more complicated to implement than a single Dictionary or Hashtable but it will be much more optimized in term of memory use
The Boyer-Moore string algorithm should work. depending on the size/# or words in the block of text, you might want to use it as the key to search the word list (are there more words in the list then in the block). Also - you probably want to remove any dups from both lists.

Resources