Split string into a list, but keeping the split pattern - ruby

Currently i am splitting a string by pattern, like this:
outcome_array=the_text.split(pattern_to_split_by)
The problem is that the pattern itself that i split by, always gets omitted.
How do i get it to include the split pattern itself?

Thanks to Mark Wilkins for inpsiration, but here's a shorter bit of code for doing it:
irb(main):015:0> s = "split on the word on okay?"
=> "split on the word on okay?"
irb(main):016:0> b=[]; s.split(/(on)/).each_slice(2) { |s| b << s.join }; b
=> ["split on", " the word on", " okay?"]
or:
s.split(/(on)/).each_slice(2).map(&:join)
See below the fold for an explanation.
Here's how this works. First, we split on "on", but wrap it in parentheses to make it into a match group. When there's a match group in the regular expression passed to split, Ruby will include that group in the output:
s.split(/(on)/)
# => ["split", "on", "the word", "on", "okay?"
Now we want to join each instance of "on" with the preceding string. each_slice(2) helps by passing two elements at a time to its block. Let's just invoke each_slice(2) to see what results. Since each_slice, when invoked without a block, will return an enumerator, we'll apply to_a to the Enumerator so we can see what the Enumerator will enumerator over:
s.split(/(on)/).each_slice(2).to_a
# => [["split", "on"], ["the word", "on"], ["okay?"]]
We're getting close. Now all we have to do is join the words together. And that gets us to the full solution above. I'll unwrap it into individual lines to make it easier to follow:
b = []
s.split(/(on)/).each_slice(2) do |s|
b << s.join
end
b
# => ["split on", "the word on" "okay?"]
But there's a nifty way to eliminate the temporary b and shorten the code considerably:
s.split(/(on)/).each_slice(2).map do |a|
a.join
end
map passes each element of its input array to the block; the result of the block becomes the new element at that position in the output array. In MRI >= 1.8.7, you can shorten it even more, to the equivalent:
s.split(/(on)/).each_slice(2).map(&:join)

You could use a regular expression assertion to locate the split point without consuming any of the input. Below uses a positive look-behind assertion to split just after 'on':
s = "split on the word on okay?"
s.split(/(?<=on)/)
=> ["split on", " the word on", " okay?"]
Or a positive look-ahead to split just before 'on':
s = "split on the word on okay?"
s.split(/(?=on)/)
=> ["split ", "on the word ", "on okay?"]
With something like this, you might want to make sure 'on' was not part of a larger word (like 'assertion'), and also remove whitespace at the split:
"don't split on assertion".split(/(?<=\bon\b)\s*/)
=> ["don't split on", "assertion"]

If you use a pattern with groups, it will return the pattern in the results as well:
irb(main):007:0> "split it here and here okay".split(/ (here) /)
=> ["split it", "here", "and", "here", "okay"]
Edit The additional information indicated that the goal is to include the item on which it was split with one of the halves of the split items. I would think there is a simple way to do that, but I don't know it and haven't had time today to play with it. So in the absence of the clever solution, the following is one way to brute force it. Use the split method as described above to include the split items in the array. Then iterate through the array and combine every second entry (which by definition is the split value) with the previous entry.
s = "split on the word on and include on with previous"
a = s.split(/(on)/)
# iterate through and combine adjacent items together and store
# results in a second array
b = []
a.each_index{ |i|
b << a[i] if i.even?
b[b.length - 1] += a[i] if i.odd?
}
print b
Results in this:
["split on", " the word on", " and include on", " with previous"]

Related

Remove all special char except apostrophe

Given a sentence, I want to count all the duplicated words:
It is an exercice from Exercism.io Word count
For example for the input "olly olly in come free"
plain
olly: 2
in: 1
come: 1
free: 1
I have this test for exemple:
def test_with_quotations
phrase = Phrase.new("Joe can't tell between 'large' and large.")
counts = {"joe"=>1, "can't"=>1, "tell"=>1, "between"=>1, "large"=>2, "and"=>1}
assert_equal counts, phrase.word_count
end
this is my method
def word_count
phrase = #phrase.downcase.split(/\W+/)
counts = phrase.group_by{|word| word}.map {|k,v| [k, v.count]}
Hash[*counts.flatten]
end
For the test above I have this failure when I run it in the terminal:
2) Failure:
PhraseTest#test_with_apostrophes [word_count_test.rb:69]:
--- expected
+++ actual
## -1 +1 ##
-{"first"=>1, "don't"=>2, "laugh"=>1, "then"=>1, "cry"=>1}
+{"first"=>1, "don"=>2, "t"=>2, "laugh"=>1, "then"=>1, "cry"=>1}
My problem is to remove all chars except 'apostrophe...
the regex in the method almost works...
phrase = #phrase.downcase.split(/\W+/)
but it remove the apostrophes...
I don't want to keep the single quote around a word, 'Hello' => Hello
but Don't be cruel => Don't be cruel
Maybe something like:
string.scan(/\b[\w']+\b/i).each_with_object(Hash.new(0)){|a,(k,v)| k[a]+=1}
The regex employs word boundaries (\b).
The scan outputs an array of the found words and for each word in the array they are added to the hash, which has a default value of zero for each item which is then incremented.
Turns out my solution whilst finding all items and ignoring case will still leave the items in the case they were found in originally.
This would now be a decision for Nelly to either accept as is or to perform a downcase on the original string or the array item as it is added to the hash.
I'll leave that decision up to you :)
Given:
irb(main):015:0> phrase
=> "First: don't laugh. Then: don't cry."
Try:
irb(main):011:0> Hash[phrase.downcase.scan(/[a-z']+/)
.group_by{|word| word.downcase}
.map{|word, words|[word, words.size]}
]
=> {"first"=>1, "don't"=>2, "laugh"=>1, "then"=>1, "cry"=>1}
With your update, if you want to remove single quotes, do that first:
irb(main):038:0> p2
=> "Joe can't tell between 'large' and large."
irb(main):039:0> p2.gsub(/(?<!\w)'|'(?!\w)/,'')
=> "Joe can't tell between large and large."
Then use the same method.
But you say -- gsub(/(?<!\w)'|'(?!\w)/,'') will remove the apostrophe in 'Twas the night before. Which I reply you will eventually need to build a parser that can determine the distinction between an apostrophe and a single quote if /(?<!\w)'|'(?!\w)/ is not sufficient.
You can also use word boundaries:
irb(main):041:0> Hash[p2.downcase.scan(/\b[a-z']+\b/)
.group_by{|word| word.downcase}
.map{|word, words|[word, words.size]}
]
=> {"joe"=>1, "can't"=>1, "tell"=>1, "between"=>1, "large"=>2, "and"=>1}
But that does not solve 'Tis the night either.
Another way:
str = "First: don't 'laugh'. Then: 'don't cry'."
reg = /
[a-z] #single letter
[a-z']+ #one or more letters or apostrophe
[a-z] #single letter
'? #optional single apostrophe
/ix #case-insensitive and free-spacing regex
str.scan(reg).group_by(&:itself).transfor‌​m_values(&:count)
#=> {"First"=>1, "don't"=>2, "laugh"=>1, "Then"=>1, "cry'"=>1}

Regular expression returns only one match

I have a set of keywords. Any keyword can contain a space symbol ['one', 'one two']. I generate a regexp from these kyewords like this /\b(?i:one|one\ two|three)\b/. Full example below:
keywords = ['one', 'one two', 'three']
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
text.downcase.scan(re)
the result of this code is
=> ["one", "one"]
How to find match of the second keyword one two and get result like this?
=> ["one", "one two"]
Regexes are eager to match. Once they find a match, they don't try to find another possibly longer one (with one important exception).
/\b(?i:one|one\ two|three)\b/ is never going to match one two because it will always match one first. You'd need /\b(?i:one two|one|three)\b/ so it tries one two first. Probably the simplest way to automate this is to sort by the longest keywords first.
keywords = ['one', 'one two', 'three']
re = Regexp.union(keywords.sort { |a,b| b.length <=> a.length }).source
re = /\b#{re}\b/i;
text = 'Some word one and one two other word'
puts text.scan(re)
Note that I set the whole regex to be case-insensitive, easier to read than (?:...), and that downcasing the string is redundant.
The exception is repetition like +, * and friends. They are greedy by default. .+ is going to match as many characters as it can. That's greedy. You can make it lazy, to match the first thing it sees, with a ?. .+? will match a single character.
"A foot of fools".match(/(.*foo)/); # matches "A foot of foo"
"A foot of fools".match(/(.*?foo)/); # matches "A foo"
The point is that \bone\b matches one in one two and since this branch appears before one two branch, it "wins" (see Remember That The Regex Engine Is Eager).
You need to sort the keyword array in a descending order before building a regex. It will then look like
(?-mix:\b(?i:three|one\ two|one)\b)
This way the longer one two will be before the shorter one and will get matched.
See the Ruby demo:
keywords = ['one', 'one two', 'three']
keywords = keywords.dup.sort.reverse
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
puts text.downcase.scan(re)
# => [ one, one two ]
I tried your example by moving the first element to the second position of the array and it works (e.g. http://rubular.com/r/4F2Hc46wHT).
In fact, it looks like the first keyword "overlaps" the second.
This response may be unhelpful if you can't change keywords order.

Searching for single words and combination words in Ruby

I want my output to search and count the frequency of the words "candy" and "gram", but also the combinations of "candy gram" and "gram candy," in a given text (whole_file.)
I am currently using the following code to display the occurrences of "candy" and "gram," but when I aggregate the combinations within the %w, only the word and frequencies of "candy" and "gram" display. Should I try a different way? thanks so much.
myArray = whole_file.split
stop_words= %w{ candy gram 'candy gram' 'gram candy' }
nonstop_words = myArray - stop_words
key_words = myArray - nonstop_words
frequency = Hash.new (0)
key_words.each { |word| frequency[word] +=1 }
key_words = frequency.sort_by {|x,y| x }
key_words.each { |word, frequency| puts word + ' ' + frequency.to_s }
It sounds like you're after n-grams. You could break the text into combinations of consecutive words in the first place, and then count the occurrences in the resulting array of word groupings. Here's an example:
whole_file = "The big fat man loves a candy gram but each gram of candy isn't necessarily gram candy"
[["candy"], ["gram"], ["candy", "gram"], ["gram", "candy"]].each do |term|
terms = whole_file.split(/\s+/).each_cons(term.length).to_a
puts "#{term.join(" ")} #{terms.count(term)}"
end
EDIT: As was pointed out in the comments below, I wasn't paying close enough attention and was splitting the file on each loop which is obviously not a good idea, especially if it's large. I also hadn't accounted for the fact that the original question may've need to sort by the count, although that wasn't explicitly asked.
whole_file = "The big fat man loves a candy gram but each gram of candy isn't necessarily gram candy"
# This is simplistic. You would need to address punctuation and other characters before
# or at this step.
split_file = whole_file.split(/\s+/)
terms_to_count = [["candy"], ["gram"], ["candy", "gram"], ["gram", "candy"]]
counts = []
terms_to_count.each do |term|
terms = split_file.each_cons(term.length).to_a
counts << [term.join(" "), terms.count(term)]
end
# Seemed like you may need to do sorting too, so here that is:
sorted = counts.sort { |a, b| b[1] <=> a[1] }
sorted.each do |count|
puts "#{count[0]} #{count[1]}"
end
Strip punctuation and convert to lower-case
The first thing you probably want to do is remove all punctuation from the string holding the contents of the file and then convert what's left to lower case, the latter so you don't have worry about counting 'Cat' and 'cat' as the same word. Those two operations can be done in either order.
Changing upper-case letters to lower-case is easy:
text = whole_file.downcase
To remove the punctuation it is probably easier to decide what to keep rather than what to discard. If we only want to keep lower-case letters, you can do this:
text = whole_file.downcase.gsub(/[^a-z]/, '')
That is, substitute an empty string for all characters other than (^) lowercase letters.1
Determine frequency of individual words
If you want to count the number of times text contains the word 'candy', you can use the method String#scan on the string text and then determine the size of the array that is returned:
text.scan(/\bcandy\b/).size
scan returns an array with every occurrence of the string 'candy'; .size returns the size of that array. Here \b ensures 'candy gram' has a word "boundary" at each end, which could be whitespace or the beginning or end of a line or the file. That's to prevent `candycane' from being counted.
A second way is to convert the string text to an array of words, as you have done2:
myArray = text.split
If you don't mind, I'd like to call this:
words = text.split
as I find that more expressive.3
The most direct way to determine the number of times 'candy' appears is to use the method Enumberable#count, like this:
words.count('candy')
You can also use the array difference method, Array#-, as you noted:
words.size - (words - ['candy']).size
If you wish to know the number of times either 'candy' or 'gram' appears, you could of course do the above for each and sum the two counts. Some other ways are:
words.size - (myArray - ['candy', 'gram']).size
words.count { |word| word == 'candy' || word = 'gram' }
words.count { |word| ['candy', 'gram'].include?(word) }
Determine the frequency of all words that appear in the text
Your use of a hash with a default value of zero was a good choice:
def frequency_of_all_words(words)
frequency = Hash.new(0)
words.each { |word| frequency[word] +=1 }
frequency
end
I wrote this as a method to emphasize that words.each... does not return frequency. Often you would see this written more compactly using the method Enumerable#each_with_object, which returns the hash ("object"):
def frequency_of_all_words(words)
words.each_with_object(Hash.new(0)) { |word, h| h[word] +=1 }
end
Once you have the hash frequency you can sort it as you did:
frequency.sort_by {|word, freq| freq }
or
frequency.sort_by(&:last)
which you could write:
frequency.sort_by {|_, freq| freq }
since you aren't using the first block variable. If you wanted the most frequent words first:
frequency.sort_by(&:last).reverse
or
frequency.sort_by {|_, freq| -freq }
All of these will give you an array. If you want to convert it back to a hash (with the largest values first, say):
Hash[frequency.sort_by(&:last).reverse]
or in Ruby 2.0+,
frequency.sort_by(&:last).reverse.to_h
Count the number of times a substring appears
Now let's count the number of times the string 'candy gram' appears. You might think we could use String#scan on the string holding the entire file, as we did earlier4:
text.scan(/\bcandy gram\b/).size
The first problem is that this won't catch 'candy\ngram'; i.e., when the words are separated by a newline character. We could fix that by changing the regex to /\bcandy\sgram\b/. A second problem is that 'candy gram' might have been 'candy. Gram' in the file, in which case you might not want to count it.
A better way is to use the method Enumerable#each_cons on the array words. The easiest way to show you how that works is by example:
words = %w{ check for candy gram here candy gram again }
#=> ["check", "for", "candy", "gram", "here", "candy", "gram", "again"]
enum = words.each_cons(2)
#=> #<Enumerator: ["check", "for", "candy", "gram", "here", "candy",
# "gram", "again"]:each_cons(2)>
enum.to_a
#=> [["check", "for"], ["for", "candy"], ["candy", "gram"],
# ["gram", "here"], ["here", "candy"], ["candy", "gram"],
# ["gram", "again"]]
each_cons(2) returns an enumerator; I've converted it to an array to display its contents.
So we can write
words.each_cons(2).map { |word_pair| word_pair.join(' ') }
#=> ["check for", "for candy", "candy gram", "gram here",
# "here candy", "candy gram", "gram again"]
and lastly:
words.each_cons(2).map { |word_pair|
word_pair.join(' ') }.count { |s| s == 'candy gram' }
#=> 2
1 If you also wanted to keep dashes, for hyphenated words, change the regex to /[^-a-z]/ or /[^a-z-]/.
2 Note from String#split that .split is the same as both .split(' ') and .split(/\s+/)).
3 Also, Ruby's naming convention is to use lower-case letters and underscores ("snake-case") for variables and methods, such as my_array.

Rails: Remove substring from the string if in array

I know I can easily remove a substring from a string.
Now I need to remove every substring from a string, if the substring is in an array.
arr = ["1. foo", "2. bar"]
string = "Only delete the 1. foo and the 2. bar"
# some awesome function
string = string.replace_if_in?(arr, '')
# desired output => "Only delete the and the"
All of the functions to remove adjust a string, such as sub, gsub, tr, ... only take one word as an argument, not an array. But my array has over 20 elements, so I need a better way than using sub 20 times.
Sadly it's not only about removing words, rather about removing the whole substring as 1. foo
How would I attempt this?
You can use gsub which accepts a regex, and combine it with Regexp.union:
string.gsub(Regexp.union(arr), '')
# => "Only delete the and the "
Like follows:
arr = ["1. foo", "2. bar"]
string = "Only delete the 1. foo and the 2. bar"
arr.each {|x| string.slice!(x) }
string # => "Only delete the and the "
One extended thing, this also allows you to crop text with regexp service chars like \, or . (Uri's answer also allows):
string = "Only delete the 1. foo and the 2. bar and \\...."
arr = ["1. foo", "2. bar", "\..."]
arr.each {|x| string.slice!(x) }
string # => "Only delete the and the and ."
Use #gsub with #join on the array elements
You can use #gsub by calling #join on the elements of the array, joining them with the regex alternation operator. For example:
arr = ["foo", "bar"]
string = "Only delete the foo and the bar"
string.gsub /#{arr.join ?|}/, ''
#=> "Only delete the and the "
You can then deal with the extra spaces left behind in any way you see fit. This is a better method when you want to censor words. For example:
string.gsub /#{arr.join ?|}/, '<bleep>'
#=> "Only delete the <bleep> and the <bleep>"
On the other hand, split/reject/join might be a better method chain if you need to care about whitespace. There's always more than one way to do something, and your mileage may vary.

How would this complicated search-and-replace operation be done in Ruby?

I have a big text file. Within this text file, I want to replace all mentions of the word 'pizza' with 'spinach', 'Pizza' with 'Spinach', and 'pizzing' with 'spinning' -- unless those words occur anywhere within curly braces. So {pizza}, {giant.pizza} and {hot-pizza-oven} should remain unchanged.
My best proposed solution so far is to iterate over the file line-by-line, issuing a regex that detects everything before an { or after an }, and using regexes on each of those strings. But that gets really complex and unwieldy and I want to know if there's a proper solution for this problem.
This can be done in a few steps. I'd iterate through the file line by line, and pass each line to this method:
def spinachize line
# list of words to swap
swaps = {
'pizza' => 'spinach',
'Pizza' => 'Spinach',
'pizzing' => 'spinning'
}
# random placeholder for bracketed text
placeholder = 'fdjfafdlskdsfajkldfas'
# save all instances of bracketed text
bracketed_text = line.scan(/\{.*?\}/)
# remove bracketed text from line
line.gsub!(/\{.*?\}/, placeholder)
# replace all swaps
swaps.each do |original_text, new_text|
line.gsub!(original_text, new_text)
end
# re-insert bracketed text
line.gsub(placeholder){bracketed_text.shift}
end
The comments above explain things as we go. Here are a couple of examples:
spinachize "Pizza is good, but more pizza is better"
=> "Spinach is good, but more spinach is better"
spinachize "Leave bracketed instances of {pizza} or {this.pizza} alone"
=> "Leave bracketed instances of {pizza} or {this.pizza} alone"
As you can see, you can specify the items you want swapped, or modify the method to pull the list from a database or flat file somewhere. The placeholder just needs to be something unique that wouldn't come up in the source file naturally.
The process is this: remove bracketed text from the original line, and remember it for later. Swap all text that needs swapping, then add back the bracketed text. It's not a one-liner, but it works well and is readable and easy to update.
The last line of the method might need some clarification. Not many people know that the "gsub" method can take a block instead of a second parameter. That block then determines what gets put in place of the original text. In this case, every time the block is called I remove the first item off our saved bracket list, and use that.
rules = {'pizza' => 'spinach','Pizza' => 'Spinach','pizzing' => 'spinning'}
regexp = /\{[^{}]*\}|#{rules.keys.join('|')}/m
puts(file.read.gsub(regexp) { |s| rules[s] || s })
This constructs a regular expression that matches either bracketed strings or the strings to replace. We then run it through a block that replaces strings with the given value, and will leave bracketed strings unchanged. With the /m flag, the regular expression can tolerate newlines inside the brackets--if that won't happen, you can take it out. Either way, no need to iterate line by line.
str = "Pizza {pizza} with spinach is not pizzing."
swaps = {'{pizza}' =>'{pizza}',
'{Pizza}' =>'{Pizza}',
'{pizzing}'=> '{pizzing}'
'pizza' => 'spinach',
'Pizza' => 'Spinach',
'pizzing' => 'spinning'}
regex = Regexp.union(swaps.keys)
p str.gsub(regex, swaps) # => "Spinach {pizza} with spinach is not spinning."
I would call the following method for each line of the file.
Code
def doit(line)
replace = {'pizza'=>'spinach', 'Pizza'=>'Spinach', 'pizzing'=>'spinning'}
r = /\{.*?\}/
arr= line.split(r).map { |str|
str.gsub(/\b(?:pizza|Pizza|pizzing)\b/, replace) }
line.scan(r).each_with_object(arr.shift) { |str,res|
res << str << arr.shift }
end
Examples
doit("Pizza Primastrada's {pizza} is the best {pizzing} pizza in town.")
#=> "Spinach Primastrada's {pizza} is the best {pizzing} spinach in town."
doit("{Pizza Primastrada}'s pizza is the best pizzing {pizza} in town.")
#=> "{Pizza Primastrada}'s spinach is the best spinning {pizza} in town."
Explanation
line = "Pizza Primastrada's {pizza} is the best {pizzing} pizza in town."
replace = {'pizza'=>'spinach', 'Pizza'=>'Spinach', 'pizzing'=>'spinning'}
r = /\{.*?\}/
a = line.split(r)
#=> ["Pizza Primastrada's ", " is the best ", " pizza in town."]
b = a.map { |str| str.gsub(/\b(?:pizza|Pizza|pizzing)\b/, replace) }
#=> ["Spinach Primastrada's ", " is the best ", " spinach in town."]
keepers = line.scan(r)
#=> ["{pizza}", "{pizzing}"]
keepers.each_with_object(b.shift) { |str,res| res << str << b.shift }
#=> "Spinach Primastrada's {pizza} is the best {pizzing} spinach in town."
Nested braces
If you wish to permit nested braces, change the regex to:
r = /\{[^{}]*?(?:\{.*?\})*?[^{}]*?\}/
doit("Pizza Primastrada's {{great {great} pizza} is the best pizza.")
#=> "Spinach Primastrada's {{great {great} pizza} is the best spinach."
You referred to the string
{words,salad,#{1,2,3} pizza|}
in a comment. If that is part of a string enclosed in single quotes, not a problem. If enclosed in double quotes, however, # will raise a syntax error. Again, no problem, if the pound character is escaped (\#).

Resources