Split a string delimited by a list of substrings - ruby

I have data like:
str = "CODEA text for first item CODEB text for next item CODEB2 some"\
"more text CODEC yet more text"
and a list:
arr = ["CODEA", "CODEB", "CODEB2", "CODEC", ... ]
I want to divide this string into a hash. The keys of the hash will be CODEA, CODEB, etc. The values of the hash will be the text that follows, until the next CODE. The output should look like this:
"CODEA" => "text for first item",
"CODEB" => "text for next item",
"CODEB2" => "some more text",
"CODEC" => "yet more text"

We are given a sting and an array.
str = "CODEA text for first item CODEB text for next item " +
"CODEB2 some more text CODEC yet more text"
arr= %w|CODEC CODEB2 CODEA CODEB|
#=> ["CODEC", "CODEB2", "CODEA", "CODEB"]
This is one way to obtain the desired hash.
str.split.
slice_before { |word| arr.include?(word) }.
map { |word, *rest| [word, rest.join(' ')] }.
to_h
#=> {"CODEA" =>"text for first item",
# "CODEB" =>"text for next item",
# "CODEB2"=>"some more text",
# "CODEC" =>"yet more text"}
See Enumerable#slice_before.
The steps are as follows.
a = str.split
#=> ["CODEA", "text", "for", "first", "item", "CODEB",
# "text", "for", "next", "item", "CODEB2", "some",
# "more", "text", "CODEC", "yet", "more", "text"]
b = a.slice_before { |word| arr.include?(word) }
#=> #<Enumerator:
# #<Enumerator::Generator:0x00005cbdec2b5eb0>:each>
We can see the (4) elements (arrays) that will be generated by this enumerator and passed to each_with_object by converting it to an array.
b.to_a
#=> [["CODEA", "text", "for", "first", "item"],
# ["CODEB", "text", "for", "next", "item"],
# ["CODEB2", "some", "more", "text"],
# ["CODEC", "yet", "more", "text"]]
Continuing,
c = b.map { |word, *rest| [word, rest.join(' ')] }
#=> [["CODEA", ["text for first item"]],
# ["CODEB", ["text for next item"]],
# ["CODEB2", ["some more text"]],
# ["CODEC", ["yet more text"]]]
c.to_h
#=> {"CODEA"=>"text for first item",
# "CODEB"=>"text for next item",
# "CODEB2"=>"some more text",
# "CODEC"=>"yet more text"}
The following is perhaps a better way of doing this.
str.split.
slice_before { |word| arr.include?(word) }.
each_with_object({}) { |(word, *rest),h|
h[word] = rest.join(' ') }
When I was a kid this might be done as follows.
last_word = ''
str.split.each_with_object({}) do |word,h|
if arr.include?(word)
h[word]=''
last_word = word
else
h[last_word] << ' ' unless h[last_word].empty?
h[last_word] << word
end
end
last_word must be set to anything outside the block.

Code:
str = 'CODEA text for first item CODEB text for next item ' +
'CODEB2 some more text CODEC yet more text'
puts Hash[str.scan(/(CODE\S*) (.*?(?= CODE|$))/)]
Result:
{"CODEA"=>"text for first item", "CODEB"=>"text for next item", "CODEB2"=>"some more text", "CODEC"=>"yet more text"}

Another option.
string.split.reverse
.slice_when { |word| word.start_with? 'CODE' }
.map{ |(*v, k)| [k, v.reverse.join(' ')] }.to_h
Enumerator#slice_when, in this case returns this array:
[["text", "more", "yet", "CODEC"], ["text", "more", "some", "CODEB2"], ["item", "next", "for", "text", "CODEB"], ["item", "first", "for", "text", "CODEA"]]
Then the array is mapped to build the required hash to get the result (I did not reversed the Hash):
#=> {"CODEC"=>"yet more text", "CODEB2"=>"some more text", "CODEB"=>"text for next item", "CODEA"=>"text for first item"}

Adding parentheses to the pattern in String#split lets you get both the separators and the fields.
str.split(/(#{Regexp.union(*arr)})/).drop(1).each_slice(2).to_h
# =>
# {
# "CODEA"=>" text for first item ",
# "CODEB"=>"2 somemore text ",
# "CODEC"=>" yet more text"
# }

Related

wrong number of arguments and hash issues

I am trying to make a method that counts the number of times it uses a word from a dictionary and is returned as a hash. Here's my code now:
def substrings(words, dictionary)
hash = {}
substrings.downcase!
dictionary.each do |substring|
words.each do |word|
if word.include? substring +=1
end
end
end
hash.to_s
end
dictionary = ["below", "down", "go", "going", "horn", "how", "howdy", "it", "i", "low", "own", "part", "partner", "sit"]
words = "below"
substrings(words, dictionary)
And I get this error:
wrong number of arguments (given 0, expected 2)
I'm looking for something like this:
=> {"below"=>1, "low"=>1}
I have tried multiple things but it never gives me that hash. I either get an undefined method error or this:
=> ["below", ["below", "down", "go", "going", "horn", "how", "howdy", "it", "i", "low", "own", "part", "partner", "sit"]]
Your error is caused by the line "substrings.downcase!" This is a recursive call to your substrings method which takes two arguments, and you are providing none. If this were not the case, you would still get an error, a stack overflow caused by the infinite recursion of this code.
This will produce the desired result, but I'm exchanging words in favor of word:
def substrings(word, dictionary)
word = word.downcase
dictionary.select { |entry| word.include?(entry.downcase) }
.group_by(&:itself)
.map { |k, v| [k, v.size] }.to_h
end
This results in:
>> dictionary = ["below", "down", "go", "going", "horn", "how", "howdy", "it", "i", "low", "own", "part", "partner", "sit"]
>> word = 'below'
>> substrings(word, dictionary)
=> {"below"=>1, "low"=>1}
And counts multiple copies of words, which although not explicitly stated, is presumably what you are after:
>> dictionary = ["below", "be", "below", "below", "low", "be", "pizza"]
>> word = 'below'
>> substrings(word, dictionary)
=> {"below"=>3, "be"=>2, "low"=>1}
You can use #reduce:
def substrings(sentence, dictionary)
sentence = sentence.downcase
dictionary.reduce(Hash.new(0)) do |counts,word|
counts[word] +=1 if sentence.include?(word.downcase)
counts
end
end
dictionary = ["below", "down", "go", "going", "horn", "how", "howdy", "it", "i", "low", "own", "part", "partner", "sit"]
sentence = "below"
substrings(sentence, dictionary) #=> {"below"=>1, "low"=>1}
Or #each:
def substrings(sentence, dictionary)
sentence = sentence.downcase
counts = Hash.new(0) # Makes the default value `0` instead of `nil`
dictionary.each do |word|
if sentence.include?(word.downcase)
counts[word] += 1
end
end
counts
end
dictionary = ["below", "down", "go", "going", "horn", "how", "howdy", "it", "i", "low", "own", "part", "partner", "sit"]
sentence = "below"
substrings(sentence, dictionary) #=> {"below"=>1, "low"=>1}

Searching through two multidimensional arrays and grouping together similar subarrays

I am trying to search through two multidimensional arrays to find any elements in common in a given subarray and then put the results in a third array where the entire subarrays with similar elements are grouped together (not just the similar elements).
The data is imported from two CSVs:
require 'csv'
array = CSV.read('primary_csv.csv')
#=> [["account_num", "account_name", "primary_phone", "second_phone", "status],
#=> ["11111", "John Smith", "8675309", " ", "active"],
#=> ["11112", "Tina F.", "5551234", "5555678" , "disconnected"],
#=> ["11113", "Troy P.", "9874321", " ", "active"]]
# and so on...
second_array = CSV.read('customer_service.csv')
#=> [["date", "name", "agent", "call_length", "phone", "second_phone", "complaint"],
#=> ["3/1/15", "Mary ?", "Bob X", "5:00", "5551234", " ", "rude"],
#=> ["3/2/15", "Mrs. Smith", "Stew", "1:45", "9995678", "8675309" , "says shes not a customer"]]
# and so on...
If any number is present as an element in a subarray on both primary.csv and customer_service.csv, I want that entire subarray (as opposed to just the common elements), put into a third array, results_array. The desire output based upon the above sample is:
results_array = [["11111", "John Smith", "8675309", " ", "active"],
["3/2/15", "Mrs. Smith", "Stew", "1:45", "9995678", "8675309" , "says shes not a customer"]] # and so on...
I then want to export the array into a new CSV, where each subarray is its own row of the CSV. I intend to iterate over each subarray by joining it with a , to make it comma delimited and then put the results into a new CSV:
results_array.each do {|j| j.join(",")}
File.open("results.csv", "w") {|f| f.puts results_array}
#=> 11111,John Smith,8675309, ,active
#=> 3/2/15,Mrs. Smith,Stew,1:45,9995678,8675309,says shes not a customer
# and so on...
How can I achieve the desired output? I am aware that the final product will look messy because similar data (for example, phone number) will be in different columns. But I need to find a way to generally group the data together.
Suppose a1 and a2 are the two arrays (excluding header rows).
Code
def combine(a1, a2)
h2 = a2.each_with_index
.with_object(Hash.new { |h,k| h[k] = [] }) { |(arr,i),h|
arr.each { |e| es = e.strip; h[es] << i if number?(es) } }
a1.each_with_object([]) do |arr, b|
d = arr.each_with_object([]) do |str, d|
s = str.strip
d.concat(a2.values_at(*h2[s])) if number?(s) && h2.key?(s)
end
b << d.uniq.unshift(arr) if d.any?
end
end
def number?(str)
str =~ /^\d+$/
end
Example
Here is your example, modified somewhat:
a1 = [
["11111", "John Smith", "8675309", "", "active" ],
["11112", "Tina F.", "5551234", "5555678", "disconnected"],
["11113", "Troy P.", "9874321", "", "active" ]
]
a2 = [
["3/1/15", "Mary ?", "Bob X", "5:00", "5551234", "", "rude"],
["3/2/15", "Mrs. Smith", "Stew", "1:45", "9995678", "8675309", "surly"],
["3/7/15", "Cher", "Sonny", "7:45", "9874321", "8675309", "Hey Jude"]
]
combine(a1, a2)
#=> [[["11111", "John Smith", "8675309", "",
# "active"],
# ["3/2/15", "Mrs. Smith", "Stew", "1:45",
# "9995678", "8675309", "surly"],
# ["3/7/15", "Cher", "Sonny", "7:45",
# "9874321", "8675309", "Hey Jude"]
# ],
# [["11112", "Tina F.", "5551234", "5555678",
# "disconnected"],
# ["3/1/15", "Mary ?", "Bob X", "5:00",
# "5551234", "", "rude"]
# ],
# [["11113", "Troy P.", "9874321", "",
# "active"],
# ["3/7/15", "Cher", "Sonny", "7:45",
# "9874321", "8675309", "Hey Jude"]
# ]
# ]
Explanation
First, we define a helper:
def number?(str)
str =~ /^\d+$/
end
For example:
number?("8675309") #=> 0 ("truthy)
number?("3/1/15") #=> nil
Now index a2 on the values that represent numbers:
h2 = a2.each_with_index
.with_object(Hash.new { |h,k| h[k] = [] }) { |(arr,i),h|
arr.each { |e| es = e.strip; h[es] << i if number?(es) } }
#=> {"5551234"=>[0], "9995678"=>[1], "8675309"=>[1, 2], "9874321"=>[2]}
This says, for example, that the "numeric" field "8675309" is contained in elements at offsets 1 and 2 of a2 (i.e, for Mrs. Smith and Cher).
We can now simply run through the elements of a1 looking for matches.
The code:
arr.each_with_object([]) do |str, d|
s = str.strip
d.concat(a2.values_at(*h2[s])) if number?(s) && h2.key?(s)
end
steps through the elements of arr, assigning each to the block variable str. For example, if arr holds the first element of a1 str will in turn equals "11111", "John Smith", and so on. After s = str.strip, this says that if a s has a numerical representation and there is a matching key in h2, the (initially empty) array d is concatenated with the elements of a2 given by the value of h2[s].
After completing this loop we see if d contains any elements of a2:
b << d.uniq.unshift(arr) if d.any?
If it does, we remove duplicates, prepend the array with arr and save it to b.
Note that this allows one element of a2 to match multiple elements of a1.

Convert an Array of Strings to a Hash in Ruby

I have an Array that contains strings:
["First Name", "Last Name", "Location", "Description"]
I need to convert the Array to a Hash, as in the following:
{"A" => "First Name", "B" => "Last Name", "C" => "Location", "D" => "Description"}
Also, this way too:
{"First Name" => "A", "Last Name" => "B", "Location" => "C", "Description" => "D"}
Any thoughts how to handle this the best way?
You could implement as follows
def string_array_to_hash(a=[],keys=false)
headers = ("A".."Z").to_a
Hash[keys ? a.zip(headers.take(a.count)) : headers.take(a.count).zip(a)]
end
Then to get your initial output it would be
a = ["First Name", "Last Name", "Location", "Description"]
string_array_to_hash a
#=> {"A"=>"First Name", "B"=>"Last Name", "C"=>"Location", "D"=>"Description"}
And second output is
a = ["First Name", "Last Name", "Location", "Description"]
string_array_to_hash a, true
#=> {"First Name"=>"A", "Last Name"=>"B", "Location"=>"C", "Description"=>"D"}
Note: this will work as long as a is less than 27 Objects otherwise you will have to specify a different desired output. This is due to the fact that a) the alphabet only has 26 letters b) Hash objects can only have unique keys.
You could do this:
arr = ["First Name", "Last Name", "Location", "Description"]
letter = Enumerator.new do |y|
l = ('A'.ord-1).chr
loop do
y.yield l=l.next
end
end
#=> #<Enumerator: #<Enumerator::Generator:0x007f9a00878fd8>:each>
h = arr.each_with_object({}) { |s,h| h[letter.next] = s }
#=> {"A"=>"First Name", "B"=>"Last Name", "C"=>"Location", "D"=>"Description"}
h.invert
#=> {"First Name"=>"A", "Last Name"=>"B", "Location"=>"C", "Description"=>"D"}
or
letter = ('A'.ord-1).chr
#=> "#"
h = arr.each_with_object({}) { |s,h| h[letter = letter.next] = s }
#=> {"A"=>"First Name", "B"=>"Last Name", "C"=>"Location", "D"=>"Description"}
When using the enumerator letter, we have
27.times { puts letter.next }
#=> "A"
# "B"
# ...
# "Z"
# "AA"
If you are not being specific about keys name then you could try this out
list = ["First Name", "Last Name", "Location", "Description"]
Hash[list.map.with_index{|*x|x}].invert
Output
{0=>"First Name", 1=>"Last Name", 2=>"Location", 3=>"Description"}
Similar solutions is here.
Or..You also can try this :)
letter = 'A'
arr = ["First Name", "Last Name", "Location", "Description"]
hash = {}
arr.each { |i|
hash[i] = letter
letter = letter.next
}
// => {"First Name"=>"A", "Last Name"=>"B", "Location"=>"C", "Description"=>"D"}
or
letter = 'A'
arr = ["First Name", "Last Name", "Location", "Description"]
hash = {}
arr.each { |i|
hash[letter] = i
letter = letter.next
}
// => {"A"=>"First Name", "B"=>"Last Name", "C"=>"Location", "D"=>"Description"}

categorize by hash value

I have an array of hashes with values like:
by_person = [{ :person => "Jane Smith", :filenames => ["Report.pdf", "File2.pdf"]}, {:person => "John Doe", :filenames => ["Report.pdf] }]
I would like to end up with another array of hashes (by_file) that has each unique value from the filenames key as a key in the by_file array:
by_file = [{ :filename => "Report.pdf", :people => ["Jane Smith", "John Doe"] }, { :filename => "File2.pdf", :people => [Jane Smith] }]
I have tried:
by_file = []
by_person.each do |person|
person[:filenames].each do |file|
unless by_file.include?(file)
# list people that are included in file
by_person_each_file = by_person.select{|person| person[:filenames].include?(file)}
by_person_each_file.each do |person|
by_file << {
:file => file,
:people => person[:person]
}
end
end
end
end
as well as:
by_file.map(&:to_a).reduce({}) {|h,(k,v)| (h[k] ||= []) << v; h}
Any feedback is appreciated, thanks!
Doesn't seem too tricky, but the way you're compiling it isn't very efficient:
by_person = [{ :person => "Jane Smith", :filenames => ["Report.pdf", "File2.pdf"]}, {:person => "John Doe", :filenames => ["Report.pdf"] }]
by_file = by_person.each_with_object({ }) do |entry, index|
entry[:filenames].each do |filename|
set = index[filename] ||= [ ]
set << entry[:person]
end
end.collect do |filename, people|
{
filename: filename,
people: people
}
end
puts by_file.inspect
# => [{:filename=>"Report.pdf", :people=>["Jane Smith", "John Doe"]}, {:filename=>"File2.pdf", :people=>["Jane Smith"]}]
This makes use of a hash to group the people by filename, essentially inverting your structure, and then converts that into the final format in a second pass. This is more efficient than working with the final format during compilation as that's not indexed and requires an expensive linear search to find the correct container to insert into.
An alternate method is to create a default hash constructor that makes the structure you're looking for:
by_file_hash = Hash.new do |h,k|
h[k] = {
filename: k,
people: [ ]
}
end
by_person.each do |entry|
entry[:filenames].each do |filename|
by_file_hash[filename][:people] << entry[:person]
end
end
by_file = by_file_hash.values
puts by_file.inspect
# => [{:filename=>"Report.pdf", :people=>["Jane Smith", "John Doe"]}, {:filename=>"File2.pdf", :people=>["Jane Smith"]}]
This may or may not be easier to understand.
This is one way to do it.
Code
def convert(by_person)
by_person.each_with_object({}) do |hf,hp|
hf[:filenames].each do |fname|
hp.update({ fname=>[hf[:person]] }) { |_,oh,nh| oh+nh }
end
end.map { |fname,people| { :filename => fname, :people=>people } }
end
Example
by_person = [{:person=>"Jane Smith", :filenames=>["Report.pdf", "File2.pdf"]},
{:person=>"John Doe", :filenames=>["Report.pdf"]}]
convert(by_person)
#=> [{:filename=>"Report.pdf", :people=>["Jane Smith", "John Doe"]},
# {:filename=>"File2.pdf", :people=>["Jane Smith"]}]
Explanation
For by_person in the example:
enum1 = by_person.each_with_object({})
#=>[{:person=>"Jane Smith", :filenames=>["Report.pdf", "File2.pdf"]},
{:person=>"John Doe", :filenames=>["Report.pdf"]}]:each_with_object({})>
Let's see what values the enumerator enum will pass into the block:
enum1.to_a
#=> [[{:person=>"Jane Smith", :filenames=>["Report.pdf", "File2.pdf"]}, {}],
# [{:person=>"John Doe", :filenames=>["Report.pdf"]}, {}]]
As will be shown below, the empty hash in the first element of the enumerator will no longer be empty with the second element is passed into the block.
The first element is assigned to the block variables as follows (I've indented to indicate the block level):
hf = {:person=>"Jane Smith", :filenames=>["Report.pdf", "File2.pdf"]}
hp = {}
enum2 = hf[:filenames].each
#=> #<Enumerator: ["Report.pdf", "File2.pdf"]:each>
enum2.to_a
#=> ["Report.pdf", "File2.pdf"]
"Report.pdf" is passed to the inner block, assigned to the block variable:
fname = "Report.pdf"
and
hp.update({ "Report.pdf"=>["Jane Smith"] }) { |_,oh,nh| oh+nh }
#=> {"Report.pdf"=>["Jane Smith"]}
is executed, returning the updated value of hp.
Here the block for Hash#update (aka Hash#merge!) is not consulted. It is only needed when the hash hp and the merging hash (here { fname=>["Jane Smith"] }) have one or more common keys. For each common key, the key and the corresponding values from the two hashes are passed to the block. This is elaborated below.
Next, enum2 passes "File2.pdf" into the block and assigns it to the block variable:
fname = "File2.pdf"
and executes
hp.update({ "File2.pdf"=>["Jane Smith"] }) { |_,oh,nh| oh+nh }
#=> {"Report.pdf"=>["Jane Smith"], "File2.pdf"=>["Jane Smith"]}
which returns the updated value of hp. Again, update's block was not consulted. We're now finished with Jane, so enum1 next passes its second and last value into the block and assigns the block variables as follows:
hf = {:person=>"John Doe", :filenames=>["Report.pdf"]}
hp = {"Report.pdf"=>["Jane Smith"], "File2.pdf"=>["Jane Smith"]}
Note that hp has now been updated. We then have:
enum2 = hf[:filenames].each
#=> #<Enumerator: ["Report.pdf"]:each>
enum2.to_a
#=> ["Report.pdf"]
enum2 assigns
fname = "Report.pdf"
and executes:
hp.update({ "Report.pdf"=>["John Doe"] }) { |_,oh,nv| oh+nv }
#=> {"Report.pdf"=>["Jane Smith", "John Doe"], "File2.pdf"=>["Jane Smith"]}
In making this update, hp and the hash being merged both have the key "Report.pdf". The following values are therefore passed to the block variables |k,ov,nv|:
k = "Report.pdf"
oh = ["Jane Smith"]
nh = ["John Doe"]
We don't need the key, so I've replaced it with an underscore. The block returns
["Jane Smith"]+["John Doe"] #=> ["Jane Smith", "John Doe"]
which becomes the new value for the key "Report.pdf".
Before turning to the final step, I'd like to suggest that you consider stopping here. That is, rather than constructing an array of hashes, one for each file, just leave it as a hash with the files as keys and arrays of persons the values:
{ "Report.pdf"=>["Jane Smith", "John Doe"], "File2.pdf"=>["Jane Smith"] }
The final step is straightforward:
hp.map { |fname,people| { :filename => fname, :people=>people } }
#=> [{ :filename=>"Report.pdf", :people=>["Jane Smith", "John Doe"] },
# { :filename=>"File2.pdf", :people=>["Jane Smith"] }]

Cleanest ruby code to split a string with specific rules

imagine an array like this
[
"A definition 1: this is the definition text",
"A definition 2: this is some other definition text",
"B definition 3: this could be: the definition text"
]
I want to end up with the following hash
hash = {
:A => ["A definition 1", "this is the definition text", "A definition 2", "this is some other definition text"],
:B => ["B definition 3", "this could be: the definition text"]
}
I'm creating a glossary, with a hash of each letter of the alphabet with definition arrays.
I'm pretty new to Ruby so what I have looks really inelegant and I'm struggling on the split regex of the line on the colon so that the 3rd line only splits on the first occurrence.
Thanks!
Edit
Here's what I have so far
def self.build(lines)
alphabet = Hash.new()
lines.each do |line|
strings = line.split(/:/)
letter = strings[0][0,1].upcase
alphabet[letter] = Array.new if alphabet[letter].nil?
alphabet[letter] << strings[0]
alphabet[letter] << strings[1..(strings.size-1)].join.strip
end
alphabet
end
Provided raw_definitions is your input:
sorted_defs = Hash.new{|hash, key| hash[key] = Array.new;}
raw_definitions.each do |d|
d.match(/^([a-zA-Z])(.*?):(.*)$/)
sorted_defs[$1.upcase]<<$1+$2
sorted_defs[$1.upcase]<<$3.strip
end
Just for fun, here's a purely-functional alternative:
defs = [
"A definition 1: this is the definition text",
"A definition 2: this is some other definition text",
"B definition 3: this could be: the definition text"
]
hash = Hash[
defs.group_by{ |s| s[0].to_sym }.map do |sym,strs|
[ sym, strs.map{ |s| s[2..-1].split(/\s*:\s*/,2) }.flatten ]
end
]
require 'pp'
pp hash
#=> {:A=>
#=> ["definition 1",
#=> "this is the definition text",
#=> "definition 2",
#=> "this is some other definition text"],
#=> :B=>["definition 3", "this could be: the definition text"]}
And a not-purely-functional variation with the same results:
hash = defs.group_by{ |s| s[0].to_sym }.tap do |h|
h.each do |sym,strs|
h[sym] = strs.map{ |s| s[2..-1].split(/\s*:\s*/,2) }.flatten
end
end
Note that these solutions only work in Ruby 1.9 due to the use of s[0].to_sym; to work in 1.8.7 you would have to change this to s[0,1].to_sym. To make the first solution work in 1.8.6 you would further have to replace Hash[ xxx ] with Hash[ *xxx.flatten ]

Resources