Creating hash with multiple words using split method - ruby

So I have a text file and some of the values are :
mom : dad
brother : sister
I created a hash with using this file:
hash = Hash[*File.read('file.txt').delete(":").split(/[, \n]+/)]
It works fine with one worded values such as brother: sister but if add, metallica : rock band it does not work. So I have 2 questions:
1- I didn't really understand .split(/[, \n]+/)] method there. Why did we add too many symbols inside? (Such as +,/) What they did?
2 - How can I create metallica => rock band ? Or green day => band. Is it possible to do with this split method?

Let's create a file for illustration.
f = 't'
File.write(f, <<~END
mom : dad
brother : sister
no colon in this line
metallica : rock band
green day : band
END
)
#=> 90
Have a look.
puts File.read(f)
mom : dad
brother : sister
no colon in this line
metallica : rock band
green day : band
We can read the file line-by-line and build the hash as we go.
IO.foreach(f).with_object({}) do |line, h|
next unless line.count(':') == 1
key, value = line.split(':').map(&:strip)
h[key] = value
end
#=> {"mom"=>"dad", "brother"=>"sister",
# "metallica"=>"rock band", "green day"=>"band"}
Note that without a block, IO::foreach returns an enumerator, which I've chained to Enumerator#with_object.

Issue is that .split(/[, \n]+/) will split by comma, empty space and new line, that mean phrase ("green day") will split into multiple key-value pairs.
Alternative approach would be to split key-value pairs by lines and split key and value by : character
hash = File.readlines('file.txt').map {|line| line.split(':').map(&:strip)}.to_h

Related

Parse a string without spaces into an array of individual words

If I have a string "blueberrymuffinsareinsanelydelicious", what is the most efficient way to parse it such that I am left with ["blueberry", "muffins", "are", "insanely", "delicious"]?
I already have my wordlist (mac's /usr/share/dict/words), but how do I ensure that the full word is stored in my array, aka: blueberry, instead of two separate words, blue and berry.
Although there's cases where there's multiple interpretations possible and picking the best one can be trouble, you can always approach it with a fairly naïve algorithm like this:
WORDS = %w[
blueberry
blue
berry
fin
fins
muffin
muffins
are
insane
insanely
in
delicious
deli
us
].sort_by do |word|
[ -word.length, word ]
end
WORD_REGEXP = Regexp.union(*WORDS)
def best_fit(string)
string.scan(WORD_REGEXP)
end
This will parse your example:
best_fit("blueberrymuffinsareinsanelydelicious")
# => ["blueberry", "muffins", "are", "insanely", "delicious"]
Note that this skips any non-matching components.
Here's a recursive method which finds the correct sentence in 0.4s on my slowish laptop.
It first imports almost 100K english words and sorts them by decreasing size
For every word, it checks if text starts with it
If it does, it removes the word from the text, keeps the word in an array and recursively calls itself.
If the text is empty, it means a sentence has been found.
It uses a lazy array to stop at the first found sentence.
text = "blueberrymuffinsareinsanelydeliciousbecausethey'rereallymoistandcolorful"
dictionary = File.readlines('/usr/share/dict/american-english')
.map(&:chomp)
.sort_by{ |w| -w.size }
def find_words(text, possible_words, sentence = [])
return sentence if text.empty?
possible_words.lazy.select{ |word|
text.start_with?(word)
}.map{ |word|
find_words(text[word.size..-1], possible_words, sentence + [word])
}.find(&:itself)
end
p find_words(text, dictionary)
#=> ["blueberry", "muffins", "are", "insanely", "delicious", "because", "they're", "really", "moist", "and", "colorful"]
p find_words('someword', %w(no way to find a combination))
#=> nil
p find_words('culdesac', %w(culd no way to find a combination cul de sac))
#=> ["cul", "de", "sac"]
p find_words("carrotate", dictionary)
#=> ["carrot", "ate"]
For a faster lookup, it could be a good idea to use a Trie.

Counting words in Ruby with some exceptions

Say that we want to count the number of words in a document. I know we can do the following:
text.each_line(){ |line| totalWords = totalWords + line.split.size }
Say, that I just want to add some exceptions, such that, I don't want to count the following as words:
(1) numbers
(2) standalone letters
(3) email addresses
How can we do that?
Thanks.
You can wrap this up pretty neatly:
text.each_line do |line|
total_words += line.split.reject do |word|
word.match(/\A(\d+|\w|\S*\#\S+\.\S+)\z/)
end.length
end
Roughly speaking that defines an approximate email address.
Remember Ruby strongly encourages the use of variables with names like total_words and not totalWords.
assuming you can represent all the exceptions in a single regular expression regex_variable, you could do:
text.each_line(){ |line| totalWords = totalWords + line.split.count {|wrd| wrd !~ regex_variable }
your regular expression could look something like:
regex_variable = /\d.|^[a-z]{1}$|\A([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})\Z/i
I don't claim to be a regex expert, so you may want to double check that, particularly the email validation part
In addition to the other answers, a little gem hunting came up with this:
WordsCounted Gem
Get the following data from any string or readable file:
Word count
Unique word count
Word density
Character count
Average characters per word
A hash map of words and the number of times they occur
A hash map of words and their lengths
The longest word(s) and its length
The most occurring word(s) and its number of occurrences.
Count invividual strings for occurrences.
A flexible way to exclude words (or anything) from the count. You can pass a string, a regexp, an array, or a lambda.
Customisable criteria. Pass your own regexp rules to split strings if you prefer. The default regexp has two features:
Filters special characters but respects hyphens and apostrophes.
Plays nicely with diacritics (UTF and unicode characters): "São Paulo" is treated as ["São", "Paulo"] and not ["S", "", "o", "Paulo"].
Opens and reads files. Pass in a file path or a url instead of a string.
Have you ever started answering a question and found yourself wandering, exploring interesting, but tangential issues, or concepts you didn't fully understand? That's what happened to me here. Perhaps some of the ideas might prove useful in other settings, if not for the problem at hand.
For readability, we might define some helpers in the class String, but to avoid contamination, I'll use Refinements.
Code
module StringHelpers
refine String do
def count_words
remove_punctuation.split.count { |w|
!(w.is_number? || w.size == 1 || w.is_email_address?) }
end
def remove_punctuation
gsub(/[.!?,;:)](?:\s|$)|(?:^|\s)\(|\-|\n/,' ')
end
def is_number?
self =~ /\A-?\d+(?:\.\d+)?\z/
end
def is_email_address?
include?('#') # for testing only
end
end
end
module CountWords
using StringHelpers
def self.count_words_in_file(fname)
IO.foreach(fname).reduce(0) { |t,l| t+l.count_words }
end
end
Note that using must be in a module (possibly a class). It does not work in main, presumably because that would make the methods available in the class self.class #=> Object, which would defeat the purpose of Refinements. (Readers: please correct me if I'm wrong about the reason using must be in a module.)
Example
Let's first informally check that the helpers are working correctly:
module CheckHelpers
using StringHelpers
s = "You can reach my dog, a 10-year-old golden, at fido#dogs.org."
p s = s.remove_punctuation
#=> "You can reach my dog a 10 year old golden at fido#dogs.org."
p words = s.split
#=> ["You", "can", "reach", "my", "dog", "a", "10",
# "year", "old", "golden", "at", "fido#dogs.org."]
p '123'.is_number? #=> 0
p '-123'.is_number? #=> 0
p '1.23'.is_number? #=> 0
p '123.'.is_number? #=> nil
p "fido#dogs.org".is_email_address? #=> true
p "fido(at)dogs.org".is_email_address? #=> false
p s.count_words #=> 9 (`'a'`, `'10'` and "fido#dogs.org" excluded)
s = "My cat, who has 4 lives remaining, is at abbie(at)felines.org."
p s = s.remove_punctuation
p s.count_words
end
All looks OK. Next, put I'll put some text in a file:
FName = "pets"
text =<<_
My cat, who has 4 lives remaining, is at abbie(at)felines.org.
You can reach my dog, a 10-year-old golden, at fido#dogs.org.
_
File.write(FName, text)
#=> 125
and confirm the file contents:
File.read(FName)
#=> "My cat, who has 4 lives remaining, is at abbie(at)felines.org.\n
# You can reach my dog, a 10-year-old golden, at fido#dogs.org.\n"
Now, count the words:
CountWords.count_words_in_file(FName)
#=> 18 (9 in ech line)
Note that there is at least one problem with the removal of punctuation. It has to do with the hyphen. Any idea what that might be?
Something like...?
def is_countable(word)
return false if word.size < 2
return false if word ~= /^[0-9]+$/
return false if is_an_email_address(word) # you need a gem for this...
return true
end
wordCount = text.split().inject(0) {|count,word| count += 1 if is_countable(word) }
Or, since I am jumping to the conclusion that you can just split your entire text into an array with split(), you might need:
wordCount = 0
text.each_line do |line|
line.split.each{|word| wordCount += 1 if is_countable(word) }
end

Can I find frequently occuring phrases in an array of strings where the phrase only forms part of each string?

This has been asked before but wasn't ever answered.
I want to search through an array of strings and find the most frequent phrases (2 or more words) that occur within those strings, so given:
["hello, my name is Emily, I'm from London",
"this chocolate from London is really good",
"my name is James, what did you say yours was",
"is he from London?"]
I would want to get back something along the lines of:
{"from London" => 3, "my name is" => 2 }
I don't really know how to approach this. Any suggestions would be awesome, even if it was just a strategy that I could test out.
This isn't a one stage process, but it is possible. Ruby knows what a character is, what a digit is, what a string is, etc. but it doesn't know what a phrase is.
You'd need to:
Begin with either building a list of phrases, or finding a list online. This would then form the basis of the matching process.
Iterate through the list of phrases for each string, to see whether an instance of any of the phrases from the list occurs within that string.
Record a count of each instance of a phrase within a string.
Although it might not seen it, this is quite a high level question, so try to break down the task into smaller tasks.
Here is something that might get you started. This is brute forcing and will be very very slow for large data sets.
x = ["hello, my name is Emily, I'm from London",
"this chocolate from London is really good",
"my name is James, what did you say yours was",
"is he from London?"]
word_maps = x.flat_map do |line|
line = line.downcase.scan(/\w+/)
(2..line.size).flat_map{|ix|line.each_cons(ix).map{|p|p.join(' ')}}
end
word_maps_hash = Hash[word_maps.group_by{|x|x}.reject{|x,y|y.size==1}.map{|x,y|[x,y.size]}]
original_hash_keys = word_maps_hash.keys
word_maps_hash.delete_if{|key, val| original_hash_keys.any?{|ohk| ohk[key] && ohk!=key}}
p word_maps_hash #=> {"from london"=>3, "my name is"=>2}
How about
x = ["hello, my name is Emily, I'm from London",
"this chocolate from London is really good",
"my name is James, what did you say yours was",
"is he from London?"]
words = x.map { |phrase| phrase.split(/[^\w\'']+/)}.flatten
word_pairs_array = words.each_cons(2)
word_pairs = word_pairs_array.map {|pair| pair.join(' ')}
counts = Hash.new 0
word_pairs.each {|pair| counts[pair] += 1}
pairs_occuring_twice_or_more = counts.select {|pair, count| count > 1}

Printing an array of arrays on one line in console (one line per master array object) in Ruby

I have an array of arrays that is currently printing each object in the array on its own line. The master array holds many different people inside of it. Each person has 5 different objects stored to them (e.g. Last Name, First Name, DOB.. etc)
Kournikova
Anna
F
6/3/1975
Red
Hingis
Martina
F
4/2/1979
Green
Seles
Monica
F
12/2/1973
Black
What I'm trying to do is print out each person and their corresponding objects on one line, per person.
Does anyone have a solution for this? Additionally, the output should not contain array brackets ([]) or commas. I'm wondering if it will simply need to be a string, or if there is something I am missing.
Some of my code below:
space_array = [split_space[0],split_space[1],split_space[3],new_date,split_space[5]]
master << space_array
puts master
The ideal output would be something like this:
Kournikova Anna F 6/3/1975 Red
Hingis Martina F 4/2/1979 Green
Seles Monica F 12/2/1973 Black
your_array.each do |person|
puts person.join(" ")
end
The method puts will automatically put a new line. Use print instead to print the text out with no new line.
Or if you want, you can use the join function.
['a', 'b', 'c'].join(' ')
=> 'a b c'
You can just iterate over the outer array and join the inner arrays into a string. Since you provide no example data ready for copying and pasting, here's some example code I made up:
outer_array.each { |inner| puts inner.join(' ') }
more simple:
puts your_array.join(" ")

Find out which words in a large list occur in a small string

I have a static 'large' list of words, about 300-500 words, called 'list1'
given a relatively short string str of about 40 words, what is the fastest method in ruby to get:
the number of times a word in list1 occurs in str (counting multiple occurrences)
a list of which words in list1 occur one or more times in the string str
the number of words in (2)
'Occuring' in str means either as a whole word in str, or as a partial within a word in str. So if 'fred' is in list1 and str contained 'fred' and 'freddie' that would be two matches.
Everything is lowercase, so any matching does not have to care about case.
For example:
list1 ="fred sam sandy jack sue bill"
str = "and so sammy went with jack to see fred and freddie"
so str contains sam, jack, fred (twice)
for part (1) the expression would return 4 (sam+jack+fred+fred)
for part (2) the expression would return "sam jack fred"
and part (3) is 3
The 'ruby way' to do this eludes me after 4 hours... with iteration it's easy enough (but slow). Any help would be appreciated!
Here's my shot at it:
def match_freq(exprs, strings)
rs, ss, f = exprs.split.map{|x|Regexp.new(x)}, strings.split, {}
rs.each{|r| ss.each{|s| f[r] = f[r] ? f[r]+1 : 1 if s=~r}}
[f.values.inject(0){|a,x|a+x}, f, f.size]
end
list1 = "fred sam sandy jack sue bill"
str = "and so sammy went with jack to see fred and freddie"
x = match_freq(list1, str)
x # => [4, {/sam/=>1, /fred/=>2, /jack/=>1}, 3]
The output of "match_freq" is an array of your output items (a,b,c). The algorithm itself is O(n*m) where n is the number of items in list1 and m is the size of the input string, I don't think you can do better than that (in terms of big-oh). But there are smaller optimizations that might pay off like keeping a separate counter for the total number of matches instead of computing it afterwards. This was just my quick hack at it.
You can extract just the matching words from the output as follows:
matches = x[1].keys.map{|x|x.source}.join(" ") # => "sam fred jack"
Note that the order won't be preserved necessarily, if that's important you'll have to keep a separate list of the order they were found.
Here's an alternative implementation, for your edification:
def match_freq( words, str )
words = words.split(/\s+/)
counts = Hash[ words.map{ |w| [w,str.scan(w).length] } ]
counts.delete_if{ |word,ct| ct==0 }
occurring_words = counts.keys
[
counts.values.inject(0){ |sum,ct| sum+ct }, # Sum of counts
occurring_words,
occurring_words.length
]
end
list1 = "fred sam sandy jack sue bill"
str = "and so sammy went with jack to see fred and freddie"
x = match_freq(list1, str)
p x #=> [4, ["fred", "sam", "jack"], 3]
Note that if I needed this data I would probably just return the 'counts' hash from the method and then do whatever analysis I wanted on it. If I was going to return multiple 'values' from an analysis method, I might return a Hash of named values. Although, returning an array allows you to unsplat the results:
hits, words, word_count = match_freq(list1, str)
p hits, words, word_count
#=> 4
#=> ["fred", "sam", "jack"]
#=> 3
For faster regular expressions, use https://github.com/mudge/re2. It is a ruby wrapper for Google re2 https://code.google.com/p/re2/

Resources