Ruby regex matching strings from an array? - ruby

I'm sort of new to regexs with Ruby, (or I suppose regex in general), but I was wondering if there was a pragmatic way to match a string using an array?
Let me explain, say I have a list of ingredients in this case:
1 1/3 cups all-purpose flour
2 teaspoons ground cinnamon
8 ounces shredded mozzarella cheese
Ultimately I need to split the ingredients into its respective "quantity and measurement" and "ingredient name", so like in the case of 2 teaspoons ground cinnamon, will be split into "8 ounces, and shredded mozzarella cheese.
So Instead of having a hugely long regex like: (cup\w*|teaspoon\w*ounce\w* ....... ), how can I use an array to hold those values outside the regex?
update
I did this (thanks cwninja):
# I think the all units should be just singular, then
# use ruby function to pluralize them.
units = [
'tablespoon',
'teaspoon',
'cup',
'can',
'quart',
'gallon',
'pinch',
'pound',
'pint',
'fluid ounce',
'ounce'
# ... shortened for brevity
]
joined_units = (units.collect{|u| u.pluralize} + units).join('|')
# There are actually many ingredients, so this is actually an iterator
# but for example sake we are going to just show one.
ingredient = "1 (10 ounce) can diced tomatoes and green chilies, undrained"
ingredient.split(/([\d\/\.\s]+(\([^)]+\))?)\s(#{joined_units})?\s?(.*)/i)
This gives me close to what I want, so I think this is the direction I want to go.
puts "measurement: #{arr[1]}"
puts "unit: #{arr[-2] if arr.size > 3}"
puts "title: #{arr[-1].strip}"

Personally I'd just build the regexp programmatically, you can do:
ingredients = [...]
recipe = Regexp.new(ingredients.join("|"), Regex::IGNORECASE)
or using union method:
recipe = Regexp.union(ingredients)
recipe = /#{regex}/i
… then use the recipe regexp.
As long as you save it and don't keep recreating it, it should be fairly efficient.

For an array a, something like this should work:
a.each do |line|
parts = /^([\d\s\.\/]+)\s+(\w+)\s+(.*)$/.match(line)
# Do something with parts[1 .. 3]
end
For example:
a = [
'1 1/3 cups all-purpose flour',
'2 teaspoons ground cinnamon',
'8 ounces shredded mozzarella cheese',
'1.5 liters brandy',
]
puts "amount\tunits\tingredient"
a.each do |line|
parts = /^([\d\s\.\/]+)\s+(\w+)\s+(.*)$/.match(line)
puts parts[1 .. 3].join("\t")
end

Related

ruby refactoring class method

I am new to ruby. I am trying to create a report_checker function that checks how often the word "green, red, amber" appears and returns it in the format: "Green: 2/nAmber: 1/nRed:1".
If the word is not one of the free mentioned, it is replaced with the word 'unaccounted' but the number of times it appears is still counted.
My code is returning repeats e.g if I give it the input report_checker("Green, Amber, Green"). It returns "Green: 2/nAmber: 1/nGreen: 2" as opposed to "Green: 2/nAmber: 1".
Also, it doesn't count the number of times an unaccounted word appears. Any guidance on where I am going wrong?
def report_checker(string)
array = []
grading = ["Green", "Amber", "Red"]
input = string.tr(',', ' ').split(" ")
input.each do |x|
if grading.include?(x)
array.push( "#{x}: #{input.count(x)}")
else
x = "Unaccounted"
array.push( "#{x}: #{input.count(x)}")
end
end
array.join("/n")
end
report_checker("Green, Amber, Green")
I tried pushing the words into separate words and returning the expected word with its count
There's a lot of things you can do here to steer this into more idiomatic Ruby:
# Use a constant, as this never changes, and a Set, since you only care
# about inclusion, not order. Calling #include? on a Set is always
# quick, while on a longer array it can be very slow.
GRADING = Set.new(%w[ Green Amber Red ])
def report_checker(string)
# Do this as a series of transformations:
# 1. More lenient splitting on either comma or space, with optional leading
# and trailing spaces.
# 2. Conversion of invalid inputs into 'Unaccounted'
# 3. Grouping together of identical inputs via the #itself method
# 4. Combining these remapped strings into a single string
string.split(/\s*[,|\s]\s*/).map do |input|
if (GRADING.include?(input))
input
else
'Unaccounted'
end
end.group_by(&:itself).map do |input, samples|
"#{input}: #{samples.length}"
end.join("\n")
end
report_checker("Green, Amber, Green, Orange")
One thing you'll come to learn about Ruby is that simple mappings like this translate into very simple Ruby code. This might look a bit daunting now if you're not used to it, but keep in mind each component of that transformation isn't that complex, and further, that you can run up to that point to see what's going on, or even use .tap { |v| p v }. in the middle to expand on what's flowing through there.
Taking this further into the Ruby realm, you'd probably want to use symbols, as in :green and :amber, as these are very tidy as things like Hash keys: { green: 0, amber: 2 } etc.
While this is done as a single method, it might make sense to split this into two concerns: One focused on computing the report itself, as in to a form like { green: 2, amber: 1, unaccounted: 1 } and a second that can convert reports of that form into the desired output string.
There are lots and lots of ways to accomplish your end goal in Ruby. I won't go over those, but I will take a moment to point out a few key issues with your code in order to show you where the most notable probelms are and to show you how to fix it with as few changes as I can personally think of:
Issue #1:
if grading.include?(x)
array.push( "#{x}: #{input.count(x)}")
This results in a new array element being added each and every time grading includes x. This explains why you are getting repeated array elements ("Green: 2/nAmber: 1/nGreen: 2"). My suggested fix for this issue is to use the uniq method in the last line of your method defintion. This will remove any duplicated array elements.
Issue #2
else
x = "Unaccounted"
array.push( "#{x}: #{input.count(x)}")
The reason you're not seeing any quantity for your "Unaccounted" elements is that you're adding the word(string) "Unaccounted" to your array, but you've also re-defined x. The problem here is that input does not actually include any instances of "Unaccounted", so your count is always going to be 0. My suggested fix for this is to simply find the length difference between input and grading which will tell you exactly how many "Unaccounted" elements there actually are.
Issue #3 ??
I'm assuming you meant to include a newline and not a forward slash (/) followed by a literal "n" (n). My suggested fix for this of course is to use a proper newline (\n). If my assumption is incorrect, just ignore that part.
After all changes, your minimally modified code would look like this:
def report_checker(string)
array = []
grading = ["Green", "Amber", "Red"]
input = string.tr(',', ' ').split(" ")
input.each do |x|
if grading.include?(x)
array.push( "#{x}: #{input.count(x)}")
else
array.push( "Unaccounted: #{(input-grading).length}")
end
end
array.uniq.join("\n")
end
report_checker("Green, Amber, Green, Yellow, Blue, Blue")
#=>
Green: 2
Amber: 1
Unaccounted: 3
Again, I'm not suggesting that this is the most effective or efficient approach. I'm just giving you some minor corrections to work with so you can take baby steps if so desired.
Try with blow code
add your display logic outside of method
def report_checker(string, grading = %w[ Green Amber Red ])
data = string.split(/\s*[,|\s]\s*/)
unaccounted = data - grading
(data - unaccounted).tally.merge('Unaccounted' => unaccounted.count)
end
result = report_checker("Green, Amber, Green, Orange, Yellow")
result.each { |k,v| puts "#{k} : #{v}"}
Output
Green : 2
Amber : 1
Unaccounted : 2

Optimize print output where i use check on zero. Ruby

Currently, I'm having print like this
print ((stamp_amount[0], 'first mark') unless stamp_amount[0].zero?), (', ' if !stamp_amount[0].zero? && !stamp_amount[1].zero?),
((stamp_amount[1], 'second mark') unless stamp_amount[1].zero?)
stamp_amount is an array with 2 integer values
Let's say in the current situation stamp_amount[0] = 10 and stamp_amount[1] = 3
Output preview:
10 first mark, 3 second mark
So if stamp_amount[0] = 0 the 10 first mark, part won't be show. Same if stamp_amount[1] = 0 the , 3 second mark part won't be shown
For me, it seems a little bit incorrect in terms of theory. Could you please suggest me the more correct or less painful print of this? :)
Cheers!
Your code is trying to join a sequence of up to two elements with a separator. The joining is a solved problem, see Array#join.
The problem can be then reduced to "how can I produce the correct sequence, given my stamp_amount input". Now this can be done in a thousand ways. Here's one:
def my_print(stamp_amount)
ary = [
!stamp_amount[0].zero? && stamp_amount[0],
!stamp_amount[1].zero? && stamp_amount[1],
].select{|elem| elem }
ary.join(', ')
end
my_print([10, 3]) # => "10, 3"
my_print([0, 3]) # => "3"
my_print([10, 0]) # => "10"
my_print([0, 0]) # => ""
Here's another
ary = []
ary << stamp_amount[0] unless stamp_amount[0].zero?
ary << stamp_amount[1] unless stamp_amount[1].zero?
ary.join(', ')
Here's yet another. This version can handle stamp_amount of any length.
ary = stamp_amount.reject(&:zero?)
ary.join(', ')
I'd go with the third, but the second one may be the easiest to understand for a beginner.
Use the select, as an alternative to reject (shown in part 3 of the answer by Sergio Tulentsev). It is just asa readable, and depending on the context and on the future changes to the code, you may prefer one versus the other.
puts stamp_amount.select{ |a| !a.zero? }.join(", ")
A few examples of inputs and outputs are:
stamp_amount output
--------------------------------------------------------------------------
10, 3 10, 3
10, 0 10
0, 3 3
0, 0 (prints an empty line, because the selected array is empty)
You're calculating zero? on index points more often than is needed, but the first thing I would look at refactoring here is the readability of the code. It might be nicer to calculate the message to print outside of the print method and explain what is happening with variable names.
# rubocop is going to complain about variable assignment like this
first_amount, second_amount = *stamp_amount
We can actually use the reason rubocop prefers the .zero? over == 0 or .empty? method to guide our development. zero? is in essence just empty? but it communicates the meaning of what you are attempting to do in a better manner. I would use this reasoning when assigning strings to variables that explain what they are doing.
some_name_that_explains_what_this_is_0 = "#{first_amount} piecu centu marka"
some_name_that_explains_what_this_is_1 = "#{second_amount} tris centu marka"
Your current code is confusing as you have the possibility of printing a string like "10 tris centu marka" which does not make lexical sense and probably not what you are after considering tis evaluates to 'second mark', which would pose an issue if the first value is zero. We also could reject zero integers before we start converting them to strings.
array = [1, 0].reject(&:zero?)
Now we can take the array and do something like:
string = []
array.each_with_index { |e, i| string << "#{e} #{Ordinalize.new(i).ordinalize} mark" }
message = string.join(', ')
print(message)
# ord class
class Ordinalize
def initialize(value)
#value = value
end
def ordinalize
mapping[#value]
end
def mapping
# acounting for zero index
['first', 'second']
end
end
where we are calculating the ordinalization and letting our new class handle the sentence structure for us.
Outputs:
[1, 0] => "1 first mark"
[0, 1] => "1 first mark"
[1, 2] => "1 first mark, 2 second mark"

How do I match a longer string to shorter word or string

I have a database of items with tags, such as:
item1 is tagged with "pork with apple sauce"
item2 is tagged with "pork",
item3 is tagged with "apple sauce".
If I match the string:
"Today I would like to eat pork with apple sauce, it would fill me up"
against the tags, I would get three results. However, I just want to get the most specific one, which in this case would be item1.
This is just an example and i'm not using a particular database. Just string and map in ruby. I came up with "fuzzy search". I'm not sure if this is correct. Can anybody suggest how to solve this particular problem?
Yes, you need to do a fuzzy match (aka approximate match). It is quite a well known problem, and implementing an approximate matching algorithm by hand is not an easy task (but I'm sure it's very fun! =D). There are lots of things that can affect how "similar" two strings, A and B, are, depending on what things you consider important, like how many times A appears in B, or how close the order and distance between the words in A appear in B, or if the "important" words in A appear in B, etc.
If you can get by with an existing library, there seems to be a couple of Ruby gems that can get the job done. For example, using this one called fuzzy-string-match, which uses the Jaro-Winkler distance ported from Lucene (a Java library... it also seems to have preserved the Java convention of camelCased method names ¬¬):
require 'fuzzystringmatch'
matcher = FuzzyStringMatch::JaroWinkler.create(:pure)
tags = ["pork with apple sauce", "pork", "apple sauce"]
input = "Today I would like to eat pork with apple sauce, it would fill me up"
# Select the tag by distance to the input string (distance == 1 means perfect
# match)
best_tag = tags.max_by { |tag| matcher.getDistance(tag, input) }
p best_tag
Will correctly select "pork with apple sauce".
There's also this other gem called amatch that has many other approximate matching algorithms.
Depending on your specific use case, you may not need a fuzzy search.
Maybe a very basic implementation like this is sufficient for you:
class Search
attr_reader :items, :string
def initialize(items, string)
#items = items
#string = string.downcase
end
def best_match
items.max_by { |item| rate(item) }
end
private
def rate(item)
tag_list(item).count { |tag| string.include?(tag) }
end
def tag_list(item)
item[:tags].split(" ")
end
end
items = [
{ id: :item1, tags: "pork with apple sauce" },
{ id: :item2, tags: "pork" },
{ id: :item3, tags: "apple sauce" }
]
string = "Today I would like to eat pork with apple sauce, it would fill me up"
Search.new(items, string).best_match
#=> {:id=>:item1, :tags=>"pork with apple sauce"}
The order or specifity among the items in your database is determined before you match them with a string. You do not make it clear in the question, but I suppose what you have in mind is the length. So, suppose you have the data as a hash:
h = {
item1: "pork with apple sauce",
item2: "pork",
item3: "apple sauce",
}
Then, you can sort this by the length of the tag so that a longer one comes first in the list. At the same time, you can convert the tags into regexes so that you don't need to worry about variation in space. Then, you would have an array like this:
a =
h
.sort_by{|_, s| s.length}.reverse
.map{|k, s| [k, Regexp.new("\\b#{s.gsub(/\s+/, '\\s+')}\\b")]}
# =>
# [
# [
# :item1,
# /\bpork\s+with\s+apple\s+sauce\b/
# ],
# [
# :item3,
# /\bapple\s+sauce\b/
# ],
# [
# :item2,
# /\bpork\b/
# ]
# ]
Once you have this, you just need to find the first item in the list that matches with the string.
s = "Today I would like to eat pork with apple sauce, it would fill me up"
a.find{|_, r| s =~ r}[0]
# => :item1
This will apply to general programming and not Ruby in particular.
I would tokenize both Strings, that is both the needle and the haystack and then loop trough them both while counting number of occurens. Then finally compare scores.
Some sudo code:
needle[] = array of tokens from keysentence
haystack[] array of tokens from search string
int score = 0
do {
haystackToken = haystack's next token
do {
needleToken = needle's next token
if (haystackToken equals needleToken)
score += 1
} while(needle has more token)
} while (haystack has more tokens)

How can I sort this array?

I have an array, headlines, that holds several sentences, so like:
headlines = ["I see a tree", "Facebook is slow", "plants need water to grow", "There's an orange", "I think we'll agree"]
first = headlines[0]
second = headlines[1]
third = headlines[2]
I am using the ruby_rhymes gem which provides a method #to_phrase.rhymes which prints out rhyming words for the last word in a string you provide it with. Now to check if the array strings rhyme, I do something like:
> first.to_phrase.rhymes.flatten.join(", ").include?(second.to_phrase.rhymes.flatten.join(", "))
=> false
> second.to_phrase.rhymes.flatten.join(", ").include?(third.to_phrase.rhymes.flatten.join(", "))
=> true
I want to save these to a text file so I want to sort them in the array so that rhyming pairs are subsequent to one another. I know to sort so that strings follow if the last 3 characters are the same is:
headlines.sort! {|a,b| a[-3,3] <=> b[-3,3] }
But I don't know how to do want I want.
By investigating the output of your suggestion you can see that you are on the right track:
p headlines.sort {|a,b| a[-3,3] <=> b[-3,3] }
# => ["Facebook is slow", "There's an orange", "I see a tree", "I think we'll agree", "plants need water to grow"]
"...slow" and "...grow" are the only unordered sentences, caused by the letters 'r' and 'o'. A simple hack would be to reverse the order of the comparison like that:
p headlines.sort {|a,b| a[-3,3].reverse <=> b[-3,3].reverse }
# => ["I see a tree", "I think we'll agree", "There's an orange", "Facebook is slow", "plants need water to grow"]
So I've figured it out:
headlines.sort_by! { |h| h.to_phrase.rhyme_key }
This doesn't work 100% but that's the fault of the dictionary the gem relies on.

Dynamically Create Arrays in Ruby

Is there a way to dynamically create arrays in Ruby? For example, let's say I wanted to loop through an array of books as input by a user:
books = gets.chomp
The user inputs:
"The Great Gatsby, Crime and Punishment, Dracula, Fahrenheit 451,
Pride and Prejudice, Sense and Sensibility, Slaughterhouse-Five,
The Adventures of Huckleberry Finn"
I turn this into an array:
books_array = books.split(", ")
Now, for each book the user input, I'd like to Ruby to create an array. Pseudo-code to do that:
x = 0
books_array.count.times do
x += 1
puts "Please input weekly sales of #{books_array[x]} separated by a comma."
weekly_sales = gets.chomp.split(",")
end
Obviously this doesn't work. It would just re-define weekly_sales over and over again. Is there a way to achieve what I'm after, and with each loop of the .times method create a new array?
weekly_sales = {}
puts 'Please enter a list of books'
book_list = gets.chomp
books = book_list.split(',')
books.each do |book|
puts "Please input weekly sales of #{book} separated by a comma."
weekly_sales[book] = gets.chomp.split(',')
end
In ruby, there is a concept of a hash, which is a key/value pair. In this case, weekly_sales is the hash, we are using the book name as the key, and the array as the value.
A small change I made to your code is instead of doing books.count.times to define the loop and then dereference array elements with the counter, each is a much nicer way to iterate through a collection.
The "push" command will append items to the end of an array.
Ruby Docs->Array->push
result = "The Great Gatsby, Crime and Punishment, Dracula, Fahrenheit 451,
Pride and Prejudice, Sense and Sensibility, Slaughterhouse-Five,
The Adventures of Huckleberry Finn".split(/,\s*/).map do |b|
puts "Please input weekly sales of #{b} separated by a comma."
gets.chomp.split(',') # .map { |e| e.to_i }
end
p result
Remove the comment if you would like the input strings converted to numbers
One way or another you need a more powerful data structure.
Your post gravitates toward the idea that weekly_sales would be an array paralleling the books array. The drawback of this approach is that you have to maintain the parallelism of these two arrays yourself.
A somewhat better solution is to use the book title as a key to hash of arrays, as several answers have suggested. For example: weekly_sales['Fahrenheit 451'] would hold an array of sales data for that book. This approach hinges on the uniqueness of the book titles and has other drawbacks.
A more robust approach, which you might want to consider, is to bundle together each book's info into one package.
At the simplest end of the spectrum would be a list of hashes. Each book would be a self-contained unit along these lines:
books = [
{
'title' => 'Fahrenheit 451',
'sales' => [1,2,3],
},
{
'title' => 'Slaughterhouse-Five',
'sales' => [123,456],
},
]
puts books[1]['title']
At the other end of the spectrum would be to create a proper Book class.
And an intermediate approach would be to use a Struct (or an OpenStruct), which occupies a middle ground between hashes and full-blown objects. For example:
# Define the attributes that a Book will have.
Book = Struct.new(:title, :weekly_sales)
books = []
# Simulate some user input.
books_raw_input = "Fahrenheit 451,Slaughterhouse-Five\n"
sales_raw_input = ['1,2,3', '44,55,66,77']
books_raw_input.chomp.split(',').each do |t|
ws = sales_raw_input.shift.split(",")
# Create a new Book.
books.push Book.new(t, ws)
end
# Now each book is a handy bundle of information.
books.each do |b|
puts b.title
puts b.weekly_sales.join(', ')
end
Are you happy to end up with an array of arrays? In which this might be useful:
book_sales = books_array.collect do |book|
puts "Please input weekly sales of #{books_array[0]} separated by a comma."
gets.chomp.split(",").collect{ |s| s.to_i }
end
Looking at it, you might prefer a hash, keyed by book. Something like this:
book_sales = books_array.inject({}) do |hash, book|
puts "Please input weekly sales of #{books_array[0]} separated by a comma."
weekly_sales = gets.chomp.split(",").collect{ |s| s.to_i }
hash[book] = weekly_sales
end
This solution assumes that there will never be a duplicate book title. I figure that is pretty safe, yes?
input = "A list of words"
hash = {}
input.split(/\s+/).collect { |word| hash[word] = [] }
# Now do whatever with each entry
hash.each do |word,ary|
ary << ...
end

Resources