How to parse data from a file into a Hash - ruby

I am trying to parse and store some data from a file into a hash map, not using regular expressions but string comparison, and I am getting some errors I tried to fix but didn't solve the problem.
The file has a structure like:
"key" + "double colon" + "value"
in every line. This structure is repeated along the file, and every data has an ID key, almost everything has at least one "is_a" key, and may also have "is_obsolete" and "replaced_by" keys.
I'm trying to parse it like this:
def get_hpo_data(hpofile="hp.obo")
hpo_data = Hash.new() #Hash map where i want to store all IDs
File.readlines(hpofile).each do |line|
if line.start_with? "id:" #if line is an ID
hpo_id = line[4..13] #Store ID value
hpo_data[hpo_id] = Hash.new() #Setting up hash map for that ID
hpo_data[hpo_id]["parents"] = Array.new()
elsif line.start_with? "is_obsolete:" #If the ID is obsolete
hpo_data[hpo_id]["is_obsolete"] = true #store value in the hash
elsif line.start_with? "replaced_by:" #If the ID is obsolete
hpo_data[hpo_id]["replaced_by"] = line[13..22]
#Store the ID term it was replaced by
elsif line.start_with? "is_a:" #If the ID has a parent ID
hpo_data[hpo_id]["parents"].push(line[6..15])
#Store the parent(s) in the array initialized before
end
end
return hpo_data
end
The structure I was expecting to be created is a global hash in which every ID also is a hash with its diferent data (one string data, one boolean and an array with a variable length depending the number of ID parents of that ID term, but I'm getting the following error:
table_combination.rb:224:in `block in get_hpo_data': undefined method `[]=' for nil:NilClass (NoMethodError)
This time the error is pointing to the replaced_by elsif statement, but I also get it with any of other elsif statements, so the code does not work parsing "is_obsolete", "replaced_by" and "is_a" properties. If I try deleting these statements, the code succesfully creates the global hash with every ID term as a hash.
I also tried giving default values for every hash but it does not solve the problem. I'm even getting a new error not seen before:
table_combination.rb:233:in '[]': no implicit conversion of String into Integer (TypeError)
at this line:
hpo_data[hpo_id]["parents"].push(line[6..15])
Here is an example of how the file looks like for two terms showing the different keys I want to take care of:
[Term]
id: HP:0002578
name: Gastroparesis
def: "Decreased strength of the muscle layer of stomach, which leads to a decreased ability to empty the contents of the stomach despite the absence of obstruction." [HPO:probinson]
subset: hposlim_core
synonym: "Delayed gastric emptying" EXACT layperson [ORCID:0000-0001-5208-3432]
xref: MSH:D018589
xref: SNOMEDCT_US:196753007
xref: SNOMEDCT_US:235675006
xref: UMLS:C0152020
is_a: HP:0002577 ! Abnormality of the stomach
is_a: HP:0011804 ! Abnormal muscle physiology
[Term]
id: HP:0002564
name: obsolete Malformation of the heart and great vessels
is_obsolete: true
replaced_by: HP:0030680

There might be more errors hidden in your code, but one problem is indeed that your hpo_data doesn't have default values.
Calling hpo_data[hpo_id]["replaced_by"] = line[13..22] fails if hpo_id hasn't been initialized.
You could define hpo_data like this:
hpo_data = Hash.new { |hash, key| hash[key] = {'parents' => [] } }
and remove
hpo_data = Hash.new() #Hash map where i want to store all IDs
and
hpo_data[hpo_id] = Hash.new() #Setting up hash map for that ID
hpo_data[hpo_id]["parents"] = Array.new()
Any time you call hpo_data[hpo_id], it will be automatically defined to {"parents"=>[]}.
As an example:
hpo_data = Hash.new { |hash, key| hash[key] = {'parents' => [] } }
# => {}
hpo_data[1234]
# => {"parents"=>[]}
hpo_data[1234]["parents"] << 6
# => [6]
hpo_data
# => {1234=>{"parents"=>[6]}}
hpo_data[42]["is_obsolete"] = true
# => true
hpo_data
# => {1234=>{"parents"=>[6]}, 42=>{"parents"=>[], "is_obsolete"=>true}}

Related

Creating a ruby nested hash with array as inner value

I am trying to create a nested hash where the inner values are arrays. For example
{"monday"=>{"morning"=>["John", "Katie", "Dave"],"afternoon"=>["Anne", "Charlie"]},
"tuesday"=>{"morning"=>["Joe"],"afternoon"=>["Chris","Tim","Melissa"]}}
I tried
h = Hash.new( |hash, key| hash[key] = Hash.new([]) }
When I try
h["monday"]["morning"].append("Ben")
and look at h, I get
{"monday" => {}}
rather than
{"monday" => {"morning"=>["Ben"]}}
I'm pretty new to Ruby, any suggestions for getting the functionality I want?
Close, you'll have to initialise a new hash as the value of the initial key, and set an Array as the value of the nested hash:
h = Hash.new { |hash, key| hash[key] = Hash.new { |k, v| k[v] = Array.new } }
h["monday"]["morning"] << "Ben"
{"monday"=>{"morning"=>["Ben"]}}
This way you will not have to initialise an array every time you want to push a value. The key will be as you set in the initial parameter, the second parameter will create a nested hash where the value will be an array you can push to with '<<'. Is this a solution to use in live code? No, it’s not very readable but explains a way of constructing data objects to fit your needs.
Refactored for Explicitness
While it's possible to create a nested initializer using the Hash#new block syntax, it's not really very readable and (as you've seen) it can be hard to debug. It may therefore be more useful to construct your nested hash in steps that you can inspect and debug as you go.
In addition, you already know ahead of time what your keys will be: the days of the week, and morning/afternoon shifts. For this use case, you might as well construct those upfront rather than relying on default values.
Consider the following:
require 'date'
# initialize your hash with a literal
schedule = {}
# use constant from Date module to initialize your
# lowercase keys
Date::DAYNAMES.each do |day|
# create keys with empty arrays for each shift
schedule[day.downcase] = {
"morning" => [],
"afternoon" => [],
}
end
This seems more explicit and readable to me, but that's admittedly subjective. Meanwhile, calling pp schedule will show you the new data structure:
{"sunday"=>{"morning"=>[], "afternoon"=>[]},
"monday"=>{"morning"=>[], "afternoon"=>[]},
"tuesday"=>{"morning"=>[], "afternoon"=>[]},
"wednesday"=>{"morning"=>[], "afternoon"=>[]},
"thursday"=>{"morning"=>[], "afternoon"=>[]},
"friday"=>{"morning"=>[], "afternoon"=>[]},
"saturday"=>{"morning"=>[], "afternoon"=>[]}}
The new data structure can then have its nested array values assigned as you currently expect:
schedule["monday"]["morning"].append("Ben")
#=> ["Ben"]
As a further refinement, you could append to your nested arrays in a way that ensures you don't duplicate names within a scheduled shift. For example:
schedule["monday"]["morning"].<<("Ben").uniq!
schedule["monday"]
#=> {"morning"=>["Ben"], "afternoon"=>[]}
There are many ways to create the hash. One simple way is as follows.
days = [:monday, :tuesday]
day_parts = [:morning, :afternoon]
h = days.each_with_object({}) do |d,h|
h[d] = day_parts.each_with_object({}) { |dp,g| g[dp] = [] }
end
#=> {:monday=>{:morning=>[], :afternoon=>[]},
# :tuesday=>{:morning=>[], :afternoon=>[]}}
Populating the hash will of course depend on the format of the data. For example, if the data were as follows:
people = { "John" =>[:monday, :morning],
"Katie" =>[:monday, :morning],
"Dave" =>[:monday, :morning],
"Anne" =>[:monday, :afternoon],
"Charlie"=>[:monday, :afternoon],
"Joe" =>[:tuesday, :morning],
"Chris" =>[:tuesday, :afternoon],
"Tim" =>[:tuesday, :afternoon],
"Melissa"=>[:tuesday, :afternoon]}
we could build the hash as follows.
people.each { |name,(day,day_part)| h[day][day_part] << name }
#=> {
# :monday=>{
# :morning=>["John", "Katie", "Dave"],
# :afternoon=>["Anne", "Charlie"]
# },
# :tuesday=>{
# :morning=>["Joe"],
# :afternoon=>["Chris", "Tim", "Melissa"]
# }
# }
As per your above-asked question
h = Hash.new{ |hash, key| hash[key] = Hash.new([]) }
you tried
h["monday"]["morning"].append("Ben")
instead you should first initialize that with an array & then you can use array functions like append
h["monday"]["morning"] = []
h["monday"]["morning"].append("Ben")
This would work fine & you will get the desired results.

how to pass variable from a class to another class in ruby

I'm trying to extract data from mongodb to Elasticsearch, getMongodoc = coll.find().limit(10)
will find the first 10 entries in mongo.
As you can see , result = ec.mongoConn should get result from method mongoConn() in class MongoConnector. when I use p hsh(to examine the output is correct), it will print 10 entires, while p result = ec.mongoConn will print #<Enumerator: #<Mongo::Cursor:0x70284070232580 #view=#<Mongo::Collection::View:0x70284066032180 namespace='mydatabase.mycollection' #filter={} #options={"limit"=>10}>>:each>
I changed p hsh to return hsh, p result = ec.mongoConn will get the correct result, but it just prints the first entry not all 10 entries. it seems that the value of hsh did not pass to result = ec.mongoConn correctly, Can anyone tell me what am I doing wrong? is this because I did something wrong with method calling?
class MongoConncetor
def mongoConn()
BSON::OrderedHash.new
client = Mongo::Client.new([ 'xx.xx.xx.xx:27017' ], :database => 'mydatabase')
coll = client[:mycollection]
getMongodoc = coll.find().limit(10)
getMongodoc.each do |document|
hsh = symbolize_keys(document.to_hash).select { |hsh| hsh != :_id }
return hsh
# p hsh
end
end
class ElasticConnector < MongoConncetor
include Elasticsearch::API
CONNECTION = ::Faraday::Connection.new url: 'http://localhost:9200'
def perform_request(method, path, params, body)
puts "--> #{method.upcase} #{path} #{params} #{body}"
CONNECTION.run_request \
method.downcase.to_sym,
path,
((
body ? MultiJson.dump(body) : nil)),
{'Content-Type' => 'application/json'}
end
ec = ElasticConnector.new
p result = ec.mongoConn
client = ElasticConnector.new
client.bulk index: 'myindex',
type:'test' ,
body: result
end
You are calling return inside a loop (each). This will stop the loop and return the first result. Try something like:
getMongodoc.map do |document|
symbolize_keys(document.to_hash).select { |hsh| hsh != :_id }
end
Notes:
In ruby you usually don't need the return keyword as the last value is returned automatically. Usually you'd use return to prevent some code from being executed
in ruby snake_case is used for variable and method names (as opposed to CamelCase or camelCase)
map enumerates a collection (by calling the block for every item in the collection) and returns a new collection of the same size with the return values from the block.
you don't need empty parens () on method definitions
UPDATE:
The data structure returned by MongoDB is a Hash (BSON is a special kind of serialization). A Hash is a collection of keys ("_id", "response") that point to values. The difference you point out in your comment is the class of the hash key: string vs. symbol
In your case a document in Mongo is represented as Hash, one hash per document
If you want to return multiple documents, then an array is required. More specifically an array of hashes: [{}, {}, ...]
If your target (ES) does only accept one hash at a time, then you will need to loop over the results from mongo and add them one by one:
list_of_results = get_mongo_data
list_of_results.each do |result|
add_result_to_es(result)
end

How to "split and group" an array of objects based on one of their properties

Context and Code Examples
I have an Array with instances of a class called TimesheetEntry.
Here is the constructor for TimesheetEntry:
def initialize(parameters = {})
#date = parameters.fetch(:date)
#project_id = parameters.fetch(:project_id)
#article_id = parameters.fetch(:article_id)
#hours = parameters.fetch(:hours)
#comment = parameters.fetch(:comment)
end
I create an array of TimesheetEntry objects with data from a .csv file:
timesheet_entries = []
CSV.parse(source_file, csv_parse_options).each do |row|
timesheet_entries.push(TimesheetEntry.new(
:date => Date.parse(row['Date']),
:project_id => row['Project'].to_i,
:article_id => row['Article'].to_i,
:hours => row['Hours'].gsub(',', '.').to_f,
:comment => row['Comment'].to_s.empty? ? "N/A" : row['Comment']
))
end
I also have a Set of Hash containing two elements, created like this:
all_timesheets = Set.new []
timesheet_entries.each do |entry|
all_timesheets << { 'date' => entry.date, 'entries' => [] }
end
Now, I want to populate the Array inside of that Hash with TimesheetEntries.
Each Hash array must contain only TimesheetEntries of one specific date.
I have done that like this:
timesheet_entries.each do |entry|
all_timesheets.each do |timesheet|
if entry.date == timesheet['date']
timesheet['entries'].push entry
end
end
end
While this approach gets the job done, it's not very efficient (I'm fairly new to this).
Question
What would be a more efficient way of achieving the same end result? In essence, I want to "split" the Array of TimesheetEntry objects, "grouping" objects with the same date.
You can fix the performance problem by replacing the Set with a Hash, which is a dictionary-like data structure.
This means that your inner loop all_timesheets.each do |timesheet| ... if entry.date ... will simply be replaced by a more efficient hash lookup: all_timesheets[entry.date].
Also, there's no need to create the keys in advance and then populate the date groups. These can both be done in one go:
all_timesheets = {}
timesheet_entries.each do |entry|
all_timesheets[entry.date] ||= [] # create the key if it's not already there
all_timesheets[entry.date] << entry
end
A nice thing about hashes is that you can customize their behavior when a non-existing key is encountered. You can use the constructor that takes a block to specify what happens in this case. Let's tell our hash to automatically add new keys and initialize them with an empty array. This allows us to drop the all_timesheets[entry.date] ||= [] line from the above code:
all_timesheets = Hash.new { |hash, key| hash[key] = [] }
timesheet_entries.each do |entry|
all_timesheets[entry.date] << entry
end
There is, however, an even more concise way of achieving this grouping, using the Enumerable#group_by method:
all_timesheets = timesheet_entries.group_by { |e| e.date }
And, of course, there's a way to make this even more concise, using yet another trick:
all_timesheets = timesheet_entries.group_by(&:date)

my_hash.keys == [], yet my_hash[key] gives a value?

I'm trying to demonstrate a situation where it's necessary to pass a block to Hash.new in order to set up default values for a given key when creating a hash of hashes.
To show what can go wrong, I've created the following code, which passes a single value as an argument to Hash.new. I expected all outer hash keys to wind up holding a reference to the same inner hash, causing the counts for the "piles" to get mixed together. And indeed, that does seem to have happened. But part_counts.each doesn't seem to find any keys/values to iterate over, and part_counts.keys returns an empty array. Only part_counts[0] and part_counts[1] successfully retrieve a value for me.
piles = [
[:gear, :spring, :gear],
[:axle, :gear, :spring],
]
# I do realize this should be:
# Hash.new {|h, k| h[k] = Hash.new(0)}
part_counts = Hash.new(Hash.new(0))
piles.each_with_index do |pile, pile_index|
pile.each do |part|
part_counts[pile_index][part] += 1
end
end
p part_counts # => {}
p part_counts.keys # => []
# The next line prints no output
part_counts.each { |key, value| p key, value }
p part_counts[0] # => {:gear=>3, :spring=>2, :axle=>1}
For context, here is the corrected code that I intend to show after the "broken" code. The parts for each pile within part_counts are separated, as they should be. each and keys work as expected, as well.
# ...same pile initialization code as above...
part_counts = Hash.new {|h, k| h[k] = Hash.new(0)}
# ...same part counting code as above...
p part_counts # => {0=>{:gear=>2, :spring=>1}, 1=>{:axle=>1, :gear=>1, :spring=>1}}
p part_counts.keys # => [0, 1]
# The next line of code prints:
# 0
# {:gear=>2, :spring=>1}
# 1
# {:axle=>1, :gear=>1, :spring=>1}
part_counts.each { |key, value| p key, value }
p part_counts[0] # => {:gear=>2, :spring=>1}
But why don't each and keys work (at all) in the first sample?
We'll start by decomposing this a little bit:
part_counts = Hash.new(Hash.new(0))
That's the same as saying:
default_hash = { }
default_hash.default = 0
part_counts = { }
part_counts.default = default_hash
Later on, you're saying things like this:
part_counts[pile_index][part] += 1
That's the same as saying:
h = part_counts[pile_index]
h[part] += 1
You're not using the (correct) block form of the default value for your Hash so accessing the default value doesn't auto-vivify the key. That means that part_counts[pile_index] doesn't create a pile_index key in part_counts, it just gives you part_counts.default and you're really saying:
h = part_counts.default
h[part] += 1
You're not doing anything else to add keys to part_counts so it has no keys and:
part_counts.keys == [ ]
So why does part_counts[0] give us {:gear=>3, :spring=>2, :axle=>1}? part_counts doesn't have any keys and in particular doesn't have a 0 key so:
part_counts[0]
is the same as
part_counts.default
Up above where you're accessing part_counts[pile_index], you're really just getting a reference to the default, the Hash won't clone it, you get the whole default value that the Hash will use next time. That means that:
part_counts[pile_index][part] += 1
is another way of saying:
part_counts.default[part] += 1
so you're actually just changing part_counts's default value in-place. Then when you part_counts[0], you're accessing this modified default value and there's the {:gear=>3, :spring=>2, :axle=>1} that you accidentally built in your loop.
The value given to Hash.new is used as the default value, but this value is not inserted into the hash. So part_count remains empty. You can get the default value by using part_count[...] but this has no effect on the hash, it doesn't really contain the key.
When you call part_counts[pile_index][part] += 1, then part_counts[pile_index] returns the default value, and it's this value that is modified with the assignment, not part_counts.
You have something like:
outer = Hash.new({})
outer[1][2] = 3
p outer, outer[1]
which can also be written like:
inner = {}
outer = Hash.new(inner)
inner2 = outer[1] # inner2 refers to the same object as inner, outer is not modified
inner2[2] = 3 # same as inner[2] = 3
p outer, inner

using an alias name for hash key value

I have some json data that I receive and that I JSON.parse to a hash. The hash key names are integer strings like data["0"], data["1"], data["2"], etc... where each value correspond to a state. Like 0 => START, 1 => STOP, 2 => RESTART.
I can't change the source json data to make the key more readable. Each hash will have 5 pairs that correspond to 5 different states.
I was wondering if there was a nice way for me to alias the numbers as meaningful names so when referencing the hash key value I don't have to use the number.
At the moment I'm using constants like below, but was thinking there might be a nicer, more Ruby way. Use another hash or struct so I can use data[STATES.start] or something?
STATE_START = "0"
STATE_STOP = "1"
STATE_RESTART = "2"
data = JSON.parse value
puts data[STATE_START]
Thanks
I think constants are fine. But if you want to rubify this code a bit, you can, for example, wrap the source hash in an object that will translate method names.
class MyHash
def initialize(hash)
#hash = hash
end
MAPPING = {
start: '0',
stop: '1',
restart: '2',
}
# dynamically define methods like
#
# def start
# #hash['0']
# end
#
# or you can use method_missing
MAPPING.each do |method_name, hash_key|
define_method method_name do
#hash[hash_key]
end
end
end
mh = MyHash.new({'0' => 'foo', '1' => 'bar'})
mh.start # => "foo"
mh.stop # => "bar"

Resources