Filter large json file with Ruby - ruby

As a total beginner of programming, I am trying to filter a JSON file for my master's thesis at university. The file contains approximately 500 hashes of which 115 are the ones I am interested in.
What I want to do:
(1) Filter the file and select the hashes I am interested in
(2) For each selected hash, return only some specific keys
The format of the array with the hashes ("loans") included:
{"header": {
"total":546188,
"page":868,
"date":"2013-04-11T10:21:24Z",
"page_size":500},
"loans": [{
"id":427853,
"name":"Peter Pan",
...,
"status":"expired",
"paid_amount":525,
...,
"activity":"Construction Supplies",
"sector":"Construction"," },
... ]
}
Being specific, I would like to have the following:
(1) Filter out the "loans" hashes with "status":"expired"
(2) Return for each such "expired" loan certain keys only: "id", "name", "activity", ...
(3) Eventually, export all that into one file that I can analyse in Excel or with some stats software (SPSS or Stata)
What I have come up with myself so far is this:
require 'rubygems'
require 'json'
toberead = File.read('loans_868.json')
another = JSON.parse(toberead)
read = another.select {|hash| hash['status'] == 'expired'}
puts hash
This is obviously totally incomplete. And I feel totally lost.
Right now, I don't know where and how to continue. Despite having googled and read through tons of articles on how to filter JSON...
Is there anyone who can help me with this?

The JSON will be parsed as a hash, 'header' is one key, 'loans' is another key.
so after your JSON.parse line, you can do
loans = another['loans']
now loans is an array of hashes, each hash representing one of your loans.
you can then do
expired_loans = loans.select {|loan| loan['status'] == 'expired'}
puts expired_loans
to get at your desired output.

Related

Iterate through hashes to find values predefined in an array

I have an array with hashes:
test = [
{"type"=>1337, "age"=>12, "name"=>"Eric Johnson"},
{"type"=>1338, "age"=>18, "name"=>"John Doe"},
{"type"=>1339, "age"=>22, "name"=>"Carl Adley"},
{"type"=>1340, "age"=>25, "name"=>"Anna Brent"}
]
I am interested in getting all the hashes where the name key equals to a value that can be found in an array:
get_hash_by_name = ["John Doe","Anna Brent"]
Which would end up in the following:
# test_sorted = would be:
# {"type"=>1338, "age"=>18, "name"=>"John Doe"}
# {"type"=>1340, "age"=>25, "name"=>"Anna Brent"}
I probably have to iterate with test.each somehow, but I still trying to get a grasp of Ruby. Happy for all help!
Here's something to meditate on:
Iterating over an array to find something is slow, even if it's a sorted array. Computer languages have various structures we can use to improve the speed of lookups, and in Ruby Hash is usually a good starting point. Where an Array is like reading from a sequential file, a Hash is like reading from a random-access file, we can jump right to the record we need.
Starting with your test array-of-hashes:
test = [
{'type'=>1337, 'age'=>12, 'name'=>'Eric Johnson'},
{'type'=>1338, 'age'=>18, 'name'=>'John Doe'},
{'type'=>1339, 'age'=>22, 'name'=>'Carl Adley'},
{'type'=>1340, 'age'=>25, 'name'=>'Anna Brent'},
{'type'=>1341, 'age'=>13, 'name'=>'Eric Johnson'},
]
Notice that I added an additional "Eric Johnson" record. I'll get to that later.
I'd create a hash that mapped the array of hashes to a regular hash where the key of each pair is a unique value. The 'type' key/value pair appears to fit that need well:
test_by_types = test.map { |h| [
h['type'], h]
}.to_h
# => {1337=>{"type"=>1337, "age"=>12, "name"=>"Eric Johnson"},
# 1338=>{"type"=>1338, "age"=>18, "name"=>"John Doe"},
# 1339=>{"type"=>1339, "age"=>22, "name"=>"Carl Adley"},
# 1340=>{"type"=>1340, "age"=>25, "name"=>"Anna Brent"},
# 1341=>{"type"=>1341, "age"=>13, "name"=>"Eric Johnson"}}
Now test_by_types is a hash using the type value to point to the original hash.
If I create a similar hash based on names, where each name, unique or not, points to the type values, I can do fast lookups:
test_by_names = test.each_with_object(
Hash.new { |h, k| h[k] = [] }
) { |e, h|
h[e['name']] << e['type']
}.to_h
# => {"Eric Johnson"=>[1337, 1341],
# "John Doe"=>[1338],
# "Carl Adley"=>[1339],
# "Anna Brent"=>[1340]}
Notice that "Eric Johnson" points to two records.
Now, here's how we look up things:
get_hash_by_name = ['John Doe', 'Anna Brent']
test_by_names.values_at(*get_hash_by_name).flatten
# => [1338, 1340]
In one quick lookup Ruby returned the matching types by looking up the names.
We can take that output and grab the original hashes:
test_by_types.values_at(*test_by_names.values_at(*get_hash_by_name).flatten)
# => [{"type"=>1338, "age"=>18, "name"=>"John Doe"},
# {"type"=>1340, "age"=>25, "name"=>"Anna Brent"}]
Because this is running against hashes, it's fast. The hashes can be BIG and it'll still run very fast.
Back to "Eric Johnson"...
When dealing with the names of people it's likely to get collisions of the names, which is why test_by_names allows multiple type values, so with one lookup all the matching records can be retrieved:
test_by_names.values_at('Eric Johnson').flatten
# => [1337, 1341]
test_by_types.values_at(*test_by_names.values_at('Eric Johnson').flatten)
# => [{"type"=>1337, "age"=>12, "name"=>"Eric Johnson"},
# {"type"=>1341, "age"=>13, "name"=>"Eric Johnson"}]
This will be a lot to chew on if you're new to Ruby, but the Ruby documentation covers it all, so dig through the Hash, Array and Enumerable class documentation.
Also, *, AKA "splat", explodes the array elements from the enclosing array into separate parameters suitable for passing into a method. I can't remember where that's documented.
If you're familiar with database design this will look very familiar, because it's similar to how we do database lookups.
The point of all of this is that it's really important to consider how you're going to store your data when you first ingest it into your program. Do it wrong and you'll jump through major hoops trying to do useful things with it. Do it right and the code and data will flow through very easily, and you'll be able to massage/extract/combine the data easily.
Said differently, Arrays are containers useful for holding things you want to access sequentially, such as jobs you want to print, sites you need to access in order, files you want to delete in a specific order, but they're lousy when you want to lookup and work with a record randomly.
Knowing which container is appropriate is important, and for this particular task, it appears that an array of hashes isn't appropriate, since there's no fast way of accessing specific ones.
And that's why I made my comment above asking what you were trying to accomplish in the first place. See "What is the XY problem?" and "XyProblem" for more about that particular question.
You can use select and include? so
test.select {|object| get_hash_by_name.include? object['name'] }
…should do the job.

Ruby - Merge two hashes with no like keys based on matching value

I would like to find an efficient way to merge two hashes together and the resulting hash must contain all original data and a new key/value pair based on criteria below. There are no keys in common between the two hashes, however the key in one hash matches the value of a key in the adjacent hash.
Also note that the second hash is actually an array of hashes.
I am working with a relatively large data set, so looking for an efficient solution but hoping to keep the code readable at the same time since it will likely end up in production.
Here is the structure of my data:
# Hash
hsh1 = { "devicename1"=>"active", "devicename2"=>"passive", "devicename3"=>"passive" }
# Array of Hashes
hsh2 = [ { "host" => "devicename3", "secure" => true },
{ "host" => "devicename2", "secure" => true },
{ "host" => "devicename1", "secure" => false } ]
Here is what I need to accomplish:
I need to merge the data from hsh1 into hsh2 keeping all of the original key/value pairs in hsh2 and adding a new key called activation_status using the the data in hsh1.
The resulting hsh2 would be as follows:
hsh2 = [{ "host"=>"devicename3", "secure"=>true, "activation_status"=>"passive" },
{ "host"=>"devicename2", "secure"=>true, "activation_status"=>"passive" },
{ "host"=>"devicename1", "secure"=>false, "activation_status"=>"active" }]
This may already be answered on StackOverflow but I looked for some time and couldn't find a match. My apologies in advance if this is a duplicate.
I suggest something along the lines of:
hash3 = hash2.map do |nestling|
host = nestling["host"]
status = hash1[host]
nestling["activation_status"] = status
nestling
end
Which of course you can shrink down a bit. This version uses less variables and in-place edit of hash2:
hash2.each do |nestling|
nestling["activation_status"] = hash1[nestling["host"]]
end
This will do it:
hsh2.map { |h| h.merge 'activation_status' => hsh1[h['host']] }
However, I think it will make a copy of the data instead of just walking the array of hashes and adding the appropriate key=>value pair. I don't think it would have a huge impact on performance unless your data set is large enough to consume a significant portion of the memory allocated to your app.

Parse text/json data in Ruby

I am collecting HTTP response and it comes back in the text/json form. The original format is as follows:
{"param" => "value", "interesting_param" => [{"parama1"=>vala1,"parama2"=>vala2,"parama3"=>vala3,"parama4"=>vala4,"parama5"=>vala5},
{"paramb1"=>valb1,"paramb2"=>valb2,"paramb3"=>valb3,"paramb4"=>valb4,"paramb5"=>valb5}]}
When I do a JSON.parse(response.body)["interesting_param"], I can retrieve this output:
{"parama1"=>vala1,"parama2"=>vala2,"parama3"=>vala3,"parama4"=>vala4,"parama5"=>vala5},
{"paramb1"=>valb1,"paramb2"=>valb2,"paramb3"=>valb3,"paramb4"=>valb4,"paramb5"=>valb5}
How can I capture only the following from the full result-set above.
`parama1-vala1`, `parama2-vala2` and `parama5-vala5`
`paramb1-valb1`, `paramb2-valb2` and `paramb5-valb5`
Update
I did try further on this & now I am thinking of making use of loop.
The way I am attempting to do this is:
Find the count of records, for example, if:
test =
{"parama1"=>vala1,"parama2"=>vala2,"parama3"=>vala3,"parama4"=>vala4,"parama5"=>vala5},
{"paramb1"=>valb1,"paramb2"=>valb2,"paramb3"=>valb3,"paramb4"=>valb4,"paramb5"=>valb5}
Then, test.count will be 2.
Now if somehow I can use a loop to iterate over elements in test, then I might be able to capture specific elements.
Thanks.
It looks like you want to map each hash into a list of strings made by joining the string version of the key with the string version of the value, joined by a '-'.
JSON.parse(response.body)["interesting_param"]
The above code should give you a ruby list of hashes.
interesting_bits = JSON.parse(response.body)["interesting_param"]
result = interesting_bits.map{|bit| bit.map{|k,v| "#{k}-#{v}"}}
Something like that should do the trick.
puts result.inspect
#prints
# [ ["parama1-vala1","parama2-vala2","parama3-vala3","parama4-vala4","parama5-vala5"] , ["paramb1-valb1","paramb2-valb2","paramb3-valb3","paramb4-valb4","paramb5-valb5"] ]
I don't understand what criteria you are using for then filtering this down to just 1,2 and 5... but that is easily done too. I would do that to the hashes before converting them to string lists.

Can you add to an array in Ruby whose name is dependent upon a variable?

Let's suppose a store owner wants to know how well his products are selling around the world, and which are selling best where.
He has the following data: |ID,Currency,Quantity,Location|
Rather than iterate through the data for each currency (data set > 10,000), is there a way to put the data into arrays specific to the currency without explicit designation...i.e., is there a way to avoid
if curr == "USD"; USDid << ID; USDquan << Quantity
elsif...
...and so on? For the purposes of this question, assume the *id and *quan arrays exist for the currencies under observation.
Is there some sort of regex trick that can look at the currency and put the data in the appropriate arrays?
Yes. Use a hash of arrays instead of multiple arrays:
sale_data = {}
sale_data.default = {"ID" => [], "Quantity" => [], "Location" => []}
# Later...
sale_data[curr]["ID"] << ID; sale_data[curr]["Quantity"] << quan; #Etc..
The default= call makes it so you can just assign as many currencies as you want without every predefining them. So, anywhere in your code, if there are not prior entries for, say, USD, when you add data, one is created.

How do you modify array mapping data structure resultant from Ruby map?

I believe that I may be missing something here, so please bear with me as I explain two scenarios in hopes to reconcile my misunderstanding:
My end goal is to create a dataset that's acceptable by Highcharts via lazy_high_charts, however in this quest, I'm finding that it is rather particular about the format of data that it receives.
A) I have found that when data is formatted like this going into it, it draws the points just fine:
[0.0000001240,0.0000000267,0.0000000722, ..., 0.0000000512]
I'm able to generate an array like this simply with:
array = Array.new
data.each do |row|
array.push row[:datapoint1].to_f
end
B) Yet, if I attempt to use the map function, I end up with a result like and Highcharts fails to render this data:
[[6.67e-09],[4.39e-09],[2.1e-09],[2.52e-09], ..., [3.79e-09]]
From code like:
array = data.map{|row| [(row.datapoint1.to_f)] }
Is there a way to coax the map function to produce results in B that more akin to the scenario A resultant data structure?
This get's more involved as I have to also add datetime into this, however that's another topic and I just want to understand this first and what can be done to perhaps further control where I'm going.
Ultimately, EVEN SCENARIO B SHOULD WORK according to the data in the example here: http://www.highcharts.com/demo/spline-irregular-time (press the "View options" button at bottom)
Heck, I'll send you a sucker in the mail if you can fill me in on that part! ;)
You can fix arrays like this
[[6.67e-09],[4.39e-09],[2.1e-09],[2.52e-09], ..., [3.79e-09]]
that have nested arrays inside them by using the flatten method on the array.
But you should be able to avoid generating nested arrays in the first place. Just remove the square brackets from your map line:
array = data.map{|row| row.datapoint1.to_f }
Code
a = [[6.67e-09],[4.39e-09],[2.1e-09],[2.52e-09], [3.79e-09]]
b = a.flatten.map{|el| "%.10f" % el }
puts b.inspect
Output
["0.0000000067", "0.0000000044", "0.0000000021", "0.0000000025", "0.0000000038"]
Unless I, too, am missing something, your problem is that you're returning a single-element array from your block (thereby creating an array of arrays) instead of just the value. This should do you:
array = data.map {|row| row.datapoint1.to_f }
# => [ 6.67e-09, 4.39e-09, 2.1e-09, 2.52e-09, ..., 3.79e-09 ]

Resources