Breaking out variably deeply nested hashes into separate hashes with Ruby - ruby

I'm pretty new to Ruby, but I've done a ton of searches, research here on Stack, and experimentation.
I'm getting POST data that contains variable information which I am able to convert into a hash from XML.
My objectives are to:
Get and store the parentage key hierarchy.
I'm creating MongoDb records of what I get via these POSTs, and I need to record what keys I get storing any new ones I get that aren't already part of the collections keys.
Once I have the key hierarchy stored, I need to take the nested hash and break out each top level key and its children into another hash. These will end up as individual subdocuments in a MongoDb record.
A big obstacle is that I won't know the hierarchy structure or any of the key names up front, so I have to create an parser that doesn't really care what is in the hash, it just organizes the key structure, and breaks the hash up into separate hashes representing each 'top level' key contained in a hash.
I have a nested hash:
{"hashdata"=>
{"ComputersCount"=>
{"Total"=>1, "Licensed"=>1, "ByOS"=>{"OS"=>{"Windows 7 x64"=>1}}},
"ScansCount"=>
{"Total"=>8,
"Scheduled"=>8,
"Agent"=>0,
"ByScanningProfile"=>{"Profile"=>{"Missing Patches"=>8}}},
"RemediationsCount"=>{"Total"=>1, "ByType"=>{"Type"=>{"9"=>1}}},
"AgentsCount"=>{"Total"=>0},
"RelaysCount"=>{"Total"=>0},
"ScanResultsDatabase"=>{"Type"=>"MSAccess"}}}
In this example, ignoring the 'hashdata' key, the 'top level' parents are:
ComputersCount
ScansCount
RemediationsCount
RelaysCount
ScanResultsDatabase
So ideally, I would end up with a hash of each parent key and its children keys, and a separate hash for each of the top level parents.
EDIT: I'm not sure the best way to articulate the 'keys hash' but I know it needs to contain a sense of the hierarchy structure with regards to what level and parent a key in the structure might have.
For the separate hashes themselves it could be as simple as:
{"ComputersCount"=>{"Total"=>1, "Licensed"=>1, "ByOS"=>{"OS"=>{"Windows 7 x64"=>1}}}}
{"ScansCount"=>{"Total"=>8,"Scheduled"=>8,"Agent"=>0,"ByScanningProfile"=>{"Profile"=>{"Missing Patches"=>8}}}}
{"RemediationsCount"=>{"Total"=>1, "ByType"=>{"Type"=>{"9"=>1}}}}
{"AgentsCount"=>{"Total"=>0}}
{"RelaysCount"=>{"Total"=>0}}
{"ScanResultsDatabase"=>{"Type"=>"MSAccess"}}}
My ultimate goal is to take the key collections and the hash collections and store them in MongoDb, each sub hash is a sub-document, and the keys collection gives me a column name map for the collection so it can be queried against later.
I've come close to a solution using some recursive methods for example:
def recurse_hash(h,p=nil)
h.each_pair do |k,v|
case v
when String, Fixnum then
p "Key: #{k}, Value: #{v}"
when Hash then
h.find_all_values_for(v)
recurse_hash(v,k)
else raise ArgumentError "Unhandled type #{v.class}"
end
end
end
But so far, I've only been able to get close to what I'm after. Ultimately, I need to be prepared to get hashes with any level of nesting or value structures because the POST data is highly variable.
Any advice, guidance or other assistance here would be greatly appreciated - I realize I could very well be approaching this entire challenge incorrectly.

Looks like you want an array of hashes like the following:
array = hash["hashdata"].map { |k,v| { k => v } }
# => [{"ComputersCount"=>{"Total"=>1, "Licensed"=>1, "ByOS"=>{"OS"=>{"Windows 7 x64"=>1}}}}, ... ]
array.first
# => {"ComputersCount"=>{"Total"=>1, "Licensed"=>1, "ByOS"=>{"OS"=>{"Windows 7 x64"=>1}}}}
array.last
# => {"ScanResultsDatabase"=>{"Type"=>"MSAccess"}}

Here's my best guess at the "key structure hierarchy and parentage."
I gently suggest that it is overkill.
Instead, I think that all you really need to do is just store your hashdata directly as MongoDB documents.
Even if your POST data is highly variable,
in all likelyhood it will still be sufficiently well-formed that you can write your application without difficulty.
Here's a test that incorporates "key structure hierarcy and parentage",
but maybe more importantly just shows how trivial it is to store your hashdata directly as a MongoDB document.
The test is run twice to demonstrate new key discovery.
test.rb
require 'mongo'
require 'test/unit'
require 'pp'
def key_structure(h)
h.keys.sort.collect{|k| v = h[k]; v.is_a?(Hash) ? [k, key_structure(h[k])] : k}
end
class MyTest < Test::Unit::TestCase
def setup
#hash_data_coll = Mongo::MongoClient.new['test']['hash_data']
#hash_data_coll.remove
#keys_coll = Mongo::MongoClient.new['test']['keys']
end
test "extract cancer drugs" do
hash_data = {
"hashdata" =>
{"ComputersCount" =>
{"Total" => 1, "Licensed" => 1, "ByOS" => {"OS" => {"Windows 7 x64" => 1}}},
"ScansCount" =>
{"Total" => 8,
"Scheduled" => 8,
"Agent" => 0,
"ByScanningProfile" => {"Profile" => {"Missing Patches" => 8}}},
"RemediationsCount" => {"Total" => 1, "ByType" => {"Type" => {"9" => 1}}},
"AgentsCount" => {"Total" => 0},
"RelaysCount" => {"Total" => 0},
"ScanResultsDatabase" => {"Type" => "MSAccess"}}}
known_keys = #keys_coll.find.to_a.collect{|doc| doc['key']}.sort
puts "known keys: #{known_keys}"
hash_data_keys = hash_data['hashdata'].keys.sort
puts "hash data keys: #{hash_data_keys.inspect}"
new_keys = hash_data_keys - known_keys
puts "new keys: #{new_keys.inspect}"
#keys_coll.insert(new_keys.collect{|key| {key: key, structure: key_structure(hash_data['hashdata'][key]), timestamp: Time.now}}) unless new_keys.empty?
pp #keys_coll.find.to_a unless new_keys.empty?
#hash_data_coll.insert(hash_data['hashdata'])
assert_equal(1, #hash_data_coll.count)
pp #hash_data_coll.find.to_a
end
end
$ ruby test.rb
Loaded suite test
Started
known keys: []
hash data keys: ["AgentsCount", "ComputersCount", "RelaysCount", "RemediationsCount", "ScanResultsDatabase", "ScansCount"]
new keys: ["AgentsCount", "ComputersCount", "RelaysCount", "RemediationsCount", "ScanResultsDatabase", "ScansCount"]
[{"_id"=>BSON::ObjectId('535976177f11ba278d000001'),
"key"=>"AgentsCount",
"structure"=>["Total"],
"timestamp"=>2014-04-24 20:37:43 UTC},
{"_id"=>BSON::ObjectId('535976177f11ba278d000002'),
"key"=>"ComputersCount",
"structure"=>[["ByOS", [["OS", ["Windows 7 x64"]]]], "Licensed", "Total"],
"timestamp"=>2014-04-24 20:37:43 UTC},
{"_id"=>BSON::ObjectId('535976177f11ba278d000003'),
"key"=>"RelaysCount",
"structure"=>["Total"],
"timestamp"=>2014-04-24 20:37:43 UTC},
{"_id"=>BSON::ObjectId('535976177f11ba278d000004'),
"key"=>"RemediationsCount",
"structure"=>[["ByType", [["Type", ["9"]]]], "Total"],
"timestamp"=>2014-04-24 20:37:43 UTC},
{"_id"=>BSON::ObjectId('535976177f11ba278d000005'),
"key"=>"ScanResultsDatabase",
"structure"=>["Type"],
"timestamp"=>2014-04-24 20:37:43 UTC},
{"_id"=>BSON::ObjectId('535976177f11ba278d000006'),
"key"=>"ScansCount",
"structure"=>
["Agent",
["ByScanningProfile", [["Profile", ["Missing Patches"]]]],
"Scheduled",
"Total"],
"timestamp"=>2014-04-24 20:37:43 UTC}]
[{"_id"=>BSON::ObjectId('535976177f11ba278d000007'),
"ComputersCount"=>
{"Total"=>1, "Licensed"=>1, "ByOS"=>{"OS"=>{"Windows 7 x64"=>1}}},
"ScansCount"=>
{"Total"=>8,
"Scheduled"=>8,
"Agent"=>0,
"ByScanningProfile"=>{"Profile"=>{"Missing Patches"=>8}}},
"RemediationsCount"=>{"Total"=>1, "ByType"=>{"Type"=>{"9"=>1}}},
"AgentsCount"=>{"Total"=>0},
"RelaysCount"=>{"Total"=>0},
"ScanResultsDatabase"=>{"Type"=>"MSAccess"}}]
.
Finished in 0.028869 seconds.
1 tests, 1 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
100% passed
34.64 tests/s, 34.64 assertions/s
$ ruby test.rb
Loaded suite test
Started
known keys: ["AgentsCount", "ComputersCount", "RelaysCount", "RemediationsCount", "ScanResultsDatabase", "ScansCount"]
hash data keys: ["AgentsCount", "ComputersCount", "RelaysCount", "RemediationsCount", "ScanResultsDatabase", "ScansCount"]
new keys: []
[{"_id"=>BSON::ObjectId('535976197f11ba278e000001'),
"ComputersCount"=>
{"Total"=>1, "Licensed"=>1, "ByOS"=>{"OS"=>{"Windows 7 x64"=>1}}},
"ScansCount"=>
{"Total"=>8,
"Scheduled"=>8,
"Agent"=>0,
"ByScanningProfile"=>{"Profile"=>{"Missing Patches"=>8}}},
"RemediationsCount"=>{"Total"=>1, "ByType"=>{"Type"=>{"9"=>1}}},
"AgentsCount"=>{"Total"=>0},
"RelaysCount"=>{"Total"=>0},
"ScanResultsDatabase"=>{"Type"=>"MSAccess"}}]
.
Finished in 0.015559 seconds.
1 tests, 1 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
100% passed
64.27 tests/s, 64.27 assertions/s

Related

build a hash from iterating over a hash with nested arrays

I'd like to structure data I get pack from an Instagram API call:
{"attribution"=>nil,
"tags"=>["loudmouth"],
"location"=>{"latitude"=>40.7181015, "name"=>"Fontanas Bar", "longitude"=>-73.9922791, "id"=>31443955},
"comments"=>{"count"=>0, "data"=>[]},
"filter"=>"Normal",
"created_time"=>"1444181565",
"link"=>"https://instagram.com/p/8hJ-UwIDyC/",
"likes"=>{"count"=>0, "data"=>[]},
"images"=>
{"low_resolution"=>{"url"=>"https://scontent.cdninstagram.com/hphotos-xaf1/t51.2885-15/s320x320/e35/12145134_169501263391761_636095824_n.jpg", "width"=>320, "height"=>320},
"thumbnail"=>
{"url"=>"https://scontent.cdninstagram.com/hphotos-xfa1/t51.2885-15/s150x150/e35/c135.0.810.810/12093266_813307028768465_178038954_n.jpg", "width"=>150, "height"=>150},
"standard_resolution"=>
{"url"=>"https://scontent.cdninstagram.com/hphotos-xaf1/t51.2885-15/s640x640/sh0.08/e35/12145134_169501263391761_636095824_n.jpg", "width"=>640, "height"=>640}},
"users_in_photo"=>
[{"position"=>{"y"=>0.636888889, "x"=>0.398666667},
"user"=>
{"username"=>"ambersmelson",
"profile_picture"=>"http://photos-h.ak.instagram.com/hphotos-ak-xfa1/t51.2885-19/11909108_1492226137759631_1159527917_a.jpg",
"id"=>"194780705",
"full_name"=>""}}],
"caption"=>
{"created_time"=>"1444181565",
"text"=>"the INCOMPARABLE Amber Nelson closing us out! #loudmouth",
"from"=>
{"username"=>"alex3nglish",
"profile_picture"=>"http://photos-f.ak.instagram.com/hphotos-ak-xaf1/t51.2885-19/s150x150/11906214_483262888501413_294704768_a.jpg",
"id"=>"30822062",
"full_name"=>"Alex English"}}
I'd like to structure it in this way:
hash ={}
hash {"item1"=>
:location => {"latitude"=>40.7181015, "name"=>"Fontanas Bar", "longitude"=>-73.9922791, "id"=>31443955},
:created_time => "1444181565",
:images =>https://scontent.cdninstagram.com/hphotos-xaf1/t51.2885-15/s320x320/e35/12145134_169501263391761_636095824_n.jpg"
:user =>"Alex English"}
I'm iterating over 20 objects, each with their location, images, etc... how can I get a hash structure like the one above ?
This is what I've tried:
array_images = Array.new
# iterate through response object to extract what is needed
response.each do |item|
array_images << { :image => item.images.low_resolution.url,
:location => item.location,:created_time => Time.at(item.created_time.to_i), :user => item.user.full_name}
end
Which works fine. So what is the better way, the fastest one?
The hash that you gave is one item in the array stored at the key "data" in a larger hash right? At least that's how it is for the tags/ endpoint so I'll assume it's the same here. (I'm referring to that array of hashes as data)
hash = {}
data.each_with_index do |h, idx|
hash["item#{idx + 1}"] = {
location: h["location"], #This grabs the entire hash at "location" because you are wanting all of that data
created_time: h["created_time"],
image: h["images"]["low_resolution"]["url"], # You can replace this with whichever resolution.
caption: h["caption"]["from"]["full_name"]
}
end
I feel like you want a more simple solution, but I'm not sure how that's going to happen as you want things nested at different levels and you are pulling things from diverse levels of nesting.

Stream based parsing and writing of JSON

I fetch about 20,000 datasets from a server in 1,000 batches. Each dataset is a JSON object. Persisted this makes around 350 MB of uncompressed plaintext.
I have a memory limit of 1GB. Hence, I write each 1,000 JSON objects as an array into a raw JSON file in append mode.
The result is a file with 20 JSON arrays which needs to be aggregated. I need to touch them anyway, because I want to add metadata. Generally the Ruby Yajl Parser makes this possible like so:
raw_file = File.new(path_to_raw_file, 'r')
json_file = File.new(path_to_json_file, 'w')
datasets = []
parser = Yajl::Parser.new
parser.on_parse_complete = Proc.new { |o| datasets += o }
parser.parse(datasets)
hash = { date: Time.now, datasets: datasets }
Yajl::Encoder.encode(hash, json_file)
Where is the problem with this solution? The problem is that still the whole JSON is parsed into memory, which I must avoid.
Basically what I need is a solution which parses the JSON from an IO object and encodes them to another IO object, at the same time.
I assumed Yajl offers this, but I haven't found a way, nor did its API give any hints, so I guess not. Is there a JSON Parser library which supports this? Are there other solutions?
The only solution I can think of is to use the IO.seek capabilities. Write all the datasets arrays one after another [...][...][...] and after every array, I seek back to the start and overwrite ][ with ,, effectively connecting the arrays manually.
Why can't you retrieve a single record at a time from the database, process it as necessary, convert it to JSON, then emit it with a trailing/delimiting comma?
If you started with a file that only contained [, then appended all your JSON strings, then, on the final entry didn't append a comma, and instead used a closing ], you'd have a JSON array of hashes, and would only have to process one row's worth at a time.
It'd be a tiny bit slower (maybe) but wouldn't impact your system. And DB I/O can be very fast if you use blocking/paging to retrieve a reasonable number of records at a time.
For instance, here's a combination of some Sequel example code, and code to extract the rows as JSON and build a larger JSON structure:
require 'json'
require 'sequel'
DB = Sequel.sqlite # memory database
DB.create_table :items do
primary_key :id
String :name
Float :price
end
items = DB[:items] # Create a dataset
# Populate the table
items.insert(:name => 'abc', :price => rand * 100)
items.insert(:name => 'def', :price => rand * 100)
items.insert(:name => 'ghi', :price => rand * 100)
add_comma = false
puts '['
items.order(:price).each do |item|
puts ',' if add_comma
add_comma ||= true
print JSON[item]
end
puts "\n]"
Which outputs:
[
{"id":2,"name":"def","price":3.714714089426208},
{"id":3,"name":"ghi","price":27.0179624376119},
{"id":1,"name":"abc","price":52.51248221170203}
]
Notice the order is now by "price".
Validation is easy:
require 'json'
require 'pp'
pp JSON[<<EOT]
[
{"id":2,"name":"def","price":3.714714089426208},
{"id":3,"name":"ghi","price":27.0179624376119},
{"id":1,"name":"abc","price":52.51248221170203}
]
EOT
Which results in:
[{"id"=>2, "name"=>"def", "price"=>3.714714089426208},
{"id"=>3, "name"=>"ghi", "price"=>27.0179624376119},
{"id"=>1, "name"=>"abc", "price"=>52.51248221170203}]
This validates the JSON and demonstrates that the original data is recoverable. Each row retrieved from the database should be a minimal "bitesized" piece of the overall JSON structure you want to build.
Building upon that, here's how to read incoming JSON in the database, manipulate it, then emit it as a JSON file:
require 'json'
require 'sequel'
DB = Sequel.sqlite # memory database
DB.create_table :items do
primary_key :id
String :json
end
items = DB[:items] # Create a dataset
# Populate the table
items.insert(:json => JSON[:name => 'abc', :price => rand * 100])
items.insert(:json => JSON[:name => 'def', :price => rand * 100])
items.insert(:json => JSON[:name => 'ghi', :price => rand * 100])
items.insert(:json => JSON[:name => 'jkl', :price => rand * 100])
items.insert(:json => JSON[:name => 'mno', :price => rand * 100])
items.insert(:json => JSON[:name => 'pqr', :price => rand * 100])
items.insert(:json => JSON[:name => 'stu', :price => rand * 100])
items.insert(:json => JSON[:name => 'vwx', :price => rand * 100])
items.insert(:json => JSON[:name => 'yz_', :price => rand * 100])
add_comma = false
puts '['
items.each do |item|
puts ',' if add_comma
add_comma ||= true
print JSON[
JSON[
item[:json]
].merge('foo' => 'bar', 'time' => Time.now.to_f)
]
end
puts "\n]"
Which generates:
[
{"name":"abc","price":3.268814929005337,"foo":"bar","time":1379688093.124606},
{"name":"def","price":13.871147312377719,"foo":"bar","time":1379688093.124664},
{"name":"ghi","price":52.720984131655676,"foo":"bar","time":1379688093.124702},
{"name":"jkl","price":53.21477190840114,"foo":"bar","time":1379688093.124732},
{"name":"mno","price":40.99364022416619,"foo":"bar","time":1379688093.124758},
{"name":"pqr","price":5.918738444452265,"foo":"bar","time":1379688093.124803},
{"name":"stu","price":45.09391752439902,"foo":"bar","time":1379688093.124831},
{"name":"vwx","price":63.08947792357426,"foo":"bar","time":1379688093.124862},
{"name":"yz_","price":94.04921035056373,"foo":"bar","time":1379688093.124894}
]
I added the timestamp so you can see that each row is processed individually, AND to give you an idea how fast the rows are being processed. Granted, this is a tiny, in-memory database, which has no network I/O to content with, but a normal network connection through a switch to a database on a reasonable DB host should be pretty fast too. Telling the ORM to read the DB in chunks can speed up the processing because the DBM will be able to return larger blocks to more efficiently fill the packets. You'll have to experiment to determine what size chunks you need because it will vary based on your network, your hosts, and the size of your records.
Your original design isn't good when dealing with enterprise-sized databases, especially when your hardware resources are limited. Over the years we've learned how to parse BIG databases, which make 20,000 row tables appear miniscule. VM slices are common these days and we use them for crunching, so they're often the PCs of yesteryear: single CPU with small memory footprints and dinky drives. We can't beat them up or they'll be bottlenecks, so we have to break the data into the smallest atomic pieces we can.
Harping about DB design: Storing JSON in a database is a questionable practice. DBMs these days can spew JSON, YAML and XML representations of rows, but forcing the DBM to search inside stored JSON, YAML or XML strings is a major hit in processing speed, so avoid it at all costs unless you also have the equivalent lookup data indexed in separate fields so your searches are at the highest possible speed. If the data is available in separate fields, then doing good ol' database queries, tweaking in the DBM or your scripting language of choice, and emitting the massaged data becomes a lot easier.
It is possible via JSON::Stream or Yajl::FFI gems. You will have to write your own callbacks though. Some hints on how to do that can be found here and here.
Facing a similar problem I have created the json-streamer gem that will spare you the need to create your own callbacks. It will yield you each object one by one removing it from the memory afterwards. You could then pass these to another IO object as intended.
There is a library called oj that does exactly that. It can do parsing and generation. For example, for parsing you can use Oj::Doc:
Oj::Doc.open('[3,[2,1]]') do |doc|
result = {}
doc.each_leaf() do |d|
result[d.where?] = d.fetch()
end
result
end #=> ["/1" => 3, "/2/1" => 2, "/2/2" => 1]
You can even backtrack in the file using doc.move(path). it seems very flexible.
For writing documents, you can use Oj::StreamWriter:
require 'oj'
doc = Oj::StreamWriter.new($stdout)
def write_item(doc, item)
doc.push_object
doc.push_key "type"
doc.push_value "item"
doc.push_key "value"
doc.push_value item
doc.pop
end
def write_array(doc, array)
doc.push_object
doc.push_key "type"
doc.push_value "array"
doc.push_key "value"
doc.push_array
array.each do |item|
write_item(doc, item)
end
doc.pop
doc.pop
end
write_array(doc, [{a: 1}, {a: 2}]) #=> {"type":"array","value":[{"type":"item","value":{":a":1}},{"type":"item","value":{":a":2}}]}

chef 11: any way to turn attributes into a ruby hash?

I'm generating a config for my service in chef attributes. However, at some point, I need to turn the attribute mash into a simple ruby hash. This used to work fine in Chef 10:
node.myapp.config.to_hash
However, starting with Chef 11, this does not work. Only the top-level of the attribute is converted to a hash, with then nested values remaining immutable mash objects. Modifying them leads to errors like this:
Chef::Exceptions::ImmutableAttributeModification
------------------------------------------------ Node attributes are read-only when you do not specify which precedence level to set. To
set an attribute use code like `node.default["key"] = "value"'
I've tried a bunch of ways to get around this issue which do not work:
node.myapp.config.dup.to_hash
JSON.parse(node.myapp.config.to_json)
The json parsing hack, which seems like it should work great, results in:
JSON::ParserError
unexpected token at '"#<Chef::Node::Attribute:0x000000020eee88>"'
Is there any actual reliable way, short of including a nested parsing function in each cookbook, to convert attributes to a simple, ordinary, good old ruby hash?
after a resounding lack of answers both here and on the opscode chef mailing list, i ended up using the following hack:
class Chef
class Node
class ImmutableMash
def to_hash
h = {}
self.each do |k,v|
if v.respond_to?('to_hash')
h[k] = v.to_hash
else
h[k] = v
end
end
return h
end
end
end
end
i put this into the libraries dir in my cookbook; now i can use attribute.to_hash in both chef 10 (which already worked properly and which is unaffected by this monkey-patch) and chef 11. i've also reported this as a bug to opscode:
if you don't want to have to monkey-patch your chef, speak up on this issue:
http://tickets.opscode.com/browse/CHEF-3857
Update: monkey-patch ticket was marked closed by these PRs
I hope I am not too late to the party but merging the node object with an empty hash did it for me:
chef (12.6.0)> {}.merge(node).class
=> Hash
I had the same problem and after much hacking around came up with this:
json_string = node[:attr_tree].inspect.gsub(/\=\>/,':')
my_hash = JSON.parse(json_string, {:symbolize_names => true})
inspect does the deep parsing that is missing from the other methods proposed and I end up with a hash that I can modify and pass around as needed.
This has been fixed for a long time now:
[1] pry(main)> require 'chef/node'
=> true
[2] pry(main)> node = Chef::Node.new
[....]
[3] pry(main)> node.default["fizz"]["buzz"] = { "foo" => [ { "bar" => "baz" } ] }
=> {"foo"=>[{"bar"=>"baz"}]}
[4] pry(main)> buzz = node["fizz"]["buzz"].to_hash
=> {"foo"=>[{"bar"=>"baz"}]}
[5] pry(main)> buzz.class
=> Hash
[6] pry(main)> buzz["foo"].class
=> Array
[7] pry(main)> buzz["foo"][0].class
=> Hash
[8] pry(main)>
Probably fixed sometime in or around Chef 12.x or Chef 13.x, it is certainly no longer an issue in Chef 15.x/16.x/17.x
The above answer is a little unnecessary. You can just do this:
json = node[:whatever][:whatever].to_hash.to_json
JSON.parse(json)

Algorithm for parsing a flat tree into a non-flat tree

I have the following flat tree:
id name parent_id is_directory
===========================================================
50 app 0 1
31 controllers 50 1
11 application_controller.rb 31 0
46 models 50 1
12 test_controller.rb 31 0
31 test.rb 46 0
and I am trying to figure out an algorithm for getting this into the following tree structuree:
[{
id: 50,
name: app,
is_directory: true
children: [{
id: 31,
name: controllers,
is_directory: true,
children: [{
id: 11,
name: application_controller.rb
is_directory: false
},{
id: 12,
name: test_controller.rb,
is_directory: false
}],
},{
id: 46,
name: models,
is_directory: true,
children: [{
id: 31,
name: test.rb,
is_directory: false
}]
}]
}]
Can someone point me in the right direction? I'm looking for steps (eg. build an associative array; loop through the array looking for x; etc.).
I'm using Ruby, so I have object-oriented language features at my disposal.
In ruby, you should be able to easily do it in linear time O(n) with a Hash.
# Put all your nodes into a Hash keyed by id This assumes your objects are already Hashes
object_hash = nodes.index_by {|node| node[:id]}
object_hash[0] = {:root => true}
# loop through each node, assigning them to their parents
object_hash.each_value {|node|
continue if node[:root]
children = object_hash[node[:parent_id]][:children] ||= []
children << node
}
#then your should have the structure you want and you can ignore 'object_hash' variable
tree = object_hash[0]
I had investigate the issue, with recursive and non recursive. I put here 2 variants:
"parend_id" = "head_id" # for those examples
Recursivly:
require 'pp'
nodes = [{"id"=>"1", "name"=>"User №1 Pupkin1", "head_id"=>nil},
{"id"=>"2", "name"=>"User №2 Pupkin2", "head_id"=>"1"},
{"id"=>"3", "name"=>"User №3 Pupkin3", "head_id"=>"2"}]
def to_tree(nodes, head_id = nil)
with_head, without_head = nodes.partition { |n| n['head_id'] == head_id }
with_head.map do |node|
node.merge('children' => to_tree(without_head, node['id']))
end
end
pp to_tree(nodes)
Pros:
as it should be
Cons!:
Ruby will fail if we will have >= 3000 nodes (this happen because ruby has stack limitation for points (functions) when ruby need retuns back) If you have 'pp' for the output it will fails on >= 200 nodes
Non recursevly, with cycle:
require 'pp'
nodes = [{"id"=>"1", "name"=>"User №1 Pupkin1", "head_id"=>nil},
{"id"=>"2", "name"=>"User №2 Pupkin2", "head_id"=>"1"},
{"id"=>"3", "name"=>"User №3 Pupkin3", "head_id"=>"2"}]
def to_tree(data)
data.each do |item|
item['children'] = data.select { |_item| _item['head_id'] == item['id'] }
end
data.select { |item| item['head_id'] == nil }
end
pp to_tree(nodes)
Pros:
more ruby style
Cons:
we modifies self object, that is not enough good.
The result of the both ways is:
[{"id"=>"1",
"name"=>"User №1 Pupkin1",
"head_id"=>nil,
"children"=>
[{"id"=>"2",
"name"=>"User №2 Pupkin2",
"head_id"=>"1",
"children"=>
[{"id"=>"3",
"name"=>"User №3 Pupkin3",
"head_id"=>"2",
"children"=>[]}]}]}]
Resume
For production it is better to use second way, probably there's a more optimal way to realize it. Hope written will be useful
Build a stack and populate it with the root element.
While (there are elements in the stack):
Pop an element off the stack and add it to where it belongs in the tree.
Find all children of this element in your array and push them into the stack.
To add element to the tree (step 3), you 'd need to find their parent first. A tree data structure should allow you to do that pretty quickly, or you can use a dictionary that contains tree nodes indexed by id.
If you mention which language you 're using a more specific solution could be suggested.
Here are a few changes I had to make to #daniel-beardsley's response to make it work for me.
1) Since I was starting with an activeRecord relation, I started by doing a "as_json"to convert to a hash. Note that all the keys were therefore strings, not symbols.
2) in my case items without parents had a parent value of nil not 0.
3) I got a compile error on the "continue" expression, so I changed that to "next" (can someone explain this to me -- maybe it was a typo by #daniel-beardsley when converting to ruby?)
4) I was getting some crashes for items with deleted parents. I added code to ignore these -- you could also put at the root if you prefer
object_hash = myActiveRecordRelation.as_json.index_by {|node| node["id"]}
object_hash[nil] = {:root => true}
object_hash.each_value {|node|
next if node[:root]
next if node["parent_id"] && !object_hash[node["parent_id"]] # throw away orphans
children = object_hash[node["parent_id"]][:children] ||= []
children << node
}
tree = object_hash[nil]

Ruby: How can I have a Hash take multiple keys?

I'm taking 5 strings (protocol, source IP and port, destination IP and port) and using them to store some values in a hash. The problem is that if the IPs or ports are switched between source and destination, the key is supposed to be the same.
If I was doing this in C#/Java/whatever I'd have to create a new class and overwrite the hashcode()/equals() methods, but that seems error prone from the little I've read about it and I was wondering if there would be a better alternative here.
I am directly copying a paragraph from Programming Ruby 1.9:
Hash keys must respond to the message hash by returning a hash code, and the hash code for a given key must not change. The keys used in hashes must also be comparable using eql?. If eql? returns true for two keys, then those keys must also have the same hash code. This means that certain classes (such as Array and Hash) can't conveniently be used as keys, because their hash values can change based on their contents.
So you might generate your hash as something like ["#{source_ip} #{source_port}", "#{dest_ip} #{dest_port}", protocol.to_s].sort.join.hash such that the result will be identical when the source and destination are switched.
For example:
source_ip = "1.2.3.4"
source_port = 1234
dest_ip = "5.6.7.8"
dest_port = 5678
protocol = "http"
def make_hash(s_ip, s_port, d_ip, d_port, proto)
["#{s_ip} #{s_port}", "#{d_ip} #{d_port}", proto.to_s].sort.join.hash
end
puts make_hash(source_ip, source_port, dest_ip, dest_port, protocol)
puts make_hash(dest_ip, dest_port, source_ip, source_port, protocol)
This will output the same hash even though the arguments are in a different order between the two calls. Correctly encapsulating this functionality into a class is left as an exercise to the reader.
I think this is what you mean...
irb(main):001:0> traffic = []
=> []
irb(main):002:0> traffic << {:src_ip => "10.0.0.1", :src_port => "9999", :dst_ip => "172.16.1.1", :dst_port => 80, :protocol => "tcp"}
=> [{:protocol=>"tcp", :src_ip=>"10.0.0.1", :src_port=>"9999", :dst_ip=>"172.16.1.1", :dst_port=>80}]
irb(main):003:0> traffic << {:src_ip => "10.0.0.2", :src_port => "9999", :dst_ip => "172.16.1.1", :dst_port => 80, :protocol => "tcp"}
=> [{:protocol=>"tcp", :src_ip=>"10.0.0.1", :src_port=>"9999", :dst_ip=>"172.16.1.1", :dst_port=>80}, {:protocol=>"tcp", :src_ip=>"10.0.0.2", :src_port=>"9999", :dst_ip=>"172.16.1.1", :dst_port=>80}]
The next, somewhat related, question is how to store the IP. You probably want to use the IPAddr object instead of just a string so you can sort the results more easily.
You can use the following code:
def create_hash(prot, s_ip, s_port, d_ip, d_port, value, x = nil)
if x
x[prot] = {s_ip => {s_port => {d_ip => {d_port => value}}}}
else
{prot => {s_ip => {s_port => {d_ip => {d_port => value}}}}}
end
end
# Create a value
h = create_hash('www', '1.2.4.5', '4322', '4.5.6.7', '80', "Some WWW value")
# Add another value
create_hash('https', '1.2.4.5', '4562', '4.5.6.7', '443', "Some HTTPS value", h)
# Retrieve the values
puts h['www']['1.2.4.5']['4322']['4.5.6.7']['80']
puts h['https']['1.2.4.5']['4562']['4.5.6.7']['443']

Resources