Efficiently building a file system tree structure with nested hashes - ruby

I have a list of the diff stats per file for a commit (using diff --numstat in Git) that I need to parse into a tree structure as a hash so I can use it as JSON. The raw data is in a format like this:
1 1 app/assets/javascripts/foo.js.coffee
2 1 app/assets/javascripts/bar.js
16 25 app/assets/javascripts/baz.js.coffee
11 0 app/controllers/foo_controller.rb
3 2 db/schema.rb
41 1 lib/foobar.rb
I need to parse this into a nested hash format something like the following:
{ name: "app", children: [
{ name: "assets", children: [
{ name: "javascripts", children: [
{ name: "foo.js.coffee", add: 1, del: 1 },
{ name: "bar.js", add: 2, del: 1 }
{ name: "baz.js.coffee", add: 16, del: 25 }
], add: 19, del: 27 },
...
] }
] }
Where every level of the tree is represented by its name, children as a hash and the total number of additions and deletions for that tree.
Is there an efficient way to construct a hash like this in Ruby?

Full source here: https://gist.github.com/dimitko/5541709. You can download it and directly run it without any trouble (just make sure to have the awesome_print gem; it shows you the object hierarchy in much more human-readable format).
I enriched your test input a little, to make sure the algorithm doesn't make stupid mistakes.
Given this input:
input = <<TEXT
2 1 app/assets/javascripts/bar.js
16 25 app/assets/javascripts/baz.js.coffee
1 1 app/assets/javascripts/foo.js.coffee
4 9 app/controllers/bar_controller.rb
3 2 app/controllers/baz_controller.rb
11 0 app/controllers/foo_controller.rb
3 2 db/schema.rb
41 1 lib/foobar.rb
12 7 lib/tasks/cache.rake
5 13 lib/tasks/import.rake
TEXT
And this expected result:
[{:name=>"app", :add=>37, :del=>38, :children=>[{:name=>"assets", :add=>19, :del=>27, :children=>[{:name=>"javascripts", :add=>19, :del=>27, :children=>[{:name=>"bar.js", :add=>2, :del=>1}, {:name=>"baz.js.coffee", :add=>16, :del=>25}, {:name=>"foo.js.coffee", :add=>1, :del=>1}]}]}, {:name=>"controllers", :add=>18, :del=>11, :children=>[{:name=>"bar_controller.rb", :add=>4, :del=>9}, {:name=>"baz_controller.rb", :add=>3, :del=>2}, {:name=>"foo_controller.rb", :add=>11, :del=>0}]}]}, {:add=>3, :del=>2, :name=>"db", :children=>[{:name=>"schema.rb", :add=>3, :del=>2}]}, {:add=>58, :del=>21, :name=>"lib", :children=>[{:name=>"foobar.rb", :add=>41, :del=>1}, {:name=>"tasks", :add=>17, :del=>20, :children=>[{:name=>"cache.rake", :add=>12, :del=>7}, {:name=>"import.rake", :add=>5, :del=>13}]}]}]
And this code:
def git_diffnum_parse_paths(list, depth, out)
to = 1
base = list.first[:name][depth]
while list[to] and list[to][:name][depth] == base do
to += 1
end
if list.first[:name][depth+1]
out << {name: base, add: 0, del: 0, children: []}
# Common directory found for the first N records; recurse deeper.
git_diffnum_parse_paths(list[0..to-1], depth + 1, out.last[:children])
add = del = 0
out.last[:children].each do |x| add += x[:add].to_i; del += x[:del].to_i; end
out.last[:add] = add
out.last[:del] = del
else
# It's a file, we can't go any deeper.
out << {name: list.first[:name].last, add: list.first[:add].to_i, del: list.first[:del].to_i}
end
if list[to]
# Recurse in to try find common directories for the deeper records.
git_diffnum_parse_paths(list[to..-1], depth, out)
end
nil
end
def to_git_diffnum_tree(txt)
items = []
txt.split("\n").each do |line|
m = line.match(/(\d+)\s+(\d+)\s+(.+)/).to_a[1..3]
items << {add: m[0], del: m[1], name: m[2]}
end
items.sort! { |a,b|
a[:name] <=> b[:name]
}
items.each do |item|
item[:name] = item[:name].split("/")
end
out = []
git_diffnum_parse_paths(items, 0, out)
out
end
And this code, which is using it:
require 'awesome_print'
out = to_git_diffnum_tree(input)
puts; ap out; puts
puts; puts "Expected result:"; puts expected.inspect
puts; puts "Actual result: "; puts out.inspect
puts; puts "Are expected and actual results identical: #{expected == out}"
It seems to produce what you want.
Notes:
I am sorting the array of parsed entries by directory/file names. This is done to avoid walking the entire list to search for a common directory; instead, the algorithm can scan the list up until the first non-match.
I am far from thinking that's the most optimal solution, but it's what I have came up with for a free hour.
I have left some [un-]commented puts statements in the gist, in case you wanna have a rough glimpse on how does the algorithm work.
In case you want to give it a more solid test, try something like this:
git diff --numstat `git rev-list --max-parents=0 HEAD | head -n 1` HEAD
That'd give you number of additions and deletions since the initial commit (provided your Git version is >=1.7.4.2), which is a far bigger input where you can give the algorithm a lot more rigorous testing.
Hope I helped.

Define "efficient".
If your problem is "performance", your solution isn't ruby.
Unless you're literally running this script on the Linux source code, I wouldn't be worrying about performance, just clarity of intent.
I took inspiration from #dimitko's solution and I minimized the code used.
https://gist.github.com/x1024/3d0f9ad61fcb4b189be3
def git_group lines, root = 'root'
if lines.count == 1 and lines[0][:name].empty? then
return {
name: root,
add: lines.map { |l| l[:add] }.reduce(0, :+),
del: lines.map { |l| l[:del] }.reduce(0, :+),
}
end
lines = lines.group_by { |line| line[:name].shift }
.map { |key, value| git_group(value, key) }
return {
name: root,
add: lines.map { |l| l[:add] }.reduce(0, :+),
del: lines.map { |l| l[:del] }.reduce(0, :+),
children: lines
}
end
def to_git_diffnum_tree(txt)
data = txt.split("\n")
.map { |line| line.split() }
.map { |line| {add: line[0].to_i, del: line[1].to_i, name: line[2].split('/')} }
.sort_by { |item| item[:name] }
git_group(data)[:children]
end
And if you are willing to compromise with your data format (i.e. return the same data but in a different structure), you can do this with even less code:
https://gist.github.com/x1024/5ecfdfe886e31f8b5ab9
def git_group lines
dirs = lines.select { |line| line[:name].count > 1 }
files = (lines - dirs).map! { |file| [file.delete(:name).shift, file] }
dirs_processed = dirs.group_by { |dir| dir[:name].shift }
.map { |key, value| [key, git_group(value)] }
data = dirs_processed.concat(files)
return {
add: data.map { |k,l| l[:add] }.reduce(0, :+),
del: data.map { |k,l| l[:del] }.reduce(0, :+),
children: Hash[data]
}
end
def to_git_diffnum_tree(txt)
data = txt.split("\n")
.map { |line| line.split() }
.map { |line| {add: line[0].to_i, del: line[1].to_i, name: line[2].split('/')} }
.sort_by { |item| item[:name] }
git_group(data)[:children]
end
Remember kids, writing C++ in Ruby is bad.

Related

iterating over to make hashes within an array

so I know how I can iterate over and make array within hash
travel=["Round Trip Ticket Price:", "Price of Accommodation:", "Number of checked bags:"]
(1..3).each_with_object({}) do |trip, travels|
puts "Please input the following for trip # #{trip}"
travels["trip #{trip}"]= travel.map { |q| print q; gets.chomp.to_f }
end
==>{"trip 1"=>[100.0, 50.0, 1.0], "trip 2"=>[200.0, 100.0, 2.0], "trip 3"=>[300.0, 150.0,
3.0]}
BUT instead I want to iterate over to make three individual hashes within one array.
I want it to look something like this
travels=[{trip_transportation: 100.0, trip_accommodation:50.0, trip_bags:50}
{trip_transportation:200.0, trip_accommodation:100.0, trip_2_bags:100}
{trip_3_transportation:300.0, trip_accommodation:150.0, trip_3_bags:150}]
I am really confused, basically the only thing I want to know how to do is how do I make three separate hashes while using a loop.
I want every hash to represent a trip.
Is that even possible?
travel=[{ prompt: "Round Trip Ticket Price: ",
key: :trip_transportation, type: :float },
{ prompt: "Price of Accommodation : ",
key: :trip_accommodation, type: :float },
{ prompt: "Number of checked bags : ",
key: :trip_bags, type: :int }]
nbr_trips = 3
Suppose that as the following code is run the user were to input the values given in the question's example.
(1..nbr_trips).map do |trip|
puts "Please input the following for trip #{trip}"
travel.map do |h|
print h[:prompt]
s = gets
[h[:key], h[:type] == :float ? s.to_f : s.to_i]
end.to_h
end
#=> [{:trip_transportation=>100.0, :trip_accommodation=>50.0, :trip_bags=>1},
# {:trip_transportation=>200.0, :trip_accommodation=>100.0, :trip_bags=>2},
# {:trip_transportation=>300.0, :trip_accommodation=>150.0, :trip_bags=>3}]
I see no reason for keys to have different names for different trips (e.g., :trip_2_bags and trip_3_bags, rather than simply trip_bags for all trips).
Using an Hash for setting up, similar to Cary Swoveland's answer and similar to my answer here: https://stackoverflow.com/a/58485997/5239030
travel = { trip_transportation: { question: 'Round Trip Ticket Price:', convert: 'to_f' },
trip_accommodation: { question: 'Price of Accommodation:', convert: 'to_f' },
trip_bags: { question: 'Number of checked bags:', convert: 'to_i' } }
n = 2
res = (1..n).map do # |n| # uncomment if (*)
travel.map.with_object({}) do |(k, v), h|
puts v[:question]
# k = k.to_s.split('_').insert(1, n).join('_').to_sym # uncomment if (*)
h[k] = gets.send(v[:convert])
end
end
res
#=> [{:trip_transportation=>10.0, :trip_accommodation=>11.0, :trip_bags=>1}, {:trip_transportation=>20.0, :trip_accommodation=>22.0, :trip_bags=>2}]
(*) Uncomment if you want the result to appear like:
#=> [{:trip_1_transportation=>10.0, :trip_1_accommodation=>11.0, :trip_1_bags=>1}, {:trip_2_transportation=>20.0, :trip_2_accommodation=>22.0, :trip_2_bags=>2}]

Serialize an array of hashes

I have an array of hashes:
records = [
{
ID: 'BOATY',
Name: 'McBoatface, Boaty'
},
{
ID: 'TRAINY',
Name: 'McTrainface, Trainy'
}
]
I'm trying to combine them into an array of strings:
["ID,BOATY","Name,McBoatface, Boaty","ID,TRAINY","Name,McTrainface, Trainy"]
This doesn't seem to do anything:
irb> records.collect{|r| r.each{|k,v| "\"#{k},#{v}\"" }}
#=> [{:ID=>"BOATY", :Name=>"McBoatface, Boaty"}, {:ID=>"TRAINY", :Name=>"McTrainface, Trainy"}]
** edit **
Formatting (i.e. ["Key0,Value0","Key1,Value1",...] is required to match a vendor's interface.
** /edit **
What am I missing?
records.flat_map(&:to_a).map { |a| a.join(',') }
#=> ["ID,BOATY", "Name,McBoatface, Boaty", "ID,TRAINY", "Name,McTrainface, Trainy"]
records = [
{
ID: 'BOATY',
Name: 'McBoatface, Boaty'
},
{
ID: 'TRAINY',
Name: 'McTrainface, Trainy'
}
]
# strait forward code
result= []
records.each do |hash|
hash.each do |key, value|
result<< key.to_s
result<< value
end
end
puts result.inspect
# a rubyish way (probably less efficient, I've not done the benchmark)
puts records.map(&:to_a).flatten.map(&:to_s).inspect
Hope it helps.
li = []
records.each do |rec|
rec.each do |k,v|
li << "#{k.to_s},#{v.to_s}".to_s
end
end
print li
["ID,BOATY", "Name,McBoatface, Boaty", "ID,TRAINY", "Name,McTrainface,
Trainy"]
You sure you wanna do it this way?
Check out Marshal. Or JSON.
You could even do it this stupid way using Hash#inspect and eval:
serialized_hashes = records.map(&:inspect) # ["{ID: 'Boaty'...", ...]
unserialized = serialized_hashes.map { |s| eval(s) }

Find and replace specific hash and it's values within array

What is the most efficient method to find specific hash within array and replace its values in-place, so array get changed as well?
I've got this code so far, but in a real-world application with loads of data, this becomes the slowest part of application, which probably leaks memory, as unbounded memory grows constantly when I perform this operation on each websocket message.
array =
[
{ id: 1,
parameters: {
omg: "lol"
},
options: {
lol: "omg"
}
},
{ id: 2,
parameters: {
omg: "double lol"
},
options: {
lol: "double omg"
}
}
]
selection = array.select { |a| a[:id] == 1 }[0]
selection[:parameters][:omg] = "triple omg"
p array
# => [{:id=>1, :parameters=>{:omg=>"triple omg"}, :options=>{:lol=>"omg"}}, {:id=>2, :parameters=>{:omg=>"double lol"}, :options=>{:lol=>"double omg"}}]
This will do what you're after looping through the records only once:
array.each { |hash| hash[:parameters][:omg] = "triple omg" if hash[:id] == 1 }
You could always expand the block to handle other conditions:
array.each do |hash|
hash[:parameters][:omg] = "triple omg" if hash[:id] == 1
hash[:parameters][:omg] = "quadruple omg" if hash[:id] == 2
# etc
end
And it'll remain iterating over the elements just the once.
It might also be you'd be better suited adjusting your data into a single hash. Generally speaking, searching a hash will be faster than using an array, particularly if you've got unique identifier as here. Something like:
{
1 => {
parameters: {
omg: "lol"
},
options: {
lol: "omg"
}
},
2 => {
parameters: {
omg: "double lol"
},
options: {
lol: "double omg"
}
}
}
This way, you could just call the following to achieve what you're after:
hash[1][:parameters][:omg] = "triple omg"
Hope that helps - let me know how you get on with it or if you have any questions.

Ruby - is there a shorthand check for two logical conditionals against one variable

How can I shorten this expression?
if artist != 'Beck' && artist != 'Led Zeppelin'
5.times { puts 'sorry' }
end
Is there a shorthand check for two logical conditionals against one variable?
As an aside, this turned into
class String
def is_not?(*arr)
!arr.include?(self)
end
end
In our project.
Now we can do 'foo'.is_not?('bar', 'batz')
unless ['Beck', 'Led Zeppelin'].include?(artist)
5.times { puts 'sorry' }
end
Isn't any "shorter", but no obscure syntax trickery too. Just using regular array api. As a consequence, you can provide that array in any way you want. Load it from a file, for example. With any number of elements.
Your specific case is pretty minimal, but if you have lots of unrelated conditions to test for lots of values you can set the tests up as lambdas in an array and use all?. For instance, the following example filters all the integers between 1 and 100 for those which are > 20, < 50, even, and divisible by 3:
tests = [
->(x) { x > 20 },
->(x) { x < 50 },
->(x) { x.even? },
->(x) { x % 3 == 0 }
]
(1..100).each do |i|
puts i if tests.all? { |test| test[i] }
end
case artist
when 'Beck', 'Led Zeppelin'
else
5.times { puts 'sorry' }
end

Lazy enumerator for nested array of hashes

Suppose I have an Array like this
data = [
{
key: val,
important_key_1: { # call this the big hash
key: val,
important_key_2: [
{ # call this the small hash
key: val,
},
{
key: val,
},
]
},
},
{
key: val,
important_key_1: {
key: val,
important_key_2: [
{
key: val,
},
{
key: val,
},
]
},
},
]
I want to create a lazy enumerator that would return the next small hash on each #next, and move on to the next big hash and do the same when the first big hash reaches the end
The easy way to return all the internal hashes that I want would be something like this
data[:important_key_1].map do |internal_data|
internal_data[:important_key_2]
end.flatten
Is there someway to do this or do I need to implement my own logic ?
This returns a lazy enumerator which iterates over all the small hashes :
def lazy_nested_hashes(data)
enum = Enumerator.new do |yielder|
data.each do |internal_data|
internal_data[:important_key_1][:important_key_2].each do |small_hash|
yielder << small_hash
end
end
end
enum.lazy
end
With your input data and a val definition :
#i = 0
def val
#i += 1
end
It outputs :
puts lazy_nested_hashes(data).to_a.inspect
#=> [{:key=>3}, {:key=>4}, {:key=>7}, {:key=>8}]
puts lazy_nested_hashes(data).map { |x| x[:key] }.find { |k| k > 3 }
#=> 4
For the second example, the second big hash isn't considered at all (thanks to enum.lazy)

Resources