Ruby's optimized implementation of Histogram/Aggregator - ruby

i'm about to write my own but i was wondering if there are any gems/libs that i can use as aggregator/histogram
my goal would be to sum up values based on a matching key:
["fish","2"]
["fish","40"]
["meat","56"]
["meat","1"]
Should sum op the values per unique key and return ["fish","42"] and ["meat","57"]
.The files i have to aggregate are relatively large, about 4gb text files made of tsv key/value pair
.My goal is to try not to use temporary files in order not to take too much space on the machine, so i was wondering if something similar already optimized already exists, i have found a jeb on github named 'histogram' but it does not really contain the functionalities i need
Thx

You can use a Hash with a default value of 0 to do the counting, then in the end you could convert it to Array to yield the format you want, though I think you might just want to keep using the Hash instead.
data = [
["fish","2"],
["fish","40"],
["meat","56"],
["meat","1"]
]
hist = data.each_with_object(Hash.new(0)) do |(k,v), h|
h[k] += v.to_i
end
hist # => {"fish"=>42, "meat"=>57}
hist.to_a # => [["fish", 42], ["meat", 57]]
# To get String values, "42" instead of 42, etc:
hist.map { |k,v| [k, v.to_s] } # => [["fish", "42"], ["meat", "57"]]
Since you stated you had to read the data from a file, here is the above when applied to a file. The input.txt file contents are as follows for this example:
fish,2
fish,40
meat,56
meat,1
Then, to create the same output as before by reading it line by line:
file = File.open('input.txt')
hist = file.each_with_object(Hash.new(0)) do |line, h|
key, value = line.split(',')
h[key] += value.to_i
end
file.close

Related

Ruby Nokogiri parsing omit duplicates

I'm parsing XML files and wanting to omit duplicate values from being added to my Array. As it stands, the XML will looks like this:
<vulnerable-software-list>
<product>cpe:/a:octopus:octopus_deploy:3.0.0</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.1</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.2</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.3</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.4</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.5</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.6</product>
</vulnerable-software-list>
document.xpath("//entry[
number(substring(translate(last-modified-datetime,'-.T:',''), 1, 12)) > #{last_imported_at} and
cvss/base_metrics/access-vector = 'NETWORK'
]").each do |entry|
product = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':')[-2] }
effected_versions = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':').last }
puts product
end
However, because of the XML input, that's parsing quite a bit of duplicates, so I end up with an array like ['Redhat','Redhat','Redhat','Fedora']
I already have the effected_versions taken care of, since those values don't duplicate.
Is there a method of .map to only add unique values?
If you need to get an array of unique values, then just call uniq method to get the unique values:
product =
entry.xpath('vulnerable-software-list/product').map do |product|
product.content.split(':')[-2]
end.uniq
There are many ways to do this:
input = ['Redhat','Redhat','Redhat','Fedora']
# approach 1
# self explanatory
result = input.uniq
# approach 2
# iterate through vals, and build a hash with the vals as keys
# since hashes cannot have duplicate keys, it provides a 'unique' check
result = input.each_with_object({}) { |val, memo| memo[val] = true }.keys
# approach 3
# Similar to the previous, we iterate through vals and add them to a Set.
# Adding a duplicate value to a set has no effect, and we can convert it to array
result = input.each_with_object.(Set.new) { |val, memo| memo.add(val) }.to_a
If you're not familiar with each_with_object, it's very similar to reduce
Regarding performance, you can find some info if you search for it, for example What is the fastest way to make a uniq array?
From a quick test, I see these performing in increasing time. uniq is 5 times faster than each_with_object, which is 25% slower than the Set.new approach. Probably because sort is implemetned using C. I only tested with only an arbitrary input though, so it might not be true for all cases.

Convert dot notation keys to tree-structured YAML in Ruby

I've sent my I18n files to be translated by a third party. Since my translator is not computer savvy we made a spreadsheet with the keys, they where sent in dot notation and the values translated.
For example:
es.models.parent: "Pariente"
es.models.teacher: "Profesor"
es.models.school: "Colegio"
How can I move that into a YAML file?
UPDATE: Just like #tadman said, this already is YAML. So if you are with the, you are just fine.
So we will focus this question if you would like to have the tree structure for YAML.
The first thing to do is transform this into a Hash.
So the previous info moved into this:
tr = {}
tr["es.models.parent"] = "Pariente"
tr["es.models.teacher"] = "Profesor"
tr["es.models.school"] = "Colegio"
Then we just advanced creating a deeper hash.
result = {} #The resulting hash
tr.each do |k, value|
h = result
keys = k.split(".") # This key is a concatenation of keys
keys.each_with_index do |key, index|
h[key] = {} unless h.has_key? key
if index == keys.length - 1 # If its the last element
h[key] = value # then we only need to set the value
else
h = h[key]
end
end
end;
require 'yaml'
puts result.to_yaml #Here it is for your YAMLing pleasure

Pull min and max value from CSV file

I have a CSV file like:
123,hat,19.99
321,cap,13.99
I have this code:
products_file = File.open('text.txt')
while ! products_file.eof?
line = products_file.gets.chomp
puts line.inspect
products[ line[0].to_i] = [line[1], line[2].to_f]
end
products_file.close
which is reading the file. While it's not at the end of the file, it reads each line. I don't need the line.inspect in there. but it stores the file in an array inside of my products hash.
Now I want to pull the min and max value from the hash.
My code so far is:
read_file = File.open('text.txt', "r+").read
read_file.(?) |line|
products[ products.length] = gets.chomp.to_f
products.min_by { |x| x.size }
smallest = products
puts "Your highest priced product is #{smallest}"
Right now I don't have anything after read_file.(?) |line| so I get an error. I tried using min or max but neither worked.
Without using CSV
If I understand your question correctly, you don't have to use CSV class methods: just read the file (less header) into an array and determine the min and max as follows:
arr = ["123,hat,19.99", "321,cap,13.99",
"222,shoes,33.41", "255,shirt,19.95"]
arr.map { |s| s.split(',').last.to_f }.minmax
#=> [13.99, 33.41]
or
arr.map { |s| s[/\d+\.\d+$/].to_f }.minmax
#=> [13.99, 33.41]
If you want the associated records:
arr.minmax_by { |s| s.split(',').last.to_f }
=> ["321,cap,13.99", "222,shoes,33.41"]
With CSV
If you wish to use CSV to read the file into an array:
arr = [["123", "hat", "19.99"],
["321", "cap", "13.99"],
["222", "shoes", "33.41"],
["255", "shirt", "19.95"]]
then
arr.map(&:last).minmax
# => ["13.99", "33.41"]
or
arr.minmax_by(&:last)
#=> [["321", "cap", "13.99"],
# ["222", "shoes", "33.41"]]
if you want the records. Note that in the CSV examples I didn't convert the last field to a float, assuming that all records have two decimal digits.
You should use the built-in CSV class as such:
require 'CSV'
data = CSV.read("text.txt")
data.sort!{ |row1, row2| row1[2].to_f <=> row2[2].to_f }
least_expensive = data.first
most_expensive = data.last
The Array#sort! method modifies data in place, so it is sorted based on the condition in the block for later usage. As you can see, the block sorts based on the values in each row at index 2 - in your case, the prices. Incidentally, you don't need to convert these values to floats - strings will sort the same way. Using to_f stops working if you have leading non-digit characters (eg, $), so you might find it better just keep the values as strings during your sort.
Then you can grab the most and least expensive, or the 5 most expensive, or whatever, at your leisure.

Automatic nested hash keys with ability to append value

I have an arbitrary list of file names I'd like to sort into a hash. I'd like to do it like this:
## Example file name 'hello.world.random_hex"
file_name_list.each do |file|
name_array = file.split('.')
files[name_array[0].to_sym][name_array[1].to_sym] << file
end
Those keys may not exist and I'd like for them to be automatically created with a default value of [] so the << works as expected. The final files hash would look like:
{ :hello => { :world => [ "hello.world.random_hex", "hello.world.other_random_hex" ] } }
How can I initialize files to accomplish this?
If there are always two levels of keys like this, you can do it using the block form of Hash.new:
files = Hash.new {|k,v| k[v] = Hash.new {|k,v| k[v] = [] }}
(On the other hand, if the keys can be nested to an arbitrary depth, this is much harder because the Hash can't know whether the value for a nonexistent key should be a Hash or an Array at the time it is accessed.)

Best way of Parsing 2 CSV files and printing the common values in a third file

I am new to Ruby, and I have been struggling with a problem that I suspect has a simple answer. I have two CSV files, one with two columns, and one with a single column. The single column is a subset of values that exist in one column of my first file. Example:
file1.csv:
abc,123
def,456
ghi,789
jkl,012
file2.csv:
def
jkl
All I need to do is look up the column 2 value in file1 for each value in file2 and output the results to a separate file. So in this case, my output file should consist of:
456
012
I’ve got it working this way:
pairs=IO.readlines("file1.csv").map { |columns| columns.split(',') }
f1 =[]
pairs.each do |x| f1.push(x[0]) end
f2 = IO.readlines("file2.csv").map(&:chomp)
collection={}
pairs.each do |x| collection[x[0]]=x[1] end
f=File.open("outputfile.txt","w")
f2.each do |col1,col2| f.puts collection[col1] end
f.close
...but there has to be a better way. If anyone has a more elegant solution, I'd be very appreciative! (I should also note that I will eventually need to run this on files with millions of lines, so speed will be an issue.)
To be as memory efficient as possible, I'd suggest only reading the full file2 (which I gather would be the smaller of the two input files) into memory. I'm using a hash for fast lookups and to store the resulting values, so as you read through file1 you only store the values for those keys you need. You could go one step further and write the outputfile while reading file2.
require 'CSV'
# Read file 2, the smaller file, and store keys in result Hash
result = {}
CSV.foreach("file2.csv") do |row|
result[row[0]] = false
end
# Read file 1, the larger file, and look for keys in result Hash to set values
CSV.foreach("file1.csv") do |row|
result[row[0]] = row[1] if result.key? row[0]
end
# Write the results
File.open("outputfile.txt", "w") do |f|
result.each do |key, value|
f.puts value if value
end
end
Tested with Ruby 1.9.3
Parsing For File 1
data_csv_file1 = File.read("file1.csv")
data_csv1 = CSV.parse(data_csv_file1, :headers => true)
Parsing For File 2
data_csv_file2 = File.read("file2.csv")
data_csv2 = CSV.parse(data_csv_file1, :headers => true)
Collection of names
names_from_sheet1 = data_csv1.collect {|data| data[0]} #returns an array of names
names_from_sheet2 = data_csv2.collect {|data| data[0]} #returns an array of names
common_names = names_from_sheet1 & names_from_sheet2 #array with common names
Collecting results to be printed
results = [] #this will store the values to be printed
data_csv1.each {|data| results << data[1] if common_names.include?(data[0]) }
Final output
f = File.open("outputfile.txt","w")
results.each {|result| f.puts result }
f.close

Resources