Ruby Nokogiri parsing omit duplicates - ruby

I'm parsing XML files and wanting to omit duplicate values from being added to my Array. As it stands, the XML will looks like this:
<vulnerable-software-list>
<product>cpe:/a:octopus:octopus_deploy:3.0.0</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.1</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.2</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.3</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.4</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.5</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.6</product>
</vulnerable-software-list>
document.xpath("//entry[
number(substring(translate(last-modified-datetime,'-.T:',''), 1, 12)) > #{last_imported_at} and
cvss/base_metrics/access-vector = 'NETWORK'
]").each do |entry|
product = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':')[-2] }
effected_versions = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':').last }
puts product
end
However, because of the XML input, that's parsing quite a bit of duplicates, so I end up with an array like ['Redhat','Redhat','Redhat','Fedora']
I already have the effected_versions taken care of, since those values don't duplicate.
Is there a method of .map to only add unique values?

If you need to get an array of unique values, then just call uniq method to get the unique values:
product =
entry.xpath('vulnerable-software-list/product').map do |product|
product.content.split(':')[-2]
end.uniq

There are many ways to do this:
input = ['Redhat','Redhat','Redhat','Fedora']
# approach 1
# self explanatory
result = input.uniq
# approach 2
# iterate through vals, and build a hash with the vals as keys
# since hashes cannot have duplicate keys, it provides a 'unique' check
result = input.each_with_object({}) { |val, memo| memo[val] = true }.keys
# approach 3
# Similar to the previous, we iterate through vals and add them to a Set.
# Adding a duplicate value to a set has no effect, and we can convert it to array
result = input.each_with_object.(Set.new) { |val, memo| memo.add(val) }.to_a
If you're not familiar with each_with_object, it's very similar to reduce
Regarding performance, you can find some info if you search for it, for example What is the fastest way to make a uniq array?
From a quick test, I see these performing in increasing time. uniq is 5 times faster than each_with_object, which is 25% slower than the Set.new approach. Probably because sort is implemetned using C. I only tested with only an arbitrary input though, so it might not be true for all cases.

Related

How can I generate a reliable digest from a hash in Ruby 2.4?

I have a very large table containing 2 billion rows of 50 attributes. Not all are filled out and it's a sparse matrix.
I dislike having to build query off of all of the values, and the indexes are much too large now. I've lost performance.
For my new approach, I want to add a digest column that contains a digest of all of the attributes in a particular row.
There is no security requirement for this hash, so even MD5 would be fine.
Am I better off building a simple string containing representations of all the keys and values together? Or is there a better way?
For example, given hash:
attr_hash = { attribute1: "Please",
attribute2: nil,
attribute3: "don't",
attribute4: nil,
attribute5: nil,
attribute6: nil,
attribute7: "immediately",
attribute8: "",
attribute9: "downvote",
attribute10: "my",
attribute11: nil,
attribute12: "question" }
would this be preferable (and I'm sure you'll agree this is beautiful):
attr_str = attr_hash.select{|k,v| v!="" && !v.nil?}.keys.sort.map{|k| "#{k}=#{attr_hash[k]}" }.join("^^")
digest = Digest::MD5.hexdigest(attr_str)
which gives a nice-looking string:
790470349a791b9897afd52a336ab2bb
I can index that column and get very, very fast response times from the database. And I'm unlikely to get many if any collisions from that. And if there's a collision one in 5 or 10 million times, it's fine.
I deeply appreciate any insights.
Lazy way:
Digest::SHA2.hexdigest(attr_hash.inspect)
Where that pre-supposes your items have identical ordering. If you need to sort the items first:
Digest::SHA2.hexdigest(attr_hash.to_a.sort_by { |k, _v| k }.inspect)
I'd use JSON.dump(x) instead of x.inspect if I wanted something more portable, like to non-Ruby code-bases.
I also wouldn't bother stripping out empty values. The hash function doesn't care.
This may be useful for somebody who is dealing with nested json object
def predigest_json(json)
if json.class == Hash
arr = []
json.each { |key, value| arr << "#{predigest_json key}=>#{predigest_json value}" }
json = arr
end
if json.class == Array
str = ''
json.map! { |value| predigest_json value }.sort!.each { |value| str << value }
end
if json.class != String
json = json.to_s << json.class.to_s
end
json
end
Digest::SHA2.hexdigest(predigest_json(json))

I an getting an "Undefined method 'new' for.... (A number that changes each time)"

I made a simple program with a single method and I'm trying to test it, but I keep getting this weird error, and I have no idea why it keeps happening.
Here's my code for the only method I wrote:
def make_database(lines)
i = 0
foods = hash.new()
while i < lines.length do
lines[i] = lines[i].chomp()
words = lines[i].split(',')
if(words[1].casecmp("b") == 0)
foods[words[0]] = words[3]
end
end
return foods
end
And then here's what I have for calling the method (Inside the same program).
if __FILE__ == $PROGRAM_NAME
lines = []
$stdin.each { |line| lines << line}
foods = make_database(lines).new
puts foods
end
I am painfully confused, especially since it gives me a different random number for each "Undefined method 'new' for (Random number)".
It's a simple mistake. hash calls a method on the current object that returns a number used by the Hash structure for indexing entries, where Hash is the hash class you're probably intending:
foods = Hash.new()
Or more succinctly:
foods = { }
It's ideal to use { } in place of Hash.new unless you need to specify things like defaults, as is the case with:
Hash.new(0)
Where all values are initialized to 0 by default. This can be useful when creating simple counters.
Ruby classes are identified by leading capital letters to avoid confusion like this. Once you get used to the syntax you'll have an easier time spotting mistakes like that.
Note that when writing Ruby code you will almost always omit braces/brackets on empty argument lists. That is x() is expressed simply as x. This keeps code more readable, especially when chaining, like x.y.z instead of x().y().z()
Other things to note include being able to read in all lines with readlines instead of what you have there where you manually compose it. Try:
make_database($stdin.readlines.map(&:chomp))
A more aggressive refactoring of your code looks like this:
def make_database(lines)
# Define a Hash based on key/value pairs in an Array...
Hash[
# ...where these pairs are based on the input lines...
lines.map do |line|
# ...which have comma-separated components.
line.split(',')
end.reject do |key, flag, _, value|
# Pick out only those that have the right flag.
flag.downcase == 'b'
end.map do |key, flag, _, value|
# Convert to a simple key/value pair array
[ key, value ]
end
]
end
That might be a little hard to follow, but once you get the hang of chaining together a series of otherwise simple operations your Ruby code will be a lot more flexible and far easier to read.

Pull min and max value from CSV file

I have a CSV file like:
123,hat,19.99
321,cap,13.99
I have this code:
products_file = File.open('text.txt')
while ! products_file.eof?
line = products_file.gets.chomp
puts line.inspect
products[ line[0].to_i] = [line[1], line[2].to_f]
end
products_file.close
which is reading the file. While it's not at the end of the file, it reads each line. I don't need the line.inspect in there. but it stores the file in an array inside of my products hash.
Now I want to pull the min and max value from the hash.
My code so far is:
read_file = File.open('text.txt', "r+").read
read_file.(?) |line|
products[ products.length] = gets.chomp.to_f
products.min_by { |x| x.size }
smallest = products
puts "Your highest priced product is #{smallest}"
Right now I don't have anything after read_file.(?) |line| so I get an error. I tried using min or max but neither worked.
Without using CSV
If I understand your question correctly, you don't have to use CSV class methods: just read the file (less header) into an array and determine the min and max as follows:
arr = ["123,hat,19.99", "321,cap,13.99",
"222,shoes,33.41", "255,shirt,19.95"]
arr.map { |s| s.split(',').last.to_f }.minmax
#=> [13.99, 33.41]
or
arr.map { |s| s[/\d+\.\d+$/].to_f }.minmax
#=> [13.99, 33.41]
If you want the associated records:
arr.minmax_by { |s| s.split(',').last.to_f }
=> ["321,cap,13.99", "222,shoes,33.41"]
With CSV
If you wish to use CSV to read the file into an array:
arr = [["123", "hat", "19.99"],
["321", "cap", "13.99"],
["222", "shoes", "33.41"],
["255", "shirt", "19.95"]]
then
arr.map(&:last).minmax
# => ["13.99", "33.41"]
or
arr.minmax_by(&:last)
#=> [["321", "cap", "13.99"],
# ["222", "shoes", "33.41"]]
if you want the records. Note that in the CSV examples I didn't convert the last field to a float, assuming that all records have two decimal digits.
You should use the built-in CSV class as such:
require 'CSV'
data = CSV.read("text.txt")
data.sort!{ |row1, row2| row1[2].to_f <=> row2[2].to_f }
least_expensive = data.first
most_expensive = data.last
The Array#sort! method modifies data in place, so it is sorted based on the condition in the block for later usage. As you can see, the block sorts based on the values in each row at index 2 - in your case, the prices. Incidentally, you don't need to convert these values to floats - strings will sort the same way. Using to_f stops working if you have leading non-digit characters (eg, $), so you might find it better just keep the values as strings during your sort.
Then you can grab the most and least expensive, or the 5 most expensive, or whatever, at your leisure.

Ruby's optimized implementation of Histogram/Aggregator

i'm about to write my own but i was wondering if there are any gems/libs that i can use as aggregator/histogram
my goal would be to sum up values based on a matching key:
["fish","2"]
["fish","40"]
["meat","56"]
["meat","1"]
Should sum op the values per unique key and return ["fish","42"] and ["meat","57"]
.The files i have to aggregate are relatively large, about 4gb text files made of tsv key/value pair
.My goal is to try not to use temporary files in order not to take too much space on the machine, so i was wondering if something similar already optimized already exists, i have found a jeb on github named 'histogram' but it does not really contain the functionalities i need
Thx
You can use a Hash with a default value of 0 to do the counting, then in the end you could convert it to Array to yield the format you want, though I think you might just want to keep using the Hash instead.
data = [
["fish","2"],
["fish","40"],
["meat","56"],
["meat","1"]
]
hist = data.each_with_object(Hash.new(0)) do |(k,v), h|
h[k] += v.to_i
end
hist # => {"fish"=>42, "meat"=>57}
hist.to_a # => [["fish", 42], ["meat", 57]]
# To get String values, "42" instead of 42, etc:
hist.map { |k,v| [k, v.to_s] } # => [["fish", "42"], ["meat", "57"]]
Since you stated you had to read the data from a file, here is the above when applied to a file. The input.txt file contents are as follows for this example:
fish,2
fish,40
meat,56
meat,1
Then, to create the same output as before by reading it line by line:
file = File.open('input.txt')
hist = file.each_with_object(Hash.new(0)) do |line, h|
key, value = line.split(',')
h[key] += value.to_i
end
file.close

Is this the best way to grab common elements from a Hash of arrays?

I'm trying to get a common element from a group of arrays in Ruby. Normally, you can use the
& operator to compare two arrays, which returns elements that are present or common in both arrays. This is all good, except when you're trying to get common elements from more than two arrays. However, I want to get common elements from an unknown, dynamic number of arrays, which are stored in a hash.
I had to resort to using the eval() method in ruby, which executes a string as actual code. Here's the function I wrote:
def get_common_elements_for_hash_of_arrays(hash) # get an array of common elements contained in a hash of arrays, for every array in the hash.
# ["1","2","3"] & ["2","4","5"] & ["2","5","6"] # => ["2"]
# eval("[\"1\",\"2\",\"3\"] & [\"2\",\"4\",\"5\"] & [\"2\",\"5\",\"6\"]") # => ["2"]
eval_string_array = Array.new # an array to store strings of Arrays, ie: "[\"2\",\"5\",\"6\"]", which we will join with & to get all common elements
hash.each do |key, array|
eval_string_array << array.inspect
end
eval_string = eval_string_array.join(" & ") # create eval string delimited with a & so we can get common values
return eval(eval_string)
end
example_hash = {:item_0 => ["1","2","3"], :item_1 => ["2","4","5"], :item_2 => ["2","5","6"] }
puts get_common_elements_for_hash_of_arrays(example_hash) # => 2
This works and is great, but I'm wondering...eval, really? Is this the best way to do it? Are there even any other ways to accomplish this(besides a recursive function, of course). If anyone has any suggestions, I'm all ears.
Otherwise, Feel free to use this code if you need to grab a common item or element from a group or hash of arrays, this code can also easily be adapted to search an array of arrays.
Behold the power of inject! ;)
[[1,2,3],[1,3,5],[1,5,6]].inject(&:&)
=> [1]
As Jordan mentioned, if your version of Ruby lacks support for &-notation, just use
inject{|acc,elem| acc & elem}
Can't you just do a comparison of the first two, take the result and compare it to the next one etc? That seems to meet your criteria.
Why not do this:
def get_common_elements_for_hash_of_arrays(hash)
ret = nil
hash.each do |key, array|
if ret.nil? then
ret = array
else
ret = array & ret
end
end
ret = Array.new if ret.nil? # give back empty array if passed empty hash
return ret
end

Resources