I have a hash that looks like this
h1 = {"4c09a0da6071a593f051de32"=>["4c09a0da6071a593f051de32", "Cafe Bistro", 37.78458803130115, -122.40743637084961, 215.0], "4abbb03ef964a520668420e3"=>["4abbb03ef964a520668420e3", "The Plant Cafe Organic", 37.7977805076241, -122.3957633972168, 83.0] }
I would like to sort it by the final value in each hash e.g. 83.0, 215.0
I have tried
h1 = h1.sort_by{|k,v| v[4]}
but in out puts an array not a hash, i would like to keep the hash the same just reordered... how do I do this?
It's not a great idea to count on ordering in a Hash. Ruby didn't order hashes at all in 1.8. The data structure in its canonical form is not ordered.
It's better style to use an Array when ordering is important and a Hash or something else when key lookup is needed.
There is a grey area when writing tests. In that case, it may be reasonable to depend on Hash ordering since you are testing a specific Ruby program in certain conditions and you have, after all, a test that can fail should the implementation assumptions ever change.
You need to convert the array back to a hash:
h1 = Hash[h1.sort_by { |_,v| v[-1] }]
Note that this only works since Ruby 1.9. Before that, hashes were not an ordered data structure.
Related
I have an array with hashes:
test = [
{"type"=>1337, "age"=>12, "name"=>"Eric Johnson"},
{"type"=>1338, "age"=>18, "name"=>"John Doe"},
{"type"=>1339, "age"=>22, "name"=>"Carl Adley"},
{"type"=>1340, "age"=>25, "name"=>"Anna Brent"}
]
I am interested in getting all the hashes where the name key equals to a value that can be found in an array:
get_hash_by_name = ["John Doe","Anna Brent"]
Which would end up in the following:
# test_sorted = would be:
# {"type"=>1338, "age"=>18, "name"=>"John Doe"}
# {"type"=>1340, "age"=>25, "name"=>"Anna Brent"}
I probably have to iterate with test.each somehow, but I still trying to get a grasp of Ruby. Happy for all help!
Here's something to meditate on:
Iterating over an array to find something is slow, even if it's a sorted array. Computer languages have various structures we can use to improve the speed of lookups, and in Ruby Hash is usually a good starting point. Where an Array is like reading from a sequential file, a Hash is like reading from a random-access file, we can jump right to the record we need.
Starting with your test array-of-hashes:
test = [
{'type'=>1337, 'age'=>12, 'name'=>'Eric Johnson'},
{'type'=>1338, 'age'=>18, 'name'=>'John Doe'},
{'type'=>1339, 'age'=>22, 'name'=>'Carl Adley'},
{'type'=>1340, 'age'=>25, 'name'=>'Anna Brent'},
{'type'=>1341, 'age'=>13, 'name'=>'Eric Johnson'},
]
Notice that I added an additional "Eric Johnson" record. I'll get to that later.
I'd create a hash that mapped the array of hashes to a regular hash where the key of each pair is a unique value. The 'type' key/value pair appears to fit that need well:
test_by_types = test.map { |h| [
h['type'], h]
}.to_h
# => {1337=>{"type"=>1337, "age"=>12, "name"=>"Eric Johnson"},
# 1338=>{"type"=>1338, "age"=>18, "name"=>"John Doe"},
# 1339=>{"type"=>1339, "age"=>22, "name"=>"Carl Adley"},
# 1340=>{"type"=>1340, "age"=>25, "name"=>"Anna Brent"},
# 1341=>{"type"=>1341, "age"=>13, "name"=>"Eric Johnson"}}
Now test_by_types is a hash using the type value to point to the original hash.
If I create a similar hash based on names, where each name, unique or not, points to the type values, I can do fast lookups:
test_by_names = test.each_with_object(
Hash.new { |h, k| h[k] = [] }
) { |e, h|
h[e['name']] << e['type']
}.to_h
# => {"Eric Johnson"=>[1337, 1341],
# "John Doe"=>[1338],
# "Carl Adley"=>[1339],
# "Anna Brent"=>[1340]}
Notice that "Eric Johnson" points to two records.
Now, here's how we look up things:
get_hash_by_name = ['John Doe', 'Anna Brent']
test_by_names.values_at(*get_hash_by_name).flatten
# => [1338, 1340]
In one quick lookup Ruby returned the matching types by looking up the names.
We can take that output and grab the original hashes:
test_by_types.values_at(*test_by_names.values_at(*get_hash_by_name).flatten)
# => [{"type"=>1338, "age"=>18, "name"=>"John Doe"},
# {"type"=>1340, "age"=>25, "name"=>"Anna Brent"}]
Because this is running against hashes, it's fast. The hashes can be BIG and it'll still run very fast.
Back to "Eric Johnson"...
When dealing with the names of people it's likely to get collisions of the names, which is why test_by_names allows multiple type values, so with one lookup all the matching records can be retrieved:
test_by_names.values_at('Eric Johnson').flatten
# => [1337, 1341]
test_by_types.values_at(*test_by_names.values_at('Eric Johnson').flatten)
# => [{"type"=>1337, "age"=>12, "name"=>"Eric Johnson"},
# {"type"=>1341, "age"=>13, "name"=>"Eric Johnson"}]
This will be a lot to chew on if you're new to Ruby, but the Ruby documentation covers it all, so dig through the Hash, Array and Enumerable class documentation.
Also, *, AKA "splat", explodes the array elements from the enclosing array into separate parameters suitable for passing into a method. I can't remember where that's documented.
If you're familiar with database design this will look very familiar, because it's similar to how we do database lookups.
The point of all of this is that it's really important to consider how you're going to store your data when you first ingest it into your program. Do it wrong and you'll jump through major hoops trying to do useful things with it. Do it right and the code and data will flow through very easily, and you'll be able to massage/extract/combine the data easily.
Said differently, Arrays are containers useful for holding things you want to access sequentially, such as jobs you want to print, sites you need to access in order, files you want to delete in a specific order, but they're lousy when you want to lookup and work with a record randomly.
Knowing which container is appropriate is important, and for this particular task, it appears that an array of hashes isn't appropriate, since there's no fast way of accessing specific ones.
And that's why I made my comment above asking what you were trying to accomplish in the first place. See "What is the XY problem?" and "XyProblem" for more about that particular question.
You can use select and include? so
test.select {|object| get_hash_by_name.include? object['name'] }
…should do the job.
I'm implementing a simple Bloom filter as an exercise.
Bloom filters require multiple hash functions, which for practical purposes I don't have.
Assuming I want to have 3 hash functions, isn't it enough to just take the hash of the object I'm checking membership for, hashing it (with murmur3) and then add +1, +2, +3 (for the 3 different hashes) before hashing them again?
As the murmur3 function has a very good avalanche effect (really spreads out results) wouldn't this for all purposes be reasonable?
Pseudo-code:
function generateHashes(obj) {
long hash = murmur3_hash(obj);
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
(hash1, hash2, hash3)
}
If not, what would be a simple, useful approach to this? I'd like to have a solution that would allow me to easily scale for more hash functions if needed be.
AFAIK, the usual approach is to not actually use multiple hash functions. Rather, hash once and split the resulting hash into 2, 3, or how many parts you want for your Bloom filter. So for example create a hash of 128 bits and split it into 2 hashes 64 bit each.
https://github.com/Claudenw/BloomFilter/wiki/Bloom-Filters----An-overview
The hashing functions of Bloom filter should be independent and random enough. MurmurHash is great for this purpose. So your approach is correct, and you can generate as many new hashes your way. For the educational purposes it is fine.
But in real world, running hashing function multiple times is slow, so the usual approach is to create ad-hoc hashes without actually calculating the hash.
To correct #memo, this is done not by splitting the hash into multiple parts, as the width of the hash should remain constant (and you can't split 64 bit hash to more than 64 parts ;) ). The approach is to get a two independent hashes and combine them.
function generateHashes(obj) {
// initialization phase
long h1 = murmur3_hash(obj);
long h2 = murmur3_hash(h1);
int k = 3; // number of desired hash functions
long hash[k];
// generation phase
for (int i=0; i<k; i++) {
hash[i] = h1 + (i*h2);
}
return hash;
}
As you see, this way creating a new hash is a simple multiply-add operation.
It would not be a good approach. Let me try and explain. Bloom filter allows you to test if an element most likely belongs to a set, or if it absolutely doesn’t. In others words, false positives may occur, but false negatives won’t.
Reference: https://sc5.io/posts/what-are-bloom-filters-and-why-are-they-useful/
Let us consider an example:
You have an input string 'foo' and we pass it to the multiple hash functions. murmur3 hash gives the output K, and subsequent hashes on this hash value give x, y and z
Now assume you have another string 'bar' and as it happens, its murmur3 hash is also K. The remaining hash values? They will be x, y and z because in your proposed approach the subsequent hash functions are not dependent on the input, but instead on the output of first hash function.
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
As explained in the link, the purpose is to perform a probabilistic search in a set. If we perform search for 'foo' or for 'bar' we would say that it is 'likely' that both of them are present. So the % of false positives will increase.
In other words this bloom filter will behave like a simple hash-function. The 'bloom' aspect of it will not come into picture because only the first hash function is determining the outcome of search.
Hope I was able to explain sufficiently. Let me know in comments if you have some more follow-up queries. Would be happy to assist.
So I have an array of hashes:
[{"id":"30","name":"Dave"},
{"id":"57","name":"Mike"},
{"id":"9","name":"Kevin"},
...
{"id":"1","name":"Steve"}]
And I want to sort it by the id attribute, so that it looks like this:
[{"id":"1","name":"Steve"},
{"id":"2","name":"Walter"},
...
{"id":"60","name":"Chester"}]
I'm assuming I use the sort_by method but I'm not exactly sure how to do it.
This should work:
array.sort_by { |hash| hash['id'].to_i }
In this case, sort_by is preferred over sort because it is more efficient. While sort calls to_i on every comparison, sort_by does it once for each element in array and remembers the result.
When I see incoming data like that, it's almost always a JSON string. Ruby doesn't automatically understand JSON, nor does it automatically know how to convert it, but Ruby does make it easy for us to convert from/to it:
require 'json'
json_data = '[{"id":"30","name":"Dave"},
{"id":"57","name":"Mike"},
{"id":"9","name":"Kevin"},
{"id":"1","name":"Steve"}]'
ary = JSON[json_data].sort_by{ |e| e['id'].to_i }
ary
# => [{"id"=>"1", "name"=>"Steve"}, {"id"=>"9", "name"=>"Kevin"}, {"id"=>"30", "name"=>"Dave"}, {"id"=>"57", "name"=>"Mike"}]
The only real trick here is:
JSON[json_data]
A lot of time you'll see people use JSON.parse(json_data), but the [] method is smart enough to recognize whether it's getting a String or an array or a hash. If it's a string it tries to parse it assuming it's incoming data. If it's an array or a hash, it converts it to a JSON string for output. The result is, using JSON[...] simplifies the use of the class and makes it so we don't have to use parse or to_json.
Otherwise, using sort_by is preferred over using sort unless you are directly comparing two simple variables, like integer to integer, string to string or character to character. Once you have to dive into an object, or do some sort of calculation to determine how things compare, then you should use sort_by. See Wikipedia's article on Schwartzian Transform to understand what's going on under the covers. It's a very powerful technique that can speed up sorting remarkably.
Your Hash syntax is wrong, if they where symbols then it would look like this:
data = [
{id:"30", name:"Dave"},
{id:"57", name:"Mike"},
{id:"9", name:"Kevin"},
{id:"1", name:"Steve"}
]
sorted_data = data.sort_by{|x| x[:id].to_i}
Edit: Forgot the to_i, fixed. If the keys are strings the : way of defining a hash does not work, so we need hash-rockets instead:
data = [{"id"=>"30","name"=>"Dave"},
{"id"=>"57","name"=>"Mike"},
{"id"=>"9","name"=>"Kevin"},
{"id"=>"1","name"=>"Steve"}]
sorted_data = data.sort_by{|x| x['id'].to_i}
I believe that I may be missing something here, so please bear with me as I explain two scenarios in hopes to reconcile my misunderstanding:
My end goal is to create a dataset that's acceptable by Highcharts via lazy_high_charts, however in this quest, I'm finding that it is rather particular about the format of data that it receives.
A) I have found that when data is formatted like this going into it, it draws the points just fine:
[0.0000001240,0.0000000267,0.0000000722, ..., 0.0000000512]
I'm able to generate an array like this simply with:
array = Array.new
data.each do |row|
array.push row[:datapoint1].to_f
end
B) Yet, if I attempt to use the map function, I end up with a result like and Highcharts fails to render this data:
[[6.67e-09],[4.39e-09],[2.1e-09],[2.52e-09], ..., [3.79e-09]]
From code like:
array = data.map{|row| [(row.datapoint1.to_f)] }
Is there a way to coax the map function to produce results in B that more akin to the scenario A resultant data structure?
This get's more involved as I have to also add datetime into this, however that's another topic and I just want to understand this first and what can be done to perhaps further control where I'm going.
Ultimately, EVEN SCENARIO B SHOULD WORK according to the data in the example here: http://www.highcharts.com/demo/spline-irregular-time (press the "View options" button at bottom)
Heck, I'll send you a sucker in the mail if you can fill me in on that part! ;)
You can fix arrays like this
[[6.67e-09],[4.39e-09],[2.1e-09],[2.52e-09], ..., [3.79e-09]]
that have nested arrays inside them by using the flatten method on the array.
But you should be able to avoid generating nested arrays in the first place. Just remove the square brackets from your map line:
array = data.map{|row| row.datapoint1.to_f }
Code
a = [[6.67e-09],[4.39e-09],[2.1e-09],[2.52e-09], [3.79e-09]]
b = a.flatten.map{|el| "%.10f" % el }
puts b.inspect
Output
["0.0000000067", "0.0000000044", "0.0000000021", "0.0000000025", "0.0000000038"]
Unless I, too, am missing something, your problem is that you're returning a single-element array from your block (thereby creating an array of arrays) instead of just the value. This should do you:
array = data.map {|row| row.datapoint1.to_f }
# => [ 6.67e-09, 4.39e-09, 2.1e-09, 2.52e-09, ..., 3.79e-09 ]
In the code below, the order of my items gets changed after the JSON.parse(f) line, i.e., this hash:
{
a => aval,
b => bval,
c => cval,
d => dval
}
becomes something like:
{
b => bval,
c => cval,
a => aval,
d => dval
}
This is a problem because my display code just reads from the json file, so any time I save back to it, and then display, everything gets changed around. Is there anything I can do to retain the order?
CODE:
f = File.read($PLAN_DESC_PATH)
puts ("f " + f.to_s())
hash = JSON.parse(f)
puts ("hash " + hash.to_s())
My Ruby version is 1.8.7. I am using Sinatra. I believe I got the JSON gem from here: http://flori.github.com/json/ (sorry, kinda new to this). Thanks!
In Ruby 1.8.7 the Hash class does not maintain order either by keys or by order added. If you need something like that, you would need to implement something like ActiveSupport::OrderedHash (http://rubydoc.info/docs/rails/ActiveSupport/OrderedHash)
In Ruby 1.9.x hashes are ordered by when they are inserted by default (see http://www.ruby-doc.org/core/classes/Hash.html)
When you serialize a hash to JSON, all bets are off for maintaining order of your keys. You'll need some post processing after your serialization to ensure order if that's necessary for you.
No, hashmaps are not meant to have a specific ordering. If you need ordering use something different like an array. Or extract all the keys, sort them like you want and then you can have what order you like.
Making assumptions on ordering inside maps is anyway something on which you shouldn't rely, that's the fact.
A good alternative would be to have:
[ [a, aval], [b, bval], ... ]
Jack answered for Ruby, so I'll answer for JSON. From RFC 4627 (emphasis added):
"An object is an unordered collection of zero or more name/value pairs"