I recently started learning Ruby and hashes. At first I learned that hashes are unordered which makes sense, but now I found out that hashes are ordered with later versions of Ruby. I don't really understand why or the concept behind this.
Could I get some insight as to what the ordered hashes are for? Possible use cases would be nice too for regular hash vs. ordered hash.
Some people like to rely on ordering of a Hash, because the ordered-hash remembers the insertion order of the key/value pairs. This allows the programmer to use a hash somewhat like a queue with random access to the values associated with the keys. This would be useful if they intend to change values on the fly and then iterate over the queue's key/value pairs to retrieve them in the insertion order again.
Also, rather than have to supply indexes into the queue, like they would if they were using an Array-based queue, they can supply a symbolic name.
Instead of:
queue[0]
they can use:
queue[:fred]
That's the only use-case I can see for ordered hashes; It'd be really easy to duplicate the functionality with a queue of keys that preserved the insertion order.
Looking back at some of the previous posts by Matz, he was pretty vague as to why it was implemented. Check out https://www.ruby-forum.com/topic/166075
He basically states that it was implemented to fit some edge cases but he didn't seem to elaborate on it more than that. He also stated that there was no impact on performance, just a negligible increase in memory consumption.
Imagine git commits, handled by ruby git wrapper. They are likely instance of a Hash, with sha as keys. While the sorting by Date makes them easily iteratable in a human-friendly manner.
Related
I have a list of hashes. Long list. Very long list. I need to check does a given hash is in that list.
The easiest way is to store hashes in memory (in a map or a simple array) and check that. But it will require lots of RAM/SSD/HDD memory. More than a server(s) can handle.
I'm wondering is there a trick to do that in reasonable memory usage. Maybe there's an algorithm I'm not familiar with or a special collection?
Three thoughts-
Depending on the structure of these hashes, you may be able to borrow some ideas from the concept of a Rainbow Table to implicitly store some of them.
You could use a trie to compress storage for shared prefixes if you have enough hashes, however given their length and (likely) uniformity, you won't see terrific savings.
You could split the hash into multiple smaller hashes, and then use these to implement a Bloom Filter, however this a probabilistic test, so you'll still need them stored somewhere else (or able to be calculated / derived) if there's a perceived "hit", however this may enable you filter out enough "misses" that a less performant (speed-wise) data structure becomes feasible.
I'm currently looking for a data structure with all O(1) operations
insert(K, V): Insert a value at the end of the queue.
remove_key(K): Remove the value from the queue corresponding to the provided key.
remove_head(): Remove the value from the front of the queue (the oldest one).
The only reasonably easy to implement thing I can think of is using a doubly linked list as the primary data structure, and keeping pointers to the list nodes in a hash table, which would get the desired asymptotic behavior, however this might not be the most efficient option in actual runtime.
I found "addressable priority queues" in the literature, but they are rather complicated (and maybe even more expensive) data structures, so I was wondering if someone has a better suggestion. It seems no one implemented something like this for Rust so far, which is why I'm hoping it doesn't get too complicated.
I would use a pub struct VecDeque<T> and use pop_front() instead of remove_head().
See the doc: VecDeque
Here I implemented an Addressable Binary Heap in Python, no third-party dependencies.
I am looking to implement my own collection class. The characteristics I want are:
Iterable - order is not important
Insertion - either at end or at iterator location, it does not matter
Random Deletion - this is the tricky one. I want to be able to have a reference to a piece of data which is guaranteed to be within the list, and remove it from the list in O(1) time.
I plan on the container only holding custom classes, so I was thinking a doubly linked list that required the components to implement a simple interface (or abstract class).
Here is where I am getting stuck. I am wondering whether it would be better practice to simply have the items in the list hold a reference to their node, or to build the node right into them. I feel like both would be fairly simple, but I am worried about coupling these nodes into a bunch of classes.
I am wondering if anyone has an idea as to how to minimize the coupling, or possibly know of another data structure that has the characteristics I want.
It'd be hard to beat a hash map.
Take a look at tries.
Apparently they can beat hashtables:
Unlike most other algorithms, tries have the peculiar feature that the time to insert, or to delete or to find is almost identical because the code paths followed for each are almost identical. As a result, for situations where code is inserting, deleting and finding in equal measure tries can handily beat binary search trees or even hash tables, as well as being better for the CPU's instruction and branch caches.
It may or may not fit your usage, but if it does, it's likely one of the best options possible.
In C++, this sounds like the perfect fit for std::unordered_set (that's std::tr1::unordered_set or boost::unordered_set to you if you have an older compiler). It's implemented as a hash set, which has the characteristics you describe.
Here's the interface documentation. Note that the hash containers actually offer two sets of iterators, the usual ones and local ones which only go through one bucket.
Many other languages have "hash sets" as well, certainly Java and C#.
I'm trying to figure out what data structure to quickly support the following operations:
Add a string (if it's not there, add it, if it is there, increment a counter for the word)
Count a given string (look up by string and then read the counter)
I'm debating between a hash table or a trie. From my understanding a hash table is fast to look up and add as long as you avoid collisions. If I don't know my inputs ahead of time would a trie be a better way to go?
It really depends on the types of strings you're going to be using as "keys". If you're using highly variable strings, plus you do not have a good hash algorithm for your strings, then a trie can outperform a hash.
However, given a good hash, the lookup will be faster than in a trie. (Given a very bad hash, the opposite is true, though.) If you don't know your inputs, but do have a decent hashing algorithm, I personally prefer using a hash.
Also, most modern languages/frameworks have very good hashing algorithms, so chances are, you'll be able to implement a good lookup using a hash with very little work, that will perform quite well.
A trie won't buy you much; they're only interesting when prefixes are important. Hash tables are simpler, and usually part of your language's standard library, if not directly part of the language itself (Ruby, Python, etc). Here's a dead-simple way to do this in Ruby:
strings = %w(some words that may be repeated repeated)
counts = Hash.new(0)
strings.each { |s| counts[s] += 1 }
#counts => {"words"=>1, "be"=>1, "repeated"=>2, "may"=>1, "that"=>1, "some"=>1}
Addenda:
For C++, you can probably use Boost's hash implementation.
Either one is reasonably fast.
It isn't necessary to completely avoid collisions.
Looking at performance a little more closely, usually, hash tables are faster than trees, but I doubt if a real life program ever ran too slow simply because it used a tree instead of a HT, and some trees are faster than some hash tables.
What else can we say, well, hash tables are more common than trees.
One advantage of the complex trees is that they have predictable access times. With hash tables and simple binary trees, the performance you see depends on the data and with an HT performance depends strongly on the quality of the implementation and its configuration with respect to the data set size.
I have a basic question. Say you have a NSFetchRequest which you want to perform on a NSManagedObjectContext. If the fetch request doesn't have any sort descriptors set to it explicitly, will the objects be random every time, or are they going to be spit out into an array in the order they were added to the Managed Object Context initially? I can't find this answer anywhere in the documentation.
No, they're not guaranteed to be ordered. You might happen to see consistent ordering depending on what type of data store you use (I've never tried), but it's not something you should depend on in any way.
It's easy to order by creation date though. Just add a date attribute to your entity, initialize it to the current date in awakeFromInsert, and specify that sort descriptor in your fetch.
The order may not be "random every time" but as far as I know you cannot/should not depend on it. If you need a specific order, then use sort descriptors.
I see two questions here: will it come out in the same ordering every time? And, is that ordering on insertion order?
It comes out in set order, which is in some ordering. Note that NSSet is just an interface and there are private classes that implement NSSet. That means that while some instances of NSSet you get back if you call allObjects against it might return them in some consistent ordering, it's almost assuredly in hash ordering as sets are almost universally implemented as hashed dictionaries.
Since the hashing algorithm is highly variable depending on what is stored and how it's hashed, you might "luck out" that it comes out in the same ordering every time, but then be caught off guard another time when something changes very slightly.
So, technically, it's not really random and it could be in some stable ordering.
To the second question, I would say it's almost assuredly NOT in insertion order.
Marc's suggestion for handling awakeFromInsert is a good one, and what you would want.
There is no guarantee on the ordering. For example, I could implement an NSAtomicStore or NSIncrementalStore that returns results in random order and it would be completely correct. I have seen the SQLite store return different ordering on different versions of the operating system as well.