Sometimes it's convenient to create a hash table where the hash function uses the hashed object's address. The hash function is easy to implement and cheap to execute.
But there's a problem if you are working on a system which uses some form of copying garbage collection because after an object has been moved the previously calculated hash value is incorrect. Even rehashing the table during or after garbage collection seems problematic since (in principle) a hash function could consume memory and trigger another garbage collection.
My question is how to work around these issues. I'm not seeing any clean solution without some kind of compromise.
How to work around this issue depends on the tools which are provided by the specific ecosystem of the specific programming language one is using.
For example, in C#, one can use RuntimeHelpers.GetHashcode() for getting a hash which is guaranteed not to change over the full lifetime of an object, even when a garbage collection takes place and the physical adress of the object changes in memory.
Related
I have learned about dynamic array (non-fixed size array) as dynamic array as vector in C++ and Arraylist in Java
And how can we implement it.
Basically when the array is full we create another array of doubled size and copy the old items to the new array
So can we implement an array of non-fixed size with random access as a vector and Arraylist without spending time copying the old elements?
In other word, Is there data structure like that (dynamic size and random access and no need for copy elements)??
Depending on what you mean by "like", this is trivially impossible to already exists.
First the trivially impossible. When we create an array, we mark a section of memory as being only for that array. If you have 3 such arrays that can grow without bound, one of them will eventually run into another. Given that we can actually create arrays that are bigger than available memory (it just pages to disk), we have to manage this risk, not avoid it.
But how big an issue is it? Copying data is O(1) per element, no matter how big it gets. And the overhead is low. The cost of this dynamicism is that you need to always check where the array starts. But that's a pretty fast check.
Alternately we can move to paged memory. Now an array access looks like, "Check what page it is on, then look at where it is in the page." Now your array can grow, but you never change where anything is. But if you want it to grow without bound, you have to add levels. We can implement it, and it does avoid copying, but this form of indirection has generally NOT proven worth it for general purpose programming. However paging is used in databases. And it is also used by operating systems to manage turning what the program thinks is the address of the data, to the actual address in memory. If you want to dive down that rabbit hole, TLB is worth looking at.
But there are other options that exist as well. Instead of fixed sized pages, we can have variable sized ones. This approach gets very complicated, very quickly. But the result is very useful. Look up ropes for more.
The browser that I wrote this on stores the text of what I wrote using a rope. This is how it can easily offer features like multi-level undo and editing in the middle of the document. However the raw performance of such schemes is significant. It is clearly worthwhile if you need the features, but otherwise we don't do it.
In short, every set of choices we make has tradeoffs. The one you'd like to optimize has what has proven to be the best tradeoff for offering dynamic size and raw performance. That's why it appears everywhere from Python lists to C++ vectors.
Let's say I have a huge list of fixed-length strings, and I want to be able to quickly determine if a new given string is part of this huge list.
If the list remains small enough to fit in memory, I would typically use a set: I would feed it first with the list of strings, and by design, the data structure would allow me to quickly check whether or not a given string is part of the set.
But as far as I can see, the various standard implementation of this data structure store data in memory, and I already know that the huge list of strings won't fit in memory, and that I'll somehow need to store this list on disk.
I could rely on something like SQLite to store the strings in a indexed table, then query the table to know whether a string is part of the initial set or not. However, using SQLite for this seems unnecessarily heavy to me, as I definitely don't need all the querying features it supports.
Have you guys faced this kind of problems before? Do you know any library that might be helpful? (I'm quite language-agnostic, feel free to throw whatever you have)
There are multiple solutions to efficiently find if a string is a part of a huge set of strings.
A first solution is to use a trie to make the set much more compact. Indeed, many strings will likely start by the same header and re-writing it over and over in memory is not space efficient. It may be enough to keep the full set in memory or not. If not, the root part of the trie can be stored in memory referencing leaf-like nodes stored on the disk. This enable the application to quickly find with part of the leaf-like nodes need to be loaded with a relatively small cost. If the number of string is not so huge, most leaf parts of the trie related to a given leaf of the root part can be loaded in one big sequential chunk from the storage device.
Another solution is to use a hash table to quickly find if a given string exist in the set with a low latency (eg. with only 2 fetches). The idea is just to hash a searched string and perform a lookup at a specific item of a big array stored on the storage device. Open-adressing can be used to make the structure more compact at the expense of a possibly higher latency while only 2 fetches are needed with closed-adressing (the first get the location of the item list associated to the given hash and the second get all the actual items).
One simple way to easily implement such data structures so they can work on a storage devices is to make use of mapped memory. Mapped memory enable you to access data on a storage device transparently as if it was in memory (whatever the language used). However, the cost to access data is the one of the storage device and not the one of the memory. Thus, the data structure implementation should be adapted to the use of mapped memory for better performance.
Finally, you can cache data so that some fetches can be much faster. One way to do that is to use Bloom filters. A Bloom filter is a very compact probabilistic hash-based data structure. It can be used to cache data in memory without actually storing any string item. False positive matches are possible, but false negatives are not. Thus, they are good to discard searched strings that are often not in the set without the need to do any (slow) fetch on the storage device. A big Bloom filter can provide a very good accuracy. This data structure need to be mixed with the above ones if deterministic results are required. LRU/LFU caches might also help regarding the distribution of the searched items.
I am researching about hash tables and hash maps, everything I have read or watched gives a very vague description of the differences. From messing around on Netbeans with them both, they seem to have the same functions and do the same things, what are the fundamental differences between these two data structures?
There are no differences, but you can find that the same thing called differently in different programming languages, so how people call something depends on their background and programming language they use. For example: in c++ it will be HashMap and in java it will be HashTable.
Also, there could be one difference concluded based on the naming: HashTable allows only store hashed keys, but not values whereas HashMap allows to retrieve a value by hashed key. Internally the both will use the same algorithm and can be considered as same data structure.
HashTable sounds to me like a concrete data structure, although it has numerous variants depending on what happens when a collision occurs, when the table fills up, when it empties.
Map sounds like a abstract data structure, something defined by the available operations (Dictionary would be a potential other name for the same data structure, but I'd not be surprised if some nomenclature defined both with a nuance somewhere).
HashMap sounds like an implementation of the Map abstract data structure using an HashTable concrete data structure.
Again, I'd not be surprised if a language or a library provided both, with a nuance somewhere (HashMap for instance could provide only the operations defined for a Map, but HashTable provides everything which make sense for an HashTable).
When would it make sense to design a data structure that is nesting a Box and a Vec (or vice versa)?
It seems like in most situations where you want to store multiple fixed-size things on the heap, the Box is redundant, since it's only (?) role is to heap-allocate a ~single value, and a normal Vec is already heap allocating its storage.
Context: I am still wrapping my head around the roles of the various Rust types for building up data structures.
There are really only a few times you need to use Box:
Recursive data structures: not relevant for the outermost element, so no need for Vec<Box<T>>.
Owned trait object, which must be Box<Trait> because the size of the object is dynamic;
Things that are sensitive to particular memory addresses, in order that the contained object will keep the same memory location (practically never the case and definitely not the case in any stable public API; some of the handle stuff to do with std::sync::mpsc::Select is the only case that I am aware of; this unsafety and care required is a part of why select! exists. This sort of a thing (Handle.add) is unsafe stuff.
If none of these situations apply, you should not use Box. And Box<Vec<T>> is one such case; the boxing is completely superfluous, adding an additional level of indirection to no benefit whatsoever.
So the simple version is:
Box<Vec<T>>: never.
Vec<Box<T>>: only if T is a trait, i.e. you’re working with trait objects.
Box<Vec<T>> can be useful on rare occasions with Option<Box<Vec<T>>>, if you want to shrink the size of the struct (one word instead of three) and the vector is usually empty. It was used in rustc for that, for example. However, lately it was replaced with the thin-vec crate, that does this better (still one word by storing the length and capacity inline, so does not require too allocations). If you don't want to introduce a dependency this can still be useful, though.
I am planning to implement an identity map for a small project which is not using any ORM tool.
The standard implementation in most examples I have seen is just a hash by object id, however it is obvious that the hash will grow without limits. I was thinking in using memcahed or redis with a cache expiration, but that also means that some objects will expire in the cache and their data will be fetched once time again from the database under a new and different object (same entity under different objects in memory).
Considering that most ORMs do not require a running memcached/redis. How do they solve this problem? Do they actually solve it? is it not a concern having an entity represented by repeated instances?
The only solution I know of is in languages supporting smart pointers and storing weak references within the hash. It does not look to me that such approach could be taken with Ruby, so I am wondering how is this pattern normally implemented by Ruby ORMs.
I think they do use a Hash, certainly appears that DataMapper uses a hash. My presumption is that the identity map is per 'session', which is probably flushed after each request (also ensures transactions are flushed by the end of request boundary). Thus it can grow unbounded but has a fixed horizon that clears it. If the intention is to have a session that lasts longer and needs periodic cleaning then WeakRef might be useful. I would be cautious about maintaining an identity map for an extended period however, particularly if concurrency is involved and there are any expectations for consistent transactional changes. I know ActiveRecord considered adding an IdentityMap and then abandoned that effort. Depending on how rows are fetched there may be duplication, but it's probably less then you would think, OR the query should be re-thought.