I have a mutable array and i would like to arrange it in a nested order upon some criteria.
To achieve this I'd like to move certain elements to mutable arrays of another elements, removing them from the main mutable array, but without releasing them. Actually the question is about removing elements from array without releasing them How do i achieve it?
Thanks
You cannot remove an object from an array without the array releasing it. If you want to make sure it sticks around, just retain it yourself first, and release it when you're done. These are pretty cheap operations, so you shouldn't worry too much about it.
Since you're moving items from one array to another, it would be easier if you first added the object to the new array and then removed it from the original array.
When you add it to the new array it is implicitly retained.
When you remove it from the old array it is implicitly released.
This is faster than retaining it, removing it from the array, adding it to the new array and then releasing it.
Related
What techniques are known to prevent iterator invalidation after/during rehashing? In particular, I'm interested in collision-chaining hash tables with incremental rehashing.
Suppose we're iterating a hash table via an iterator, insert an element during the iteration, and that insertion causes full or partial table rehash. I'm looking for hash table variants which allow to continue iteration and be sure that all elements are visited (except the newly inserted one maybe, it doesn't matter) and no element is visited twice.
AFAIK C++ unordered_map invalidates iterators during rehash. Also, AFAIK Go's map has incremental rehashing and doesn't invalidate iterators (range loop state), so it's likely what I'm looking for, but I can't fully understand the source code so far.
One possible solution is to have a doubly-linked list of all elements, parallel to the hash table, which is not affected by rehashing. This solution requires two extra pointers per element. I feel that better solutions should exist.
AFAIK C++ unordered_map invalidates iterators during rehash.
Correct. cppreference.com summarises unordered_map iterator invalidation thus:
Operations Invalidated
========== ===========
All read only operations, swap, std::swap Never
clear, rehash, reserve, operator= Always
insert, emplace, emplace_hint, operator[] Only if causes rehash
erase Only to the element erased
If you want to use unordered_map, your options are:
call reserve() before you start your iteration/insertions, to avoid rehashing
change max_load_factor() before you start your iterations/insertions, to avoid rehashing
store the elements to be inserted in say a vector during the iteration, then move them into the unordered_map afterwards
create e.g. vector<T*> or vector<reference_wrapper<T>> to the elements, iterate over it instead of the unordered_map, but still do your insertions into the unordered_map
If you really want incremental rehashing, you could write a class that wraps two unordered_maps, and when you see an insertion that would cause a rehash of the first map, you start inserting into the second (for which you'd reserve twice the size of the first map). You could manually control when all the elements from the first map were shifted to the second, or have it happen as a side effect of some other operation (e.g. migrate one element each time a new element is inserted). This wrapping approach will be much easier than writing an incremental rehashing table from scratch.
I am trying to implement a hashtable using linear probing.
Before inserting a (key, value) pair into the hashtable, I want to check if it's half full. If it is, I need to double the size of the underlying array.
Obviously, there are two ways to do that:
One is to create another array with the doubled size, rehash all entries in the old one and add them to the new array. Then, rebind the old array to the new one. This way is easy to implement but uses a lot of space.
The other one is to double the array and do the rehashing in-place. It seems that this way may lead to longer running time because rehashing may cause collisions with both newly hashed slots and old slots.
Which way should I use?
Your second solution only saves space during the resize process if there is in fact room to expand the existing hash table in-place - I think the chances of that being the case for a large hash table are quite slim, so I would just go for your first solution.
What is a shadow array and how is it implemented?
I came through the term while reading about compiler optimizations but I couldn't find any substantial reference about it.
When using arrays to implement dynamically resizable abstract data types, such as a List, Queue or Stack, the obvious problem that one encounters is that arrays are not themselves freely resized. At some point, then, if one adds enough items to an array, one will eventually run out of space.
The naive solution to this problem is to wait until the array being used runs out of space and then create a new larger array, copy all of the items from the old array into the new array, and the start using the new array.
A shadow array using implementation of an abstract data type is an alternative to this. Instead of waiting until the old array is full, a second, larger array is created after some threshold of fullness is passed on the array that's being used. Thereafter, as items are added to the old array, multiple items are copied from the old array to the shadow array, such that when the old array is full, all of it's items have already been copied to the new array.
The advantage of using a shadow array implementation instead of the naive "copy everything at the end" approach is that the time required by each add operation is much more consistent.
I think of it as a form of dynamic array.
The term shadow would referr to the underlying algorithms that try to resize it with good performance but are hidden behind an easy interface. (For example ArrayList in Java)
Is this what you're looking for? (Scroll to the bottom.)
Some claim that appending to immutable lists is more efficient. Is this true? How?
Producing a modified version of a list by allocating an array large enough to hold the modified version and copying over all of the unmodified elements is somewhat expensive, regardless of whether the modification is an append, an insert, a deletion, replacement, or anything else. The cost is roughly comparable to that of producing an unmodified, but distinct, copy of the list.
If an object Foo wishes to maintain a list of elements in such a way that it can only be changed when Foo changes it, there are two common approaches it can use to do so:
It can use an "immutable list" type which guarantees that any instance which has ever been exposed to the outside world will forever hold the same sequence of objects. The object `Foo` would be free to expose references to this list, since nobody would be able to alter it. If `Foo` wants to e.g. add an item to its list, it would generate a new immutable list which contains all the items in the list, plus the new one, and start holding a reference to that instead of the old one.
It can create a list object which is mutable, but is never exposed to the outside world. If anyone needs to retrieve the sequence of items from the list, `Foo` would copy the list's contents into a new list with which the caller could use in any way it sees fit without affecting `Foo`s list..
If one uses approach #1, then every time Foo alters the list it must create a new "immutable list" instance, but Foo could answer a request for the list's contents without having to copy it. If one uses approach #2, adding items to the list (and other modifications) will be cheaper, but answering a request for the list's contents will require copying the list. Whether it's better to use approach #1 or approach #2 will depend upon how often the list is updated, versus how often the application will need a copy of it.
Immutable objects can be shared between threads without synchronization. Synchronization negatively affects scaling, and can potentially be more costly than the copy.
What is the best way to remove an entry from a hashtable that uses linear probing? One way to do this would be to use a flag to indicate deleted elements? Are there any ways better than this?
An easy technique is to:
Find and remove the desired element
Go to the next bucket
If the bucket is empty, quit
If the bucket is full, delete the element in that bucket and re-add it to the hash table using the normal means. The item must be removed before re-adding, because it is likely that the item could be added back into its original spot.
Repeat step 2.
This technique keeps your table tidy at the expense of slightly slower deletions.
It depends on how you handle overflow and whether (1) the item being removed is in an overflow slot or not, and (2) if there are overflow items beyond the item being removed, whether they have the hash key of the item being removed or possibly some other hash key. [Overlooking that double condition is a common source of bugs in deletion implementations.]
If collisions overflow into a linked list, it is pretty easy. You're either popping up the list (which may have gone empty) or deleting a member from the middle or end of the linked list. Those are fun and not particularly difficult. There can be other optimizations to avoid excessive memory allocations and freeings to make this even more efficient.
For linear probing, Knuth suggests that a simple approach is to have a way to mark a slot as empty, deleted, or occupied. Mark a removed occupant slot as deleted so that overflow by linear probing will skip past it, but if an insertion is needed, you can fill the first deleted slot that you passed over [The Art of Computer Programming, vol.3: Sorting and Searching, section 6.4 Hashing, p. 533 (ed.2)]. This assumes that deletions are rather rare.
Knuth gives a nice refinment as Algorithm R6.4 [pp. 533-534] that instead marks the cell as empty rather than deleted, and then finds ways to move table entries back closer to their initial-probe location by moving the hole that was just made until it ends up next to another hole.
Knuth cautions that this will move existing still-occupied slot entries and is not a good idea if pointers to the slots are being held onto outside of the hash table. [If you have garbage-collected- or other managed-references in the slots, it is all right to move the slot, since it is the reference that is being used outside of the table and it doesn't matter where the slot that references the same object is in the table.]
The Python hash table implementation (arguable very fast) uses dummy elements to mark deletions. As you grow or shrink or table (assuming you're not doing a fixed-size table), you can drop the dummies at the same time.
If you have access to a copy, have a look at the article in Beautiful Code about the implementation.
The best general solutions I can think of include:
If you're can use a non-const iterator (ala C++ STL or Java), you should be able to remove them as you encounter them. Presumably, though, you wouldn't be asking this question unless you're using a const iterator or an enumerator which would be invalidated if the underlying collection is modified.
As you said, you could mark a deleted flag within the contained object. This doesn't release any memory or reduce collisions on the key, though, so it's not the best solution. Also requires the addition of a property on the class that probably doesn't really belong there. If this bothers you as much as it would me, or if you simply can't add a flag to the stored object (perhaps you don't control the class), you could store these flags in a separate hash table. This requires the most long-term memory use.
Push the keys of the to-be-removed items into a vector or array list while traversing the hash table. After releasing the enumerator, loop through this secondary list and remove the keys from the hash table. If you have a lot of items to remove and/or the keys are large (which they shouldn't be), this may not be the best solution.
If you're going to end up removing more items from the hash table than you're leaving in there, it may be better to create a new hash table, and as you traverse your original one, add to the new hash table only the items you're going to keep. Then replace your reference(s) to the old hash table with the new one. This saves a secondary list iteration, but it's probably only efficient if the new hash table will have significantly fewer items than the original one, and it definitely only works if you can change all the references to the original hash table, of course.
If your hash table gives you access to its collection of keys, you may be able to iterate through those and remove items from the hash table in one pass.
If your hash table or some helper in your library provides you with predicate-based collection modifiers, you may have a Remove() function to which you can pass a lambda expression or function pointer to identify the items to remove.
A common technique when time is a factor is to have a second table of deleted items, and clean up the main table when you have time. Commonly used in search engines.
How about enhancing the hash table to contain pointers like a linked list?
When you insert, if the bucket is full, create a pointer from this bucket to the bucket where the new field in stored.
While deleting something from the hashtable, the solution will be equivalent to how you write a function to delete a node from linkedlist.