Variable length free list - algorithm

The concept of free list is commonly used for re-using space so if I have a file full of fixed-length values and I delete one, I put it on a free list. Then when I need to insert new value I take one from the list and put it on that spot.
However I am a bit confused as to how to implement a free list for values that are variable length. If I delete the value and I put the position and its length on the free list, how do I retrieve the "best" candidate for a new value?
Using plain list will be O(n) time complexity. Using a tree (with length as key) would make that log(n). Is there anything better that would give O(1)?

Yes, a hash table! So you have a big hashtable containing the sizes of the free blocks as keys and the values are arrays holding pointers to blocks of the corresponding sizes. So each time you free a block:
hash[block.size()].append(block.address())
And each time you allocate a free block:
block = hash[requested_size].pop()
The problem with this method is there are too many possible block sizes. Therefore the hash will fill up with millions of keys, wasting enormous amounts of memory.
So instead you can have a list and iterate it to find a suitable block:
for block in blocks:
if block.size() >= requested_size:
return blocks.remove(block)
Memory efficient but slow because you might have to scan through millions of blocks.
So what you do is you combine these two methods. If you set your allocation quanta to 64, then a hash containing 256 size classes can be used for all allocations up to 64 * 256 = 16 kb. Blocks larger than that you store in a tree which gives you O(log N) insertion and removal.

Related

Does a "Pyramid List" data-structure already exist?

I am thinking about data structures which can be used in environments such as embedded/memory-constrained/filesystem and came upon an idea for a list-like data structure which has O(1) {access, insert, pop} while also always having O(1) push (non-amortized), even if it can only be grown by a constant amount (i.e. 4KiB). I cannot find an example of it anywhere and am wondering if it exists, and if so if anyone knows of a reference implementation.
The basic structure would look something like this:
PyramidList contains
a size_t numSlots
a size_t sizeSlots
a void** slots pointer to an array of pointers of size sizeSlots with pointers to values in indexes up to numSlots
The void **slots array has the following structure for each index. These are structured in such a way that 2^i = maxValues where i is the index and maxValues is the maximum number of values that can exist at that index or less (i.e. the sum of the count of all values up to that index)
index 0: contains a pointer directly to a single value (2^0 = 1)
index 1: contains a pointer directly to a single value (2^1 = 2)
index 2: contains a pointer to an array of two values (2^2 = 4)
index 3: contains a pointer to an array of four values (2^3 = 8)
index 4: contains a pointer to an array of eight values (2^4 = 16)
.. etc
index M: contains a pointer to an array of MAX_NUM_VALUES (2^M = MAX_NUM_VALUES*2)
index M+1: contains a pointer to an array of MAX_NUM_VALUES
index M+2: contains a pointer to an array of MAX_NUM_VALUES
etc
Now, suppose I want to access index i. I can use the BSR instruction to get the "power of 2" of the index. If it is less than the power of 2 of MAX_NUM_VALUES then I have my index. If it is larger than the power of 2 of MAX_NUM_VALUES I can act accordingly (subtract and divide). Therefore I can look up the array/single-value in O(1) time and then access the index I want in O(1) as well. Pushing to the PyramidList requires (at most):
allocating a new MAX_NUM_VALUES and adding it's pointer to slots
In some cases slots might not be able to hold it and would have to be grown as well, so this is only really always O(1) up to some limit, but that limit is likely to be extreme for the use cases here.
inserting the value into the proper index
A few other benefits
Works great for (embedded/file-system/kernel/etc) memory managers that have a maximum alloc size (i.e. can only allocate 4KiB chunks)
Works great when you truly don't know how large your vector is likely to be. Starts out extremely small and grows by known amounts
Always having (near) constant insertion may be useful for timing-critical interrupts/etc
Does not leave fragmented space behind when growing. Might be great for appending records into a file.
Disadvantages
Is likely less performant (amortized) than a contiguous vector in nearly every way (even insertion). Moving memory is typically less expensive than adding a dereference for every operation, so the amortized cost of a vector is still probably smaller.
Also, it is not truly always O(1) since the slots vector has to be grown when all the slots are full, but this only happens when currentNumSlots*2*MAX_NUM_VALUES have been added since the last growth.
When you exceed the capacity of an array of size X, and so allocate a new array of size 2X, you can then incrementally move the X items from the old array into the start of the new array over the next X append operations. After that the old array can be discarded when the new array is full, just before you have to allocate a new array of size 4X.
Therefore, it is not necessary to maintain this list of increasing-size arrays in order to achieve O(1) appends (assuming that allocation is O(1)). Incremental doubling is a well-known technique in the de-amortization business, so I think most people desiring this sort of behaviour would turn to that first.
Nothing like this is commonly used, because memory allocation can almost never be considered O(1). Applications that can't afford to copy a block all at once generally can't afford to use any kind of dynamic memory allocation at all.

"In-place" iterative LSD n-radix sort

Is it possible to implement a "in-place" iterative LSD n-radix sort? To clarify: I've read the wikipedia atricle on in-place MSD radix sort. There it says that:
Counting sort is used to determine the size of each bin and their starting index.
and therefore an auxiliary array is needed to store the indexes but if that is all that it needs I still consider it to be a in-place algorithm (since this array would only be of size n for a n-radix sort). I have also read this answer where once again a recursive MSD radix sort is implemented. That one also has no general implementation for a n-radix sort.
EDIT: I've finally described my algorithm in the answer here: https://cs.stackexchange.com/questions/93563/fast-stable-almost-in-place-radix-and-merge-sorts
Yes. Virtually split input data into pages of some size (4 KB will serve well). Then reuse these pages as you consumed their data. You will need some extra memory, though - up to n pages for initial buckets, next-page pointers (one pointer per page), 3*n pointers for head_page/current_page/current_write_ptr per bucket
The algo:
Alloc n pages and put them into list of free pages
Process input data, moving entries into their buckets. Once entire page of input data is processed, add this page to the list of free pages. Once entire bucket page is filled, add it to the list of this bucket pages and alloc new one from free list. You will end up with list of pages per each bicket, last page in this list is partially filled
Move data around to fit it back to the original array in the order of buckets
Of course, if you need multiple LSD passes, you can skip step 3 except for last pass and start each sorting pass directly from list built at step 2. It will require n extra pages, though.

Using dynamic array to handle collisions in hash tables

Looking around at some of the hash table implementations, separate chaining seems to be handled via a linked list or a tree. Is there a reason why a dynamic array is not used? I would imagine that having a dynamic array would have better cache performance as well. However, since I've not seen any such implementation, I'm probably missing something.
What am I missing?
One advantage of a linked list over a dynamic array is that rehashing can be accomplished more quickly. Rather than having to make a bunch of new dynamic arrays and then copy all the elements from the old dynamic arrays into the new, the elements from the linked lists can be redistributed into the new buckets without performing any allocations.
Additionally, if the load factor is small, the space overhead of using linked lists may be better than the space overhead for dynamic arrays. When using dynamic arrays, you usually need to store a pointer, a length, and a capacity. This means that if you have an empty dynamic array, you end up needing space for two integers and a pointer, plus any space preallocated to hold the elements. In an empty bucket, this space overhead is large compared to storing just a null pointer for a linked list. On the other hand, if the buckets have large numbers of elements in them, then dynamic arrays will be a bit more space-efficient and have higher performance due to locality of reference.
Hope this helps!
One advantage i can think of are the deletes ..while addition is done at the head of the hash..if i want to delete a value in the hash it will be difficult for array as it may be present in the middle of the array.

Finding k most common words in a file - memory usage

Let's say you are given a huge file, say 1GB. The file contains a word on each line (total n words), and you want to find the k most frequent terms in the file.
Now, assuming you have enough memory to store these words, what is the better way to approach the question in terms of reducing the memory usage and the constant overhead in the Big-O complexity? I believe there are two basic algorithms one can use:
Use a hash table and a min-heap to store the occurrences and the top K words seen. This is O(n + nlogk) ~ O(N)
Use a trie to store the words and occurrences and then traverse the trie to count the most frequent words. This is O(n*p) ~ O(N) where p is the length of the longest word.
Which is a better approach?
Also: if you didn't have enough memory for a hash table/trie (i.e. limited memory of 10MB or so), then what is the best approach?
Which is more efficient regarding the constant is very dependent. On one hand, trie offers strict O(N) time complexity for inserting all elements, while the hash table might decay to quadric time on worst case.
On the other hand, tries are not very efficient when it comes to cache - each seek requires O(|S|) random access memory requests, which might cause the performance to decay significantly.
Both approaches are valid, and I think there are multiple considerations that should be taken when choosing one over the other like maximum latency (if it is a real time system), throughput, and time to develop.
If average case performance is all that matters, I'd suggest to generate a bunch of files and run statistical analysis which approach is better. Wilcoxon signed test is the de-facto state of the art hypothesis test in use.
Regarding embedded systems: both approaches are still valid, but in here:
Each "Node" (or bunch of nodes) in the trie will be on disk rather then on RAM. Note that it means for the trie O(|S|) random access disk seeks per entry, which might be slowish.
For hashing solutions, you have 10MB, let's say they can use 5MB out of these for hash table of pointers to disk. Let's also assume you can store 500 different disk addresses on these 5MB (pessimistic analysis here), that means that you have 5MB left to load a bucket after each hash seek, and if you have 500 buckets, with load factor of 0.5, it means you can store 500 * 5MB * 0.5 ~= 1.25GB > 1GB of you data, thus using the hash table solution, so using hashing - each seek will need only O(1) random disk seeks in order to find the bucket containing the relevant string.
Note that if it still is not enough, we can rehash the pointers tables, very similar to what is being done in the paging table in the virtual memory mechanism.
From this we can conclude, for embedded systems, the hash solution is better for most cases (note it might still suffer from high latency on worst cases, no silver bullet here).
PS, radix tree is usually faster and more compact then trie, but suffers from the same side effects of trie comparing to hash tables (though less significant, of course).
For the limited memory option you could quick sort the list first, then simply populate a hash table with k items in it. You would then need one more counter to know how many items where in the current word you were checking - if its higher then you replace the lowest item in the hash table with your current item.
This would probably work ok for the initial list, but would be slower than just scanning the full list, and populating a hash table with the count.
Do you drive to store intermediate results ? if true:
you may have some meta structure.
and a set of hashetable.
You read a part of data (while size of you hash < 3 mb) and fill hashtable.
while size > 3mb you save on disk.
if you limit is 10 mb size of hashtable is 3 mb(for example).
meta discribe your hashtables.
in meta you may store number of unique word and count of all word in this hash and max count of one world!!! i
after this.
you may load hashtables from disk and merged.
for example you may load hashtable in ascending order of unique words or max count of one world in hash.
in this step you may use some heuristic.

Efficient mapping from 2^24 values to a 2^7 index

I have a data structure that stores amongst others a 24-bit wide value. I have a lot of these objects.
To minimize storage cost, I calculated the 2^7 most important values out of the 2^24 possible values and stored them in a static array. Thus I only have to save a 7-bit index to that array in my data structure.
The problem is: I get these 24-bit values and I have to convert them to my 7-bit index on the fly (no preprocessing possible). The computation is basically a search which one out of 2^7 values fits best. Obviously, this takes some time for a big number of objects.
An obvious solution would be to create a simple mapping array of bytes with the length 2^24. But this would take 16 MB of RAM. Too much.
One observation of the 16 MB array: On average 31 consecutive values are the same. Unfortunately there are also a number of consecutive values that are different.
How would you implement this conversion from a 24-bit value to a 7-bit index saving as much CPU and memory as possible?
Hard to say without knowing what the definition is of "best fit". Perhaps a kd-tree would allow a suitable search based on proximity by some metric or other, so that you quickly rule out most candidates, and only have to actually test a few of the 2^7 to see which is best?
This sounds similar to the problem that an image processor has when reducing to a smaller colour palette. I don't actually know what algorithms/structures are used for that, but I'm sure they're look-up-able, and might help.
As an idea...
Up the index table to 8 bits, then xor all 3 bytes of the 24 bit word into it.
then your table would consist of this 8 bit hash value, plus the index back to the original 24 bit value.
Since your data is RGB like, a more sophisticated hashing method may be needed.
bit24var & 0x000f gives you the right hand most char.
(bit24var >> 8) & 0x000f gives you the one beside it.
(bit24var >> 16) & 0x000f gives you the one beside that.
Yes, you are thinking correctly. It is quite likely that one or more of the 24 bit values will hash to the same index, due to the pigeon hole principal.
One method of resolving a hash clash is to use some sort of chaining.
Another idea would be to put your important values is a different array, then simply search it first. If you don't find an acceptable answer there, then you can, shudder, search the larger array.
How many 2^24 haves do you have? Can you sort these values and count them by counting the number of consecutive values.
Since you already know which of the 2^24 values you need to keep (i.e. the 2^7 values you have determined to be important), we can simply just filter incoming data and assign a value, starting from 0 and up to 2^7-1, to these values as we encounter them. Of course, we would need some way of keeping track of which of the important values we have already seen and assigned a label in [0,2^7) already. For that we can use some sort of tree or hashtable based dictionary implementation (e.g. std::map in C++, HashMap or TreeMap in Java, or dict in Python).
The code might look something like this (I'm using a much smaller range of values):
import random
def make_mapping(data, important):
mapping=dict() # dictionary to hold the final mapping
next_index=0 # the next free label that can be assigned to an incoming value
for elem in data:
if elem in important: #check that the element is important
if elem not in mapping: # check that this element hasn't been assigned a label yet
mapping[elem]=next_index
next_index+=1 # this label is assigned, the next new important value will get the next label
return mapping
if __name__=='__main__':
important_values=[1,5,200000,6,24,33]
data=range(0,300000)
random.shuffle(data)
answer=make_mapping(data,important_values)
print answer
You can make the search much faster by using hash/tree based set data structure for the set of important values. That would make the entire procedure O(n*log(k)) (or O(n) if its is a hashtable) where n is the size of input and k is the set of important values.
Another idea is to represent the 24BitValue array in a bit map. A nice unsigned char can hold 8 bits, so one would need 2^16 array elements. Thats 65536. If the corresponding bit is set, then you know that that specific 24BitValue is present in the array, and needs to be checked.
One would need an iterator, to walk through the array and find the next set bit. Some machines actually provide a "find first bit" operation in their instruction set.
Good luck on your quest.
Let us know how things turn out.
Evil.

Resources