Does a "Pyramid List" data-structure already exist?

Does a "Pyramid List" data-structure already exist? - data-structures

I am thinking about data structures which can be used in environments such as embedded/memory-constrained/filesystem and came upon an idea for a list-like data structure which has O(1) {access, insert, pop} while also always having O(1) push (non-amortized), even if it can only be grown by a constant amount (i.e. 4KiB). I cannot find an example of it anywhere and am wondering if it exists, and if so if anyone knows of a reference implementation.
The basic structure would look something like this:
PyramidList contains
a size_t numSlots
a size_t sizeSlots
a void** slots pointer to an array of pointers of size sizeSlots with pointers to values in indexes up to numSlots
The void **slots array has the following structure for each index. These are structured in such a way that 2^i = maxValues where i is the index and maxValues is the maximum number of values that can exist at that index or less (i.e. the sum of the count of all values up to that index)
index 0: contains a pointer directly to a single value (2^0 = 1)
index 1: contains a pointer directly to a single value (2^1 = 2)
index 2: contains a pointer to an array of two values (2^2 = 4)
index 3: contains a pointer to an array of four values (2^3 = 8)
index 4: contains a pointer to an array of eight values (2^4 = 16)
.. etc
index M: contains a pointer to an array of MAX_NUM_VALUES (2^M = MAX_NUM_VALUES*2)
index M+1: contains a pointer to an array of MAX_NUM_VALUES
index M+2: contains a pointer to an array of MAX_NUM_VALUES
etc
Now, suppose I want to access index i. I can use the BSR instruction to get the "power of 2" of the index. If it is less than the power of 2 of MAX_NUM_VALUES then I have my index. If it is larger than the power of 2 of MAX_NUM_VALUES I can act accordingly (subtract and divide). Therefore I can look up the array/single-value in O(1) time and then access the index I want in O(1) as well. Pushing to the PyramidList requires (at most):
allocating a new MAX_NUM_VALUES and adding it's pointer to slots
In some cases slots might not be able to hold it and would have to be grown as well, so this is only really always O(1) up to some limit, but that limit is likely to be extreme for the use cases here.
inserting the value into the proper index
A few other benefits
Works great for (embedded/file-system/kernel/etc) memory managers that have a maximum alloc size (i.e. can only allocate 4KiB chunks)
Works great when you truly don't know how large your vector is likely to be. Starts out extremely small and grows by known amounts
Always having (near) constant insertion may be useful for timing-critical interrupts/etc
Does not leave fragmented space behind when growing. Might be great for appending records into a file.
Disadvantages
Is likely less performant (amortized) than a contiguous vector in nearly every way (even insertion). Moving memory is typically less expensive than adding a dereference for every operation, so the amortized cost of a vector is still probably smaller.
Also, it is not truly always O(1) since the slots vector has to be grown when all the slots are full, but this only happens when currentNumSlots*2*MAX_NUM_VALUES have been added since the last growth.

When you exceed the capacity of an array of size X, and so allocate a new array of size 2X, you can then incrementally move the X items from the old array into the start of the new array over the next X append operations. After that the old array can be discarded when the new array is full, just before you have to allocate a new array of size 4X.
Therefore, it is not necessary to maintain this list of increasing-size arrays in order to achieve O(1) appends (assuming that allocation is O(1)). Incremental doubling is a well-known technique in the de-amortization business, so I think most people desiring this sort of behaviour would turn to that first.
Nothing like this is commonly used, because memory allocation can almost never be considered O(1). Applications that can't afford to copy a block all at once generally can't afford to use any kind of dynamic memory allocation at all.

Related

High performing container for storing a high number of objects

I am looking for an ideal data container for with following objectives:
The behavior of the container must be sort of like Queue, with the following specifications:
1) random access is not a must
2) iterating over the objects in two directions must be super fast ( contiguous data would be better)
3) high performing delete from the front of the list and insert in the back is a must ( a high number of deletes and appends are done at every time step )
4) items are not primitive types, they are objects.
I know double-linked lists are not high performing containers.
vectors (like std::vector in c++) are good, but it is not really optimized for deleting from the front, also I don't think vectorization is possible at all given the size of objects.
I was also looking at the possibility of Slot-Map container, but not sure if it is the best option.
I was wondering if there are better options available?

You might be able to get away with just a regular vector and a start index that tells you where the "real" beginning of your data is.
to append to the back, use the regular method. This has an amortized constant-time complexity, which is probably fine for you given that you will be doing lots of pushing.
to delete from the front, increment start.
to access element i, use vector[start + i].
whenever you delete from anywhere but the front, or insert anywhere except the back, go ahead and recreate the whole vector without any leading deleted entries and reset start to zero.
Pros:
entries are in a contiguous chunk of memory
fast delete from the front and (amortized) fast insert into the back
fast random access and fast iteration
Cons:
slow worst-case insertion behavior
potentially lots of wasted space unless cleaned up periodically
cleaning up on deletes changes delete's worst-case behavior to linear, slow.
Whatever you do, consider comparing to the natural approach: a doubly-linked list with the head and tail remembered.
fast inserts/deletes from the front/back
no wasted space
True, the items will not be contiguous in memory so there is a potential for more cache misses; however, you could combat this with occasional defragmentation:
allocate enough contiguous space for all nodes in the list
recreate nodes in order by traversing links
release the original nodes and use the new set of nodes as the list
Depending on the pattern of deletes/inserts/traversals, this could be feasible.

If we really care about the performance, the container should never allocate any memory dynamically, i.e. we should define an upper limit of objects in the container.
The interface requirements is queueish indeed, so it looks like the fastest option would be circular queue of pointers to objects. So the container should shave the following fields:
OBJECT * ptrs[SIZE] -- fixed size array of pointers. Sure, we will waste SIZE * sizeof (OBJECT *) bytes here, but performance wise it could be a good trade.
size_t head_idx -- head object index.
size_t tail_idx -- tail object index.
iterating over the objects in two directions must be super fast
Next object is a next index in the ptrs[]:
if (cur_idx >= head_idx) return nullptr;
return ptrs[(cur_idx++) % SIZE]; // make sure SIZE is a power of 2 constant
Prev object is a prev index in the ptrs[]:
if (cur_idx <= tail_idx) return nullptr;
return ptrs[(cur_idx--) % SIZE]; // make sure SIZE is a power of 2 constant
high performing delete from the front of the list and insert in the back is a must
The pop_front() would be as simple as:
if (tail_idx == head_idx) ... // should not happen, through an error
head_idx++;
The push_back() would be as simple as:
if (tail_idx - head_idx >= SIZE) ... // should not happen, through an error
ptrs[(tail_idx++) % SIZE] = obj_ptr; // make sure SIZE if a power of 2 constant
items are not primitive types, they are objects
The most generic solution would be to simply store pointers in the cyclic queue, so the size of the object does not matter and you waste just SIZE times pointer, not SIZE times object. But sure, if you can afford to preallocate thousands of objects, it should be even faster...
Those are kind of speculations based on your performance requirements. I am not sure if you can afford to trade some memory for the performance, so I am sorry if it is not the case...

Variable length free list

The concept of free list is commonly used for re-using space so if I have a file full of fixed-length values and I delete one, I put it on a free list. Then when I need to insert new value I take one from the list and put it on that spot.
However I am a bit confused as to how to implement a free list for values that are variable length. If I delete the value and I put the position and its length on the free list, how do I retrieve the "best" candidate for a new value?
Using plain list will be O(n) time complexity. Using a tree (with length as key) would make that log(n). Is there anything better that would give O(1)?

Yes, a hash table! So you have a big hashtable containing the sizes of the free blocks as keys and the values are arrays holding pointers to blocks of the corresponding sizes. So each time you free a block:
hash[block.size()].append(block.address())
And each time you allocate a free block:
block = hash[requested_size].pop()
The problem with this method is there are too many possible block sizes. Therefore the hash will fill up with millions of keys, wasting enormous amounts of memory.
So instead you can have a list and iterate it to find a suitable block:
for block in blocks:
if block.size() >= requested_size:
return blocks.remove(block)
Memory efficient but slow because you might have to scan through millions of blocks.
So what you do is you combine these two methods. If you set your allocation quanta to 64, then a hash containing 256 size classes can be used for all allocations up to 64 * 256 = 16 kb. Blocks larger than that you store in a tree which gives you O(log N) insertion and removal.

In-memory tree/index structure for fast lookups and insertions of increasing integer keys

Background: I'm going to be inserting about a billion key value pairs. I need an in-memory index with which I can simultaneously do look ups for the (32 bit integer) value for a (unique, 64 bit integer) key. There's no updating, no deleting and no traversing. The keys are generally gradually increasing with time.
What index structure is most appropriate to handle this?
The requirements I can think of are:
It needs to have efficient rebalancing, due to the increasing keys
It needs to use memory efficiently to fit in ram, preferably < 28GB
It needs to have very efficient lookups

There's probably no more efficient datastructure for this problem than a simple sorted vector. (Actually, given alignment issues and depending on access characteristics, you might want to put keys and values in separate vectors.) But there are a number of practical problems, particularly if you don't know how big the data will be. If you do know this, or if you're prepared to just preallocate too much space and then die if you get more data than will fit in this space, then that's fine, although you still need to worry about keeping the vector sorted.
A possibly better approach is to keep a binary search tree of index ranges, where the leaves of the BST point to "clumps" of data (i.e. vectors). (This is essentially a B+ tree.) The clumps can be reasonably large; I'd say something like the amount of data you expect to receive in a couple of minutes, or several thousand entries. They don't have to all be the same size. (B+-trees usually have a smaller fanout than that, but since your data is "mostly sorted", you should be able to use a larger one. Don't make it too large; the only point is to reduce overhead and possibly cache-thrashing.)
Since your data is "mostly sorted", you can accumulate data for a while, keeping it in an ordinary ordered map (assuming you have such a thing), or even in a vector using insertion sort. When this buffer gets large enough, you can append it to your main data structure as a single clump, repartitioning the last clump to deal with overlaps.
If you're reasonably certain that you will only rarely get out-of-order keys, would be to keep a second conventional BST of out-of-order data elements. Any element which cannot be accomodated by repartitioning the new clump and the previous last one can just be added to this BST. To do a lookup, you do a parallel lookup between the main structure and the out-of-order structure.
If you're paranoid or not certain enough about the amount of unordered data, just use the standard B+-tree insertion algorithm, which consists of creating clumps with a little bit of reserved but unused space to allow for insertions (a few per cent; you want to avoid space overhead), and splitting a clump if necessary.

What makes table lookups so cheap?

A while back, I learned a little bit about big O notation and the efficiency of different algorithms.
For example, looping through each item in an array to do something with it
foreach(item in array)
doSomethingWith(item)
is an O(n) algorithm, because the number of cycles the program performs is directly proportional to the size of the array.
What amazed me, though, was that table lookup is O(1). That is, looking up a key in a hash table or dictionary
value = hashTable[key]
takes the same number of cycles regardless of whether the table has one key, ten keys, a hundred keys, or a gigabrajillion keys.
This is really cool, and I'm very happy that it's true, but it's unintuitive to me and I don't understand why it's true.
I can understand the first O(n) algorithm, because I can compare it to a real-life example: if I have sheets of paper that I want to stamp, I can go through each paper one-by-one and stamp each one. It makes a lot of sense to me that if I have 2,000 sheets of paper, it will take twice as long to stamp using this method than it would if I had 1,000 sheets of paper.
But I can't understand why table lookup is O(1). I'm thinking that if I have a dictionary, and I want to find the definition of polymorphism, it will take me O(logn) time to find it: I'll open some page in the dictionary and see if it's alphabetically before or after polymorphism. If, say, it was after the P section, I can eliminate all the contents of the dictionary after the page I opened and repeat the process with the remainder of the dictionary until I find the word polymorphism.
This is not an O(1) process: it will usually take me longer to find words in a thousand page dictionary than in a two page dictionary. I'm having a hard time imagining a process that takes the same amount of time regardless of the size of the dictionary.
tl;dr: Can you explain to me how it's possible to do a table lookup with O(1) complexity?
(If you show me how to replicate the amazing O(1) lookup algorithm, I'm definitely going to get a big fat dictionary so I can show off to all of my friends my ninja-dictionary-looking-up skills)
EDIT: Most of the answers seem to be contingent on this assumption:
You have the ability to access any page of a dictionary given its page number in constant time
If this is true, it's easy for me to see. But I don't know why this underlying assumption is true: I would use the same process to to look up a page by number as I would by word.
Same thing with memory addresses, what algorithm is used to load a memory address? What makes it so cheap to find a piece of memory from an address? In other words, why is memory access O(1)?

You should read the Wikipedia article.
But the essence is that you first apply a hash function to your key, which converts it to an integer index (this is O(1)). This is then used to index into an array, which is also O(1). If the hash function has been well designed, there should only be one (or a few items) stored at each location in the array, so the lookup is complete.
So in massively-simplified pseudocode:
ValueType array[ARRAY_SIZE];
void insert(KeyType k, ValueType v)
{
int index = hash(k);
array[index] = v;
}
ValueType lookup(KeyType k)
{
int index = hash(k);
return array[index];
}
Obviously, this doesn't handle collisions, but you can read the article to learn how that's handled.
Update
To address the edited question, indexing into an array is O(1) because underneath the hood, the CPU is doing this:
ADD index, array_base_address -> pointer
LOAD pointer -> some_cpu_register
where LOAD loads data stored in memory at the specified address.
Update 2
And the reason a load from memory is O(1) is really just because this is an axiom we usually specify when we talk about computational complexity (see http://en.wikipedia.org/wiki/RAM_model). If we ignore cache hierarchies and data-access patterns, then this is a reasonable assumption. As we scale the size of the machine,, this may not be true (a machine with 100TB of storage may not take the same amount of time as a machine with 100kB). But usually, we assume that the storage capacity of our machine is constant, and much much bigger than any problem size we're likely to look at. So for all intents and purposes, it's a constant-time operation.

I'll address the question from a different perspective from every one else. Hopefully this will give light to why the accessing x[45] and accessing x[5454563] takes the same amount of time.
A RAM is laid out in a grid (i.e. rows and columns) of capacitors. A RAM can address a particular cell of memory by activating a particular column and row on the grid, so let's say if you have a 16-byte capacity RAM, laid out in a 4x4 grid (insanely small for modern computer, but sufficient for illustrative purpose), and you're trying to access the memory address 13 (1101), you first split the address into rows and column, i.e row 3 (11) column 1 (01).
Let's suppose a 0 means taking the left intersection and a 1 means taking a right intersection. So when you want to activate row 3, you send an army of electrons in the row starting gate, the row-army electrons went right, right to reach row 3 activation gate; next you send another army of electrons on the column starting gate, the column-army electrons went left then right to reach the 1st column activation gate. A memory cell can only be read/written if the row and column are both activated, so this would allow the marked cell to be read/written.
The effect of all this gibberish is that the access time of a memory address depends on the address length, and not the particular memory address itself; if an architecture uses a 32-bit address space (i.e. 32 intersections), then addressing memory address 45 and addressing memory address 5454563 both will still have to pass through all 32 intersections (actually 16 intersections for the row electrons and 16 intersections for the columns electrons).
Note that in reality memory addressing takes very little amount of time compared to charging and discharging the capacitors, therefore even if we start having a 512-bit length address space (enough for ~1.4*10^130 yottabyte of RAM, i.e. enough to keep everything under the sun in your RAM), which mean the electrons would have to go through 512 intersections, it wouldn't really add that much time to the actual memory access time.
Note that this is a gross oversimplification of modern RAM. In modern DRAM, if you want to access subsequent memory addresses you only change the columns and not spend time changing the rows, therefore accessing subsequent memory is much faster than accessing totally random addresses. Also, this description is totally ignorant about the effect of CPU cache (although CPU cache also uses a similar grid addressing scheme, however since CPU cache uses the much faster transistor-based capacitor, the negative effect of having large cache address space becomes very critical). However, the point still holds that if you're jumping around the memory, accessing any one of them will take the same amount of time.

You're right, it's surprisingly difficult to find a real-world example of this. The idea of course is that you're looking for something by address and not value.
The dictionary example fails because you don't immediately know the location of page say 278. You still have to look that up the same as you would a word because the page locations are not in your memory.
But say I marked a number on each of your fingers and then I told you to wiggle the one with 15 written on it. You'd have to look at each of them (assuming its unsorted), and if it's not 15 you check the next one. O(n).
If I told you to wiggle your right pinky. You don't have to look anything up. You know where it is because I just told you where it is. The value I just passed to you is its address in your "memory."
It's kind of like that with databases, but on a much larger scale than just 10 fingers.

Because work is done up front -- the value is put in a bucket that is easily accessible given the hashcode of the key. It would be like if you wanted to look up your work in the dictionary but had marked the exact page the word was on.

Imagine you had a dictionary where everything starting with letter A was on page 1, letter B on page 2...etc. So if you wanted to look up "balloon" you would know exactly what page to go to. This is the concept behind O(1) lookups.
Arbitrary data input => maps to a specific memory address
The trade-off of course being you need more memory to allocate for all the potential addresses, many of which may never be used.

If you have an array with 999999999 locations, how long does it take to find a record by social security number?
Assuming you don't have that much memory, then allocate about 30% more array locations that the number of records you intend to store, and then write a hash function to look it up instead.
A very simple (and probably bad) hash function would be social % numElementsInArray.
The problem is collisions--you can't guarantee that every location holds only one element. But thats ok, instead of storing the record at the array location, you can store a linked list of records. Then you scan linearly for the element you want once you hash to get the right array location.
Worst case this is O(n)--everything goes to the same bucket. Average case is O(1) because in general if you allocate enough buckets and your hash function is good, records generally don't collide very often.

Ok, hash-tables in a nutshell:
You take a regular array (O(1) access), and instead of using regular Int values to access it, you use MATH.
What you do, is to take the key value (lets say a string) calculate it into a number (some function on the characters) and then use a well known mathematical formula that gives you a relatively good distribution on the array's range.
So, in that case you are just doing like 4-5 calculations (O(1)) to get an object from that array, using a key which isn't an int.
Now, avoiding collisions, and finding the right mathematical formula for good distribution is the hard part. That's what is explained pretty well in wikipedia: en.wikipedia.org/wiki/Hash_table

Lookup tables know exactly how to access the given item in the table before hand.
Completely the opposite of say, finding an item by it's value in a sorted array, where you have to access items to check that it is what you want.

In theory, a hashtable is a series of buckets (addresses in memory) and a function that maps objects from a domain into those buckets.
Say your domain is 3 letter words, you'd block out 26^3=17,576 addresses for all the possible 3 letter words and create a function that maps all 3 letter words to those addresses, e.g., aaa=0, aab=1, etc. Now when you have a word you'd like to look up, say, "and", you know immediately from your O(1) function that it is address number 367.

Efficient mapping from 2^24 values to a 2^7 index

I have a data structure that stores amongst others a 24-bit wide value. I have a lot of these objects.
To minimize storage cost, I calculated the 2^7 most important values out of the 2^24 possible values and stored them in a static array. Thus I only have to save a 7-bit index to that array in my data structure.
The problem is: I get these 24-bit values and I have to convert them to my 7-bit index on the fly (no preprocessing possible). The computation is basically a search which one out of 2^7 values fits best. Obviously, this takes some time for a big number of objects.
An obvious solution would be to create a simple mapping array of bytes with the length 2^24. But this would take 16 MB of RAM. Too much.
One observation of the 16 MB array: On average 31 consecutive values are the same. Unfortunately there are also a number of consecutive values that are different.
How would you implement this conversion from a 24-bit value to a 7-bit index saving as much CPU and memory as possible?

Hard to say without knowing what the definition is of "best fit". Perhaps a kd-tree would allow a suitable search based on proximity by some metric or other, so that you quickly rule out most candidates, and only have to actually test a few of the 2^7 to see which is best?
This sounds similar to the problem that an image processor has when reducing to a smaller colour palette. I don't actually know what algorithms/structures are used for that, but I'm sure they're look-up-able, and might help.

As an idea...
Up the index table to 8 bits, then xor all 3 bytes of the 24 bit word into it.
then your table would consist of this 8 bit hash value, plus the index back to the original 24 bit value.
Since your data is RGB like, a more sophisticated hashing method may be needed.
bit24var & 0x000f gives you the right hand most char.
(bit24var >> 8) & 0x000f gives you the one beside it.
(bit24var >> 16) & 0x000f gives you the one beside that.
Yes, you are thinking correctly. It is quite likely that one or more of the 24 bit values will hash to the same index, due to the pigeon hole principal.
One method of resolving a hash clash is to use some sort of chaining.

Another idea would be to put your important values is a different array, then simply search it first. If you don't find an acceptable answer there, then you can, shudder, search the larger array.

How many 2^24 haves do you have? Can you sort these values and count them by counting the number of consecutive values.

Since you already know which of the 2^24 values you need to keep (i.e. the 2^7 values you have determined to be important), we can simply just filter incoming data and assign a value, starting from 0 and up to 2^7-1, to these values as we encounter them. Of course, we would need some way of keeping track of which of the important values we have already seen and assigned a label in [0,2^7) already. For that we can use some sort of tree or hashtable based dictionary implementation (e.g. std::map in C++, HashMap or TreeMap in Java, or dict in Python).
The code might look something like this (I'm using a much smaller range of values):
import random
def make_mapping(data, important):
mapping=dict() # dictionary to hold the final mapping
next_index=0 # the next free label that can be assigned to an incoming value
for elem in data:
if elem in important: #check that the element is important
if elem not in mapping: # check that this element hasn't been assigned a label yet
mapping[elem]=next_index
next_index+=1 # this label is assigned, the next new important value will get the next label
return mapping
if __name__=='__main__':
important_values=[1,5,200000,6,24,33]
data=range(0,300000)
random.shuffle(data)
answer=make_mapping(data,important_values)
print answer
You can make the search much faster by using hash/tree based set data structure for the set of important values. That would make the entire procedure O(n*log(k)) (or O(n) if its is a hashtable) where n is the size of input and k is the set of important values.

Another idea is to represent the 24BitValue array in a bit map. A nice unsigned char can hold 8 bits, so one would need 2^16 array elements. Thats 65536. If the corresponding bit is set, then you know that that specific 24BitValue is present in the array, and needs to be checked.
One would need an iterator, to walk through the array and find the next set bit. Some machines actually provide a "find first bit" operation in their instruction set.
Good luck on your quest.
Let us know how things turn out.
Evil.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio