I need to store billions of unsigned 64 bit ints in a data structure that supports lookup of these ints. Currently the solution is to use a bloomfilter which generally works, and gives me a probability that a given int is in the stored set. This is working fine for now, however given the error rate setup in the bloomfilter and the sheer amount of distinct ints, it currently takes up several GB of memory (about 5GB).
Is there a data structure that could potentially allow me to do non-probabilistic lookups and that would take up less or just as much space? I looked at Tries and X-fast-trie's both of which would work but take too much memory.
Related
My understanding of hash tables is that they use hash functions to relate keys to locations in memory, with a total number of "buckets" pre-allocated in memory. The goal is for there to be enough buckets that I don't have to use chaining, slowing my ideal O(1) access time complexity to n/m x O(1) where n is the number of unique keys to store, and m is the number of buckets.
So if I have 1000 unique items to store, I'll want no less than 1000 buckets, and perhaps a lot more to minimize probability of having to use my chained linked list. If this weren't the case, we'd expect the average hash table to have many, many collisions. Now if we've got 1000 pre-allocated buckets, that means I have 1000 bytes of allocated memory, distributed around my memory. Thus every single unique key in my hash table results in a fragment of memory, fragmenting my RAM.
Does this mean that the use of hash tables is basically guaranteed to result in an amount of fragmentation proportional to the number of unique keys? Further, this seems to indicate that you can greatly minimize fragmentation using some statistics to pick the number of buckets, if you know how many unique keys there are going to be. Is this the case?
1000 bytes of allocated memory, distributed around my memory
No, you have one array of 1000 entries (of some size which is almost certainly larger than 1 byte per entry).
If each entry is big enough to handle the non-collision case in-place, no extra dynamic allocation is required until you have a collision. (e.g. maybe you use a union and a 1-bit flag to indicate whether this entry is a stand-alone bucket or whether it's a pointer to a linked list.)
If not, then when you write an entry, space needs to be allocated for it and a pointer stored in the table array itself. (e.g. a key-value hash table with small keys but large values). An empty hash table can still be full of NULL pointers.
You might still want it to hold structs of pointer and hash value (for single-member buckets). Then you can reject definitely-not-present queries without another level of indirection if the full hash value doesn't match the query; e.g. for a 32 or 64-bit hash that's many more bits than the 10 bits for indexing a 1024-entry table.
To reduce overall fragmentation, you can use a slab allocator or other technique for carving nodes out of a contiguous block you get from a global allocator. Having the hash table maintain its own private free-list could help with spatial locality of the linked-list nodes, so they're at least not scattered across many different virtual pages (TLB misses) and hopefully not DRAM pages (even slower cache misses).
I have read about the variants of hashtable but it is not clear to me which one is more appropriate for a system that is low on memory (we have a memory constraint limit).
Linear/Quadratic probing works well for sparse tables.
I think Double hashing is the same as Quadratic in this aspect.
External chaining does not have issue with clustering.
Most textbooks I have checked seem to assume that a extra space will always be available but practically in most example implementations I have seen since the hashtable is never halved take much more space than really needed.
So which variant of a hashtable is most efficient when we want to make the best usage of memory?
Update:
So my question is not only about the size of the buckets. My understanding is that both the size of the buckets and the performance under load is what matters. Because if the size of the bucket is small but the table degrades on 50% load then this means we need to resize to a larger table often.
See this variant of Cukoo Hashing.
This will require from you more hash functions, but, it makes sense - you need to pay something for the memory savings.
I'm developing an application that creates a 3D Voronoi Diagram created from a 3D point cloud using boost multi_array allocated dynamically to store the whole diagram.
One of the test cases I'm using requires a large amount of memory (around [600][600][600]), which is over the limit allowed and results in bad_alloc.
I already tried to separate the diagram in small pieces but also it doesn't work, as it seems that the total memory is already over the limits.
My question is, how can I work with such large 3D volume under the PC constraints?
EDIT
The Element type is a struct as follows:
struct Elem{
int R[3];
int d;
int label;
}
The elements are indexed in the multiarray based on their position in the 3D space.
The multiarray is constructed by setting specific points on the space from a file and then filling the intermediate spaces by passing a forward and a backward mask over the whole space.
You didn't say how do you get all your points. If you read them from a file, then don't read them all. If you compute them, then you can probably recompute them as needed. In both cases you can implement some cache that will store most often used ones. If you know how your algorithm will use the data, then you can predict which values will be needed next. You can even do this in a different thread.
The second solution is to work on your data so they fit in your RAM. You have 216 millions of points, but we don't know what's the size of a point. They are 3D but do they use floats or doubles? Are they a classes or simple structs? Do they have vtables? Do you use Debug build? (in Debug objects may be bigger). Do you allocate entire array at the beginning or incrementally? I believe there should be no problem storing 216M of 3D points on current PC but it depends on answers for all those questions.
The third way that comes to my mind is to use Memory Mapped Files, but i never used them personally.
Here are few things to try:
Try to allocate in different batches, like: 1 * 216M, 1k * 216k, 1M * 216 to see how much memory can you get.
Try to change boost map to std::vector and even raw void* and compare maximum RAM you can get.
You didn't mention the element type. Give the element is a four-byte float, a 600*600*600 matrix only takes about 820M bytes, which is not very big actually. I'd suggest you to check your operating system's limit on memory usage per process. For Linux, check it with ulimit -a.
If you really cannot allocate the matrix in memory, create a file of desired size on disk map it to memory using mmap. Then pass the memory address returned by mmap to boost::multi_array_ref.
I have a large number (100s of millions) of fixed-size values stored in a random order on a disk. I have the same set of values stored in memory, in a different order. I need to store the values in the order they are in memory, on disk. The challenge is this: I need to keep at least one copy of each value on disk at any one time – i.e. it needs to be durable.
I have quite a bit of RAM to work with (the values take up only about 60%), a lot of ephemeral storage, but only a very small amount of space on the durable disk, enough for less than a million of the values.
Given a value on disk, I can find it in memory very very quickly. But the converse is not true, given a value in memory, it is very slow to find it on disk.
Given these limitations, what's the best algorithm to transfer the order of the values from memory, to disk, as fast as possible?
Sounds like you have a sorting problem, where your comparator is the order of elements in RAM (element x is 'bigger' than element y, if x appears after y in RAM).
It can be solved using an external sort.
Note that if you allow duplicates, some more processing needs to be done in order to make sure your comparator is valid (can be solved by enumerating the identical values, and assigning a 'dupe_id' to each duplicate - both in RAM and on disk)
When retrieving entries in a database, is there a difference between storing values as a float or decimal vs. an int when using ORDERBY in a SELECT statement?
It depends. You didn't specify the RDBMS so I can only speak to SQL Server specifically but data types have different storage costs associated with them. Ints range from 1 to 8 bytes, Decimals are 5-17 and floats are 4 to 8 bytes.
The RDBMS will need to read data pages off disk to find your data (worst case) and they can only fit so many rows on an 8k page of data. So, if you have 17 byte decimals, you're going to get 1/17th the amount of rows read off disk per read than you could have if you sized your data correctly and used a tinyint with a 1 byte cost to store X.
That storage cost will have a cascading effect when you go to sort (order by) your data. It will attempt to sort in memory but if you have a bazillion rows and are starved for memory it may dump to temp storage for the sort and you're paying that cost over and over.
Indexes may help as the data can be stored in a sorted manner but again, if getting that data into memory may not be as efficient for obese data types.
[edit]
#Bohemian makes a fine point about the CPU efficiency of integer vs floating point comparisons but it is amazingly rare for the CPU to be spiked on a database server. You are far more likely to be constrained by the disk IO subsystem and memory which is why my answer focuses on the speed difference between getting that data into the engine for it to perform the sort operation vs the CPU cost of comparison.
(Edited) Since both int and float occupy exactly the same space on disk, and of course in memory - ie 32 bits - the only differences are in the way they are processed.
int should be faster to sort than float, because the comparison is simpler: Processors can compare ints in one machine cycle, but a float's bits have to be "interpreted" to get a value before comparing (not sure how many cycles, but probably more than one, although some CPUs may have special support for float comparison).
In general, the choice of datatypes should be driven by whether the datatype is appropriate for storing the values that are required to be stored. If a given datatype is inadequate, it doesn't matter how efficient it is.
In terms of disk i/o the speed difference is second order. Don't worry about second order effects until your design is good with regard to first order effects.
Correct index design will result in a huge decrease in delays when a query can be retrieved in sorted order to begin with. However, speeding up that query is done at the cost of slowing down other processes, like processes that modify the indexed data. The trade off has to be considered to see whether it's worth it.
In short, worry about the stuff that's going to double your disk i/o or worse before you worry about the stuff that's going to add 10% to your disk i/o