I have the need to store many data flows consisting of something like:
struct Flow {
source: Address,
destination: Address,
last_seq_num_sent: u32,
last_seq_num_rcvd: u32,
last_seq_num_ackd: u32
}
I need to query by last_seq_num_rcvd. I can guarantee (with off-screen magic) the uniqueness of this field among all flows.
The flow may occur over unreliable connections, so some sequence numbers may get skipped due to network packet loss. I account for this by using a window, one which also guarantees uniqueness for its entire stretch. The rates of data flows are independent of each other, but have the ability to renumber their sequence numbers before collisions occur.
So the goal is to perform a range query against the flows to find any flow with a last_seq_num_rcvd within a WINDOW_SIZE constant's distance of some given next sequence number.
I gather the BTreeMap is appropriate here for its range query ability.
const WINDOW_SIZE = 10;
struct FlowValue { /* All original fields, minus last_seq_num_rcvd which now acts as key */ }
let mut flows = BTreeMap<u32, FlowValue>::new();
let query = 42;
for (k, v) in flows.range(Excluded(query), Included(query + WINDOW_SIZE)) {
// This is how I would query for a flow
}
But now my key is something that changes often. It seems like there's no efficient way to update it in-place; it requires full deletion and reinsertion (under incremented key), which sounds like an expensive operation.
Is the BTreeMap method too expensive? Is there an alternative data structure that isn't? Or could I overload the BTreeMap to actually perform an efficient in-place increment of an integer key?
You're right that a B-Tree map is a little expensive for this application.
Since the window size is constant, a faster implementation would be to partition the sequence numbers into buckets of size about WINDOW_SIZE/2. Then just put the flows into a hash table according to their rcvd bucket.
To find flows for a particular packet, then, you only need to look up the 3 buckets that could possibly contain matching flows, and test each flow in the buckets. This will be faster than a B-Tree lookup.
On update, the situation is even better, because you only need to update the hash table when an entry changes buckets, and that only happens every once every WINDOW_SIZE/2 packets.
Related
We have an incoming queue of support cases from customers.
Each support case has at least the following fields:
Case Number (Unique numeric ID)
Creation Time (Timestamp)
Title (Text String)
Description (Text String)
Other fields...
We'd like to split these into three distinct buckets, in a repeatable way.
For example, the first case that comes in goes to queue A, the second to queue B, the third to queue C etc.
It doesn't necessarily need to be in that order, but the distribution needs to be equal (or close to equal).
The case number is monotonically increasing but they are not sequential (that is, there will be gaps - e.g. case 10005, then case 10400, then case 10405 etc.). The reason is that the case numbers are shared among several categories, but we are only looking at a single specific category of cases.
We don't want to have to maintain a lookup table - but rather I was thinking of generating some kind of hash based on case number + creation time, for example, and then doing a modulus 3 on it?
Question
Does the above approach look sane? Any comments?
What sort of hashing algorithm, and on what fields should I do it across, in order to get a good distribution for the modulus?
I found about Cuckoo Hash tables and they seem pretty good.
But most examples code I found implement this using 2 tables.
This seems to me wrong because the 2 tables may be in different memory pages and we have overhead of fetching random addresses and have no real locality.
Is it not possible to use 1 array instead of 2?
Is it perhaps not possible to detect when an element has already been kicked out 2 times and is time for resizing?
You can definitely do a cuckoo hashtable with a single hash table; that is, where the two positions for each object are simply positions within a single hash table.
The only small problem to be solved is how to decide during the cuckoo eviction loop which of the two positions to use for an evicted key. Of course, you can just try one position and use the other one if the first one was the same as the actual position. It should be possible to use SIMD to compute both hashes in parallel, so the cost of this strategy might be small.
However, if you want to guarantee a single hash computation during the cuckoo loop, there is a simple solution: instead of using H0(k) and H1(k) as the two positions, use H0(k) and H0(k) xor H1(k). (If H1 is independent from H0, then so is H0 xor H1, so the xor does not affect the distribution of hash values.) With this modification, you can always find "the other position" of k by xor'ing the current position with H1(k), so only a single hash computation is needed in the loop.
While that allows you to use a single hash table, and may even simplify the code, there is not a lot of evidence that it improves the operation of the algorithm. In my limited testing, it seems to increase the number of loop iterations by 40-50%. (Although it needs to be emphasized that in the vast majority of cases, a new key can be inserted into the table without entering the loop at all, so the increased number of loops is hardly noticeable in the actual execution time.)
To answer the confusion in the comments: no this is not language specific. If you're thinking about memory locality and want to ensure the two tables are close, then a single allocation is the way to go (however you allocate). In java this may look as follows:
class TwoTables {
private static final int SIZE_TABLE_FIRST = 11, SIZE_TABLE_SECOND = 29;
public TwoTables() {
m_buffer = new int[SIZE_TABLE_FIRST + SIZE_TABLE_SECOND];
}
// consider similar setters...
public int getFirst(int key) {
return m_buffer[toIndex(hashFirst(key), SIZE_TABLE_FIRST, 0)];
}
public int getSecond(int key) {
return m_buffer[toIndex(hashSecond(key), SIZE_TABLE_SECOND, SIZE_TABLE_FIRST)];
}
private static int toIndex(int hash, int mod, int offset) {
return hash % mod + offset;
}
private static int hashFirst(int key) { return ...; }
private static int hashSecond(int key) { return ...; }
private final int[] m_buffer;
}
If this performs better than accessing into two separate arrays is dependant on your JVM however: just think about the JIT being able to merge two small allocations into a single larger one on the fly - without you having to perform any index-magic.
Well, all forms of hashing are murder on caches.
Anyways you can easily combine the two into a single table. But then how do you tell whether you're on your first hash function or the second? The options are add that as metadata to every bucket, or else figure it out by running the first hash function, seeing whether you got the current location, and running the second only if you were on the first. That either requires extra space, or running more hash functions.
Splitting the table into 2 solves that problem more efficiently. And statistically you need the same number of buckets to store the same number of things whether or not the table has been split. So your whole hash table becomes smaller.
Yes.
http://www.spoj.com/problems/CUCKOO/
You can check this problem on spoj,we need to do this problem using a single hash table and two hash functions.
Hashmaps usually implemented using internal array (table) of buckets. On accessing hashmap by key, we get key's hashcode using key-type specific(logic type specific) hash function. Then we need to map hashcode to actual internal buckets table index.
key -> (hash function) -> hashcode -> (???) -> index in internal table
Sometimes internal table could shrink and expand, depending on hashmap filling ratio. Then probably hashcode->index conversion method could be changed a bit.
For example our hash function returns 32 bit unsigned integer value and
moment A: internal table has capacity 10000
moment B: internal table has capacity 100000
What algorithms or approach usually used to perform hashcode->internal table index conversion? How is table resizing isue solved for them?
Usually, a simple modulo will do the job.
To take a quick example from Wikipedia, it's simple as that :
hash = hashfunc(key)
index = hash % array_size
As you said, the resizing happen dependending on the hashmap filling ratio. The array is reallocated (see realloc()), then the indices are recalculated given the new array size, and the values copied to their new index.
I wrote about this here and here.
When you increase the size of your vector of indeces you can be sure that the algorithm that worked well on the shorter vector will work less well on the longer. It is possible to test beforehand and have new algorithms to put in place when you make the vector longer. Or, as the the number of occupied indeces in the current vector increases, have a background, lower-priority thread that tests different algorithms on the data.
As the example in one of my answers shows, a "new algorithm" need be nothing more than a different pair of matched prime numbers.
At some point we need to increase the size of hash, and normally we just rehash, which leads to re-constructure of the whole hash.
Is there any better solution so that when we increase the size, we don't need to re-construct the whole thing?
You could use http://en.wikipedia.org/wiki/Extendible_hashing, although AFAIK it is used mostly for on-disk databases.
There are also general methods for smoothing out some amortised costs. Starting points for this would be http://en.wikipedia.org/wiki/Static_and_dynamic_data_structures and http://en.wikipedia.org/wiki/Dynamization. One application of this to hash tables would be to always keep two tables, one of size N and one of size 2N or so. When the smaller overflows, start creating a table of size 4N, but don't populate it straight away - populate it incrementally while using the table of size 2N. By the time the table of size 2N is full, the table of size 4N should be ready. For the special case of hash tables, extendible hashing should be better.
Any time you re-hash, there's nothing that says you need to actually re-hash. In fact all that you actually need to do is re-mod (i.e. shift everything's position).
If you cache the hash (hehe, sounds like the start of a dr. seuss book) then you only need to compute it once. So store the hash along with the actual data, and that will save you from needing to calculate the hash again in the future. However I'm assuming that you're not already doing this, you didn't exactly explain the current process.
// Store these instead of the data directly. This assumes immutable data.
struct hashable_item
{
data dat;
int32 hash;
}
I have an array of items that are sorted by a key value, items are retrieved by doing a binary search. Simplified version of the items would look something like this:
struct Item
{
uint64_t key;
uint64_t data;
};
I'm looking for ways to reduce the overhead of the key. The key value is not used for anything except searching. Assuming insert cost is not a concern, but retrieval cost is, what alternative data structure could I use to reduce the bookkeeping overhead to something less than 64-bits per item?
The only other "gotcha" is that I need to be able to detect the case where a key isn't present in the set.
One obvious possibility would be to treat your key as 8 individual bytes and build a trie out of them. This combines the common prefixes in your keys, so if you have (for example) a thousand Items with the same first byte, you only store that first byte once instead of a thousand times.
In order to be able to detect the absence of a key from your set, you need to store your keys in one way or another. Since the keys are random, you can't compress them into fewer than 64 bits by using clever data structures. Ergo, they way you're doing it now is optimal in terms of memory consumption.
If there was some structure, or predictability, to the keys it would be a different story.
If the "keys are basically random", then you don't have much option other than what you are using right now. For 64bit integers you cannot even assume a dense set of keys.
Are there anything else about the keys that you can exploit? ... Maybe a lot of keys are near to each other ... or something else? ... In this cases you can build multi-level hash tables or tries for storing your data.