Redis Hyperloglog - PFCOUNT side effect - data-structures

Redis recently released their new data structure called the HyperLogLog. It allows us to keep a count of unique objects and only takes up a size of 12k bytes. What I don't understand is that Redis's PFCOUNT command is said to be technically a write command. Why is this the case?
Note: as a side effect of calling this function, it is possible that the HyperLogLog is modified, since the last 8 bytes encode the latest computed cardinality for caching purposes. So PFCOUNT is technically a write command.

The header of the HyperLogLog object is as follows:
struct hllhdr {
char magic[4]; /* "HYLL" */
uint8_t encoding; /* HLL_DENSE or HLL_SPARSE. */
uint8_t notused[3]; /* Reserved for future use, must be zero. */
uint8_t card[8]; /* Cached cardinality, little endian. */
uint8_t registers[]; /* Data bytes. */
};
Note the card field: it contains the last cardinality evaluated by the algorithm. Calculating an estimation of the cardinality is an expensive operation, so Redis caches the value and keeps it in this field.
When PFADD is called, the HyperLogLog object may be updated or not (there is a good probability it is not). If it is not updated, calling PFCOUNT will reuse the cached value (card field). If it is updated, the card field is invalidated, so the next PFCOUNT will execute the counting algorithm, and write the new value in the card field.
That's why the PFCOUNT can alter the HyperLogLog object.

Related

Data Structure with Range Search Against Constantly Updating Keys

I have the need to store many data flows consisting of something like:
struct Flow {
source: Address,
destination: Address,
last_seq_num_sent: u32,
last_seq_num_rcvd: u32,
last_seq_num_ackd: u32
}
I need to query by last_seq_num_rcvd. I can guarantee (with off-screen magic) the uniqueness of this field among all flows.
The flow may occur over unreliable connections, so some sequence numbers may get skipped due to network packet loss. I account for this by using a window, one which also guarantees uniqueness for its entire stretch. The rates of data flows are independent of each other, but have the ability to renumber their sequence numbers before collisions occur.
So the goal is to perform a range query against the flows to find any flow with a last_seq_num_rcvd within a WINDOW_SIZE constant's distance of some given next sequence number.
I gather the BTreeMap is appropriate here for its range query ability.
const WINDOW_SIZE = 10;
struct FlowValue { /* All original fields, minus last_seq_num_rcvd which now acts as key */ }
let mut flows = BTreeMap<u32, FlowValue>::new();
let query = 42;
for (k, v) in flows.range(Excluded(query), Included(query + WINDOW_SIZE)) {
// This is how I would query for a flow
}
But now my key is something that changes often. It seems like there's no efficient way to update it in-place; it requires full deletion and reinsertion (under incremented key), which sounds like an expensive operation.
Is the BTreeMap method too expensive? Is there an alternative data structure that isn't? Or could I overload the BTreeMap to actually perform an efficient in-place increment of an integer key?
You're right that a B-Tree map is a little expensive for this application.
Since the window size is constant, a faster implementation would be to partition the sequence numbers into buckets of size about WINDOW_SIZE/2. Then just put the flows into a hash table according to their rcvd bucket.
To find flows for a particular packet, then, you only need to look up the 3 buckets that could possibly contain matching flows, and test each flow in the buckets. This will be faster than a B-Tree lookup.
On update, the situation is even better, because you only need to update the hash table when an entry changes buckets, and that only happens every once every WINDOW_SIZE/2 packets.

Does a "Pyramid List" data-structure already exist?

I am thinking about data structures which can be used in environments such as embedded/memory-constrained/filesystem and came upon an idea for a list-like data structure which has O(1) {access, insert, pop} while also always having O(1) push (non-amortized), even if it can only be grown by a constant amount (i.e. 4KiB). I cannot find an example of it anywhere and am wondering if it exists, and if so if anyone knows of a reference implementation.
The basic structure would look something like this:
PyramidList contains
a size_t numSlots
a size_t sizeSlots
a void** slots pointer to an array of pointers of size sizeSlots with pointers to values in indexes up to numSlots
The void **slots array has the following structure for each index. These are structured in such a way that 2^i = maxValues where i is the index and maxValues is the maximum number of values that can exist at that index or less (i.e. the sum of the count of all values up to that index)
index 0: contains a pointer directly to a single value (2^0 = 1)
index 1: contains a pointer directly to a single value (2^1 = 2)
index 2: contains a pointer to an array of two values (2^2 = 4)
index 3: contains a pointer to an array of four values (2^3 = 8)
index 4: contains a pointer to an array of eight values (2^4 = 16)
.. etc
index M: contains a pointer to an array of MAX_NUM_VALUES (2^M = MAX_NUM_VALUES*2)
index M+1: contains a pointer to an array of MAX_NUM_VALUES
index M+2: contains a pointer to an array of MAX_NUM_VALUES
etc
Now, suppose I want to access index i. I can use the BSR instruction to get the "power of 2" of the index. If it is less than the power of 2 of MAX_NUM_VALUES then I have my index. If it is larger than the power of 2 of MAX_NUM_VALUES I can act accordingly (subtract and divide). Therefore I can look up the array/single-value in O(1) time and then access the index I want in O(1) as well. Pushing to the PyramidList requires (at most):
allocating a new MAX_NUM_VALUES and adding it's pointer to slots
In some cases slots might not be able to hold it and would have to be grown as well, so this is only really always O(1) up to some limit, but that limit is likely to be extreme for the use cases here.
inserting the value into the proper index
A few other benefits
Works great for (embedded/file-system/kernel/etc) memory managers that have a maximum alloc size (i.e. can only allocate 4KiB chunks)
Works great when you truly don't know how large your vector is likely to be. Starts out extremely small and grows by known amounts
Always having (near) constant insertion may be useful for timing-critical interrupts/etc
Does not leave fragmented space behind when growing. Might be great for appending records into a file.
Disadvantages
Is likely less performant (amortized) than a contiguous vector in nearly every way (even insertion). Moving memory is typically less expensive than adding a dereference for every operation, so the amortized cost of a vector is still probably smaller.
Also, it is not truly always O(1) since the slots vector has to be grown when all the slots are full, but this only happens when currentNumSlots*2*MAX_NUM_VALUES have been added since the last growth.
When you exceed the capacity of an array of size X, and so allocate a new array of size 2X, you can then incrementally move the X items from the old array into the start of the new array over the next X append operations. After that the old array can be discarded when the new array is full, just before you have to allocate a new array of size 4X.
Therefore, it is not necessary to maintain this list of increasing-size arrays in order to achieve O(1) appends (assuming that allocation is O(1)). Incremental doubling is a well-known technique in the de-amortization business, so I think most people desiring this sort of behaviour would turn to that first.
Nothing like this is commonly used, because memory allocation can almost never be considered O(1). Applications that can't afford to copy a block all at once generally can't afford to use any kind of dynamic memory allocation at all.

Garbage collection with a very large dictionary

I have a very large immutable set of keys that doesn't fit in memory, and an even larger list of references, which must be scanned just once. How can the mark phase be done in RAM? I do have a possible solution, which I will write as an answer later (don't want to spoil it), but maybe there are other solutions I didn't think about.
I will try to restate the problem to make it more "real":
You work at Facebook, and your task is to find which users didn't ever create a post with an emoji. All you have is the list of active user names (around 2 billion), and the list of posts (user name / text), which you have to scan, but just once. It contains only active users (you don't need to validate them).
Also, you have one computer, with 2 GB of RAM (bonus points for 1 GB). So it has to be done all in RAM (without external sort or reading in sorted order). Within two day.
Can you do it? How? Tips: You might want to use a hash table, with the user name as the key, and one bit as the value. But the list of user names doesn't fit in memory, so that doesn't work. With user ids it might work, but you just have the names. You can scan the list of user names a few times (maybe 40 times, but not more).
Sounds like a problem I tackled 10 years ago.
The first stage: ditch GC. The overhead of GC for small objects (a few bytes) can be in excess of 100%.
The second stage: design a decent compression scheme for user names. English has about 3 bits per character. Even if you allowed more characters, the average amount of bits won't rise fast.
Third stage: Create dictionary of usernames in memory. Use a 16 bit prefix of each username to choose the right sub-dictionary. Read in all usernames, initially sorting them just by this prefix. Then sort each dictionary in turn.
As noted in the question, allocate one extra bit per username for the "used emoji" result.
The problem is now I/O bound, as the computation is embarrassingly parallel. The longest phase will be reading in all the posts (which is going to be many TB).
Note that in this setup, you're not using fancy data types like String. The dictionaries are contiguous memory blocks.
Given a deadline of two days, I would however dump some of this this fanciness. The I/O bound for reading the text is severe enough that the creation of the user database may exceed 16 GB. Yes, that will swap to disk. Big deal for a one-off.
Hash the keys, sort the hashes, and store sorted hashes in compressed form.
TL;DR
The algorithm I propose may be considered as an extension to the solution for similar (simpler) problem.
To each key: apply a hash function that maps keys to integers in range [0..h]. It seems to be reasonably good to start with h = 2 * number_of_keys.
Fill all available memory with these hashes.
Sort the hashes.
If hash value is unique, write it to the list of unique hashes; otherwise remove all copies of it and write it to the list of duplicates. Both these lists should be kept in compressed form: as difference between adjacent values, compressed with optimal entropy coder (like arithmetic coder, range coder, or ANS coder). If the list of unique hashes was not empty, merge it with sorted hashes; additional duplicates may be found while merging. If the list of duplicates was not empty, merge new duplicates to it.
Repeat steps 1..4 while there are any unprocessed keys.
Read keys several more times while performing steps 1..5. But ignore all keys that are not in the list of duplicates from previous pass. For each pass use different hash function (for anything except matching with the list of duplicates from previous pass, which means we need to sort hashes twice, for 2 different hash functions).
Read keys again to convert remaining list of duplicate hashes into list of plain keys. Sort it.
Allocate array of 2 billion bits.
Use all unoccupied memory to construct an index for each compressed list of hashes. This could be a trie or a sorted list. Each entry of the index should contain a "state" of entropy decoder which allows to avoid decoding compressed stream from the very beginning.
Process the list of posts and update the array of 2 billion bits.
Read keys once more co convert hashes back to keys.
While using value h = 2*number_of_keys seems to be reasonably good, we could try to vary it to optimize space requirements. (Setting it too high decreases compression ratio, setting it too low results in too many duplicates).
This approach does not guarantee the result: it is possible to invent 10 bad hash functions so that every key is duplicated on every pass. But with high probability it will succeed and most likely will need about 1GB RAM (because most compressed integer values are in range [1..8], so each key results in about 2..3 bits in compressed stream).
To estimate space requirements precisely we might use either (complicated?) mathematical proof or complete implementation of algorithm (also pretty complicated). But to obtain rough estimation we could use partial implementation of steps 1..4. See it on Ideone. It uses variant of ANS coder named FSE (taken from here: https://github.com/Cyan4973/FiniteStateEntropy) and simple hash function implementation (taken from here: https://gist.github.com/badboy/6267743). Here are the results:
Key list loads allowed: 10 20
Optimal h/n: 2.1 1.2
Bits per key: 2.98 2.62
Compressed MB: 710.851 625.096
Uncompressed MB: 40.474 3.325
Bitmap MB: 238.419 238.419
MB used: 989.744 866.839
Index entries: 1'122'520 5'149'840
Indexed fragment size: 1781.71 388.361
With the original OP limitation of 10 key scans optimal value for hash range is only slightly higher (2.1) than my guess (2.0) and this parameter is very convenient because it allows using 32-bit hashes (instead of 64-bit ones). Required memory is slightly less than 1GB, which allows to use pretty large indexes (so step 10 would be not very slow). Here lies a little problem: these results show how much memory is consumed at the end, but in this particular case (10 key scans) we temporarily need more than 1 GB memory while performing second pass. This may be fixed if we drop results (unique hashes) of the first first pass and recompute them later, together with step 7.
With not so tight limitation of 20 key scans optimal value for hash range is 1.2, which means algorithm needs much less memory and allows more space for indexes (so that step 10 would be almost 5 times faster).
Loosening limitation to 40 key scans does not result in any further improvements.
Minimal perfect hashing
Create a minimal perfect hash function (MPHF).
At around 1.8 bits per key (using the
RecSplit
algorithm), this uses about 429 MB.
(Here, 1 MB is 2^20 bytes, 1 GB is 2^30 bytes.)
For each user, allocate one bit as a marker, about 238 MB.
So memory usage is around 667 MB.
Then read the posts, for each user calculate the hash,
and set the related bit if needed.
Read the user table again, calculate the hash, check if the bit is set.
Generation
Generating the MPHF is a bit tricky, not because it is slow
(this may take around 30 minutes of CPU time),
but due to memory usage. With 1 GB or RAM,
it needs to be done in segments.
Let's say we use 32 segments of about the same size, as follows:
Loop segmentId from 0 to 31.
For each user, calculate the hash code, modulo 32 (or bitwise and 31).
If this doesn't match the current segmentId, ignore it.
Calculate a 64 bit hash code (using a second hash function),
and add that to the list.
Do this until all users are read.
A segment will contain about 62.5 million keys (2 billion divided by 32), that is 238 MB.
Sort this list by key (in place) to detect duplicates.
With 64 bit entries, the probability of duplicates is very low,
but if there are any, use a different hash function and try again
(you need to store which hash function was used).
Now calculate the MPHF for this segment.
The RecSplit algorithm is the fastest I know.
The CHD algorithm can be used as well,
but needs more space / is slower to generate.
Repeat until all segments are processed.
The above algorithm reads the user list 32 times.
This could be reduced to about 10 if more segments are used
(for example one million),
and as many segments are read, per step, as fits in memory.
With smaller segments, less bits per key are needed
to the reduced probability of duplicates within one segment.
The simplest solution I can think of is an old-fashioned batch update program. It takes a few steps, but in concept it's no more complicated than merging two lists that are in memory. This is the kind of thing we did decades ago in bank data processing.
Sort the file of user names by name. You can do this easily enough with the Gnu sort utility, or any other program that will sort files larger than what will fit in memory.
Write a query to return the posts, in order by user name. I would hope that there's a way to get these as a stream.
Now you have two streams, both in alphabetic order by user name. All you have to do is a simple merge:
Here's the general idea:
currentUser = get first user name from users file
currentPost = get first post from database stream
usedEmoji = false
while (not at end of users file and not at end of database stream)
{
if currentUser == currentPostUser
{
if currentPost has emoji
{
usedEmoji = true
}
currentPost = get next post from database
}
else if currentUser > currentPostUser
{
// No user for this post. Get next post.
currentPost = get next post from database
usedEmoji = false
}
else
{
// Current user is less than post user name.
// So we have to switch users.
if (usedEmoji == false)
{
// No post by this user contained an emoji
output currentUser name
}
currentUser = get next user name from file
}
}
// at the end of one of the files.
// Clean up.
// if we reached the end of the posts, but there are still users left,
// then output each user name.
// The usedEmoji test is in there strictly for the first time through,
// because the current user when the above loop ended might have had
// a post with an emoji.
while not at end of user file
{
if (usedEmoji == false)
{
output currentUser name
}
currentUser = get next user name from file
usedEmoji = false
}
// at this point, names of all the users who haven't
// used an emoji in a post have been written to the output.
An alternative implementation, if obtaining the list of posts as described in #2 is overly burdensome, would be to scan the list of posts in their natural order and output the user name from any post that contains an emoji. Then, sort the resulting file and remove duplicates. You can then proceed with a merge similar to the one described above, but you don't have to explicitly check if post has an emoji. Basically, if a name appears in both files, then you don't output it.

Generic solution to storing unknown size of data and performance

I have a text file with numbers, letters, special characters, and symbols. There are some lines where I want to insert an RLE unicode control character at the beginning/middle/end of the line.
First I needed to find out how to catch and represent RLE. I thought of streams. I found out that RLE takes up 3 bytes -30 -128 -85
InputStream input = new BufferedInputStream (new FileInputStream (file_name_here_with_path));
byte[] = input.read();
If what the app reads included an RLE character, then when printing the array you'll get those 3 signed numbers.
Next problem, CURRENT PROBLEM, is to find a suitable container for this information.
input.read(): this returns the byte the app read. I can save it in a byte array but I can't even create the array unless I knew its size. No, the file size isn't the size of the array because I need to insert those 3 bytes into the array more than once and at different locations depending on some conditions I set.
input.read(byte[] array): this returns an int representing the number of bytes that were read. The parameter will have all that info saved. Same problem as the above. Array of fixed size
input.read(byte[] array, offset, length): same as previous only I can make it read from any point I want and for as long as I want unlike the previous ones where it reads from the beginning to the end or until some exception is thrown
Use bufferedReader: same problem. I read a line, save it in a string, turn the string to byte array (stringname.getBytes()). fixed size. can't insert.
The solution to all 4 methods is to create a new byte array and move the bytes from the old array to the new one while inserting the control character. THE PROBLEM, maybe, is that according to a member here, Javier, the read method is slow. I haven't received confirmation yet as I wasn't sure if he meant one specific read or all 3. Also, even if I knew how many extra slots I needed in the new array, would it be good practice to create the new array of such a size?
This reminds me, my txt file is 200KB tops. It's not much but I'm looking for the RIGHT practice. The generic solution!
Anyways, I looked for alternatives. I recall using vectors. Yes they're obsolete. I don't know why and since I'm not creating a big app or an app for a client then I can use vectors :P HOWEVER, i thought I should keep reading. Then I came across ArrayList and I read a post here about how it performs better.
So ... what will it be? possible slow performing read methods or bufferedReader or the obsolete vector or the fast performing ArrayList? :P
Vectors were replaced by the faster ArrayLists with the caveat that that ArrayLists are not thread safe (but you could call a synchronize method for Collections to negate this) and has a different data growth method (every time a resize needs to be made, ArrayList increases size by 50% while vectors basically double). Apart from this, they are pretty much the same.
Given your options, I would go with an ArrayList that holds Objects (ArrayList) as in this manner, you could retain the original element type. Although, in such a case, you need to keep a track of what type of element is in each index position (if it is necessary).

What's a simple way to design a memoization system with limited memory?

I am writing a manual computation memoization system (ugh, in Matlab). The straightforward parts are easy:
A way to put data into the memoization system after performing a computation.
A way to query and get data out of the memoization.
A way to query the system for all the 'keys'.
These parts are not so much in doubt. The problem is that my computer has a finite amount of memory, so sometime the 'put' operation will have to dump some objects out of memory. I am worried about 'cache misses', so I would like some relatively simple system for dropping the memoized objects which are not used frequently and/or are not costly to recompute. How do I design that system? The parts I can imagine it having:
A way to tell the 'put' operation how costly (relatively speaking) the computation was.
A way to optionally hint to the 'put' operation how often the computation might be needed (going forward).
The 'get' operation would want to note how often a given object is queried.
A way to tell the whole system the maximum amount of memory to use (ok, that's simple).
The real guts of it would be in the 'put' operation when you hit the memory limit and it has to cull some objects, based on their memory footprint, their costliness, and their usefulness. How do I do that?
Sorry if this is too vague, or off-topic.
I'd do it by creating a subclass to DYNAMICPROPS that uses a cell array to store the data internally. This way, you can dynamically add more data to the object.
Here's the basic design idea:
The data is stored in a cell array. Each property gets its own row, with the first column being the property name (for convenience), the second column a function handle to calculate the data, the third column the data, the fourth column the time it took to generate the data, the fifth column an array of, say, length 100 storing the timestamps corresponding to when the property was accessed the last 100 times, and the sixth column contains the variable size.
There is a generic get method that takes as input the row number corresponding to the property (see below). The get method first checks whether column 3 is empty. If no, it returns the value and stores the timestamp. If yes, it performs the computation using the handle from column 1 inside a TIC/TOC statement to measure how expensive the computation is (which is stored in col4, and the timestamp is stored in col5). Then it checks whether there is enough space for storing the result. If yes, it stores the data, otherwise it checks size, as well as the product of how many times data were accessed with how long it would take to regenerate, to decide what to cull.
In addition, there is an 'add' property, that adds a row to the cell array, creates a dynamic property (using addprops) of the same name as the function handle, and sets the get-method to myGetMethod(myPropertyIndex). If you need to pass parameters to the function, you can create an additional property myDynamicPropertyName_parameters with a set method that will remove previously calculated data whenever the parameters change value.
Finally, you can add a few dependent properties, that can tell how many properties there are (# of rows in the cell array), how they're called (first col of the cell array), etc.
Consider using Java, since MATLAB runs on top of it, and can access it. This would work if you have marshallable values (ints, doubles, strings, matrices, but not structs or cell arrays).
LRU containers are available for Java:
Easy, simple to use LRU cache in java
http://java-planet.blogspot.com/2005/08/how-to-set-up-simple-lru-cache-using.html
http://www.codeproject.com/KB/java/lru.aspx

Resources