Using ChronicleMap as a key-value database - chronicle

I would like to use a ChronicleMap as a memory-mapped key-value database (String to byte[]). It should be able to hold up to the order of 100 million entries. Reads/gets will happen much more frequently than writes/puts, with an expected write rate of less than 10 entries/sec. While the keys would be similar in length, the length of the value could vary strongly: it could be anything from a few bytes up to tens of Mbs. Yet, the majority of values will have a length between 500 to 1000 bytes.
Having read a bit about ChronicleMap, I am amazed about its features and am wondering why I can't find articles describing it being used as a general key-value database. To me there seem to be a lot of advantages of using ChronicleMap for such a purpose. What am I missing here?
What are the drawbacks of using ChronicleMap for the given boundary conditions?

I voted for closing this question because any "drawbacks" would be relative.
As a data structure, Chronicle Map is not sorted, so it doesn't fit when you need to iterate the key-value pairs in the sorted order by key.
Limitation of the current implementation is that you need to specify the number of elements that are going to be stored in the map in advance, and if the actual number isn't close to the specified number, you are going to overuse memory and disk (not very severely though, on Linux systems), but if the actual number of entries exceeds the specified number by approximately 20% or more, operation performance starts to degrade, and the performance hit grows linearly with the number of entries growing further. See https://github.com/OpenHFT/Chronicle-Map/issues/105

Related

Check for duplicate input items in a data-intensive application

I have to build a server-side application that will receive a stream of data as input, it will actually receive a stream of integers up to nine decimal digits, and have to write each of them to a log file. Input data is totally random, and one of the requirements is that the application should not write duplicate items to the log file, and should periodically report the number of duplicates items found.
Taking into account that performance is a critical aspect of this application, as it should be able to handle high loads of work (and parallel work), I would like to found a proper solution to keep track of the duplicate entries, as checking the whole log (text) file every time it writes is not a suitable solution for sure. I can think of a solution consisting of maintaining some sort of data structure in memory to keep track of the whole stream of data being processed so far, but as input data can be really high, I don't think is the best way to do it either...
Any idea?
Assuming the stream of random integers is uniformly distributed. The most efficient way to keep track of duplicates is to maintain a huge bitmap of 10 billion bits in memory. However, this takes a lot of RAM: about 1.2 Gio. However, since this data structure is big, memory accesses may be slow (limited by the latency of the memory hierarchy).
If the ordering does not matter, you can use multiple threads to mitigate the impact of the memory latency. Parallel accesses can be done safely using logical atomic operations.
To check if a value is already seen before, you can check the value of a bit in the bitmap then set it (atomically if done in parallel).
If you know that your stream do contains less than one million of integers or the stream of random integers is not uniformly distributed, you can use a hash-set data structure as it store data in a more compact way (in sequential).
Bloom filters could help you to speed up the filtering when the number of value in the stream is quite big and they are very few duplicates (this method have to be combined with another approach if you want get deterministic results).
Here is an example using hash-sets in Python:
seen = set() # List of duplicated values seen so far
for value in inputStream: # Iterate over the stream value
if value not in seen: # O(1) lookup
log.write(value) # Value not duplicated here
seen.add(value) # O(1) appending

Redis GEORADIUS with one ZSET versus a lot of ZSETs of particular size

What will work faster, one big ZSET with geodata where I'll query for 100m radius with GEORADIUS
OR
a lot of ZSETs where each ZSET is responsible for 100m X 100m square covering the whole world? and named after this 100m squares like:
left_corner1_49_2440000_28_5010000
left_corner2_49_2450000_28_5010000
.......
and have all the 100 meters to the right and bottom inside the sets.
So when searching for the nearest point I'll just omit the redundant digits from gps like: 49.2440408, 28.5011694 will become
49.2440000, 28.5010000 so this way I'll know the ZSETS's name where just to get all the exact values with 100 meters precision.
OR to question it in general form: how are the ZSET's names are stored and accessed in redis? If I have too much ZSETS will it impact performance while accessing them?
Precise comparison of this approaches could only be done via benchmark and it would be specific to your dataset and configuration. But architecturally speaking, your pros and cons are:
BIG ZSET: less bandwidth and less operations (CPU cycles) taken to execute, no problems on borders (possible duplicates with many ZSETS), can get throughput with sharding;
MANY ZSETS: less latency for other operations (while big ZSET is going, other commands are waiting), can get throughput with sharding AND latency with clustering.
As for bottom line question, I did not see implementation code, but set names should be the same keys as any other keys you use. This is what Redis FAQ says about number of keys:
What is the maximum number of keys a single Redis instance can hold? <...>
Redis can handle up to 2^32 keys, and was tested in practice to handle
at least 250 million keys per instance.
UPDATE:
Look at what Redis docs say about GEORADIUS:
Time complexity: O(N+log(M)) where N is the number of elements inside
the bounding box of the circular area delimited by center and radius
and M is the number of items inside the index.
It means that items outside of your query make O(log(M)) impact on your query. So, 17 hops for 10m items or 21 hop for 1b items which is quite affordable. The question left is will you do partitioning between nodes?

given 10 billion URL with average length 100 characters per each url, check duplicate

Suppose I have 1GB memory available, how to find the duplicates among those urls?
I saw one solution on the book "Cracking the Coding Interview", it suggests to use hashtable to separate these urls into 4000 files x.txt, x = hash(u)%4000 in the first scan. And in the 2nd scan, we can check duplicates in each x.txt separately file.
But how can I guarantee that each file would store about 1GB url data? I think there's a chance that some files would store much more url data than other files.
My solution to this problem is to implement the file separation trick iteratively until the files are small enough for the memory available for me.
Is there any other way to do it?
If you don't mind a solution which requires a bit more code, you can do the following:
Calculate only the hashcodes. Each hashcode is exactly 4 bytes, so you have perfect control of the amount of memory that will be occupied by each chunk of hashcodes. You can also fit a lot more hashcodes in memory than URLs, so you will have fewer chunks.
Find the duplicate hashcodes. Presumably, they are going to be much fewer than 10 billion. They might even all fit in memory.
Go through the URLs again, recomputing hashcodes, seeing if a URL has one of the duplicate hashcodes, and then comparing actual URLs to rule out false positives due to hashcode collisions. (With 10 billion urls, and with hashcodes only having 4 billion different values, there will be plenty of collisions.)
This is a bit long for a comment.
The truth is, you cannot guarantee that a file is going to be smaller than 1 Gbyte. I'm not sure where the 4,000 comes from. The total data volume is about 1,000 Gbytes, so the average file size would be 250 Mbytes.
It is highly unlikely that you would ever be off by a factor of 4 in size. Of course, it is possible. In that case, just split the file again into a handful of other files. This adds a negligible amount to the complexity.
What this doesn't account for is a simple case. What if one of the URLs has a length of 100 and appears 10,000,000 times in the data? Ouch! In that case, you would need to read a file and "reduce" it by combining each value with a count.

LevelDB value size

I'm using LevelDB(scala bindings) to store large (k,v) pairs on disk. While the keys are usually short strings, the values can be in the 10s of MBs (an outlier could even in the 100s of MBs). It does not seem to be doing a good job of storing large values - my application runs into frequent full GCs and things get messy.
I dont see any limit on the value size in the documentation. Does anyone know of similar issues? Should I try breaking up my value into smaller chunks?

Why does LevelDB needs more than two levels?

I think only two levels(level-0 and level-1) is ok, why does LevelDB need level-2, level-3, and more?
I'll point you in the direction of some articles on LevelDB and it's underlying storage structure.
So in the documentation for LevelDB
it discusses merges among levels.
These merges have the effect of gradually migrating new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).
LevelDB is similar in structure to Log Structured Merge Trees. The paper discusses the different levels if you're interested in the analysis of it. If you can get through the mathematics it seems to be your best bet to understanding the data structure.
A much easier to read analysis of levelDB talks about the datastore's relation to LSM Trees but in terms of your questions about the levels all it says is:
Finally, having hundreds of on-disk SSTables is also not a great idea, hence periodically we will run a process to merge the on-disk SSTables.
Probably the LevelDB documentation provides the best answer: (maximizing the size of the writes and reads, since LevelDB is on-disk(slow seek) data storage).
Good Luck!
I think it is mostly to do with easy and quick merging of levels.
In Leveldb, level-(i+1) has approx. 10 times the data compared to level-i. This is more analogous to a multi-level cache structure where in if the database has 1000 records between keys x1 to x2, then 10 of the most frequently accessed ones in that range would be in level-1 and 100 in the same range would be in level-2 and rest in level-3 (this is not exact but just to give an intuitive idea of levels). In this set up, to merge a file in level-i we need to look at at most 10 files in level-(i+1) and it can all be brought into memory, a quick merge done and written back. These results in reading relatively small chunks of data for each compaction/merging operation.
On the other hand if you had just 2 levels, the key range in one level-0 file could potentially match 1000's of files in level-1 and all of them need to be opened up for merging which is going to be pretty slow. Note that an important assumption here is we have fixed size files (say 2MB). With variable length files in level-1, your idea could still work and I think a variant of that is used in systems like HBase and Cassandra.
Now if you are concern is about look up delay with many levels, again this is like a multi-level cache structure, most recently written data would be in higher levels to help with typical locality of reference.
Level 0 is data in memory other levels are disk data. The important part is that data in levels is sorted. If level1 consists of 3 2Mb files then in file1 it's the keys 0..50 (sorted) in file2 150..200 and in file3 300..400 (as an example). So when memory level is full we need to insert it's data to disk in the most efficient manner, which is sequential writing (using as few disk seeks as possible). Imagine in memory we have keys 60-120, cool, we just write them sequentially as file which becomes file2 in level1. Very efficient!
But now imagine that level1 is much larger then level0 (which is reasonable as level0 is memory). In this case there are many files in level1. And now our keys in memory (60-120) belong to many files as the key range in level1 is very fine grained. Now to merge level0 with level1 we need to read many files and make a lot of random seeks, make new files in memory and write them. So this is where many levels idea kicks in, we'll have many layers, each somewhat larger than the previous (x10), but not much larger so when we have to migrate data from i-1 to i-th layer we have a good chance of having to read least amount of files.
Now, since data might change there may be no need to propagate it to higher more expensive layers (it might be changed or deleted) and so we avoid expensive merges altogether. The data that does end up in the last level is statistically least likely to change so is the best fit for most-expensive-to-merge-with last layer.

Resources