Check for duplicate input items in a data-intensive application - algorithm

I have to build a server-side application that will receive a stream of data as input, it will actually receive a stream of integers up to nine decimal digits, and have to write each of them to a log file. Input data is totally random, and one of the requirements is that the application should not write duplicate items to the log file, and should periodically report the number of duplicates items found.
Taking into account that performance is a critical aspect of this application, as it should be able to handle high loads of work (and parallel work), I would like to found a proper solution to keep track of the duplicate entries, as checking the whole log (text) file every time it writes is not a suitable solution for sure. I can think of a solution consisting of maintaining some sort of data structure in memory to keep track of the whole stream of data being processed so far, but as input data can be really high, I don't think is the best way to do it either...
Any idea?

Assuming the stream of random integers is uniformly distributed. The most efficient way to keep track of duplicates is to maintain a huge bitmap of 10 billion bits in memory. However, this takes a lot of RAM: about 1.2 Gio. However, since this data structure is big, memory accesses may be slow (limited by the latency of the memory hierarchy).
If the ordering does not matter, you can use multiple threads to mitigate the impact of the memory latency. Parallel accesses can be done safely using logical atomic operations.
To check if a value is already seen before, you can check the value of a bit in the bitmap then set it (atomically if done in parallel).
If you know that your stream do contains less than one million of integers or the stream of random integers is not uniformly distributed, you can use a hash-set data structure as it store data in a more compact way (in sequential).
Bloom filters could help you to speed up the filtering when the number of value in the stream is quite big and they are very few duplicates (this method have to be combined with another approach if you want get deterministic results).
Here is an example using hash-sets in Python:
seen = set() # List of duplicated values seen so far
for value in inputStream: # Iterate over the stream value
if value not in seen: # O(1) lookup
log.write(value) # Value not duplicated here
seen.add(value) # O(1) appending

Related

External sorting when indices can fit in RAM

I want to sort a multi-TB file full of 20kb records. I only need to read a few bytes from each record in order to determine its order, so I can sort the indices in memory.
I cannot fit the records themselves in memory, however. Random access is slower than sequential access, and I don't want to random-access writes to the output file either. Is there any algorithm known that will take advantage of the sorted indices to "strategize" the optimal way to re-arrange the records as they are copied from the input file to the output file?
There are reorder array according to sorted index algorithms, but they involve random access. Even in the case of an SSD, although the random access itself is not an issue, reading or writing one record at a time due to random access has a slower throughput than reading or writing multiple records at a time which is typically down by an external merge sort.
For a typical external merge sort, the file is read in "chunks" small enough for an internal sort to sort the "chunk", and write the sorted "chunks" to external media. After this initial pass, a k-way merge is done on the "chunks" multiplying the size of the merged "chunks" by k on each merge pass, until a single sorted "chunk" is produced. The read/write operations can read multiple records at a time. Say you have 1GB of ram and use a 16-way merge. For a 16 way merge, 16 "input" buffers and 1 "output" buffer are used, so buffer size could be 63MB (1GB/17 rounded down a bit for variable space) which would allow 3150 records to be read or written at a time, greatly reducing random access and command overhead. Assuming initial pass creates sorted chunks of size 0.5 GB, after 3 (16 way) merge passes, chunk size is 2TB, after 4 passes, it's 32TB, and so on.

Using ChronicleMap as a key-value database

I would like to use a ChronicleMap as a memory-mapped key-value database (String to byte[]). It should be able to hold up to the order of 100 million entries. Reads/gets will happen much more frequently than writes/puts, with an expected write rate of less than 10 entries/sec. While the keys would be similar in length, the length of the value could vary strongly: it could be anything from a few bytes up to tens of Mbs. Yet, the majority of values will have a length between 500 to 1000 bytes.
Having read a bit about ChronicleMap, I am amazed about its features and am wondering why I can't find articles describing it being used as a general key-value database. To me there seem to be a lot of advantages of using ChronicleMap for such a purpose. What am I missing here?
What are the drawbacks of using ChronicleMap for the given boundary conditions?
I voted for closing this question because any "drawbacks" would be relative.
As a data structure, Chronicle Map is not sorted, so it doesn't fit when you need to iterate the key-value pairs in the sorted order by key.
Limitation of the current implementation is that you need to specify the number of elements that are going to be stored in the map in advance, and if the actual number isn't close to the specified number, you are going to overuse memory and disk (not very severely though, on Linux systems), but if the actual number of entries exceeds the specified number by approximately 20% or more, operation performance starts to degrade, and the performance hit grows linearly with the number of entries growing further. See https://github.com/OpenHFT/Chronicle-Map/issues/105

given 10 billion URL with average length 100 characters per each url, check duplicate

Suppose I have 1GB memory available, how to find the duplicates among those urls?
I saw one solution on the book "Cracking the Coding Interview", it suggests to use hashtable to separate these urls into 4000 files x.txt, x = hash(u)%4000 in the first scan. And in the 2nd scan, we can check duplicates in each x.txt separately file.
But how can I guarantee that each file would store about 1GB url data? I think there's a chance that some files would store much more url data than other files.
My solution to this problem is to implement the file separation trick iteratively until the files are small enough for the memory available for me.
Is there any other way to do it?
If you don't mind a solution which requires a bit more code, you can do the following:
Calculate only the hashcodes. Each hashcode is exactly 4 bytes, so you have perfect control of the amount of memory that will be occupied by each chunk of hashcodes. You can also fit a lot more hashcodes in memory than URLs, so you will have fewer chunks.
Find the duplicate hashcodes. Presumably, they are going to be much fewer than 10 billion. They might even all fit in memory.
Go through the URLs again, recomputing hashcodes, seeing if a URL has one of the duplicate hashcodes, and then comparing actual URLs to rule out false positives due to hashcode collisions. (With 10 billion urls, and with hashcodes only having 4 billion different values, there will be plenty of collisions.)
This is a bit long for a comment.
The truth is, you cannot guarantee that a file is going to be smaller than 1 Gbyte. I'm not sure where the 4,000 comes from. The total data volume is about 1,000 Gbytes, so the average file size would be 250 Mbytes.
It is highly unlikely that you would ever be off by a factor of 4 in size. Of course, it is possible. In that case, just split the file again into a handful of other files. This adds a negligible amount to the complexity.
What this doesn't account for is a simple case. What if one of the URLs has a length of 100 and appears 10,000,000 times in the data? Ouch! In that case, you would need to read a file and "reduce" it by combining each value with a count.

LevelDB value size

I'm using LevelDB(scala bindings) to store large (k,v) pairs on disk. While the keys are usually short strings, the values can be in the 10s of MBs (an outlier could even in the 100s of MBs). It does not seem to be doing a good job of storing large values - my application runs into frequent full GCs and things get messy.
I dont see any limit on the value size in the documentation. Does anyone know of similar issues? Should I try breaking up my value into smaller chunks?

Why does LevelDB needs more than two levels?

I think only two levels(level-0 and level-1) is ok, why does LevelDB need level-2, level-3, and more?
I'll point you in the direction of some articles on LevelDB and it's underlying storage structure.
So in the documentation for LevelDB
it discusses merges among levels.
These merges have the effect of gradually migrating new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).
LevelDB is similar in structure to Log Structured Merge Trees. The paper discusses the different levels if you're interested in the analysis of it. If you can get through the mathematics it seems to be your best bet to understanding the data structure.
A much easier to read analysis of levelDB talks about the datastore's relation to LSM Trees but in terms of your questions about the levels all it says is:
Finally, having hundreds of on-disk SSTables is also not a great idea, hence periodically we will run a process to merge the on-disk SSTables.
Probably the LevelDB documentation provides the best answer: (maximizing the size of the writes and reads, since LevelDB is on-disk(slow seek) data storage).
Good Luck!
I think it is mostly to do with easy and quick merging of levels.
In Leveldb, level-(i+1) has approx. 10 times the data compared to level-i. This is more analogous to a multi-level cache structure where in if the database has 1000 records between keys x1 to x2, then 10 of the most frequently accessed ones in that range would be in level-1 and 100 in the same range would be in level-2 and rest in level-3 (this is not exact but just to give an intuitive idea of levels). In this set up, to merge a file in level-i we need to look at at most 10 files in level-(i+1) and it can all be brought into memory, a quick merge done and written back. These results in reading relatively small chunks of data for each compaction/merging operation.
On the other hand if you had just 2 levels, the key range in one level-0 file could potentially match 1000's of files in level-1 and all of them need to be opened up for merging which is going to be pretty slow. Note that an important assumption here is we have fixed size files (say 2MB). With variable length files in level-1, your idea could still work and I think a variant of that is used in systems like HBase and Cassandra.
Now if you are concern is about look up delay with many levels, again this is like a multi-level cache structure, most recently written data would be in higher levels to help with typical locality of reference.
Level 0 is data in memory other levels are disk data. The important part is that data in levels is sorted. If level1 consists of 3 2Mb files then in file1 it's the keys 0..50 (sorted) in file2 150..200 and in file3 300..400 (as an example). So when memory level is full we need to insert it's data to disk in the most efficient manner, which is sequential writing (using as few disk seeks as possible). Imagine in memory we have keys 60-120, cool, we just write them sequentially as file which becomes file2 in level1. Very efficient!
But now imagine that level1 is much larger then level0 (which is reasonable as level0 is memory). In this case there are many files in level1. And now our keys in memory (60-120) belong to many files as the key range in level1 is very fine grained. Now to merge level0 with level1 we need to read many files and make a lot of random seeks, make new files in memory and write them. So this is where many levels idea kicks in, we'll have many layers, each somewhat larger than the previous (x10), but not much larger so when we have to migrate data from i-1 to i-th layer we have a good chance of having to read least amount of files.
Now, since data might change there may be no need to propagate it to higher more expensive layers (it might be changed or deleted) and so we avoid expensive merges altogether. The data that does end up in the last level is statistically least likely to change so is the best fit for most-expensive-to-merge-with last layer.

Resources