good people.
Recently we have been faced with this problem.
We need to store lots of (key, value) pairs. But keys are always monotonic. You can consider the keys as a 64 bit integer and we always insert the data in increasing order of keys (you can think of this as some sort of time series).
Also - write happens in batch (let's say in 5 sec interval). But we can read arbitarily and read : write = 200 : 1.
What do you think would be the ideal option for such data ? How can I configure rocksDB optimally?
Thanks for your help in advance.
RocksDB works better if your key is inserted in increasing order. For example, you can encode your integers in big endian string and store them as RocksDB key.
Related
I am developing a piece of software that need to check duplicate small text (normally less than 2 kb) using pre-calculated signature (4bytes). Currently , I've implemented CRC32 (4byte) to achieve this purpose but I suspect that CRC32 would generated a lot of duplicate values. I know it is impossible to make it really unique but at least I want to minimize this probability.
-- UPDATE 1 --
NOTE: I can not increase the size of hash bytes. It costs me a lot of storage. I am talking about entries size more than 1,000,000. for example 1,000,000 * 4 byte = 4,000,000 bytes. I cannot use MD5 because it takes 16 bytes!
-- UPDATE 2 --
I did not want to open the whole problem but now I have to do it.
My project is a dictionary engine that can search a lot of independent databases to find the users' asked phrase. All results must be prepared instantly (auto-complete feature). All text data is compressed, so I cannot decompress them to check the duplicated results. I have to store hash values from compressed text in my index. So hash bytes increase index size and disk I/O to read, decompress and decode index blocks (index blocks is also compressed). The hash values are generally un-compressible. The design of this software forced me to compress everything to meet the user's needs (using in embedded systems). Now, I want to remove duplicate text from search result using hash values to avoid (un)compressed text comparison (which is unreasonable in my case because of disk I/O).
It seems that we can design a custom checksum that meets the conditions. For example, I store text length in 2 bytes and generate 2-bytes checksum to check duplicate possibility ?!
I appreciate any suggestion in advance.
-- UPDATE 3 --
After a lot of investigations and using the information that are provided by answers, thanks to all of you, I found that CRC32 is good enough in my case. I ran some statistical benchmarks on my generated CRCs, after checking the duplicate values, the result was satisfying.
thanks to all of you.
I will up-vote all answers.
Without further knowledge about small text, the best you can hope for is each hash value equally probable, and most of 2³² 4-octet-values used. Even then, you are more likely than not to have a collision with just about 77000 texts, let alone a million. With a few exceptions (Adler32 coming to mind), well-known hash functions differ very little in collision probability. (They differ in difficulty to produce collisions/given values on purpose, and in computation/circuit cost.)
→Chose a compromise between collision probability and storage requirements.
For easily computed checksums, have a look at Fletcher's - Adler32 is very similar, but has a an increased collision probability with short inputs.
In case you get into hash collision you have to check if text is equal. The best way would be to count how many time it happens to have collision make some statistics and if it looks bad optimizing it. I got this idea that you could build 2 different hash values crc32 and md5 (or Luhn or whatever you like) and check for equality only if both hashes have same values.
I did something very similar in one of my projects. In my project i used something called a BLOOM FILTER , watch about the entire thing here and how to implement it , Bloom filter reduces the chances of HASH COLLITIONS massively thanks to its use of several hashing algorithms (however its possible to simulate multiple hash functions using just one hashing function but that what we are here for.) .. Try this out !! it worked for me and will work for u as well
An actual working implementation of a bloom filter
I have a problem where I am going beyond the amount of RAM in my server. I need to reduce the database size so that I can still use Redis. My application is a massive key / value store where the keys are user given text strings (directory / file paths). The values are very simple pointers to objects that I create. So it is an object store. The problem is that I have a Petabyte of objects, where an object could be 100K bytes. I can actually constrain the average object to be no less than 1M bytes, so 10^15 / 10^6 = 10^9 objects. Being that each object needs a key, that is 10^9, or 1G keys. If each key/value pair is 100 bytes, that is 100GB in RAM. That almost fits in servers with 128GB of RAM but it is not the only thing that is going on in the server. I'd like to reduce the footprint if I can.
The question is what direction to go in? I tried compressing the input key, but that was actually bigger than the original in my testing because it is such a short string and not a document. I have thought about using a different data store for smaller sized files, let's say below 1G. That will reduce what I need to put into Redis. I have also thought about using a hash algorithm that intentionally overlaps and bins the keys, and then putting the hash deltas into the merged keys as values. If that it too confusing here is a made up example:
Key Hash
A 15gh2
B 15gh2
C 4Tgnx
I would then store in Redis:
V(15gh2) = A, B, A-Value=A-Object, B-Value=B-Object
V(4Tgnx) = C
There is probably a proper way to algebraically represent this, but I don't know how to do that. "A-Object" is my pointer to the A object. What I'm trying to do is to end up with fewer keys, based on some posts I've read about keys being more expensive than Redis hash values (don't confuse the "Redis hash" with the "hash" algorithm). I have access to http://ieeexplore.ieee.org/ full database to search for papers on this topic. I'm not quite sure what I should be searching for in the query field? I tried things like "hash chain" but that appears to be targeting encryption more than efficient database stores. Any solution ideas or paths for greater research would be appreciated.
Update: As noted in the comments section, the values, or what I call "A-Object", "B-Object" are encoded "pointers" that are paths to objects. These are actual files in an XFS filesystem. They can be encoded as simply as "1:6:2" to point to path "/data/d0001/d0006/d0002". So a very short value "1:6:2" is all that needs to be stored.
The standard approach with this much data is to partition data across multiple servers.
See http://redis.io/topics/partitioning for advice on how to do that.
How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)
This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.
I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.
Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:
When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.
When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.
This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.
You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.
If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.
I have an array of items that are sorted by a key value, items are retrieved by doing a binary search. Simplified version of the items would look something like this:
struct Item
{
uint64_t key;
uint64_t data;
};
I'm looking for ways to reduce the overhead of the key. The key value is not used for anything except searching. Assuming insert cost is not a concern, but retrieval cost is, what alternative data structure could I use to reduce the bookkeeping overhead to something less than 64-bits per item?
The only other "gotcha" is that I need to be able to detect the case where a key isn't present in the set.
One obvious possibility would be to treat your key as 8 individual bytes and build a trie out of them. This combines the common prefixes in your keys, so if you have (for example) a thousand Items with the same first byte, you only store that first byte once instead of a thousand times.
In order to be able to detect the absence of a key from your set, you need to store your keys in one way or another. Since the keys are random, you can't compress them into fewer than 64 bits by using clever data structures. Ergo, they way you're doing it now is optimal in terms of memory consumption.
If there was some structure, or predictability, to the keys it would be a different story.
If the "keys are basically random", then you don't have much option other than what you are using right now. For 64bit integers you cannot even assume a dense set of keys.
Are there anything else about the keys that you can exploit? ... Maybe a lot of keys are near to each other ... or something else? ... In this cases you can build multi-level hash tables or tries for storing your data.
I am working on the parallelization an algorithm, which roughly does the following:
Read several text documents with a total of 10k words.
Create an objects for every word in the text corpus.
Create a pair between all word-objects (yes, O(n)). And return the most frequent pairs.
I would like to parallelize the 3. step by creating the pairs between the first 1000 word-objects the rest on the fist machine, the second 1000 word-objects on the next machine, etc.
My question is how to pass the objects created in the 2. step to the Mapper? As far as I am aware I would require input files for this and hence would need to serialize the objects (though haven't worked with this before). Is there a direct way to pass the objects to the Mapper?
Thanks in advance for the help
Evgeni
UPDATE
Thank you for reading my question before. Serialization seems to be the best way to solve this (see java.io.Serializable). Furthermore, I have found this tutorial useful to read data from serialized objects into hadoop: http://www.cs.brown.edu/~pavlo/hadoop/).
How about parallelize all steps? Use your #1 text documents as input to your Mapper. Create the object for every word in the Mapper. In the Mapper your key-value pair will be the word-object pair (or object-word depending on what you are doing). The Reducer can then count the unique pairs.
Hadoop will take care of bringing all the same keys together into the same Reducer.
Use twitter protobufs ( elephant-bird ) . Convert each word into a protobuf object and process it however you want. Also protobufs are much faster and light compared to default java serialization. Refer Kevin Weil's presentation on this. http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter