I have a giant hashmap in Redis which keeps on growing. Around 50k per day, upon the subsequent days this number will become less as the same keys would repeat. I want to limit this hashmap to 1 million key value pairs. I want to do this based on LRU.
I know I can do this with Redis' sorted set with timestamp as a value and removing entries which are not in range, but I need to retain the key value pair structure, when I move to sorted set I will be missing the key value structure as the value would be timestamp and I need to perform some string operations in the key to achieve the equivalent hash functionality (not feasible).
So my requirements are:
Key value pair structure
get values based on the given key or
multiple keys
trimming the structure with 1 million pairs with
the lru policy.
Can I achieve it with hash? I am also open for other suggestions? Thanks in advance
Why not use both HASH and Sorted Set.
Save data in HASH
HSET KEY FIELD VALUE
With your data saved in HASH, you can achieve both "Key value pair structure" and "get values based on the given key or multiple keys".
Implement LRU with Sorted Set
ZADD KEY TIMESTAMP FIELD
With Sorted Set, you can save the timestamp as the score of a field. Each time when you access a field in HASH, update the score of the field with current timestamp.
If the number of members in Sorted Set is larger than a million, get fields whose score is too small (with zcard and zrange). Then remove these fields in both HASH and Sorted Set.
Related
I am reading Data Structures and Algorithms & Software Principles in C to try to wrap my head around some internals of data structures, and two things are really bothering me:
(1) How do hash tables deal with deciding which item in the bucket is the item you are looking up if they all have the same hash?
e.g.
Get Key, Value
use Hash algorithm on the key to find the index to try to put value into
if the slot is taken, but there is no bucket(single entry), create a bucket and throw the current item into the bucket and then throw the current value into it.
now I have a bucket with a bunch of values and a "lost and found problem" where you can't tell which value belongs to which key because all the keys map to the same hash and the item in the bucket has no key to search the bucket by key.
This would work if the bucket saves keys as well as values for each entry, but I am confused since I can't find a site that confirms that hash tables save keys along with the values for their entries.
(2) How do hash tables tell if the value at an index is the correct value for the key, or if probing found a collision and put it elsewhere.
eg.
Get Key, Value
hash key to find index(0)
index taken, use a naive probe algorithm of perform linear search until slot found(slot 1 is empty).
now I search for my key and find index 0. How does the hash know that index 0 is not the correct item for this key, but that it has been probed into slot 1?
Again, this would make sense to me if the table saved a key as well as value for the entry, but I am not sure if hashes save keys along with values for the entries or have another way of ensuring that the item at the hash index or bucket index is the correct item, or if I am misunderstanding it.
To clarify the question: do hash tables save key along with value to disambiguate buckets and probe sequences or do they use something else to avoid ambiguity of hashes?
Sorry for the crudely formulated question but I just had to ask.
Thanks ahead of time.
Hash Tables save the entry. An entry consists of key and value.
How do hash tables deal with deciding which item in the bucket is the item you are looking up if they all have the same hash?
Because query is done by passing the key.
Purpose of hashing is to reduce the time to find the index. They key is hashed to find the right bucket. Then, when the items have been reduced from a total N to a very small n, you can even perform a linear search to find the right item out of all the keys having the same hash.
How do hash tables tell if the value at an index is the correct value for the key, or if probing found a collision and put it elsewhere.
Again, that's because Hash Table would save entries instead of just the value. If, in case of a collision, the Hash Table sees that the key found at this bucket is not the key that's queried, the Hash Table knows that the collision occurred earlier and the key may be in the next bucket. Please note that in this case the bucket stores a single entry unlike the case of first answer where the bucket may store a LinkedList or a Tree of entries.
I am aware how hash table works. But I am not sure of the possible implementation of get(key) when multiple values are stored at the same place with the help of linked list.
For example:
set(1,'Val1') get stored at index 7
set(2,'Val2') also get stored at index 7. (Internal implementation create a linked list and store pointer at index 7. That's understandable).
But I am thinking if now I call get(2). How does Hash Table knows which Value to return. Because my hash function will resolve this to index 7. But at index 7 there are 2 values.
One possible way is to store at the linked node, both value and key.
Is there any other different implementation possible?
Go through the linked list and do a linear search for the key '2'. The properties of the hash function and the hash table size should guarantee that these lists' length is O(1) on average.
I think you misunderstood the fact that hash tables has to store their keys. The hash function is only for speeding up insertion/lookup.
"There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition." This is a line of a famous hadoop text book. I am not getting the entire meaning of the second part of it which says "but the records for any given key are all in a single partition." Is this mean that all the records for a single key should be in a single partition or something else.
but the records for any given key are all in a single partition
If you have one key, the key and its associated value, must be on a single partition. Sometimes the value can be rather large. But this is a constraint on the size of a value. It must be small enough to fit on a single partition.
Note, there maybe other constants on both keys and values, depending on what you use for back-end storage, for example, a single key-value pair may be required to fit into the node's memory.
I'm thinking the answer is no based on the experimentation I've done. However I wasn't sure if I was doing things correctly.
My function is:
select buyer_key, DBMS_UTILITY.get_hash_value(buyer_key||'|'||buyer_entity_id||'|'||buyer_io_id||'|'||buyer_line_item_id||'|'||is_billing_enabled||'|'||currency_id_b_trgt||'|'||currency_id_b_prfrd||'|'||ymdh_max,1,POWER(2,16)-1) as hashvalue from network_buyer_dim order by hashvalue asc;
When I run it it returns numerous rows with duplicate hashkey values. But when I go to the database and look at those rows (BTW, each buyer_key is unique) I see that the rows DO NOT contain the same values.
Am I calling the function correctly?
Obviously NOT!!
A hash function is any algorithm or subroutine that maps large
data sets of variable length, called keys, to smaller data sets of a
fixed length. For example, a person's name, having a variable length,
could be hashed to a single integer. The values returned by a hash
function are called hash values, hash codes, hash sums, checksums or
simply hashes.
This means that if the input domain set size is bigger than the output domain set size there shold be duplicates.
In addition to this the best hash funcions are considered those ones that tend to give the same number of duplicate output values for all the possible input values.
Is there any data structure in which locating a data is independent of its volume ?
"locating a data is independent of volume of data in it" - I assume this means O(1) for get operations. That would be a hash map.
This presumes that you fetch the object based on the hash.
If you have to check each element to see if an attribute matches a particular value, like your rson or ern or any other parts of it, then you have to make that value the key up front.
If you have several values that you need to search on - all of the must be unique and immutable - you can create several maps, one for each value. That lets you search on more than one. But they have to all be unique, immutable, and known up front.
If you don't establish the key up front it's O(N), which means you have to check every element in turn until you find what you want. On average, this time will increase as the size of the collection grows. That's what O(N) means.