Redis Zset Locate record - data-structures

Why can redis locate a record in zset in log(n) time both by score and key? Does redis actually store two indexes for a zset?
I thought if we have a skiplist that determines a record by its key, we can only index by this key.
SkipNode
key
k1 #value
k2 #score
|------------------------------> |
|-... |
|------------->|-------...---------|
skipNode1 -> skipNode2 -> ... skipNodeN
we can only locate a record by key, in leftmost, (k1, k2), order, how can we index a record by k2 only?

Why can redis locate a record in zset in log(n) time both by score and key?Z
The time complexity of searching by key is O(1), and by score is O(log(n)).
Does redis actually store two indexes for a zset?
Yes, it has two indexes. A hash index for key, and a skip list index for score.

Related

How to augment a skip list such that we can extract max value of a specific segment of the skiplist efficiently? [Skiplist not sorted by value]

i have a problem im struggling with.
I have a skiplist with elements:
element = (date,value)
The dates are the key's of the skiplist,and hence,the skiplist is sorted by date.
How can i augment the skiplist such that the function
Max(d1,d2) -> returns largest value between dates d1 and d2
is most efficient.
The values are integers.
The most efficient way is to iterate over each item from d1 to d2 and select the maximum item. Because the skip list is ordered by date, you cannot assume anything about the order of values: they might as well be randomly ordered. So you'll have to look at each one.
So it's O(log n) (on average: this is a skip list, after all) to find d1, and then it's O(range) to find the maximum element, where range is the number of items between d1 and d2, inclusive.
How you'd implement this is to add a function to the skip list that will allow you to iterate the list starting at an arbitrary element. You almost certainly already have a function that will iterate over the entire list in order, so all you have to do is create a function that will iterate over a range of keys (i.e. from a start key to an end key).

fastest algorithm for sum queries in a range

Assume we have the following data, which consists of a consecutive 0's and 1's (the nature of data is that there are very very very few 1s.
data =
[0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0]
so a huge number of zeros, and then possibly some ones (which indicate that some sort of an event is happening).
You want to query this data many times. The query is that given two indices i and j what is sum(data[i:j]). For example, sum_query(i=12, j=25) = 2 in above example.
Note that you have all these queries in advance.
What sort of a data structure can help me evaluate all the queries as fast as possible?
My initial thoughts:
preprocess the data and obtain two shorter arrays: data_change and data_cumsum. The data_change will be filled up with the indices for when the sequence of 1s will start and when the next sequence of 0s will start, and so on. The data_cumsum will contain the corresponding cummulative sums up to indices represented in data_change, i.e. data_cumsum[k] = sum(data[0:data_change[k]])
In above example, the preprocessing results in: data_change=[8,11,18,20,31,35] and data_cumsum=[0,3,3,5,5,9]
Then if query comes for i=12 and j=25, I will do a binary search in this sorted data_change array to find the corresponding index for 12 and then for 25, which will result in the 0-based indices: bin_search(data_change, 12)=2 and bin_search(data_change, 25)=4.
Then I simply output the corresponding difference from the cumsum array: data_cumsum[4] - data_cumsum[2]. (I won't go into the detail of handling the situation where the any endpoint of the query range falls in the middle of the sequence of 1's, but those cases can be handled easily with an if-statement.
With linear space, linear preprocessing, constant query time, you can store an array of sums. The i'th position gets the sum of the first i elements. To get query(i,j) you take the difference of the sums (sums[j] - sums[i-1]).
I already gave an O(1) time, O(n) space answer. Here are some alternates that trade time for space.
1. Assuming that the number of 1s is O(log n) or better (say O(log n) for argument):
Store an array of ints representing the positions of the ones in the original array. so if the input is [1,0,0,0,1,0,1,1] then A = [0,4,6,7].
Given a query, use binary search on A for the start and end of the query in O(log(|A|)) = O(log(log(n)). If the element you're looking for isn't in A, find the smallest bigger index and the largest smaller index. E.g., for query (2,6) you'd return the indices for the 4 and the 6, which are (1,2). Then the answer is one more than the difference.
2. Take advantage of knowing all the queries up front (as mentioned by the OP in a comment to my other answer). Say Q = (Q1, Q2, ..., Qm) is the set of queries.
Process the queries, storing a map of start and end indices to the query. E.g., if Q1 = (12,92) then our map would include {92 => Q1, 12 => Q1}. This take O(m) time and O(m) space. Take note of the smallest start index and the largest end index.
Process the input data, starting with the smallest start index. Keep track of the running sum. For each index, check your map of queries. If the index is in the map, associate the current running sum with the appropriate query.
At the end, each query will have two sums associated with it. Add one to the difference to get the answer.
Worst case analysis:
O(n) + O(m) time, O(m) space. However, this is across all queries. The amortized time cost per query is O(n/m). This is the same as my constant time solution (which required O(n) preprocessing).
I would probably go with something like this:
# boilerplate testdata
from itertools import chain, permutations
data = [0,0,0,0,0,0,0,1,1,1]
chained = list(chain(*permutations(data,5))) # increase 5 to 10 if you dare
Preprozessing:
frSet = frozenset([i for i in range(len(chained)) if chained[i]==1])
"Counting":
# O(min(len(frSet), len(frozenset(range(200,500))))
summa = frSet.intersection(frozenset(range(200,500))) # use two sets for faster intersect
counted=len(summa)
"Sanity-Check"
print(sum([1 for x in frSet if x >= 200 and x<500]))
print(summa)
print(len(summa))
No edge cases needed, intersection will do all you need, slightly higher memory as you store each index not ranges of ones. Performance depends on intersection-Implementation.
This might be helpfull: https://wiki.python.org/moin/TimeComplexity#set

what is meant by open addressing in collision handling?

Collision is occur in hashing, there are different types of collision avoidance.
1)chaining
2)open addressing etc.,
what is meant by open addressing and how to store index in open addressing. calculation??
Collision is a situation when the resultant hashes for two or more data elements in the data set U, maps to the same location in the hash table, is called a hash collision. In such a situation two or more data elements would qualify to be stored/mapped to the same location in the hash table.
Open addressing also called closed hashing is a method of resolving collisions by probing, or searching through alternate locations in the array until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.
In open addressing, while inserting, if a collision occurs, alternative cells are tried until an empty bucket is found. For which one of the following technique is adopted.
There are many ways of probing: Linear, Quadratic, Cuckoo hashing (which I have used in my project), double hashing.
Now going into deep what do you mean by probing. Suppose we want to do insert and search operation in our hashtable.
Insert:
When there is a collision we just probe or go to the next slot in the table.
If it is unoccupied – we store the key there.
If it is occupied – we continue probing the next slot.
Search:
If the key hashes to a position that is occupied and there is no match,
we probe the next position.
a) match – successful search
b) empty position – unsuccessful search
c) occupied and no match – continue probing.
When the end of the table is reached, the probing continues from the beginning,
until the original starting position is reached.
To add more in this, in open addressing we do not require additional data structure to hold the data as in case of closed addressing data is stored into a linked list whose head pointer is referenced through a pointer whose index is stored in our hashtable.
Index is calculated using hash function for each key. Lets say in linear probing we need to do insert in a hashtable[20].
Hashtablesize=20;
void insert(string s)
{
// Compute the index using the Hash Function
int index = hashFunc(s);
// Search for an unused slot and if the index will exceed the hashTableSize
// we will roll back
while(hashTable[index] != "")
index = (index + 1) % hashTableSize;
hashTable[index] = s;
}
Quadratic probing is also similar to linear one, the difference is in iterating by the probing sequence. In quadratic probing the probing sequence can be
index = index % hashTableSize
index = (index + 1^2) % hashTableSize
index = (index + 2^2) % hashTableSize
index = (index + 3^2) % hashTableSize

The complexity of the LRU cache algorithm

I have in front of me a task to implement the LRU cache. And the longest operation in the system should take O(log (n)). As my Cache I use std :: MAP. I still need a second container for storing key + Creation Time - Sort by time. And when I need to update the address to the cache it should take somewhere:
Find by key O(log (n)).
Remove to an iterator O(1).
Insert a new element of the O(log (n)).
The oldest member must reside naturally in container.begin().
I can use only STL.
List - does not suit me.
list.find() - O (n)
Priority Queue - delete items not implemented.
I think it could ideally be stored in a std::set;
std::set<pair<KEY, TIME>>;
Sort std::set:
struct compare
{
bool operator ()(const pair<K, TIME> &a, const pair<K, TIME> &b)
{
return a.second < b.second;
}
};
And to find a key in std :: set to write the function wich looks for the first element of the pair - std::set<pair<KEY, TIME>>;.
What do you think? Can anyone tell if this suits my specified complexity requirements?
Yes you can use map plus set to get the complexity of deleting/updating/inserting as O(logn).
map stores key,value pair.
set should store time,key in this order ( you have done opposite ). When cache is full and you want to remove a key it will be correspong to the element in set to which it = myset.begin() points to.
Having said that you can improve performance by using hash + double linked list.
You can achieve O(1) complexity when chose proper data structures:
template<typename Key_t, typename Value_t>
class LruCache {
....
using Order_t = std::list<Key_t>;
Order_t m_order;
std::unordered_map<Key_t, std::pair<typename Order_t::iterator, Value_t>> m_container;
};
m_order is a list. You need to add some elements at the beginning or at the end of the list (O(1)).
Removing an item from a list if you have iterator to it: m_order.erase(it) - O(1).
Removing Recently Used Key from a list: pop_front/pop_back: O(1).
When you need to find a key, use hash_map - find - (O(1) on average).
When you found a key, you have a value, which is the real value and in addition an iterator to proper item in the list.
The whole complexity can be O(1), then.

Whats the best data-structure for storing 2-tuple (a, b) which support adding, deleting tuples and compare (either on a or b))

So here is my problem. I want to store 2-tuple (key, val) and want to perform following operations:
keys are strings and values are Integers
multiple keys can have same value
adding new tuples
updating any key with new value (any new value or updated value is greater than the previous one, like timestamps)
fetching all the keys with values less than or greater than given value
deleting tuples.
Hash seems to be the obvious choice for updating the key's value but then lookups via values will be going to take longer (O(n)). The other option is balanced binary search tree with key and value switched. So now lookups via values will be fast (O(lg(n))) but updating a key will take (O(n)). So is there any data-structure which can be used to address these issues?
Thanks.
I'd use 2 datastructures, a hash table from keys to values and a search tree ordered by values and then by keys. When inserting, insert the pair into both structures, when deleting by key, look up the value from the hash and then remove the pair from the tree. Updating is basically delete+insert. Insert, delete and update are O(log n). For fetching all the keys less than a value lookup the value in the search tree and iterate backwards. This is O(log n + k).
The choices for good hash table and search tree implementations depend a lot on your particular distribution of data and operations. That said, a good general purpose implementation of both should be sufficient.
For binary Search Tree Insert is O(logN) operation in average and O(n) in worst case. The same for lookup operation. So this should be your choice I believe.
Dictionary or Map types tend to be based on one of two structures.
Balanced tree (guarantee O(log n) lookup).
Hash based (best case is O(1), but a poor hash function for the data could result in O(n) lookups).
Any book on algorithms should cover both in lots of detail.
To provide operations both on keys and values, there are also multi-index based collections (with all the extra complexity) which maintain multiple structures (much like an RDBMS table can have multiple indexes). Unless you have a lot of lookups over a large collection the extra overhead might be a higher cost than a few linear lookups.
You can create a custom data structure which holds two dictionaries.
i.e
a hash table from keys->values and another hash table from values->lists of keys.
class Foo:
def __init__(self):
self.keys = {} # (KEY=key,VALUE=value)
self.values = {} # (KEY=value,VALUE=list of keys)
def add_tuple(self,kd,vd):
self.keys[kd] = vd
if self.values.has_key(vd):
self.values[vd].append(kd)
else:
self.values[vd] = [kd]
f = Foo()
f.add_tuple('a',1)
f.add_tuple('b',2)
f.add_tuple('c',3)
f.add_tuple('d',3)
print f.keys
print f.values
print f.keys['a']
print f.values[3]
print [f.values[v] for v in f.values.keys() if v > 1]
OUTPUT:
{'a': 1, 'c': 3, 'b': 2, 'd': 3}
{1: ['a'], 2: ['b'], 3: ['c', 'd']}
1
['c', 'd']
[['b'], ['c', 'd']]

Categories

Resources