Interval tree with meta data about intervals - data-structures

Is there any existing data structure that acts like an interval tree but stores meta-data about the intervals? E.g. If you have a list of intervals that represent when a specific person spends time in a store how would you create a intervalTree that you could efficiently query for a particular time and get a list of the people in the store.
Hashset -> Hashmap
Intervaltree -> ?

Related

Range search with KNN on two different dimensions

I've a few million records (which are updated often) with 2 properties:
Timestamp
Popularity score
I'm looking for a data structure (maybe some metric tree?) that can do fast range search on 1 dimension (e.g. all records greater than a timestamp value), and locate top K records that fall within that range on the other dimension (i.e. popularity score). In other words, I can phrase this query as "Find top K popular records with timestamp greater than T".
I currently have a naive implementation where I filter the N records in linear time complexity and then identify the top K records using a partial sorting algorithm. But this is not fast enough given the number of concurrent users we need to support.
I'm not super familiar with KD trees, but I see that some popular implementations support both range searches and finding K nearest neighbors, but my requirements are a bit peculiar here -- so I'm wondering if there is a way to do this faster, at the expense of maybe additional indexing overhead.
If you will invest the initial sorting of a list of tuples (record_name, timestamp) by the timestamp, and create a dictionary with the record name as keys and (popularity_score, timestamp_list_idx) tuples as values you will be able to:
Perform binary search for a particular timestamp O(logn)
Extract the greater than values in O(1) since the array is sorted
Extract the matching popularity vote in O(1) since they are in a dictionary
Update a record popularity score in O(1) due to the dictionary
Update a particular timestamp in O(1) via pulling the index of the record from the tuple in the dictionary value
Suppose you have m records with the wanted timestamp range, you can
generate a max heap from them by popularity, this takes O(m) and then perform k pops from that heap with O(klogm) since we need to repopulate the root after every pop. This means that the actions you want to do will take O(m + klogm). Assuming k << m this will run in O(m).
Iterate over the m records with a list in size k to keep track of thr top k popular songs. After passing over all m records you will have the top k in the list. This takes O(m) as well
Method 1 take a little more time than method 2 in terms of complexity but in case you suddenly want to know the k+1 most popular record, you can just pop abother item from the heap instead of passing over the entire m records again with a k+1 long list.

Inserting more than one data field in a sorted data structure

What is the best data structure to use when you need to insert more than 1 (like 10) data fields in a node and have the nodes to be stored in the ascending order of their key?
In general, inserts element need 3 steps:
locate where to insert
insert it
adjust data structure if needs
If the inserting elements have regularity(like all elements equals, It will save 1. locate where to insert time in linked list), but time complexity still O(n).
If the inserting elements have no regularity, it random enough. There is no difference between inserting one and insert batch.
So in one word, there is no special data structure for batch inserting. but you can improve Constant Time Complexity in some special conditions.
PS: common sorted data structure is balanced binary tree, spends O(logn) time complexity when updating

What is a data structure to lookup an object based on an id and find the minimum value of an associated field for each object in the DS fast

I have a set of objects that have the following fields:
id
time
I am looking for a data structure that can look up an object's id fast (about O(log n)). Also, I need to keep account of the object that has the smallest amount of time and removes this minimum object fast O(log n). Additionally, an object's time field is mutable.
I thought about using a combination of a binary search tree and priority heap to do this. Essentially, the keys of the priority heap will be time and the values will be the object themselves. The keys of the binary search tree will be id and values will be the object itself along with a pointer to the entry representing the object in the heap. Is there a name for this type of data structure?

Efficient data structure for a leaderboard, i.e., a list of records (name, points) - Efficient Search(name), Search(rank) and Update(points)

Please suggest a data structure for representing a list of records in memory. Each record is made up of:
User Name
Points
Rank (based on Points) - optional field - can be either stored in the record or can be computed dynamically
The data structure should support implementation of the following operations efficiently:
Insert(record) - might change ranks of existing records
Delete(record) - might change ranks of existing records
GetRecord(name) - Probably a hash table will do.
GetRecord(rank)
Update(points) - might change ranks of existing records
My main problem is efficient implementation of GetRecord(rank), because ranks can change frequently.
I guess an in-memory DBMS would be a good solution, but please don't suggest it; please suggest a data structure.
Basically, you'll just want a pair of balanced search trees, which will allow O(lg n) insertion, deletion, and getRecord operations. The trick is that instead of storing the actual data in the trees, you'll store pointers to a set of record objects, where each record object will contain 5 fields:
user name
point value
rank
pointer back to the node in the name tree that references the object
pointer back to the node in the point tree that references the object.
The name tree is only modified when new records are added and when records are deleted. The point tree is modified for insertions and deletions, but also for updates, where the appropriate record is found, has its point-tree pointer removed, its point count updated, then a new pointer added to the point-tree.
As you mention, you can use a hash table instead for the name tree if you like. The key here is that you simply maintain separate sorted indexes into a set of otherwise unordered records that themselves contain pointers to their index nodes.
The point tree will be some variation on an order statistic tree, which rather than being a specific data structure, is an umbrella term for a binary search tree whose operations are modified to maintain an invariant which makes the requested rank-related operations more efficient than walking the tree. The details of how the invariants are maintained depend on the underlying balanced search tree (red-black tree, avl tree, etc) used.
A skiplist + hashmap should work.
Here is an implementation in Go: https://github.com/wangjia184/sortedset
Every node in the set is associated with these properties.
key is an unique identity of the node, which is "User name" in your case.
value is any value associated with the node
score a number decides the order(rank) in the set, which is "Points" in your case
Each node in the set is associated with a key. While keys are unique,
scores may be repeated. Nodes are taken in order (from low score to
high score) instead of ordered afterwards. If scores are the same, the
node is ordered by its key in lexicographic order. Each node in the
set also can be accessed by rank, which represents the position in the
sorted set.
A typical use case of sorted set is a leader board in a massive online
game, where every time a new score is submitted you update it using
AddOrUpdate() method. You can easily take the top users using
GetByRankRange() method, you can also, given an user name, return its
rank in the listing using FindRank() method. Using FindRank() and
GetByRankRange() together you can show users with a score similar to a
given user. All very quickly.
Look for a DBMS that includes a function to select a record by sequential record number.
See: How to select the nth row in a SQL database table?
Construct a table with a UserName column and a Points column. Make UserName the primary index. Construct a secondary non-unique maintained index on Points.
To get the record with rank R, select the index on Points and move to record R.
This makes the DBMS engine do most of the work and keeps your part simple.

return inserted items for a given interval

How would one design a memory efficient system which accepts Items added into it and allows Items to be retrieved given a time interval (i.e. return Items inserted between time T1 and time T2). There is no DB involved. Items stored in-memory. What is the data structure involved and associated algorithm.
Updated:
Assume extremely high insertion rate compared to data query.
You can use a sorted data structure, where key is by time of arrival. Note the following:
items are not remvoed
items are inserted in order [if item i was inserted after item j then key(i)>key(j)].
For this reason, tree is discouraged, since it is "overpower", and insertion in it is O(logn), where you can get an O(1) insertion. I suggest using one of the followings:
(1)Array: the array will be filled up always at its end. When the allocated array is full, reallocate a bigger [double sized] array, and copy existing array to it.
Advantages: good caching is usually expected in arrays, O(1) armotorized insertion, used space is at most 2*elementSize*#elemetns
Disadvantages: high latency: when the array is full, it will take O(n) to add an element, so you need to expect that once in a while, there will be costly operation.
(2)Skip list The skip list also allows you also O(logn) seek and O(1) insertion at the end, but it doesn't have latency issues. However, it will suffer more from cache misses then an array. Space used is on average elementSize*#elements + pointerSize*#elements*2 for a skip list.
Advantages: O(1) insertion, no costly ops.
Distadvantages: bad caching is expected.
Suggestion:
I suggest using an array if latency is not an issue. If it is, you should better use a skip list.
In both, finding the desired interval is:
findInterval(T1,T2):
start <- data.find(T1)
end <- data.find(T2)
for each element in data from T1 to T2:
yield element
Either BTree or Binary Search Tree could be a good in-memory data structure to accomplish the above. Just save the timestamp in each node and you can do a range query.
You can add them all to a simple array and sort them.
Do a binary search to located both T1 and T2. All the array elements between them are what you are looking for.
This is helpful if the searching is done only after all the elements are added. If not you can use an AVL or Red-Black tree
How about a relation interval tree (encode your items as intervals containing only a single element, e.g., [a,a])? Although, it has been said already that the ratio of the anticipated operations matter (a lot actually). But here's my two cents:
I suppose an item X that is inserted at time t(X) is associated with that timestamp, right? Meaning you don't insert an item now which has a timestamp from a week ago or something. If that's the case go for the simple array and do interpolation search or something similar (your items will already be sorted according to the attribute that your query refers to, i.e., the time t(X)).
We already have an answer that suggests trees, but I think we need to be more specific: the only situation in which this is really a good solution is if you are very specific about how you build up the tree (and then I would say it's on par with the skip lists suggested in a different answer; ). The objective is to keep the tree as full as possible to the left - I'll make clearer what that means in the following. Make sure each node has a pointer to its (up to) two children and to its parent and knows the depth of the subtree rooted at that node.
Keep a pointer to the root node so that you are able to do lookups in O(log(n)), and keep a pointer to the last inserted node N (which is necessarily the node with the highest key - its timestamp will be the highest). When you are inserting a node, check how many children N has:
If 0, then replace N with the new node you are inserting and make N its left child. (At this point you'll need to update the tree depth field of at most O(log(n)) nodes.)
If 1, then add the new node as its right child.
If 2, then things get interesting. Go up the tree from N until either you find a node that has only 1 child, or the root. If you find a node with only 1 child (this is necessarily the left child), then add the new node as its new right child. If all nodes up to the root have two children, then the current tree is full. Add the new node as the new root node and the old root node as its left child. Don't change the old tree structure otherwise.
Addendum: in order to make cache behaviour and memory overhead better, the best solution is probably to make a tree or skip list of arrays. Instead of every node having a single time stamp and a single value, make every node have an array of, say, 1024 time stamps and values. When an array fills up you add a new one in the top level data structure, but in most steps you just add a single element to the end of the "current array". This wouldn't affect big-O behaviour with respect to either memory or time, but it would reduce the overhead by a factor of 1024, while latency is still very small.

Resources