Which data structure should be used for range look up? - data-structures

I was thinking to use HashMap but I think either I have to customize it or I have to create custom data structure for it. As we know HashMap stores Key-Value pair but I need a data-structure where instead of a single key I should be able to put a range. For example:
Range Should return
0 to 50 Object1
51 to 100 Object2
90 to 150 Object3
So
if user search for 10. He should be able to get Object1,
if user search for 55. He should be able to get Object2,
if user search for 95. He should be able to get Object2 and Object3 both.
I was thinking to put range inside each object and putting all the objects into an ArrayList or LinkedList and then I can iterate it and find the all Objects which are satisfying the input. But its time complexity will be more. For every input I have to traverse whole list. I thought for tree also but in case of overlapping range (like 51 to 100 and 90 to 150), I could not figure out how that will be helpful.
Let me know your views, my target is time complexity should be less like or near to hashmap

You could use a B-Tree: B-Tree or maybe a Disjoint-set structure: Disjoint-set Another S.O. user suggests a TreeMap: TreeMap The final possibility (Possibly solving your overlapping range dilemma) is the R-Tree: R-Tree
R-Tree Visualization:
With the B-Tree, you can put a small "directory" field in each node object that would immediately be able to tell you what is contained in each Node/object. However, you have to think about what happens when the containing node becomes full of objects and you have to donate/adopt an object to or from another node.
Having said that, the Disjoint-set structure using path compression gives you an amortized runtime of O(1), and a worst-case of O(log*N)! This is also extremely easy to implement; you really only need a handful of core methods, (Union, Find, Union By Size, Find By Size), to get it running.
R-Trees would allow you to handle the case that you have over-lapping ranges, but you also sacrifice a bit of your runtime. In the worst case, you end up with a search time of O(M logMn), which is slower than a HashMap.

Related

data structure to find value for closest point to input data

I have a dataset where I map a value to a three dimensional point. e.g. (1,2,3)->5; (2,4,1)->7; and so on.
I need to store these and be able to quickly find the desired value. If there were an entry for every possible input, I could just use an 3D array (or a dictionary), use the point as an index and do no searching at all.
The input however is real valued, so not every single point exists in the dataset. I want to find the n nearest points to the input data and get their related values to perform interpolation.
Which data structure could I use to implement this in an efficient way? The data structure only needs to be created once and does not have to change later.
What you want is a K-d tree.
It is a data structure designed especially to partition points of a k dimensional space, and it allows to find the nearest neighbor to a given point remarkably fast (O(log n)).
You also probably won't need to implement a k-d tree yourself, since implementations of the structure exist in many languages (I used it in python, and I'd bet you can find solid implementations in most of the common languages).

What is the best data structure for implementing a high scrores table

I am trying to implement a high score table for a game. I think that the best solution would be to create a linked list, and then just insert the new score on the table, and then run through the list up to that point and move all the rankings down one. Is there a better alternative to this?
I think the best solution is to use a Priority Queue (which is usually implemented with a heap), that way new scores can simply be added and the data structure will sort it for you.
I'd say that a balanced binary search tree (Red-Black/AVL) with finger (pointing to the maximum in the tree) would be quite efficient. You can get to the highest score in O(1) and via getting predecessor nodes (which is simple in a BVS), you can get as many high scores as you wish.
This would be useful if you have a bunch of scores and want to be able to pick x top ones. If you wanted to have a function which gets score and returns its ranking, or wanted to ask general range queries (i.e., "give me ranks from 10 to 20), you would need to modify the structure slightly (keeping the size of subtrees in each node and using that information).
Your solution is quite slow, requiring O(n) operations per insert and find.
Heap is a lot faster in finding top score, however, it gives you only the highest score - you would have to manually search for top x scores; also, general range queries would be impossible to be done efficiently.

Efficiently querying a B+ Tree holding multidimensional data

I have a collection of tuples (x,y) of 64-bit integers that make up my dataset. I have, say, trillions of these tuples; it is not feasible to keep the dataset in memory on any machine on earth. However, it is quite reasonable to store them on disk.
I have an on-disk store (a B+-tree) that allow for the quick, and concurrent, querying of data in a single dimension. However, some of my queries rely on both dimensions.
Query examples:
Find the tuple whose x is greater than or equal than some given value
Find the tuple whose x is as small as possible s.t. it's y is greater than or equal to some given value
Find the tuple whose x is as small as possible s.t. it's y is less than or equal to some given value
Perform maintenance operations (insert some tuple, remove some tuple)
The best bet I have found are Z-order curves but I cannot seem to figure out how to conduct the queries given my two dimensional data-set.
Solutions that are not acceptable include a sequential scan of the data, this could be far too slow.
I think, the most appropriate data structures for your requirements are R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree). R-tree is similar to B+-tree, but also allows multidimensional queries.
Other relevant data structure is Priority Search Tree. It is good for queries like your examples 1 .. 3, but not very efficient if you need frequent updates or on-disk store. For details see this paper or this book: "Handbook of Data Structures and Applications" (Chapter 18.5).
Are you saying you don't know how to query z-order curves? The Wikipedia page describes how you do range searches.
A z-curve divides your space into nested rectangles, where each additional bit in the key divides the space in half. To search for a point:
Start with the largest rectangle that might contain your point.
Recursively:
Create a result set of rectangles
For each rectangle in your set
If the rectangle is a single point, you are done, it is what you are looking for.
Otherwise, divide the rectangle in two (specify one additional bit of the z-curve)
If both halves contain a point
If one half is better
Add that rectangle to your result set of rectangles
Otherwise
Add both rectangles to your result set of rectangles
Otherwise, only one half contains a point
Add that rectangle to your result set of rectangles
Search your result set of rectangles
Worst case performance is bad, of course. You can adjust it by changing how you construct your z-order index.
I'm currently working on designing a data structure which is essentially a 'stacked' B+ tree (or a d+ tree where d is the number of dimensions) for multidimensional data. I believe it would suit your data perfectly and is being designed specifically for your use case.
The basic idea is this:
Each dimension is a B+ tree and is linked to the next dimension's B+ tree. Search through the first dimension normally, once a leaf is reached it contains a pointer to the root of the next B+ tree which belongs to the next dimension. Everything in the second B+ tree belongs to the same x value.
The original plan was to only store the unique values for each dimension along with it's count. This employs a very simple compression algorithm (if you can even call it that) while still allowing for the entire data set to be represented. This 'linked' dimension scheme could allow for extra dimensions to be added later as they are simply added to the stack of B+ trees.
Total insert/search/delete time for 2 dimensions would be something similar to this:
log b(card(x)) + log b(card(y))
where b is the base of each B+ tree and card(x) would be the cardinality of the x dimension.
I hope that makes sense. I'm still working on an implementation, however feel free to use or even augment the idea.
http://fallabs.com/tokyocabinet/
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX. Tokyo Cabinet is a free software licensed under the GNU Lesser General Public License.
may it easy for u to embed?

Data Structure for Storing Ranges

I am wondering if anyone knows of a data structure which would efficiently handle the following situation:
The data structure should store several, possibly overlapping, variable length ranges on some continuous timescale.
For example, you might add the ranges a:[0,3], b:[4,7], c:[0,9].
Insertion time does not need to be particularly efficient.
Retrievals would take a range as a parameter, and return all the ranges in the set that overlap with the range, for example:
Get(1,2) would return a and c. Get(6,7) would return b and c. Get(2,6) would return all three.
Retrievals need to be as efficient as possible.
One data structure you could use is a one-dimensional R-tree. These are designed to deal with ranges and to provide efficient retrieval. You will also learn about Allen's Operators; there are a dozen other relationships between time intervals than just 'overlaps'.
There are other questions on SO that impinge on this area, including:
Determine Whether Two Date Ranges Overlap
Data structure for non-overlapping ranges within a single dimension
You could go for a binary tree, that stores the ranges in a hierarchy. Starting from the root node, that represents an all-encompassing range divided it its middle, you test if your range you are trying to insert belong to the left subrange, right subrange, or both, and recursively carry on in the matching subnodes until you reach a certain depth, at which you save the actual range.
For lookup, you test your input range against the left and right subranges of the top node, and dive in the ones which overlap, repeating until you have reached actual ranges that you save.
This way, retrieval has a logarithmic complexity. You'd still need to manage duplicates in your retrieval, as some ranges are going to belong to several nodes.

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?
A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.
Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.
It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?
If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)
You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Resources