I have a collection of tuples (x,y) of 64-bit integers that make up my dataset. I have, say, trillions of these tuples; it is not feasible to keep the dataset in memory on any machine on earth. However, it is quite reasonable to store them on disk.
I have an on-disk store (a B+-tree) that allow for the quick, and concurrent, querying of data in a single dimension. However, some of my queries rely on both dimensions.
Query examples:
Find the tuple whose x is greater than or equal than some given value
Find the tuple whose x is as small as possible s.t. it's y is greater than or equal to some given value
Find the tuple whose x is as small as possible s.t. it's y is less than or equal to some given value
Perform maintenance operations (insert some tuple, remove some tuple)
The best bet I have found are Z-order curves but I cannot seem to figure out how to conduct the queries given my two dimensional data-set.
Solutions that are not acceptable include a sequential scan of the data, this could be far too slow.
I think, the most appropriate data structures for your requirements are R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree). R-tree is similar to B+-tree, but also allows multidimensional queries.
Other relevant data structure is Priority Search Tree. It is good for queries like your examples 1 .. 3, but not very efficient if you need frequent updates or on-disk store. For details see this paper or this book: "Handbook of Data Structures and Applications" (Chapter 18.5).
Are you saying you don't know how to query z-order curves? The Wikipedia page describes how you do range searches.
A z-curve divides your space into nested rectangles, where each additional bit in the key divides the space in half. To search for a point:
Start with the largest rectangle that might contain your point.
Recursively:
Create a result set of rectangles
For each rectangle in your set
If the rectangle is a single point, you are done, it is what you are looking for.
Otherwise, divide the rectangle in two (specify one additional bit of the z-curve)
If both halves contain a point
If one half is better
Add that rectangle to your result set of rectangles
Otherwise
Add both rectangles to your result set of rectangles
Otherwise, only one half contains a point
Add that rectangle to your result set of rectangles
Search your result set of rectangles
Worst case performance is bad, of course. You can adjust it by changing how you construct your z-order index.
I'm currently working on designing a data structure which is essentially a 'stacked' B+ tree (or a d+ tree where d is the number of dimensions) for multidimensional data. I believe it would suit your data perfectly and is being designed specifically for your use case.
The basic idea is this:
Each dimension is a B+ tree and is linked to the next dimension's B+ tree. Search through the first dimension normally, once a leaf is reached it contains a pointer to the root of the next B+ tree which belongs to the next dimension. Everything in the second B+ tree belongs to the same x value.
The original plan was to only store the unique values for each dimension along with it's count. This employs a very simple compression algorithm (if you can even call it that) while still allowing for the entire data set to be represented. This 'linked' dimension scheme could allow for extra dimensions to be added later as they are simply added to the stack of B+ trees.
Total insert/search/delete time for 2 dimensions would be something similar to this:
log b(card(x)) + log b(card(y))
where b is the base of each B+ tree and card(x) would be the cardinality of the x dimension.
I hope that makes sense. I'm still working on an implementation, however feel free to use or even augment the idea.
http://fallabs.com/tokyocabinet/
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX. Tokyo Cabinet is a free software licensed under the GNU Lesser General Public License.
may it easy for u to embed?
Related
I have a dataset where I map a value to a three dimensional point. e.g. (1,2,3)->5; (2,4,1)->7; and so on.
I need to store these and be able to quickly find the desired value. If there were an entry for every possible input, I could just use an 3D array (or a dictionary), use the point as an index and do no searching at all.
The input however is real valued, so not every single point exists in the dataset. I want to find the n nearest points to the input data and get their related values to perform interpolation.
Which data structure could I use to implement this in an efficient way? The data structure only needs to be created once and does not have to change later.
What you want is a K-d tree.
It is a data structure designed especially to partition points of a k dimensional space, and it allows to find the nearest neighbor to a given point remarkably fast (O(log n)).
You also probably won't need to implement a k-d tree yourself, since implementations of the structure exist in many languages (I used it in python, and I'd bet you can find solid implementations in most of the common languages).
I am developing a simple ecosystem simulation and I am stuck on choosing an optimal datastructure for the creatures.
What I basically have is a list of all animals in my current world with information about animal type and location. This list can grow very large - probably close to millions.
So, in order to implement some simple AI, I need to make this operation as fast as possible:
Given a point on the map and a radius, give me a list of all animals within the radius sorted by the distance from center point.
The world would be 2d so we are limited to planar coordinates.
Some of the operations I also need to support:
Changing location of animals
Creating new animals
Removing animals
I read about kd trees and their ability to calculate nearest neighbour fast.
Questions:
do you think it will work in my case? If not, what data structure should I use to meet the requirements?
EDIT:
Here are some more details as requested in comments.
The world wouldn't be too big - something that can fit a screen with animals being little circles. I should also support the scenario where the world could become fairly dense.
Finally, I expect not more than a few dozens returned per query but as I will have a great number of those queries (one per each animal althrough this should be cached somehow probably but let's forget about that for simplicity for now) this to be as fast and efficient as possible.
Given your constraints (a large number of moving objects and a small world of presumably fixed size), a simple grid may be a better fit than a kd-tree.
See: Trees or Grids? Indexing Moving Objects in Main Memory
kd-trees should work fine if there are many queries before the positions change (e.g., one per animal). Incremental updates probably would be fine as well, but an immutable kd-tree that is completely rebuilt when necessary has the advantage that the tree structure can be represented succinctly with an array. To find all points within a search disk, traverse the tree while pruning nodes whose half-plane completely excludes the search disk. You'll have to sort the results by distance at the end.
I am working on a tool that requires a 3D "voxel-based" engine. By that I mean it will involve adding and removing cubes from a grid. In order to manage these cubes I need a data structure that allows for quick insertions and deletes. The problem I've seen with k-d trees and octrees is that it seems like they would frequently need to be recreated (or at least rebalanced) because of these operations.
Before I jumped in I wanted to get opinions on what the best way to go about this would be.
Some more details:
x,y,z position is in integer space
needs to be efficient enough for a real-time application
there is no hard limit on the number of cubes that would be used.
In all likelihood the number will most often be inconsequentially
low (<100), however I would like to have the tool handle as many
cubes as possible
I guess the ultimate question is what is the best way to manage what is essentially 3D point data in a way that can handle frequent insertions and deletes?
(No I'm not making Minecraft)
Octrees are easy to update dynamically. Typically the tree is refined based on a per leaf upper/lower population count:
When a new item is inserted, it is pushed onto the item list for the enclosing leaf node. If the upper population count is exceeded, the leaf is refined.
When an existing item is erased, it is removed from the item list for the enclosing leaf node. If the lower population count is reached, the leaf siblings are scanned. If all siblings are leaf nodes and their cummulative item count is less than the upper population count the set of siblings are deleted and the items pushed onto the parent.
Both operations are local, traversing only the height of the tree, which is O(log(n)) for well distributed point sets.
KD-trees, on the other hand, are not easy to update dynamically, since their structure is based on the distribution of the full point set.
There are also a number of other spatial data structures that support dynamic updates - R-trees, Delaunay triangulations to name a few, but it's not clear that they'd offer better performance than an Octree. I'm not aware of any spatial structure that supports better than O(log(n)) dynamic queries.
Hope this helps.
Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it being a lisp-like S-expression or an (G)Algebraic Data Type in Haskell or Ocaml.
We are given a large number of "documents" in the tree structure. Our goal is to cluster documents which are similar. By clustering, we mean a way to divide the documents into j groups, such that elements in each looks like each other.
I am sure there are papers out there which describes approaches but since I am not very known in the area of AI/Clustering/MachineLearning, I want to ask somebody who are what to look for and where to dig.
My current approach is something like this:
I want to convert each document into an N-dimensional vector set up for a K-means clustering.
To do this, I recursively walk the document tree and for each level I calculate a vector. If I am at a tree vertex, I recur on all subvertices and then sum their vectors. Also, whenever I recur, a power factor is applied so it does matter less and less the further down the tree I go. The documents final vector is the root of the tree.
Depending on the data at a tree leaf, I apply a function which takes the data into a vector.
But surely, there are better approaches. One weakness of my approach is that it will only similarity-cluster trees which has a top structure much like each other. If the similarity is present, but occurs farther down the tree, then my approach probably won't work very well.
I imagine there are solutions in full-text-search as well, but I do want to take advantage of the semi-structure present in the data.
Distance function
As suggested, one need to define a distance function between documents. Without this function, we can't apply a clustering algorithm.
In fact, it may be that the question is about that very distance function and examples thereof. I want documents where elements near the root are the same to cluster close to each other. The farther down the tree we go, the less it matters.
The take-one-step-back viewpoint:
I want to cluster stack traces from programs. These are well-formed tree structures, where the function close to the root are the inner function which fails. I need a decent distance function between stack traces that probably occur because the same event happened in code.
Given the nature of your problem (stack trace), I would reduce it to a string matching problem. Representing a stack trace as a tree is a bit of overhead: for each element in the stack trace, you have exactly one parent.
If string matching would indeed be more appropriate for your problem, you can run through your data, map each node onto a hash and create for each 'document' its n-grams.
Example:
Mapping:
Exception A -> 0
Exception B -> 1
Exception C -> 2
Exception D -> 3
Doc A: 0-1-2
Doc B: 1-2-3
2-grams for doc A:
X0, 01, 12, 2X
2-grams for doc B:
X1, 12, 23, 3X
Using the n-grams, you will be able to cluster similar sequences of events regardless of the root node (in this examplem event 12)
However, if you are still convinced that you need trees, instead of strings, you must consider the following: finding similarities for trees is a lot more complex. You will want to find similar subtrees, with subtrees that are similar over a greater depth resulting in a better similarity score. For this purpose, you will need to discover closed subtrees (subtrees that are the base subtrees for trees that extend it). What you don't want is a data collection containing subtrees that are very rare, or that are present in each document you are processing (which you will get if you do not look for frequent patterns).
Here are some pointers:
http://portal.acm.org/citation.cfm?id=1227182
http://www.springerlink.com/content/yu0bajqnn4cvh9w9/
Once you have your frequent subtrees, you can use them in the same way as you would use the n-grams for clustering.
Here you may find a paper that seems closely related to your problem.
From the abstract:
This thesis presents Ixor, a system which collects, stores, and analyzes
stack traces in distributed Java systems. When combined with third-party
clustering software and adaptive cluster filtering, unusual executions can be
identified.
I am wondering if anyone knows of a data structure which would efficiently handle the following situation:
The data structure should store several, possibly overlapping, variable length ranges on some continuous timescale.
For example, you might add the ranges a:[0,3], b:[4,7], c:[0,9].
Insertion time does not need to be particularly efficient.
Retrievals would take a range as a parameter, and return all the ranges in the set that overlap with the range, for example:
Get(1,2) would return a and c. Get(6,7) would return b and c. Get(2,6) would return all three.
Retrievals need to be as efficient as possible.
One data structure you could use is a one-dimensional R-tree. These are designed to deal with ranges and to provide efficient retrieval. You will also learn about Allen's Operators; there are a dozen other relationships between time intervals than just 'overlaps'.
There are other questions on SO that impinge on this area, including:
Determine Whether Two Date Ranges Overlap
Data structure for non-overlapping ranges within a single dimension
You could go for a binary tree, that stores the ranges in a hierarchy. Starting from the root node, that represents an all-encompassing range divided it its middle, you test if your range you are trying to insert belong to the left subrange, right subrange, or both, and recursively carry on in the matching subnodes until you reach a certain depth, at which you save the actual range.
For lookup, you test your input range against the left and right subranges of the top node, and dive in the ones which overlap, repeating until you have reached actual ranges that you save.
This way, retrieval has a logarithmic complexity. You'd still need to manage duplicates in your retrieval, as some ranges are going to belong to several nodes.