Auto-balancing (or cheaply balanced) 3D datastructure - data-structures

I am working on a tool that requires a 3D "voxel-based" engine. By that I mean it will involve adding and removing cubes from a grid. In order to manage these cubes I need a data structure that allows for quick insertions and deletes. The problem I've seen with k-d trees and octrees is that it seems like they would frequently need to be recreated (or at least rebalanced) because of these operations.
Before I jumped in I wanted to get opinions on what the best way to go about this would be.
Some more details:
x,y,z position is in integer space
needs to be efficient enough for a real-time application
there is no hard limit on the number of cubes that would be used.
In all likelihood the number will most often be inconsequentially
low (<100), however I would like to have the tool handle as many
cubes as possible
I guess the ultimate question is what is the best way to manage what is essentially 3D point data in a way that can handle frequent insertions and deletes?
(No I'm not making Minecraft)

Octrees are easy to update dynamically. Typically the tree is refined based on a per leaf upper/lower population count:
When a new item is inserted, it is pushed onto the item list for the enclosing leaf node. If the upper population count is exceeded, the leaf is refined.
When an existing item is erased, it is removed from the item list for the enclosing leaf node. If the lower population count is reached, the leaf siblings are scanned. If all siblings are leaf nodes and their cummulative item count is less than the upper population count the set of siblings are deleted and the items pushed onto the parent.
Both operations are local, traversing only the height of the tree, which is O(log(n)) for well distributed point sets.
KD-trees, on the other hand, are not easy to update dynamically, since their structure is based on the distribution of the full point set.
There are also a number of other spatial data structures that support dynamic updates - R-trees, Delaunay triangulations to name a few, but it's not clear that they'd offer better performance than an Octree. I'm not aware of any spatial structure that supports better than O(log(n)) dynamic queries.
Hope this helps.

Related

Most performant way to find all the leaf nodes in a tree data structure

I have a tree data structure where each node can have any number of children, and the tree can be of any height. What is the optimal way to get all the leaf nodes in the tree? Is it possible to do better than just traversing every path in the tree until I hit the leaf nodes?
In practice the tree will usually have a max depth of 5 or so, and each node in the tree will have around 10 children.
I'm open to other types of data structures or special trees that would make getting the leaf nodes especially optimal.
I'm using javascript but really just looking for general recommendations, any language etc.
Thanks!
Memory layout is essential to optimal retrieval, so the child lists should be contiguous and not linked list, the nodes should be place after each other in retrieval order.
The more static your tree is, the better layout can be done.
All in one layout
All in one array totally ordered
Pro
memory can be streamed for maximal throughput (hardware pre-fetch)
no unneeded page lookups
normal lookups can be made
no extra memory to make linked lists.
internal nodes use offset to find the child relative to itself
Con
inserting / deleting can be cumbersome
insert / delete O(N)
insert might lead to resize of the array leading to a costly copy
Two array layout
One array for internal nodes
One array for leafs
Internal nodes points to the leafs
Pro
leaf nodes can be streamed at maximum throughput (maybe the best layout if your mostly only interested in the leafs).
no unneeded page lookups
indirect lookups can be made
Con
if all leafs are ordered insert / delete can be cumbersome
if leafs are unordered insertion is ease, just add at the end.
deleting unordered leafs is also a problem if no tombstones are allowed as the last leaf would have to be moved back and the internal nodes would need fix up. (via a further indirection this can also be fixed see slot-map)
resizing of the either might lead to a large copy, though less than the All-in-one as they could be done independently.
Array of arrays (dynamic sized, C++ vector of vectors)
using contiguous arrays for referencing the children of each node
Pro
running through each child list is fast
each child array may be resized independently
Con
while removing much of the extra work of linked list children the individual lists are dispersed among all other data making lookup taking extra time.
insert might cause resize and copy of an array.
Finding the leaves of a tree is O(n), which is optimal for a tree, because you have to look at O(n) places to retrieve all n things, plus the branch nodes along the way. The constant overhead is the branch nodes.
If we increase the branching factor, e.g. letting each branch have 32 children instead of 2, we significantly decrease the number of overhead nodes, which might make the traversal faster.
If we skip a branch, we're not including the values in that branch, so we have to look at all branches.

What data structure will be best to store the order of explored nodes in an occupancy grid?

I have multiple robots, which explore an occupancy grid through some algorithm. I am trying to save the order of explored nodes. But I am not sure, which data structure can be used to save them efficiently.
I first thought of an tree, but the order can be repeatable like 1, 2, 5, 1. So, I feel, it may be too complex to store such an order in tree form. Then, I thought of an array, but it can be too much expensive in terms of memory for large grids.
I am a bit confused now. What data structure would be better(suppose grid is of 10,000 nodes). But the point is the order of explored nodes will be greater than 10,000 in this case as there will be overlap.
Thanks!
A tree makes little sense here with a need to preserve insertion order and the need to allow duplicates. Basically, as I understand it, we want to store the path in which the robot has traveled in the tightest form we can.
A compact, contiguous kind of sequence ends up making the most sense here (array, e.g.). It's cheaper than any linked structure (tree included) since there are no links to store.
There's little we can do to compact memory usage any further.
However, an unrolled list might be helpful here. Since it's not one giant contiguous block and instead a series of smaller blocks (ex: 4 kilobytes each) linked together, you can start, say, off-loading blocks at the front of the list to disk if you want to reduce memory use. The link overhead is trivial since we're only storing a link every N elements, where N could be some large number.

Fast query of objects based on location and radius

I am developing a simple ecosystem simulation and I am stuck on choosing an optimal datastructure for the creatures.
What I basically have is a list of all animals in my current world with information about animal type and location. This list can grow very large - probably close to millions.
So, in order to implement some simple AI, I need to make this operation as fast as possible:
Given a point on the map and a radius, give me a list of all animals within the radius sorted by the distance from center point.
The world would be 2d so we are limited to planar coordinates.
Some of the operations I also need to support:
Changing location of animals
Creating new animals
Removing animals
I read about kd trees and their ability to calculate nearest neighbour fast.
Questions:
do you think it will work in my case? If not, what data structure should I use to meet the requirements?
EDIT:
Here are some more details as requested in comments.
The world wouldn't be too big - something that can fit a screen with animals being little circles. I should also support the scenario where the world could become fairly dense.
Finally, I expect not more than a few dozens returned per query but as I will have a great number of those queries (one per each animal althrough this should be cached somehow probably but let's forget about that for simplicity for now) this to be as fast and efficient as possible.
Given your constraints (a large number of moving objects and a small world of presumably fixed size), a simple grid may be a better fit than a kd-tree.
See: Trees or Grids? Indexing Moving Objects in Main Memory
kd-trees should work fine if there are many queries before the positions change (e.g., one per animal). Incremental updates probably would be fine as well, but an immutable kd-tree that is completely rebuilt when necessary has the advantage that the tree structure can be represented succinctly with an array. To find all points within a search disk, traverse the tree while pruning nodes whose half-plane completely excludes the search disk. You'll have to sort the results by distance at the end.

How do I balance a BK-Tree and is it necessary?

I am looking into using an Edit Distance algorithm to implement a fuzzy search in a name database.
I've found a data structure that will supposedly help speed this up through a divide and conquer approach - Burkhard-Keller Trees. The problem is that I can't find very much information on this particular type of tree.
If I populate my BK-tree with arbitrary nodes, how likely am I to have a balance problem?
If it is possibly or likely for me to have a balance problem with BK-Trees, is there any way to balance such a tree after it has been constructed?
What would the algorithm look like to properly balance a BK-tree?
My thinking so far:
It seems that child nodes are distinct on distance, so I can't simply rotate a given node in the tree without re-calibrating the entire tree under it. However, if I can find an optimal new root node this might be precisely what I should do. I'm not sure how I'd go about finding an optimal new root node though.
I'm also going to try a few methods to see if I can get a fairly balanced tree by starting with an empty tree, and inserting pre-distributed data.
Start with an alphabetically sorted list, then queue from the middle. (I'm not sure this is a great idea because alphabetizing is not the same as sorting on edit distance).
Completely shuffled data. (This relies heavily on luck to pick a "not so terrible" root by chance. It might fail badly and might be probabilistically guaranteed to be sub-optimal).
Start with an arbitrary word in the list and sort the rest of the items by their edit distance from that item. Then queue from the middle. (I feel this is going to be expensive, and still do poorly as it won't calculate metric space connectivity between all words - just each word and a single reference word).
Build an initial tree with any method, flatten it (basically like a pre-order traversal), and queue from the middle for a new tree. (This is also going to be expensive, and I think it may still do poorly as it won't calculate metric space connectivity between all words ahead of time, and will simply get a different and still uneven distribution).
Order by name frequency, insert the most popular first, and ditch the concept of a balanced tree. (This might make the most sense, as my data is not evenly distributed and I won't have pure random words coming in).
FYI, I am not currently worrying about the name-synonym problem (Bill vs William). I'll handle that separately, and I think completely different strategies would apply.
There is a lisp example in the article: http://cliki.net/bk-tree. About unbalancing the tree I think the data structure and the method seems to be complicated enough and also the author didn't say anything about unbalanced tree. When you experience unbalanced tree maybe it's not for you?

Efficiently querying a B+ Tree holding multidimensional data

I have a collection of tuples (x,y) of 64-bit integers that make up my dataset. I have, say, trillions of these tuples; it is not feasible to keep the dataset in memory on any machine on earth. However, it is quite reasonable to store them on disk.
I have an on-disk store (a B+-tree) that allow for the quick, and concurrent, querying of data in a single dimension. However, some of my queries rely on both dimensions.
Query examples:
Find the tuple whose x is greater than or equal than some given value
Find the tuple whose x is as small as possible s.t. it's y is greater than or equal to some given value
Find the tuple whose x is as small as possible s.t. it's y is less than or equal to some given value
Perform maintenance operations (insert some tuple, remove some tuple)
The best bet I have found are Z-order curves but I cannot seem to figure out how to conduct the queries given my two dimensional data-set.
Solutions that are not acceptable include a sequential scan of the data, this could be far too slow.
I think, the most appropriate data structures for your requirements are R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree). R-tree is similar to B+-tree, but also allows multidimensional queries.
Other relevant data structure is Priority Search Tree. It is good for queries like your examples 1 .. 3, but not very efficient if you need frequent updates or on-disk store. For details see this paper or this book: "Handbook of Data Structures and Applications" (Chapter 18.5).
Are you saying you don't know how to query z-order curves? The Wikipedia page describes how you do range searches.
A z-curve divides your space into nested rectangles, where each additional bit in the key divides the space in half. To search for a point:
Start with the largest rectangle that might contain your point.
Recursively:
Create a result set of rectangles
For each rectangle in your set
If the rectangle is a single point, you are done, it is what you are looking for.
Otherwise, divide the rectangle in two (specify one additional bit of the z-curve)
If both halves contain a point
If one half is better
Add that rectangle to your result set of rectangles
Otherwise
Add both rectangles to your result set of rectangles
Otherwise, only one half contains a point
Add that rectangle to your result set of rectangles
Search your result set of rectangles
Worst case performance is bad, of course. You can adjust it by changing how you construct your z-order index.
I'm currently working on designing a data structure which is essentially a 'stacked' B+ tree (or a d+ tree where d is the number of dimensions) for multidimensional data. I believe it would suit your data perfectly and is being designed specifically for your use case.
The basic idea is this:
Each dimension is a B+ tree and is linked to the next dimension's B+ tree. Search through the first dimension normally, once a leaf is reached it contains a pointer to the root of the next B+ tree which belongs to the next dimension. Everything in the second B+ tree belongs to the same x value.
The original plan was to only store the unique values for each dimension along with it's count. This employs a very simple compression algorithm (if you can even call it that) while still allowing for the entire data set to be represented. This 'linked' dimension scheme could allow for extra dimensions to be added later as they are simply added to the stack of B+ trees.
Total insert/search/delete time for 2 dimensions would be something similar to this:
log b(card(x)) + log b(card(y))
where b is the base of each B+ tree and card(x) would be the cardinality of the x dimension.
I hope that makes sense. I'm still working on an implementation, however feel free to use or even augment the idea.
http://fallabs.com/tokyocabinet/
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX. Tokyo Cabinet is a free software licensed under the GNU Lesser General Public License.
may it easy for u to embed?

Resources