GPU-based inclusive scan on an unbalanced tree

GPU-based inclusive scan on an unbalanced tree - algorithm

I have the following problem: I need to compute the inclusive scans (e.g. prefix sums) of values based on a tree structure on the GPU. These scans are either from the root node (top-down) or from the leaf nodes (bottom-up). The case of a simple chain is easily handled, but the tree structure makes parallelization rather difficult to implement efficiently.
For instance, after a top-down inclusive scan, (12) would hold (0)[op](6)[op](7)[op](8)[op](11)[op](12), and for a bottom-up inclusive scan, (8) would hold (8)[op](9)[op](10)[op](11)[op](12), where [op] is a given binary operator (matrix addition, multiplication etc.).
One also needs to consider the following points:
For a typical scenario, the length of the different branches should not be too long (~10), with something like 5 to 10 branches, so this is something that will run inside a block and work will be split between the threads. Different blocks will simply handle different values of nodes. This is obviously not optimal regarding occupancy, but this is a constraint on the problem that will be tackled sometime later. For now, I will rely on Instruction-level parallelism.
The structure of the graph cannot be changed (it describes an actual system), thus it cannot be balanced (or only by changing the root of the tree, e.g. using (6) as the new root). Nonetheless, a typical tree should not be too unbalanced.
I currently use CUDA for GPGPU, so I am open to any CUDA-enabled template library that can solve this issue.
Node data is already in global memory and the result will be used by other CUDA kernels, so the objective is just to achieve this without making it a huge bottleneck.
There is no "cycle", i.e. branches cannot merge down the tree.
The structure of the tree is fixed and set in an initialization phase.
A single binary operation can be quite expensive (e.g. multiplication of polynomial matrices, i.e. each element is a polynomial of a given order).
In this case, what would be the "best" data structure (for the tree structure) and the best algorithms (for the inclusive scans/prefix sums) to solve this problem?

Probably a harebrained idea, but imagine that you insert nodes of 0 value into the tree, in such a way that you get a 2D matrix. For instance, there would be 3 zero value nodes below the 5 node in your example. Then use one thread to travel each level of the matrix horizontally. For the top-down prefix sum, offset the threads in such a way that each lower thread is delayed by the maximum number of branches the tree can have in that location. So, you get a "wave" with a slanted edge running over the matrix. The upper threads, being further along, calculate those nodes in time for them to be processed further by threads running further down. You would need the same number of threads as the tree is deep.

I think parallel prefix scan may not suitable for your case because:
parallel prefix scan algorithm will increase the total number of [op], in your link of prefix sum, a 16-input parallel prefix scan requires 26 [op], while a sequential version only need 15. parallel algorithm performs better is based on a assumption that there's enough hardware resources to run multiple [op] in parallel.
You could evaluate the cost of your [op] before try the parallel prefix scan.
On the other hand, since the size of the tree is small, I think you could simply consider your tree as 4 (number of the leaf nodes) independent simple chains, and use concurrent kernel execution to improve the performance of these 4 prefix scan kernels
0-1-2-3
0-4-5
0-6-7-8-9-10
0-6-7-8-11-12

I think in Kepler GK110 architecture, you can invoke kernels recursively, which they call dynamic parallelism. So, if you need to compute the sum of the values at each node of the tree, dynamic parallelism would help. However, the depth of recursion might be a constraint.

My first impression is that you could organize the tree nodes in a 1 dimensional array, similar to what Eric suggested. And then you could do a Segmented Prefix Sum Scan (http://en.wikipedia.org/wiki/Segmented_scan) over that array.
Using your tree nodes as an example, the 1-dim array would look like:
0-1-2-3-0-4-5-0-6-7-8-9-10-0-6-7-8-11-12
And then you would have a parallel array of flags indicating where a new list begins (by list I mean a sequence beginning at the root and ending at a leaf node):
1-0-0-0-1-0-0-1-0-0-0-0- 0-1-0-0-0- 0- 0
For the bottom-up case, you could create a separate segment-flag array like so:
0-0-0-1-0-0-1-0-0-0-0-0- 1-0-0-0-0- 0- 1
and traverse it in reverse order using the same algorithm.
As for how to implement a Segmented Prefix Scan, I haven't implemented one myself but I found a couple of references that might be informative on how to do it: http://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/lectures/24_algorithms.pdf and http://www.cs.ucdavis.edu/research/tech-reports/2011/CSE-2011-4.pdf (see page 23)

Related

non-binary tree search and insertion

I searched a bit but haven't found the answer to this question..
I built a non-binary tree, so each node can have any number of children (called n-ary tree i think)
To help with searching, I gave every node a number when i built the tree, so that every node's child nodes will be bigger, and all the node to the right of the it will be bigger as well.
something like this:
This way I get logn time for search
The problem comes when I want to insert nodes. This model would not work if I want to insert nodes anywhere but the end.
I thought of a few ways to do it,
insert the new nodes at the desired location, then update the number of all the nodes "behind" it.
instead of using a single number to represent each node, use an array of numbers. The numbers in the array will represent its location on that specific level. For example, node 1 will be {0}. Node 9 will be {0,2}. Node 7 will be {0, 0, 1, 2}. Now when inserting, I just need to change the numbers on that level.
forget all the numbering and just compare every node until the correct one is found. Insertion don't need to care about numbers either.
My question is, which way would be better? I'm not sure if using an array of integers to represent each node is very fast.. maybe it is still faster than the first way? Are there other ways of going about this?
Thank you in advance.

I gather that the problem you have is to assign a unique identifier to each node in such a way that you can find the node given its unique id in sublinear time.
This is not usually a problem for transient (in-memory) data structures, because typical tree implementations never move a node in memory (until it is deleted). That lets you use the address of the node as a unique identifier, which provides O(1) access. Languages other than C dress this up in an object like a tree iterator or node reference, but under the hood the principle is the same.
However, there are times when you really need to be able to attach a fixed-for-all-time identifier to a tree node, in a way which will be resilient against, for example, persisting the tree to permanent storage and then reserializing it in a different executable image.
One well-known hack is to use floating-point ids. When a new node is inserted, its id is assigned to be the average of its immediate neighbours. For the purpose of this computation, we pretend that there is a node on the left of the tree with id 0.0 and a node to the right with id 1.0, so every node has two neighbours, even if it is the new left- or right-most node. In particular, the root node is given the id 0.5, which is the average of the 0.0 and 1.0 imaginary boundary nodes.
Unfortunately, floating point precision is not infinite, and this hack works best if insertions are always at random places in the tree. If you insert all the nodes at the end, you will rapidly exhaust the floating point precision. You could periodically renumber all the nodes, but that defeats the purpose of having a permanent unchanging unique node id. (For some problem domains, it's acceptable, though.)
Of course, you don't really have to use floating point. A double on standard architecture has 53 bits of precision, which is plenty if your insertions are stochastic and very little if you always insert at the same place; you can use all 64 bits of an unsigned 64-bit integer by (conceptually) locating a fixed binary point prior to the high-order bit. The average computation works the same, except that you need to special case the computation with the 1.0 value.
This is essentially the same as your idea of labelling the nodes with a vector of indices. That scheme has the advantage of never running out of precision, and the disadvantage that the vectors can get quite long. You could also use a hybrid solution where you start a new level only when you run out of precision with the current level.

Decision Tree Binary Classifier shortcut (sorting)

Normally, at each node of the decision tree, we consider all features and all splitting points for each feature. We calculate the difference between the entropy of the entire node and the weighted avg of the entropies of potential left and right branches, and the feature + splitting feature_value that gives us the greatest entropy drop is chosen as the splitting criterion for that particular node.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
sort the m distinct feature_values by the percentage of 1's of the samples within the node that takes that feature_value for that feature.
Only try the m-1 ways of splitting the sorted list.
This 'trying only m-1 splits' method is mentioned as a 'shortcut' in the article below, which (by definition of 'shortcut') means the results of the two methods which differ drastically in runtime are exactly the same.
The quote:"For regression and binary classification problems, with K = 2 response classes, there is a computational shortcut [1]. The tree can order the categories by mean response (for regression) or class probability for one of the classes (for classification). Then, the optimal split is one of the L – 1 splits for the ordered list. "
The article:
http://www.mathworks.com/help/stats/splitting-categorical-predictors-for-multiclass-classification.html?s_tid=gn_loc_drop&requestedDomain=uk.mathworks.com
Note that I'm talking only about categorical variables.

Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
The answer is simple: both procedures just aren't the same. As you noticed, splitting in the exact way is an NP-hard problem and thus hardly feasible for any problem in practice. Moreover, due to overfitting that would usually be not the optimal result in terms of generaluzation.
Instead, the exhaustive search is replaced by some kind of greedy procedure which goes like: sort first, then try all ordered splits. In general this leads to different results than the exact splitting.
In order to improve on the greedy result, one further often applies pruning (which can be seen as another greedy and heuristic method). And never methods like random forests or BART deal with this problem effectively by averaging over several trees -- so that the deviation of a single tree becomes less important.

What invariant do RRB-trees maintain?

Relaxed Radix Balanced Trees (RRB-trees) are a generalization of immutable vectors (used in Clojure and Scala) that have 'effectively constant' indexing and update times. RRB-trees maintain efficient indexing and update but also allow efficient concatenation (log n).
The authors present the data structure in a way that I find hard to follow. I am not quite sure what the invariant is that each node maintains.
In section 2.5, they describe their algorithm. I think they are ensuring that indexing into the node will only ever require e extra steps of linear search after radix searching. I do not understand how they derived their formula for the extra steps, and I think perhaps I'm not sure what each of the variables mean (in particular "a total of p sub-tree branches").
What's how does the RRB-tree concatenation algorithm work?

They do describe an invariant in section 2.4 "However, as mentioned earlier
B-Trees nodes do not facilitate radix searching. Instead we chose
the initial invariant of allowing the node sizes to range between m
and m - 1. This defines a family of balanced trees starting with
well known 2-3 trees, 3-4 trees and (for m=32) 31-32 trees. This
invariant ensures balancing and achieves radix branch search in the
majority of cases. Occasionally a few step linear search is needed
after the radix search to find the correct branch.
The extra steps required increase at the higher levels."
Looking at their formula, it looks like they have worked out the maximum and minimum possible number of values stored in a subtree. The difference between the two is the maximum possible difference between the maximum and minimum number of values underneath a point. If you divide this by the number of values underneath a slot, you have the maximum number of slots you could be off by when you work out which slot to look at to see if it contains the index you are searching for.

#mcdowella is correct that's what they say about relaxed nodes. But if you're splitting and joining nodes, a range from m to m-1 means you will sometimes have to adjust up to m-1 (m-2?) nodes in order to add or remove a single element from a node. This seems horribly inefficient. I think they meant between m and (2 m) - 1 because this allows nodes to be split into 2 when they get too big, or 2 nodes joined into one when they are too small without ever needing to change a third node. So it's a typo that the "2" is missing in "2 m" in the paper. Jean Niklas L’orange's masters thesis backs me up on this.
Furthermore, all strict nodes have the same length which must be a power of 2. The reason for this is an optimization in Rich Hickey's Clojure PersistentVector. Well, I think the important thing is to pack all strict nodes left (more on this later) so you don't have to guess which branch of the tree to descend. But being able to bit-shift and bit-mask instead of divide is a nice bonus. I didn't time the get() operation on a relaxed Scala Vector, but the relaxed Paguro vector is about 10x slower than the strict one. So it makes every effort to be as strict as possible, even producing 2 strict levels if you repeatedly insert at 0.
Their tree also has an even height - all leaf nodes are equal distance from the root. I think it would still work if relaxed trees had to be within, say, one level of one-another, though not sure what that would buy you.
Relaxed nodes can have strict children, but not vice-versa.
Strict nodes must be filled from the left (low-index) without gaps. Any non-full Strict nodes must be on the right-hand (high-index) edge of the tree. All Strict leaf nodes can always be full if you do appends in a focus or tail (more on that below).
You can see most of the invariants by searching for the debugValidate() methods in the Paguro implementation. That's not their paper, but it's mostly based on it. Actually, the "display" variables in the Scala implementation aren't mentioned in the paper either. If you're going to study this stuff, you probably want to start by taking a good look at the Clojure PersistentVector because the RRB Tree has one inside it. The two differences between that and the RRB Tree are 1. the RRB Tree allows "relaxed" nodes and 2. the RRB Tree may have a "focus" instead of a "tail." Both focus and tail are small buffers (maybe the same size as a strict leaf node), the difference being that the focus will probably be localized to whatever area of the vector was last inserted/appended to, while the tail is always at the end (PerSistentVector can only be appended to, never inserted into). These 2 differences are what allow O(log n) arbitrary inserts and removals, plus O(log n) split() and join() operations.

Binary Search Tree for specific intent

We all know there are plenty of self-balancing binary search trees (BST), being the most famous the Red-Black and the AVL. It might be useful to take a look at AA-trees and scapegoat trees too.
I want to do deletions insertions and searches, like any other BST. However, it will be common to delete all values in a given range, or deleting whole subtrees. So:
I want to insert, search, remove values in O(log n) (balanced tree).
I would like to delete a subtree, keeping the whole tree balanced, in O(log n) (worst-case or amortized)
It might be useful to delete several values in a row, before balancing the tree
I will most often insert 2 values at once, however this is not a rule (just a tip in case there is a tree data structure that takes this into account)
Is there a variant of AVL or RB that helps me on this? Scapegoat-trees look more like this, but would also need some changes, anyone who has got experience on them can share some thougts?
More precisely, which balancing procedure and/or removal procedure would help me keep this actions time-efficient?

It is possible to delete a range of values a BST in O(logn + objects num).
The easiest way I know is to work with the Deterministic Skip List data structure (you might want to read a bit about this data structure before you go on).
In the deterministic skip list all of the real values are stored in the bottom level, and there are pointers on upper levels to them. Insert, search and remove are done in O(logn).
The range deletion operation can be done according to the following algorithm:
Find the first element in the range - O(logn)
Go forward in the linked list, and remove all elements that are still in the range. If there are elements with pointers to the upper levels - remove them too, until reaching the topmost level (removal from a linked list) - O(number of deleted objects)
Fix the pointers to fit deterministic skip list (2-3 elements between every pointer upward)
The total complexity of the range delete is O(logn + number of objects in the range).
Notice that if you choose to work with a random skip list, you get the same complexity, but on average, and not worst case. The plus is that you don't have to fix the upper level pointers to meet the 2-3 demand.
A deterministic skip list has a 1-1 mapping to a 2-3 tree, so with some more work, the procedure described above could work for a 2-3 tree as well.

Long ago in the pre-STL days I wrote my own B-Tree (BST) algorithm because I had a rather large data set at the time (roughly 700K items in 2 trees that were interdependent). I found that rebalancing after every 100-200 insertions/deletions was the peak performance I could get at the time based on experimentation on 486 and SGI hardware. This number may be different now, or maybe not since it does appear to be an algorithmic optimization limit unless you convert to a parallel model.
In short, you could apply a modification trigger for the rebalancing, and allow for forced rebalancing when you've completed all your modifications.
The improvement was remarkable. The initial straight load was not complete after 25m (killed the process). Rebalancing as we went also was killed after 15m. The restricted modification loads with a rebalance every 100 mods loaded and ran in less than 3m. Note that during the "run" portion, there were 0-8 modification to the tree per initial entry. You really need to consider whether you always need to be in-balance when the tree will be modified again in the near term.

Hmm, what about B-trees? They are also balanced, and if you choose a big-order one --- it depends on how many items do you have ---, you will save a bunch of object creation/destruction times.
To 2. If you have a B-tree of order 100, you can remove up to 100 items by one function call.
To 3. This feature can be applied to almost any of the trees, just implement a RemoveSome() function that removes N items and does a rebalance. For B-trees, it's a bit trickier, but can be done.
Note: I supposed you're a programmer. If you need a complete, tested, off-the-shelf solution, you need another answer.

It should be easy to implement deleting a node and its sub nodes in an AVL tree if every node stores its height instead of a balance factor. After deleting a node keep rotating until the two child nodes differ by no more than one. Then move up the tree and repeat. The only real difference from a normal deletion will be a while instead of an if for testing the heights.

The Set implementation in the OCaml standard library is a purely functional AVL tree that satisfies all of your requirements and, in particular, has very efficient implementations of set theoretic operations (union, intersection, difference). Insertion and deletion are O(log n). You can remove subtrees and runs of elements by representing them as a set and using set difference. You can insert two elements simultaneously by creating a 2-element set and applying set union.

efficient algorithm to test _which_ sets a particular number belongs to

If I have a large set of continuous ranges ( e.g. [0..5], [10..20], [7..13],[-1..37] ) and can arrange those sets into any data-structure I like, what's the most efficient way to test which sets a particular test_number belongs to?
I've thought about storing the sets in a balanced binary tree based on the low number of a set ( and each node would have all the sets that have the same lowest number of their set). This would allow you to efficiently prune the number of sets based on whether the test_number you're testing against the sets is less than the lowest number of a set, and then prune that node and all the nodes to the right of that node ( which have a low number in their range which is greater than the test_number) . I think that would prune about 25% of the sets on average, but then I would need to linearly look at all the rest of the nodes in the binary tree to determine whether the test_number belonged in those sets. ( I could further optimize by sorting the lists of sets at any one node by the highest number in the set, which would allow me to do binary search within a specific list to determine which set, if any, contain the test_number. Unfortunately, most of the sets I'll be dealing with don't have overlapping set boundaries.)
I think that this problem has been solved in graphics processing since they've figured out ways to efficiently test which polygons in their entire model contribute to a specific pixel, but I don't know the terminology of that type of algorithm.

Your intuition about the relevance of your problem to graphics is correct. Consider building and querying a segment tree. It is particularly-well suited for the counting query you want. See also its description in Computational Geometry.

I think building a tree structure will speed things up considerably (provided you have enough sets and numbers to check that it's worth the initial cost). Instead of a binary tree it should be a ternary tree. Each node should have left, middle, and right nodes, where the left node contains a set that is strictly less than the node set, the right is strictly greater, and the middle has overlap.
Set1
/ | \
/ | \
/ | \
Set2 Set3 Set4
It's quick and easy to tell if there's overlap in the sets, since you only have to compare the min and max values to order them. In the simple case above, Set2[max] < Set1[min], Set4[min] > Set1[max], and Set1 and Set3 have some overlap. This will speed up your search because if the number you're searching for is in Set1, it won't be in Set2 or Set4, and you don't have to check them.
I just want to point out that using a scheme like this only saves time over the naive implementation of checking every set if you have more numbers to check than you have sets.

I think I would organise them in the same way Mediawiki indexes pages - as a bucket sort. I don't know that it's the most efficient algorithm out there, but it should be fast, and is pretty easy to implement (even I've managed it, and in SQL at that!!).
Basically, the algorithm for sorting is
For Each SetOfNumbers
For Each NumberInSet
Put SetOfNumbers into Bin(NumberInSet)
Then to query, you can just count the number of items in Bin(MyNumber)
This approach will work well when your SetOfNumbers rarely changes, although if they change regularly it's generally not too hard to keep the Bins updated either. It's chief disadvantage is that it trades space, and initial sorting time, for very fast queries.
Note that in the algorithm I've expanded the Ranges into SetsOfNumbers - enumerating every number in a given range.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio