efficient algorithm to test _which_ sets a particular number belongs to - algorithm

If I have a large set of continuous ranges ( e.g. [0..5], [10..20], [7..13],[-1..37] ) and can arrange those sets into any data-structure I like, what's the most efficient way to test which sets a particular test_number belongs to?
I've thought about storing the sets in a balanced binary tree based on the low number of a set ( and each node would have all the sets that have the same lowest number of their set). This would allow you to efficiently prune the number of sets based on whether the test_number you're testing against the sets is less than the lowest number of a set, and then prune that node and all the nodes to the right of that node ( which have a low number in their range which is greater than the test_number) . I think that would prune about 25% of the sets on average, but then I would need to linearly look at all the rest of the nodes in the binary tree to determine whether the test_number belonged in those sets. ( I could further optimize by sorting the lists of sets at any one node by the highest number in the set, which would allow me to do binary search within a specific list to determine which set, if any, contain the test_number. Unfortunately, most of the sets I'll be dealing with don't have overlapping set boundaries.)
I think that this problem has been solved in graphics processing since they've figured out ways to efficiently test which polygons in their entire model contribute to a specific pixel, but I don't know the terminology of that type of algorithm.

Your intuition about the relevance of your problem to graphics is correct. Consider building and querying a segment tree. It is particularly-well suited for the counting query you want. See also its description in Computational Geometry.

I think building a tree structure will speed things up considerably (provided you have enough sets and numbers to check that it's worth the initial cost). Instead of a binary tree it should be a ternary tree. Each node should have left, middle, and right nodes, where the left node contains a set that is strictly less than the node set, the right is strictly greater, and the middle has overlap.
Set1
/ | \
/ | \
/ | \
Set2 Set3 Set4
It's quick and easy to tell if there's overlap in the sets, since you only have to compare the min and max values to order them. In the simple case above, Set2[max] < Set1[min], Set4[min] > Set1[max], and Set1 and Set3 have some overlap. This will speed up your search because if the number you're searching for is in Set1, it won't be in Set2 or Set4, and you don't have to check them.
I just want to point out that using a scheme like this only saves time over the naive implementation of checking every set if you have more numbers to check than you have sets.

I think I would organise them in the same way Mediawiki indexes pages - as a bucket sort. I don't know that it's the most efficient algorithm out there, but it should be fast, and is pretty easy to implement (even I've managed it, and in SQL at that!!).
Basically, the algorithm for sorting is
For Each SetOfNumbers
For Each NumberInSet
Put SetOfNumbers into Bin(NumberInSet)
Then to query, you can just count the number of items in Bin(MyNumber)
This approach will work well when your SetOfNumbers rarely changes, although if they change regularly it's generally not too hard to keep the Bins updated either. It's chief disadvantage is that it trades space, and initial sorting time, for very fast queries.
Note that in the algorithm I've expanded the Ranges into SetsOfNumbers - enumerating every number in a given range.

Related

Decision Tree Binary Classifier shortcut (sorting)

Normally, at each node of the decision tree, we consider all features and all splitting points for each feature. We calculate the difference between the entropy of the entire node and the weighted avg of the entropies of potential left and right branches, and the feature + splitting feature_value that gives us the greatest entropy drop is chosen as the splitting criterion for that particular node.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
sort the m distinct feature_values by the percentage of 1's of the samples within the node that takes that feature_value for that feature.
Only try the m-1 ways of splitting the sorted list.
This 'trying only m-1 splits' method is mentioned as a 'shortcut' in the article below, which (by definition of 'shortcut') means the results of the two methods which differ drastically in runtime are exactly the same.
The quote:"For regression and binary classification problems, with K = 2 response classes, there is a computational shortcut [1]. The tree can order the categories by mean response (for regression) or class probability for one of the classes (for classification). Then, the optimal split is one of the L – 1 splits for the ordered list. "
The article:
http://www.mathworks.com/help/stats/splitting-categorical-predictors-for-multiclass-classification.html?s_tid=gn_loc_drop&requestedDomain=uk.mathworks.com
Note that I'm talking only about categorical variables.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
The answer is simple: both procedures just aren't the same. As you noticed, splitting in the exact way is an NP-hard problem and thus hardly feasible for any problem in practice. Moreover, due to overfitting that would usually be not the optimal result in terms of generaluzation.
Instead, the exhaustive search is replaced by some kind of greedy procedure which goes like: sort first, then try all ordered splits. In general this leads to different results than the exact splitting.
In order to improve on the greedy result, one further often applies pruning (which can be seen as another greedy and heuristic method). And never methods like random forests or BART deal with this problem effectively by averaging over several trees -- so that the deviation of a single tree becomes less important.

GPU-based inclusive scan on an unbalanced tree

I have the following problem: I need to compute the inclusive scans (e.g. prefix sums) of values based on a tree structure on the GPU. These scans are either from the root node (top-down) or from the leaf nodes (bottom-up). The case of a simple chain is easily handled, but the tree structure makes parallelization rather difficult to implement efficiently.
For instance, after a top-down inclusive scan, (12) would hold (0)[op](6)[op](7)[op](8)[op](11)[op](12), and for a bottom-up inclusive scan, (8) would hold (8)[op](9)[op](10)[op](11)[op](12), where [op] is a given binary operator (matrix addition, multiplication etc.).
One also needs to consider the following points:
For a typical scenario, the length of the different branches should not be too long (~10), with something like 5 to 10 branches, so this is something that will run inside a block and work will be split between the threads. Different blocks will simply handle different values of nodes. This is obviously not optimal regarding occupancy, but this is a constraint on the problem that will be tackled sometime later. For now, I will rely on Instruction-level parallelism.
The structure of the graph cannot be changed (it describes an actual system), thus it cannot be balanced (or only by changing the root of the tree, e.g. using (6) as the new root). Nonetheless, a typical tree should not be too unbalanced.
I currently use CUDA for GPGPU, so I am open to any CUDA-enabled template library that can solve this issue.
Node data is already in global memory and the result will be used by other CUDA kernels, so the objective is just to achieve this without making it a huge bottleneck.
There is no "cycle", i.e. branches cannot merge down the tree.
The structure of the tree is fixed and set in an initialization phase.
A single binary operation can be quite expensive (e.g. multiplication of polynomial matrices, i.e. each element is a polynomial of a given order).
In this case, what would be the "best" data structure (for the tree structure) and the best algorithms (for the inclusive scans/prefix sums) to solve this problem?
Probably a harebrained idea, but imagine that you insert nodes of 0 value into the tree, in such a way that you get a 2D matrix. For instance, there would be 3 zero value nodes below the 5 node in your example. Then use one thread to travel each level of the matrix horizontally. For the top-down prefix sum, offset the threads in such a way that each lower thread is delayed by the maximum number of branches the tree can have in that location. So, you get a "wave" with a slanted edge running over the matrix. The upper threads, being further along, calculate those nodes in time for them to be processed further by threads running further down. You would need the same number of threads as the tree is deep.
I think parallel prefix scan may not suitable for your case because:
parallel prefix scan algorithm will increase the total number of [op], in your link of prefix sum, a 16-input parallel prefix scan requires 26 [op], while a sequential version only need 15. parallel algorithm performs better is based on a assumption that there's enough hardware resources to run multiple [op] in parallel.
You could evaluate the cost of your [op] before try the parallel prefix scan.
On the other hand, since the size of the tree is small, I think you could simply consider your tree as 4 (number of the leaf nodes) independent simple chains, and use concurrent kernel execution to improve the performance of these 4 prefix scan kernels
0-1-2-3
0-4-5
0-6-7-8-9-10
0-6-7-8-11-12
I think in Kepler GK110 architecture, you can invoke kernels recursively, which they call dynamic parallelism. So, if you need to compute the sum of the values at each node of the tree, dynamic parallelism would help. However, the depth of recursion might be a constraint.
My first impression is that you could organize the tree nodes in a 1 dimensional array, similar to what Eric suggested. And then you could do a Segmented Prefix Sum Scan (http://en.wikipedia.org/wiki/Segmented_scan) over that array.
Using your tree nodes as an example, the 1-dim array would look like:
0-1-2-3-0-4-5-0-6-7-8-9-10-0-6-7-8-11-12
And then you would have a parallel array of flags indicating where a new list begins (by list I mean a sequence beginning at the root and ending at a leaf node):
1-0-0-0-1-0-0-1-0-0-0-0- 0-1-0-0-0- 0- 0
For the bottom-up case, you could create a separate segment-flag array like so:
0-0-0-1-0-0-1-0-0-0-0-0- 1-0-0-0-0- 0- 1
and traverse it in reverse order using the same algorithm.
As for how to implement a Segmented Prefix Scan, I haven't implemented one myself but I found a couple of references that might be informative on how to do it: http://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/lectures/24_algorithms.pdf and http://www.cs.ucdavis.edu/research/tech-reports/2011/CSE-2011-4.pdf (see page 23)

What invariant do RRB-trees maintain?

Relaxed Radix Balanced Trees (RRB-trees) are a generalization of immutable vectors (used in Clojure and Scala) that have 'effectively constant' indexing and update times. RRB-trees maintain efficient indexing and update but also allow efficient concatenation (log n).
The authors present the data structure in a way that I find hard to follow. I am not quite sure what the invariant is that each node maintains.
In section 2.5, they describe their algorithm. I think they are ensuring that indexing into the node will only ever require e extra steps of linear search after radix searching. I do not understand how they derived their formula for the extra steps, and I think perhaps I'm not sure what each of the variables mean (in particular "a total of p sub-tree branches").
What's how does the RRB-tree concatenation algorithm work?
They do describe an invariant in section 2.4 "However, as mentioned earlier
B-Trees nodes do not facilitate radix searching. Instead we chose
the initial invariant of allowing the node sizes to range between m
and m - 1. This defines a family of balanced trees starting with
well known 2-3 trees, 3-4 trees and (for m=32) 31-32 trees. This
invariant ensures balancing and achieves radix branch search in the
majority of cases. Occasionally a few step linear search is needed
after the radix search to find the correct branch.
The extra steps required increase at the higher levels."
Looking at their formula, it looks like they have worked out the maximum and minimum possible number of values stored in a subtree. The difference between the two is the maximum possible difference between the maximum and minimum number of values underneath a point. If you divide this by the number of values underneath a slot, you have the maximum number of slots you could be off by when you work out which slot to look at to see if it contains the index you are searching for.
#mcdowella is correct that's what they say about relaxed nodes. But if you're splitting and joining nodes, a range from m to m-1 means you will sometimes have to adjust up to m-1 (m-2?) nodes in order to add or remove a single element from a node. This seems horribly inefficient. I think they meant between m and (2 m) - 1 because this allows nodes to be split into 2 when they get too big, or 2 nodes joined into one when they are too small without ever needing to change a third node. So it's a typo that the "2" is missing in "2 m" in the paper. Jean Niklas L’orange's masters thesis backs me up on this.
Furthermore, all strict nodes have the same length which must be a power of 2. The reason for this is an optimization in Rich Hickey's Clojure PersistentVector. Well, I think the important thing is to pack all strict nodes left (more on this later) so you don't have to guess which branch of the tree to descend. But being able to bit-shift and bit-mask instead of divide is a nice bonus. I didn't time the get() operation on a relaxed Scala Vector, but the relaxed Paguro vector is about 10x slower than the strict one. So it makes every effort to be as strict as possible, even producing 2 strict levels if you repeatedly insert at 0.
Their tree also has an even height - all leaf nodes are equal distance from the root. I think it would still work if relaxed trees had to be within, say, one level of one-another, though not sure what that would buy you.
Relaxed nodes can have strict children, but not vice-versa.
Strict nodes must be filled from the left (low-index) without gaps. Any non-full Strict nodes must be on the right-hand (high-index) edge of the tree. All Strict leaf nodes can always be full if you do appends in a focus or tail (more on that below).
You can see most of the invariants by searching for the debugValidate() methods in the Paguro implementation. That's not their paper, but it's mostly based on it. Actually, the "display" variables in the Scala implementation aren't mentioned in the paper either. If you're going to study this stuff, you probably want to start by taking a good look at the Clojure PersistentVector because the RRB Tree has one inside it. The two differences between that and the RRB Tree are 1. the RRB Tree allows "relaxed" nodes and 2. the RRB Tree may have a "focus" instead of a "tail." Both focus and tail are small buffers (maybe the same size as a strict leaf node), the difference being that the focus will probably be localized to whatever area of the vector was last inserted/appended to, while the tail is always at the end (PerSistentVector can only be appended to, never inserted into). These 2 differences are what allow O(log n) arbitrary inserts and removals, plus O(log n) split() and join() operations.

Data structure for range query

I was recently asked a coding question on the below problem.
I have some solution to this problem but I am not very sure if those are most efficient.
Problem:
Write a program to track set of text ranges. Start point and end point will be string.
Text range example : [AbA-Ef]
Aa would fall before this range
AB would fall inside this range
etc.
String comparison would be like 'A' < 'a' < 'B' < 'b' ... 'Z' < 'z'
We need to support following operations on this range
Add range - this should merge the ranges if applicable
Delete range - this deletes range from tracked ranges and recompute the ranges
Query range - Given a character, function should return whether it is part of any of tracked ranges or not.
Note that tracked ranges can be dis-continuous.
My solutions:
I came up with two approaches.
Store ranges as doubly linked list or
Store ranges as some sort of balanced tree with leaf node having actual data and they are inter-connected as linked list.
Do you think that this solution are good enough or you can think of any better way of doing this so that those three API gives your best performance ?
You are probably looking for an interval tree.
Use the data structure with your custom comparator to indicate "What's on range", and you will be able to do the required operations efficiently.
Note, an interval tree is actually an efficient way to implement your 2nd idea (Store ranges as a some sort of balanced tree)
I'm not clear on what the "delete range" operation is supposed to do. Does it;
Delete a previously inserted range, and recompute the merge of the remaining ranges?
Stop tracking the deleted range, regardless of how many times parts of it have been added.
That doesn't make a huge difference algorithmically; it's just bookkeeping. But it's important to clarify. Also, are the ranges closed or half-open? (Another detail which doesn't affect the algorithm but does affect the implementation).
The basic approach to this problem is to merge the tracked set into a sorted list of disjoint (non-overlapping) ranges; either as a vector or a binary search tree, or basically any structure which supports O(log n) searching.
One approach is to put both endpoints of every disjoint range into the datastructure. To find out if a target value is in a range, find the index of the smallest endpoint greater than the target. If the index is odd the target is in some range; even means it's outside.
Alternatively, index all the disjoint ranges by their start points; find the target by searching for the largest start-point not greater than the target, and then compare the target with the associated end-point.
I usually use the first approach with sorted vectors, which are plausible if (a) space utilization is important and (b) insert and merge are relatively rare. With binary search trees, I go for the second approach. But they differ only in details and constants.
Merging and deleting are not difficult, but there are an annoying number of cases. You start by finding the ranges corresponding to the endpoints of the range to be inserted/deleted (using the standard find operation), remove all the ranges in between the two, and fiddle with the endpoints to correct the partially overlapping ranges. While the find operation is always O(log n), the tree/vector manipulation is o(n) (if the inserted/deleted range is large, anyway).
Most languages, including Java and C++, have a some sort of ordered map or ordered set in which you can both look up a value and find the next value after or the first value before a value. You could use this as a building block - If it contains a set of disjoint ranges then it will have a least element of a range followed by a greatest element of a range followed by the least element of a range followed by the greatest element of a range and so on. When you add a range you can check to see if you have preserved this property. If not, you need to merge ranges. Similarly, you want to preserve this when you delete. Then you can query by just looking to see if there is a least element just before your query point and a greatest element just after.
If you want to create your own datastructure from scratch, I would think about some sort of radix trie structure, because this avoids doing lots of repeated string comparisons.
I think you would go for B+ tree it's the same which you have mentioned as your second approach.
Here are some properties of B+ tree:
All data is stored leaf nodes.
Every leaf is at the same level.
All leaf nodes have links to other leaf nodes.
Here are few applications B+ tree:
It reduces the number of I/O operations required to find an element in the tree.
Often used in the implementation of database indexes.
The primary value of a B+ tree is in storing data for efficient retrieval in a block-oriented storage context — in particular, file systems.
NTFS uses B+ trees for directory indexing.
Basically it helps for range queries look ups, minimizes tree traversing.

Data Structure for Storing Ranges

I am wondering if anyone knows of a data structure which would efficiently handle the following situation:
The data structure should store several, possibly overlapping, variable length ranges on some continuous timescale.
For example, you might add the ranges a:[0,3], b:[4,7], c:[0,9].
Insertion time does not need to be particularly efficient.
Retrievals would take a range as a parameter, and return all the ranges in the set that overlap with the range, for example:
Get(1,2) would return a and c. Get(6,7) would return b and c. Get(2,6) would return all three.
Retrievals need to be as efficient as possible.
One data structure you could use is a one-dimensional R-tree. These are designed to deal with ranges and to provide efficient retrieval. You will also learn about Allen's Operators; there are a dozen other relationships between time intervals than just 'overlaps'.
There are other questions on SO that impinge on this area, including:
Determine Whether Two Date Ranges Overlap
Data structure for non-overlapping ranges within a single dimension
You could go for a binary tree, that stores the ranges in a hierarchy. Starting from the root node, that represents an all-encompassing range divided it its middle, you test if your range you are trying to insert belong to the left subrange, right subrange, or both, and recursively carry on in the matching subnodes until you reach a certain depth, at which you save the actual range.
For lookup, you test your input range against the left and right subranges of the top node, and dive in the ones which overlap, repeating until you have reached actual ranges that you save.
This way, retrieval has a logarithmic complexity. You'd still need to manage duplicates in your retrieval, as some ranges are going to belong to several nodes.

Resources