Decision Tree Binary Classifier shortcut (sorting) - sorting

Normally, at each node of the decision tree, we consider all features and all splitting points for each feature. We calculate the difference between the entropy of the entire node and the weighted avg of the entropies of potential left and right branches, and the feature + splitting feature_value that gives us the greatest entropy drop is chosen as the splitting criterion for that particular node.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
sort the m distinct feature_values by the percentage of 1's of the samples within the node that takes that feature_value for that feature.
Only try the m-1 ways of splitting the sorted list.
This 'trying only m-1 splits' method is mentioned as a 'shortcut' in the article below, which (by definition of 'shortcut') means the results of the two methods which differ drastically in runtime are exactly the same.
The quote:"For regression and binary classification problems, with K = 2 response classes, there is a computational shortcut [1]. The tree can order the categories by mean response (for regression) or class probability for one of the classes (for classification). Then, the optimal split is one of the L – 1 splits for the ordered list. "
The article:
http://www.mathworks.com/help/stats/splitting-categorical-predictors-for-multiclass-classification.html?s_tid=gn_loc_drop&requestedDomain=uk.mathworks.com
Note that I'm talking only about categorical variables.

Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
The answer is simple: both procedures just aren't the same. As you noticed, splitting in the exact way is an NP-hard problem and thus hardly feasible for any problem in practice. Moreover, due to overfitting that would usually be not the optimal result in terms of generaluzation.
Instead, the exhaustive search is replaced by some kind of greedy procedure which goes like: sort first, then try all ordered splits. In general this leads to different results than the exact splitting.
In order to improve on the greedy result, one further often applies pruning (which can be seen as another greedy and heuristic method). And never methods like random forests or BART deal with this problem effectively by averaging over several trees -- so that the deviation of a single tree becomes less important.

Related

Reorder a sequence with minimum number of swaps to fulfil partial order constraints

Input: An array of elements and a partial order on a subset of those elements, seen as a constraint set.
Output: An array (or any ordered sequence) fulfilling the partial order.
The Problem: How can one achieve the reordering efficiently? The number of introduced inversions (or swaps) compared to the original input sequence should be as small as possible. Note, that the partial order can be defined for any amount of elements (some elements may are not part of it).
The context: It arises from a situation in 2-layer graph crossing reduction: After a crossing reduction phase, I want to reorder some of the nodes (thus, the partial order may contain only a small subset).
In general, I had the idea to weaken this a little bit and solve the problem only for the elements being part of the partial order (though I think, that this could lead to non-optimal results). Thus, if I have a sequence A B C D E and the partial order only contains A, B and E, then C and D will stay at the same place. It somehow reminds me of the Kemeny Score, but I couldn't yet turn that into an algorithm.
Just to be sure: I am not searching for a topological sort. This would probably introduce a lot more inversions than required.
Edit 1:
Changed wording (sequence to array).
The amount of additional space for solving the problem can be arbitrary (well, polynomially bounded) large. Of course, less is better :) So, something like O(ArrayLen*ArrayLen) at most would be fantastic.
Why the min amount of swaps or inversions: As this procedure is part of crossing reduction, the input array's ordering is (hopefully) close to an optimum, in terms of edge crossings with the second node layer. Then, every additional swap or inversion would, probably, introduce edge crossings again. But in the process of computing the output, the number of swaps or movements done is not really important (though, again, something linear or quadratical would be cool), as only the output-quality is important. Right now, I require the constraints to be in a total order and only inspect the nodes of that order, thus it becomes trivial to solve. But the partial order constraints would be more flexible.
I found a paper, which looks promising: "A Fast and Simple Heuristic for Constrained Two-Level Crossing Reduction" by Michael Foster.
Together with the comments below my question, it is answered. Thanks again, #j_random_hacker!

Is the linear formation the best sorting production?

Considering usually a sorting method products linearly sorted productions (such as "1,7,8,13,109..."), which consumes O(N) to inquiry.
Why not sort in non-linear order, consuming O(logN) or something to find element(s) by iteration or Newton method etc.? Is it expensive to make such a high-order sorted structure?
Concisely, is it a possible idea to sort results which allowed to be accessed by finding roots for ax^2 + bx + c = 0? (for contrast, usually it's finding root for ax + c = 0.) For example, we have x1 = 1, x2 = 2 as roots of a quadratic equation and just insert following xi(s). Then it is possible to use smarter ways to inquiry.
I suppose difficulty can be encountered by these aspects:
prediction of data can be rather hard. thus we cannot construct a general formula to describe well the following numbers (may be hash values).
due to the first difficulty, numbers out of certain range can be divergent. example graphed by Google:the graph. the values derived out of [-1,3] are really large, as well as rapid increment in difficulty executing the original formula.
that is actually equivalent to hash, which creates a table that contains the values. and the production rule is a formula.
the execution of a "smarter" inquiry may be expensive because of the complexity of algorithm itself.
Smarter schemes which take advantage of a known statistical distribution are typically faster by some constant. However, that still keeps them at O(log N), which is the same as a trivial binary search. The reason is that in each step, they typically narrow down the range of elements to search by a factor R > 2 , for simple binary search that's just R=2. But you need log(N)/log(R) steps to narrow it down to exactly one element.
Now whether this is a net win depends on log(R) versus the work needed at each step. A simple comparison (for binary search) takes a few cycles. As soon as you need anything more complex than +-*/ (say exp or log) to predict the location of the next element, the profit of needing less steps is gone.
So, in summary: binary search is used because each step is efficient, for many real-world distributions.

What invariant do RRB-trees maintain?

Relaxed Radix Balanced Trees (RRB-trees) are a generalization of immutable vectors (used in Clojure and Scala) that have 'effectively constant' indexing and update times. RRB-trees maintain efficient indexing and update but also allow efficient concatenation (log n).
The authors present the data structure in a way that I find hard to follow. I am not quite sure what the invariant is that each node maintains.
In section 2.5, they describe their algorithm. I think they are ensuring that indexing into the node will only ever require e extra steps of linear search after radix searching. I do not understand how they derived their formula for the extra steps, and I think perhaps I'm not sure what each of the variables mean (in particular "a total of p sub-tree branches").
What's how does the RRB-tree concatenation algorithm work?
They do describe an invariant in section 2.4 "However, as mentioned earlier
B-Trees nodes do not facilitate radix searching. Instead we chose
the initial invariant of allowing the node sizes to range between m
and m - 1. This defines a family of balanced trees starting with
well known 2-3 trees, 3-4 trees and (for m=32) 31-32 trees. This
invariant ensures balancing and achieves radix branch search in the
majority of cases. Occasionally a few step linear search is needed
after the radix search to find the correct branch.
The extra steps required increase at the higher levels."
Looking at their formula, it looks like they have worked out the maximum and minimum possible number of values stored in a subtree. The difference between the two is the maximum possible difference between the maximum and minimum number of values underneath a point. If you divide this by the number of values underneath a slot, you have the maximum number of slots you could be off by when you work out which slot to look at to see if it contains the index you are searching for.
#mcdowella is correct that's what they say about relaxed nodes. But if you're splitting and joining nodes, a range from m to m-1 means you will sometimes have to adjust up to m-1 (m-2?) nodes in order to add or remove a single element from a node. This seems horribly inefficient. I think they meant between m and (2 m) - 1 because this allows nodes to be split into 2 when they get too big, or 2 nodes joined into one when they are too small without ever needing to change a third node. So it's a typo that the "2" is missing in "2 m" in the paper. Jean Niklas L’orange's masters thesis backs me up on this.
Furthermore, all strict nodes have the same length which must be a power of 2. The reason for this is an optimization in Rich Hickey's Clojure PersistentVector. Well, I think the important thing is to pack all strict nodes left (more on this later) so you don't have to guess which branch of the tree to descend. But being able to bit-shift and bit-mask instead of divide is a nice bonus. I didn't time the get() operation on a relaxed Scala Vector, but the relaxed Paguro vector is about 10x slower than the strict one. So it makes every effort to be as strict as possible, even producing 2 strict levels if you repeatedly insert at 0.
Their tree also has an even height - all leaf nodes are equal distance from the root. I think it would still work if relaxed trees had to be within, say, one level of one-another, though not sure what that would buy you.
Relaxed nodes can have strict children, but not vice-versa.
Strict nodes must be filled from the left (low-index) without gaps. Any non-full Strict nodes must be on the right-hand (high-index) edge of the tree. All Strict leaf nodes can always be full if you do appends in a focus or tail (more on that below).
You can see most of the invariants by searching for the debugValidate() methods in the Paguro implementation. That's not their paper, but it's mostly based on it. Actually, the "display" variables in the Scala implementation aren't mentioned in the paper either. If you're going to study this stuff, you probably want to start by taking a good look at the Clojure PersistentVector because the RRB Tree has one inside it. The two differences between that and the RRB Tree are 1. the RRB Tree allows "relaxed" nodes and 2. the RRB Tree may have a "focus" instead of a "tail." Both focus and tail are small buffers (maybe the same size as a strict leaf node), the difference being that the focus will probably be localized to whatever area of the vector was last inserted/appended to, while the tail is always at the end (PerSistentVector can only be appended to, never inserted into). These 2 differences are what allow O(log n) arbitrary inserts and removals, plus O(log n) split() and join() operations.

Looking for a multidimensional optimization algorithm

Problem description
There are different categories which contain an arbitrary amount of elements.
There are three different attributes A, B and C. Each element does have an other distribution of these attributes. This distribution is expressed through a positive integer value. For example, element 1 has the attributes A: 42 B: 1337 C: 18. The sum of these attributes is not consistent over the elements. Some elements have more than others.
Now the problem:
We want to choose exactly one element from each category so that
We hit a certain threshold on attributes A and B (going over it is also possible, but not necessary)
while getting a maximum amount of C.
Example: we want to hit at least 80 A and 150 B in sum over all chosen elements and want as many C as possible.
I've thought about this problem and cannot imagine an efficient solution. The sample sizes are about 15 categories from which each contains up to ~30 elements, so bruteforcing doesn't seem to be very effective since there are potentially 30^15 possibilities.
My model is that I think of it as a tree with depth number of categories. Each depth level represents a category and gives us the choice of choosing an element out of this category. When passing over a node, we add the attributes of the represented element to our sum which we want to optimize.
If we hit the same attribute combination multiple times on the same level, we merge them so that we can stripe away the multiple computation of already computed values. If we reach a level where one path has less value in all three attributes, we don't follow it anymore from there.
However, in the worst case this tree still has ~30^15 nodes in it.
Does anybody of you can think of an algorithm which may aid me to solve this problem? Or could you explain why you think that there doesn't exist an algorithm for this?
This question is very similar to a variation of the knapsack problem. I would start by looking at solutions for this problem and see how well you can apply it to your stated problem.
My first inclination to is try branch-and-bound. You can do it breadth-first or depth-first, and I prefer depth-first because I think it's cleaner.
To express it simply, you have a tree-walk procedure walk that can enumerate all possibilities (maybe it just has a 5-level nested loop). It is augmented with two things:
At every step of the way, it keeps track of the cost at that point, where the cost can only increase. (If the cost can also decrease, it becomes more like a minimax game tree search.)
The procedure has an argument budget, and it does not search any branches where the cost can exceed the budget.
Then you have an outer loop:
for (budget = 0; budget < ... ; budget++){
walk(budget);
// if walk finds a solution within the budget, halt
}
The amount of time it takes is exponential in the budget, so easier cases will take less time. The fact that you are re-doing the search doesn't matter much because each level of the budget takes as much or more time than all the previous levels combined.
Combine this with some sort of heuristic about the order in which you consider branches, and it may give you a workable solution for typical problems you give it.
IF that doesn't work, you can fall back on basic heuristic programming. That is, do some cases by hand, and pay attention to how you did it. Then program it the same way.
I hope that helps.

Looking for a sort algorithm with as few as possible compare operations

I want to sort items where the comparison is performed by humans:
Pictures
Priority of work items
...
For these tasks the number of comparisons is the limiting factor for performance.
What is the minimum number of comparisons needed (I assume > N for N items)?
Which algorithm guarantees this minimum number?
To answer this, we need to make a lot of assumptions.
Let's assume we are sorting pictures by cuteness. The goal is to get the maximum usable information from the human in the least amount of time. This interaction will dominate all other computation, so it's the only one that counts.
As someone else mentioned, humans can deal well with ordering several items in one interaction. Let's say we can get eight items in relative order per round.
Each round introduces seven edges into a directed graph where the nodes are the pictures. If node A is reachable from node B, then node A is cuter than node B. Keep this graph in mind.
Now, let me tell you about a problem the Navy and the Air Force solve differently. They both want to get a group of people in height order and quickly. The Navy tells people to get in line, then if you're shorter than the guy in front of you, switch places, and repeat until done. In the worst case, it's N*N comparison.
The Air Force tells people to stand in a square grid. They shuffle front-to-back on sqrt(N) people, which means worst case sqrt(N)*sqrt(N) == N comparisons. However, the people are only sorted along one dimension. So therefore, the people face left, then do the same shuffle again. Now we're up to 2*N comparisons, and the sort is still imperfect but it's good enough for government work. There's a short corner, a tall corner opposite, and a clear diagonal height gradient.
You can see how the Air Force method gets results in less time if you don't care about perfection. You can also see how to get the perfection effectively. You already know that the very shortest and very longest men are in two corners. The second-shortest might be behind or beside the shortest, the third shortest might be behind or beside him. In general, someone's height rank is also his maximum possible Manhattan distance from the short corner.
Looking back at the graph analogy, the eight nodes to present each round are eight of those with the currently most common length of longest inbound path. The length of the longest inbound path also represents the node's minimum possible sorted rank.
You'll use a lot of CPU following this plan, but you will make the best possible use of your human resources.
From an assignment I once did on this very subject ...
The comparison counts are for various sorting algorithms operating on data in a random order
Size QkSort HpSort MrgSort ModQk InsrtSort
2500 31388 48792 25105 27646 1554230
5000 67818 107632 55216 65706 6082243
10000 153838 235641 120394 141623 25430257
20000 320535 510824 260995 300319 100361684
40000 759202 1101835 561676 685937
80000 1561245 2363171 1203335 1438017
160000 3295500 5045861 2567554 3047186
These comparison counts are for various sorting algorithms operating on data that is started 'nearly sorted'. Amongst other things it shows a the pathological case of quicksort.
Size QkSort HpSort MrgSort ModQk InsrtSort
2500 72029 46428 16001 70618 76050
5000 181370 102934 34503 190391 3016042
10000 383228 226223 74006 303128 12793735
20000 940771 491648 158015 744557 50456526
40000 2208720 1065689 336031 1634659
80000 4669465 2289350 712062 3820384
160000 11748287 4878598 1504127 10173850
From this we can see that merge sort is the best by number of comparisons.
I can't remember what the modifications to the quick sort algorithm were, but I believe it was something that used insertion sorts once the individual chunks got down to a certain size. This sort of thing is commonly done to optimise quicksort.
You might also want to look up Tadao Takaoka's 'Minimal Merge Sort', which is a more efficient version of the merge sort.
Pigeon hole sorting is order N and works well with humans if the data can be pigeon holed. A good example would be counting votes in an election.
You should consider that humans might make non-transitive comparisons, e.g. they favor A over B, B over C but also C over A. So when choosing your sort algorithm, make sure it doesn't completely break when that happens.
People are really good at ordering 5-10 things from best to worst and come up with more consistent results when doing so. I think trying to apply a classical sorting algo might not work here because of the typically human multi-compare approach.
I'd argue that you should have a round robin type approach and try to bucket things into their most consistent groups each time. Each iteration would only make the result more certain.
It'd be interesting to write too :)
If comparisons are expensive relative to book-keeping costs, you might try the following algorithm which I call "tournament sort". First, some definitions:
Every node has a numeric "score" property (which must be able to hold values from 1 to the number of nodes), and a "last-beat" and "fellow-loser" properties, which must be able to hold node references.
A node is "better" than another node if it should be output before the other.
An element is considered "eligible" if there are no elements known to be better than it which have been output, and "ineligible" if any element which has not been output is known to be better than it.
The "score" of a node is the number of nodes it's known to be better than, plus one.
To run the algorithm, initially assign every node a score of 1. Repeatedly compare the two lowest-scoring eligible nodes; after each comparison, mark the loser "ineligible", and add the loser's score to the winner's (the loser's score is unaltered). Set the loser's "fellow loser" property to the winner's "last-beat", and the winner's "last-beat" property to the loser. Iterate this until only one eligible node remains. Output that node, and make eligible all nodes the winner beat (using the winner's "last-beat" and the chain of "fellow-loser" properties). Then continue the algorithm on the remaining nodes.
The number of comparisons with 1,000,000 items was slightly lower than that of a stock library implementation of Quicksort; I'm not sure how the algorithm would compare against a more modern version of QuickSort. Bookkeeping costs are significant, but if comparisons are sufficiently expensive the savings could possibly be worth it. One interesting feature of this algorithm is that it will only perform comparisons relevant to determining the next node to be output; I know of no other algorithm with that feature.
I don't think you're likely to get a better answer than the Wikipedia page on sorting.
Summary:
For arbitrary comparisons (where you can't use something like radix sorting) the best you can achieve is O(n log n)
Various algorithms achieve this - see the "comparison of algorithms" section.
The commonly used QuickSort is O(n log n) in a typical case, but O(n^2) in the worst case; there are often ways to avoid this, but if you're really worried about the cost of comparisons, I'd go with something like MergeSort or a HeapSort. It partly depends on your existing data structures.
If humans are doing the comparisons, are they also doing the sorting? Do you have a fixed data structure you need to use, or could you effectively create a copy using a balanced binary tree insertion sort? What are the storage requirements?
Here is a comparison of algorithms. The two better candidates are Quick Sort and Merge Sort. Quick Sort is in general better, but has a worse worst case performance.
Merge sort is definately the way to go here as you can use a Map/Reduce type algorithm to have several humans doing the comparisons in parallel.
Quicksort is essentially a single threaded sort algorithm.
You could also tweak the merge sort algorithm so that instead of comparing two objects you present your human with a list of say five items and ask him or her to rank them.
Another possibility would be to use a ranking system as used by the famous "Hot or Not" web site. This requires many many more comparisons, but, the comparisons can happen in any sequence and in parallel, this would work faster than a classic sort provided you have enough huminoids at your disposal.
The questions raises more questions really.
Are we talking a single human performing the comparisons? It's a very different challenge if you are talking a group of humans trying to arrange objects in order.
What about the questions of trust and error? Not everyone can be trusted or to get everything right - certain sorts would go catastrophically wrong if at any given point you provided the wrong answer to a single comparison.
What about subjectivity? "Rank these pictures in order of cuteness". Once you get to this point, it could get really complex. As someone else mentions, something like "hot or not" is the simplest conceptually, but isn't very efficient. At it's most complex, I'd say that google is a way of sorting objects into an order, where the search engine is inferring the comparisons made by humans.
The best one would be the merge sort
The minimum run time is n*log(n) [Base 2]
The way it is implemented is
If the list is of length 0 or 1, then it is already sorted.
Otherwise:
Divide the unsorted list into two sublists of about half the size.
Sort each sublist recursively by re-applying merge sort.
Merge the two sublists back into one sorted list.

Resources