Is the linear formation the best sorting production? - sorting

Considering usually a sorting method products linearly sorted productions (such as "1,7,8,13,109..."), which consumes O(N) to inquiry.
Why not sort in non-linear order, consuming O(logN) or something to find element(s) by iteration or Newton method etc.? Is it expensive to make such a high-order sorted structure?
Concisely, is it a possible idea to sort results which allowed to be accessed by finding roots for ax^2 + bx + c = 0? (for contrast, usually it's finding root for ax + c = 0.) For example, we have x1 = 1, x2 = 2 as roots of a quadratic equation and just insert following xi(s). Then it is possible to use smarter ways to inquiry.
I suppose difficulty can be encountered by these aspects:
prediction of data can be rather hard. thus we cannot construct a general formula to describe well the following numbers (may be hash values).
due to the first difficulty, numbers out of certain range can be divergent. example graphed by Google:the graph. the values derived out of [-1,3] are really large, as well as rapid increment in difficulty executing the original formula.
that is actually equivalent to hash, which creates a table that contains the values. and the production rule is a formula.
the execution of a "smarter" inquiry may be expensive because of the complexity of algorithm itself.

Smarter schemes which take advantage of a known statistical distribution are typically faster by some constant. However, that still keeps them at O(log N), which is the same as a trivial binary search. The reason is that in each step, they typically narrow down the range of elements to search by a factor R > 2 , for simple binary search that's just R=2. But you need log(N)/log(R) steps to narrow it down to exactly one element.
Now whether this is a net win depends on log(R) versus the work needed at each step. A simple comparison (for binary search) takes a few cycles. As soon as you need anything more complex than +-*/ (say exp or log) to predict the location of the next element, the profit of needing less steps is gone.
So, in summary: binary search is used because each step is efficient, for many real-world distributions.

Related

Decision Tree Binary Classifier shortcut (sorting)

Normally, at each node of the decision tree, we consider all features and all splitting points for each feature. We calculate the difference between the entropy of the entire node and the weighted avg of the entropies of potential left and right branches, and the feature + splitting feature_value that gives us the greatest entropy drop is chosen as the splitting criterion for that particular node.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
sort the m distinct feature_values by the percentage of 1's of the samples within the node that takes that feature_value for that feature.
Only try the m-1 ways of splitting the sorted list.
This 'trying only m-1 splits' method is mentioned as a 'shortcut' in the article below, which (by definition of 'shortcut') means the results of the two methods which differ drastically in runtime are exactly the same.
The quote:"For regression and binary classification problems, with K = 2 response classes, there is a computational shortcut [1]. The tree can order the categories by mean response (for regression) or class probability for one of the classes (for classification). Then, the optimal split is one of the L – 1 splits for the ordered list. "
The article:
http://www.mathworks.com/help/stats/splitting-categorical-predictors-for-multiclass-classification.html?s_tid=gn_loc_drop&requestedDomain=uk.mathworks.com
Note that I'm talking only about categorical variables.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
The answer is simple: both procedures just aren't the same. As you noticed, splitting in the exact way is an NP-hard problem and thus hardly feasible for any problem in practice. Moreover, due to overfitting that would usually be not the optimal result in terms of generaluzation.
Instead, the exhaustive search is replaced by some kind of greedy procedure which goes like: sort first, then try all ordered splits. In general this leads to different results than the exact splitting.
In order to improve on the greedy result, one further often applies pruning (which can be seen as another greedy and heuristic method). And never methods like random forests or BART deal with this problem effectively by averaging over several trees -- so that the deviation of a single tree becomes less important.

Calculate "moving" Covariance

I've been trying to figure out how to efficiently calculate the covariance in a moving window, i.e. moving from a set of values (x[0], y[0])..(x[n-1], y[n-1]) to a new set of values (x[1], y[1])..(x[n], y[n]). In other words, the value (x[0], y[0]) gets replaces by the value (x[n], y[n]). For performance reasons I need to calculate the covariance incrementally in the sense that I'd like to express the new covariance Cov(x[1]..x[n], y[1]..y[n]) in terms of the previous covariance Cov(x[0]..x[n-1], y[0]..y[n-1]).
Starting off with the naive formula for covariance as described here:
[https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance][1]
All I can come up with is:
Cov(x[1]..x[n], y[1]..y[n]) =
Cov(x[0]..x[n-1], y[0]..y[n-1]) +
(x[n]*y[n] - x[0]*y[0]) / n -
AVG(x[1]..x[n]) * AVG(y[1]..y[n]) +
AVG(x[0]..x[n-1]) * AVG(y[0]..y[n-1])
I'm sorry about the notation, I hope it's more or less clear what I'm trying to express.
However, I'm not sure if this is sufficiently numerically stable. Dealing with large values I might run into arithmetic overflows or other (for example cancellation) issues.
Is there a better way to do this?
Thanks for any help.
It looks like you are trying some form of "add the new value and subtract the old one". You are correct to worry: this method is not numerically stable. Keeping sums this way is subject to drift, but the real killer is the fact that at each step you are subtracting a large number from another large number to get what is likely a very small number.
One improvement would be to maintain your sums (of x_i, y_i, and x_i*y_i) independently, and recompute the naive formula from them at each step. Your running sums would still drift, and the naive formula is still numerically unstable, but at least you would only have one step of numerical instability.
A stable way to solve this problem would be to implement a formula for (stably) merging statistical sets, and evaluate your overall covariance using a merge tree. Moving your window would update one of your leaves, requiring an update of each node from that leaf to the root. For a window of size n, this method would take O(log n) time per update instead of the O(1) naive computation, but the result would be stable and accurate. Also, if you don't need the statistics for each incremental step, you can update the tree once per each output sample instead of once per input sample. If you have k input samples per output sample, this reduces the cost per input sample to O(1 + (log n)/k).
From the comments: the wikipedia page you reference includes a section on Knuth's online algorithm, which is relatively stable, though still prone to drift. You should be able to do something comparable for covariance; and resetting your computation every K*n samples should limit the drift at minimal cost.
Not sure why no one has mentioned this, but you can use the Welford online algorithm which relies on the running mean:
The equations should look like:
the online mean given by:

How can you compute a shortest addition chain for an arbitrary n <= 600 within one second?

How can you compute a shortest addition chain (sac) for an arbitrary n <= 600 within one second?
Notes
This is the programming competition on codility for this month.
Addition chains are numerically very important, since they are the most economical way to compute x^n (by consecutive multiplications).
Knuth's Art of Computer Programming, Volume 2, Seminumerical Algorithms has a nice introduction to addition chains and some interesting properties, but I didn't find anything that enabled me to fulfill the strict performance requirements.
What I've tried (spoiler alert)
Firstly, I constructed a (highly branching) tree (with the start 1-> 2 -> ( 3 -> ..., 4 -> ...)) such that for each node n, the path from the root to n is a sac for n. But for values >400, the runtime is about the same as for making a coffee.
Then I used that program to find some useful properties for reducing the search space. With that, I'm able to build all solutions up to 600 while making a coffee. But for n, I need to compute all solutions up to n. Unfortunately, codility measures the class initialization's runtime, too...
Since the problem is probably NP-hard, I ended up hard-coding a lookup table. But since codility asked to construct the sac, I don't know if they had a lookup table in mind, so I feel dirty and like a cheater. Hence this question.
Update
If you think a hard-coded, full lookup table is the way to go, can you give an argument why you think a full computation/partly computed solutions/heuristics won't work?
I have just got my Golden Certificate for this problem. I will not provide a full solution because the problem is still available on the site.I will instead give you some hints:
You might consider doing a deep-first search.
There exists a minimal star-chain for each n < 12509
You need to know how prune your search space.
You need a good lower bound for the length of the chain you are looking for.
Remember that you need just one solution, not all.
Good luck.
Addition chains are numerically very important, since they are the
most economical way to compute x^n (by consecutive multiplications).
This is not true. They are not always the most economical way to compute x^n. Graham et. all proved that:
If each step in addition chain is assigned a cost equal to the product
of the numbers at that step, "binary" addition chains are shown to
minimize the cost.
Situation changes dramatically when we compute x^n (mod m), which is a common case, for example in cryptography.
Now, to answer your question. Apart from hard-coding a table with answers, you could try a Brauer chain.
A Brauer chain (aka star-chain) is an addition chain where each new element is formed as the sum of the previous element and some element (possibly the same). Brauer chain is a sac for n < 12509. Quoting Daniel. J. Bernstein:
Brauer's algorithm is often called "the left-to-right 2^k-ary method",
or simply "2^k-ary method". It is extremely popular. It is easy to
implement; constructing the chain for n is a simple matter of
inspecting the bits of n. It does not require much storage.
BTW. Does anybody know a decent C/C++ implementation of Brauer's chain computation? I'm working partially on a comparison of exponentiation times using binary and Brauer's chains for both cases: x^n and x^n (mod m).

String analysis

Given a sequence of operations:
a*b*a*b*a*a*b*a*b
is there a way to get the optimal subdivision to enable reusage of substring.
making
a*b*a*b*a*a*b*a*b => c*a*c, where c = a*b*a*b
and then seeing that
a*b*a*b => d*d, where d = a*b
all in all reducing the 8 initial operations into the 4 described here?
(c = (d = a*b)*d)*a*c
The goal of course is to minimize the number of operations
I'm considering a suffixtree of sorts.
I'm especially interested in linear time heuristics or solutions.
The '*' operations are actually matrix multiplications.
This whole problem is known as "Common Subexpression Elimination" or CSE. It is a slightly smaller version of the problem called "Graph Reduction" faced by the implementer of compilers for functional programming languages. Googling "Common Subexpression elimination algorithm" gives lots of solutions, though none that I can see especially for the constraints given by matrix multiplication.
The pages linked to give a lot of references.
My old answer is below. However, having researched a bit more, the solution is simply building a suffix tree. This can be done in O(N) time (lots of references on the wikipedia page). Having done this, the sub-expressions (c, d etc. in your question) are just nodes in the suffix tree - just pull them out.
However, I think MarcoS is on to something with the suggestion of Longest repeating Substring, as graph reduction precedence might not allow optimisations that can be allowed here.
sketch of algorithm:
optimise(s):
let sub = longestRepeatingSubstring(s).
optimisedSub = optimise(sub)
return s with sub replaced by optimisedSub
Each run of longest repeating substring takes time N. You can probably re-use the suffix tree you build to solve the whole thing in time N.
edit: The orders-of-growth in this answer are needed in addition to the accepted answer in order to run CSE or matrix-chain multiplication
Interestingly, a compression algorithm may be what you want: a compression algorithm seeks to reduce the size of what it's compressing, and if the only way it can do that is substitution, you can trace it and obtain the necessary subcomponents for your algorithm. This may not give nice results though for small inputs.
What subsets of your operations are commutative will be an important consideration in choosing such an algorithm. [edit: OP says no operations are commutative in his/her situation]
We can also define an optimal solution, if we ignore effects such as caching:
input: [some product of matrices to compute]
given that multiplying two NxN matrices is O(N^2.376)
given we can visualize the product as follows:
[[AxB][BxC][CxD][DxE]...]
we must for example perform O(max(A,B,C)^2.376) or so operations in order to combine
[AxB][BxC] -> [AxC]
The max(...) is an estimate based on how fast it is to multiply two square matrices;
a better estimate of cost(A,B,C) for multiplying an AxB * BxC matrix can be gotten
from actually looking at the algorithm, or running benchmarks if you don't know the
algorithm used.
However note that multiplying the same matrix with itself, i.e. calculating
a power, can be much more efficient, and we also need to take that into account.
At worst, it takes log_2(power) multiplies each of O(N^2.376), but this could be
made more efficient by diagonalizing the matrix first.
There is the question about whether a greedy approach is feasible for not: whether one SHOULD compress repeating substrings at each step. This may not be the case, e.g.
aaaaabaab
compressing 'aa' results in ccabcb and compressing 'aab' is now impossible
However I have a hunch that, if we try all orders of compressing substrings, we will probably not run into this issue too often.
Thus having written down what we want (the costs) and considered possibly issues, we already have a brute-force algorithm which can do this, and it will run for very small numbers of matrices:
# pseudocode
def compress(problem, substring)
x = new Problem(problem)
x.string.replaceall(substring, newsymbol)
x.subcomputations += Subcomputation(newsymbol=substring)
def bestCompression(problem)
candidateCompressions = [compress(problem,substring) for each substring in problem.string]
# etc., recursively return problem with minimum cost
# dynamic programming may help make this more efficient, but one must watch
# out for the note above, how it may be hard to be greedy
Note: according to another answer by Asgeir, this is known as the Matrix Chain Multiplication optimization problem. Nick Fortescue notes this is also known more generally as http://en.wikipedia.org/wiki/Common_subexpression_elimination -- thus one could find any generic CSE or Matrix-Chain-Multiplication algorithm/library from the literature, and plug in the cost orders-of-magnitude I mentioned earlier (you will need those nomatter which solution you use). Note that the cost of the above calculations (multiplication, exponentiation, etc.) assume that they are being done efficiently with state-of-the-art algorithms; if this is not the case, replace the exponents with appropriate values which correspond to the way the operations will be carried out.
If you want to use the fewest arithmetic operations then you should have a look at matrix chain multiplication which can be reduced to O(n log n)
From the top of the head the problem seems in NP for me. Depending on the substitutions you are doing other substitions will be possible or impossible for example for the string
d*e*a*b*c*d*e*a*b*c*d*e*a there are several possibilities.
If you take the longest common string it will be:
f = d*e*a*b*c and you could substitute f*f*e*a leaving you with three multiplications in the end and four intermediate ones (total seven).
If you instead substitute the following way:
f = d*e*a you get f*b*c*f*b*c*f which you can further substitute using g = f*b*c to
g*g*f for a total of six multiplication.
There are other possible substitutions in this problem, but I do not have the time to count them all right now.
I am guessing for a complete minimal substitution it is not only necessary to figure out the longest common substring but also the number of times each substring repeats, which probably means you have to track all substitutions so far and do backtracking. Still it might be faster than the actual multiplications.
Isn't this the Longest repeated substring problem?

Splitting a set of object into several subsets according to certain evaluation

Suppose I have a set of objects, S. There is an algorithm f that, given a set S builds certain data structure D on it: f(S) = D. If S is large and/or contains vastly different objects, D becomes large, to the point of being unusable (i.e. not fitting in allotted memory). To overcome this, I split S into several non-intersecting subsets: S = S1 + S2 + ... + Sn and build Di for each subset. Using n structures is less efficient than using one, but at least this way I can fit into memory constraints. Since size of f(S) grows faster than S itself, combined size of Di is much less than size of D.
However, it is still desirable to reduce n, i.e. the number of subsets; or reduce the combined size of Di. For this, I need to split S in such a way that each Si contains "similar" objects, because then f will produce a smaller output structure if input objects are "similar enough" to each other.
The problems is that while "similarity" of objects in S and size of f(S) do correlate, there is no way to compute the latter other than just evaluating f(S), and f is not quite fast.
Algorithm I have currently is to iteratively add each next object from S into one of Si, so that this results in the least possible (at this stage) increase in combined Di size:
for x in S:
i = such i that
size(f(Si + {x})) - size(f(Si))
is min
Si = Si + {x}
This gives practically useful results, but certainly pretty far from optimum (i.e. the minimal possible combined size). Also, this is slow. To speed up somewhat, I compute size(f(Si + {x})) - size(f(Si)) only for those i where x is "similar enough" to objects already in Si.
Is there any standard approach to such kinds of problems?
I know of branch and bounds algorithm family, but it cannot be applied here because it would be prohibitively slow. My guess is that it is simply not possible to compute optimal distribution of S into Si in reasonable time. But is there some common iteratively improving algorithm?
EDIT:
As comments noted, I never defined "similarity". In fact, all I want is to split in such subsets Si that combined size of Di = f(Si) is minimal or at least small enough. "Similarity" is defined only as this and unfortunately simply cannot be computed easily. I do have a simple approximation, but it is only that — an approximation.
So, what I need is a (likely heuristical) algorithm that minimizes sum f(Si) given that there is no simple way to compute the latter — only approximations I use to throw away cases that are very unlikely to give good results.
About the slowness I found that in similar problems a good-enough solution is to compute the match just by picking a fixed number of random candidates.
True that the result will not be the best one (often worse than the full "greedy" solution you implemented) but it in my experience not too bad and you can decide the speed... it can even be implemented in a prescribed amount of time (that is you keep searching until the allocated time expires).
Another option I use is to keep searching until I see no improvement for a while.
To get past the greedy logic you could keep a queue of N "x" elements and trying to pack them simultaneously in groups of "k" (with k < N).
In this case I found that is important to also keep the "age" of an element in the queue and to use it as a "prize" for the result to avoid keeping "bad" elements forever in the queue because others will always match better (this would make the queue search useless and the results would be basically the same as the greedy approach).

Resources