I have an incoming stream of data and a set of transformations, which can be applied to the stream in various combinations to get a numerical output value. I need to find which subset of the transformation minimizes the number.
The data is an ordered list of numbers with metadata attached to each one.
The transformations are quasi-linear: they are technically executable code in a Turing-complete language, but they are known to belong to a restricted subset which always halts, and they transform the input number to output number with arithmetic operations, whose flow is dependent on metadata attached. Moreover, the operations are almost all the time linear (but they are not bound to be—meaning this may be a place for optimization, but not restriction).
Basically, a brute-force approach involving 2n steps (where n is a number of transformations) would work, but it is woefully ineffective, and I'm almost absolutely sure this would not scale in production. Are there any algorithms to solve this task faster?
If almost all operations are linear, can't you use linear programming as heuristics?
And maybe in between do checks whether some transformations are particularly slow, in which case you can still switch to brute force.
Do you need to find the optimal output?
Related
This may be a repeat of the question here: Predict Huffman compression ratio without constructing the tree
So basically, I have the probabilistic distribution of two datasets with the same variables but different probabilities. Now, is there any way that by looking at the variable distribution, I can to some degree confidently say that the dataset, when passed through a Huffman Coding implementation would achieve a higher compression ratio than the other?
One of the solutions that I came across was to calculate the upper bound using conditional entropy and then compute the average code length. Is there any other approach that can I can probably explore before using the said method?
Thanks a lot.
I don't know what "to some degree confidently" means, but you can get a lower bound on the compressed size of each set by computing the zero-order entropy as done in the linked question (the negative of the sum of the probabilities times the log of the probabilities). Then the lower entropy very likely produces a shorter Huffman coding than the higher entropy. It is not definite, as I am sure that one could come up with a counter-example.
You also need to send a description of the code itself if you want to decode it on the other end, which adds a wrinkle to the comparison. However if the data is much larger than the code description, then that will be lost in the noise.
Simply generating the code, the coded data, and the code description is very fast. The best solution is to do that, and compare the resulting number of bits directly.
In “Dynamic Creation of
Pseudorandom Number Generators” Matsumoto and Nishimura cautioned against careless initialization of MT PRNGs for the purposes of parallel simulations which assume the number streams from the different generators are independent of each other:
The usual scheme for PRNG in parallel machines is to use one and the same PRNG for every process, with different intial seeds. However, this procedure may yield bad collision, in particular if the generator is based on a linear recurrence, because in this case the sum of two pseudorandom sequences satisfies the same linear recurrence, and may appear in the third sequence. The danger becomes non-negligible if the number of parallel streams becomes large compared to the size of the state space.
How big does the number of streams have to get for this to be a serious problem? In the case of the standard MT, MT19937, the state space is rather large... I could definitely see there being a linear relationship modulo 219937−1 among 20,000 sequences, but how about a birthday-paradox-style relationship among 400 sequences?
It appears that this is actually a serious problem, because parallel PRNG implementations do include DCMT, but it would be nice to have some examples of failure, and some sense for when this becomes a problem.
There was a discussion of this issue here.
Take two streams "far apart" in MT space, and their sum also satisfies
the recurrence. So a possible worry about a third stream isn't just
about correlation or overlap with the first two streams, but,
depending on the app, also about correlation/overlap with the sum of
the first two streams. Move to N streams, and there are O(N**2)
direct sums to worry about, and then sums of sums, and ...
Still won't cause a problem in my statistical life expectancy, but I
only have 4 cores ;-)
So it's worse than birthday-paradox. The problem is actually pretty likely with log(size of state space)/log(2) sequences, which for the standard MT is about 14 sequences.
Does ELKI fail for data which has many duplicate values in it? I have files with more than 2 million observations(1D), but it contains only a few hundred unique values. The rest are duplicates. When I run this file in ELKI, for LOF or LoOP calculations, it returns NAN as outlier scores for any k less than the number of occurrences of a value with highest frequency. I can imagine the LRD calculation must be causing this problem if duplicates are taken as nearest neighbours. But should'nt it NOT be doing this? Can we rely on the results ELKI is producing for such cases?
It is not so much a matter of ELKI, but of the algorithms.
Most outlier detection algorithms use the k nearest neighbors.
If these are identical, the values can be problematic. In LOF, the neighbors of duplicated points can obtain an outlier score of infinity. Similarly, the outlier scores of LoOP probably reach NaN due to a division by 0 if there are too many duplicates.
But that is not a matter of ELKI, but of the definition of these methods. Any implementation that sticks to these definition should exhibit these effects. There are some methods to avoid/reduce the effects:
add jitter to the data set
remove duplicates (but never consider highly dupilcated values outliers!)
increase the neighborhood size
It is easy to prove that such results do arise in LOF/LoOP equations if the data has duplicates.
This limitation of these algorithms can most probably be "fixed", but we want the implementations in ELKI to be close to the original publication, so we avoid doing unpublished changes. But if a "LOFdup" method is published and contributed to ELKI, we would add that obviously.
Note that neither LOF nor LoOP is meant to be used with 1-dimensional data. For 1-dimensional data, I recommend focusing on "traditional" statistical literature instead, such as kernel density estimation. 1-dimensional numerical data is special, because it is ordered - this allows for both optimizations and much more advanced statistics that would be infeasible or require too much observations on multivariate data. LOF and similar methods are very basic statistics (so basic that many statisticians would outright reject them as "stupid" or "naive") - with the key benefit that they easily scale to large, multivariate data sets. Sometimes naive methods such as naive bayes can work very well in practise; the same holds for LOF and LoOP: there are some questionable decisions in the algorithms. But they work, and scale. Just as with naive bayes - the independence assumption in naive bayes is questionable, but naive bayes classification often works well, and scales very well.
In other words, this is not a bug in ELKI. The implementation does what is published.
Consider the bisection algorithm to find square root. Every step depends on the previous, so in my opinion it's not possibile to parallelize it. Am I wrong?
Consider also similar algorithm like binary search.
edit
My problem is not the bisection, but it is very similar. I have a monotonic function f(mu) and I need to find the mu where f(mu)<alpha. One core need 2 minutes to compute f(mu) and I need a very big precision. We have a farm of ~100 cores. My first attemp was to use only 1 core and then scan all value of f with a dynamic step, depending on how close I am to alpha. Now I want to use the whole farm, but my only idea is to compute 100 value of f at equal spaced points.
It depends on what you mean by parallelize, and at what granularity. For example you could use instruction level parallelism (e.g. SIMD) to find square roots for a set of input values.
Binary search is trickier, because the control flow is data-dependent, as is the number of iterations, but you could still conceivably perform a number of binary searches in parallel so long as you allow for the maximum number of iterations (log2 N).
Even if these algorithms could be parallelized (and I'm not sure they can), there is very little point in doing so.
Generally speaking, there is very little point in attempting to parallelize algorithms that already have sub-linear time bounds (that is, T < O(n)). These algorithms are already so fast that extra hardware will have very little impact.
Furthermore, it is not true (in general) that all algorithms with data dependencies cannot be parallelized. In some cases, for example, it is possible to set up a pipeline where different functional units operate in parallel and feed data sequentially between them. Image processing algorithms, in particular, are frequently amenable to such arrangements.
Problems with no such data dependencies (and thus no need to communicate between processors) are referred to as "embarrassingly parallel". Those problems represent a small subset of the space of all problems that can be parallelized.
Many algorithms have several steps that each step depend on previous step,Some those algorithm can changed steps to doing parallel and some impossible to parallel, I think BinarySearch is of second type, You not wrong, But you can paralleled binary search with multiple Search.
I am looking for a general algorithm to help in situations with similar constraints as this example :
I am thinking of a system where images are constructed based on a set of operations. Each operation has a set of parameters. The total "gene" of the image is then the sequential application of the operations with the corresponding parameters. The finished image is then given a vote by one or more real humans according to how "beautiful" it is.
The question is what kind of algorithm would be able to do better than simply random search if you want to find the most beautiful image? (and hopefully improve the confidence over time as votes tick in and improve the fitness function)
Given that the operations will probably be correlated, it should be possible to do better than random search. So for example operation A with parameters a1 and a2 followed by B with parameters b1 could generally be vastly superior to B followed by A. The order of operations will matter.
I have tried googling for research papers on random walk and markov chains as that is my best guesses about where to look, but so far have found no scenarios similar enough. I would really appreciate even just a hint of where to look for such an algorithm.
I think what you are looking for fall in a broad research area called metaheuristics (which include many non-linear optimization algorithms such as genetic algorithms, simulated annealing or tabu search).
Then if your raw fitness function is just giving a statistical value somehow approximating a real (but unknown) fitness function, you can probably still use most metaheuristics by (somehow) smoothing your fitness function (averaging results would do that).
Do you mean the Metropolis algorithm?
This approach uses a random walk, weighted by the fitness function. It is useful for locating local extrema in complicated fitness landscapes, but is generally slower than deterministic approaches where those will work.
You're pretty much describing a genetic algorithm in which the sequence of operations represents the "gene" ("chromosome" would be a better term for this, where the parameter[s] passed to each operation represents a single "gene", and multiple genes make up a chromosome), the image produced represents the phenotypic expression of the gene, and the votes from the real humans represent the fitness function.
If I understand your question, you're looking for an alternative algorithm of some sort that will evaluate the operations and produce a "beauty" score similar to what the real humans produce. Good luck with that - I don't think there really is any such thing, and I'm not surprised that you didn't find anything. Human brains, and correspondingly human evaluations of aesthetics, are much too staggeringly complex to be reducible to a simplistic algorithm.
Interestingly, your question seems to encapsulate the bias against using real human responses as the fitness function in genetic-algorithm-based software. This is a subject of relevance to me, since my namesake software is specifically designed to use human responses (or "votes") to evaluate music produced via a genetic process.
Simple Markov Chain
Markov chains, which you mention, aren't a bad way to go. A Markov chain is just a state machine, represented as a graph with edge weights which are transition probabilities. In your case, each of your operations is a node in the graph, and the edges between the nodes represent allowable sequences of operations. Since order matters, your edges are directed. You then need three components:
A generator function to construct the graph of allowed transitions (which operations are allowed to follow one another). If any operation is allowed to follow any other, then this is easy to write: all nodes are connected, and your graph is said to be complete. You can initially set all the edge weights to 1.
A function to traverse the graph, crossing N nodes, where N is your 'gene-length'. At each node, your choice is made randomly, but proportionally weighted by the values of the edges (so better edges have a higher chance of being selected).
A weighting update function which can be used to adjust the weightings of the edges when you get feedback about an image. For example, a simple update function might be to give each edge involved in a 'pleasing' image a positive vote each time that image is nominated by a human. The weighting of each edge is then normalised, with the currently highest voted edge set to 1, and all the others correspondingly reduced.
This graph is then a simple learning network which will be refined by subsequent voting. Over time as votes accumulate, successive traversals will tend to favour the more highly rated sequences of operations, but will still occasionally explore other possibilities.
Advantages
The main advantage of this approach is that it's easy to understand and code, and makes very few assumptions about the problem space. This is good news if you don't know much about the search space (e.g. which sequences of operations are likely to be favourable).
It's also easy to analyse and debug - you can inspect the weightings at any time and very easily calculate things like the top 10 best sequences known so far, etc. This is a big advantage - other approaches are typically much harder to investigate ("why did it do that?") because of their increased abstraction. Although very efficient, you can easily melt your brain trying to follow and debug the convergence steps of a simplex crawler!
Even if you implement a more sophisticated production algorithm, having a simple baseline algorithm is crucial for sanity checking and efficiency comparisons. It's also easy to tinker with, by messing with the update function. For example, an even more baseline approach is pure random walk, which is just a null weighting function (no weighting updates) - whatever algorithm you produce should perform significantly better than this if its existence is to be justified.
This idea of baselining is very important if you want to evaluate the quality of your algorithm's output empirically. In climate modelling, for example, a simple test is "does my fancy simulation do any better at predicting the weather than one where I simply predict today's weather will be the same as yesterday's?" Since weather is often correlated on a timescale of several days, this baseline can give surprisingly good predictions!
Limitations
One disadvantage of the approach is that it is slow to converge. A more agressive choice of update function will push promising results faster (for example, weighting new results according to a power law, rather than the simple linear normalisation), at the cost of giving alternatives less credence.
This is equivalent to fiddling with the mutation rate and gene pool size in a genetic algorithm, or the cooling rate of a simulated annealing approach. The tradeoff between 'climbing hills or exploring the landscape' is an inescapable "twiddly knob" (free parameter) which all search algorithms must deal with, either directly or indirectly. You are trying to find the highest point in some fitness search space. Your algorithm is trying to do that in less tries than random inspection, by looking at the shape of the space and trying to infer something about it. If you think you're going up a hill, you can take a guess and jump further. But if it turns out to be a small hill in a bumpy landscape, then you've just missed the peak entirely.
Also note that since your fitness function is based on human responses, you are limited to a relatively small number of iterations regardless of your choice of algorithmic approach. For example, you would see the same issue with a genetic algorithm approach (fitness function limits the number of individuals and generations) or a neural network (limited training set).
A final potential limitation is that if your "gene-lengths" are long, there are many nodes, and many transitions are allowed, then the size of the graph will become prohibitive, and the algorithm impractical.