How to find the best combination of parameters from a very large sets? - algorithm

I have a processing logic which has 11 parameters(let's say from parameter A to parameter K) and different combinations of theses parameters can results in different outcomes.
Processing Logic Example:
if x > A:
x = B
else:
x = C
y = math.sin(2x*x+1.1416)-D
# other logic involving parameter E,F,G,H,I,J,K
return outcome
Here are some examples of the possible values of the parameters(others are similar, discrete):
A ∈ [0.01, 0.02, 0.03, ..., 0.2]
E ∈ [1, 2, 3, 4, ..., 200]
I would like to find the combination of these parameters that results in the best outcome.
However, the problem I am facing is that there are in total
10^19 possible combinations while each combination takes 700ms processing time per CPU core. Obviously, the time to process the whole combinations is unacceptable even I have a large computing cluster.
Could anyone give some advice on what is the correct methodology to handle this problem?
Here is some of my thoughts:
Step 1. Minimize the step interval of each parameter that reduces the total processing time to an acceptable scope, for example:
A ∈ [0.01, 0.05, 0.09, ..., 0.2]
E ∈ [1, 5, 10, 15, ..., 200]
Step 2. Starting from the best combination resulted from step 1, doing a more meticulous research around that combination to find the best combination
But I am afraid that the best combination might hide somewhere that step 1 is not able to perceive, so step 2 is in vain

This is an optimization problem. However, you have two distinct problems in what you posed:
There are no restrictions or properties on the evaluation function;
You accept only the best solution of 10^19 possibilities.
The field of optimization serves up many possibilities, most of which are one variation or another of hill-climbing search and irruptive movement (to help break out of a local maximum that is not the global solution). All of these depend on some manner of continuity or predictability in the evaluation function's dependence on its inputs.
Without that continuity, there is no shorter path to the sole optimal solution.
If you do have some predictability, then you have some reading to do on various solution methods. Start with Newton-Raphson, move on to Gradient Descent, and continue to other topics, depending on the fabric of your function.

Have you thought about purely mathematical approach i.e. trying to find local/global extrema, or based on whether function is monotonic per operation?
There are quite decent numerical methods for derivatives/integrals, even to be used in a relatively-generic manner.
So in other words limit the scope, instead of computing every single option - depends on the general character of operations, that you have in mind.

Related

Ways to compare two methods in general

Background:
I have two methods to handle a set of problems. The key performance index is the dates to complete those set of problems. I want to assess which method is better base upon the overall dates is needed. For example, there are two methods called, A and B. There are a set of problems containing 3 problems, which are x, y, and z. Using method A to approach these three problems needs 3, 5, 7 days respectively. Whereas using method B needs 6, 7, 8 days respectively. I need to compare whether A is better or B is better. In this naive example, A is obviously better.
However, there is an edge case. It could happen that for some problems, one method would takes forever to finish it. For example, there is another method called C. To approach x, y, and z, it needs 1, 1, and 99999 days to finish. My question is how I can compare C to A and B?
There are two ways I am considering. One is reciprocal of dates. The other one is truncated. On the previous example, if I use reciprocal, then the scores of method A are (1/3 + 1/5 + 1/7)/3, method B are (1/6 + 1/7 + 1/8)/3, and method C are (1/1 + 1/1 + 1/99999)/3. Based upon reciprocal, A > C > B. If I use truncated to 30 (any value bigger than 30 become 30). Then the average needed dates for method A is 5, method B is 7, and method C is 10.7.
My question are
which criterion is better?
Is there any other ways to assess?
If truncation is better, is there a better way to set the cutoff point?
If reciprocal is better, how to interpret that number? average efficiency? and how to report to boss what that number means.
Is there any more structure, or scientific way to think of this kind of problem in general?
Any helps or hints is appreciated, thank you very much.

Algorithm for creating a sample that's as close to evenly distributed as possible?

I have a large database of data with dates. There are large gaps and large chunks of data without gaps. I want to get a sample of this data such that the dates are as evenly distributed as possible (i.e. as spread out as possible).
E.g. if the dates are [1, 2, 3, 4, 100] and I want to sample 3 elements, the ideal sample would be [1, 50.5, 100] and I would take [1, 4, 100].
Is this a known problem with an existing algorithm?
My attempt to formalize this problem would be: Given an array A, select a subarray B such that the following is minimized:
Σabs(Bi - (min(A) + i * (max(A) - min(A)) / (len(B) - 1))
You should be able to model this as an assignment problem. Construct a bipartite graph with vertex sets A and B. The edge from A_i to B_j has weight something like
abs(j / (|B| - 1) - (A_i - min(A) / (max(A) - min(A)))
where A_0 <= A_1 <= ... <= A_{|A|-1}.
Note that in your problem the graph is dense, so easily represented as a rectangular matrix of weights W[i,j]. No explicit vertex or edge data structures are needed.
A minimum weight matching would identify the elements of A for the sample.
There are several efficient algorithms for solving assignment problems. Perhaps the best known is the Hungarian method. This can be implemented with O(n^3) run time. Actually I vaguely remember that in this text there's a version with O(n^2 log n) run time. I don't have access to it right now, so can't check.) An implementation I used in the 90's ran in a few seconds on a standard desktop machine for a problem with n = ~10k. Should be able to do considerably better now.
You didn't give a definition of "large." If the DB is too big to process as a single assignment problem, you can probably get reasonable results by working in chunks.

Guidance on Algorithmic Thinking (4 fours equation)

I recently saw a logic/math problem called 4 Fours where you need to use 4 fours and a range of operators to create equations that equal to all the integers 0 to N.
How would you go about writing an elegant algorithm to come up with say the first 100...
I started by creating base calculations like 4-4, 4+4, 4x4, 4/4, 4!, Sqrt 4 and made these values integers.
However, I realized that this was going to be a brute force method testing the combinations to see if they equal, 0 then 1, then 2, then 3 etc...
I then thought of finding all possible combinations of the above values, checking that the result was less than 100 and filling an array and then sorting it...again inefficient because it may find 1000s of numbers over 100
Any help on how to approach a problem like this would be helpful...not actual code...but how to think through this problem
Thanks!!
This is an interesting problem. There are a couple of different things going on here. One issue is how to describe the sequence of operations and operands that go into an arithmetic expression. Using parentheses to establish order of operations is quite messy, so instead I suggest thinking of an expression as a stack of operations and operands, like - 4 4 for 4-4, + 4 * 4 4 for (4*4)+4, * 4 + 4 4 for (4+4)*4, etc. It's like Reverse Polish Notation on an HP calculator. Then you don't have to worry about parentheses, having the data structure for expressions will help below when we build up larger and larger expressions.
Now we turn to the algorithm for building expressions. Dynamic programming doesn't work in this situation, in my opinion, because (for example) to construct some numbers in the range from 0 to 100 you might have to go outside of that range temporarily.
A better way to conceptualize the problem, I think, is as breadth first search (BFS) on a graph. Technically, the graph would be infinite (all positive integers, or all integers, or all rational numbers, depending on how elaborate you want to get) but at any time you'd only have a finite portion of the graph. A sparse graph data structure would be appropriate.
Each node (number) on the graph would have a weight associated with it, the minimum number of 4's needed to reach that node, and also the expression which achieves that result. Initially, you would start with just the node (4), with the number 1 associated with it (it takes one 4 to make 4) and the simple expression "4". You can also throw in (44) with weight 2, (444) with weight 3, and (4444) with weight 4.
To build up larger expressions, apply all the different operations you have to those initial node. For example, unary negation, factorial, square root; binary operations like * 4 at the bottom of your stack for multiply by 4, + 4, - 4, / 4, ^ 4 for exponentiation, and also + 44, etc. The weight of an operation is the number of 4s required for that operation; unary operations would have weight 0, + 4 would have weight 1, * 44 would have weight 2, etc. You would add the weight of the operation to the weight of the node on which it operates to get a new weight, so for example + 4 acting on node (44) with weight 2 and expression "44" would result in a new node (48) with weight 3 and expression "+ 4 44". If the result for 48 has better weight than the existing result for 48, substitute that new node for (48).
You will have to use some sense when applying functions. factorial(4444) would be a very large number; it would be wise to set a domain for your factorial function which would prevent the result from getting too big or going out of bounds. The same with functions like / 4; if you don't want to deal with fractions, say that non-multiples of 4 are outside of the domain of / 4 and don't apply the operator in that case.
The resulting algorithm is very much like Dijkstra's algorithm for calculating distance in a graph, though not exactly the same.
I think that the brute force solution here is the only way to go.
The reasoning behind this is that each number has a different way to get to it, and getting to a certain x might have nothing to do with getting to x+1.
Having said that, you might be able to make the brute force solution a bit quicker by using obvious moves where possible.
For instance, if I got to 20 using "4" three times (4*4+4), it is obvious to get to 16, 24 and 80. Holding an array of 100 bits and marking the numbers reached
Similar to subset sum problem, it can be solved using Dynamic Programming (DP) by following the recursive formulas:
D(0,0) = true
D(x,0) = false x!=0
D(x,i) = D(x-4,i-1) OR D(x+4,i-1) OR D(x*4,i-1) OR D(x/4,i-1)
By computing the above using DP technique, it is easy to find out which numbers can be produced using these 4's, and by walking back the solution, you can find out how each number was built.
The advantage of this method (when implemented with DP) is you do not recalculate multiple values more than once. I am not sure it will actually be effective for 4 4's, but I believe theoretically it could be a significant improvement for a less restricted generalization of this problem.
This answer is just an extension of Amit's.
Essentially, your operations are:
Apply a unary operator to an existing expression to get a new expression (this does not use any additional 4s)
Apply a binary operator to two existing expressions to get a new expression (the new expression has number of 4s equal to the sum of the two input expressions)
For each n from 1..4, calculate Expressions(n) - a List of (Expression, Value) pairs as follows:
(For a fixed n, only store 1 expression in the list that evaluates to any given value)
Initialise the list with the concatenation of n 4s (i.e. 4, 44, 444, 4444)
For i from 1 to n-1, and each permitted binary operator op, add an expression (and value) e1 op e2 where e1 is in Expressions(i) and e2 is in Expressions(n-i)
Repeatedly apply unary operators to the expressions/values calculated so far in steps 1-3. When to stop (applying 3 recursively) is a little vague, certainly stop if an iteration produces no new values. Potentially limit the magnitude of the values you allow, or the size of the expressions.
Example unary operators are !, Sqrt, -, etc. Example binary operators are +-*/^ etc. You can easily extend this approach to operators with more arguments if permitted.
You could do something a bit cleverer in terms of step 3 never ending for any given n. The simple way (described above) does not start calculating Expressions(i) until Expressions(j) is complete for all j < i. This requires that we know when to stop. The alternative is to build Expressions of a certain maximum length for each n, then if you need to (because you haven't found certain values), extend the maximum length in an outer loop.

Algorithm to calculate all possible subset

This should be a quite simple problem, but I don't have proper algorithmic training and find myself stuck trying to solve this.
I need to calculate the possible combinations to reach a number by adding a limited set of smaller numbers together.
Imagine that we are playing with LEGO and I have a brick that is 12 units long and I need to list the possible substitutions I can make with shorter bricks. For this example we may say that the available bricks are 2, 4, 6 and 12 units long.
What might be a good approach to building an algorithm that can calculate the substitions? There are no bounds on how many bricks I can use at a time, so it could be 6x2 as well as 1x12, the important thing is I need to list all of the options.
So the inputs are the target length (in this case 12) and available bricks (an array of numbers (arbitrary length), in this case [2, 4, 6, 12]).
My approach was to start with the low number and add it up until I reach the target, then take the next lowest and so on. But that way I miss out on the combinations of multiple numbers and when I try to factor that in, it gets really messy.
I suggest a recursive approach: given a function f(target,permissibles) to list all representations of target as a combination of permissibles, you can do this:
def f(target,permissibles):
for x in permissibles:
collect f(target - x, permissibles)
if you do not want to differentiate between 12 = 4+4+2+2 and 12=2+4+2+4, you need to sort permissibles in the descending order and do
def f(target,permissibles):
for x in permissibles:
collect f(target - x, permissibles.remove(larger than x))

Expectation Maximization coin toss examples

I've been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:
http://cs.dartmouth.edu/~cs104/CS104_11.04.22.pdf
There are 3 coins 0, 1 and 2 with P0, P1 and P2 probability landing on Head when tossed. Toss coin 0, if the result is Head, toss coin 1 three times else toss coin 2 three times. The observed data produced by coin 1 and 2 is like this: HHH, TTT, HHH, TTT, HHH. The hidden data is coin 0's result. Estimate P0, P1 and P2.
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
There are two coins A and B with PA and PB being the probability landing on Head when tossed. Each round, select one coin at random and toss it 10 times then record the results. The observed data is the toss results provided by these two coins. However, we don't know which coin was selected for a particular round. Estimate PA and PB.
While I can get the calculations, I can't relate the ways they are solved to the original EM theory. Specifically, during the M-Step of both examples, I don't see how they're maximizing anything. It just seems they are recalculating the parameters and somehow, the new parameters are better than the old ones. Moreover, the two E-Steps don't even look similar to each other, not to mention the original theory's E-Step.
So how exactly do these example work?
The second PDF won't download for me, but I also visited the wikipedia page http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm which has more information. http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf (which claims to be a gentle introduction) might be worth a look too.
The whole point of the EM algorithm is to find parameters which maximize the likelihood of the observed data. This is the only bullet point on page 8 of the first PDF, the equation for capital Theta subscript ML.
The EM algorithm comes in handy where there is hidden data which would make the problem easy if you knew it. In the three coins example this is the result of tossing coin 0. If you knew the outcome of that you could (of course) produce an estimate for the probability of coin 0 turning up heads. You would also know whether coin 1 or coin 2 was tossed three times in the next stage, which would allow you to make estimates for the probabilities of coin 1 and coin 2 turning up heads. These estimates would be justified by saying that they maximized the likelihood of the observed data, which would include not only the results that you are given, but also the hidden data that you are not - the results from coin 0. For a coin that gets A heads and B tails you find that the maximum likelihood for the probability of A heads is A/(A+B) - it might be worth you working this out in detail, because it is the building block for the M step.
In the EM algorithm you say that although you don't know the hidden data, you come in with probability estimates which allow you to write down a probability distribution for it. For each possible value of the hidden data you could find the parameter values which would optimize the log likelihood of the data including the hidden data, and this almost always turns out to mean calculating some sort of weighted average (if it doesn't the EM step may be too difficult to be practical).
What the EM algorithm asks you to do is to find the parameters maximizing the weighted sum of log likelihoods given by all the possible hidden data values, where the weights are given by the probability of the associated hidden data given the observations using the parameters at the start of the EM step. This is what almost everybody, including the Wikipedia algorithm, calls the Q-function. The proof behind the EM algorithm, given in the Wikipedia article, says that if you change the parameters so as to increase the Q-function (which is only a means to an end), you will also have changed them so as to increase the likelihood of the observed data (which you do care about). What you tend to find in practice is that you can maximize the Q-function using a variation of what you would do if you know the hidden data, but using the probabilities of the hidden data, given the estimates at the start of the EM-step, to weight the observations in some way.
In your example it means totting up the number of heads and tails produced by each coin. In the PDF they work out P(Y=H|X=) = 0.6967. This means that you use weight 0.6967 for the case Y=H, which means that you increment the counts for Y=H by 0.6967 and increment the counts for X=H in coin 1 by 3*0.6967, and you increment the counts for Y=T by 0.3033 and increment the counts for X=H in coin 2 by 3*0.3033. If you have a detailed justification for why A/(A+B) is a maximum likelihood of coin probabilities in the standard case, you should be ready to turn it into a justification for why this weighted updating scheme maximizes the Q-function.
Finally, the log likelihood of the observed data (the thing you are maximizing) gives you a very useful check. It should increase with every EM step, at least until you get so close to convergence that rounding error comes in, in which case you may have a very small decrease, signalling convergence. If it decreases dramatically, you have a bug in your program or your maths.
As luck would have it, I have been struggling with this material recently as well. Here is how I have come to think of it:
Consider a related, but distinct algorithm called the classify-maximize algorithm, which we might use as a solution technique for a mixture model problem. A mixture model problem is one where we have a sequence of data that may be produced by any of N different processes, of which we know the general form (e.g., Gaussian) but we do not know the parameters of the processes (e.g., the means and/or variances) and may not even know the relative likelihood of the processes. (Typically we do at least know the number of the processes. Without that, we are into so-called "non-parametric" territory.) In a sense, the process which generates each data is the "missing" or "hidden" data of the problem.
Now, what this related classify-maximize algorithm does is start with some arbitrary guesses at the process parameters. Each data point is evaluated according to each one of those parameter processes, and a set of probabilities is generated-- the probability that the data point was generated by the first process, the second process, etc, up to the final Nth process. Then each data point is classified according to the most likely process.
At this point, we have our data separated into N different classes. So, for each class of data, we can, with some relatively simple calculus, optimize the parameters of that cluster with a maximum likelihood technique. (If we tried to do this on the whole data set prior to classifying, it is usually analytically intractable.)
Then we update our parameter guesses, re-classify, update our parameters, re-classify, etc, until convergence.
What the expectation-maximization algorithm does is similar, but more general: Instead of a hard classification of data points into class 1, class 2, ... through class N, we are now using a soft classification, where each data point belongs to each process with some probability. (Obviously, the probabilities for each point need to sum to one, so there is some normalization going on.) I think we might also think of this as each process/guess having a certain amount of "explanatory power" for each of the data points.
So now, instead of optimizing the guesses with respect to points that absolutely belong to each class (ignoring the points that absolutely do not), we re-optimize the guesses in the context of those soft classifications, or those explanatory powers. And it so happens that, if you write the expressions in the correct way, what you're maximizing is a function that is an expectation in its form.
With that said, there are some caveats:
1) This sounds easy. It is not, at least to me. The literature is littered with a hodge-podge of special tricks and techniques-- using likelihood expressions instead of probability expressions, transforming to log-likelihoods, using indicator variables, putting them in basis vector form and putting them in the exponents, etc.
These are probably more helpful once you have the general idea, but they can also obfuscate the core ideas.
2) Whatever constraints you have on the problem can be tricky to incorporate into the framework. In particular, if you know the probabilities of each of the processes, you're probably in good shape. If not, you're also estimating those, and the sum of the probabilities of the processes must be one; they must live on a probability simplex. It is not always obvious how to keep those constraints intact.
3) This is a sufficiently general technique that I don't know how I would go about writing code that is general. The applications go far beyond simple clustering and extend to many situations where you are actually missing data, or where the assumption of missing data may help you. There is a fiendish ingenuity at work here, for many applications.
4) This technique is proven to converge, but the convergence is not necessarily to the global maximum; be wary.
I found the following link helpful in coming up with the interpretation above: Statistical learning slides
And the following write-up goes into great detail of some painful mathematical details: Michael Collins' write-up
I wrote the below code in Python which explains the example given in your second example paper by Do and Batzoglou.
I recommend that you read this link first for a clear explanation of how and why the 'weightA' and 'weightB' in the code below are obtained.
Disclaimer : The code does work but I am certain that it is not coded optimally. I am not a Python coder normally and have started using it two weeks ago.
import numpy as np
import math
#### E-M Coin Toss Example as given in the EM tutorial paper by Do and Batzoglou* ####
def get_mn_log_likelihood(obs,probs):
""" Return the (log)likelihood of obs, given the probs"""
# Multinomial Distribution Log PMF
# ln (pdf) = multinomial coeff * product of probabilities
# ln[f(x|n, p)] = [ln(n!) - (ln(x1!)+ln(x2!)+...+ln(xk!))] + [x1*ln(p1)+x2*ln(p2)+...+xk*ln(pk)]
multinomial_coeff_denom= 0
prod_probs = 0
for x in range(0,len(obs)): # loop through state counts in each observation
multinomial_coeff_denom = multinomial_coeff_denom + math.log(math.factorial(obs[x]))
prod_probs = prod_probs + obs[x]*math.log(probs[x])
multinomial_coeff = math.log(math.factorial(sum(obs))) - multinomial_coeff_denom
likelihood = multinomial_coeff + prod_probs
return likelihood
# 1st: Coin B, {HTTTHHTHTH}, 5H,5T
# 2nd: Coin A, {HHHHTHHHHH}, 9H,1T
# 3rd: Coin A, {HTHHHHHTHH}, 8H,2T
# 4th: Coin B, {HTHTTTHHTT}, 4H,6T
# 5th: Coin A, {THHHTHHHTH}, 7H,3T
# so, from MLE: pA(heads) = 0.80 and pB(heads)=0.45
# represent the experiments
head_counts = np.array([5,9,8,4,7])
tail_counts = 10-head_counts
experiments = zip(head_counts,tail_counts)
# initialise the pA(heads) and pB(heads)
pA_heads = np.zeros(100); pA_heads[0] = 0.60
pB_heads = np.zeros(100); pB_heads[0] = 0.50
# E-M begins!
delta = 0.001
j = 0 # iteration counter
improvement = float('inf')
while (improvement>delta):
expectation_A = np.zeros((5,2), dtype=float)
expectation_B = np.zeros((5,2), dtype=float)
for i in range(0,len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_log_likelihood(e,np.array([pA_heads[j],1-pA_heads[j]])) # loglikelihood of e given coin A
ll_B = get_mn_log_likelihood(e,np.array([pB_heads[j],1-pB_heads[j]])) # loglikelihood of e given coin B
weightA = math.exp(ll_A) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of A proportional to likelihood of A
weightB = math.exp(ll_B) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of B proportional to likelihood of B
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
pA_heads[j+1] = sum(expectation_A)[0] / sum(sum(expectation_A));
pB_heads[j+1] = sum(expectation_B)[0] / sum(sum(expectation_B));
improvement = max( abs(np.array([pA_heads[j+1],pB_heads[j+1]]) - np.array([pA_heads[j],pB_heads[j]]) ))
j = j+1
The key to understanding this is knowing what the auxiliary variables are that make estimation trivial. I will explain the first example quickly, the second follows a similar pattern.
Augment each sequence of heads/tails with two binary variables, which indicate whether coin 1 was used or coin 2. Now our data looks like the following:
c_11 c_12
c_21 c_22
c_31 c_32
...
For each i, either c_i1=1 or c_i2=1, with the other being 0. If we knew the values these variables took in our sample, estimation of parameters would be trivial: p1 would be the proportion of heads in samples where c_i1=1, likewise for c_i2, and \lambda would be the mean of the c_i1s.
However, we don't know the values of these binary variables. So, what we basically do is guess them (in reality, take their expectation), and then update the parameters in our model assuming our guesses were correct. So the E step is to take the expectation of the c_i1s and c_i2s. The M step is to take maximum likelihood estimates of p_1, p_2 and \lambda given these cs.
Does that make a bit more sense? I can write out the updates for the E and M step if you prefer. EM then just guarantees that by following this procedure, likelihood will never decrease as iterations increase.

Resources