Meaning of each column solver output in Gekko - gekko

I am curious what each solver output column from IPOPT solver suggests. Is there any material that explains this?
Below is solver output from IPOPT solver. And I'd like to know what the inf_pr, inf_du, lg(mu), ||d||, etc. terms mean.

Below is a description of each column from the IPOPT documentation.
iter: The current iteration count. This includes regular iterations and iterations during the restoration phase. If the algorithm is in the restoration phase, the letter "r" will be appended to the iteration number.
objective: The unscaled objective value at the current point. During the restoration phase, this value remains the unscaled objective value for the original problem.
inf_pr: The unscaled constraint violation at the current point. This quantity is the infinity-norm (max) of the (unscaled) constraints ( gL≤g(x)≤gU in (NLP)). During the restoration phase, this value remains the constraint violation of the original problem at the current point. The option inf_pr_output can be used to switch to the printing of a different quantity.
inf_du: The scaled dual infeasibility at the current point. This quantity measure the infinity-norm (max) of the internal dual infeasibility, Eq. (4a) in the implementation paper [11], including inequality constraints reformulated using slack variables and problem scaling. During the restoration phase, this is the value of the dual infeasibility for the restoration phase problem.
lg(mu): log10 of the value of the barrier parameter μ.
||d||: The infinity norm (max) of the primal step (for the original variables x and the internal slack variables s). During the restoration phase, this value includes the values of additional variables, p and n (see Eq. (30) in [11]).
lg(rg): log10 of the value of the regularization term for the Hessian of the Lagrangian in the augmented system ( δw in Eq. (26) and Section 3.1 in [11]). A dash ("-") indicates that no regularization was done.
alpha_du: The stepsize for the dual variables ( αzk in Eq. (14c) in [11]).
alpha_pr: The stepsize for the primal variables ( αk in Eq. (14a) in [11]). The number is usually followed by a character for additional diagnostic information regarding the step acceptance criterion:
Tag Description
f f-type iteration in the filter method w/o second order correction
F f-type iteration in the filter method w/ second order correction
h h-type iteration in the filter method w/o second order correction
H h-type iteration in the filter method w/ second order correction
k penalty value unchanged in merit function method w/o second order correction
K penalty value unchanged in merit function method w/ second order correction
n penalty value updated in merit function method w/o second order correction
N penalty value updated in merit function method w/ second order correction
R Restoration phase just started
w in watchdog procedure
s step accepted in soft restoration phase
t/T tiny step accepted without line search
r some previous iterate restored
ls: The number of backtracking line search steps (does not include second-order correction steps).
There is additional information on interior point methods with code examples on my course website for engineering design optimization and in Section 8.4 of the Design Optimization textbook.

Related

Parametric Scoring Function or Algorithm

I'm trying to come up with a way to arrive at a "score" based on an integer number of "points" that is adjustable using a small number (3-5?) of parameters. Preferably it would be simple enough to reasonably enter as a function/calculation in a spreadsheet for tuning the parameters by the "designer" (not a programmer or mathematician). The first point has the most value and eventually additional points have a fixed or nearly fixed value. The transition from the initial slope of point value to final slope would be smooth. See example shapes below.
Points values are always positive integers (0 pts = 0 score)
At some point, curve is linear (or nearly), all additional points have fixed value
Preferably, parameters are understandable to a lay person, e.g.: "smoothness of the curve", "value of first point", "place where the additional value of points is fixed", etc
For parameters, an example of something ideal would be:
Value of first point: 10
Value of point #: 3 is: 5
Minimum value of additional points: 0.75
Exact shape of curve not too important as long as the corner can be more smooth or more sharp.
This is not for a game but more of a rating system with multiple components (several of which might use this kind of scale) will be combined.
This seems like a non-traditional kind of question for SO/SE. I've done mostly financial software in my career, I'm hoping there some domain wisdom for this kind of thing I can tap into.
Implementation of Prune's Solution:
Google Sheet
Parameters:
Initial value (a)
Second value (b)
Minimum value (z)
Your decay ratio is b/a. It's simple from here: iterate through your values, applying the decay at each step, until you "peg" at the minimum:
x[n] = max( z, a * (b/a)^n )
// Take the larger of the computed "decayed" value,
// and the specified minimum.
The sequence x is your values list.
You can also truncate intermediate results if you want integers up to a certain point. Just apply the floor function to each computed value, but still allow z to override that if it gets too small.
Is that good enough? I know there's a discontinuity in the derivative function, which will be noticeable if the minimum and decay aren't pleasantly aligned. You can adjust this with a relative decay, translating the exponential decay curve from y = 0 to z.
base = z
diff = a-z
ratio = (b-z) / diff
x[n] = z + diff * ratio^n
In this case, you don't need the max function, since the decay has a natural asymptote of 0.

Determinate State of Linear Congruential Generator from Results

I am curious on how someone would go about determining the state of a Linear Congruential Generator given its output.
X(n-1) = (aX(n) + c) mod p
Since the values returned are deterministic and the formula is well known, it should be possible to obtain state's value. What exactly is the best way to do this?
Edit:
I was at work when I posted this and this isn't work related, so I didn't spend much time and should have elaborated (much) further.
Assume this is used to generate non-integer values between 0 and 1, but its only visible output is true or false with a 50/50 spread. Assume the implementation is also known, so the values of a, c and p are known, but not X.
Would it be possible, with an finite amount of output, to determine the value of X?
Well, in the simplest case, the output IS the state -- the output is the sequence X0, X1, X2, ... each element of which is the internal state at one step.
More commonly, the LCRNG will be divided to generate uniform numbers in the range [0,k) rather than [0,p) (the values output will be floor(kXn/p),) so will only tell you the upper bits of the internal state. In this case, each output value will give you a range of possible value for the state at that step. By correlating multiple consecutive values, you can narrow down the range.

Gradient boosting practical example or broad hint needed

I recently got contact with boosting tree algorithms and implemented a small regression tree class for multidimensional outputs y1 .. yo depending ob features x1 .. xn.
Actually, I read several articles about boosting and the advantages of Friedmans stochastic gradient boosting, but I got stuck a bit implementing this, since all the math equations are quite hard for me - need it a bit more 'coding style'.
First of all: I did not find a practical example on that topic anywhere around. I know that there are third party libraries which could suite quite well, but I want to understand what's going on there and hence implement it by myself ;-)
Okay, assuming that my regression tree yields the required results which are needed for the boosting implementation, here is my problem now:
The algorithm builds up a regression forrest cosisting of n trees where the results of one tree are part of the inputs for the next tree. Since this is an iterative approach, it starts with an initial guess of the target funktion (prediction = intercept + x1*T1 + x2*T2 + ... + xnTn) where Tn is the prediction of the regression tree at position n and x is a weight for the prediction.
Pseudocode:
Initialize the intercept with the mean values of the o target dimensions (columns)
Initialize a matrix "h values" (current prediction result of the the ith iteration) with n rows, o columns
For iteration = 1 to n (number of trees)
3.1 create a gradient matrix (n rows, o cols) and initialize every row with the current training data subtracted by the current prediction of the result function
3.2 select a random subset of indexes between 0 and n of a given size (e.g. 50% of all data)
3.3 get the gradient rows as inputs for y and the corresponding x values and set up a new regression tree
3.4 train the tree and add it to the result function
3.5 Modify the "h values" row by row with the new row value += Prediciton * LearningRate
-- END OF ITERATIONS
4 return the result function
First of all, do you see an error in my understanding? If I take sample data, which performs quite well with a single tree (5% error with a depth of 10 and a minimum of three nodes per knot), I get an error of about 50 % - the graph is nearly a mean value line.
Second, I am not sure, whether I got the point with the "gradient", since I only use prediction - current value of the result function instead of a "real" loss function that I read everywhere about.
Third and last of all, I'm also not quite sure about point 2.5, is it all I have to do in here?
Any help or hint would be great.
Thanks a lot in advance
Monga

Expectation Maximization coin toss examples

I've been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:
http://cs.dartmouth.edu/~cs104/CS104_11.04.22.pdf
There are 3 coins 0, 1 and 2 with P0, P1 and P2 probability landing on Head when tossed. Toss coin 0, if the result is Head, toss coin 1 three times else toss coin 2 three times. The observed data produced by coin 1 and 2 is like this: HHH, TTT, HHH, TTT, HHH. The hidden data is coin 0's result. Estimate P0, P1 and P2.
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
There are two coins A and B with PA and PB being the probability landing on Head when tossed. Each round, select one coin at random and toss it 10 times then record the results. The observed data is the toss results provided by these two coins. However, we don't know which coin was selected for a particular round. Estimate PA and PB.
While I can get the calculations, I can't relate the ways they are solved to the original EM theory. Specifically, during the M-Step of both examples, I don't see how they're maximizing anything. It just seems they are recalculating the parameters and somehow, the new parameters are better than the old ones. Moreover, the two E-Steps don't even look similar to each other, not to mention the original theory's E-Step.
So how exactly do these example work?
The second PDF won't download for me, but I also visited the wikipedia page http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm which has more information. http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf (which claims to be a gentle introduction) might be worth a look too.
The whole point of the EM algorithm is to find parameters which maximize the likelihood of the observed data. This is the only bullet point on page 8 of the first PDF, the equation for capital Theta subscript ML.
The EM algorithm comes in handy where there is hidden data which would make the problem easy if you knew it. In the three coins example this is the result of tossing coin 0. If you knew the outcome of that you could (of course) produce an estimate for the probability of coin 0 turning up heads. You would also know whether coin 1 or coin 2 was tossed three times in the next stage, which would allow you to make estimates for the probabilities of coin 1 and coin 2 turning up heads. These estimates would be justified by saying that they maximized the likelihood of the observed data, which would include not only the results that you are given, but also the hidden data that you are not - the results from coin 0. For a coin that gets A heads and B tails you find that the maximum likelihood for the probability of A heads is A/(A+B) - it might be worth you working this out in detail, because it is the building block for the M step.
In the EM algorithm you say that although you don't know the hidden data, you come in with probability estimates which allow you to write down a probability distribution for it. For each possible value of the hidden data you could find the parameter values which would optimize the log likelihood of the data including the hidden data, and this almost always turns out to mean calculating some sort of weighted average (if it doesn't the EM step may be too difficult to be practical).
What the EM algorithm asks you to do is to find the parameters maximizing the weighted sum of log likelihoods given by all the possible hidden data values, where the weights are given by the probability of the associated hidden data given the observations using the parameters at the start of the EM step. This is what almost everybody, including the Wikipedia algorithm, calls the Q-function. The proof behind the EM algorithm, given in the Wikipedia article, says that if you change the parameters so as to increase the Q-function (which is only a means to an end), you will also have changed them so as to increase the likelihood of the observed data (which you do care about). What you tend to find in practice is that you can maximize the Q-function using a variation of what you would do if you know the hidden data, but using the probabilities of the hidden data, given the estimates at the start of the EM-step, to weight the observations in some way.
In your example it means totting up the number of heads and tails produced by each coin. In the PDF they work out P(Y=H|X=) = 0.6967. This means that you use weight 0.6967 for the case Y=H, which means that you increment the counts for Y=H by 0.6967 and increment the counts for X=H in coin 1 by 3*0.6967, and you increment the counts for Y=T by 0.3033 and increment the counts for X=H in coin 2 by 3*0.3033. If you have a detailed justification for why A/(A+B) is a maximum likelihood of coin probabilities in the standard case, you should be ready to turn it into a justification for why this weighted updating scheme maximizes the Q-function.
Finally, the log likelihood of the observed data (the thing you are maximizing) gives you a very useful check. It should increase with every EM step, at least until you get so close to convergence that rounding error comes in, in which case you may have a very small decrease, signalling convergence. If it decreases dramatically, you have a bug in your program or your maths.
As luck would have it, I have been struggling with this material recently as well. Here is how I have come to think of it:
Consider a related, but distinct algorithm called the classify-maximize algorithm, which we might use as a solution technique for a mixture model problem. A mixture model problem is one where we have a sequence of data that may be produced by any of N different processes, of which we know the general form (e.g., Gaussian) but we do not know the parameters of the processes (e.g., the means and/or variances) and may not even know the relative likelihood of the processes. (Typically we do at least know the number of the processes. Without that, we are into so-called "non-parametric" territory.) In a sense, the process which generates each data is the "missing" or "hidden" data of the problem.
Now, what this related classify-maximize algorithm does is start with some arbitrary guesses at the process parameters. Each data point is evaluated according to each one of those parameter processes, and a set of probabilities is generated-- the probability that the data point was generated by the first process, the second process, etc, up to the final Nth process. Then each data point is classified according to the most likely process.
At this point, we have our data separated into N different classes. So, for each class of data, we can, with some relatively simple calculus, optimize the parameters of that cluster with a maximum likelihood technique. (If we tried to do this on the whole data set prior to classifying, it is usually analytically intractable.)
Then we update our parameter guesses, re-classify, update our parameters, re-classify, etc, until convergence.
What the expectation-maximization algorithm does is similar, but more general: Instead of a hard classification of data points into class 1, class 2, ... through class N, we are now using a soft classification, where each data point belongs to each process with some probability. (Obviously, the probabilities for each point need to sum to one, so there is some normalization going on.) I think we might also think of this as each process/guess having a certain amount of "explanatory power" for each of the data points.
So now, instead of optimizing the guesses with respect to points that absolutely belong to each class (ignoring the points that absolutely do not), we re-optimize the guesses in the context of those soft classifications, or those explanatory powers. And it so happens that, if you write the expressions in the correct way, what you're maximizing is a function that is an expectation in its form.
With that said, there are some caveats:
1) This sounds easy. It is not, at least to me. The literature is littered with a hodge-podge of special tricks and techniques-- using likelihood expressions instead of probability expressions, transforming to log-likelihoods, using indicator variables, putting them in basis vector form and putting them in the exponents, etc.
These are probably more helpful once you have the general idea, but they can also obfuscate the core ideas.
2) Whatever constraints you have on the problem can be tricky to incorporate into the framework. In particular, if you know the probabilities of each of the processes, you're probably in good shape. If not, you're also estimating those, and the sum of the probabilities of the processes must be one; they must live on a probability simplex. It is not always obvious how to keep those constraints intact.
3) This is a sufficiently general technique that I don't know how I would go about writing code that is general. The applications go far beyond simple clustering and extend to many situations where you are actually missing data, or where the assumption of missing data may help you. There is a fiendish ingenuity at work here, for many applications.
4) This technique is proven to converge, but the convergence is not necessarily to the global maximum; be wary.
I found the following link helpful in coming up with the interpretation above: Statistical learning slides
And the following write-up goes into great detail of some painful mathematical details: Michael Collins' write-up
I wrote the below code in Python which explains the example given in your second example paper by Do and Batzoglou.
I recommend that you read this link first for a clear explanation of how and why the 'weightA' and 'weightB' in the code below are obtained.
Disclaimer : The code does work but I am certain that it is not coded optimally. I am not a Python coder normally and have started using it two weeks ago.
import numpy as np
import math
#### E-M Coin Toss Example as given in the EM tutorial paper by Do and Batzoglou* ####
def get_mn_log_likelihood(obs,probs):
""" Return the (log)likelihood of obs, given the probs"""
# Multinomial Distribution Log PMF
# ln (pdf) = multinomial coeff * product of probabilities
# ln[f(x|n, p)] = [ln(n!) - (ln(x1!)+ln(x2!)+...+ln(xk!))] + [x1*ln(p1)+x2*ln(p2)+...+xk*ln(pk)]
multinomial_coeff_denom= 0
prod_probs = 0
for x in range(0,len(obs)): # loop through state counts in each observation
multinomial_coeff_denom = multinomial_coeff_denom + math.log(math.factorial(obs[x]))
prod_probs = prod_probs + obs[x]*math.log(probs[x])
multinomial_coeff = math.log(math.factorial(sum(obs))) - multinomial_coeff_denom
likelihood = multinomial_coeff + prod_probs
return likelihood
# 1st: Coin B, {HTTTHHTHTH}, 5H,5T
# 2nd: Coin A, {HHHHTHHHHH}, 9H,1T
# 3rd: Coin A, {HTHHHHHTHH}, 8H,2T
# 4th: Coin B, {HTHTTTHHTT}, 4H,6T
# 5th: Coin A, {THHHTHHHTH}, 7H,3T
# so, from MLE: pA(heads) = 0.80 and pB(heads)=0.45
# represent the experiments
head_counts = np.array([5,9,8,4,7])
tail_counts = 10-head_counts
experiments = zip(head_counts,tail_counts)
# initialise the pA(heads) and pB(heads)
pA_heads = np.zeros(100); pA_heads[0] = 0.60
pB_heads = np.zeros(100); pB_heads[0] = 0.50
# E-M begins!
delta = 0.001
j = 0 # iteration counter
improvement = float('inf')
while (improvement>delta):
expectation_A = np.zeros((5,2), dtype=float)
expectation_B = np.zeros((5,2), dtype=float)
for i in range(0,len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_log_likelihood(e,np.array([pA_heads[j],1-pA_heads[j]])) # loglikelihood of e given coin A
ll_B = get_mn_log_likelihood(e,np.array([pB_heads[j],1-pB_heads[j]])) # loglikelihood of e given coin B
weightA = math.exp(ll_A) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of A proportional to likelihood of A
weightB = math.exp(ll_B) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of B proportional to likelihood of B
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
pA_heads[j+1] = sum(expectation_A)[0] / sum(sum(expectation_A));
pB_heads[j+1] = sum(expectation_B)[0] / sum(sum(expectation_B));
improvement = max( abs(np.array([pA_heads[j+1],pB_heads[j+1]]) - np.array([pA_heads[j],pB_heads[j]]) ))
j = j+1
The key to understanding this is knowing what the auxiliary variables are that make estimation trivial. I will explain the first example quickly, the second follows a similar pattern.
Augment each sequence of heads/tails with two binary variables, which indicate whether coin 1 was used or coin 2. Now our data looks like the following:
c_11 c_12
c_21 c_22
c_31 c_32
...
For each i, either c_i1=1 or c_i2=1, with the other being 0. If we knew the values these variables took in our sample, estimation of parameters would be trivial: p1 would be the proportion of heads in samples where c_i1=1, likewise for c_i2, and \lambda would be the mean of the c_i1s.
However, we don't know the values of these binary variables. So, what we basically do is guess them (in reality, take their expectation), and then update the parameters in our model assuming our guesses were correct. So the E step is to take the expectation of the c_i1s and c_i2s. The M step is to take maximum likelihood estimates of p_1, p_2 and \lambda given these cs.
Does that make a bit more sense? I can write out the updates for the E and M step if you prefer. EM then just guarantees that by following this procedure, likelihood will never decrease as iterations increase.

Minimizing a function of vectors

I need to minimize the following sum:
minimize sum for all i{(i = 1 to n) fi(v(i), v(i - 1), tangent(i))}
v and tangent are vectors.
fi takes the 3 vectors as arguments and returns a cost associated with these 3 vectors. For this function, v(i - 1) is the vector chosen in the previous iteration. tangent(i) is also known. fi calculates the cost of choosing a vector v(i), given the other two vectors v(i - 1) and tangent(i). The v(0) and v(n) vectors are known. tangent(i) values are also known in advance for alli = 0 to n.
My task is to determine all such v(i)s such that the total cost of the function values for i = 1 to n is minimized.
Can you please give me any ideas to solve this?
So far I can think of Branch and Bound or dynamic programming methods.
Thanks!
I think this is a problem in mathematical optimisation, with an objective function built up of dot products and arcCosines, subject to the constraint that your vectors should be unit vectors. You could enforce this either with Lagrange multipliers, or by including a normalising step in the arc-Cosine. If Ti is a unit vector then for Vi calculate cos^-1(Ti.Vi/sqrt(Vi.Vi)). I would have a go at using a conjugate gradient optimiser for this, or perhaps even Newton's method, with my starting point Vi = Ti.
I would hope that this would be reasonably tractable, because the Vi are only related to neighbouring Vi. You might even get somewhere by repeatedly adjusting each Vi in isolation, one by one, to optimise the objective function. It might be worth just seeing what happens if you repeatedly set Vi to be the average of Ti, Vi+1, and Vi-1, and then scaled Vi to be a unit vector again.

Resources