Stanford NER - Caught OutOfMemory, changing m - stanford-nlp

While training a model using Stanford NER 2016-10-31 the following line was output:
Iter 7 evals 17 <D> [M 1,000E0] {Caught OutOfMemory, changing m from 25 to 5}] |2,519E4| {1,000E-1} 2,295E-2 -
What does this mean (what is m?) and how does it affect the training of the model?
Thanks!

Got similar warning. Here is minimal explanation:
References:
An implementation of L-BFGS for Quasi Newton unconstrained minimization in Stanfordnlp CoreNLP
Numerical Optimization (second edition) 2006, Jorge Nocedal and Stephen J. Wright
As per (1), m is
* mem - the number of previous estimate vector pairs to
* store, generally 15 is plenty.
As mentioned in the book (2) - " the algorithm tends to be less robust when m is small."
So, number of recent correction pairs is decreased on OutOfMemory, however, you still have a good chance to solve the optimisation problem with limited-memory BFGS algorithm

Related

Multichannel blind deconvolution in the simplest formulation: how to solve?

Recently I began to study deconvolution algorithms and met the following acquisition model:
where f is the original (latent) image, g is the input (observed) image, h is the point spread function (degradation kernel), n is a random additive noise and * is the convolution operator.
If we know g and h, then we can recover f using Richardson-Lucy algorithm:
where , (W,H) is the size of rectangular support of h and multiplication and division are pointwise. Simple enough to code in C++, so I did just so. It turned out that approximates to f while i is less then some m and then it starts rapidly decay. So the algorithm just needed to be stopped at this m - the most satisfactory iteration.
If the point spread function g is also unknown then the problem is said to be blind, and the modification of Richardson-Lucy algorithm can be applied:
For initial guess for f we can take g, as before, and for initial guess for h we can take trivial PSF, or any simple form that would look similar to observed image degradation. This algorithm also works quit fine on the simulated data.
Now I consider the multiframe blind deconvolution problem with the following acquisition model:
Is there a way to develop Richardson-Lucy algorithm for solving the problem in this formulation? If no, is there any other iterative procedure for recovering f, that wouldn't be much more complicated than the previous ones?
According to your acquisition model, latent image (f) remains same while the observed images are different due to different psf and noise models. One way to look at it, is a motion-blur problem where a sharp and noise-free image(f) is corrupted by the motion blur kernel. As this is an ill-posed problem, in most of the literature it's solved iteratively by estimating the blur kernel and the latent image. The way you solve this depends entirely on your objective function.
For example in some papers IRLS is used to estimate the blur kernel. You can find a lot of literature on this.
If you want to use Richardson Lucy Blind deconvolution, then use it on just one frame.
One strategy can be in each iteration while recovering f, assign different weights for contribution from each g(observed images). You can incorporate different weights in the objective function or calculate them according to the estimated blur kernel.
Is there a way to develop Richardson-Lucy algorithm for solving the problem in this formulation?
I'm not a specialist in this area, but I don't think that such way to construct an algorithm exists, at least not straightforwardly. Here is my argument for this. The first problem you described (when the psf is known) is already ill-posed due to the random nature of the noise and loss of information about convolution near image edges. The second problem on your list — single-channel blind deconvolution — is the extention of the previous one. In this case in addition it's underdetermined, so the ill-posedness expands, and so it's natural that the method to solve this problem is developed from the method for solving the first problem. Now when we consider the multichannel blind deconvolution formulation, we add a bunch of additional information to our previous model and so the problem goes from underdetermined to overdetermined. This is the whole other kind of ill-posedness and hence different approaches to solution are required.
is there any other iterative procedure for recovering f, that wouldn't be much more complicated than the previous ones?
I can recommend the algorithm introduced by Šroubek and Milanfar in [1]. I'm not sure whether it's much more complicated on your opinion or not so much, but it's by far one of the most recent and robust. The formulation of the problem is precisely the same as you wrote. The algorithm takes as input K>1 number of images, the upper bound of the psf size L, and four tuning parameters: alpha, beta, gamma, delta. To specify gamma, for example, you will need to estimate the variance of the noise on your input images and take the largest variance var, then gamma = 1/var. The algorithm solves the following optimization problem using alternating minimization:
where F is the data fidelity term and Q and R are regularizers of the image and blurs, respectively.
For detailed analysis of the algorithm see [1], for a collection of different deconvolution formulation and their solutions see [2]. Hope it helps.
Referenses:
Filip Šroubek, Peyman Milanfar. —- Robust Multichannel Blind Deconvolution via Fast Alternating Minimization.
-— IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012
Patrizio Campisi, Karen Egiazarian. —- Blind Image Deconvolution: Theory and Applications

Algorithm for finding the optimal weights for prediction

Consider I have the following weights and quantitative parameters: w_1..w_n, p_1..p_n. 0 <= w <= 1. I also have a selection of cases of parameters and associated values.
What algorithms exist for finding the optimal weights to minimize the errors of predicting the value given the parameters? And what algorithms have typically achieved the best results?
I try to predict the quality of an apple based on the parameters p_1=transport _time, p_2=days_since_picking. The quality is measured using a subjective likert scale.
Fifty people have rated apples with scores from 1 to 5 and I know p_1 and p_2 for all those apples. How do I predict and find the weights for p_1 and p_2 that minimize the total errors in the cases?
I agree with the comment that you should run a web search on "linear regression". At least three other sources for lists of algorithms come to mind:
NLopt: http://ab-initio.mit.edu/wiki/index.php/NLopt_Algorithms (and my C# wrapper for it: https://github.com/BrannonKing/NLoptNet)
S. Boyd's book: http://stanford.edu/~boyd/cvxbook/
You could probably use a supervised AI algorithm. Neural networks are typically made up of "weights": https://en.wikipedia.org/wiki/Supervised_learning
You could also use a genetic algorithm in conjunction with gray code weight encoding.

Number of checks to find the median of a set of random 16 numbers

Our teacher gave us a problem to program a way to figure the number of checks needed to find the median of a set of 16 random numbers using a decision tree. he also told us that the least possible number of checks would be 2N, however when we did it on a paper we came out with 27 checks, we double checked our work and everything was right. So is their a definitive answer for min number of checks?
The paper "Samuel W. Bent and John W. John. Finding the median requires 2n comparisons. In Proceedings
of the Seventeenth Annual ACM Symposium on Theory of Computing, pages 213{216, 1985." proves that 2n is a lower bound.
This paper requires access to ACM, but there is an alternative proof on page 3 of "ON LOWER BOUNDS FOR SELECTING THE MEDIAN" available here.
Perhaps you could explain your proposed solution?

Probability of failure - Limit State Function - Monte Carlo Method

I want to calculate the probability of failure, pf adopting the monte carlo method.
The limit state equation is obtained by comparing the substance content at a time t, C(x=a,t), and the critical content, Ccrit:
LSF: g(Ccrit, C(x=a,t)) = Ccrit - C(x=a, t) < 0
Ccrit follows a beta distribution Ccrit~B(mean=0.6, s=0.15, a=0.20, b=2.0). Generated distribution:
r=((mean-a)/(b-a))*((((mean-a)*(b-mean))/(s^2))-1)
t=((b-mean)/(b-a))*((((mean-a)*(b-mean))/(s^2))-1)
Ccrit=beta.rvs(r,t,a,b,1e6)
C(x=a, t) is function of 11 other variables (beta, normal, deterministic, lognormal etc) and varies with time t. These variables have been defined adopting scipy.stats eg:
Var1=truncnorm.rvs(0, 1000, 60e-3, 6e-3, 1e6)
(...)
Var11=Csax=dist.lognormal(l, z, 1e6)
After all the variables are generated I am having difficulty computing the pf.
I have seen that:
P(Ccrit < C) = integral -inf to +inf Fccrit(c) * fC(c) dc
leads to the pf but I am clueless on how to calculate it.
Will appreciate your help,
Thank you
Well, how I understood your question, this is the way to compute the probability of failure from crude Monte Carlo simulation:
pf = sum(I(g(x))/N
where:
N - is the number of simulations
x - is the vector of all the involved random variables
I(arg) - is an indicator function, defined as:
if arg < 0
I = 1
else
I = 0
end
The simulation methods are basically invented to circumvent complicated or impossible integrals, no need in this case for the integration you mentioned.
Keep in mind that the coefficient of variation of the estimate is proportional to 1/sqrt(N).
I tried to be clear as possible with the notations, in case it is problematic to follow, see this lecture notes for better formatting.
I assumed you used crude Monte Carlo, but for importance sampling you can find the formulas in the linked source as well.
The above formulation is time-invariant; the fact that your problem involves time makes the task much harder in general.
The solution technique depends on the time-variance, because no details are given in this regard I can only recommend you a book (Melchers, Structural Reliability Analysis and Prediction) where the question is treated in details:
In general, time-variant problems can be reduced (at least in an approximate manner) to time-variant problems and the above formulation can be used. Or you might calculate the probability of failure in every time moments with the above sketched 'method' if that makes sense for your problem.
Because C is substance content the problem might contain no stochastic process but only a monotonically increasing (in time) random variable, in this case the probability of failure is the time-invariant probability of failure at the last time instant (when the concentration is closest to the critical value), so the above mentioned Monte Carlo technique could be directly used. This type of problem is called right-boundary problem, more details:
Construction Reliability: Safety, Variability and Sustainability. Chapter 10.
If this is not you want to accomplish please give us more details.

Machine Learning Algorithm for Completing Sparse Matrix Data

I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.

Resources