Does Spark MLlib supports nonlinear optimization with nonlinear constraints? - apache-spark-mllib

Component: Spark MLlib
Level: Beginner
Scenario: Does Spark MLlib supports nonlinear optimization with nonlinear constraints?
Our business application supports two types of function convex and S-shaped curves and linear & non-linear constraints. These constraints can be combined with any one type of functional form at a time.
Example of convex curve:
Y = c^k*pow(a^k,p^k)
Example of S-shaped curve:
Y = c^k*pow(a^k,p^k)/(b^k + pow(a^k,p^k))
Example of non-linear constraints:
Min Bound (50%) < ∑(k=0 to n) c^k*pow(a^k,p^k) < Max Bound (150%)
Example of linear constraints:
Min Bound (50%) < a+b+c < Max Bound (150%)
At present we are using SAS to solve these business problems. We are looking for SAS replacement software, which can solve similar kind of problems with performance equivalent to SAS.
Also, please share benchmarking of its performance. How it perform as no. of variables keep on increasing

Related

which algorithm perform better for high dimensional feature and small sample size?

I am trying my best to deal with high-dimensional data in a small sample size. For an example, Y is 500*1 matrix and X is 500*10000 matrix. Are there some better regression methods for this data?
An applicable solution is to apply some reduction methods such as PCA (principal component analysis) over X and apply the regression over the result of PCA.

Large Residual-Online Outlier Detection for Kalman Filter

I am trying to find outliers in Residual. I used three algorithms basically if the residuals magnitudes are less, the algorithm performances are good but if the residuals magnitude are big, the algorithm performances are not good.
1) 𝑿^𝟐=〖(𝒚−𝒉(𝒙))〗^𝑻 𝑺^(−𝟏) (𝒚−𝒉(𝒙)) - Chi-Square Test
if the matrix 3x3 - degree of freedom is 4.
𝑿^𝟐 > 13.277
2) Residual(i) > 3√(HP 𝐻^𝑇 + R) - Measurement Covariance Noise
3) Residual(i) > 3-Sigma
I have applied three algorithms to find the outliers. First one is Chi Square Test, second checks Measurement Covariance Noise, Third looks the 3 sigma.
Can you give any suggestion about the algorithms or I can implement a new way if you suggest?
The third case cannot be correct for all case because if there is a large residual, will fail. The second one is more stable because it is related to measurement noise covariance so that your residual should change according to the measurement covariance error.

Probability of failure - Limit State Function - Monte Carlo Method

I want to calculate the probability of failure, pf adopting the monte carlo method.
The limit state equation is obtained by comparing the substance content at a time t, C(x=a,t), and the critical content, Ccrit:
LSF: g(Ccrit, C(x=a,t)) = Ccrit - C(x=a, t) < 0
Ccrit follows a beta distribution Ccrit~B(mean=0.6, s=0.15, a=0.20, b=2.0). Generated distribution:
r=((mean-a)/(b-a))*((((mean-a)*(b-mean))/(s^2))-1)
t=((b-mean)/(b-a))*((((mean-a)*(b-mean))/(s^2))-1)
Ccrit=beta.rvs(r,t,a,b,1e6)
C(x=a, t) is function of 11 other variables (beta, normal, deterministic, lognormal etc) and varies with time t. These variables have been defined adopting scipy.stats eg:
Var1=truncnorm.rvs(0, 1000, 60e-3, 6e-3, 1e6)
(...)
Var11=Csax=dist.lognormal(l, z, 1e6)
After all the variables are generated I am having difficulty computing the pf.
I have seen that:
P(Ccrit < C) = integral -inf to +inf Fccrit(c) * fC(c) dc
leads to the pf but I am clueless on how to calculate it.
Will appreciate your help,
Thank you
Well, how I understood your question, this is the way to compute the probability of failure from crude Monte Carlo simulation:
pf = sum(I(g(x))/N
where:
N - is the number of simulations
x - is the vector of all the involved random variables
I(arg) - is an indicator function, defined as:
if arg < 0
I = 1
else
I = 0
end
The simulation methods are basically invented to circumvent complicated or impossible integrals, no need in this case for the integration you mentioned.
Keep in mind that the coefficient of variation of the estimate is proportional to 1/sqrt(N).
I tried to be clear as possible with the notations, in case it is problematic to follow, see this lecture notes for better formatting.
I assumed you used crude Monte Carlo, but for importance sampling you can find the formulas in the linked source as well.
The above formulation is time-invariant; the fact that your problem involves time makes the task much harder in general.
The solution technique depends on the time-variance, because no details are given in this regard I can only recommend you a book (Melchers, Structural Reliability Analysis and Prediction) where the question is treated in details:
In general, time-variant problems can be reduced (at least in an approximate manner) to time-variant problems and the above formulation can be used. Or you might calculate the probability of failure in every time moments with the above sketched 'method' if that makes sense for your problem.
Because C is substance content the problem might contain no stochastic process but only a monotonically increasing (in time) random variable, in this case the probability of failure is the time-invariant probability of failure at the last time instant (when the concentration is closest to the critical value), so the above mentioned Monte Carlo technique could be directly used. This type of problem is called right-boundary problem, more details:
Construction Reliability: Safety, Variability and Sustainability. Chapter 10.
If this is not you want to accomplish please give us more details.

Machine Learning Algorithm for Completing Sparse Matrix Data

I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.

how to implement eigenvalue calculation with MapReduce/Hadoop?

It is possible because PageRank was a form of eigenvalue and that is why MapReduce introduced. But there seems problems in actual implementation, such as every slave computer have to maintain a copy of the matrix?
PageRank solves the dominant eigenvector problem by iteratively finding the steady-state discrete flow condition of the network.
If NxM matrix A describes the link weight (amount of flow) from node n to node m, then
p_{n+1} = A . p_{n}
In the limit where p has converged to a steady state (p_n+1 = p_n), this is an eigenvector problem with eigenvalue 1.
The PageRank algorithm doesn't require the matrix to be held in memory, but is inefficient on dense (non-sparse) matrices. For dense matrices, MapReduce is the wrong solution -- you need locality and broad exchange among nodes -- and you should instead look at LaPACK and MPI and friends.
You can see a working pagerank implementation in the wukong library (hadoop streaming for ruby) or in the Heretrix pagerank submodule. (The heretrix code runs independently of Heretrix)
(disclaimer: I am an author of wukong.)
PREAMBLE:
Given the right sequestration of data, one could achieve parallel computing results without a complete dataset on every machine.
Take for example the following loop:
for (int i = 0; i < m[].length; i++)
{
for (int j = 0; j < m[i].length; j++)
{
m[i][j]++;
}
}
And given a matrix of the following layout:
j=0 j=1 j=2
i=0 [ ] [ ] [ ]
i=1 [ ] [ ] [ ]
i=2 [ ] [ ] [ ]
Parallel constructs exist such that the J column can be sent to each computer and the single columns are computed in parallel. The difficult part of parallelization comes when you've got loops that contain dependencies.
for (int i = 0; i < m[].length; i++)
{
for (int j = 0; j < m[i].length; j++)
{
//For obvious reasons, matrix index verification code removed
m[i][j] = m[i/2][j] + m[i][j+7];
}
}
Obviously a loop like the one above becomes extremely problematic (notice the matrix indexers.) But techniques do exist for unrolling these types of loops and creating effective parallel algorithms.
ANSWER:
It is possible that google developed a solution to compute an eigenvalue without maintaining a copy of the matrix on all slave computers. -Or- They used something like Monte Carlo or some other Approximation Algorithm to develop a "close enough" calculation.
In fact, I'd go so far as to say that Google will have gone to as great of lengths as possible to make any calculation required for their PageRank algorithm as efficient as possible. When you're running machines such as these and this (notice the Ethernet cable) you can't be transferring large datasets(100s of gigs) because it is impossible given their hardware limitations of commodity NIC cards.
With that said, Google is good at surprising the programmer community and their implementation could be entirely different.
POSTAMBLE:
Some good resources for parallel computing would include OpenMP and MPI. Both parallel implementations approach parallel computing from very different paradigms, some of which stems from machine implementation (cluster vs. distributed computing.)
I suspect it is intractable for most matrices except those w/ special structures (e.g. sparse matrices or ones w/ certain block patterns). There's way too much coupling between matrix coefficients and eigenvalues.
PageRank uses a very sparse matrix of a special form, and any conclusions from calculating its eigenvalues almost certainly don't extend to general matrices. (edit: here's another reference that looks interesting)
I can answer myself now. The PageRank algorithm take advantage of sparse matrix where it should converge at the eigenvalue with several self-multiply. Thus, in PageRank practice, the Map/Reduce procedure is valid. You can perform matrix multiply in Map procedure and form a sparse matrix in Reduce procedure. But for general matrix eigenvalue finding, it is still a tricky problem.
The apache hama project has some interesting implementation of the Jacobi eigenvalue algorithm. It runs on hadoop. Note the rotation happens in the scan of the matrix not in the map reduce.

Resources