Ceres bundle adjustment, DENSE_NORMAL_CHOLESKY is faster than SCHUR, but shouldn't be? - ceres-solver

I have a simple Bundle adjustment problem, with two cameras, and 230 points that i am trying to solve using Ceres.
My goal is to get the absolute fastest solve that i can, but the results that i see seem to contradict the documentation about bundle adjustment problems.
As stated here:
one way to solve this problem is to set Solver::Options::linear_solver_type to SPARSE_NORMAL_CHOLESKY and call Solve(). And while this is a reasonable thing to do, bundle adjustment problems have a special sparsity structure that can be exploited to solve them much more efficiently. Ceres provides three specialized solvers (collectively known as Schur-based solvers) for this task.
However, when I use DENSE_NORMAL_CHOLESKY , using the solver settings:
options.sparse_linear_algebra_library_type = SUITE_SPARSE;
options.linear_solver_type = ceres::DENSE_NORMAL_CHOLESKY;
options.minimizer_progress_to_stdout = false;
options.logging_type = ceres::SILENT;
options.max_num_iterations = 20;
It gives me:
Time (in seconds):
Preprocessor 0.006372
Residual only evaluation 0.000359 (12)
Jacobian & residual evaluation 0.003254 (12)
Linear solver 0.001549 (12)
Minimizer 0.008216
Postprocessor 0.000008
Total 0.014596
However, when i switch to SCHUR solvers, as below:
options.use_explicit_schur_complement = true;
options.sparse_linear_algebra_library_type = SUITE_SPARSE;
options.linear_solver_type = ceres::ITERATIVE_SCHUR;
options.minimizer_progress_to_stdout = false;
options.logging_type = ceres::SILENT;
options.max_num_iterations = 20;
options.preconditioner_type = SCHUR_JACOBI;
It runs slower, with:
Time (in seconds):
Preprocessor 0.007213
Residual only evaluation 0.000306 (10)
Jacobian & residual evaluation 0.002611 (10)
Linear solver 0.007781 (10)
Minimizer 0.013027
Postprocessor 0.000009
Total 0.020249
Is there anything i can do to get a faster result? I have tried ordering, every kind of linear_solver_type and different pre-conditioners. Setting options.num_threads = 8; makes no noticeable difference either. Am i missing something?

Use analytical derivatives. The bulk of your time is being spent there.

Related

Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Let RG = A for dense unstructured matrices with shapes (e.g. roughly) R: (30k x 40k, entries float32) and G: (40k x 50k, entries either 0.0 or 1.0, roughly equally often) and of course A: (30k x 50k, entries float32).
Given A and G, I want to find the least squares solution for R.
I can use hundreds of CPU cores, hundreds of GB of RAM and also an A40 GPU. What is the best way to use such resources to solve the problem? I'm using Julia 1.7 in the examples below but I'm open to other options!
First question: Can I somehow exploit that the entries of G are only zeros and ones?
Trying to use Julia LinearAlgebra with many CPUs
I've tried two methods: "Penrose inverse" and "right division"
using LinearAlgebra
#show BLAS.get_num_threads()
# defaults to 8. Can change using BLAS.set_num_threads(N)
# build toy problem (order of magnitude smaller sizes)
R_true = rand(Float32, 3_000, 4_000)
G = rand([0., 1.], 4_000, 5_000)
# note: using true/false here gives same results but is much slower!
A = R_true * G
# solve toy problem using matrix (right) division
R_fitted_rdiv = A / G
# solve toy problem using Penrose inverse
R_fitted_pinv = (pinv(G') * A')'
First, setting BLAS.set_num_threads(64) (or any bigger number) actually only gives me BLAS.get_num_threads() returning 32. Apparantly that's an upper limit. Second,
using 32 BLAS threads is actually slower than using 8.
(e.g. performing right division with sizes (4000, 9800) / (8500, 9800) takes less than 50 seconds on 8 threads but more than 55 seconds on 32 threads. I ran things multiple times to exclude compilation time issues.) I don't know why this is or if it's normal. How can I make use of my computing power for this problem?
I think that the matrix division is faster than the Penrose inverse method. Should this be expected? I don't know what either of the functions do exactly for these inputs. The docs say that left division (\) uses pivoted QR factorization. I couldn't find what algorithm(s) are used for pinv or right division (/) (although it's probably the same as \ since they are related by transposing the matrices). I'd rather not delve too deeply because my knowledge in numerical linear algebra is quite limited.
The issue is that for my large matrices either method takes forever. Is there a way to make use of my ~100 cores somehow?
Trying to use the GPU:
Using CUDA.jl, Matrices of size around 10k work fine and take a minute to pinv:
using CUDA
#time matrix = CUDA.rand(Float32, 10_000, 10_500) # 0.003037 seconds (5 allocations: 160 bytes)
#time pinv(matrix) # 57.417559 seconds (678 allocations: 172.094 KiB)
However, when I try to do matrices around size 20k, I get right away the error InexactError: trunc(Int32, 4811456640). I assume this is due to CUBLAS using int32 for indexing, even though I don't understand why it leads to an error in this case. (edit: it's about the size of the array in bytes fitting into 31 bits.)
Trying to use right division with CuArrays gives the error "DimensionMismatch("LU factored matrix A must be square!")". I guess I have to choose a different algorithm manually? I don't know what it's called. (Although, it probably would still crash for large matrices...?)
To summarize, it doesn't look like I can use the GPU from Julia easily to solve my problem. Should I keep trying to use the GPU for this task or stick to the many CPUs?
Yes this is really my problem, please refrain from commenting "nobody should ever need such large least squares"
Naive answer
Using pytorch, this will require at least 30GB GPU memory
import torch
A = torch.randint(0, 2, (50000, 40000), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (50000, 30000), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
If the system can sustain the same operation throughput as my laptop you should have an answer in about 15 minutes.
I would suggest you to try a generalized version scaling up the dimensions to get a better feeling of how your system will handle it
def try_it(a,b,c):
A = torch.randint(0, 2, (a, b), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (a, c), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
I transposed the dimensions in the generation in order to make sure G.T and A.T would be contiguous.
You can't take much advantage of the entries being integer. This type of problem is easier to solve on the reals than on the integers, because finding integer solutions would require you to search the solutions, while the real solution you can find by doing algebraic manipulations.

GEKKO (IPOPT) Solution Not Found due to Out of memory?

I am getting following error while simulating a chemical reactor with 50 series CSTRs in Dynamic mode (m.options.imode=4) with time-series data. The steady-state runs just fine. Also, dynamic simulation seems to work for a simpler model with 15 CSTRs.
Is there a solution for this issue?
MUMPS returned INFO(1) =-13 - out of memory when trying to allocate 219104583 bytes.
In some cases it helps to decrease the value of the option "mumps_mem_percent".
WARNING: Problem in step computation; switching to emergency mode.
1r0.0000000e+000 2.87e+001 9.99e+002 1.5 0.00e+000 - 0.00e+000 0.00e+000R 1
MUMPS returned INFO(1) =-13 - out of memory when trying to allocate 219104583 bytes.
In some cases it helps to decrease the value of the option "mumps_mem_percent".
WARNING: Problem in step computation; switching to emergency mode.
Restoration phase is called at point that is almost feasible,
with constraint violation 0.000000e+000. Abort.
Restoration phase in the restoration phase failed.
Number of Iterations....: 1
(scaled) (unscaled)
Objective...............: 0.0000000000000000e+000 0.0000000000000000e+000
Dual infeasibility......: 0.0000000000000000e+000 0.0000000000000000e+000
Constraint violation....: 2.8680600237259355e+001 2.8680600237259355e+001
Complementarity.........: 0.0000000000000000e+000 0.0000000000000000e+000
Overall NLP error.......: 2.8680600237259355e+001 2.8680600237259355e+001
Number of objective function evaluations = 2
Number of objective gradient evaluations = 2
Number of equality constraint evaluations = 2
Number of inequality constraint evaluations = 0
Number of equality constraint Jacobian evaluations = 2
Number of inequality constraint Jacobian evaluations = 0
Number of Lagrangian Hessian evaluations = 2
Total CPU secs in IPOPT (w/o function evaluations) = 1.672
Total CPU secs in NLP function evaluations = 4.237
EXIT: Restoration Failed!
An error occured.
The error code is -2
If the simultaneous mode (IMODE=4) is too large and runs out of memory then I recommend that you try the sequential mode with (IMODE=7).
from gekko import GEKKO
m = GEKKO(remote=False)
m.options.IMODE=7
# Your model
m.solve(disp=False)
A couple other tips when switching to IMODE=7:
Use remote=False to solve on your computer instead of a public server
Use disp=False to not show the solver output. Print statements can slow down the code.
IMODE=4 and IMODE=7 should give equivalent results but they are different solution methods. The simultaneous mode is reviewed in the collocation material in the Dynamic Optimization course.

CPLEX giving different OPTIMAL solutions with very different objective values

When running CPLEX on the same ILP problem (exactly the same input file):
With MIPEmphasis = 3 I get an objective value of 6.81613e-06
With MIPEmphasis = 4 I get an objective value of 1.03858
In both cases, CPLEX returns an OPTIMAL status.
From the CPLEX user manual:
To make clear a point that has been alluded to so far: every choice of MIPEmphasis results in the search algorithm proceeding in a manner that eventually will find and prove an optimal solution, or will prove that no integer feasible solution exists. The choice of emphasis only guides CPLEX to produce feasible solutions in a way that is in keeping with the user's particular purposes, but the accuracy and completeness of the algorithm is not sacrificed in the process.
Am I missing something here? I am facing this problem not only with the MIPEmphasis parameter, but with other parameters as well (ScaInd for example), where by varying the parameter I get different OPTIMAL solutions that greatly vary in quality.
Here's some more info which I can't seem to decipher.
For MIPEmphasis = 3:
Maximum condition number = 5.03484e+12,
Attention level = 0.290111,
Suspicious bases: 0.0111111,
Unstable bases = 0.966667,
Ill-posed bases = 0,
CPLEX Status = `OptimalTol`
For MIPEmphasis = 4:
Maximum condition number = 4.73342e+08,
Attention level = 0.00925,
Suspicious bases: 0.925,
Unstable bases = 0,
Ill-posed bases = 0,
CPLEX Status = `Optimal`
This looks like numerical-trouble which is common and depends greatly on your modelling (e.g. usage of big-M constants).
I never used CPLEX, but this official page talks about ill-conditioned MIP models.
Small excerpt relevant here:
You should reconsider your model if CPLEX reports any ill-posed bases or more than 5% unstable bases.
In your case A, you got more than 95% unstable bases:
For MIPEmphasis = 3: .... Unstable bases = 0.966667 ...
So it's quite possible, that the result of A can't be trusted. Furthermore i would try to reformulate my model.
If we look at B, you got > 92.5% suspicious bases, so maybe even in this case the model is asking for trouble.
As i'm not familiar with all the tunings and defaults, i can't give any insight on the source of these pretty different computational results in regards to your MIPEmphasis and co. (maybe generating more cutting-planes due to MIPEmphasis result in a more stable problem; just guessing).

How does one do Algebra in Lua?

I've looked and tried but i cant find anything really helpful so thank you in advance.
My problem is i have a changing variable, "balance" for the moment i have it represented as 200. I need to use this equation to find how much money i should withdraw in a game, but I don't know how to write a LUA script that solves algebra
The equation is: 200/(x+x^2+x^3+x^4+x^5)=0.00001001 how would i set about solving for x?
I have tried adding .0000001 if 200/(x+x^2+x^3+x^4+x^5) doesn't equal 0.00001001 but it is very impractical and I haven't gotten it to work. This is The only way I can come up with at the moment. Any help would be appreciated.
This solution finds zero of any continuous function (not only algebraical and not only differentiable) and requires knowing the diapazone of the root to be found.
local function find_zero(f, x_left, x_right, eps)
eps = eps or 0.0000000001 -- precision
local f_left, f_right = f(x_left), f(x_right)
assert(x_left <= x_right and f_left * f_right <= 0, "Wrong diapazone")
while x_right - x_left > eps do
local x_middle = (x_left + x_right) / 2
local f_middle = f(x_middle)
if f_middle * f_left > 0 then
x_left, f_left = x_middle, f_middle
else
x_right, f_right = x_middle, f_middle
end
end
return (x_left + x_right) / 2
end
local function my_func(x)
return 200/(x+x^2+x^3+x^4+x^5) - 0.00001001
end
-- Assuming that the root is between 1 and 1000
local x = find_zero(my_func, 1.0, 1000.0)
print(x) --> 28.643931367544
200/(x+x^2+x^3+x^4+x^5)=0.00001001 is equivalent to 200 = 0.00001001 * (x+x^2+x^3+x^4+x^5), so you have a polynomial equation to solve, and traditionally it is this form of the equation that people like to deal with.
If you want to stay in Lua, then if the form of the equation is predictable enough that you can find a place where the right side is always less than the left (e.g. x = 0) and a place where the right sight is always greater than the left (e.g. very large values of x) then you can use binary search - not terribly efficient, but certain and easy to code.
For general polynomial equations, one well known method is https://en.wikipedia.org/wiki/Newton's_method. Given f(x) = 0 and a guess for x, a better guess might be x - f(x) / f'(x), where f'(x) is the derivative of f(x). There are a few pathological cases where this fails for various reasons, though, so again you probably want to know that your equations is reliably tractable.
Since you have Lua, you may be able to bring in C code that calls out to a maths library such as http://commons.apache.org/proper/commons-math/. They have a routine called LaguerreSolver() which will reasonably reliably solve polynomial equations for you, defending itself against all of the pathological cases. Most math libraries contain a lot more work than any single person is likely to put in for an individual problem, and are of correspondingly higher quality than do it yourself approach such as I describe above.

SVM training performance

I'm using SVMLib to train a simple SVM over the MNIST dataset. It contains 60.000 training data. However, I have several performance issues: the training seems to be endless (after a few hours, I had to shut it down by hand, because it doesn't respond). My code is very simple, I just call ovrtrain on the dataset without any kernel and any special constants:
function features = readFeatures(fileName)
[fid, msg] = fopen(fileName, 'r', 'ieee-be');
header = fread(fid, 4, "int32" , 0, "ieee-be");
if header(1) ~= 2051
fprintf("Wrong magic number!");
end
M = header(2);
rows = header(3);
columns = header(4);
features = fread(fid, [M, rows*columns], "uint8", 0, "ieee-be");
fclose(fid);
return;
endfunction
function labels = readLabels(fileName)
[fid, msg] = fopen(fileName, 'r', 'ieee-be');
header = fread(fid, 2, "int32" , 0, "ieee-be");
if header(1) ~= 2049
fprintf("Wrong magic number!");
end
M = header(2);
labels = fread(fid, [M, 1], "uint8", 0, "ieee-be");
fclose(fid);
return;
endfunction
labels = readLabels("train-labels.idx1-ubyte");
features = readFeatures("train-images.idx3-ubyte");
model = ovrtrain(labels, features, "-t 0"); % doesn't respond...
My question: is it normal? I'm running it on Ubuntu, a virtual machine. Should I wait longer?
I don't know whether you took your answer or not, but let me tell you what I predict about your situation. 60.000 examples is not a lot for a power trainer like LibSVM. Currently, I am working on a training set of 6000 examples and it takes 3-to-5 seconds to train. However, the parameter selection is important and that is the one probably taking long time. If the number of unique features in your data set is too high, then for any example, there will be lots of zero feature values for non-existing features. If the tool is implementing data scaling on your training set, then most probably those lots of zero feature values will be scaled to a certain non-zero value, leaving you astronomic number of unique and non-zero valued features for each and every example. This is very very complicated for a SVM tool to get in and extract efficient parameter values.
Long story short, if you had enough research on SVM tools and understand what I mean, you either assign parameter values in the training command before executing it or find a way to decrease the number of unique features. If you haven't, go on and download the latest version of LibSVM, read the ReadME files as well as the FAQ from the website of the tool.
If non of these is the case, then sorry for taking your time:) Good luck.
It might be an issue of convergence given the characteristics of your data.
Check the kernel you have as default selection and change it. Also, check the stopping criterion of the package. Additionally, if you are looking for faster implementation, check MSVMpack which is a parallel implementation of SVM.
Finally, feature selection in your case is desired. You can end up with a good feature subset of almost half of what you have. In addition, you need only a portion of data for training e.g. 60~70 % are sufficient.
First of all 60k is huge data for training.Training that much data with linear kernel will take hell of time unless you have a supercomputing. Also you have selected a linear kernel function of degree 1. Its better to use Gaussian or higher degree polynomial kernel (deg 4 used with the same dataset showed a good tranning accuracy). Try to add the LIBSVM options for -c cost -m memory cachesize -e epsilon tolerance of termination criterion (default 0.001). First run 1000 samples with Gaussian/ polynomial of deg 4 and compare the accuracy.

Resources