GEKKO (IPOPT) Solution Not Found due to Out of memory? - gekko

I am getting following error while simulating a chemical reactor with 50 series CSTRs in Dynamic mode (m.options.imode=4) with time-series data. The steady-state runs just fine. Also, dynamic simulation seems to work for a simpler model with 15 CSTRs.
Is there a solution for this issue?
MUMPS returned INFO(1) =-13 - out of memory when trying to allocate 219104583 bytes.
In some cases it helps to decrease the value of the option "mumps_mem_percent".
WARNING: Problem in step computation; switching to emergency mode.
1r0.0000000e+000 2.87e+001 9.99e+002 1.5 0.00e+000 - 0.00e+000 0.00e+000R 1
MUMPS returned INFO(1) =-13 - out of memory when trying to allocate 219104583 bytes.
In some cases it helps to decrease the value of the option "mumps_mem_percent".
WARNING: Problem in step computation; switching to emergency mode.
Restoration phase is called at point that is almost feasible,
with constraint violation 0.000000e+000. Abort.
Restoration phase in the restoration phase failed.
Number of Iterations....: 1
(scaled) (unscaled)
Objective...............: 0.0000000000000000e+000 0.0000000000000000e+000
Dual infeasibility......: 0.0000000000000000e+000 0.0000000000000000e+000
Constraint violation....: 2.8680600237259355e+001 2.8680600237259355e+001
Complementarity.........: 0.0000000000000000e+000 0.0000000000000000e+000
Overall NLP error.......: 2.8680600237259355e+001 2.8680600237259355e+001
Number of objective function evaluations = 2
Number of objective gradient evaluations = 2
Number of equality constraint evaluations = 2
Number of inequality constraint evaluations = 0
Number of equality constraint Jacobian evaluations = 2
Number of inequality constraint Jacobian evaluations = 0
Number of Lagrangian Hessian evaluations = 2
Total CPU secs in IPOPT (w/o function evaluations) = 1.672
Total CPU secs in NLP function evaluations = 4.237
EXIT: Restoration Failed!
An error occured.
The error code is -2

If the simultaneous mode (IMODE=4) is too large and runs out of memory then I recommend that you try the sequential mode with (IMODE=7).
from gekko import GEKKO
m = GEKKO(remote=False)
m.options.IMODE=7
# Your model
m.solve(disp=False)
A couple other tips when switching to IMODE=7:
Use remote=False to solve on your computer instead of a public server
Use disp=False to not show the solver output. Print statements can slow down the code.
IMODE=4 and IMODE=7 should give equivalent results but they are different solution methods. The simultaneous mode is reviewed in the collocation material in the Dynamic Optimization course.

Related

Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Let RG = A for dense unstructured matrices with shapes (e.g. roughly) R: (30k x 40k, entries float32) and G: (40k x 50k, entries either 0.0 or 1.0, roughly equally often) and of course A: (30k x 50k, entries float32).
Given A and G, I want to find the least squares solution for R.
I can use hundreds of CPU cores, hundreds of GB of RAM and also an A40 GPU. What is the best way to use such resources to solve the problem? I'm using Julia 1.7 in the examples below but I'm open to other options!
First question: Can I somehow exploit that the entries of G are only zeros and ones?
Trying to use Julia LinearAlgebra with many CPUs
I've tried two methods: "Penrose inverse" and "right division"
using LinearAlgebra
#show BLAS.get_num_threads()
# defaults to 8. Can change using BLAS.set_num_threads(N)
# build toy problem (order of magnitude smaller sizes)
R_true = rand(Float32, 3_000, 4_000)
G = rand([0., 1.], 4_000, 5_000)
# note: using true/false here gives same results but is much slower!
A = R_true * G
# solve toy problem using matrix (right) division
R_fitted_rdiv = A / G
# solve toy problem using Penrose inverse
R_fitted_pinv = (pinv(G') * A')'
First, setting BLAS.set_num_threads(64) (or any bigger number) actually only gives me BLAS.get_num_threads() returning 32. Apparantly that's an upper limit. Second,
using 32 BLAS threads is actually slower than using 8.
(e.g. performing right division with sizes (4000, 9800) / (8500, 9800) takes less than 50 seconds on 8 threads but more than 55 seconds on 32 threads. I ran things multiple times to exclude compilation time issues.) I don't know why this is or if it's normal. How can I make use of my computing power for this problem?
I think that the matrix division is faster than the Penrose inverse method. Should this be expected? I don't know what either of the functions do exactly for these inputs. The docs say that left division (\) uses pivoted QR factorization. I couldn't find what algorithm(s) are used for pinv or right division (/) (although it's probably the same as \ since they are related by transposing the matrices). I'd rather not delve too deeply because my knowledge in numerical linear algebra is quite limited.
The issue is that for my large matrices either method takes forever. Is there a way to make use of my ~100 cores somehow?
Trying to use the GPU:
Using CUDA.jl, Matrices of size around 10k work fine and take a minute to pinv:
using CUDA
#time matrix = CUDA.rand(Float32, 10_000, 10_500) # 0.003037 seconds (5 allocations: 160 bytes)
#time pinv(matrix) # 57.417559 seconds (678 allocations: 172.094 KiB)
However, when I try to do matrices around size 20k, I get right away the error InexactError: trunc(Int32, 4811456640). I assume this is due to CUBLAS using int32 for indexing, even though I don't understand why it leads to an error in this case. (edit: it's about the size of the array in bytes fitting into 31 bits.)
Trying to use right division with CuArrays gives the error "DimensionMismatch("LU factored matrix A must be square!")". I guess I have to choose a different algorithm manually? I don't know what it's called. (Although, it probably would still crash for large matrices...?)
To summarize, it doesn't look like I can use the GPU from Julia easily to solve my problem. Should I keep trying to use the GPU for this task or stick to the many CPUs?
Yes this is really my problem, please refrain from commenting "nobody should ever need such large least squares"
Naive answer
Using pytorch, this will require at least 30GB GPU memory
import torch
A = torch.randint(0, 2, (50000, 40000), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (50000, 30000), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
If the system can sustain the same operation throughput as my laptop you should have an answer in about 15 minutes.
I would suggest you to try a generalized version scaling up the dimensions to get a better feeling of how your system will handle it
def try_it(a,b,c):
A = torch.randint(0, 2, (a, b), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (a, c), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
I transposed the dimensions in the generation in order to make sure G.T and A.T would be contiguous.
You can't take much advantage of the entries being integer. This type of problem is easier to solve on the reals than on the integers, because finding integer solutions would require you to search the solutions, while the real solution you can find by doing algebraic manipulations.

Ceres bundle adjustment, DENSE_NORMAL_CHOLESKY is faster than SCHUR, but shouldn't be?

I have a simple Bundle adjustment problem, with two cameras, and 230 points that i am trying to solve using Ceres.
My goal is to get the absolute fastest solve that i can, but the results that i see seem to contradict the documentation about bundle adjustment problems.
As stated here:
one way to solve this problem is to set Solver::Options::linear_solver_type to SPARSE_NORMAL_CHOLESKY and call Solve(). And while this is a reasonable thing to do, bundle adjustment problems have a special sparsity structure that can be exploited to solve them much more efficiently. Ceres provides three specialized solvers (collectively known as Schur-based solvers) for this task.
However, when I use DENSE_NORMAL_CHOLESKY , using the solver settings:
options.sparse_linear_algebra_library_type = SUITE_SPARSE;
options.linear_solver_type = ceres::DENSE_NORMAL_CHOLESKY;
options.minimizer_progress_to_stdout = false;
options.logging_type = ceres::SILENT;
options.max_num_iterations = 20;
It gives me:
Time (in seconds):
Preprocessor 0.006372
Residual only evaluation 0.000359 (12)
Jacobian & residual evaluation 0.003254 (12)
Linear solver 0.001549 (12)
Minimizer 0.008216
Postprocessor 0.000008
Total 0.014596
However, when i switch to SCHUR solvers, as below:
options.use_explicit_schur_complement = true;
options.sparse_linear_algebra_library_type = SUITE_SPARSE;
options.linear_solver_type = ceres::ITERATIVE_SCHUR;
options.minimizer_progress_to_stdout = false;
options.logging_type = ceres::SILENT;
options.max_num_iterations = 20;
options.preconditioner_type = SCHUR_JACOBI;
It runs slower, with:
Time (in seconds):
Preprocessor 0.007213
Residual only evaluation 0.000306 (10)
Jacobian & residual evaluation 0.002611 (10)
Linear solver 0.007781 (10)
Minimizer 0.013027
Postprocessor 0.000009
Total 0.020249
Is there anything i can do to get a faster result? I have tried ordering, every kind of linear_solver_type and different pre-conditioners. Setting options.num_threads = 8; makes no noticeable difference either. Am i missing something?
Use analytical derivatives. The bulk of your time is being spent there.

Tensorflow: How to check the validation accuracy every 100 iterations?

I try to use the following code to check the validation accuracy every 100 iterations, however, the validation accuracy is not changing(the network is fine)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(1000):
batch = mnist.train.next_batch(50)
if i % 100 == 0:
train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_:batch[1], keep_prob:1.0})
print('step %d, training accuracy %g' %(i, train_accuracy))
validation_accuracy = accuracy.eval(feed_dict={x:mnist.test.images, y_:mnist.test.labels, keep_prob:0.0})
print('step %d, validation accuracy %g' %(i, validation_accuracy))
train_step.run(feed_dict={x:batch[0], y_:batch[1], keep_prob:0.5})
Since you haven't added your network implementation, my answer will be an educated guess.
TL;DR: You should use keep_prob:1.0 instead of keep_prob:0.0 in your validation step.
By the appearance of keep_prob, I deduce that your network is using dropout. By using feed_dict={x:mnist.test.images, y_:mnist.test.labels, keep_prob:0.0}, you are feeding a 0.0 probability to keeping an activation, which is equivalent to a 1.0 probability of dropping it. The result is that when you are performing validation, you are basically ignoring the input to the network and all hidden layers. This has the effect that the last layer gives you the same output values for all classes of MNIST (this may be only approximately true, depending on the specific implementation), therefore the accuracy is constant.
Dropout is a method for regularization, which drops neurons during training steps, thus improving the generalization ability of the network. When you are not training (such as during a validation step), you want to keep all neurons. Thus, what you probably want to do is feed the value 1.0 instead.

How is a prefix sum a bulk-synchronous algorithmic primitive?

Concerning NVIDIA GPU the author in High Performance and Scalable GPU Graph Traversal paper says:
1-A sequence of kernel invocations is bulk- synchronous: each kernel is
initially presented with a consistent view of the results from the
previous.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I am unable to understand these two points (I know GPU based prefix sum though), Can smeone help me this concept.
1-A sequence of kernel invocations is bulk- synchronous: each kernel is initially presented with a consistent view of the results from the previous.
It's about parallel computation model: each processor has its own memory which is fast (like cache in CPU) and is performing computation using values stored there without any synchronization. Then non-blocking synchronization takes place - processor puts data it has computed so far and gets data from its neighbours. Then there's another synchronization - barrier, which makes all of them wait for each other.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I believe that's about the second step of BSP model - synchronization. That's the way processors store and get data for the next step.
Name of the model implies that it is highly concurrent (many many processes that work synchronously relatively to each other). And this is how we get to the second point.
As far as we want to live up to the name (be highly concurrent) we want get rid of sequential parts where it is possible. We can achieve that with prefix-sum.
Consider prefix-sum associative operator +. Then scan on set [5 2 0 3 1] returns the set [0 5 7 7 10 11]. So, now we can replace such sequential pseudocode:
foreach i = 1...n
foo[i] = foo[i-1] + bar(i);
with this pseudocode, which now can be parallel(!):
foreach(i)
baz[i] = bar(i);
scan(foo, baz);
That is very much naive version, but it's okay for explanation.

SVM training performance

I'm using SVMLib to train a simple SVM over the MNIST dataset. It contains 60.000 training data. However, I have several performance issues: the training seems to be endless (after a few hours, I had to shut it down by hand, because it doesn't respond). My code is very simple, I just call ovrtrain on the dataset without any kernel and any special constants:
function features = readFeatures(fileName)
[fid, msg] = fopen(fileName, 'r', 'ieee-be');
header = fread(fid, 4, "int32" , 0, "ieee-be");
if header(1) ~= 2051
fprintf("Wrong magic number!");
end
M = header(2);
rows = header(3);
columns = header(4);
features = fread(fid, [M, rows*columns], "uint8", 0, "ieee-be");
fclose(fid);
return;
endfunction
function labels = readLabels(fileName)
[fid, msg] = fopen(fileName, 'r', 'ieee-be');
header = fread(fid, 2, "int32" , 0, "ieee-be");
if header(1) ~= 2049
fprintf("Wrong magic number!");
end
M = header(2);
labels = fread(fid, [M, 1], "uint8", 0, "ieee-be");
fclose(fid);
return;
endfunction
labels = readLabels("train-labels.idx1-ubyte");
features = readFeatures("train-images.idx3-ubyte");
model = ovrtrain(labels, features, "-t 0"); % doesn't respond...
My question: is it normal? I'm running it on Ubuntu, a virtual machine. Should I wait longer?
I don't know whether you took your answer or not, but let me tell you what I predict about your situation. 60.000 examples is not a lot for a power trainer like LibSVM. Currently, I am working on a training set of 6000 examples and it takes 3-to-5 seconds to train. However, the parameter selection is important and that is the one probably taking long time. If the number of unique features in your data set is too high, then for any example, there will be lots of zero feature values for non-existing features. If the tool is implementing data scaling on your training set, then most probably those lots of zero feature values will be scaled to a certain non-zero value, leaving you astronomic number of unique and non-zero valued features for each and every example. This is very very complicated for a SVM tool to get in and extract efficient parameter values.
Long story short, if you had enough research on SVM tools and understand what I mean, you either assign parameter values in the training command before executing it or find a way to decrease the number of unique features. If you haven't, go on and download the latest version of LibSVM, read the ReadME files as well as the FAQ from the website of the tool.
If non of these is the case, then sorry for taking your time:) Good luck.
It might be an issue of convergence given the characteristics of your data.
Check the kernel you have as default selection and change it. Also, check the stopping criterion of the package. Additionally, if you are looking for faster implementation, check MSVMpack which is a parallel implementation of SVM.
Finally, feature selection in your case is desired. You can end up with a good feature subset of almost half of what you have. In addition, you need only a portion of data for training e.g. 60~70 % are sufficient.
First of all 60k is huge data for training.Training that much data with linear kernel will take hell of time unless you have a supercomputing. Also you have selected a linear kernel function of degree 1. Its better to use Gaussian or higher degree polynomial kernel (deg 4 used with the same dataset showed a good tranning accuracy). Try to add the LIBSVM options for -c cost -m memory cachesize -e epsilon tolerance of termination criterion (default 0.001). First run 1000 samples with Gaussian/ polynomial of deg 4 and compare the accuracy.

Resources