How to estimate complex algorithm facility requirements?

How to estimate complex algorithm facility requirements? - complexity-theory

I'd like to understand how to efficiently estimate hardware requirements for certain complex algorithms using some well known heuristic approach.
Ie. I'd like to estimate quickly how much computer power is necessary to crack my TEA O(2^32) or XTEA O(2^115.15) in some reasonable time or other way round :
Having facility power of a 1000 x 4GHz quad core CPU's, how much time would it take to execute given algorithm?
I'd be also interested in other algo complexity estimations for algorithms like O(log N) etc..
regards
bua

ok, so I'd came up with some thing like this:
Simplifying that CPU clock is this same as MIPS.
having amount of instructions ex. 2^115 and a processor with ex. 1GHz clock
which is:
i = 2^115.15
clock = 1GHz
ipersec=1/10e+9
seconds = i * ipersec
in python:
def sec(N,cpuSpeedHz):
instructions=math.pow(2, N)
return instructions*(1./cpuSpeedHz)
ex
sec(115.15, math.pow(10,9)) / (365*24*60*60)
1.4614952014571389e+18
so it would take 1.4 ^ 18 years to calculate it
so having 1mln 4 cores 1Ghz processors it would take:
sec(115.15, 1000000*4*math.pow(10,9)) / (365*24*60*60)
365373800364.28467
it would take 3.6 ^ 11 years (~ 3600 mld years)
simplified version:
2^115.15 = 2^32 * 2^83.15
clock = 2^32 ~ 4Ghz
2^83.15 =
>>> math.pow(2,83.15)/(365*24*60*60)
3.4028086845230746e+17
checking:
2^32 = 10 ^ 9.63295986
>>> sec(115.15, math.pow(2,32))/(365*24*60*60)
3.4028086845230746e+17

Pick whichever answer you like:
More than you can afford
It would be far, far cheaper to keylog your machine
Where are you going to store to 2^20 plaintexts needed to achieve the O(2^115) time complexity
A whole bunch
If someone really wants your pr0n collection it is much easier to break the key holder than it is the key.

Related

Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Let RG = A for dense unstructured matrices with shapes (e.g. roughly) R: (30k x 40k, entries float32) and G: (40k x 50k, entries either 0.0 or 1.0, roughly equally often) and of course A: (30k x 50k, entries float32).
Given A and G, I want to find the least squares solution for R.
I can use hundreds of CPU cores, hundreds of GB of RAM and also an A40 GPU. What is the best way to use such resources to solve the problem? I'm using Julia 1.7 in the examples below but I'm open to other options!
First question: Can I somehow exploit that the entries of G are only zeros and ones?
Trying to use Julia LinearAlgebra with many CPUs
I've tried two methods: "Penrose inverse" and "right division"
using LinearAlgebra
#show BLAS.get_num_threads()
# defaults to 8. Can change using BLAS.set_num_threads(N)
# build toy problem (order of magnitude smaller sizes)
R_true = rand(Float32, 3_000, 4_000)
G = rand([0., 1.], 4_000, 5_000)
# note: using true/false here gives same results but is much slower!
A = R_true * G
# solve toy problem using matrix (right) division
R_fitted_rdiv = A / G
# solve toy problem using Penrose inverse
R_fitted_pinv = (pinv(G') * A')'
First, setting BLAS.set_num_threads(64) (or any bigger number) actually only gives me BLAS.get_num_threads() returning 32. Apparantly that's an upper limit. Second,
using 32 BLAS threads is actually slower than using 8.
(e.g. performing right division with sizes (4000, 9800) / (8500, 9800) takes less than 50 seconds on 8 threads but more than 55 seconds on 32 threads. I ran things multiple times to exclude compilation time issues.) I don't know why this is or if it's normal. How can I make use of my computing power for this problem?
I think that the matrix division is faster than the Penrose inverse method. Should this be expected? I don't know what either of the functions do exactly for these inputs. The docs say that left division (\) uses pivoted QR factorization. I couldn't find what algorithm(s) are used for pinv or right division (/) (although it's probably the same as \ since they are related by transposing the matrices). I'd rather not delve too deeply because my knowledge in numerical linear algebra is quite limited.
The issue is that for my large matrices either method takes forever. Is there a way to make use of my ~100 cores somehow?
Trying to use the GPU:
Using CUDA.jl, Matrices of size around 10k work fine and take a minute to pinv:
using CUDA
#time matrix = CUDA.rand(Float32, 10_000, 10_500) # 0.003037 seconds (5 allocations: 160 bytes)
#time pinv(matrix) # 57.417559 seconds (678 allocations: 172.094 KiB)
However, when I try to do matrices around size 20k, I get right away the error InexactError: trunc(Int32, 4811456640). I assume this is due to CUBLAS using int32 for indexing, even though I don't understand why it leads to an error in this case. (edit: it's about the size of the array in bytes fitting into 31 bits.)
Trying to use right division with CuArrays gives the error "DimensionMismatch("LU factored matrix A must be square!")". I guess I have to choose a different algorithm manually? I don't know what it's called. (Although, it probably would still crash for large matrices...?)
To summarize, it doesn't look like I can use the GPU from Julia easily to solve my problem. Should I keep trying to use the GPU for this task or stick to the many CPUs?
Yes this is really my problem, please refrain from commenting "nobody should ever need such large least squares"

Naive answer
Using pytorch, this will require at least 30GB GPU memory
import torch
A = torch.randint(0, 2, (50000, 40000), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (50000, 30000), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
If the system can sustain the same operation throughput as my laptop you should have an answer in about 15 minutes.
I would suggest you to try a generalized version scaling up the dimensions to get a better feeling of how your system will handle it
def try_it(a,b,c):
A = torch.randint(0, 2, (a, b), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (a, c), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)
I transposed the dimensions in the generation in order to make sure G.T and A.T would be contiguous.
You can't take much advantage of the entries being integer. This type of problem is easier to solve on the reals than on the integers, because finding integer solutions would require you to search the solutions, while the real solution you can find by doing algebraic manipulations.

Time taken in executing % / * + - operations

Recently, i heard that % operator is costly in terms of time.
So, the question is that, is there a way to find the remainder faster?
Also your help will be appreciated if anyone can tell the difference in the execution of % / * + - operations.

In some cases where you're using power-of-2 divisors you can do better with roll-your-own techniques for calculating remainder, but generally a halfway decent compiler will do the best job possible with variable divisors, or "odd" divisors that don't fit any pattern.
Note that a few CPUs don't even have a multiply operation, and so (on those) multiply is quite slow vs add (at least 64x for a 32-bit multiply). (But a smart compiler may improve on this if the multiplier is a literal.) A slightly larger number do not have a divide operation or have a pretty slow one. (On a CPU with a fast multiplier multiply may only be on the order of 4 times slower than add, but on "normal" hardware it's 16-32 times slower for a 32 bit operation. Divide is inherently 2-4x slower than multiply, but can be much slower on some hardware.)
The remainder operation is rarely implemented in hardware, and normally A % B maps to something along the lines of A - ((A / B) * B) (a few extra operations may be required to assure the proper sign, et al).
(I learned about this stuff while microprogramming the instruction set for the SUMC computer for RCA/NASA back in the early 70s.)

No, the compiler is going to implement % in the most efficient way possible.
In terms of speed, + and - are the fastest (and are equally fast, generally done by the same hardware).
*, /, and % are much slower. Multiplication is basically done by the method you learn in grade school- multiply the first number by every digit in the second number and add the results. With some hacks made possible by binary. As of a few years ago, multiply was 3x slower than add. Division should be similar to multiply. Remainder is similar to division (in fact it generally calculates both at once).
Exact differences depend on the CPU type and exact model. You'd need to look up the latencies in the CPU spec sheets for your particular machine.

How to speed up GLM estimation?

I am using RStudio 0.97.320 (R 2.15.3) on Amazon EC2. My data frame has 200k rows and 12 columns.
I am trying to fit a logistic regression with approximately 1500 parameters.
R is using 7% CPU and has 60+GB memory and is still taking a very long time.
Here is the code:
glm.1.2 <- glm(formula = Y ~ factor(X1) * log(X2) * (X3 + X4 * (X5 + I(X5^2)) * (X8 + I(X8^2)) + ((X6 + I(X6^2)) * factor(X7))),
family = binomial(logit), data = df[1:150000,])
Any suggestions to speed this up by a significant amount?

There are a couple packages to speed up glm fitting. fastglm has benchmarks showing it to be even faster than speedglm.
You could also install a more performant BLAS library on your computer (as Ben Bolker suggests in comments), which will help any method.

Although a bit late but I can only encourage dickoa's suggestion to generate a sparse model matrix using the Matrix package and then feeding this to the speedglm.wfit function. That works great ;-) This way, I was able to run a logistic regression on a 1e6 x 3500 model matrix in less than 3 minutes.

Assuming that your design matrix is not sparse, then you can also consider my package parglm. See this vignette for a comparison of computation times and further details. I show a comparison here of computation times on a related question.
One of the methods in the parglm function works as the bam function in mgcv. The method is described in detail in
Wood, S.N., Goude, Y. & Shaw S. (2015) Generalized additive models for large datasets. Journal of the Royal Statistical Society, Series C 64(1): 139-155.
On advantage of the method is that one can implement it with non-concurrent QR implementation and still do the computation in parallel. Another advantage is a potentially lower memory footprint. This is used in mgcv's bam function and could also be implemented here with a setup as in speedglm's shglm function.

RMS minimization speed-up

[Environment: MATLAB 64 bit, Windows 7, Intel I5-2320]
I would like to RMS-fit a function to experimental data y, so I am minimizing the following function (by using fminsearch):
minfunc = rms(y - fitfunc)
From the general point of view, does it make sense to minimize:
minfunc = sum((y - fitfunc) .^ 2)
instead and then (after minimization) just do minfunc = sqrt(minfunc / N) to get the fit RMS error?
To reformulate the question, how much time (roughly, in percent) would fminsearch save by not doing sqrt and 1/(N - 1) each time? I wouldn't like to decrease readability of my code if my CPU / MATLAB are so fast that it wouldn't improve performance by at least a percent.
Update: I've tried simple tests, but the results are not clear: depending on the actual value of the minfunc, fminsearch takes more or less time.

The general answer for performance questions:
If you just want to figure out what is faster, design a benchmark and run it a few times.
By just providing general information it is not likely that you will determine which method is 1 percent faster.

Program Runtime HW Problem

An algo takes .5 ms seconds for an
input size of 100, how long will it
take to run if the input size is 500
and the program is O(n lg(n))?
My book says that when the input size doubles, n lg(n), takes "slightly more than twice as long". That isn't really helping me much.
The way I've been doing it, is solving for the constant multiplier (which the book doesn't talk about, so I don't know if it's valid):
.5ms = c * 100 * lg(100) => c = .000753
So
.000753 * 500 * lg(500) = 3.37562ms
Is that a valid way to calculate running times, and is there a better way to figure it out?

Yes. That is exactly how it works.
Of course this ignores any possible initialization overhead, as this is not specified in big-o notation, but that's irrelevant for most algorithms.

Thats not exactly right. Tomas was right in saying there is overhead and the real equation is more like
runtime = inputSize * lg(inputSize) * singleInputProcessTime + overhead
The singleInputProcessTime has to do with machine operations like loading of address spaces, arithmetic or anything that must be done each time you interact with the input. This generally has a runtime that ranges from a few CPU cycles to seconds or minutes depending on your domain. It is important to understand that this time is roughly CONSTANT and therefore does not affect the overall runtime very much AT LARGE ENOUGH input sizes.
The overhead is cost of setting up the problem/solution such as reading the algorithm into memory, spreading the input amongst servers/processes or any operation which only needs to happen once or a set number of times which does NOT depend on the input size. This cost is also constant and can range anywhere from a few CPU cycles to minutes depending on the method used to solve the problem.
The inputSize and n * lg(n) you know about already, but as for your homework problem, as long as you explain how you got to the solution, you should be just fine.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to estimate complex algorithm facility requirements? - complexity-theory

Related

Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Time taken in executing % / * + - operations

How to speed up GLM estimation?

RMS minimization speed-up

Program Runtime HW Problem

Categories

Resources