Related
Gekko - APMonitor Optimization Suite is unable to solve an optimization problem. I am trying to solve Max a^Tx/b^Tx with the constraint d<=c^Tx <=e, where the decision vector x=[x_1, x_2, ..., x_n] are non-negative integers, and vectors a,b,c are positive real-number vectors, and constants d and e are positive lower and upper bounds. The problem is feasible because I got a feasible solution with the objective being replaced by 0. I was wondering whether APMonitor is capable of solving linear-fractional objective problems or not.
Anyone has experience with how to handle this kind of issues? Is there any options in the solver I could try to turn on to resolve the issue?
The option I was using is below:
from gekko import GEKKO
model = GEKKO()
model.options.SOLVER=1
model.solver_options = ['minlp_maximum_iterations 100', \
'minlp_max_iter_with_int_sol 10', \
'minlp_as_nlp 0', \
'nlp_maximum_iterations 50', \
'minlp_branch_method 1', \
'minlp_print_level 8', \
'minlp_integer_tol 0.05', \
'minlp_gap_tol 0.001']
model.solve(disp=True)
The output looks like below, where the solver status is inconsistent with APPSTATUS and APPINFO. This may be a APMonitor reporting issue.
apm 67.162.115.84_gk_model0 <br><pre> -----------------------------------------------
-----------------
APMonitor, Version 1.0.1
APMonitor Optimization Suite
----------------------------------------------------------------
--------- APM Model Size ------------
Each time step contains
Objects : 7
Constants : 0
Variables : 5626
Intermediates: 0
Connections : 4914
Equations : 4913
Residuals : 4913
Number of state variables: 5626
Number of total equations: - 4919
Number of slack variables: - 2
---------------------------------------
Degrees of freedom : 705
----------------------------------------------
Steady State Optimization with APOPT Solver
----------------------------------------------
Iter: 1 I: -9 Tm: 75.50 NLPi: 251 Dpth: 0 Lvs: 0 Obj: 0.00E+00 Gap:
NaN
Warning: no more possible trial points and no integer solution
Maximum iterations
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 75.5581999999995 sec
Objective : NaN
Unsuccessful with error code 0
---------------------------------------------------
Creating file: infeasibilities.txt
Use command apm_get(server,app,'infeasibilities.txt') to retrieve file
#error: Solution Not Found
Not successful
Gekko Solvetime: 1.0 s
#################################################
APPINFO = 0 - a successful solution
APPSTATUS =1 - solver converges to a successful solution
Solver status - Not successful, exception thrown
decision variable =[0,0, ...,0].
To maximize the objective, the solver minimizes the value of b so that the objective function goes to +infinity. Try setting a lower bound on b to a small number such as 0.001 to prevent the unbounded solution. Starting with non-zero values (default) can also help to find the solution.
b = model.Array(m.Var,n,value=1,lb=0.001)
Another suggestion is to set a lower bound constraint on b^Tx in case x also goes to zero.
model.Equation(b#x>=0.01)
If the APOPT solver does not converge with the modified problem, try using an NLP solver such as the Interior Point Method solver IPOPT to initialize the solution. Gekko retains the solution values from one solve to use as the initial guess for the next solve.
model.options.SOLVER=3
model.solve()
model.options.SOLVER=1
model.solver_options = ['minlp_maximum_iterations 100', \
'minlp_max_iter_with_int_sol 10', \
'minlp_as_nlp 0', \
'nlp_maximum_iterations 50', \
'minlp_branch_method 1', \
'minlp_print_level 8', \
'minlp_integer_tol 0.05', \
'minlp_gap_tol 0.001']
model.solve()
Please post a complete and minimal example if more specific suggestions are needed.
I am getting somewhat unexpected results when measuring the processing runtime of the Conv1D layer and wonder if anybody understands the results. Before going on I note that the observation is not only linked to the Conv1D layer but can be observed similarly for the tf.nn.conv1d function.
The code I am using is very simple
import os
# silence verbose TF feedback
if 'TF_CPP_MIN_LOG_LEVEL' not in os.environ:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "3"
import tensorflow as tf
import time
def fun(sigl, cc, bs=10):
oo = tf.ones((bs, sigl, 200), dtype=tf.float32)
start_time = time.time()
ss=cc(oo).numpy()
dur = time.time() - start_time
print(f"size {sigl} time: {dur:.3f} speed {bs*sigl / 1000 / dur:.2f}kHz su {ss.shape}")
cctf2t = tf.keras.layers.Conv1D(100,10)
for jj in range(2):
print("====")
for ii in range(30):
fun(10000+ii, cctf2t, bs=10)
I was expecting to observe the first call to be slow and the others to show approximately similar runtime. It turns out that the behavior is quite different.
Assuming the code above is stored in a script called debug_conv_speed.py I get the following on an NVIDIA GeForce GTX 1050 Ti
$> ./debug_conv_speed.py
====
size 10000 time: 0.901 speed 111.01kHz su (10, 9991, 100)
size 10001 time: 0.202 speed 554.03kHz su (10, 9992, 100)
...
size 10029 time: 0.178 speed 563.08kHz su (10, 10020, 100)
====
size 10000 time: 0.049 speed 2027.46kHz su (10, 9991, 100)
...
size 10029 time: 0.049 speed 2026.87kHz su (10, 10020, 100)
where ... indicates approximately the same result. So as expected, the first time is slow, then for each input length, I get the same speed of about 550kHz. But then for the repetition, I am astonished to find all operations to run about 4 times faster, with 2MHz.
The results are even more different on a GeForce GTX 1080. There the first time a length is used it runs at about 200kHz, and for the repetitions, I find a speed of 1.8MHz.
In response to the https://stackoverflow.com/a/71184388/3932675 I add a second variant of the code that uses tf.function a
import os
# silence verbose TF feedback
if 'TF_CPP_MIN_LOG_LEVEL' not in os.environ:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "3"
import tensorflow as tf
import time
from functools import partial
print(tf.config.list_physical_devices())
class run_fun(object):
def __init__(self, ll, channels):
self.op = ll
self.channels = channels
#tf.function(input_signature=(tf.TensorSpec(shape=[None,None,None]),),
experimental_relax_shapes=True)
def __call__(self, input):
print("retracing")
return self.op(tf.reshape(input, (tf.shape(input)[0], tf.shape(input)[1], self.channels)))
def run_layer(sigl, ll, bs=10):
oo = tf.random.normal((bs, sigl, 200), dtype=tf.float32)
start_time = time.time()
ss=ll(oo).numpy()
dur = time.time() - start_time
print(f"len {sigl} time: {dur:.3f} speed {bs*sigl / 1000 / dur:.2f}kHz su {ss.shape}")
ww= tf.ones((10, 200, 100))
ll=partial(tf.nn.conv1d, filters=ww, stride=1, padding="VALID", data_format="NWC")
run_ll = run_fun(ll, 200)
for jj in range(2):
print(f"=== run {jj+1} ===")
for ii in range(5):
run_layer(10000+ii, run_ll)
# alternatively for eager mode run
# run_layer(10000+ii, ll)
the result after running on google's colab GPU
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
=== run 1 ===
retracing
len 10000 time: 10.168 speed 9.83kHz su (10, 9991, 100)
len 10001 time: 0.621 speed 161.09kHz su (10, 9992, 100)
len 10002 time: 0.622 speed 160.80kHz su (10, 9993, 100)
len 10003 time: 0.644 speed 155.38kHz su (10, 9994, 100)
len 10004 time: 0.632 speed 158.18kHz su (10, 9995, 100)
=== run 2 ===
len 10000 time: 0.080 speed 1253.34kHz su (10, 9991, 100)
len 10001 time: 0.053 speed 1898.41kHz su (10, 9992, 100)
len 10002 time: 0.052 speed 1917.43kHz su (10, 9993, 100)
len 10003 time: 0.067 speed 1499.43kHz su (10, 9994, 100)
len 10004 time: 0.095 speed 1058.60kHz su (10, 9995, 100)
This shows that with the given tf.function args retracing is not happening and the performance shows the same difference.
Does anybody know how to explain this?
The reason for your comparatively slow first iteration is that you are feeding different shapes into cctf2t, which triggers a retracting of your compute graph.
In the 2nd, and all subsequent, iteration, you no longer encounter new shapes and therefore no further retracings.
I am pretty sure to have found the explanation in the source of TensorFlow cudnn, and share the insight here for others (notably those who upvoted the question) that encounter the same problem.
cuda supports a number of convolution kernels that in the current version of TensorFlow 2.9.0 are obtained by means of CudnnSupport::GetConvolveRunners
here
https://github.com/tensorflow/tensorflow/blob/21368c687cafdf97fac3dd0eefaed710df0068a2/tensorflow/stream_executor/cuda/cuda_dnn.cc#L4557
Which is then used here in the various autotune functions
https://github.com/tensorflow/tensorflow/blob/21368c687cafdf97fac3dd0eefaed710df0068a2/tensorflow/core/kernels/conv_ops_gpu.cc#L365
It appears that each time a configuration consisting of data shape, filter shape, and maybe other parameters are encountered the cuda driver tests all of the kernels and retains the most efficient one. This is a very nice optimization for most cases, notably training with constant batch shapes, or inference with constant image sizes. For inference with audio signals that may have arbitary lengths (e.g. audio signals with 48000Hz sample rate covering duration from 1s to 20s have nearly 1 million different lengths), the cuda implementation is testing most of the time all kernels versions. It hardly ever benefits, from the information which of the kernels is the most efficient one for any given configuration, as the same configuration is hardly ever encountered a second time.
For my use case, I now use overlap-add-based processing with fixed signal length and improved inference time by about factor 4.
Originally this is a problem coming up in mathematica.SE, but since multiple programming languages have involved in the discussion, I think it's better to rephrase it a bit and post it here.
In short, michalkvasnicka found that in the following MATLAB sample
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for c = 1:s
for r = 1:s
H(r,c) = 1/(r+c-1);
end
end
toc
%Elapsed time is 1.359625 seconds.... For-loop
tic;
% vectorized version
c = 1:s;
r = c';
HH=1./(r+c-1);
toc
%Elapsed time is 0.047916 seconds.... Vectorized
isequal(H,HH)
the vectorized code piece is more than 25 times faster than the pure for-loop code piece. Though I don't have access to MATLAB so cannot test the sample myself, the timing 1.359625 seems to suggest it's tested on an average PC, just as mine.
But I cannot reproduce the timing with other languages like fortran or julia! (We know, both of them are famous for their performance of numeric calculation. Well, I admit I'm by no means an expert of fortran or julia. )
The followings are the samples I used for test. I'm using a laptop with i7-8565U CPU, Win 10.
fortran
fortran code is compiled with gfortran (TDM-GCC-10.3.0-2, with compile option -Ofast).
program tst
use, intrinsic :: iso_fortran_env
implicit none
integer,parameter::s=15000
integer::r,c
real(real64)::hmn(s,s)
do r=1,s
do c=1, s
hmn(r,c)=1._real64/(r + c - 1)
end do
end do
print *, hmn(s,s)
end program
compilation timing: 0.2057823 seconds
execution timing: 0.7179657 seconds
julia
Version of julia is 1.6.3.
#time (s=15000; Hmm=[1. /(r+c-1) for r=1:s,c=1:s];)
Timing: 0.7945998 seconds
Here comes the question:
Is the timing of MATLAB reliable?
If the answer to 1st question is yes, then how can we reproduce the performance (for 2 GHz CPU, the timing should be around 0.05 seconds) with julia, fortran, or any other programming languages?
Just to add on the Julia side - make sure you use BenchmarkToolsto benchmark, wrap the code you want to benchmark in functions so as not to benchmark in global scope, and interpolate any variables you pass to #btime.
Here's how I would do it:
julia> s = 15_000;
julia> function f_loop!(H)
for c ∈ 1:size(H, 1)
for r ∈ 1:size(H, 1)
H[r, c] = 1 / (r + c - 1)
end
end
end
f_loop! (generic function with 1 method)
julia> function f_vec!(H)
c = 1:size(H, 1)
r = c'
H .= 1 ./ (r .+ c .- 1)
end
f_vec! (generic function with 1 method)
julia> H = zeros(s, s);
julia> using BenchmarkTools
julia> #btime f_loop!($H);
625.891 ms (0 allocations: 0 bytes)
julia> H = zeros(s, s);
julia> #btime f_vec!($H);
625.248 ms (0 allocations: 0 bytes)
So both versions come in at the same time, which is what I'd expect for such a straightforward operation where a properly type-inferred code should compile down to roughly the same machine code.
tic/toc should be fine, but it looks like the timing is being skewed by memory pre-allocation.
I can reproduce similar timings to your MATLAB example, however
On first run (clear workspace)
Loop approach takes 2.08 sec
Vectorised approach takes 1.04 sec
Vectorisation saves 50% execution time
On second run (workspace not cleared)
Loop approach takes 2.55 sec
Vectorised approach takes 0.065 sec
Vectorisation "saves" 97.5% execution time
My guess would be that since the loop approach explicitly creates a new matrix via zeros, the memory is reallocated from scratch on every run and you don't see the speed improvement on subsequent runs.
However, when HH remains in memory and the HH=___ line outputs a matrix of the same size, I suspect MATLAB is doing some clever memory allocation to speed up the operation.
We can prove this theory with the following test:
Test Num | Workspace cleared | s | Loop (sec) | Vectorised (sec)
1 | Yes | 15000 | 2.10 | 1.41
2 | No | 15000 | 2.73 | 0.07
3 | No | 15000 | 2.50 | 0.07
4 | No | 15001 | 2.74 | 1.73
See the variation between tests 2 and 3, this is why timeit would have been helpful for an average runtime (see footnote). The difference in output sizes between tests 3 and 4 are pretty small, but the execution time returns to a similar magnitude of that in test 1 for the vectorised approach, suggesting that the re-allocation to create HH costs most of the time.
Footnote: tic/toc timings in MATLAB can be improved by using the in-built timeit function, which essentially takes an average over several runs. One interesting thing to observe from the workings of timeit though is that it explicitly "warms up" (quoting a comment) the tic/toc function by calling it a couple of times. You can see when running tic/toc a few times from a clear workspace (with no intermediate code) that the first call takes longer than subsequent calls, as there must be some overhead for getting the timer initialised.
I hope that the following modified benchmark could bring some new light to the problem:
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for i =1:10
for c = 1:s
for r = 1:s
H(r,c) = H(r,c) + 1/(r+c-1+i);
end
end
end
toc
tic;
% vectorized version
HH = zeros(s,s);
c = 1:s;
r = c';
for i=1:10
HH= HH + 1./(r+c-1+i);
end
toc
isequal(H,HH)
In this case any kind of "cashing" is avoided by changing of matrix H (HH) at each for-loop (over "i") iteration.
In this case we get:
Elapsed time is 3.737275 seconds. (for-loop)
Elapsed time is 1.143387 seconds. (vectorized)
So, there is still performance improvement (~ 3x) due to the vectorization, which is probably done by implicit multi-threading implementation of vectorized Matlab commands.
Yes, tic/toc vs timeit is not strictly consistent, but the overall timing functionality is very similar.
To add to this, here is a simple python script which does the vectorized operation with numpy:
from timeit import default_timer
import numpy as np
s = 15000
start = default_timer()
# for-loop
H = np.zeros([s, s])
for c in range(1, s):
for r in range(1, s):
H[r, c] = 1 / (r + c - 1)
end = default_timer()
print(end - start)
start = default_timer()
# vectorized
c = np.arange(1, s).reshape([1, -1])
r = c.T
HH = 1 / (c + r - 1)
end = default_timer()
print(end - start)
for-loop: 32.94566780002788 seconds
vectorized: 0.494859800033737 seconds
While the for-loop version is terribly slow, the vectorized version is faster than the posted fortran/julia times. Numpy internally tries to use special SIMD hardware instructions to speed up arithmetic on vectors, which can make a significant difference. It's possible that the fortran/julia compilers weren't able to generate those instructions from the provided code, but numpy/matlab were able to. However, Matlab is still about 10x faster than the numpy code, which I don't think would be explained by better use of SIMD instructions. Instead, they may also be using multiple threads to parallelize the computation, since the matrix is fairly large.
Ultimately, I think the matlab numbers are plausible, but I'm not sure exactly how they're getting their speedup.
I want to perform N=1000 bootstrapping with replacement on gridded data. One computation takes about 0.5s. I have access to a supercomputer exclusive node with 48 cores. Because the resampling are independent of each other, I naively hope to distribute the workload on all or at least many cores and get a performance increase by .8 * ncores. But I dont get it.
I still lack proper understand about dask. Based on Best practices in setting number of dask workers, I use:
from dask.distributed import Client
client = Client(processes=False, threads_per_worker=8, n_workers=6, memory_limit=‘32GB')
I also tried with SLURMCluster, but I guess I first need to understand what I do and then scale.
My MWE:
create sample data
write function I want to apply
write resampling inits function
write bootstrapping function with bootstrap (=N) as argument: see many implementations below
perform bootstrapping
import dask
import numpy as np
import xarray as xr
from dask.distributed import Client
inits = np.arange(50)
lats = np.arange(96)
lons = np.arange(192)
data = np.random.rand(len(inits), len(lats), len(lons))
a = xr.DataArray(data,
coords=[inits, lats, lons],
dims=['init', 'lat', 'lon'])
data = np.random.rand(len(inits), len(lats), len(lons))
b = xr.DataArray(data,
coords=[inits, lats, lons],
dims=['init', 'lat', 'lon'])
def func(a,b, dim='init'):
return (a-b).std(dim)
bootstrap=96
def resample(a):
smp_init = np.random.choice(inits, len(inits))
smp_a = a.sel(init=smp_init)
smp_a['init'] = inits
return smp_a
# serial function
def bootstrap_func(bootstrap=bootstrap):
res = (func(resample(a),b) for _ in range(bootstrap))
res = xr.concat(res,'bootstrap')
# leave out quantile because not issue here yet
#res_ci = res.quantile([.05,.95],'bootstrap')
return res
#dask.delayed
def bootstrap_func_delayed_decorator(bootstrap=bootstrap):
return bootstrap_func(bootstrap=bootstrap)
def bootstrap_func_delayed(bootstrap=bootstrap):
res = (dask.delayed(func)(resample(a),b) for _ in range(bootstrap))
res = xr.concat(dask.compute(*res),'bootstrap')
#res_ci = res.quantile([.05,.95],'bootstrap')
return res
for scheduler in ['synchronous','distributed','multiprocessing','processes','single-threaded','threads']:
print('scheduler:',scheduler)
def bootstrap_func_delayed_processes(bootstrap=bootstrap):
res = (dask.delayed(func)(resample(a),b) for _ in range(bootstrap))
res = xr.concat(dask.compute(*res, scheduler=scheduler),'bootstrap')
res = res.quantile([.05,.95],'bootstrap')
return res
%time c = bootstrap_func_delayed_processes()
The following results are from my 4 core laptop. But on the supercomputer I also see no speedup, rather decrease by 50%.
Results for serial:
%time c = bootstrap_func()
CPU times: user 814 ms, sys: 58.7 ms, total: 872 ms
Wall time: 862 ms
Results for parallel:
%time c = bootstrap_func_delayed_decorator().compute()
CPU times: user 96.2 ms, sys: 50 ms, total: 146 ms
Wall time: 906 ms
Results for parallelized from the loop:
scheduler: synchronous
CPU times: user 2.57 s, sys: 330 ms, total: 2.9 s
Wall time: 2.95 s
scheduler: distributed
CPU times: user 4.51 s, sys: 2.74 s, total: 7.25 s
Wall time: 8.86 s
scheduler: multiprocessing
CPU times: user 4.18 s, sys: 2.53 s, total: 6.71 s
Wall time: 7.95 s
scheduler: processes
CPU times: user 3.97 s, sys: 2.1 s, total: 6.07 s
Wall time: 7.39 s
scheduler: single-threaded
CPU times: user 2.26 s, sys: 275 ms, total: 2.54 s
Wall time: 2.47 s
scheduler: threads
CPU times: user 2.84 s, sys: 341 ms, total: 3.18 s
Wall time: 2.66 s
Expected results:
- speedup (by .8 * ncores)
Other considerations:
- I also checked whether I should chunk my data. too sample chunks. chunked arrays take longer.
My questions:
- What did I get wrong about dask parallelization?
- Is the client setup not useful that way?
- Did I implement dask.delayed not clever enough?
- Is my serial function already executed in parallel because of dask? I think not.
I finally solved this. When posting this challenge, I obviously didn't understand a few aspects of it:
I ran the timings on a laptop with two physical cores. This doesn't allow much parallelization in a CPU-bound problem. Now I ran this on a node with 48 logical CPUs
I should have thought about which parts of the algorithm are easily parallelizable and which parts are not. Only then I can chunk accordingly.
See my solution here: https://gist.github.com/aaronspring/118abd7b9bf81e555b1fced42eef427f
The game-changers wrt. the code posted initially:
I chunk a dimension (here x) with is not involved in the func (which uses time)
I still use the client as mentioned above: Best practices in setting number of dask workers
I only try to parallelize the iteration part. The quantile method is done in memory.
Conclusion: It is simpler than expected. The gist shows an implementation with dask.delayed and dask.futures but thats not even needed in my use case. First try to understand parallelism https://realpython.com/python-concurrency/ and read the dask documentation https://dask.org/.
Much faster solution with multidimensional indexing
https://xskillscore.readthedocs.io/en/latest/api/xskillscore.core.resampling.resample_iterations_idx.html#xskillscore.core.resampling.resample_iterations_idx
randomized SVD decomposes a matrix by extracting the first k singular values/vectors using k+p random projections. this works surprisingly well for large matrices.
my question concerns the singular values that are output from the algorithm. why aren't the values equal to the first k-singular values if you do the full SVD?
Below I have a simple implementation in R. Any suggestions on improving the performance would be appreciated.
rsvd = function(A, k=10, p=5) {
n = nrow(A)
y = A %*% matrix(rnorm(n * (k+p)), nrow=n)
q = qr.Q(qr(y))
b = t(q) %*% A
svd = svd(b)
list(u=q %*% svd$u, d=svd$d, v=svd$v)
}
> set.seed(10)
> A <- matrix(rnorm(500*500),500,500)
> svd(A)$d[1:15]
[1] 44.94307 44.48235 43.78984 43.44626 43.27146 43.15066 42.79720 42.54440 42.27439 42.21873 41.79763 41.51349 41.48338 41.35024 41.18068
> rsvd.o(A,10,5)$d
[1] 34.83741 33.83411 33.09522 32.65761 32.34326 31.80868 31.38253 30.96395 30.79063 30.34387 30.04538 29.56061 29.24128 29.12612 27.61804
Calculation
I reckon that your algorithm is a modification of the algorithm of Martinsson et al.. If I understood it correctly, this is especially meant for approximations for low rank matrices. I might be wrong though.
The difference is easily explained by the huge difference between the actual rank of A (500) and the values of k (10) and p (5). Plus, Martinsson et al mention that the value for p should actually be larger than the chosen value for k.
So if we apply your solution taking their considerations into account, using :
set.seed(10)
A <- matrix(rnorm(500*500),500,500) # rank 500
B <- matrix(rnorm(500*50),500,500) # rank 50
We find for the timings that the use of a larger p value still results in a huge speed-up compared to the original svd algorithm.
> system.time(t1 <- svd(A)$d[1:5])
user system elapsed
0.8 0.0 0.8
> system.time(t2 <- rsvd(A,10,5)$d[1:5])
user system elapsed
0.01 0.00 0.02
> system.time(t3 <- rsvd(A,10,30)$d[1:5])
user system elapsed
0.04 0.00 0.03
> system.time(t4 <- svd(B)$d[1:5] )
user system elapsed
0.55 0.00 0.55
> system.time(t5 <-rsvd(B,10,5)$d[1:5] )
user system elapsed
0.02 0.00 0.02
> system.time(t6 <-rsvd(B,10,30)$d[1:5] )
user system elapsed
0.05 0.00 0.05
> system.time(t7 <-rsvd(B,25,30)$d[1:5] )
user system elapsed
0.06 0.00 0.06
But we see that using a higher p for a lower rank matrix indeed gives a better approximation. If we let k also approach the rank a bit closer, the difference between the real solution and the approximation becomes appx. 0, while the speed gain is still substantial.
> round(mean(t2/t1),2)
[1] 0.77
> round(mean(t3/t1),2)
[1] 0.82
> round(mean(t5/t4),2)
[1] 0.92
> round(mean(t6/t4),2)
[1] 0.97
> round(mean(t7/t4),2)
[1] 1
So in general I believe that one could conclude that :
p should be chosen so p > k (Martinsson calls it l if I'm right)
k shouldn't be too much different from rank(A)
For low rank matrices the result is generally better.
Optimalization:
As far as I'm concerned, it's a neat way of doing it. I couldn't really find a more optimal way actually. The only thing I could say is that the construct t(q) %*% A is advised against. One should use crossprod(q,A) for that, which is supposed to be a tiny bit faster. But in your example the difference was nonexistent.
The paper by Halko, Martinsson and Tropp also recommends to do a couple of power iterations before computing the QR. We do 3 power iterations by default in the implementation in scikit-learn and we found it to work very well in practice.