Is my Theano program actually using the GPU? - performance

Theano claims it's using the GPU; it says what device when it starts up, etc. Furthermore nvidia-smi says it's being used.
But the running time seems to be exactly the same regardless of whether or not I use it.
Could it have something to do with integer arithmetic?
import sys
import numpy as np
import theano
import theano.tensor as T
def ariths(v, ub):
"""Given a sorted vector v and scalar ub, returns multiples of elements in v.
Specifically, returns a vector containing all numbers j * k < ub where j is in
v and k >= j. Some elements may occur more than once in the output.
"""
lp = v[0]
v = T.shape_padright(v)
a = T.shape_padleft(T.arange(0, (ub + lp - 1) // lp - lp, 1, 'int64'))
res = v * (a + v)
return res[(res < ub).nonzero()]
def filter_composites(pv, using_primes):
a = ariths(using_primes, pv.size)
return T.set_subtensor(pv[a], 0)
def _iterfn(prev_bnds, pv):
bstart = prev_bnds[0]
bend = prev_bnds[1]
use_primes = pv[bstart:bend].nonzero()[0] + bstart
pv = filter_composites(pv, use_primes)
return pv
def primes_to(n):
if n <= 2:
return np.asarray([])
elif n <= 3:
return np.asarray([2])
res = T.ones(n, 'int8')
res = T.set_subtensor(res[:2], 0)
ubs = [[2, 4]]
ub = 4
while ub ** 2 < n:
prevub = ub
ub *= 2
ubs.append([prevub, ub])
(r, u5) = theano.scan(fn=_iterfn,
outputs_info=res, sequences=[np.asarray(ubs)])
return r[-1].nonzero()[0]
def main(n):
print(primes_to(n).size.eval())
if __name__ == '__main__':
main(int(sys.argv[1]))

The answer is yes. And no. If you profile your code in a GPU enabled Theano installation using nvprof, you will see something like this:
==16540== Profiling application: python ./theano_test.py
==16540== Profiling result:
Time(%) Time Calls Avg Min Max Name
49.22% 12.096us 1 12.096us 12.096us 12.096us kernel_reduce_ccontig_node_c8d7bd33dfef61705c2854dd1f0cb7ce_0(unsigned int, float const *, float*)
30.60% 7.5200us 3 2.5060us 832ns 5.7600us [CUDA memcpy HtoD]
13.93% 3.4240us 1 3.4240us 3.4240us 3.4240us [CUDA memset]
6.25% 1.5350us 1 1.5350us 1.5350us 1.5350us [CUDA memcpy DtoH]
i.e. There is a least a reduce operation being performed on your GPU. However, if you modify your main like this:
def main():
n = 100000000
print(primes_to(n).size.eval())
if __name__ == '__main__':
import cProfile, pstats
cProfile.run("main()", "{}.profile".format(__file__))
s = pstats.Stats("{}.profile".format(__file__))
s.strip_dirs()
s.sort_stats("time").print_stats(10)
and use cProfile to profile your code, you will see something like this:
Thu Mar 10 14:35:24 2016 ./theano_test.py.profile
486743 function calls (480590 primitive calls) in 17.444 seconds
Ordered by: internal time
List reduced from 1138 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 6.376 6.376 16.655 16.655 {theano.scan_module.scan_perform.perform}
13 6.168 0.474 6.168 0.474 subtensor.py:2084(perform)
27 2.910 0.108 2.910 0.108 {method 'nonzero' of 'numpy.ndarray' objects}
30 0.852 0.028 0.852 0.028 {numpy.core.multiarray.concatenate}
27 0.711 0.026 0.711 0.026 {method 'astype' of 'numpy.ndarray' objects}
13 0.072 0.006 0.072 0.006 {numpy.core.multiarray.arange}
1 0.034 0.034 17.142 17.142 function_module.py:482(__call__)
387 0.020 0.000 0.052 0.000 graph.py:486(stack_search)
77 0.016 0.000 10.731 0.139 op.py:767(rval)
316 0.013 0.000 0.066 0.000 graph.py:715(general_toposort)
The slowest operation (just) is the scan call, and looking at the source for scan, you can see that presently, GPU execution of scan is disabled.
So then answer is, yes, the GPU is being used for something in your code, but no, the most time consuming operation(s) are being run on the CPU because GPU execution appears to be hard disabled in the code at present.

Related

How to optimize a function and minimize allocations

The following function generates primes up to N. For large N, this becomes quite slow, my Julia implementation is 5X faster for N = 10**7. I guess the creation of a large integer array and using pack to collect the result is the slowest part. I tried counting .true.s first, then allocating res(:) and populating it using a loop, but the speedup was negligible (4%) as I iterate the prims array twice in this case. In Julia, I used findall which does exactly what I did; iterating the array twice, first counting trues and allocationg result then populating it. Any ideas? Thank you.
Compiler:
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.3.210 Build 20180410 (on Windows 10)
Options: ifort -warn /O3 -heap-arrays:8000000
program main
implicit none
integer, allocatable :: primes(:)
integer :: t0, t1, count_rate, count_max
call system_clock(t0, count_rate, count_max)
primes = do_primes(10**7)
call system_clock(t1)
print '(a,f7.5,a)', 'Elapsed time: ', real(t1-t0)/count_rate, ' seconds'
print *, primes(1:10)
contains
function do_primes(N) result (res)
integer, allocatable :: res(:), array(:)
logical, allocatable :: prims(:)
integer :: N, i, j
allocate (prims(N))
prims = .true.
i = 3
do while (i * i < N)
j = i
do while (j * i < N)
prims(j*i) = .false.
j = j + 2
end do
i = i + 2
end do
prims(1) = .false.
prims(2) = .true.
do i = 4, N, 2
prims(i) = .false.
end do
allocate (array(N))
do i = 1, N
array(i) = i
end do
res = pack(array, prims)
end
end
Timing (147 runs):
Elapsed time: 0.14723 seconds
Edit:
I converted the do whiles to straight dos as per #IanBush comment like this, still no speedup:
do i = 3, sqrt(dble(N)), 2
do j = i, N/i, 2
prims(j*i) = .false.
end do
end do
The Julia implementation:
function do_primes(N)
prims = trues(N)
i = 3
while i * i < N
j = i
while j * i < N
prims[j*i] = false
j = j + 2
end
i = i + 2
end
prims[1] = false
prims[2] = true
prims[4:2:N] .= false
return findall(prims)
end
Timing:
using Benchmarktools
#benchmark do_primes(10^7)
BenchmarkTools.Trial:
memory estimate: 6.26 MiB
allocs estimate: 5
--------------
minimum time: 32.227 ms (0.00% GC)
median time: 32.793 ms (0.00% GC)
mean time: 34.098 ms (3.92% GC)
maximum time: 94.479 ms (65.46% GC)
--------------
samples: 147
evals/sample: 1

Julia: why doesn't shared memory multi-threading give me a speedup?

I want to use shared memory multi-threading in Julia. As done by the Threads.#threads macro, I can use ccall(:jl_threading_run ...) to do this. And whilst my code now runs in parallel, I don't get the speedup I expected.
The following code is intended as a minimal example of the approach I'm taking and the performance problem I'm having: [EDIT: See later for even more minimal example]
nthreads = Threads.nthreads()
test_size = 1000000
println("STARTED with ", nthreads, " thread(s) and test size of ", test_size, ".")
# Something to be processed:
objects = rand(test_size)
# Somewhere for our results
results = zeros(nthreads)
counts = zeros(nthreads)
# A function to do some work.
function worker_fn()
work_idx = 1
my_result = results[Threads.threadid()]
while work_idx > 0
my_result += objects[work_idx]
work_idx += nthreads
if work_idx > test_size
break
end
counts[Threads.threadid()] += 1
end
end
# Call our worker function using jl_threading_run
#time ccall(:jl_threading_run, Ref{Cvoid}, (Any,), worker_fn)
# Verify that we made as many calls as we think we did.
println("\nCOUNTS:")
println("\tPer thread:\t", counts)
println("\tSum:\t\t", sum(counts))
On an i7-7700, a typical single threaded result is:
STARTED with 1 thread(s) and test size of 1000000.
0.134606 seconds (5.00 M allocations: 76.563 MiB, 1.79% gc time)
COUNTS:
Per thread: [999999.0]
Sum: 999999.0
And with 4 threads:
STARTED with 4 thread(s) and test size of 1000000.
0.140378 seconds (1.81 M allocations: 25.661 MiB)
COUNTS:
Per thread: [249999.0, 249999.0, 249999.0, 249999.0]
Sum: 999996.0
Multi-threading slows things down! Why?
EDIT: A better minimal example can be created #threads macro itself.
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
#time Threads.#threads for i = 1 : test_size
a[Threads.threadid()] += b[i]
calls[Threads.threadid()] += 1
end
I falsely assumed that the #threads macro's inclusion in Julia would mean that there was a benefit to be had.
The problem you have is most probably false sharing.
You can solve it by separating the areas you write to far enough like this (here is a "quick and dirty" implementation to show the essence of the change):
julia> function f(spacing)
test_size = 1000000
a = zeros(Threads.nthreads()*spacing)
b = rand(test_size)
calls = zeros(Threads.nthreads()*spacing)
Threads.#threads for i = 1 : test_size
#inbounds begin
a[Threads.threadid()*spacing] += b[i]
calls[Threads.threadid()*spacing] += 1
end
end
a, calls
end
f (generic function with 1 method)
julia> #btime f(1);
41.525 ms (35 allocations: 7.63 MiB)
julia> #btime f(8);
2.189 ms (35 allocations: 7.63 MiB)
or doing per-thread accumulation on a local variable like this (this is a preferred approach as it should be uniformly faster):
function getrange(n)
tid = Threads.threadid()
nt = Threads.nthreads()
d , r = divrem(n, nt)
from = (tid - 1) * d + min(r, tid - 1) + 1
to = from + d - 1 + (tid ≤ r ? 1 : 0)
from:to
end
function f()
test_size = 10^8
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
Threads.#threads for k = 1 : Threads.nthreads()
local_a = 0.0
local_c = 0.0
for i in getrange(test_size)
for j in 1:10
local_a += b[i]
local_c += 1
end
end
a[Threads.threadid()] = local_a
calls[Threads.threadid()] = local_c
end
a, calls
end
Also note that you are probably using 4 treads on a machine with 2 physical cores (and only 4 virtual cores) so the gains from threading will not be linear.

OpenCV 3.1 optimization

I'm currently trying to implement an algorithm from a paper with OpenCV 3.1 on python 2.7 but the process is taking way too long.
The section of my code that's giving me trouble looks something like this:
width, height = mr.shape[:2]
Pm = []
for i in d:
M = np.float32([[1,0,-d[i]], [0,1,1]])
mrd = cv2.warpAffine(mr, M, (height,width))
C = cv2.subtract(ml, mrd)
C = cv2.pow(C,2)
C = np.divide(C, sigma_m)
C = p0 + (1-p0)**(-C)
Pm.append(C)
Where ml, mr and mrd are cv2 objects and d, p0 and sigma_m are integers.
The division and final equation in the last 3 lines are the real troublemakers here. Every iteration of this cycle is independent so in theory I could just split the 'for loop' through a few processors, but that seems like a lazy approach where I would just bypass the problem instead of fixing it.
Does anyone know a way to perform those computations faster?
We can leverage numexpr module to efficiently perform all of those latter arithmetic operations as one evaluate expression.
Thus, these steps :
C = cv2.subtract(ml, mrd)
C = cv2.pow(C,2)
C = np.divide(C, sigma_m)
C = p0 + (1-p0)**(-C)
could be replaced by one expression -
import numexpr as ne
C = ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
Let's verify things. The original approach as func -
def original_app(ml, mrd, sigma_m, p0):
C = cv2.subtract(ml, mrd)
C = cv2.pow(C,2)
C = np.divide(C, sigma_m)
C = p0 + (1-p0)**(-C)
return C
Verification -
In [28]: # Setup inputs
...: S = 1024 # Size parameter
...: ml = np.random.randint(0,255,(S,S))/255.0
...: mrd = np.random.randint(0,255,(S,S))/255.0
...: sigma_m = 0.45
...: p0 = 0.56
...:
In [29]: out1 = original_app(ml, mrd, sigma_m, p0)
In [30]: out2 = ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
In [31]: np.allclose(out1, out2)
Out[31]: True
Timings across various sizes of datasets -
In [19]: # Setup inputs
...: S = 1024 # Size parameter
...: ml = np.random.randint(0,255,(S,S))/255.0
...: mrd = np.random.randint(0,255,(S,S))/255.0
...: sigma_m = 0.45
...: p0 = 0.56
...:
In [20]: %timeit original_app(ml, mrd, sigma_m, p0)
10 loops, best of 3: 67.1 ms per loop
In [21]: %timeit ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
100 loops, best of 3: 12.9 ms per loop
In [22]: # Setup inputs
...: S = 512 # Size parameter
In [23]: %timeit original_app(ml, mrd, sigma_m, p0)
100 loops, best of 3: 15.3 ms per loop
In [24]: %timeit ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
100 loops, best of 3: 3.39 ms per loop
In [25]: # Setup inputs
...: S = 256 # Size parameter
In [26]: %timeit original_app(ml, mrd, sigma_m, p0)
100 loops, best of 3: 3.65 ms per loop
In [27]: %timeit ne.evaluate('p0 +(1-p0)**(-((ml-mrd)**2)/sigma_m)')
1000 loops, best of 3: 878 µs per loop
Around 5x speedup across various sizes with better speedups for larger arrays!
Also, as a side-note, I would advise using initialized arrays instead of appending as you are doing at the final step. Thus, we could initialize before going into the loop with something like out = np.zeros((len(d), width, height)) / np.empty and at the final step assign into the output array with : out[iteration_ID] = C.

Exponent calculation speed

I am currently testing Julia (I've worked with Matlab)
In matlab the calculation speed of N^3 is slower than NxNxN. This doesn't happen with N^2 and NxN. They use a different algorithm to calculate higher-order exponents because they prefer accuracy rather than speed.
I think Julia do the same thing.
I wanted to ask if there is a way to force Julia to calculate the exponent of N using multiplication instead of the default algorithm, at least for cube exponents.
Some time ago a I did a few test on matlab of this. I made a translation of that code to julia.
Links to code:
http://pastebin.com/bbeukhTc
(I cant upload all the links here :( )
Results of the scripts on Matlab 2014:
Exponente1
Elapsed time is 68.293793 seconds. (17.7x times of the smallest)
Exponente2
Elapsed time is 24.236218 seconds. (6.3x times of the smallests)
Exponente3
Elapsed time is 3.853348 seconds.
Results of the scripts on Julia 0.46:
Exponente1
18.423204 seconds (8.22 k allocations: 372.563 KB) (51.6x times of the smallest)
Exponente2
13.746904 seconds (9.02 k allocations: 407.332 KB) (38.5 times of the smallest)
Exponente3
0.356875 seconds (10.01 k allocations: 450.441 KB)
In my tests julia is faster than Matlab, but i am using a relative old version. I cant test other versions.
Checking Julia's source code:
julia/base/math.jl:
^(x::Float64, y::Integer) =
box(Float64, powi_llvm(unbox(Float64,x), unbox(Int32,Int32(y))))
^(x::Float32, y::Integer) =
box(Float32, powi_llvm(unbox(Float32,x), unbox(Int32,Int32(y))))
julia/base/fastmath.jl:
pow_fast{T<:FloatTypes}(x::T, y::Integer) = pow_fast(x, Int32(y))
pow_fast{T<:FloatTypes}(x::T, y::Int32) =
box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,y)))
We can see that Julia uses powi_llvm
Checking llvm's source code:
define double #powi(double %F, i32 %power) {
; CHECK: powi:
; CHECK: bl __powidf2
%result = call double #llvm.powi.f64(double %F, i32 %power)
ret double %result
}
Now, the __powidf2 is the interesting function here:
COMPILER_RT_ABI double
__powidf2(double a, si_int b)
{
const int recip = b < 0;
double r = 1;
while (1)
{
if (b & 1)
r *= a;
b /= 2;
if (b == 0)
break;
a *= a;
}
return recip ? 1/r : r;
}
Example 1: given a = 2; b = 7:
- r = 1
- iteration 1: r = 1 * 2 = 2; b = (int)(7/2) = 3; a = 2 * 2 = 4
- iteration 2: r = 2 * 4 = 8; b = (int)(3/2) = 1; a = 4 * 4 = 16
- iteration 3: r = 8 * 16 = 128;
Example 2: given a = 2; b = 8:
- r = 1
- iteration 1: r = 1; b = (int)(8/2) = 4; a = 2 * 2 = 4
- iteration 2: r = 1; b = (int)(4/2) = 2; a = 4 * 4 = 16
- iteration 3: r = 1; b = (int)(2/2) = 1; a = 16 * 16 = 256
- iteration 4: r = 1 * 256 = 256; b = (int)(1/2) = 0;
Integer power is always implemented as a sequence multiplications. That's why N^3 is slower than N^2.
jl_powi_llvm (called in fastmath.jl. "jl_" is concatenated by macro expansion), on the other hand, casts the exponent to floating-point and calls pow(). C source code:
JL_DLLEXPORT jl_value_t *jl_powi_llvm(jl_value_t *a, jl_value_t *b)
{
jl_value_t *ty = jl_typeof(a);
if (!jl_is_bitstype(ty))
jl_error("powi_llvm: a is not a bitstype");
if (!jl_is_bitstype(jl_typeof(b)) || jl_datatype_size(jl_typeof(b)) != 4)
jl_error("powi_llvm: b is not a 32-bit bitstype");
jl_value_t *newv = newstruct((jl_datatype_t*)ty);
void *pa = jl_data_ptr(a), *pr = jl_data_ptr(newv);
int sz = jl_datatype_size(ty);
switch (sz) {
/* choose the right size c-type operation */
case 4:
*(float*)pr = powf(*(float*)pa, (float)jl_unbox_int32(b));
break;
case 8:
*(double*)pr = pow(*(double*)pa, (double)jl_unbox_int32(b));
break;
default:
jl_error("powi_llvm: runtime floating point intrinsics are not implemented for bit sizes other than 32 and 64");
}
return newv;
}
Lior's answer is excellent. Here is a solution to the problem you posed: Yes, there is a way to force usage of multiplication, at cost of accuracy. It's the #fastmath macro:
julia> #benchmark 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 999
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 16.00 bytes
allocs estimate: 1
minimum time: 13.00 ns (0.00% GC)
median time: 14.00 ns (0.00% GC)
mean time: 15.74 ns (6.14% GC)
maximum time: 1.85 μs (98.16% GC)
julia> #benchmark #fastmath 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 2.00 ns (0.00% GC)
median time: 3.00 ns (0.00% GC)
mean time: 2.59 ns (0.00% GC)
maximum time: 20.00 ns (0.00% GC)
Note that with #fastmath, performance is much better.

Counting swaps for sorting statistics - what with swaps with only two assignments instead of three

While helping out a student with his classes, I implemented the dual pivot quicksort algorithm to prepare a session and got intriged. After running some statistics, then solving the worst case situation, then running stats again, and again solving the next worst case situation, and repeating this process several times, the resulting code is no more then 80 lines of simple straightforward Python code (a bit less then Vladimir's code). The novel part is how the 3 partitions are constructed in combination with some very simple yet effective post processing of them. Now I need some help on how to test and make statistics properly.
Especially about how to count the swaps: most of the swaps only perform two assignements instead of three. So must I count them as full swaps or, is it fair to count them only as a '2/3' swap?
Counting every swap as 1, the Cn in Cn * N * log2(N) is around 0.48 on short lists (<100 elements) and around 0.55 on longer lists of several million elements. That is just the theoretical minimum as calculated by Vladimir Yaroslavskiy.
Counting the lighter swaps as 2/3 instead, the number of needed swaps is almost equal for any list size and is around 0.36 (stdev around 0.015).
The Cn for the number of comparisons is on average around 1.3 for lists of 2 million records, which is less then the theoretical 1.38 (from 2*N*ln(N)), and lower for shorter lists, i.e. for 1024 elements, it's around 1.21
That is for lists with 100% unique numbers and randomly ordered with Python's random.shuffle().
So my question is:
Is it ok to count the lighter swaps as such, and is the result indeed promising or not?
Also interesting is:
the more equal elements in the list, the faster is sorts. Cn is 0.03 and 0.1 for swaps and comparisons respectively for a 2 million list of all equal elements.
Cn for sorted and reversed sorted lists are almost the same for all sizes: 0.3 and 1 for the swaps (counted with 2/3) and comparisons respectively.
I will post a list with more statistics shortly which includes maximum stack depth, number of recursive calls besides the swaps and comparisons. Are there other things I should count?
Also, are there some 'standard' test suites with files of all kinds of situations (with equals, partially sorted etc.) one can use to test a sorting algorithm, and to make the results comparable with other sorting algorithms.
Added May 5:
I improved the algorithm especially for sorted lists.
Here are the resutls for 20 runs for each.
Are this good results?
New statistics:
Random.shuffle(), unique number
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.367 0.922 0.250
64 0.360 1.072 0.500
256 0.342 1.122 0.625
1024 0.358 1.156 0.800
4096 0.359 1.199 0.917
16384 0.359 1.244 1.071
65536 0.360 1.244 1.125
262144 0.360 1.269 1.167
1048576 0.362 1.275 1.200
Sorted, unique numbers
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.172 0.531 0.250
64 0.117 0.586 0.333
256 0.087 0.609 0.375
1024 0.075 0.740 0.500
4096 0.060 0.732 0.500
16384 0.051 0.726 0.500
65536 0.044 0.722 0.500
262144 0.041 0.781 0.556
1048576 0.036 0.774 0.550
2097152 0.035 0.780 0.571
Reversed order, unique numbers
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.344 0.828 0.250
64 0.279 0.812 0.333
256 0.234 0.788 0.375
1024 0.210 0.858 0.500
4096 0.190 0.865 0.500
16384 0.172 0.855 0.500
65536 0.158 0.846 0.500
262144 0.153 0.900 0.556
1048576 0.143 0.892 0.550
2097152 0.140 0.895 0.571
I have chosen to count the assignments executed on the elements to be sorted, instead of 'swaps'. Assignements and comparisons of indexes are not counted.
I converted the code Vladimir Yaroslavskiy included in his document (Last updated: September 22, 2009) to Python and added the counters the same way as I did in my own implementation. The code is included at the end.
Any comments are welcome.
Here are the results, the averages of 10 runs.
The columns labeled VY are the results for the implementation by Vladimir, the columns labeled by JB are these of my own implementation.
Length F Function call Assignements Comparisons Maximum Stack
of list per N per N.log2(N) per N.log2(N) per log2(N)
Random.shuffle(), unique number
Version VY JB VY JB VY JB VY JB
64 1 0.170 0.266 1.489 1.029 1.041 1.028 0.417 0.633
256 1 0.171 0.270 1.463 1.016 1.066 1.138 0.575 0.812
1024 1 0.167 0.275 1.451 1.046 1.089 1.165 0.690 1.010
4096 1 0.164 0.273 1.436 1.069 1.119 1.189 0.800 1.075
16384 1 0.166 0.273 1.444 1.077 1.117 1.270 0.843 1.221
65536 1 0.166 0.273 1.440 1.108 1.126 1.258 0.919 1.281
262144 1 0.166 0.273 1.423 1.102 1.134 1.278 0.950 1.306
1048576 1 0.166 0.273 1.426 1.085 1.131 1.273 0.990 1.290
Sorted, unique numbers
Version VY JB VY JB VY JB VY JB
64 1 0.203 0.203 1.036 0.349 0.643 0.586 0.333 0.333
256 1 0.156 0.156 0.904 0.262 0.643 0.609 0.375 0.375
1024 1 0.118 0.355 0.823 0.223 0.642 0.740 0.400 0.500
4096 1 0.131 0.267 0.840 0.181 0.679 0.732 0.500 0.500
16384 1 0.200 0.200 0.926 0.152 0.751 0.726 0.500 0.500
65536 1 0.150 0.150 0.866 0.131 0.737 0.722 0.500 0.500
262144 1 0.113 0.338 0.829 0.124 0.728 0.781 0.500 0.556
1048576 1 0.147 0.253 0.853 0.108 0.750 0.774 0.550 0.550
Reversed order, unique numbers
Version VY JB VY JB VY JB VY JB
64 1 0.203 0.203 1.320 0.836 0.841 0.802 0.333 0.333
256 1 0.156 0.156 1.118 0.703 0.795 0.783 0.375 0.375
1024 1 0.118 0.312 1.002 0.631 0.768 0.852 0.400 0.500
4096 1 0.125 0.267 0.977 0.569 0.776 0.861 0.500 0.500
16384 1 0.200 0.200 1.046 0.516 0.834 0.852 0.500 0.500
65536 1 0.150 0.150 0.974 0.475 0.813 0.844 0.500 0.500
262144 1 0.113 0.338 0.925 0.459 0.795 0.896 0.500 0.556
1048576 1 0.145 0.253 0.938 0.430 0.811 0.890 0.550 0.550
Random, with increasing frequency of the numbers.
The last row is a list of the same number
Version VY JB VY JB VY JB VY JB
65536 1 0.166 0.273 1.429 1.051 1.113 1.251 0.881 1.156
65536 2 0.167 0.270 1.404 1.075 1.112 1.238 0.894 1.194
65536 4 0.168 0.273 1.373 1.039 1.096 1.213 0.906 1.238
65536 8 0.151 0.245 1.302 1.029 1.069 1.199 0.900 1.262
65536 16 0.132 0.127 1.264 0.970 1.020 1.150 0.912 1.188
65536 32 0.090 0.064 1.127 0.920 0.950 1.099 0.856 1.119
65536 64 0.051 0.032 1.000 0.845 0.879 0.993 0.819 1.019
65536 128 0.026 0.016 0.884 0.792 0.797 0.923 0.725 0.931
65536 256 0.013 0.008 0.805 0.704 0.728 0.840 0.675 0.856
65536 512 0.006 0.004 0.690 0.615 0.652 0.728 0.588 0.669
65536 1024 0.003 0.002 0.635 0.557 0.579 0.654 0.519 0.625
65536 2048 0.002 0.001 0.541 0.487 0.509 0.582 0.438 0.463
65536 4096 0.001 0.000 0.459 0.417 0.434 0.471 0.369 0.394
65536 8192 0.000 0.000 0.351 0.359 0.357 0.405 0.294 0.300
65536 16384 0.000 0.000 0.247 0.297 0.253 0.314 0.206 0.194
65536 32768 0.000 0.000 0.231 0.188 0.209 0.212 0.125 0.081
65536 65536 0.000 0.000 0.063 0.125 0.063 0.125 0.062 0.000
Here is the code of Vladimirs sort in Python:
DIST_SIZE = 13
TINY_SIZE = 17
def dualPivotQuicksort(a, left, right, nesting=0):
global assignements, comparisons, oproepen, maxnesting
oproepen += 1
maxnesting = max(maxnesting, nesting)
length = right - left
if length < TINY_SIZE: # insertion sort on tiny array
# note by JB: rewritten to minimize the assignements
for i in xrange(left+1, right+1):
key = a[i]
assignements += 1
while i > left:
comparisons += 1
if key < a[i - 1]:
assignements += 1
a[i] = a[i-1]
i -= 1
else:
break
assignements += 1
a[i] = key
return
# median indexes
sixth = length / 6
m1 = left + sixth
m2 = m1 + sixth
m3 = m2 + sixth
m4 = m3 + sixth
m5 = m4 + sixth
assignements += 9*3
comparisons += 9
## 5-element sorting network
if a[m1] > a[m2]: a[m1],a[m2] = a[m2],a[m1]
if a[m4] > a[m5]: a[m4],a[m5] = a[m5],a[m4]
if a[m1] > a[m3]: a[m1],a[m3] = a[m3],a[m1]
if a[m2] > a[m3]: a[m2],a[m3] = a[m3],a[m2]
if a[m1] > a[m4]: a[m1],a[m4] = a[m4],a[m1]
if a[m3] > a[m4]: a[m3],a[m4] = a[m4],a[m3]
if a[m2] > a[m5]: a[m2],a[m5] = a[m5],a[m2]
if a[m2] > a[m3]: a[m2],a[m3] = a[m3],a[m2]
if a[m4] > a[m5]: a[m4],a[m5] = a[m5],a[m4]
# pivots: [ < pivot1 | pivot1 <= && <= pivot2 | > pivot2 ]
assignements += 2
pivot1 = a[m2]
pivot2 = a[m4]
comparisons += 1
diffPivots = pivot1 != pivot2
assignements += 2
a[m2] = a[left]
a[m4] = a[right]
# center part pointers
less = left + 1
great = right - 1
# sorting
if (diffPivots):
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 2
if (x < pivot1):
comparisons -= 1
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
elif (x > pivot2):
while k < great:
comparisons += 1
if a[great] > pivot2:
great -= 1
else:
break
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
else:
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 1
if (x == pivot1):
k += 1
continue
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
else:
while k < great:
comparisons += 1
if a[great] > pivot2:
great -= 1
else:
break
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
# swap
assignements += 2
a[left] = a[less - 1]
a[less - 1] = pivot1
assignements += 2
a[right] = a[great + 1]
a[great + 1] = pivot2
# left and right parts
dualPivotQuicksort(a, left, less - 2, nesting+1)
dualPivotQuicksort(a, great + 2, right, nesting+1)
# equal elements
if (great - less > length - DIST_SIZE and diffPivots):
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 2
if (x == pivot1):
comparisons -= 1
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
elif (x == pivot2):
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x == pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
# center part
if (diffPivots):
dualPivotQuicksort(a, less, great, nesting+1)
This code is about 190 lines, my current implementation written with the same formatting is about 110 lines.
So any remarks are welcome.

Resources