Counting swaps for sorting statistics - what with swaps with only two assignments instead of three - algorithm

While helping out a student with his classes, I implemented the dual pivot quicksort algorithm to prepare a session and got intriged. After running some statistics, then solving the worst case situation, then running stats again, and again solving the next worst case situation, and repeating this process several times, the resulting code is no more then 80 lines of simple straightforward Python code (a bit less then Vladimir's code). The novel part is how the 3 partitions are constructed in combination with some very simple yet effective post processing of them. Now I need some help on how to test and make statistics properly.
Especially about how to count the swaps: most of the swaps only perform two assignements instead of three. So must I count them as full swaps or, is it fair to count them only as a '2/3' swap?
Counting every swap as 1, the Cn in Cn * N * log2(N) is around 0.48 on short lists (<100 elements) and around 0.55 on longer lists of several million elements. That is just the theoretical minimum as calculated by Vladimir Yaroslavskiy.
Counting the lighter swaps as 2/3 instead, the number of needed swaps is almost equal for any list size and is around 0.36 (stdev around 0.015).
The Cn for the number of comparisons is on average around 1.3 for lists of 2 million records, which is less then the theoretical 1.38 (from 2*N*ln(N)), and lower for shorter lists, i.e. for 1024 elements, it's around 1.21
That is for lists with 100% unique numbers and randomly ordered with Python's random.shuffle().
So my question is:
Is it ok to count the lighter swaps as such, and is the result indeed promising or not?
Also interesting is:
the more equal elements in the list, the faster is sorts. Cn is 0.03 and 0.1 for swaps and comparisons respectively for a 2 million list of all equal elements.
Cn for sorted and reversed sorted lists are almost the same for all sizes: 0.3 and 1 for the swaps (counted with 2/3) and comparisons respectively.
I will post a list with more statistics shortly which includes maximum stack depth, number of recursive calls besides the swaps and comparisons. Are there other things I should count?
Also, are there some 'standard' test suites with files of all kinds of situations (with equals, partially sorted etc.) one can use to test a sorting algorithm, and to make the results comparable with other sorting algorithms.
Added May 5:
I improved the algorithm especially for sorted lists.
Here are the resutls for 20 runs for each.
Are this good results?
New statistics:
Random.shuffle(), unique number
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.367 0.922 0.250
64 0.360 1.072 0.500
256 0.342 1.122 0.625
1024 0.358 1.156 0.800
4096 0.359 1.199 0.917
16384 0.359 1.244 1.071
65536 0.360 1.244 1.125
262144 0.360 1.269 1.167
1048576 0.362 1.275 1.200
Sorted, unique numbers
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.172 0.531 0.250
64 0.117 0.586 0.333
256 0.087 0.609 0.375
1024 0.075 0.740 0.500
4096 0.060 0.732 0.500
16384 0.051 0.726 0.500
65536 0.044 0.722 0.500
262144 0.041 0.781 0.556
1048576 0.036 0.774 0.550
2097152 0.035 0.780 0.571
Reversed order, unique numbers
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.344 0.828 0.250
64 0.279 0.812 0.333
256 0.234 0.788 0.375
1024 0.210 0.858 0.500
4096 0.190 0.865 0.500
16384 0.172 0.855 0.500
65536 0.158 0.846 0.500
262144 0.153 0.900 0.556
1048576 0.143 0.892 0.550
2097152 0.140 0.895 0.571

I have chosen to count the assignments executed on the elements to be sorted, instead of 'swaps'. Assignements and comparisons of indexes are not counted.
I converted the code Vladimir Yaroslavskiy included in his document (Last updated: September 22, 2009) to Python and added the counters the same way as I did in my own implementation. The code is included at the end.
Any comments are welcome.
Here are the results, the averages of 10 runs.
The columns labeled VY are the results for the implementation by Vladimir, the columns labeled by JB are these of my own implementation.
Length F Function call Assignements Comparisons Maximum Stack
of list per N per N.log2(N) per N.log2(N) per log2(N)
Random.shuffle(), unique number
Version VY JB VY JB VY JB VY JB
64 1 0.170 0.266 1.489 1.029 1.041 1.028 0.417 0.633
256 1 0.171 0.270 1.463 1.016 1.066 1.138 0.575 0.812
1024 1 0.167 0.275 1.451 1.046 1.089 1.165 0.690 1.010
4096 1 0.164 0.273 1.436 1.069 1.119 1.189 0.800 1.075
16384 1 0.166 0.273 1.444 1.077 1.117 1.270 0.843 1.221
65536 1 0.166 0.273 1.440 1.108 1.126 1.258 0.919 1.281
262144 1 0.166 0.273 1.423 1.102 1.134 1.278 0.950 1.306
1048576 1 0.166 0.273 1.426 1.085 1.131 1.273 0.990 1.290
Sorted, unique numbers
Version VY JB VY JB VY JB VY JB
64 1 0.203 0.203 1.036 0.349 0.643 0.586 0.333 0.333
256 1 0.156 0.156 0.904 0.262 0.643 0.609 0.375 0.375
1024 1 0.118 0.355 0.823 0.223 0.642 0.740 0.400 0.500
4096 1 0.131 0.267 0.840 0.181 0.679 0.732 0.500 0.500
16384 1 0.200 0.200 0.926 0.152 0.751 0.726 0.500 0.500
65536 1 0.150 0.150 0.866 0.131 0.737 0.722 0.500 0.500
262144 1 0.113 0.338 0.829 0.124 0.728 0.781 0.500 0.556
1048576 1 0.147 0.253 0.853 0.108 0.750 0.774 0.550 0.550
Reversed order, unique numbers
Version VY JB VY JB VY JB VY JB
64 1 0.203 0.203 1.320 0.836 0.841 0.802 0.333 0.333
256 1 0.156 0.156 1.118 0.703 0.795 0.783 0.375 0.375
1024 1 0.118 0.312 1.002 0.631 0.768 0.852 0.400 0.500
4096 1 0.125 0.267 0.977 0.569 0.776 0.861 0.500 0.500
16384 1 0.200 0.200 1.046 0.516 0.834 0.852 0.500 0.500
65536 1 0.150 0.150 0.974 0.475 0.813 0.844 0.500 0.500
262144 1 0.113 0.338 0.925 0.459 0.795 0.896 0.500 0.556
1048576 1 0.145 0.253 0.938 0.430 0.811 0.890 0.550 0.550
Random, with increasing frequency of the numbers.
The last row is a list of the same number
Version VY JB VY JB VY JB VY JB
65536 1 0.166 0.273 1.429 1.051 1.113 1.251 0.881 1.156
65536 2 0.167 0.270 1.404 1.075 1.112 1.238 0.894 1.194
65536 4 0.168 0.273 1.373 1.039 1.096 1.213 0.906 1.238
65536 8 0.151 0.245 1.302 1.029 1.069 1.199 0.900 1.262
65536 16 0.132 0.127 1.264 0.970 1.020 1.150 0.912 1.188
65536 32 0.090 0.064 1.127 0.920 0.950 1.099 0.856 1.119
65536 64 0.051 0.032 1.000 0.845 0.879 0.993 0.819 1.019
65536 128 0.026 0.016 0.884 0.792 0.797 0.923 0.725 0.931
65536 256 0.013 0.008 0.805 0.704 0.728 0.840 0.675 0.856
65536 512 0.006 0.004 0.690 0.615 0.652 0.728 0.588 0.669
65536 1024 0.003 0.002 0.635 0.557 0.579 0.654 0.519 0.625
65536 2048 0.002 0.001 0.541 0.487 0.509 0.582 0.438 0.463
65536 4096 0.001 0.000 0.459 0.417 0.434 0.471 0.369 0.394
65536 8192 0.000 0.000 0.351 0.359 0.357 0.405 0.294 0.300
65536 16384 0.000 0.000 0.247 0.297 0.253 0.314 0.206 0.194
65536 32768 0.000 0.000 0.231 0.188 0.209 0.212 0.125 0.081
65536 65536 0.000 0.000 0.063 0.125 0.063 0.125 0.062 0.000
Here is the code of Vladimirs sort in Python:
DIST_SIZE = 13
TINY_SIZE = 17
def dualPivotQuicksort(a, left, right, nesting=0):
global assignements, comparisons, oproepen, maxnesting
oproepen += 1
maxnesting = max(maxnesting, nesting)
length = right - left
if length < TINY_SIZE: # insertion sort on tiny array
# note by JB: rewritten to minimize the assignements
for i in xrange(left+1, right+1):
key = a[i]
assignements += 1
while i > left:
comparisons += 1
if key < a[i - 1]:
assignements += 1
a[i] = a[i-1]
i -= 1
else:
break
assignements += 1
a[i] = key
return
# median indexes
sixth = length / 6
m1 = left + sixth
m2 = m1 + sixth
m3 = m2 + sixth
m4 = m3 + sixth
m5 = m4 + sixth
assignements += 9*3
comparisons += 9
## 5-element sorting network
if a[m1] > a[m2]: a[m1],a[m2] = a[m2],a[m1]
if a[m4] > a[m5]: a[m4],a[m5] = a[m5],a[m4]
if a[m1] > a[m3]: a[m1],a[m3] = a[m3],a[m1]
if a[m2] > a[m3]: a[m2],a[m3] = a[m3],a[m2]
if a[m1] > a[m4]: a[m1],a[m4] = a[m4],a[m1]
if a[m3] > a[m4]: a[m3],a[m4] = a[m4],a[m3]
if a[m2] > a[m5]: a[m2],a[m5] = a[m5],a[m2]
if a[m2] > a[m3]: a[m2],a[m3] = a[m3],a[m2]
if a[m4] > a[m5]: a[m4],a[m5] = a[m5],a[m4]
# pivots: [ < pivot1 | pivot1 <= && <= pivot2 | > pivot2 ]
assignements += 2
pivot1 = a[m2]
pivot2 = a[m4]
comparisons += 1
diffPivots = pivot1 != pivot2
assignements += 2
a[m2] = a[left]
a[m4] = a[right]
# center part pointers
less = left + 1
great = right - 1
# sorting
if (diffPivots):
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 2
if (x < pivot1):
comparisons -= 1
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
elif (x > pivot2):
while k < great:
comparisons += 1
if a[great] > pivot2:
great -= 1
else:
break
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
else:
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 1
if (x == pivot1):
k += 1
continue
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
else:
while k < great:
comparisons += 1
if a[great] > pivot2:
great -= 1
else:
break
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
# swap
assignements += 2
a[left] = a[less - 1]
a[less - 1] = pivot1
assignements += 2
a[right] = a[great + 1]
a[great + 1] = pivot2
# left and right parts
dualPivotQuicksort(a, left, less - 2, nesting+1)
dualPivotQuicksort(a, great + 2, right, nesting+1)
# equal elements
if (great - less > length - DIST_SIZE and diffPivots):
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 2
if (x == pivot1):
comparisons -= 1
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
elif (x == pivot2):
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x == pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
# center part
if (diffPivots):
dualPivotQuicksort(a, less, great, nesting+1)
This code is about 190 lines, my current implementation written with the same formatting is about 110 lines.
So any remarks are welcome.

Related

Inverse of cumsum in Julia

The matrix Y is defined as
Y = cumsum(cumsum(X,dims=1), dims=2)
For example,
julia> X = [1 4 2 3; 2 4 5 2; 4 3 4 1; 2 5 4 2];
julia> Y = cumsum(cumsum(X,dims=1), dims=2)
4x4 Matrix{Int64}:
1 5 7 10
3 11 18 23
7 18 29 35
9 25 40 48
I want to reproduce the matrix X from Y. It seems that function diff is helpful. However, as you can see below, we cannot reproduce the first line and first column of X.
julia> diff(diff(y, dims=1), dims=2)
3x3 Matrix{Int64}:
4 5 2
3 4 1
5 4 2
So, I concatenate zeros. Then, it works.
julia> y00 = vcat(zeros(5)',hcat(zeros(4), y))
5x5 Matrix{Int64}:
0 0 0 0 0
0 1 5 7 10
0 3 11 18 23
0 7 18 29 35
0 9 25 40 48
julia> diff(diff(y00, dims=1), dims=2)
4x4 Matrix{Int64}:
1 5 7 10
3 11 18 23
7 18 29 35
9 25 40 48
But I think concatenating takes time and memory.
Is there any better idea to reproduce X from Y?
Context
I want to expand the above matrices X and Y to any dimensional array. For example, I want to reconstruct a three-dimensional array X from given three-dimensional array
Y = cumsum( cumsum( cumsum(X, dims=1), dims=2), dims=3)
When both speed and succinctness are required, it's hard to beat powerful Julia packages like Tullio.jl. Here is a one-liner that's about 4X faster than the fastest solution by #DanGetz.
using Tullio
cumdiff(Y) = #tullio X[i,j] = Y[i,j] - Y[i,j-1] - Y[i-1,j] + Y[i-1,j-1]
Benchmarking with a 100-by-100 matrix gives:
X = rand(0:100,100,100)
Y = cumsum(cumsum(X,dims=1), dims=2)
#btime cumdiff($Y)
#btime decumsum3($Y)
4.957 μs (17 allocations: 464 bytes)
21.300 μs (2 allocations: 78.17 KiB)
Fix: The code above was using the predefined X instead of creating a new one. This is fixed below, and the speedup is more like 3.5X and not 4X.
function cumdiff(Y)
X = similar(Y)
X[1] = Y[1]
for i = 2:size(Y,1) X[i,1] = Y[i,1] - Y[i-1,1] end
for j = 2:size(Y,2) X[1,j] = Y[1,j] - Y[1,j-1] end
#tullio X[i,j] = Y[i,j] - Y[i,j-1] - Y[i-1,j] + Y[i-1,j-1]
end
#btime cumdiff($Y)
#btime decumsum3($Y)
6.000 μs (4 allocations: 78.23 KiB)
21.300 μs (2 allocations: 78.17 KiB)
See EDIT section below.
Some options so far:
decumsum1(X) = begin
Z = copy(X)
Z[2:end,:] .-= Z[1:end-1,:]
Z[:,2:end] .-= Z[:,1:end-1]
return Z
end
decumsum2(X) = begin # This is from question #
r,c = size(X)
Z = vcat(zeros(eltype(X),r+1)',
hcat(zeros(eltype(X),c), X))
return diff(diff(Z, dims=1), dims=2)
end
decumsum3(Y) = [Y[I]-(I[2]==1 ? 0 : Y[I[1],I[2]-1])-
(I[1]==1 ? 0 : Y[I[1]-1,I[2]])+
((I[1]==1 || I[2]==1) ? 0 : Y[I[1]-1,I[2]-1])
for I in CartesianIndices(Y)]
function decumsum5(Y)
R = similar(Y)
h,w = size(Y)
R[1,1] = Y[1,1]
#inbounds for i=2:h R[i,1] = Y[i,1]-Y[i-1,1] ; end
#inbounds for j=2:w R[1,j] = Y[1,j]-Y[1,j-1] ; end
#inbounds for i=2:h,j=2:w R[i,j] = Y[i,j]-Y[i-1,j]-Y[i,j-1]+Y[i-1,j-1] ; end
return R
end
Giving the following benchmarks:
julia> using BenchmarkTools
julia> decumsum1(Y) == decumsum2(Y) == decumsum3(Y) == X
true
julia> #btime decumsum1($Y);
352.571 ns (5 allocations: 832 bytes)
julia> #btime decumsum2($Y);
475.438 ns (9 allocations: 1.14 KiB)
julia> #btime decumsum3($Y);
96.875 ns (1 allocation: 192 bytes)
julia> #btime decumsum5($Y);
60.805 ns (1 allocation: 192 bytes)
EDIT: Perhaps the prettier solutions is:
decumsum(Y; dims) = [Y[I] - (
I[dims]==1 ? 0 : Y[(ifelse(k == dims,I[k]-1,I[k])
for k in 1:ndims(Y))...]
) for I in CartesianIndices(Y)]
and with it, the cumsum can be walked back:
julia> decumsum(decumsum(Y, dims=1), dims=2)
4×4 Matrix{Int64}:
1 4 2 3
2 4 5 2
4 3 4 1
2 5 4 2
julia> decumsum(decumsum(Y, dims=1), dims=2) == X
true
julia> #btime decumsum(decumsum($Y, dims=1), dims=2);
165.656 ns (2 allocations: 384 bytes)
with nice performance and also generalized to any Array dimension.
Update: another version decumsum5 added. Still faster.

CUDA kernel: performance drops by 10x when increased loop count by 10%

I have a simple CUDA kernel to test loop unrolling, then discovered another thing: when the loop count is 10, kernel takes 34 milliseconds to perform, when the loop count is 90, it takes 59 milliseconds, but when the loop count is 100, the time it takes is 423 milliseconds!
Launch configuration is the same, only loop count changed.
So, my question is, what could be the reason for this performance drop?
Here is the code, input is an array of 128x1024x1024 elements, and I'm using PyCUDA:
__global__ void copy(float *input, float *output) {
int tidx = blockIdx.y * blockDim.x + threadIdx.x;
int stride = 1024 * 1024;
for (int i = 0; i < 128; i++) {
int idx = i * stride + tidx;
float x = input[idx];
float y = 0;
for (int j = 0; j < 100; j += 10) {
x = x + sqrt(float(j));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+1));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+2));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+3));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+4));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+5));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+6));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+7));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+8));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+9));
y = sqrt(abs(x)) + sin(x) + cos(x);
}
output[idx] = y;
}
}
The loop count I mentioned is this line:
for (int j = 0; j < 100; j += 10)
And sample outputs here:
10 loops
griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info : 0 bytes gmem, 24 bytes cmem[3]
ptxas info : Compiling entry function 'copy' for 'sm_61'
ptxas info : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 21 registers, 336 bytes cmem[0], 52 bytes cmem[2]
computation takes 34.24 miliseconds
90 loops
griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info : 0 bytes gmem, 24 bytes cmem[3]
ptxas info : Compiling entry function 'copy' for 'sm_61'
ptxas info : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 21 registers, 336 bytes cmem[0], 52 bytes cmem[2]
computation takes 59.33 miliseconds
100 loops
griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info : 0 bytes gmem, 24 bytes cmem[3]
ptxas info : Compiling entry function 'copy' for 'sm_61'
ptxas info : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 22 registers, 336 bytes cmem[0], 52 bytes cmem[2]
computation takes 422.96 miliseconds
The problem seems to come from loop unrolling.
Indeed, the 10-loops case can be trivially unrolled by NVCC since the loop is actually always executed once (thus the for line can be removed with j set to 0).
The 90-loops case is unrolled by NVCC (there are only 9 actual iterations). The resulting code is thus much bigger but still fast since no branches are performed (GPUs hate branches). However, the 100-loops case is not unrolled by NVCC (you hit a threshold of the compiler optimizer). The resulting code is small, but it leads to more branches being executed at runtime: branching is performed for each executed loop iteration (a total of 10).
You can see the assembly code difference here.
You can force unrolling using the directive #pragma unroll. However, keep in mind that increasing the size of a code can reduce its performance.
PS: the slightly higher number of register used in the last version may decrease performance, but simulations show that it should be OK in this case.

Exponent calculation speed

I am currently testing Julia (I've worked with Matlab)
In matlab the calculation speed of N^3 is slower than NxNxN. This doesn't happen with N^2 and NxN. They use a different algorithm to calculate higher-order exponents because they prefer accuracy rather than speed.
I think Julia do the same thing.
I wanted to ask if there is a way to force Julia to calculate the exponent of N using multiplication instead of the default algorithm, at least for cube exponents.
Some time ago a I did a few test on matlab of this. I made a translation of that code to julia.
Links to code:
http://pastebin.com/bbeukhTc
(I cant upload all the links here :( )
Results of the scripts on Matlab 2014:
Exponente1
Elapsed time is 68.293793 seconds. (17.7x times of the smallest)
Exponente2
Elapsed time is 24.236218 seconds. (6.3x times of the smallests)
Exponente3
Elapsed time is 3.853348 seconds.
Results of the scripts on Julia 0.46:
Exponente1
18.423204 seconds (8.22 k allocations: 372.563 KB) (51.6x times of the smallest)
Exponente2
13.746904 seconds (9.02 k allocations: 407.332 KB) (38.5 times of the smallest)
Exponente3
0.356875 seconds (10.01 k allocations: 450.441 KB)
In my tests julia is faster than Matlab, but i am using a relative old version. I cant test other versions.
Checking Julia's source code:
julia/base/math.jl:
^(x::Float64, y::Integer) =
box(Float64, powi_llvm(unbox(Float64,x), unbox(Int32,Int32(y))))
^(x::Float32, y::Integer) =
box(Float32, powi_llvm(unbox(Float32,x), unbox(Int32,Int32(y))))
julia/base/fastmath.jl:
pow_fast{T<:FloatTypes}(x::T, y::Integer) = pow_fast(x, Int32(y))
pow_fast{T<:FloatTypes}(x::T, y::Int32) =
box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,y)))
We can see that Julia uses powi_llvm
Checking llvm's source code:
define double #powi(double %F, i32 %power) {
; CHECK: powi:
; CHECK: bl __powidf2
%result = call double #llvm.powi.f64(double %F, i32 %power)
ret double %result
}
Now, the __powidf2 is the interesting function here:
COMPILER_RT_ABI double
__powidf2(double a, si_int b)
{
const int recip = b < 0;
double r = 1;
while (1)
{
if (b & 1)
r *= a;
b /= 2;
if (b == 0)
break;
a *= a;
}
return recip ? 1/r : r;
}
Example 1: given a = 2; b = 7:
- r = 1
- iteration 1: r = 1 * 2 = 2; b = (int)(7/2) = 3; a = 2 * 2 = 4
- iteration 2: r = 2 * 4 = 8; b = (int)(3/2) = 1; a = 4 * 4 = 16
- iteration 3: r = 8 * 16 = 128;
Example 2: given a = 2; b = 8:
- r = 1
- iteration 1: r = 1; b = (int)(8/2) = 4; a = 2 * 2 = 4
- iteration 2: r = 1; b = (int)(4/2) = 2; a = 4 * 4 = 16
- iteration 3: r = 1; b = (int)(2/2) = 1; a = 16 * 16 = 256
- iteration 4: r = 1 * 256 = 256; b = (int)(1/2) = 0;
Integer power is always implemented as a sequence multiplications. That's why N^3 is slower than N^2.
jl_powi_llvm (called in fastmath.jl. "jl_" is concatenated by macro expansion), on the other hand, casts the exponent to floating-point and calls pow(). C source code:
JL_DLLEXPORT jl_value_t *jl_powi_llvm(jl_value_t *a, jl_value_t *b)
{
jl_value_t *ty = jl_typeof(a);
if (!jl_is_bitstype(ty))
jl_error("powi_llvm: a is not a bitstype");
if (!jl_is_bitstype(jl_typeof(b)) || jl_datatype_size(jl_typeof(b)) != 4)
jl_error("powi_llvm: b is not a 32-bit bitstype");
jl_value_t *newv = newstruct((jl_datatype_t*)ty);
void *pa = jl_data_ptr(a), *pr = jl_data_ptr(newv);
int sz = jl_datatype_size(ty);
switch (sz) {
/* choose the right size c-type operation */
case 4:
*(float*)pr = powf(*(float*)pa, (float)jl_unbox_int32(b));
break;
case 8:
*(double*)pr = pow(*(double*)pa, (double)jl_unbox_int32(b));
break;
default:
jl_error("powi_llvm: runtime floating point intrinsics are not implemented for bit sizes other than 32 and 64");
}
return newv;
}
Lior's answer is excellent. Here is a solution to the problem you posed: Yes, there is a way to force usage of multiplication, at cost of accuracy. It's the #fastmath macro:
julia> #benchmark 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 999
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 16.00 bytes
allocs estimate: 1
minimum time: 13.00 ns (0.00% GC)
median time: 14.00 ns (0.00% GC)
mean time: 15.74 ns (6.14% GC)
maximum time: 1.85 μs (98.16% GC)
julia> #benchmark #fastmath 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 2.00 ns (0.00% GC)
median time: 3.00 ns (0.00% GC)
mean time: 2.59 ns (0.00% GC)
maximum time: 20.00 ns (0.00% GC)
Note that with #fastmath, performance is much better.

Is my Theano program actually using the GPU?

Theano claims it's using the GPU; it says what device when it starts up, etc. Furthermore nvidia-smi says it's being used.
But the running time seems to be exactly the same regardless of whether or not I use it.
Could it have something to do with integer arithmetic?
import sys
import numpy as np
import theano
import theano.tensor as T
def ariths(v, ub):
"""Given a sorted vector v and scalar ub, returns multiples of elements in v.
Specifically, returns a vector containing all numbers j * k < ub where j is in
v and k >= j. Some elements may occur more than once in the output.
"""
lp = v[0]
v = T.shape_padright(v)
a = T.shape_padleft(T.arange(0, (ub + lp - 1) // lp - lp, 1, 'int64'))
res = v * (a + v)
return res[(res < ub).nonzero()]
def filter_composites(pv, using_primes):
a = ariths(using_primes, pv.size)
return T.set_subtensor(pv[a], 0)
def _iterfn(prev_bnds, pv):
bstart = prev_bnds[0]
bend = prev_bnds[1]
use_primes = pv[bstart:bend].nonzero()[0] + bstart
pv = filter_composites(pv, use_primes)
return pv
def primes_to(n):
if n <= 2:
return np.asarray([])
elif n <= 3:
return np.asarray([2])
res = T.ones(n, 'int8')
res = T.set_subtensor(res[:2], 0)
ubs = [[2, 4]]
ub = 4
while ub ** 2 < n:
prevub = ub
ub *= 2
ubs.append([prevub, ub])
(r, u5) = theano.scan(fn=_iterfn,
outputs_info=res, sequences=[np.asarray(ubs)])
return r[-1].nonzero()[0]
def main(n):
print(primes_to(n).size.eval())
if __name__ == '__main__':
main(int(sys.argv[1]))
The answer is yes. And no. If you profile your code in a GPU enabled Theano installation using nvprof, you will see something like this:
==16540== Profiling application: python ./theano_test.py
==16540== Profiling result:
Time(%) Time Calls Avg Min Max Name
49.22% 12.096us 1 12.096us 12.096us 12.096us kernel_reduce_ccontig_node_c8d7bd33dfef61705c2854dd1f0cb7ce_0(unsigned int, float const *, float*)
30.60% 7.5200us 3 2.5060us 832ns 5.7600us [CUDA memcpy HtoD]
13.93% 3.4240us 1 3.4240us 3.4240us 3.4240us [CUDA memset]
6.25% 1.5350us 1 1.5350us 1.5350us 1.5350us [CUDA memcpy DtoH]
i.e. There is a least a reduce operation being performed on your GPU. However, if you modify your main like this:
def main():
n = 100000000
print(primes_to(n).size.eval())
if __name__ == '__main__':
import cProfile, pstats
cProfile.run("main()", "{}.profile".format(__file__))
s = pstats.Stats("{}.profile".format(__file__))
s.strip_dirs()
s.sort_stats("time").print_stats(10)
and use cProfile to profile your code, you will see something like this:
Thu Mar 10 14:35:24 2016 ./theano_test.py.profile
486743 function calls (480590 primitive calls) in 17.444 seconds
Ordered by: internal time
List reduced from 1138 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 6.376 6.376 16.655 16.655 {theano.scan_module.scan_perform.perform}
13 6.168 0.474 6.168 0.474 subtensor.py:2084(perform)
27 2.910 0.108 2.910 0.108 {method 'nonzero' of 'numpy.ndarray' objects}
30 0.852 0.028 0.852 0.028 {numpy.core.multiarray.concatenate}
27 0.711 0.026 0.711 0.026 {method 'astype' of 'numpy.ndarray' objects}
13 0.072 0.006 0.072 0.006 {numpy.core.multiarray.arange}
1 0.034 0.034 17.142 17.142 function_module.py:482(__call__)
387 0.020 0.000 0.052 0.000 graph.py:486(stack_search)
77 0.016 0.000 10.731 0.139 op.py:767(rval)
316 0.013 0.000 0.066 0.000 graph.py:715(general_toposort)
The slowest operation (just) is the scan call, and looking at the source for scan, you can see that presently, GPU execution of scan is disabled.
So then answer is, yes, the GPU is being used for something in your code, but no, the most time consuming operation(s) are being run on the CPU because GPU execution appears to be hard disabled in the code at present.

Applying a function to a distance matrix in R

This question came today in the manipulatr mailing list.
http://groups.google.com/group/manipulatr/browse_thread/thread/fbab76945f7cba3f
I am rephrasing.
Given a distance matrix (calculated with dist) apply a function to the rows of the distance matrix.
Code:
library(plyr)
N <- 100
a <- data.frame(b=1:N,c=runif(N))
d <- dist(a,diag=T,upper=T)
sumd <- adply(as.matrix(d),1,sum)
The problem is that to apply the function by row you have to store the whole matrix (instead of just the lower triangular part. So it uses too much memory for large matrices. It fails in my computer for matrices of dimensions ~ 10000.
Any ideas?
First of all, for anyone who hasn't seen this yet, I strongly recommend reading this article on the r-wiki about code optimization.
Here's another version without using ifelse (that's a relatively slow function):
noeq.2 <- function(i, j, N) {
i <- i-1
j <- j-1
x <- i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i
x2 <- j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j
idx <- i < j
x[!idx] <- x2[!idx]
x[i==j] <- 0
x
}
And timings on my laptop:
> N <- 1000
> system.time(sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N))))
user system elapsed
51.31 0.10 52.06
> system.time(sapply(1:N, function(j) noeq.1(1:N, j, N)))
user system elapsed
2.47 0.02 2.67
> system.time(sapply(1:N, function(j) noeq.2(1:N, j, N)))
user system elapsed
0.88 0.01 1.12
And lapply is faster than sapply:
> system.time(do.call("rbind",lapply(1:N, function(j) noeq.2(1:N, j, N))))
user system elapsed
0.67 0.00 0.67
This is a vectorized version of the function noeq (either argument i or j):
noeq.1 <- function(i, j, N) {
i <- i-1
j <- j-1
ifelse(i < j,
i*(N-1) - ((i-1)*i)/2 + j - i,
j*(N-1) - ((j-1)*j)/2 + i - j) * ifelse(i == j, 0, 1)
}
> N <- 4
> sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N)))
[,1] [,2] [,3] [,4]
[1,] 0 1 2 3
[2,] 1 0 4 5
[3,] 2 4 0 6
[4,] 3 5 6 0
> sapply(1:N, function(i) noeq.1(i, 1:N, N))
[,1] [,2] [,3] [,4]
[1,] 0 1 2 3
[2,] 1 0 4 5
[3,] 2 4 0 6
[4,] 3 5 6 0
Timings are done on a 2.4 GHz Intel Core 2 Duo (Mac OS 10.6.1):
> N <- 1000
> system.time(sapply(1:N, function(j) noeq.1(1:N, j, N)))
user system elapsed
0.676 0.061 0.738
> system.time(sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N))))
user system elapsed
14.359 0.032 14.410
My solution is to get the indexes of the distance vector, given a row and the size of the matrix. I got this from codeguru
int Trag_noeq(int row, int col, int N)
{
//assert(row != col); //You can add this in if you like
if (row<col)
return row*(N-1) - (row-1)*((row-1) + 1)/2 + col - row - 1;
else if (col<row)
return col*(N-1) - (col-1)*((col-1) + 1)/2 + row - col - 1;
else
return -1;
}
After translating to R, assuming indexes start at 1, and assuming a lower tri instead of upper tri matrix I got.
EDIT: Using the vectorized version contributed by rcs
noeq.1 <- function(i, j, N) {
i <- i-1
j <- j-1
ix <- ifelse(i < j,
i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i,
j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j) * ifelse(i == j, 0, 1)
ix
}
## To get the indexes of the row, the following one liner works:
getrow <- function(z, N) noeq.1(z, 1:N, N)
## to get the row sums
getsum <- function(d, f=sum) {
N <- attr(d, "Size")
sapply(1:N, function(i) {
if (i%%100==0) print(i)
f(d[getrow(i,N)])
})
}
So, with the example:
sumd2 <- getsum(d)
This was much slower than as.matrix for small matrices before vectorizing. But just about 3x as slow after vectorizing. In a Intel Core2Duo 2ghz applying the sum by row of the size 10000 matrix took just over 100s. The as.matrix method fails. Thanks rcs!

Resources