gettimeofday() C++ Inconsistency - time

I'm doing a project that involves comparing programming languages. I'm computing the Ackermann function. I tested Java, Python, and Ruby, and got responses between 10 and 30 milliseconds. But C++ seems to take 125 milliseconds. Is this normal, or is it a problem with the gettimeofday()? Gettimeofday() is in time.h.
I'm testing on a (virtual) Ubuntu Natty Narwhal 32-bit. I'm not short processing power (Quad-core 2.13 GHz Intel Xeon).
My code is here:
#include <iostream>
#include <sys/time.h>
using namespace std;
int a(int m,int n) {
if (m == 0) {
return n + 1;
} else if (m > 0 and n == 0) {
return a(m-1,1);
} else if (m > 0 and n > 0) {
return a(m-1,a(m,n-1));
}
}
int main() {
timeval tim;
gettimeofday(&tim,NULL);
double t1 = tim.tv_usec;
int v = a(3,4);
gettimeofday(&tim,NULL);
double t2 = tim.tv_usec;
cout << v << endl << t2-t1;
return 0;
}

Assuming you're talking about the resolution of the data returned, the POSIX specification for gettimeofday states:
The resolution of the system clock is unspecified.
This is due to the fact that systems may have a widely varying capacity for tracking small time periods. Even the ISO standard clock() function includes caveats like this.
If you're talking about how long it takes to call it, the standard makes no guarantees about performance along those lines. An implementation is perfectly free to wait 125 minutes before giving you the time although I doubt such an implementation would have much market success :-)
As an example of the limited resolution, I typed in the following code to check it on my system:
#include <stdio.h>
#include <sys/time.h>
#define NUMBER 30
int main (void) {
struct timeval tv[NUMBER];
int count[NUMBER], i, diff;
gettimeofday (&tv[0], NULL);
for (i = 1; i < NUMBER; i++) {
gettimeofday (&tv[i], NULL);
count[i] = 1;
while ((tv[i].tv_sec == tv[i-1].tv_sec) &&
(tv[i].tv_usec == tv[i-1].tv_usec))
{
count[i]++;
gettimeofday (&tv[i], NULL);
}
}
printf ("%2d: secs = %d, usecs = %6d\n", 0, tv[0].tv_sec, tv[0].tv_usec);
for (i = 1; i < NUMBER; i++) {
diff = (tv[i].tv_sec - tv[i-1].tv_sec) * 1000000;
diff += tv[i].tv_usec - tv[i-1].tv_usec;
printf ("%2d: secs = %d, usecs = %6d, count = %5d, diff = %d\n",
i, tv[i].tv_sec, tv[i].tv_usec, count[i], diff);
}
return 0;
}
The code basically records the changes in the underlying time, keeping a count of how many calls it took to gettimeofday() for the time to actually change. This is on a reasonably powerful machine so it's not short on processing power (the count indicates how often it was able to call gettimeofday() for each time quantum, around the 5,800 mark, ignoring the first since we don't know when in that quantum we started the measurements).
The output was:
0: secs = 1318554836, usecs = 990820
1: secs = 1318554836, usecs = 991820, count = 5129, diff = 1000
2: secs = 1318554836, usecs = 992820, count = 5807, diff = 1000
3: secs = 1318554836, usecs = 993820, count = 5901, diff = 1000
4: secs = 1318554836, usecs = 994820, count = 5916, diff = 1000
5: secs = 1318554836, usecs = 995820, count = 5925, diff = 1000
6: secs = 1318554836, usecs = 996820, count = 5814, diff = 1000
7: secs = 1318554836, usecs = 997820, count = 5814, diff = 1000
8: secs = 1318554836, usecs = 998820, count = 5819, diff = 1000
9: secs = 1318554836, usecs = 999820, count = 5901, diff = 1000
10: secs = 1318554837, usecs = 820, count = 5815, diff = 1000
11: secs = 1318554837, usecs = 1820, count = 5866, diff = 1000
12: secs = 1318554837, usecs = 2820, count = 5849, diff = 1000
13: secs = 1318554837, usecs = 3820, count = 5857, diff = 1000
14: secs = 1318554837, usecs = 4820, count = 5867, diff = 1000
15: secs = 1318554837, usecs = 5820, count = 5852, diff = 1000
16: secs = 1318554837, usecs = 6820, count = 5865, diff = 1000
17: secs = 1318554837, usecs = 7820, count = 5867, diff = 1000
18: secs = 1318554837, usecs = 8820, count = 5885, diff = 1000
19: secs = 1318554837, usecs = 9820, count = 5864, diff = 1000
20: secs = 1318554837, usecs = 10820, count = 5918, diff = 1000
21: secs = 1318554837, usecs = 11820, count = 5869, diff = 1000
22: secs = 1318554837, usecs = 12820, count = 5866, diff = 1000
23: secs = 1318554837, usecs = 13820, count = 5875, diff = 1000
24: secs = 1318554837, usecs = 14820, count = 5925, diff = 1000
25: secs = 1318554837, usecs = 15820, count = 5870, diff = 1000
26: secs = 1318554837, usecs = 16820, count = 5877, diff = 1000
27: secs = 1318554837, usecs = 17820, count = 5868, diff = 1000
28: secs = 1318554837, usecs = 18820, count = 5874, diff = 1000
29: secs = 1318554837, usecs = 19820, count = 5862, diff = 1000
showing that the resolution seems to be limited to no better than one thousand microseconds. Of course, your system may be different to that, the bottom line is that it depends on your implementation and/or environment.
One way to get around this type of limitation is to not do something once but to do it N times and then divide the elapsed time by N.
For example, let's say you call your function and the timer says it took 125 milliseconds, something that you suspect seems a little high. I would suggest then calling it a thousand times in a loop, measuring the time it took for the entire thousand.
If that turns out to be 125 seconds then, yes, it's probably slow. However, if it takes only 27 seconds, that would indicate your timer resolution is what's causing the seemingly large times, since that would equate to 27 milliseconds per iteration, on par with what you're seeing from the other results.
Modifying your code to take this into account would be along the lines of:
int main() {
const int count = 1000;
timeval tim;
gettimeofday(&tim, NULL);
double t1 = 1.0e6 * tim.tv_sec + tim.tv_usec;
int v;
for (int i = 0; i < count; ++i)
v = a(3, 4);
gettimeofday(&tim, NULL);
double t2 = 1.0e6 * tim.tv_sec + tim.tv_usec;
cout << v << '\n' << ((t2 - t1) / count) << '\n';
return 0;
}

Related

Exponent calculation speed

I am currently testing Julia (I've worked with Matlab)
In matlab the calculation speed of N^3 is slower than NxNxN. This doesn't happen with N^2 and NxN. They use a different algorithm to calculate higher-order exponents because they prefer accuracy rather than speed.
I think Julia do the same thing.
I wanted to ask if there is a way to force Julia to calculate the exponent of N using multiplication instead of the default algorithm, at least for cube exponents.
Some time ago a I did a few test on matlab of this. I made a translation of that code to julia.
Links to code:
http://pastebin.com/bbeukhTc
(I cant upload all the links here :( )
Results of the scripts on Matlab 2014:
Exponente1
Elapsed time is 68.293793 seconds. (17.7x times of the smallest)
Exponente2
Elapsed time is 24.236218 seconds. (6.3x times of the smallests)
Exponente3
Elapsed time is 3.853348 seconds.
Results of the scripts on Julia 0.46:
Exponente1
18.423204 seconds (8.22 k allocations: 372.563 KB) (51.6x times of the smallest)
Exponente2
13.746904 seconds (9.02 k allocations: 407.332 KB) (38.5 times of the smallest)
Exponente3
0.356875 seconds (10.01 k allocations: 450.441 KB)
In my tests julia is faster than Matlab, but i am using a relative old version. I cant test other versions.
Checking Julia's source code:
julia/base/math.jl:
^(x::Float64, y::Integer) =
box(Float64, powi_llvm(unbox(Float64,x), unbox(Int32,Int32(y))))
^(x::Float32, y::Integer) =
box(Float32, powi_llvm(unbox(Float32,x), unbox(Int32,Int32(y))))
julia/base/fastmath.jl:
pow_fast{T<:FloatTypes}(x::T, y::Integer) = pow_fast(x, Int32(y))
pow_fast{T<:FloatTypes}(x::T, y::Int32) =
box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,y)))
We can see that Julia uses powi_llvm
Checking llvm's source code:
define double #powi(double %F, i32 %power) {
; CHECK: powi:
; CHECK: bl __powidf2
%result = call double #llvm.powi.f64(double %F, i32 %power)
ret double %result
}
Now, the __powidf2 is the interesting function here:
COMPILER_RT_ABI double
__powidf2(double a, si_int b)
{
const int recip = b < 0;
double r = 1;
while (1)
{
if (b & 1)
r *= a;
b /= 2;
if (b == 0)
break;
a *= a;
}
return recip ? 1/r : r;
}
Example 1: given a = 2; b = 7:
- r = 1
- iteration 1: r = 1 * 2 = 2; b = (int)(7/2) = 3; a = 2 * 2 = 4
- iteration 2: r = 2 * 4 = 8; b = (int)(3/2) = 1; a = 4 * 4 = 16
- iteration 3: r = 8 * 16 = 128;
Example 2: given a = 2; b = 8:
- r = 1
- iteration 1: r = 1; b = (int)(8/2) = 4; a = 2 * 2 = 4
- iteration 2: r = 1; b = (int)(4/2) = 2; a = 4 * 4 = 16
- iteration 3: r = 1; b = (int)(2/2) = 1; a = 16 * 16 = 256
- iteration 4: r = 1 * 256 = 256; b = (int)(1/2) = 0;
Integer power is always implemented as a sequence multiplications. That's why N^3 is slower than N^2.
jl_powi_llvm (called in fastmath.jl. "jl_" is concatenated by macro expansion), on the other hand, casts the exponent to floating-point and calls pow(). C source code:
JL_DLLEXPORT jl_value_t *jl_powi_llvm(jl_value_t *a, jl_value_t *b)
{
jl_value_t *ty = jl_typeof(a);
if (!jl_is_bitstype(ty))
jl_error("powi_llvm: a is not a bitstype");
if (!jl_is_bitstype(jl_typeof(b)) || jl_datatype_size(jl_typeof(b)) != 4)
jl_error("powi_llvm: b is not a 32-bit bitstype");
jl_value_t *newv = newstruct((jl_datatype_t*)ty);
void *pa = jl_data_ptr(a), *pr = jl_data_ptr(newv);
int sz = jl_datatype_size(ty);
switch (sz) {
/* choose the right size c-type operation */
case 4:
*(float*)pr = powf(*(float*)pa, (float)jl_unbox_int32(b));
break;
case 8:
*(double*)pr = pow(*(double*)pa, (double)jl_unbox_int32(b));
break;
default:
jl_error("powi_llvm: runtime floating point intrinsics are not implemented for bit sizes other than 32 and 64");
}
return newv;
}
Lior's answer is excellent. Here is a solution to the problem you posed: Yes, there is a way to force usage of multiplication, at cost of accuracy. It's the #fastmath macro:
julia> #benchmark 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 999
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 16.00 bytes
allocs estimate: 1
minimum time: 13.00 ns (0.00% GC)
median time: 14.00 ns (0.00% GC)
mean time: 15.74 ns (6.14% GC)
maximum time: 1.85 μs (98.16% GC)
julia> #benchmark #fastmath 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 2.00 ns (0.00% GC)
median time: 3.00 ns (0.00% GC)
mean time: 2.59 ns (0.00% GC)
maximum time: 20.00 ns (0.00% GC)
Note that with #fastmath, performance is much better.

octave is slow; suggestions

Have run the following code in both Octave 4.0.0 and MATLAB 2014. Time difference is silly, i.e. more than two orders of magnitude. Running on Windows laptop. What can be done to improve Octave computational speed?
startTime = cputime;
iter = 1; % iter is the current iteration of the loop
itSum = 0; % itSum is the sum of the iterations
stopCrit = sqrt(275); % stopCrit is the stopping criteria for the while loop
while itSum < stopCrit
itSum = itSum + 1/iter;
iter = iter + 1;
if iter > 1e7, break, end
end
iter-1
totTime = cputime - startTime
Octave: totTime ~ 112
MATLAB: totTime < 0.4
It takes a lot of iterations in the loop to compute the results in your code. Vectorizing the code will help speed up a lot. My following code do exactly what you did, but vectorize the computation quite a bit. See if it helps.
startTime = cputime;
iter = 1; % iter is the current iteration of the loop
itSum = 0; % itSum is the sum of the iterations
stopCrit = sqrt(275); % stopCrit is the stopping criteria for the while loop
step=1000;
while(itSum < stopCrit && iter <= 1e7)
itSum=itSum+sum(1./(iter:iter+step));
iter = iter + step+ 1;
end
iter=iter-step-1;
itSum=sum(1./(1:iter));
for i=(iter+1):(iter+step)
itSum=itSum+1/i;
if(itSum+1/i>stopCrit)
iter=i-1;
break;
end
end
totTime = cputime - startTime
My runtime is only about 0.6 second using the above code. If you do not care about exactly when the loop stops, the following code is even faster:
startTime = cputime;
iter = 1; % iter is the current iteration of the loop
itSum = 0; % itSum is the sum of the iterations
stopCrit = sqrt(275); % stopCrit is the stopping criteria for the while loop
step=1000;
while(itSum < stopCrit && iter <= 1e7)
itSum=itSum+sum(1./(iter:iter+step));
iter = iter + step+ 1;
end
iter=iter-step-1;
totTime = cputime - startTime
My runtime is only about 0.35 second in latter case.
You can also try:
itSum = sum(1./(1:exp(stopCrit)));
%start the iteration
iter = exp(stopCrit-((stopCrit-itSum)/abs(stopCrit-itSum))*(stopCrit-itSum));
itSum = sum(1./(1:iter))
With this methode you will only have 1 or 2 iteration. But of course you sum each time the whole array.

Count number of 1 digits in 11 to the power of N

I came across an interesting problem:
How would you count the number of 1 digits in the representation of 11 to the power of N, 0<N<=1000.
Let d be the number of 1 digits
N=2 11^2 = 121 d=2
N=3 11^3 = 1331 d=2
Worst time complexity expected O(N^2)
The simple approach where you compute the number and count the number of 1 digits my getting the last digit and dividing by 10, does not work very well. 11^1000 is not even representable in any standard data type.
Powers of eleven can be stored as a string and calculated quite quickly that way, without a generalised arbitrary precision math package. All you need is multiply by ten and add.
For example, 111 is 11. To get the next power of 11 (112), you multiply by (10 + 1), which is effectively the number with a zero tacked the end, added to the number: 110 + 11 = 121.
Similarly, 113 can then be calculated as: 1210 + 121 = 1331.
And so on:
11^2 11^3 11^4 11^5 11^6
110 1210 13310 146410 1610510
+11 +121 +1331 +14641 +161051
--- ---- ----- ------ -------
121 1331 14641 161051 1771561
So that's how I'd approach, at least initially.
By way of example, here's a Python function to raise 11 to the n'th power, using the method described (I am aware that Python has support for arbitrary precision, keep in mind I'm just using it as a demonstration on how to do this an an algorithm, which is how the question was tagged):
def elevenToPowerOf(n):
# Anything to the zero is 1.
if n == 0: return "1"
# Otherwise, n <- n * 10 + n, once for each level of power.
num = "11"
while n > 1:
n = n - 1
# Make multiply by eleven easy.
ten = num + "0"
num = "0" + num
# Standard primary school algorithm for adding.
newnum = ""
carry = 0
for dgt in range(len(ten)-1,-1,-1):
res = int(ten[dgt]) + int(num[dgt]) + carry
carry = res // 10
res = res % 10
newnum = str(res) + newnum
if carry == 1:
newnum = "1" + newnum
# Prepare for next multiplication.
num = newnum
# There you go, 11^n as a string.
return num
And, for testing, a little program which works out those values for each power that you provide on the command line:
import sys
for idx in range(1,len(sys.argv)):
try:
power = int(sys.argv[idx])
except (e):
print("Invalid number [%s]" % (sys.argv[idx]))
sys.exit(1)
if power < 0:
print("Negative powers not allowed [%d]" % (power))
sys.exit(1)
number = elevenToPowerOf(power)
count = 0
for ch in number:
if ch == '1':
count += 1
print("11^%d is %s, has %d ones" % (power,number,count))
When you run that with:
time python3 prog.py 0 1 2 3 4 5 6 7 8 9 10 11 12 1000
you can see that it's both accurate (checked with bc) and fast (finished in about half a second):
11^0 is 1, has 1 ones
11^1 is 11, has 2 ones
11^2 is 121, has 2 ones
11^3 is 1331, has 2 ones
11^4 is 14641, has 2 ones
11^5 is 161051, has 3 ones
11^6 is 1771561, has 3 ones
11^7 is 19487171, has 3 ones
11^8 is 214358881, has 2 ones
11^9 is 2357947691, has 1 ones
11^10 is 25937424601, has 1 ones
11^11 is 285311670611, has 4 ones
11^12 is 3138428376721, has 2 ones
11^1000 is 2469932918005826334124088385085221477709733385238396234869182951830739390375433175367866116456946191973803561189036523363533798726571008961243792655536655282201820357872673322901148243453211756020067624545609411212063417307681204817377763465511222635167942816318177424600927358163388910854695041070577642045540560963004207926938348086979035423732739933235077042750354729095729602516751896320598857608367865475244863114521391548985943858154775884418927768284663678512441565517194156946312753546771163991252528017732162399536497445066348868438762510366191040118080751580689254476068034620047646422315123643119627205531371694188794408120267120500325775293645416335230014278578281272863450085145349124727476223298887655183167465713337723258182649072572861625150703747030550736347589416285606367521524529665763903537989935510874657420361426804068643262800901916285076966174176854351055183740078763891951775452021781225066361670593917001215032839838911476044840388663443684517735022039957481918726697789827894303408292584258328090724141496484460001, has 105 ones
real 0m0.609s
user 0m0.592s
sys 0m0.012s
That may not necessarily be O(n2) but it should be fast enough for your domain constraints.
Of course, given those constraints, you can make it O(1) by using a method I call pre-generation. Simply write a program to generate an array you can plug into your program which contains a suitable function. The following Python program does exactly that, for the powers of eleven from 1 to 100 inclusive:
def mulBy11(num):
# Same length to ease addition.
ten = num + '0'
num = '0' + num
# Standard primary school algorithm for adding.
result = ''
carry = 0
for idx in range(len(ten)-1, -1, -1):
digit = int(ten[idx]) + int(num[idx]) + carry
carry = digit // 10
digit = digit % 10
result = str(digit) + result
if carry == 1:
result = '1' + result
return result
num = '1'
print('int oneCountInPowerOf11(int n) {')
print(' static int numOnes[] = {-1', end='')
for power in range(1,101):
num = mulBy11(num)
count = sum(1 for ch in num if ch == '1')
print(',%d' % count, end='')
print('};')
print(' if ((n < 0) || (n > sizeof(numOnes) / sizeof(*numOnes)))')
print(' return -1;')
print(' return numOnes[n];')
print('}')
The code output by this script is:
int oneCountInPowerOf11(int n) {
static int numOnes[] = {-1,2,2,2,2,3,3,3,2,1,1,4,2,3,1,4,2,1,4,4,1,5,5,1,5,3,6,6,3,6,3,7,5,7,4,4,2,3,4,4,3,8,4,8,5,5,7,7,7,6,6,9,9,7,12,10,8,6,11,7,6,5,5,7,10,2,8,4,6,8,5,9,13,14,8,10,8,7,11,10,9,8,7,13,8,9,6,8,5,8,7,15,12,9,10,10,12,13,7,11,12};
if ((n < 0) || (n > sizeof(numOnes) / sizeof(*numOnes)))
return -1;
return numOnes[n];
}
which should be blindingly fast when plugged into a C program. On my system, the Python code itself (when you up the range to 1..1000) runs in about 0.6 seconds and the C code, when compiled, finds the number of ones in 111000 in 0.07 seconds.
Here's my concise solution.
def count1s(N):
# When 11^(N-1) = result, 11^(N) = (10+1) * result = 10*result + result
result = 1
for i in range(N):
result += 10*result
# Now count 1's
count = 0
for ch in str(result):
if ch == '1':
count += 1
return count
En c#:
private static void Main(string[] args)
{
var res = Elevento(1000);
var countOf1 = res.Select(x => int.Parse(x.ToString())).Count(s => s == 1);
Console.WriteLine(countOf1);
}
private static string Elevento(int n)
{
if (n == 0) return "1";
//Otherwise, n <- n * 10 + n, once for each level of power.
var num = "11";
while (n > 1)
{
n--;
// Make multiply by eleven easy.
var ten = num + "0";
num = "0" + num;
//Standard primary school algorithm for adding.
var newnum = "";
var carry = 0;
foreach (var dgt in Enumerable.Range(0, ten.Length).Reverse())
{
var res = int.Parse(ten[dgt].ToString()) + int.Parse(num[dgt].ToString()) + carry;
carry = res/10;
res = res%10;
newnum = res + newnum;
}
if (carry == 1)
newnum = "1" + newnum;
// Prepare for next multiplication.
num = newnum;
}
//There you go, 11^n as a string.
return num;
}

arrayfire evaluation of equations running really slowly

I have been working on a project to simulate biologically inspired neural networks using arrayfire. I got to the point of doing some timing tests and was disappointed with the results I was getting. I decided to try and go with one of the fastest, dirt-simple models for a timing test case, the Izhikevich model. When I ran the new test with that model the results were worse. The code I am using is below. It is not doing anything fancy. It is just standard matrix algebra. However, it takes over 5 seconds to do a single evaluation of the equation for just 10 neurons! Every stop after that takes roughly that same amount of time as well.
Code:
unsigned int neuron_count = 10;
array a = af::constant(0.02, neuron_count);
array b = af::constant(0.2, neuron_count);
array c = af::constant(-65.0, neuron_count);
array d = af::constant(6, neuron_count);
array v = af::constant(-70.0, neuron_count);
array u = af::constant(-20.0, neuron_count);
array i = af::constant(14, neuron_count);
double tau = 0.2;
void StepIzhikevich()
{
v = v + tau*(0.04*pow(v, 2) + 5 * v + 140 - u + i);
//af_print(v);
u = u + tau*a*(b*v - u);
//Leaving off spike threshold checks for now
}
void TestIzhikevich()
{
StepIzhikevich();
timer::start();
StepIzhikevich();
printf("elapsed seconds: %g\n", timer::stop());
}
Here are the timing results for different numbers of neurons.
results:
neurons seconds
10 5.18275
100 5.27969
1000 5.20637
10000 4.86609
Increasing the number of neurons does not appear to have a huge effect. The time goes down a little. Am I doing something wrong here? Is there a better way to optimize things with arrayfire to get better results?
When I switched the v equation to use v*v instead pow(v, 2) the time required for a step went down to 3.75762. That is still extremely slow though, so something odd is happening.
[EDIT]
I tried to split the processing up into pieces and found something new. Here is the code I am using now.
Code:
unsigned int neuron_count = 10;
array a = af::constant(0.02, neuron_count);
array b = af::constant(0.2, neuron_count);
array c = af::constant(-65.0, neuron_count);
array d = af::constant(6, neuron_count);
array v = af::constant(-70.0, neuron_count);
array u = af::constant(-20.0, neuron_count);
array i = af::constant(14, neuron_count);
array g = af::constant(0.0, neuron_count);
double tau = 0.2;
void StepIzhikevich()
{
array j = tau*(0.04*pow(v, 2));
//af_print(j);
array k = 5 * v + 140 - u + i;
//af_print(k);
array l = v + j + k;
//af_print(l);
v = l; //If this line is here time is long on second loop
//g = l; //If this is here then time is short.
//u = u + tau*a*(b*v - u);
//Leaving off spike threshold checks for now
}
void TestIzhikevich()
{
timer::start();
StepIzhikevich();
printf("elapsed seconds: %g\n", timer::stop());
timer::start();
StepIzhikevich();
printf("elapsed seconds: %g\n", timer::stop());
}
When I run it without reassigning back to v, or assigning it to a new variable g, then the time for the step on both the first and second run are small
results:
elapsed seconds: 0.0036143
elapsed seconds: 0.00340621
However, when I put v = l; back in, then the first time it runs it is fast, but from then on it is slow.
results:
elapsed seconds: 0.0034497
elapsed seconds: 2.98624
Any ideas on what is causing this?
[EDIT 2]
I still do not know why it is doing this, but I have found a workaround by copying the v array before using it again.
Code:
unsigned int neuron_count = 100000;
array v = af::constant(-70.0, neuron_count);
array u = af::constant(-20.0, neuron_count);
array i = af::constant(14, neuron_count);
double tau = 0.2;
void StepIzhikevich()
{
//array vp = v;
array vp = v.copy();
//af_print(vp);
array j = tau*(0.04*pow(vp, 2));
//af_print(j);
array k = 5 * vp + 140 - u + i;
//af_print(k);
array l = vp + j + k;
//af_print(l);
v = l; //If this line is here time is long on second loop
}
void TestIzhikevich()
{
for (int i = 0; i < 10; i++)
{
timer::start();
StepIzhikevich();
printf("loop: %d ", i);
printf("elapsed seconds: %g\n", timer::stop());
timer::start();
}
}
Here are the results now. The second time it runs it is a bit slow, but now it is fast after that. Huge improvement over before.
Results:
loop: 0 elapsed seconds: 0.657355
loop: 1 elapsed seconds: 0.981287
loop: 2 elapsed seconds: 0.000416182
loop: 3 elapsed seconds: 0.000415045
loop: 4 elapsed seconds: 0.000421014
loop: 5 elapsed seconds: 0.000413339
loop: 6 elapsed seconds: 0.00041675
loop: 7 elapsed seconds: 0.000412202
loop: 8 elapsed seconds: 0.000473321
loop: 9 elapsed seconds: 0.000677432

matlab matrix operation speed

I've been asked to make some MATLAB code run faster, and have run into something that seems strange to me.
In one of the functions there's a loop where we multiply a 3x1 vector (let's call it x) - a 3x3 matrix (let's call it A) - and the transpose of x, yielding a scalar. The code has the whole set of element-by-element multiplications and additions, and is pretty cumbersome:
val = x(1)*A(1,1)*x(1) + x(1)*A(1,2)*x(2) + x(1)*A(1,3)*x(3) + ...
x(2)*A(2,1)*x(1) + x(2)*A(2,2)*x(2) + x(2)*A(2,3)*x(3) + ...
x(3)*A(3,1)*x(1) + x(3)*A(3,2)*x(2) + x(3)*A(3,3)*x(3);
I figured I'd just replace it all by:
val = x*A*x';
To my surprise, it ran significantly slower (as in 4-5 times slower). Is it just that the vector and matrix are so small that MATLAB's optimizations don't apply?
EDIT: I improved the tests to give more accurate times. I also optimized the unrolled version which is now much better than what I initially had, still matrix multiplication is way faster as you increase the size.
EDIT2: To make sure that the JIT compiler is working on the unrolled functions, I modified the code to write the generated functions as M-files. Also the comparison can now be seen as fair as both methods get evaluated by passing TIMEIT the function handle: timeit(#myfunc)
I am not convinced that your approach is faster than matrix multiplication for reasonable sizes. So lets compare the two methods.
I am using the Symbolic Math Toolbox to help me get the "unrolled" form of the equation of x'*A*x (try multiplying by hand a 20x20 matrix and a 20x1 vector!):
function f = buildUnrolledFunction(N)
% avoid regenerating files, CCODE below can be really slow!
fname = sprintf('f%d',N);
if exist([fname '.m'], 'file')
f = str2func(fname);
return
end
% construct symbolic vector/matrix of the specified size
x = sym('x', [N 1]);
A = sym('A', [N N]);
% work out the expanded form of the matrix-multiplication
% and convert it to a string
s = ccode(expand(x.'*A*x)); % instead of char(.) to avoid x^2
% a bit of RegExp to fix the notation of the variable names
% also convert indexing into linear indices: A(3,3) into A(9)
s = regexprep(regexprep(s, '^.*=\s+', ''), ';$', '');
s = regexprep(regexprep(s, 'x(\d+)', 'x($1)'), 'A(\d+)_(\d+)', ...
'A(${ int2str(sub2ind([N N],str2num($1),str2num($2))) })');
% build an M-function from the string, and write it to file
fid = fopen([fname '.m'], 'wt');
fprintf(fid, 'function v = %s(A,x)\nv = %s;\nend\n', fname, s);
fclose(fid);
% rehash path and return a function handle
rehash
clear(fname)
f = str2func(fname);
end
I tried to optimize the generated function by avoid exponentiation (we prefer x*x to x^2). I also converted the subscripts into linear indices (A(9) instead of A(3,3)). Therefore for n=3 we get the same equation you had:
>> s
s =
A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) +
A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) +
A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3)
Given the above method to construct M-functions, we now evaluate it for various sizes and compare it against the matrix-multiplication form (I put it in a separate function to account for function calling overhead). I am using the TIMEIT function instead of tic/toc to get more accurate timings. Also to have a fair comparison, each method is implemented as an M-file function that gets passed all of the needed variables as input arguments.
function results = testMatrixMultVsUnrolled()
% vector/matrix size
N_vec = 2:50;
results = zeros(numel(N_vec),3);
for ii = 1:numel(N_vec);
% some random data
N = N_vec(ii);
x = rand(N,1); A = rand(N,N);
% matrix multiplication
f = #matMult;
results(ii,1) = timeit(#() feval(f, A,x));
% unrolled equation
f = buildUnrolledFunction(N);
results(ii,2) = timeit(#() feval(f, A,x));
% check result
results(ii,3) = norm(matMult(A,x) - f(A,x));
end
% display results
fprintf('N = %2d: mtimes = %.6f ms, unroll = %.6f ms [error = %g]\n', ...
[N_vec(:) results(:,1:2)*1e3 results(:,3)]')
plot(N_vec, results(:,1:2)*1e3, 'LineWidth',2)
xlabel('size (N)'), ylabel('timing [msec]'), grid on
legend({'mtimes','unrolled'})
title('Matrix multiplication: $$x^\mathsf{T}Ax$$', ...
'Interpreter','latex', 'FontSize',14)
end
function v = matMult(A,x)
v = x.' * A * x;
end
The results:
N = 2: mtimes = 0.008816 ms, unroll = 0.006793 ms [error = 0]
N = 3: mtimes = 0.008957 ms, unroll = 0.007554 ms [error = 0]
N = 4: mtimes = 0.009025 ms, unroll = 0.008261 ms [error = 4.44089e-16]
N = 5: mtimes = 0.009075 ms, unroll = 0.008658 ms [error = 0]
N = 6: mtimes = 0.009003 ms, unroll = 0.008689 ms [error = 8.88178e-16]
N = 7: mtimes = 0.009234 ms, unroll = 0.009087 ms [error = 1.77636e-15]
N = 8: mtimes = 0.008575 ms, unroll = 0.009744 ms [error = 8.88178e-16]
N = 9: mtimes = 0.008601 ms, unroll = 0.011948 ms [error = 0]
N = 10: mtimes = 0.009077 ms, unroll = 0.014052 ms [error = 0]
N = 11: mtimes = 0.009339 ms, unroll = 0.015358 ms [error = 3.55271e-15]
N = 12: mtimes = 0.009271 ms, unroll = 0.018494 ms [error = 3.55271e-15]
N = 13: mtimes = 0.009166 ms, unroll = 0.020238 ms [error = 0]
N = 14: mtimes = 0.009204 ms, unroll = 0.023326 ms [error = 7.10543e-15]
N = 15: mtimes = 0.009396 ms, unroll = 0.024767 ms [error = 3.55271e-15]
N = 16: mtimes = 0.009193 ms, unroll = 0.027294 ms [error = 2.4869e-14]
N = 17: mtimes = 0.009182 ms, unroll = 0.029698 ms [error = 2.13163e-14]
N = 18: mtimes = 0.009330 ms, unroll = 0.033295 ms [error = 7.10543e-15]
N = 19: mtimes = 0.009411 ms, unroll = 0.152308 ms [error = 7.10543e-15]
N = 20: mtimes = 0.009366 ms, unroll = 0.167336 ms [error = 7.10543e-15]
N = 21: mtimes = 0.009335 ms, unroll = 0.183371 ms [error = 0]
N = 22: mtimes = 0.009349 ms, unroll = 0.200859 ms [error = 7.10543e-14]
N = 23: mtimes = 0.009411 ms, unroll = 0.218477 ms [error = 8.52651e-14]
N = 24: mtimes = 0.009307 ms, unroll = 0.235668 ms [error = 4.26326e-14]
N = 25: mtimes = 0.009425 ms, unroll = 0.256491 ms [error = 1.13687e-13]
N = 26: mtimes = 0.009392 ms, unroll = 0.274879 ms [error = 7.10543e-15]
N = 27: mtimes = 0.009515 ms, unroll = 0.296795 ms [error = 2.84217e-14]
N = 28: mtimes = 0.009567 ms, unroll = 0.319032 ms [error = 5.68434e-14]
N = 29: mtimes = 0.009548 ms, unroll = 0.339517 ms [error = 3.12639e-13]
N = 30: mtimes = 0.009617 ms, unroll = 0.361897 ms [error = 1.7053e-13]
N = 31: mtimes = 0.009672 ms, unroll = 0.387270 ms [error = 0]
N = 32: mtimes = 0.009629 ms, unroll = 0.410932 ms [error = 1.42109e-13]
N = 33: mtimes = 0.009605 ms, unroll = 0.434452 ms [error = 1.42109e-13]
N = 34: mtimes = 0.009534 ms, unroll = 0.462961 ms [error = 0]
N = 35: mtimes = 0.009696 ms, unroll = 0.489474 ms [error = 5.68434e-14]
N = 36: mtimes = 0.009691 ms, unroll = 0.512198 ms [error = 8.52651e-14]
N = 37: mtimes = 0.009671 ms, unroll = 0.544485 ms [error = 5.68434e-14]
N = 38: mtimes = 0.009710 ms, unroll = 0.573564 ms [error = 8.52651e-14]
N = 39: mtimes = 0.009946 ms, unroll = 0.604567 ms [error = 3.41061e-13]
N = 40: mtimes = 0.009735 ms, unroll = 0.636640 ms [error = 3.12639e-13]
N = 41: mtimes = 0.009858 ms, unroll = 0.665719 ms [error = 5.40012e-13]
N = 42: mtimes = 0.009876 ms, unroll = 0.697364 ms [error = 0]
N = 43: mtimes = 0.009956 ms, unroll = 0.730506 ms [error = 2.55795e-13]
N = 44: mtimes = 0.009897 ms, unroll = 0.765358 ms [error = 4.26326e-13]
N = 45: mtimes = 0.009991 ms, unroll = 0.800424 ms [error = 0]
N = 46: mtimes = 0.009956 ms, unroll = 0.829717 ms [error = 2.27374e-13]
N = 47: mtimes = 0.010210 ms, unroll = 0.865424 ms [error = 2.84217e-13]
N = 48: mtimes = 0.010022 ms, unroll = 0.907974 ms [error = 3.97904e-13]
N = 49: mtimes = 0.010098 ms, unroll = 0.944536 ms [error = 5.68434e-13]
N = 50: mtimes = 0.010153 ms, unroll = 0.984486 ms [error = 4.54747e-13]
At small sizes the two methods perform somewhat similarly. Although for N<7 the expanded version beats mtimes, but the difference is hardly significant. Once we move past tiny sizes, matrix multiplication is orders of magnitude faster.
This is not really surprising; with only N=20 the formula is scary long and involves adding 400 terms. As MATLAB language is interpreted, I doubt this is very efficient..
Now I agree that there is an overhead for calling an external function vs. directly embedding the code in-line, but how practical is such an approach. Even for a small size as N=20, the generated line is over 7000 characters! I also noticed the MATLAB editor becoming sluggish on account of the long lines :)
Besides, the advantage quickly disappears after around N>10. I compared the embedded-code/explicitly-written vs. matrix-multiplication, similar to what #DennisJaheruddin suggested. The results:
N=3:
Elapsed time is 0.062295 seconds. % unroll
Elapsed time is 1.117962 seconds. % mtimes
N=12:
Elapsed time is 1.024837 seconds. % unroll
Elapsed time is 1.126147 seconds. % mtimes
N=19:
Elapsed time is 140.915138 seconds. % unroll
Elapsed time is 1.305382 seconds. % mtimes
... and it only gets worse for the unrolled version. Like I said before, MATLAB is interpreted so the cost of parsing the code starts to show at such huge files.
The way I see it, after doing a million iterations we only gained 1 second at best, which I think does not justify all the trouble and hacks, over using the much more readable and succinct v=x'*A*x. So perhaps there are other places in the code one can improve, rather than focusing on an already optimized operation such as matrix multiplication.
Matrix multiplication in MATLAB is seriously fast (this is what MATLAB is best at!). It really shines once you reach large enough data (as multithreading kicks in):
>> N=5000; x=rand(N,1); A=rand(N,N);
>> tic, for i=1e4, v=x.'*A*x; end, toc
Elapsed time is 0.021959 seconds.
#Amro has given en exstensive answer, and I agree that in general you should not bother writing out the explicit calculations and simply use matrix multiplication everywhere in your code.
However, if your matrix is small enough, and you really need to calculate something a few billion times, the written out form can be significantly faster (less overhead). The trick however, is not to put your code in a separate function, as the call overhead will be much larger than the calculation time.
Here is a smalle example:
x = 1:3;
A = rand(3);
v=0;
unroll = #(x) A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) + A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) + A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3);
regular = #(x) x*A*x';
%Written out, no function call
tic
for t = 1:1e6
v = A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) + A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) + A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3);;
end
t1=toc;
%Matrix form, no function call
tic
for t = 1:1e6
v = x*A*x';
end
t2=toc;
%Written out, function call
tic
for t = 1:1e6
v = unroll(x);
end
t3=toc;
%Matrix form, function call
tic
for t = 1:1e6
v = regular(x);
end
t4=toc;
[t1;t2;t3;t4]
Which will give these results:
0.0767
1.6988
6.1975
7.9353
So if you call it via an (anonymous) function, it won't be interesting to use the written out form, however if you really want to get the best speed simply using the written out form directly can get you a big speedup for tiny matrices.

Resources