Use of pthread increases execution time, suggestions for improvements - parallel-processing

I had a piece of code, which looked like this,
r = rand() % node_info[crawler[k]].num_of_nodes;
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
I changed it so that the load can be split among multiple threads. Now it looks like this,
pthread_mutex_lock( &mutex1 );
pthread_mutex_unlock( &mutex1 );
pthread_mutex_lock( &mutex1 );
r = rand() % node_info[crawler[k]].num_of_nodes;
pthread_mutex_unlock( &mutex1 );
pthread_mutex_lock( &mutex1 );
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
pthread_mutex_unlock( &mutex1 );
I need the mutexes to protect shared variables. It turns out that my parallel code is slower. But why ? Is it because of the mutexes ?
Could this possibly be something to do with the cacheline size ?

You are not parallelizing anything but the loop heads. Everything between lock and unlock is forced to be executed sequentially. And since lock/unlock are (potentially) expensive operations, the code is getting slower.
To fix this, you should at least separate expensive computations (without mutex protection) from access to shared data areas (with mutexes). Then try to move the mutexes out of the inner loop.
You could use atomic increment instructions (depends on platform) instead of plain '++', which is generally cheaper than mutexes. But beware of doing this often on data of a single cache line from different threads in parallel (see 'false sharing').
AFAICS, you could rewrite the algorithm as indicated below with out needing mutexes and atomic increment at all. getFirstK() is NumOfNodes/NumOfThreads*t if NumOfNodes is an integral multiple of NumOfThreads.
kbegin = getFirstK(NumOfNodes, NumOfThreads, t);
kend = getFirstK(NumOfNodes, NumOfThreads, t+1);
// start the following in a separate thread with kbegin and kend
// copied to thread local vars kbegin_ and kend_
int k, i, r;
unsigned state = kend_; // really bad seed
r = rand_r(&state) % node_info[crawler[k]].num_of_nodes;
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
// wait for threads/jobs to complete
This way to generate random numbers may lead to bad random distributions, see this question for details.


When to prefer for-loop over std::transform or vice-versa

I would like to understand when it is more practical to use std::transform
and when an old fashioned for-loop is better.
This is my code with a for loop, I want to combine two vectors into a complex one:
vector<double> vAmplitude = this->amplitudeData(N);
vector<double> vPhase = this->phaseData(N);
vector<complex<double>,fftalloc<complex<double> > > vComplex(N);
for (size_t i = 0; i < N; ++i)
vComplex[i] = std::polar(vAmplitude[i], vPhase[i]);
This is my std::transform code
vector<double> vAmplitude = this->amplitudeData(N);
vector<double> vPhase = this->phaseData(N);
vector<complex<double>,fftalloc<complex<double> > > vComplex;
begin(vPhase), end(vPhase), begin(vAmplitude),
[](double p, double a) { return std::polar(a, p); });
Note that vComplex is allocated without size, so I wonder when the allocations happends. Also I do not understand why, in the lambda expression, p and a must be reversed to their usage.
One consideration in favor of the standard algorithms, is that it prepares your code (and you) for the c++17 alternative execution model versions.
To borrow from JoachimPileborg's answer, say you write your code as
vector<complex<double>,fftalloc<complex<double> > > vComplex(N);
begin(vAmplitude), end(vAmplitude), begin(vPhase),
After some time, you realize that this is the bottleneck in your code, and you need to run it in parallel. So, in this case, all you'd need to do is add
std::execution::par{} as the first parameter to std::transform. In the hand-rolled version, your (standard-compliant) parallelism choices are gone.
Regarding the allocation, that's what std::back_inserter does.
You could also set the size for the destination vector vComplex and use std::begin for it in the std::transform call:
vector<complex<double>,fftalloc<complex<double> > > vComplex(N);
begin(vPhase), end(vPhase), begin(vAmplitude),
[](double p, double a) { return std::polar(a, p); });
As for the reversal of the arguments in the lambda, it's because you use vPhase as the first container in the std::transform call. If you changed to use vAmplitude instead you could have passed just a pointer to std::polar instead:
begin(vAmplitude), end(vAmplitude), begin(vPhase),
Lastly as for when to call std::transform it's more of a personal matter in most cases. I personally prefers to use the standard algoritm functions before trying to do everything myself.

about memory barriers (why the following example is error)

I read one article,
In this doc, the following example shown
So don't leave out the ACCESS_ONCE().
It is tempting to try to enforce ordering on identical stores on both
branches of the "if" statement as follows:
if (q) {
} else {
Unfortunately, current compilers will transform this as follows at high
optimization levels:
ACCESS_ONCE(b) = p; /* BUG: No ordering vs. load from a!!! */
if (q) {
/* ACCESS_ONCE(b) = p; -- moved up, BUG!!! */
} else {
/* ACCESS_ONCE(b) = p; -- moved up, BUG!!! */
I don't know, why "moveed up" is a bug ? If I write code, I will move "ACCESS_ONE(b) up because both if/else branch execute the same code.
It isn't so much that the moving up is a bug, it's that it exposes a bug in the code.
The intention was to use the conditional on q (from a), to ensure that the write to b is done after the read from a; because both stores are "protected" by a conditional and "stores are not speculated", the CPU shouldn't be making the store until it knows the outcome of the condition, which requires the read to have been done first.
The compiler defeats this intention by seeing that both branches of the conditional start with the same thing, so in a formal sense those statements are not conditioned. The problem with this is explained in the next paragraph:
Now there is no conditional between the load from 'a' and the store to
'b', which means that the CPU is within its rights to reorder them:
The conditional is absolutely required, and must be present in the
assembly code even after all compiler optimizations have been applied.
I'm not experienced enough to know exactly what is meant by barrier(), but apparently it is not powerful enough to enforce the ordering between the two independent memory operations.

OOP much slower than Structural programming. why and how can be fixed?

as i mentioned on subject of this post i found out OOP is slower than Structural Programming(spaghetti code) in the hard way.
i writed a simulated annealing program with OOP then remove one class and write it structural in main form. suddenly it got much faster . i was calling my removed class in every iteration in OOP program.
also checked it with Tabu Search. Same result .
can anyone tell me why this is happening and how can i fix it on other OOP programs?
are there any tricks ? for example cache my classes or something like that?
(Programs has been written in C#)
If you have a high-frequency loop, and inside that loop you create new objects and don't call other functions very much, then, yes, you will see that if you can avoid those news, say by re-using one copy of the object, you can save a large fraction of total time.
Between new, constructors, destructors, and garbage collection, a very little code can waste a whole lot of time.
Use them sparingly.
Memory access is often overlooked. The way o.o. tends to lay out data in memory is not conducive to efficient memory access in practice in loops. Consider the following pseudocode:
adult_clients = 0
for client in list_of_all_clients:
if client.age >= AGE_OF_MAJORITY:
It so happens that the way this is accessed from memory is quite inefficient on modern architectures because they like accessing large contiguous rows of memory, but we only care for client.age, and of all clients we have; those will not be laid out in contiguous memory.
Focusing on objects that have fields results into data being laid out in memory in such a way that fields that hold the same type of information will not be laid out in consecutive memory. Performance-heavy code tends to involve loops that often look at data with the same conceptual meaning. It is conducive to performance that such data be laid out in contiguous memory.
Consider these two examples in Rust:
// struct that contains an id, and an optiona value of whether the id is divisible by three
struct Foo {
id : u32,
divbythree : Option<bool>,
fn main () {
// create a pretty big vector of these structs with increasing ids, and divbythree initialized as None
let mut vec_of_foos : Vec<Foo> = (0..100000000).map(|i| Foo{ id : i, divbythree : None }).collect();
// loop over all hese vectors, determine if the id is divisible by three
// and set divbythree accordingly
let mut divbythrees = 0;
for foo in vec_of_foos.iter_mut() {
if % 3 == 0 {
foo.divbythree = Some(true);
divbythrees += 1;
} else {
foo.divbythree = Some(false);
// print the number of times it was divisible by three
println!("{}", divbythrees);
On my system, the real time with rustc -O is 0m0.436s; now let us consider this example:
fn main () {
// this time we create two vectors rather than a vector of structs
let vec_of_ids : Vec<u32> = (0..100000000).collect();
let mut vec_of_divbythrees : Vec<Option<bool>> = vec![None; vec_of_ids.len()];
// but we basically do the same thing
let mut divbythrees = 0;
for i in 0..vec_of_ids.len(){
if vec_of_ids[i] % 3 == 0 {
vec_of_divbythrees[i] = Some(true);
divbythrees += 1;
} else {
vec_of_divbythrees[i] = Some(false);
println!("{}", divbythrees);
This runs in 0m0.254s on the same optimization level, — close to half the time needed.
Despite having to allocate two vectors instead of of one, storing similar values in contiguous memory has almost halved the execution time. Though obviously the o.o. approach provides for much nicer and more maintainable code.
P.s.: it occurs to me that I should probably explain why this matters so much given that the code itself in both cases still indexes memory one field at a time, rather than, say, putting a large swath on the stack. The reason is c.p.u. caches: when the program asks for the memory at a certain address, it actually obtains, and caches, a significant chunk of memory around that address, and if memory next to it be asked quickly again, then it can serve it from the cache, rather than from actual physical working memory. Of course, compilers will also vectorize the bottom code more efficiently as a consequence.

Perl fast matrix multiply

I have implemented the following statistical computation in perl
The results are correct. I know this because I have 100's of test cases that match input and output. The problem is that I need to compute this many times every single time I run the script. The average number of calls to this function is around 530. I used Devel::NYTProf to find out this out as well as where the slow parts are. I have optimized the algorithm to only traverse the top half of the matrix and reflect it onto the bottom as they are the same. I'm not a perl expert, but I need to know if there is anything I can try to speed up the perl. This script is distributed to clients so compiling a C file is not an option. Is there another perl library I can try? This needs to be sub second in speed if possible.
More information is $MatrixRef is a matrix of floating point numbers that is $rows by $variables. Here is the NYTProf dump for the function.
sub ComputeXpX
# spent 4.27s within ComputeXpX which was called 526 times, avg 8.13ms/call:
# 526 times (4.27s+0s) by ComputeEfficiency at line 7121, avg 8.13ms/call
526 0s my ($MatrixRef, $rows, $variables) = #_;
526 0s my $r = 0;
526 0s my $c = 0;
526 0s my $k = 0;
526 0s my $sum = 0;
526 0s my #xpx = ();
526 11.0ms for ($r = 0; $r < $variables; $r++)
14202 19.0ms my #temp = (0) x $variables;
14202 6.01ms push(#xpx, \#temp);
526 0s }
526 7.01ms for ($r = 0; $r < $variables; $r++)
14202 144ms for ($c = $r; $c < $variables; $c++)
198828 43.0ms $sum = 0;
#for ($k = 0; $k < $rows; $k++)
198828 101ms foreach my $RowRef (#{$MatrixRef})
#$sum += $MatrixRef->[$k]->[$r]*$MatrixRef->[$k]->[$c];
6362496 3.77s $sum += $RowRef->[$r]*$RowRef->[$c];
198828 80.1ms $xpx[$r]->[$c] = $sum;
#reflect on other side of matrix
198828 82.1ms $xpx[$c]->[$r] = $sum if ($r != $c);
14202 1.00ms }
526 2.00ms }
526 2.00ms return \#xpx;
Since each element of the result matrix can be calculated independently, it should be possible to calculate some/all of them in parallel. In other words, none of the instances of the innermost loop depend on the results of any other, so they could run simultaneously on their own threads.
There really isn't much you can do here, without rewriting parts in C, or moving to a better framework for mathematic operations than bare-bone Perl (→ PDL!).
Some minor optimization ideas:
You initialize #xpx with arrayrefs containing zeros. This is unneccessary, as you assign a value to every position either way. If you want to pre-allocate array space, assign to the $#array value:
my #array;
$#array = 100; # preallocate space for 101 scalars
This isn't generally useful, but you can benchmark with and without.
Iterate over ranges; don't use C-style for loops:
for my $c ($r .. $variables - 1) { ... }
Perl scalars aren't very fast for math operations, so offloading the range iteration to lower levels will gain a speedup.
Experiment with changing the order of the loops, and toy around with caching a level of array accesses. Keeping $my $xpx_r = $xpx[$r] around in a scalar will reduce the number of array accesses. If your input is large enough, this translates into a speed gain. Note that this only works when the cached value is a reference.
Remember that perl does very few “big” optimizations, and that the opcode tree produced by compilation closely resembles your source code.
Edit: On threading
Perl threads are heavyweight beasts that literally clone the current interpreter. It is very much like forking.
Sharing data structures across thread boundaries is possible (use threads::shared; my $variable :shared = "foo") but there are various pitfalls. It is cleaner to pass data around in a Thread::Queue.
Splitting the calculation of one product over multiple threads could end up with your threads doing more communication than calculation. You could benchmark a solution that divides responsibility for certain rows between the threads. But I think recombining the solutions efficiently would be difficult here.
More likely to be useful is to have a bunch of worker threads running from the beginning. All threads listen to a queue which contains a pair of a matrix and a return queue. The worker would then dequeue a problem, and send back the solution. Multiple calculations could be run in parallel, but a single matrix multiplication will be slower. Your other code would have to be refactored significantly to take advantage of the parallelism.
Untested code:
use strict; use warnings; use threads; use Thread::Queue;
# spawn worker threads:
my $problem_queue = Thread::Queue->new;
my #threads = map threads->new(\&worker, $problem_queue), 1..3; # make 3 workers
# automatically close threads when program exits
$problem_queue->enqueue((undef) x #threads);
$_->join for #threads;
# This is the wrapper around the threading,
# and can be called exactly as ComputeXpX
sub async_XpX {
my $return_queue = Thread::Queue->new();
$problem_queue->enqueue([$return_queue, #_]);
return sub { $return_queue->dequeue };
# The main loop of worker threads
sub worker {
my ($queue) = #_;
while(defined(my $problem = $queue->dequeue)) {
my ($return, #args) = #$problem;
sub ComputeXpX { ... } # as before
The async_XpX returns a coderef that will eventually collect the result of the computation. This allows us to carry on with other stuff until we need the result.
# start two calculations
my $future1 = async_XpX(...);
my $future2 = async_XpX(...);
...; # do something else
# collect the results
my $result1 = $future1->();
my $result2 = $future2->();
I benchmarked the bare-bones threading code without doing actual calculations, and the communication is about as expensive as the calculations. I.e. with a bit of luck, you may start to get a benefit on a machine with at least four processors/kernel threads.
A note on profiling threaded code: I know of no way to do that elegantly. Benchmarking threaded code, but profiling with single-threaded test cases may be preferable.

How do you measure the time a function takes to execute?

How can you measure the amount of time a function will take to execute?
This is a relatively short function and the execution time would probably be in the millisecond range.
This particular question relates to an embedded system, programmed in C or C++.
The best way to do that on an embedded system is to set an external hardware pin when you enter the function and clear it when you leave the function. This is done preferably with a little assembly instruction so you don't skew your results too much.
Edit: One of the benefits is that you can do it in your actual application and you don't need any special test code. External debug pins like that are (should be!) standard practice for every embedded system.
There are three potential solutions:
Hardware Solution:
Use a free output pin on the processor and hook an oscilloscope or logic analyzer to the pin. Initialize the pin to a low state, just before calling the function you want to measure, assert the pin to a high state and just after returning from the function, deassert the pin.
*io_pin = 1;
*io_pin = 0;
Bookworm solution:
If the function is fairly small, and you can manage the disassembled code, you can crack open the processor architecture databook and count the cycles it will take the processor to execute every instructions. This will give you the number of cycles required.
Time = # cycles * Processor Clock Rate / Clock ticks per instructions
This is easier to do for smaller functions, or code written in assembler (for a PIC microcontroller for example)
Timestamp counter solution:
Some processors have a timestamp counter which increments at a rapid rate (every few processor clock ticks). Simply read the timestamp before and after the function.
This will give you the elapsed time, but beware that you might have to deal with the counter rollover.
Invoke it in a loop with a ton of invocations, then divide by the number of invocations to get the average time.
// begin timing
for (int i = 0; i < 10000; i++) {
// end time
// divide by 10000 to get actual time.
if you're using linux, you can time a program's runtime by typing in the command line:
time [funtion_name]
if you run only the function in main() (assuming C++), the rest of the app's time should be negligible.
I repeat the function call a lot of times (millions) but also employ the following method to discount the loop overhead:
start = getTicks();
repeat n times {
lap = getTicks();
repeat n times {
finish = getTicks();
// overhead + function + function
elapsed1 = lap - start;
// overhead + function
elapsed2 = finish - lap;
// overhead + function + function - overhead - function = function
ntimes = elapsed1 - elapsed2;
once = ntimes / n; // Average time it took for one function call, sans loop overhead
Instead of calling function() twice in the first loop and once in the second loop, you could just call it once in the first loop and don't call it at all (i.e. empty loop) in the second, however the empty loop could be optimized out by the compiler, giving you negative timing results :)
start_time = timer
exec_time = timer - start_time
Windows XP/NT Embedded or Windows CE/Mobile
You an use the QueryPerformanceCounter() to get the value of a VERY FAST counter before and after your function. Then you substract those 64-bits values and get a delta "ticks". Using QueryPerformanceCounterFrequency() you can convert the "delta ticks" to an actual time unit. You can refer to MSDN documentation about those WIN32 calls.
Other embedded systems
Without operating systems or with only basic OSes you will have to:
program one of the internal CPU timers to run and count freely.
configure it to generate an interrupt when the timer overflows, and in this interrupt routine increment a "carry" variable (this is so you can actually measure time longer than the resolution of the timer chosen).
before your function you save BOTH the "carry" value and the value of the CPU register holding the running ticks for the counting timer you configured.
same after your function
substract them to get a delta counter tick.
from there it is just a matter of knowing how long a tick means on your CPU/Hardware given the external clock and the de-multiplication you configured while setting up your timer. You multiply that "tick length" by the "delta ticks" you just got.
VERY IMPORTANT Do not forget to disable before and restore interrupts after getting those timer values (bot the carry and the register value) otherwise you risk saving incorrect values.
This is very fast because it is only a few assembly instructions to disable interrupts, save two integer values and re-enable interrupts. The actual substraction and conversion to real time units occurs OUTSIDE the zone of time measurement, that is AFTER your function.
You may wish to put that code into a function to reuse that code all around but it may slow things a bit because of the function call and the pushing of all the registers to the stack, plus the parameters, then popping them again. In an embedded system this may be significant. It may be better then in C to use MACROS instead or write your own assembly routine saving/restoring only relevant registers.
Depends on your embedded platform and what type of timing you are looking for. For embedded Linux, there are several ways you can accomplish. If you wish to measure the amout of CPU time used by your function, you can do the following:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#define SEC_TO_NSEC(s) ((s) * 1000 * 1000 * 1000)
int work_function(int c) {
// do some work here
int i, j;
int foo = 0;
for (i = 0; i < 1000; i++) {
for (j = 0; j < 1000; j++) {
for ^= i + j;
int main(int argc, char *argv[]) {
struct timespec pre;
struct timespec post;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &pre);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &post);
printf("time %d\n",
(SEC_TO_NSEC(post.tv_sec) + post.tv_nsec) -
(SEC_TO_NSEC(pre.tv_sec) + pre.tv_nsec));
return 0;
You will need to link this with the realtime library, just use the following to compile your code:
gcc -o test test.c -lrt
You may also want to read the man page on clock_gettime there is some issues with running this code on SMP based system that could invalidate you testing. You could use something like sched_setaffinity() or the command line cpuset to force the code on only one core.
If you are looking to measure user and system time, then you could use the times(NULL) which returns something like a jiffies. Or you can change the parameter for clock_gettime() from CLOCK_THREAD_CPUTIME_ID to CLOCK_MONOTONIC...but be careful of wrap around with CLOCK_MONOTONIC.
For other platforms, you are on your own.
I always implement an interrupt driven ticker routine. This then updates a counter that counts the number of milliseconds since start up. This counter is then accessed with a GetTickCount() function.
#define TICK_INTERVAL 1 // milliseconds between ticker interrupts
static unsigned long tickCounter;
interrupt ticker (void)
tickCounter += TICK_INTERVAL;
unsigned in GetTickCount(void)
return tickCounter;
In your code you would time the code as follows:
int function(void)
unsigned long time = GetTickCount();
do something ...
printf("Time is %ld", GetTickCount() - ticks);
In OS X terminal (and probably Unix, too), use "time":
time python
If the code is .Net, use the stopwatch class (.net 2.0+) NOT DateTime.Now. DateTime.Now isn't updated accurately enough and will give you crazy results
If you're looking for sub-millisecond resolution, try one of these timing methods. They'll all get you resolution in at least the tens or hundreds of microseconds:
If it's embedded Linux, look at Linux timers:
Embedded Java, look at nanoTime(), though I'm not sure this is in the embedded edition:
If you want to get at the hardware counters, try PAPI:
Otherwise you can always go to assembler. You could look at the PAPI source for your architecture if you need some help with this.
