Go-lang parallel segment runs slower than series segment - go

I have built an epidemic mathematics model which is fairly computationally intense in Go. I'm trying now to build a set of systems to test my model, where I change an input and expect a different output. I built a version in series to slowly increase HIV prevalence and see effects on HIV deaths. It takes ~200 milliseconds to run.
for q = 0.0; q < 1000; q++ {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
}
Then I made a "parallel" version using channels, and it takes longer, ~400 milliseconds to run. These small changes are important as we will be running millions of runs with different inputs, so would like to make it as efficient as possible. Here is the parallel version:
ch := make(chan ChData)
var q float64
for q = 0.0; q < 1000; q++ {
go func(q float64, inputs *costanalysis.Inputs, ch chan ChData) {
inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] = inputs.CountryProfile.HivPrevalenceAdultsByGroup[0] * float32(math.Pow(1.00001, q))
results := costAnalysisHandler(inputs)
fmt.Println(results.HivDeaths[20])
ch <- ChData{int(q), results.HivDeaths[20]}
}(q, inputs, ch)
}
for q = 0.0; q < 1000; q++ {
theResults := <-ch
fmt.Println(theResults)
}
Any thoughts are very much appreciated.

There's overhead to starting and communicating with background tasks. The time spent on your cost analyses probably dwarfs equals the cost of communication if the program was taking 200ms, but if coordination cost ever does kill your app, a common approach is to hand off largish chunks of work at a time--e.g., make each goroutine do analyses for a range of 10 q values instead of just one. (Edit: And as #Innominate says, making a "worker pool" of goroutines that process a queue of job objects is another common approach.)
Also, the code you pasted has a race condition. The contents of your Inputs struct don't get copied each time you spawn a goroutine, because you're passing your function a pointer. So goroutines running in parallel will read from and write to the same Inputs instance.
Simply making a brand new Inputs instance for each analysis, with its own arrays, etc. would avoid the race. If that ended up wasting tons of memory or causing lots of redundant copies, you could 1) recycle Inputs instances, 2) separate out read-only data that can safely be shared (maybe there's country data that's fixed, dunno), or 3) change some of the relatively big arrays to be local variables within costAnalysisHandler rather than stuff that needs to be passed around (maybe it could just take initial HIV prevalence and return HIV deaths at t=20, and everything else is local and on the stack).
This doesn't apply to Go today, but did when the question was originally posted: nothing is really running in parallel unless you call runtime.GOMAXPROCS() with your desired concurrency level, e.g., runtime.GOMAXPROCS(runtime.NumCPU()).
Finally, you should only worry about all of this if you're doing some larger analysis and actually have a performance problem; if .2 seconds of waiting is all that performance work can save you here, it's not worth it.

Parallelizing a computationally intensive set of calculations requires that the parallel computations can actually run in parallel on your machine. If they don't then the extra overhead of creating goroutines, channels and reading off the channel will make the program run slower.
I'm guessing that is the problem here.
Try setting the GOMAXPROCS environment variable to the number of CPU's you have before running your code. Or call runtime.GOMAXRPROCS(runtime.NumCPU()) before you start the parallell computations.

I see two issues related to parallel performance,
The first and more obvious one is that you must set GOMAXPROCS in order to get the Go runtime to use more than one cpu/core. Typically one would set it for the number of processors in the machine but the ideal setting can vary.
The second problem is a bit trickier, which is that your code doesn't appear to be parallelizing very well. Simply starting a thousand goroutines and assuming they'll work it out isn't going to give good results. You should probably be using some kind of worker pool, running a limited number of simultaneous computations(a good starting number would be to set it the same as GOMAXPROCS) rather than trying to do 1000 at once.
See: http://golang.org/doc/faq#Why_no_multi_CPU

Related

How to properly implement waiting of async computations?

i have some little trouble and i am asking for hint. I am on Windows platform, doing calculations in a following manner:
int input = 0;
int output; // junk bytes here
while(true) {
async_enqueue_upload(input); // completes instantly, but transfer will take 10us
async_enqueue_calculate(); // completes instantly, but computation will take 80us
async_enqueue_download(output); // completes instantly, but transfer will take 10us
sync_wait_finish(); // must wait while output is fully calculated, and there is no junk
input = process(output); // i cannot launch next step without doing it on the host.
}
I am asking about wait_finish() thing. I must wait all devices to finish, to combine all results and somehow process the data and upload a new portion, that is based on a previous computation step. I need to sync data in between each step, so i can't parallelize steps. I know, this is not quite performant case. So lets proceed to question.
I have 2 ways of checking completion, within wait_finish(). First is to put thread to sleep until it wakes up by completion event:
while( !is_completed() )
Sleep(1);
It has very low performance, because actual calculation, to say, takes 100us, and minimal Windows sheduler timestep is 1ms, so it gives unsuitable 10x lower performance.
Second way is to check completion in empty infinite loop:
while( !is_completed() )
{} // do_nothing();
It has 10x good computation performance. But it is also unsuitable solution, because it makes full cpu core utilisation usage, with absolutely useless work. How to make cpu "sleep" exactly time i needed? (Each step has equal amount of work)
How this case is usually solved, when amount of calculation time is too big for active spin-wait, but is too small compared to sheduler timestep? Also related subquestion - how to do that on linux?
Fortunately, i have succeeded in finding answer on my own. In short words - i should use linux for that.
And my investigation shows following. On windows there is hidden function in ntdll, NtDelayExecution(). It is not exposed through SDK, but can be loaded in a following manner:
static int(__stdcall *NtDelayExecution)(BOOL Alertable, PLARGE_INTEGER DelayInterval) = (int(__stdcall*)(BOOL, PLARGE_INTEGER)) GetProcAddress(GetModuleHandleW(L"ntdll.dll"), "NtDelayExecution");
It allows to set sleep intervals in 100ns periods. However, even that not worked well, as shown in a following benchmark:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS); // requires Admin privellegies
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
uint64_t hpf = qpf(); // QueryPerformanceFrequency()
uint64_t s0 = qpc(); // QueryPerformanceCounter()
uint64_t n = 0;
while (1) {
sleep_precise(1); // NtDelayExecution(-1); waits one 100-nanosecond interval
auto s1 = qpc();
n++;
auto passed = s1 - s0;
if (passed >= hpf) {
std::cout << "freq=" << (n * hpf / passed) << " hz\n";
s0 = s1;
n = 0;
}
}
That yields something less than just 2000 hz loop rate, and result varies from string to string. That led me towards windows thread switching sheduler, which is totally not suited for real time tasks. And its minimum interval of 0.5ms (+overhead). Btw, does anyone knows on how to tune that value?
And next was linux question, and what does it can? So i've built custom tiny kernel 4.14 with means of buildroot, and tested that benchmark code there. I replaced qpc() to return clock_gettime() data, with CLOCK_MONOTONIC clock, and qpf() just returns number of nanoseconds in a second and sleep_precise() just called clock_nanosleep(). I was failed to find out what is the difference between CLOCK_MONOTONIC and CLOCK_REALTIME.
And i was quite surprised, getting whooping 18.4khz frequency just out of the box, and that was quite stable. While i tested several intervals, i found that i can set the loop to almost any frequency up to 18.4khz, but also that actual measured wait time results differs to 1.6 times of what i asked. For example if i ask to sleep 100 us it actually sleeps for ~160 us, giving ~6.25 khz frequency. Nothing else is going on the system, just kernel, busybox and this test. I am not an experience linux user, and i am still wondering how can i tune this to be more real-time and deterministic. Can i push that frequency maximum even more?

Most efficient number of goroutines on this machine

So I do have a concurrent quicksort implementation written by me. It looks like this:
func Partition(A []int, p int, r int) int {
index := MedianOf3(A, p, r)
swapArray(A, index, r)
x := A[r]
j := p - 1
i := p
for i < r {
if A[i] <= x {
j++
tmp := A[j]
A[j] = A[i]
A[i] = tmp
}
i++
}
swapArray(A, j+1, r)
return j + 1
}
func ConcurrentQuicksort(A []int, p int, r int) {
wg := sync.WaitGroup{}
if p < r {
q := Partition(A, p, r)
select {
case sem <- true:
wg.Add(1)
go func() {
ConcurrentQuicksort(A, p, q-1)
<-sem
wg.Done()
}()
default:
Quicksort(A, p, q-1)
}
select {
case sem <- true:
wg.Add(1)
go func() {
ConcurrentQuicksort(A, q+1, r)
<-sem
wg.Done()
}()
default:
Quicksort(A, q+1, r)
}
}
wg.Wait()
}
func Quicksort(A []int, p int, r int) {
if p < r {
q := Partition(A, p, r)
Quicksort(A, p, q-1)
Quicksort(A, q+1, r)
}
}
I have a sem buffered channel, which I use to limit the number of goroutines running (if its reaches that number, I dont set up another goroutine, I just do the normal quicksort on the subarray). First I started with 100, then I've changed to 50, 20. The benchmarks would get slightly better. But after switching to 10, it started to go back, times started to get bigger. So there is some arbitrary number, at least for my hardware, that makes the algorithm run most efficient.
When I was implementing this, I actually saw some SO question about the number of goroutines that would be the best and now I cannot find it (stupid Chrome history actually saves not all visited sites). Do you know how to calculate such a things? And it would be the best if I didn't have to hardcode it, just let the program do it itself.
P.S I have nonconcurrent Quicksort, which runs about 1.7x slower than this. As you can see in my code, I do Quicksort, when the number of running goroutines exceeds the number I've set up earlier. I thought what about using a ConcurrentQuicksort, but not calling it with go keyword, just simply calling it, and maybe if other goroutines finish their job, the ConcurrentQuicksort which I called would start to launch up goroutines, speeding up the process (cuz as you can see Quicksort would only launch recursive quicksorts, without goroutines). I did that, and actually the time was like 10% slower than the regular Quicksort. Do you know why would that happen?
You have to experiment a bit with this stuff, but I don't think the main concern is goroutines running at once. As the answer #reticentroot linked to says, it's not necessarily a problem to run a lot of simultaneous goroutines.
I think your main concern should be total number of goroutine launches. The current implementation could theoretically start a goroutine to sort just a few items, and that goroutine would spend a lot more time on startup/coordination than actual sorting.
The ideal is you only start as many goroutines as you need to get good utilization of all your CPUs. If your work items are ~equal size and your cores are ~equally busy, then starting one task per core is perfect.
Here, tasks aren't evenly sized, so you might split the sort into somewhat more tasks than you have CPUs and distribute them. (In production you would typically use a worker pool to distribute work without starting a new goroutine for every task, but I think we can get away with skipping that here.)
To get a workable number of tasks--enough to keep all cores busy, but not so many that you create lots of overhead--you can set a minimum size (initial array size/100 or whatever), and only split off sorts of arrays larger than that.
In slightly more detail, there is a bit of cost every time you send a task off to the background. For starters:
Each goroutine launch spends a little time setting up the stack and doing scheduler bookkeeping
Each task switch spends some time in the scheduler and may incur cache misses when the two goroutines are looking at different code or data
Your own coordination code (channel sends and sync ops) takes time
Other things can prevent ideal speedups from happening: you could hit a systemwide limit on e.g. memory bandwidth as Volker pointed out, some sync costs can increase as you add cores, and you can run into various trickier issues sometimes. But the setup, switching, and coordination costs are a good place to start.
The benefit that can outweigh the coordination costs is, of course, other CPUs getting work done when they'd otherwise sit idle.
I think, but haven't tested, that your problems at 50 goroutines are 1) you already reached nearly-full utilization long ago, so adding more tasks adds more coordination work without making things go faster, and 2) you're creating goroutines for tiny sorts, which may spend more of their time setting up and coordinating than they actually do sorting. And at 10 goroutines your problem might be that you're no longer achieving full CPU utilization.
If you wanted, you could test those theories by counting the number of total goroutine launches at various goroutine limits (in an atomic global counter) and measuring CPU utilization at various limits (e.g. by running your program under the Linux/UNIX time utility).
The approach I'd suggest for a divide-and-conquer problem like this is only fork off a goroutine for large enough subproblems (for quicksort, that means large enough subarrays). You can try different limits: maybe you only start goroutines for pieces that are more than 1/64th of the original array, or pieces above some static threshold like 1000 items.
And you meant this sort routine as an exercise, I suspect, but there are various things you can do to make your sorts faster or more robust against weird inputs. The standard libary sort falls back to insertion sort for small subarrays and uses heapsort for the unusual data patterns that cause quicksort problems.
You can also look at other algorithms like radix sort for all or part of the sorting, which I played with. That sorting library is also parallel. I wound up using a minimum cutoff of 127 items before I'd hand a subarray off for other goroutines to sort, and I used an arrangement with a fixed pool of goroutines and a buffered chan to pass tasks between them. That produced decent practical speedups at the time, though it was likely not the best approach at the time and I'm almost sure it's not on today's Go scheduler. Experimentation is fun!
If the operation is CPU bounded, my experiments show that the optimal is the number of CPUs.
template

What is the best general purpose computing practice in OpenCL for iterative problems?

When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.
I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?

Testing Erlang function performance with timer

I'm testing the performance of a function in a tight loop (say 5000 iterations) using timer:tc/3:
{Duration_us, _Result} = timer:tc(M, F, [A])
This returns both the duration (in microseconds) and the result of the function. For argument's sake the duration is N microseconds.
I then perform a simple average calculation on the results of the iterations.
If I place a timer:sleep(1) function call before the timer:tc/3 call, the average duration for all the iterations is always > the average without the sleep:
timer:sleep(1),
timer:tc(M, F, [A]).
This doesn't make much sense to me as the timer:tc/3 function should be atomic and not care about anything that happened before it.
Can anyone explain this strange functionality? Is it somehow related to scheduling and reductions?
Do you mean like this:
4> foo:foo(10000).
Where:
-module(foo).
-export([foo/1, baz/1]).
foo(N) -> TL = bar(N), {TL,sum(TL)/N} .
bar(0) -> [];
bar(N) ->
timer:sleep(1),
{D,_} = timer:tc(?MODULE, baz, [1000]),
[D|bar(N-1)]
.
baz(0) -> ok;
baz(N) -> baz(N-1).
sum([]) -> 0;
sum([H|T]) -> H + sum(T).
I tried this, and it's interesting. With the sleep statement the mean time returned by timer:tc/3 is 19 to 22 microseconds, and with the sleep commented out, the average drops to 4 to 6 microseconds. Quite dramatic!
I notice there are artefacts in the timings, so events like this (these numbers being the individual microsecond timings returned by timer:tc/3) are not uncommon:
---- snip ----
5,5,5,6,5,5,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,5,4,5,5,5,5,6,5,5,
5,6,5,5,5,5,5,6,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,5,5,5,4,5,
5,5,5,6,5,5,5,6,5,5,7,8,7,8,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,
14,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,
5,5,4,5,4,5,5,4,4,5,5,4,5,5,4,4,4,4,4,5,4,5,5,4,5,5,5,4,5,5,
4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,4,5,4,5,5,4,4,4,4,5,4,
5,5,54,22,26,21,22,22,24,24,32,31,36,31,33,27,25,21,22,21,
24,21,22,22,24,21,22,21,24,21,22,22,24,21,22,21,24,21,22,21,
23,27,22,21,24,21,22,21,24,22,22,21,23,22,22,21,24,22,22,21,
24,21,22,22,24,22,22,21,24,22,22,22,24,22,22,22,24,22,22,22,
24,22,22,22,24,22,22,21,24,22,22,21,24,21,22,22,24,22,22,21,
24,21,23,21,24,22,23,21,24,21,22,22,24,21,22,22,24,21,22,22,
24,22,23,21,24,21,23,21,23,21,21,21,23,21,25,22,24,21,22,21,
24,21,22,21,24,22,21,24,22,22,21,24,22,23,21,23,21,22,21,23,
21,22,21,23,21,23,21,24,22,22,22,24,22,22,41,36,30,33,30,35,
21,23,21,25,21,23,21,24,22,22,21,23,21,22,21,24,22,22,22,24,
22,22,21,24,22,22,22,24,22,22,21,24,22,22,21,24,22,22,21,24,
22,22,21,24,21,22,22,27,22,23,21,23,21,21,21,23,21,21,21,24,
21,22,21,24,21,22,22,24,22,22,22,24,21,22,22,24,21,22,21,24,
21,23,21,23,21,22,21,23,21,23,22,24,22,22,21,24,21,22,22,24,
21,23,21,24,21,22,22,24,21,22,22,24,21,22,21,24,21,22,22,24,
22,22,22,24,22,22,21,24,22,21,21,24,21,22,22,24,21,22,22,24,
24,23,21,24,21,22,24,21,22,21,23,21,22,21,24,21,22,21,32,31,
32,21,25,21,22,22,24,46,5,5,5,5,5,4,5,5,5,5,6,5,5,5,5,5,5,4,
6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,
5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,6,4,6,5,5,5,5,5,5,4,6,5,5,5,
5,4,5,5,5,5,5,5,6,5,5,5,5,4,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,
5,5,5,4,5,5,6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,
6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,4,5,4,5,5,5,5,5,6,5,5,
5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,
---- snip ----
I assume this is the effect you are referring to, though when you say always > N, is it always, or just mostly? Not always for me anyway.
The above results extract was without the sleep. Typically when using sleep timer:tc/3 returns low times like 4 or 5 most of the time without the sleep, but sometimes big times like 22, and with the sleep in place it's usually big times like 22, with occasional batches of low times.
It's certainly not obvious why this would happen, since sleep really just means yield. I wonder if all this is not down to the CPU cache. After all, especially on a machine that's not busy, one might expect the case without the sleep to execute most of the code all in one go without it getting moved to another core, without doing so much else with the core, thus making the most out of the caches... but when you sleep, and thus yield, and come back later, the chances of cache hits might be considerably less.
Measuring performance is a complex task especially on new HW and in modern OS. There are many things which can fiddle with your result. First thing, you are not alone. It is when you measure on your desktop or notebook, there can be other processes which can interfere with your measurement including system ones. Second thing, there is HW itself. Moder CPUs have many cool features which control performance and power consumption. They can boost performance for a short time before overheat, they can boost performance when there is not work on other CPUs on the same chip or other hyper thread on the same CPU. On another hand, they can enter power saving mode when there is not enough work and CPU doesn't react fast enough to the sudden change. It is hard to tell if it is your case, but it is naive to thing previous work or lack of it can't affect your measurement. You should always take care to measure in steady state for long enough time (seconds at least) and remove as much as possible other things which could affect your measurement. (And do not forget GC in Erlang as well.)

Measuring execution time of selected loops

I want to measure the running times of selected loops in a C program so as to see what percentage of the total time for executing the program (on linux) is spent in these loops. I should be able to specify the loops for which the performance should be measured. I have tried out several tools (vtune, hpctoolkit, oprofile) in the last few days and none of them seem to do this. They all find the performance bottlenecks and just show the time for those. Thats because these tools only store the time taken that is above a threshold (~1ms). So if one loop takes lesser time than that then its execution time won't be reported.
The basic block counting feature of gprof depends on a feature in older compilers thats not supported now.
I could manually write a simple timer using gettimeofday or something like that but for some cases it won't give accurate results. For ex:
for (i = 0; i < 1000; ++i)
{
for (j = 0; j < N; ++j)
{
//do some work here
}
}
Now here I want to measure the total time spent in the inner loop and I will have to put a call to gettimeofday inside the first loop. So gettimeofday itself will get called a 1000 times which introduces its own overhead and the result will be inaccurate.
Unless you have an in circuit emulator or break-out box around your CPU, there's no such thing as timing a single-loop or single-instruction. You need to bulk up your test runs to something that takes at least several seconds each in order to reduce error due to other things going on in the CPU, OS, etc.
If you're wanting to find out exactly how much time a particular loop takes to execute, and it takes less than, say, 1 second to execute, you're going to need to artificially increase the number of iterations in order to get a number that is above the "noise floor". You can then take that number and divide it by the number of artificially inflated iterations to get a figure that represents how long one pass through your target loop will take.
If you're wanting to compare the performance of different loop styles or techniques, the same thing holds: you're going to need to increase the number of iterations or passes through your test code in order to get a measurement in which what you're interested in dominates the time slice you're measuring.
This is true whether you're measuring performance using sub-millisecond high performance counters provided by the CPU, the system date time clock, or a wall clock to measure the elapsed time of your test.
Otherwise, you're just measuring white noise.
Typically if you want to measure the time spent in the inner loop, you'll put the time get routines outside of the outer loop and then divide by the (outer) loop count. If you expect the time of the inner loop to be relatively constant for any j, that is.
Any profiling instructions incur their own overhead, but presumably the overhead will be the same regardless of where it's inserted so "it all comes out in the wash." Presumably you're looking for spots where there are considerable differences between the runtimes of two compared processes, where a pair of function calls like this won't be an issue (since you need one at the "end" too, to get the time delta) since one routine will be 2x or more costly over the other.
Most platforms offer some sort of higher resolution timer, too, although the one we use here is hidden behind an API so that the "client" code is cross-platform. I'm sure with a little looking you can turn it up. Although even here, there's little likelihood that you'll get better than 1ms accuracy, so it's preferable to run the code several times in a row and time the whole run (then divide by the loop count, natch).
I'm glad you're looking for percentage, because that's easy to get. Just get it running. If it runs quickly, put an outer loop around it so it takes a good long time. That won't affect the percentages. While it's running, get stackshots. You can do this with Ctrl-Break in gdb, or you can use pstack or lsstack. Just look to see what percentage of stackshots display the code you care about.
Suppose the loops take some fraction of time, like 0.2 (20%) and you take N=20 samples. Then the number of samples that should show them will average 20 * 0.2 = 4, and the standard deviation of the number of samples will be sqrt(20 * 0.2 * 0.8) = sqrt(3.2) = 1.8, so if you want more precision, take more samples. (I personally think precision is overrated.)

Resources