What would be the best and most accurate way to determine how long it took to process a routine, such as a procedure of function?
I ask because I am currently trying to optimize a few functions in my Application, when i test the changes it is hard to determine just by looking at it if there was any improvements at all. So if I could return an accurate or near accurate time it took to process a routine, I then have a more clear idea of how well, if any changes to the code have been made.
I considered using GetTickCount, but I am unsure if this would be anything near accurate?
It would be useful to have a resuable function/procedure to calculate the time of a routine, and use it something like this:
// < prepare for calcuation of code
...
ExecuteSomeCode; // < code to test
...
// < stop calcuating code and return time it took to process
I look forward to hearing some suggestions.
Thanks.
Craig.
From my knowledge, the most accurate method is by using QueryPerformanceFrequency:
code:
var
Freq, StartCount, StopCount: Int64;
TimingSeconds: real;
begin
QueryPerformanceFrequency(Freq);
QueryPerformanceCounter(StartCount);
// Execute process that you want to time: ...
QueryPerformanceCounter(StopCount);
TimingSeconds := (StopCount - StartCount) / Freq;
// Display timing: ...
end;
Try Eric Grange's Sampling Profiler.
From Delphi 6 upwards you can use the x86 Timestamp counter.
This counts CPU cycles, on a 1 Ghz processor, each count takes one nanosecond.
Can't get more accurate than that.
function RDTSC: Int64; assembler;
asm
// RDTSC can be executed out of order, so the pipeline needs to be flushed
// to prevent RDTSC from executing before your code is finished.
// Flush the pipeline
XOR eax, eax
PUSH EBX
CPUID
POP EBX
RDTSC //Get the CPU's time stamp counter.
end;
On x64 the following code is more accurate, because it does not suffer from the delay of CPUID.
rdtscp // On x64 we can use the serializing version of RDTSC
push rbx // Serialize the code after, to avoid OoO sneaking in
push rax // subsequent instructions prior to executing RDTSCP.
push rdx // See: http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
xor eax,eax
cpuid
pop rdx
pop rax
pop rbx
shl rdx,32
or rax,rdx
Use the above code to get the timestamp before and after executing your code.
Most accurate method possible and easy as pie.
Note that you need to run a test at least 10 times to get a good result, on the first pass the cache will be cold, and random harddisk reads and interrupts can throw off your timings.
Because this thing is so accurate it can give you the wrong idea if you only time the first run.
Why you should not use QueryPerformanceCounter()
QueryPerformanceCounter() gives the same amount of time if the CPU slows down, it compensates for CPU thottling. Whilst RDTSC will give you the same amount of cycles if your CPU slows down due to overheating or whatnot.
So if your CPU starts running hot and needs to throttle down, QueryPerformanceCounter() will say that your routine is taking more time (which is misleading) and RDTSC will say that it takes the same amount of cycles (which is accurate).
This is what you want because you're interested in the amount of CPU-cycles your code uses, not the wall-clock time.
From the lastest intel docs: http://software.intel.com/en-us/articles/measure-code-sections-using-the-enhanced-timer/?wapkw=%28rdtsc%29
Using the Processor Clocks
This timer is very accurate. On a system with a 3GHz processor, this timer can measure events that last less than one nanosecond. [...] If the frequency changes while the targeted code is running, the final reading will be redundant since the initial and final readings were not taken using the same clock frequency. The number of clock ticks that occurred during this time will be accurate, but the elapsed time will be an unknown.
When not to use RDTSC
RDTSC is useful for basic timing. If you're timing multithreaded code on a single CPU machine, RDTSC will work fine. If you have multiple CPU's the startcount may come from one CPU and the endcount from another.
So don't use RDTSC to time multithreaded code on a multi-CPU machine. On a single CPU machine it works fine, or single threaded code on a multi-CPU machine it is also fine.
Also remember that RDTSC counts CPU cycles. If there is something that takes time but doesn't use the CPU, like disk-IO or network than RDTSC is not a good tool.
But the documentation says RDTSC is not accurate on modern CPU's
RDTSC is not a tool for keeping track of time, it's a tool for keeping track of CPU-cycles.
For that it is the only tool that is accurate. Routines that keep track of time are not accurate on modern CPU's because the CPU-clock is not absolute like it used to be.
You didn't specify your Delphi version, but Delphi XE has a TStopWatch declared in unit Diagnostics. This will allow you to measure the runtime with reasonable precision.
uses
Diagnostics;
var
sw: TStopWatch;
begin
sw := TStopWatch.StartNew;
<dosomething>
Writeln(Format('runtime: %d ms', [sw.ElapsedMilliseconds]));
end;
I ask because I am currently trying to
optimize a few functions
It is natural to think that measuring is how you find out what to optimize, but there's a better way.
If something takes a large enough fraction of time (F) to be worth optimizing, then if you simply pause it at random, F is the probability you will catch it in the act.
Do that several times, and you will see precisely why it's doing it, down to the exact lines of code.
More on that.
Here's an example.
Fix it, and then do an overall measurement to see how much you saved, which should be about F.
Rinse and repeat.
Here are some procedures I made to handle checking the duration of a function. I stuck them in a unit I called uTesting and then just throw into the uses clause during my testing.
Declaration
Procedure TST_StartTiming(Index : Integer = 1);
//Starts the timer by storing now in Time
//Index is the index of the timer to use. 100 are available
Procedure TST_StopTiming(Index : Integer = 1;Display : Boolean = True; DisplaySM : Boolean = False);
//Stops the timer and stores the difference between time and now into time
//Displays the result if Display is true
//Index is the index of the timer to use. 100 are available
Procedure TST_ShowTime(Index : Integer = 1;Detail : Boolean = True; DisplaySM : Boolean = False);
//In a ShowMessage displays time
//Uses DateTimeToStr if Detail is false else it breaks it down (H,M,S,MS)
//Index is the index of the timer to use. 100 are available
variables declared
var
Time : array[1..100] of TDateTime;
Implementation
Procedure TST_StartTiming(Index : Integer = 1);
begin
Time[Index] := Now;
end;
Procedure TST_StopTiming(Index : Integer = 1;Display : Boolean = True; DisplaySM : Boolean = False);
begin
Time[Index] := Now - Time[Index];
if Display then TST_ShowTime;
end;
Procedure TST_ShowTime(Index : Integer = 1;Detail : Boolean = True; DisplaySM : Boolean = False);
var
H,M,S,MS : Word;
begin
if Detail then
begin
DecodeTime(Time[Index],H,M,S,MS);
if DisplaySM then
ShowMessage('Hour = ' + FloatToStr(H) + #13#10 +
'Min = ' + FloatToStr(M) + #13#10 +
'Sec = ' + FloatToStr(S) + #13#10 +
'MS = ' + FloatToStr(MS) + #13#10)
else
OutputDebugString(PChar('Hour = ' + FloatToStr(H) + #13#10 +
'Min = ' + FloatToStr(M) + #13#10 +
'Sec = ' + FloatToStr(S) + #13#10 +
'MS = ' + FloatToStr(MS) + #13#10));
end
else
ShowMessage(TimeToStr(Time[Index]));
OutputDebugString(Pchar(TimeToStr(Time[Index])));
end;
Use this http://delphi.about.com/od/windowsshellapi/a/delphi-high-performance-timer-tstopwatch.htm
clock_gettime() is the high solution, which is precise to nano seconds, you can also use rtdsc, which is precise to CPU cycle, and lastly you can simply use gettimeofday().
Related
I'm working in a program that it needs to get the current CPU Usage How i can achieve that in vb.Net
i tried like 4 codes but i still get 0% every time . here is one example of what i've used Link
Thanks In Advance ,
Anes08
Though it is not allowed to answer such questions,but still , here's something that might help you get started :
Dim cpu as New System.Diagnostics.PerformanceCounter
cpu.CategoryName = "Processor"
cpu.CounterName = "% Processor Time"
cpu.InstanceName = "_Total"
MessageBox(cpu.NextValue.ToString + "%")
If it doesn't work , here's a better version:
Dim cpu as PerformanceCounter '''Declare in class level
'On form load(actually you need to initialize it first)
cpu = new PerformanceCounter("Processor", "% Processor Time", "_Total")
'''Finally,get the value :
MsgBox(cpu.NextValue & "%") '''Use .ToString if required
you can use LblCpuUsage.text = CombinedAllCpuUsageOfEachThread.NextValue()
.There is a helper library to get that information:
The Performance Data Helper (see Using the PDH Functions to Consume
Counter Data (Windows)[^])
.
Microsoft examples are in C but there are also corresponding VB (not .Net) functions:
Performance Counters Functions for Visual Basic (Windows)[^]
For me I wanted an average. There were a couple problems getting CPU utilization that seemed like there should be an easy package to solve but I didn't see one.
The first is of course that a value of 0 on the first request is useless. Since you already know that the first response is 0, why doesn't the function just take that into account and return the true .NextValue()?
The second problem is that an instantaneous reading may be wildly inaccurate when trying to make decisions on what resources your app may have available to it since it could be spiking, or between spikes.
My solution was to do a for loop that cycles through and gives you an average for the past few seconds. you can adjust the counter to make it shorter or longer (as long as it is more than 2).
public static float ProcessorUtilization;
public static float GetAverageCPU()
{
PerformanceCounter cpuCounter = new PerformanceCounter("Process", "% Processor Time", Process.GetCurrentProcess().ProcessName);
for (int i = 0; i < 11; ++i)
{
ProcessorUtilization += (cpuCounter.NextValue() / Environment.ProcessorCount);
}
// Remember the first value is 0, so we don't want to average that in.
Console.Writeline(ProcessorUtilization / 10);
return ProcessorUtilization / 10;
}
Question: Suppose that you have a clock chip operating at 100 kHz, that is, every ten microseconds the clock will tick and subtract one from a counter. When the counter reaches zero, an interrupt occurs. Suppose that the counter is 16 bits, and you can load it with any value from 0 to 65535. How would you implement a time of day clock with a resolution of one second.
My understanding:
You can't store 100,000 in a 16 bit counter, but you can store 50,000 so could you would you have to use some sort of flag and only execute interrupt every other time?
But, i'm not sure how to go about implement that. Any form of Pseudocode or a general explanation would be most appreciated.
Since you can't get the range you want in hardware, you would need to extend the hardware counter with some sort of software counter (e.g. everytime the hardware counter goes up, increment the software counter). In your case you just need an integer value to keep track of whether or not you have seen a hardware tick. If you wanted to use this for some sort of scheduling policy, you could do something like this:
static int counter; /* initilized to 0 somewhere */
void trap_handler(int trap_num)
{
switch(trap_num)
{
...
case TP_TIMER:
if(counter == 0)
counter++;
else if(counter == 1)
{
/* Action here (flush cache/yield/...)*/
fs_cache_flush();
counter = 0;
} else {
/* Impossible case */
panic("Counter value bad\n");
}
break;
...
default:
panic("Unknown trap: 0x%x\n", trap_num)
break;
}
...
}
If you want example code in another language let me know and I can change it. I'm assuming you're using C because you tagged your question with OS. It could also make sense to keep your counter on a per process basis (if your operating system supports processes).
I have a large vector of vectors of strings:
There are around 50,000 vectors of strings,
each of which contains 2-15 strings of length 1-20 characters.
MyScoringOperation is a function which operates on a vector of strings (the datum) and returns an array of 10100 scores (as Float64s). It takes about 0.01 seconds to run MyScoringOperation (depending on the length of the datum)
function MyScoringOperation(state:State, datum::Vector{String})
...
score::Vector{Float64} #Size of score = 10000
I have what amounts to a nested loop.
The outer loop typically would runs for 500 iterations
data::Vector{Vector{String}} = loaddata()
for ii in 1:500
score_total = zeros(10100)
for datum in data
score_total+=MyScoringOperation(datum)
end
end
On one computer, on a small test case of 3000 (rather than 50,000) this takes 100-300 seconds per outer loop.
I have 3 powerful servers with Julia 3.9 installed (and can get 3 more easily, and then can get hundreds more at the next scale).
I have basic experience with #parallel, however it seems like it is spending a lot of time copying the constant (It more or less hang on the smaller testing case)
That looks like:
data::Vector{Vector{String}} = loaddata()
state = init_state()
for ii in 1:500
score_total = #parallel(+) for datum in data
MyScoringOperation(state, datum)
end
state = update(state, score_total)
end
My understanding of the way this implementation works with #parallel is that it:
For Each ii:
partitions data into a chuck for each worker
sends that chuck to each worker
works all process there chunks
main procedure sums the results as they arrive.
I would like to remove step 2,
so that instead of sending a chunk of data to each worker,
I just send a range of indexes to each worker, and they look it up from their own copy of data. or even better, only giving each only their own chunk, and having them reuse it each time (saving on a lot of RAM).
Profiling backs up my belief about the functioning of #parellel.
For a similarly scoped problem (with even smaller data),
the non-parallel version runs in 0.09seconds,
and the parallel runs in
And the profiler shows almost all the time is spent 185 seconds.
Profiler shows almost 100% of this is spend interacting with network IO.
This should get you started:
function get_chunks(data::Vector, nchunks::Int)
base_len, remainder = divrem(length(data),nchunks)
chunk_len = fill(base_len,nchunks)
chunk_len[1:remainder]+=1 #remained will always be less than nchunks
function _it()
for ii in 1:nchunks
chunk_start = sum(chunk_len[1:ii-1])+1
chunk_end = chunk_start + chunk_len[ii] -1
chunk = data[chunk_start: chunk_end]
produce(chunk)
end
end
Task(_it)
end
function r_chunk_data(data::Vector)
all_chuncks = get_chunks(data, nworkers()) |> collect;
remote_chunks = [put!(RemoteRef(pid)::RemoteRef, all_chuncks[ii]) for (ii,pid) in enumerate(workers())]
#Have to add the type annotation sas otherwise it thinks that, RemoteRef(pid) might return a RemoteValue
end
function fetch_reduce(red_acc::Function, rem_results::Vector{RemoteRef})
total = nothing
#TODO: consider strongly wrapping total in a lock, when in 0.4, so that it is garenteed safe
#sync for rr in rem_results
function gather(rr)
res=fetch(rr)
if total===nothing
total=res
else
total=red_acc(total,res)
end
end
#async gather(rr)
end
total
end
function prechunked_mapreduce(r_chunks::Vector{RemoteRef}, map_fun::Function, red_acc::Function)
rem_results = map(r_chunks) do rchunk
function do_mapred()
#assert r_chunk.where==myid()
#pipe r_chunk |> fetch |> map(map_fun,_) |> reduce(red_acc, _)
end
remotecall(r_chunk.where,do_mapred)
end
#pipe rem_results|> convert(Vector{RemoteRef},_) |> fetch_reduce(red_acc, _)
end
rchunk_data breaks the data into chunks, (defined by get_chunks method) and sends those chunks each to a different worker, where they are stored in RemoteRefs.
The RemoteRefs are references to memory on your other proccesses(and potentially computers), that
prechunked_map_reduce does a variation on a kind of map reduce to have each worker first run map_fun on each of it's chucks elements, then reduce over all the elements in its chuck using red_acc (a reduction accumulator function). Finally each worker returns there result which is then combined by reducing them all together using red_acc this time using the fetch_reduce so that we can add the first ones completed first.
fetch_reduce is a nonblocking fetch and reduce operation. I believe it has no raceconditions, though this maybe because of a implementation detail in #async and #sync. When julia 0.4 comes out, it is easy enough to put a lock in to make it obviously have no race conditions.
This code isn't really battle hardened. I don;t believe the
You also might want to look at making the chuck size tunable, so that you can seen more data to faster workers (if some have better network or faster cpus)
You need to reexpress your code as a map-reduce problem, which doesn't look too hard.
Testing that with:
data = [float([eye(100),eye(100)])[:] for _ in 1:3000] #480Mb
chunk_data(:data, data)
#time prechunked_mapreduce(:data, mean, (+))
Took ~0.03 seconds, when distributed across 8 workers (none of them on the same machine as the launcher)
vs running just locally:
#time reduce(+,map(mean,data))
took ~0.06 seconds.
I'm working in an algorithm using OpenCL and I need to measure the execution time of it in its parallel and sequential versions. Due to this, I'm using an external loop to iterate both codes and measure their times but I have obtained:
Sequential: 3.06 segs
Parallel: 269 segs
The code that I'm using for the parallel version is:
t_start=clock(); /* Start measuring time */
for(i=0;i<=N; i++) // N is really big, around a million, but is the same for both versions
{
fitness = 0;
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, NULL, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, vdistance, CL_TRUE, 0, siz_mem_distance_code, distance_code, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, vsumatorio, CL_TRUE, 0,siz_mem_sumatorio, sumatorio, 0, NULL, NULL);
fitness = (1/(*sumatorio)) + (*distance_code/12) + ((pow(*distance_code,2))/4) + ((pow(*distance_code,3))/6);
}
t_finish=clock(); /* End measuring time */
Before this piece of code, I have created/initialized all the things that we need to run a program using OpenCL ( platform, devide, context, queue, buffer, kernel,...) and after this code, I release everything.
I have checked that this increase of time is due to read in each iteration both variables ( distance_code and sumatorio) but I must to do it because I have to obtain the fitness value which is a sequential instruction and can only be excuted when the kernel has finished, so... Could you help me? What am I doing wrong?
I hope to have explained myself properly, thanks in advance.
Note: I'm only working with the CPU.
The overhead of launching so many kernels exceeds the benefits of parallelizing a for loop over only 64 data items. You need to rewrite your problem so that you launch relatively few kernels over large batches of data. In that case and if the OpenCL compiler generated appropriate vectorizing machine code you would see an improvement over the sequential version.
Additionally, you should check with either AMD's CodeXL or Intel's Offline Compiler if the generated code contains any vector instructions.
How can you measure the amount of time a function will take to execute?
This is a relatively short function and the execution time would probably be in the millisecond range.
This particular question relates to an embedded system, programmed in C or C++.
The best way to do that on an embedded system is to set an external hardware pin when you enter the function and clear it when you leave the function. This is done preferably with a little assembly instruction so you don't skew your results too much.
Edit: One of the benefits is that you can do it in your actual application and you don't need any special test code. External debug pins like that are (should be!) standard practice for every embedded system.
There are three potential solutions:
Hardware Solution:
Use a free output pin on the processor and hook an oscilloscope or logic analyzer to the pin. Initialize the pin to a low state, just before calling the function you want to measure, assert the pin to a high state and just after returning from the function, deassert the pin.
*io_pin = 1;
myfunc();
*io_pin = 0;
Bookworm solution:
If the function is fairly small, and you can manage the disassembled code, you can crack open the processor architecture databook and count the cycles it will take the processor to execute every instructions. This will give you the number of cycles required.
Time = # cycles * Processor Clock Rate / Clock ticks per instructions
This is easier to do for smaller functions, or code written in assembler (for a PIC microcontroller for example)
Timestamp counter solution:
Some processors have a timestamp counter which increments at a rapid rate (every few processor clock ticks). Simply read the timestamp before and after the function.
This will give you the elapsed time, but beware that you might have to deal with the counter rollover.
Invoke it in a loop with a ton of invocations, then divide by the number of invocations to get the average time.
so:
// begin timing
for (int i = 0; i < 10000; i++) {
invokeFunction();
}
// end time
// divide by 10000 to get actual time.
if you're using linux, you can time a program's runtime by typing in the command line:
time [funtion_name]
if you run only the function in main() (assuming C++), the rest of the app's time should be negligible.
I repeat the function call a lot of times (millions) but also employ the following method to discount the loop overhead:
start = getTicks();
repeat n times {
myFunction();
myFunction();
}
lap = getTicks();
repeat n times {
myFunction();
}
finish = getTicks();
// overhead + function + function
elapsed1 = lap - start;
// overhead + function
elapsed2 = finish - lap;
// overhead + function + function - overhead - function = function
ntimes = elapsed1 - elapsed2;
once = ntimes / n; // Average time it took for one function call, sans loop overhead
Instead of calling function() twice in the first loop and once in the second loop, you could just call it once in the first loop and don't call it at all (i.e. empty loop) in the second, however the empty loop could be optimized out by the compiler, giving you negative timing results :)
start_time = timer
function()
exec_time = timer - start_time
Windows XP/NT Embedded or Windows CE/Mobile
You an use the QueryPerformanceCounter() to get the value of a VERY FAST counter before and after your function. Then you substract those 64-bits values and get a delta "ticks". Using QueryPerformanceCounterFrequency() you can convert the "delta ticks" to an actual time unit. You can refer to MSDN documentation about those WIN32 calls.
Other embedded systems
Without operating systems or with only basic OSes you will have to:
program one of the internal CPU timers to run and count freely.
configure it to generate an interrupt when the timer overflows, and in this interrupt routine increment a "carry" variable (this is so you can actually measure time longer than the resolution of the timer chosen).
before your function you save BOTH the "carry" value and the value of the CPU register holding the running ticks for the counting timer you configured.
same after your function
substract them to get a delta counter tick.
from there it is just a matter of knowing how long a tick means on your CPU/Hardware given the external clock and the de-multiplication you configured while setting up your timer. You multiply that "tick length" by the "delta ticks" you just got.
VERY IMPORTANT Do not forget to disable before and restore interrupts after getting those timer values (bot the carry and the register value) otherwise you risk saving incorrect values.
NOTES
This is very fast because it is only a few assembly instructions to disable interrupts, save two integer values and re-enable interrupts. The actual substraction and conversion to real time units occurs OUTSIDE the zone of time measurement, that is AFTER your function.
You may wish to put that code into a function to reuse that code all around but it may slow things a bit because of the function call and the pushing of all the registers to the stack, plus the parameters, then popping them again. In an embedded system this may be significant. It may be better then in C to use MACROS instead or write your own assembly routine saving/restoring only relevant registers.
Depends on your embedded platform and what type of timing you are looking for. For embedded Linux, there are several ways you can accomplish. If you wish to measure the amout of CPU time used by your function, you can do the following:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#define SEC_TO_NSEC(s) ((s) * 1000 * 1000 * 1000)
int work_function(int c) {
// do some work here
int i, j;
int foo = 0;
for (i = 0; i < 1000; i++) {
for (j = 0; j < 1000; j++) {
for ^= i + j;
}
}
}
int main(int argc, char *argv[]) {
struct timespec pre;
struct timespec post;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &pre);
work_function(0);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &post);
printf("time %d\n",
(SEC_TO_NSEC(post.tv_sec) + post.tv_nsec) -
(SEC_TO_NSEC(pre.tv_sec) + pre.tv_nsec));
return 0;
}
You will need to link this with the realtime library, just use the following to compile your code:
gcc -o test test.c -lrt
You may also want to read the man page on clock_gettime there is some issues with running this code on SMP based system that could invalidate you testing. You could use something like sched_setaffinity() or the command line cpuset to force the code on only one core.
If you are looking to measure user and system time, then you could use the times(NULL) which returns something like a jiffies. Or you can change the parameter for clock_gettime() from CLOCK_THREAD_CPUTIME_ID to CLOCK_MONOTONIC...but be careful of wrap around with CLOCK_MONOTONIC.
For other platforms, you are on your own.
Drew
I always implement an interrupt driven ticker routine. This then updates a counter that counts the number of milliseconds since start up. This counter is then accessed with a GetTickCount() function.
Example:
#define TICK_INTERVAL 1 // milliseconds between ticker interrupts
static unsigned long tickCounter;
interrupt ticker (void)
{
tickCounter += TICK_INTERVAL;
...
}
unsigned in GetTickCount(void)
{
return tickCounter;
}
In your code you would time the code as follows:
int function(void)
{
unsigned long time = GetTickCount();
do something ...
printf("Time is %ld", GetTickCount() - ticks);
}
In OS X terminal (and probably Unix, too), use "time":
time python function.py
If the code is .Net, use the stopwatch class (.net 2.0+) NOT DateTime.Now. DateTime.Now isn't updated accurately enough and will give you crazy results
If you're looking for sub-millisecond resolution, try one of these timing methods. They'll all get you resolution in at least the tens or hundreds of microseconds:
If it's embedded Linux, look at Linux timers:
http://linux.die.net/man/3/clock_gettime
Embedded Java, look at nanoTime(), though I'm not sure this is in the embedded edition:
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#nanoTime()
If you want to get at the hardware counters, try PAPI:
http://icl.cs.utk.edu/papi/
Otherwise you can always go to assembler. You could look at the PAPI source for your architecture if you need some help with this.