Heluuu,
I have a rather large program that I'm attempting to thread. So far, this has been succesful, and the basics are all working as intended.
I now want to do some fancy work with cascading threads in nested mode. Essentially, I want the main parallel region to use any free threads in lower parallel regions.
To detail the current system, the main parallel region starts 10 threads. I have 12 cores, so I can use 2 more threads. There is a second parallel region where some heavy computing happens, and I want the first two threads to reach this point to start a new team there, each with 2 threads. Every new entry to the lower parallel region after this will continue in serial.
So, this should look like the following.
Main region: 10 threads started.
Lower region: 2 new threads started.
Thread 1: 2 threads in lower region.
Thread 2: 2 threads in lower region.
Thread 3-10: 1 thread in lower region.
Please keep in mind that these numbers are for the sake of clarity in providing a concrete description of my situation, and not the absolute and only case in which the program operates.
The code:
main() {
...
...
omp_set_num_threads(n);
omp_set_dynamic(x);
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < iterations; i++) {
...
Compute();
...
}
}
}
And in Compute
bool Compute() {
...
float nThreads = omp_get_thread_limit() - omp_get_num_threads();
nThreads = ceil(nThreads / omp_get_num_threads());
omp_set_num_threads((int)nThreads);
#pragma omp parallel
{
...
#pragma omp for
for (int i = 0; i < nReductSize; i++) {
...
}
}
}
Now, my problem is that setting the uppermost limit for the whole program (i.e. OMP_THREAD_LIMIT) only works from outside the program. Using
export OMP_THREAD_LIMIT=5
from the bash command line works great. But I want to do it internally. So far, I've tried
putenv("OMP_THREAD_LIMIT=12");
setenv("OMP_THREAD_LIMIT", "12", 1);
but when I call omp_get_thread_limit() or getenv("OMP_THREAD_LIMIT") I get wacky return values. Even when I set the variable with export, calling getenv("OMP_THREAD_LIMIT"); returns 0.
So, I would ask for your help in this: How do I properly set OMP_THREAD_LIMIT at runtime?
This is the main function where I set the thread defaults. It is executed well before any threading occurs:
#ifdef _OPENMP
const char *name = "OMP_THREAD_LIMIT";
const char *value = "5";
int overwrite = 1;
int success = setenv(name, value, overwrite);
cout << "Var set (0 is success): " << success << endl;
#endif
Oh, and setenv reports success in setting the variable.
Compiler says
gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
Flags
CCFLAGS = -c -O0 -fopenmp -g -msse -msse2 -msse3 -mfpmath=sse -std=c++0x
OpenMP version is 3.0.
This is correct implementation of OpenMP, and it ignores changes in environment from inside the program. As stated in OpenMP 3.1 Standard, page 159:
Modifications to the environment variables after the program has started, even if
modified by the program itself, are ignored by the OpenMP implementation.
You are doing exactly what is said in this paragraph.
OpenMP allows changing of such parameters only via omp_set_* functions, but there are no such function for thread-limit-var ICV:
However, the settings of some of the ICVs can be modified during the execution of the OpenMP
program by the use of the appropriate directive clauses or OpenMP API routines.
I think, you may use num_threads clause of #pragma omp parallel to achieve what you want.
Changing the behavior of OpenMP using OMP_THREAD_LIMIT (or any other OMP_* environment variable) is not possible after the program has started; these are intended for use by the user. You could have the user invoke your program through a script that sets OMP_THREAD_LIMIT and then calls your program, but that's probably not what you need to do in this case.
OMP_NUM_THREADS, omp_set_num_threads, and the num_threads clause are usually used to set the number of threads operating in a region.
It might be offtopic, but you may want to try openmp collapse instead of handcrafting here.
Related
My problem: I am trying to get GCC to vectorize a nested loop.
Compiler flags I added to the basic flags:
-fopenmp
-march=native
-msse2 -mfpmath=sse
-ffast-math
-funsafe-math-optimizations
-ftree-vectorize
-fopt-info-vec-missed
Variables:
// local variables
float RTP, RPE, RR=0;
int SampleLoc, EN;
float THr;
long TN;
// global variables
extern float PR[MaxX - MinX][MaxY - MinY][MaxZ - MinZ];
extern float SampleOffset;
extern float SampleInt;
// global defines
Speed
Loops start at line 500. The code is parallelized but not vectorized. Here is the code:
#pragma omp parallel for
for (TN = 0; TN < NumXmtrFoci; TN++) {
for (XR = XlBound; XR <= XuBound; XR++) {
for (YR = YlBound; YR <= YuBound; YR++) {
for (ZR = ZlBound; ZR <= ZuBound; ZR++) {
for (EN = 0; EN < NUM_RCVR_ELE; EN++) {
RPE = REP[XR - MinX][YR - MinY][ZR - MinZ][EN];
RTP = RPT[XR - MinX][YR - MinY][ZR - MinZ][TN];
RR = RPE + RTP + ZT[TN];
SampleLoc = (int)(floor(RR/(SampleInt*Speed) + SampleOffset));
THr = TimeHistory[SampleLoc][EN];
PR[XR-MinX][YR-MinY][ZR-MinZ] += THr;
} /*for EN*/
} /*for ZR*/
} /*for YR*/
} /*for XR*/ /
} /*for TN*/
The loop bounds are all #define and range from -64 to 128. Loop iterator variables are of type int. Inner loop variables are of type float.
A sampling of GCC compiler 'NOT VECTORIZED' message relevant to this loop;
some repeated many times at many places in the code.
502|note: not vectorized: multiple nested loops.|
505|note: not vectorized: not suitable for gather load _61 = TimeHistory[_59][_85];|
500|note: not vectorized: no grouped stores in basic block.|
500|note: not vectorized: not enough data-refs in basic block.|
MY QUESTION IS: What do the GCC compiler messages tell me to focus on to get the loop to vectorize?
I do not understand the messages adequately. So far online has not provided the answers. I thought multiple nested loops were not a problem and what are: gather load; grouped stores; data-refs.
The main point of the GCC report is that the expression TimeHistory[SampleLoc][EN] cannot be easily vectorized unless gather instructions are used. Indeed, SampleLoc is a variable that can contains non linear values. Gather instruction are only available in the AVX2/AVX-512 instruction set. You compilation flags indicate you use SSE2 and possible a more advanced instruction set available on the target platform (but this one is not provided). Without AVX2/AVX-512, GCC cannot vectorize the loops because of such non contiguous access pattern. In fact, AVX-2 gather instructions are a bit limited compared to general reads so the compiler may not be able to use them because of that. You can see the list of intrinsics/instructions here. Additionally, the compiler needs to be sure TimeHistory is not modified in the loop. THis seems trivial here but in practice arrays can theoretically alias so the compiler can be afraid of a possible aliasing and not vectorise the gather loads because of a possible dependence between each read and next writes. Replicating the last loop may help. Using the restrict keyword and const can also help.
I'm currently trying to parallelize an existing hierarchical MCMC sampling scheme. The majority of my (by now sequential) source code is written in RcppArmadillo, so I'd like to stick with this framework for parallelization, too.
Before starting with parallelizing my code I have read a couple of blog posts on Rcpp/Openmp. In the majority of these blog posts (e.g. Drew Schmidt, wrathematics) the authors warn about the issue of thread safety, R/Rcpp data structures and Openmp. The bottom line of all posts I have read so far is, R and Rcpp are not thread safe, don't call them from within an omp parallel pragma.
Because of this, the following Rcpp example causes a segfault, when called from R:
#include <Rcpp.h>
#include <omp.h>
using namespace Rcpp;
double rcpp_rootsum_j(Rcpp::NumericVector x)
{
Rcpp::NumericVector ret = sqrt(x);
return sum(ret);
}
// [[Rcpp::export]]
Rcpp::NumericVector rcpp_rootsum(Rcpp::NumericMatrix x, int cores = 2)
{
omp_set_num_threads(cores);
const int nr = x.nrow();
const int nc = x.ncol();
Rcpp::NumericVector ret(nc);
#pragma omp parallel for shared(x, ret)
for (int j=0; j<nc; j++)
ret[j] = rcpp_rootsum_j(x.column(j));
return ret;
}
As Drew explains in his blog post, the segfault happens due to a "hidden" copy, which Rcpp makes in the call to ret[j] = rcpp_rootsum_j(x.column(j));.
Since I'm interested in the behavior of RcppArmadillo in case of parallelization, I have converted Drew's example:
//[[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <omp.h>
double rcpp_rootsum_j_arma(arma::vec x)
{
arma::vec ret = arma::sqrt(x);
return arma::accu(ret);
}
// [[Rcpp::export]]
arma::vec rcpp_rootsum_arma(arma::mat x, int cores = 2)
{
omp_set_num_threads(cores);
const int nr = x.n_rows;
const int nc = x.n_cols;
arma::vec ret(nc);
#pragma omp parallel for shared(x, ret)
for (int j=0; j<nc; j++)
ret(j) = rcpp_rootsum_j_arma(x.col(j));
return ret;
}
Interestingly the semantically equivalent code does not causes a segfault.
The second thing I have noticed during my research is, that the aforementioned statement (R and Rcpp are not thread safe, don't call them from within an omp parallel pragma) seems to not always hold to be true. For example the call in the next example does not cause a segfault, although we're reading and writing to Rcpp data structures.
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::export]]
Rcpp::NumericMatrix rcpp_sweep_(Rcpp::NumericMatrix x, Rcpp::NumericVector vec)
{
Rcpp::NumericMatrix ret(x.nrow(), x.ncol());
#pragma omp parallel for default(shared)
for (int j=0; j<x.ncol(); j++)
{
#pragma omp simd
for (int i=0; i<x.nrow(); i++)
ret(i, j) = x(i, j) - vec(i);
}
return ret;
}
My Questions
Why does the code from the first example causes a segfault but the code from example two and three not?
How do I know, if it is safe to call a method (arma::mat.col(i)) or if it is unsafe to call a method (Rcpp::NumericMatrix.column(i))? Do I have to read the framework's source code every time?
Any suggestions on how to avoid these "opaque" situations like in example one?
It might be pure coincidence that my RcppArmadillo example does not fail. See Dirks comments below.
EDIT 1
In his answer and in both of his comments Dirk strongly recommends to study more closely the examples in the Rcpp Gallery.
Here are my initial assumptions:
Extracting rows, columns, etc. within a OpenMp pragma is generally not thread safe since it might call back into R to allocate new space in memory for a hidden copy.
Because RcppArmadillo relies on the same lightweight/proxy model for data structures as Rcpp, my first assumption also applies to RcppArmadillo.
Data structures from the std namespace should be somehow more safe since they don't use the same lightweight/proxy scheme.
Primitive data types also shouldn't cause problems as they live on the stack and don't require R to allocate and manage memory.
Optimizing Code vs...
arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
Hierarchical Risk Parity...
interMatrix(_, i) = MAT_COV(_, index_asset); // 3rd code example 3rd method
Using RcppProgress...
thread_sum += R::dlnorm(i+j, 0.0, 1.0, 0); // subsection OpenMP support
In my opinion the first and second example clearly interfere with my assumptions made in point one and two. Example three also gives me headaches, since for me it looks like a call to R...
My updated questions
Where is the difference between example one/two and my first code example?
Where did I get lost in my assumptions?
Any recommendations, besides RcppGallery and GitHub, on how to get a better idea of the interaction of Rcpp and OpenMP?
Before starting with parallelizing my code I have read a couple of blog posts on Rcpp/Openmp. In the majority of these blog posts (e.g. Drew Schmidt, wrathematics) the authors warn about the issue of thread safety, R/Rcpp data structures and Openmp. The bottom line of all posts I have read so far is, R and Rcpp are not thread safe, don't call them from within an omp parallel pragma.
That is a well-known limitation of R itself not being thread-safe. That means you cannot call back, or trigger R events -- which may happen with Rcpp unless you are careful. To be more plain: The constraint has nothing to do with Rcpp, it simply means you cannot blindly drop into OpenMP via Rcpp. But you can if you're careful.
We have countless examples of success with OpenMP and related tools both in numerous packages on CRAN, on the Rcpp Gallery and via extension packages like RcppParallel.
You appear to have been very selective in what you chose to read on this topic, and you ended up with something somewhere between wrong and misleading. I suggest you turn to the several examples on the Rcpp Gallery which deal with OpenMP / RcppParallel as they deal with the very problem. Or if you're in a hurry: look up RVector and RMatrix in the RcppParallel documentation.
Resources:
Six posts on OpenMP at RcppGallery
RcppParallel site
RcppParallel CRAN package
and your largest resource may be some targeted search at GitHub for code involving R, C++ and OpenMP. It will lead you to numerous working examples.
I am quite confused about the ways to specify the number of threads in parallel part of a code.
I know I can use:
the enviromental variable OMP_NUM_THREADS
function omp_set_num_threads(int)
num_threads(int) in #pragma omp parallel for num_threads(NB_OF_THREADS)
What I have gathered so far the first two are equivalent. But what about the third one?
Can someone provide a more detailed exposition of the difference, I could not find any information in the internet regarding the difference between 1/2 and 3.
OMP_NUM_THREADS and omp_set_num_threads() are not equivalent. The environment variable is only used to set the initial value of the nthreads-var ICV (internal control variable) which controls the maximum number of threads in a team. omp_set_num_threads() can be used to change the value of nthreads-var at any time (outside of any parallel regions, of course) and affects all subsequent parallel regions. Therefore setting a value, e.g. n, to OMP_NUM_THREADS is equivalent to calling omp_set_num_threads(n) before the very first parallel region is encountered.
The algorithm to determine the number of threads in a parallel region is very clearly described in the OpenMP specification that is available freely on the OpenMP website:
if a num_threads clause exists
then let ThreadsRequested be the value of the num_threads clause expression;
else let ThreadsRequested = value of the first element of nthreads-var;
That priority of the different ways to set nthreads-var is listed in the ICV Override Relationships part of the specification:
The num_threads clause and omp_set_num_threads() override the value of the OMP_NUM_THREADS environment variable and the initial value of the first element of the nthreads-var ICV.
Translated into human language, that is:
OMP_NUM_THREADS (if present) specifies initially the number of threads;
calls to omp_set_num_threads() override the value of OMP_NUM_THREADS;
the presence of the num_threads clause overrides both other values.
The actual number of threads used is also affected by whether dynamic team sizes are enabled (dyn-var ICV settable via OMP_DYNAMIC and/or omp_set_dynamic()), by whether a thread limit is imposed by thread-limit-var (settable via OMP_THREAD_LIMIT), as well as by whether nested parallelism (OMP_NESTED / omp_set_nested()) is enabled or not.
Think of it like scope. Option 3 (num_threads) sets the number of threads for the current team of threads only. The other options are global/state settings. I generally don't set the number of threads and instead I just use the defaults. When I do change the number of threads it's usually only in special cases so I use option three so that the next time I use a parallel team it goes back to the global (default) setting. See the code below. After I use option 3 the next team of threads goes back to the last global setting.
#include <stdio.h>
#include <omp.h>
int main() {
#pragma omp parallel
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
omp_set_num_threads(8);
#pragma omp parallel
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
#pragma omp parallel
{
#pragma omp single
{
printf("%d\n", omp_get_num_threads());
}
}
}
4
8
2
8
I've looked at the official definitions, but I'm still quite confused.
firstprivate: Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.
To me, that sounds a lot like private. I've looked for examples, but I don't seem to understand how it's special or how it can be used.
lastprivate: Specifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections).
I feel like I understand this one a bit better because of the following example:
#pragma omp parallel
{
#pragma omp for lastprivate(i)
for (i=0; i<n-1; i++)
a[i] = b[i] + b[i+1];
}
a[i]=b[i];
So, in this example, I understand that lastprivate allows for i to be returned outside of the loop as the last value it was.
I just started learning OpenMP today.
private variables are not initialised, i.e. they start with random values like any other local automatic variable (and they are often implemented using automatic variables on the stack of each thread). Take this simple program as an example:
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i = 10;
#pragma omp parallel private(i)
{
printf("thread %d: i = %d\n", omp_get_thread_num(), i);
i = 1000 + omp_get_thread_num();
}
printf("i = %d\n", i);
return 0;
}
With four threads it outputs something like:
thread 0: i = 0
thread 3: i = 32717
thread 1: i = 32717
thread 2: i = 1
i = 10
(another run of the same program)
thread 2: i = 1
thread 1: i = 1
thread 0: i = 0
thread 3: i = 32657
i = 10
This clearly demonstrates that the value of i is random (not initialised) inside the parallel region and that any modifications to it are not visible after the parallel region (i.e. the variable keeps its value from before entering the region).
If i is made firstprivate, then it is initialised with the value that it has before the parallel region:
thread 2: i = 10
thread 0: i = 10
thread 3: i = 10
thread 1: i = 10
i = 10
Still modifications to the value of i inside the parallel region are not visible after it.
You already know about lastprivate (and it is not applicable to the simple demonstration program as it lacks worksharing constructs).
So yes, firstprivate and lastprivate are just special cases of private. The first one results in bringing in values from the outside context into the parallel region while the second one transfers values from the parallel region to the outside context. The rationale behind these data-sharing classes is that inside the parallel region all private variables shadow the ones from the outside context, i.e. it is not possible to use an assignment operation to modify the outside value of i from inside the parallel region.
You cannot use local variable i before initialization, the program will give an error since C++ 14 Standard.
How can you measure the amount of time a function will take to execute?
This is a relatively short function and the execution time would probably be in the millisecond range.
This particular question relates to an embedded system, programmed in C or C++.
The best way to do that on an embedded system is to set an external hardware pin when you enter the function and clear it when you leave the function. This is done preferably with a little assembly instruction so you don't skew your results too much.
Edit: One of the benefits is that you can do it in your actual application and you don't need any special test code. External debug pins like that are (should be!) standard practice for every embedded system.
There are three potential solutions:
Hardware Solution:
Use a free output pin on the processor and hook an oscilloscope or logic analyzer to the pin. Initialize the pin to a low state, just before calling the function you want to measure, assert the pin to a high state and just after returning from the function, deassert the pin.
*io_pin = 1;
myfunc();
*io_pin = 0;
Bookworm solution:
If the function is fairly small, and you can manage the disassembled code, you can crack open the processor architecture databook and count the cycles it will take the processor to execute every instructions. This will give you the number of cycles required.
Time = # cycles * Processor Clock Rate / Clock ticks per instructions
This is easier to do for smaller functions, or code written in assembler (for a PIC microcontroller for example)
Timestamp counter solution:
Some processors have a timestamp counter which increments at a rapid rate (every few processor clock ticks). Simply read the timestamp before and after the function.
This will give you the elapsed time, but beware that you might have to deal with the counter rollover.
Invoke it in a loop with a ton of invocations, then divide by the number of invocations to get the average time.
so:
// begin timing
for (int i = 0; i < 10000; i++) {
invokeFunction();
}
// end time
// divide by 10000 to get actual time.
if you're using linux, you can time a program's runtime by typing in the command line:
time [funtion_name]
if you run only the function in main() (assuming C++), the rest of the app's time should be negligible.
I repeat the function call a lot of times (millions) but also employ the following method to discount the loop overhead:
start = getTicks();
repeat n times {
myFunction();
myFunction();
}
lap = getTicks();
repeat n times {
myFunction();
}
finish = getTicks();
// overhead + function + function
elapsed1 = lap - start;
// overhead + function
elapsed2 = finish - lap;
// overhead + function + function - overhead - function = function
ntimes = elapsed1 - elapsed2;
once = ntimes / n; // Average time it took for one function call, sans loop overhead
Instead of calling function() twice in the first loop and once in the second loop, you could just call it once in the first loop and don't call it at all (i.e. empty loop) in the second, however the empty loop could be optimized out by the compiler, giving you negative timing results :)
start_time = timer
function()
exec_time = timer - start_time
Windows XP/NT Embedded or Windows CE/Mobile
You an use the QueryPerformanceCounter() to get the value of a VERY FAST counter before and after your function. Then you substract those 64-bits values and get a delta "ticks". Using QueryPerformanceCounterFrequency() you can convert the "delta ticks" to an actual time unit. You can refer to MSDN documentation about those WIN32 calls.
Other embedded systems
Without operating systems or with only basic OSes you will have to:
program one of the internal CPU timers to run and count freely.
configure it to generate an interrupt when the timer overflows, and in this interrupt routine increment a "carry" variable (this is so you can actually measure time longer than the resolution of the timer chosen).
before your function you save BOTH the "carry" value and the value of the CPU register holding the running ticks for the counting timer you configured.
same after your function
substract them to get a delta counter tick.
from there it is just a matter of knowing how long a tick means on your CPU/Hardware given the external clock and the de-multiplication you configured while setting up your timer. You multiply that "tick length" by the "delta ticks" you just got.
VERY IMPORTANT Do not forget to disable before and restore interrupts after getting those timer values (bot the carry and the register value) otherwise you risk saving incorrect values.
NOTES
This is very fast because it is only a few assembly instructions to disable interrupts, save two integer values and re-enable interrupts. The actual substraction and conversion to real time units occurs OUTSIDE the zone of time measurement, that is AFTER your function.
You may wish to put that code into a function to reuse that code all around but it may slow things a bit because of the function call and the pushing of all the registers to the stack, plus the parameters, then popping them again. In an embedded system this may be significant. It may be better then in C to use MACROS instead or write your own assembly routine saving/restoring only relevant registers.
Depends on your embedded platform and what type of timing you are looking for. For embedded Linux, there are several ways you can accomplish. If you wish to measure the amout of CPU time used by your function, you can do the following:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#define SEC_TO_NSEC(s) ((s) * 1000 * 1000 * 1000)
int work_function(int c) {
// do some work here
int i, j;
int foo = 0;
for (i = 0; i < 1000; i++) {
for (j = 0; j < 1000; j++) {
for ^= i + j;
}
}
}
int main(int argc, char *argv[]) {
struct timespec pre;
struct timespec post;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &pre);
work_function(0);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &post);
printf("time %d\n",
(SEC_TO_NSEC(post.tv_sec) + post.tv_nsec) -
(SEC_TO_NSEC(pre.tv_sec) + pre.tv_nsec));
return 0;
}
You will need to link this with the realtime library, just use the following to compile your code:
gcc -o test test.c -lrt
You may also want to read the man page on clock_gettime there is some issues with running this code on SMP based system that could invalidate you testing. You could use something like sched_setaffinity() or the command line cpuset to force the code on only one core.
If you are looking to measure user and system time, then you could use the times(NULL) which returns something like a jiffies. Or you can change the parameter for clock_gettime() from CLOCK_THREAD_CPUTIME_ID to CLOCK_MONOTONIC...but be careful of wrap around with CLOCK_MONOTONIC.
For other platforms, you are on your own.
Drew
I always implement an interrupt driven ticker routine. This then updates a counter that counts the number of milliseconds since start up. This counter is then accessed with a GetTickCount() function.
Example:
#define TICK_INTERVAL 1 // milliseconds between ticker interrupts
static unsigned long tickCounter;
interrupt ticker (void)
{
tickCounter += TICK_INTERVAL;
...
}
unsigned in GetTickCount(void)
{
return tickCounter;
}
In your code you would time the code as follows:
int function(void)
{
unsigned long time = GetTickCount();
do something ...
printf("Time is %ld", GetTickCount() - ticks);
}
In OS X terminal (and probably Unix, too), use "time":
time python function.py
If the code is .Net, use the stopwatch class (.net 2.0+) NOT DateTime.Now. DateTime.Now isn't updated accurately enough and will give you crazy results
If you're looking for sub-millisecond resolution, try one of these timing methods. They'll all get you resolution in at least the tens or hundreds of microseconds:
If it's embedded Linux, look at Linux timers:
http://linux.die.net/man/3/clock_gettime
Embedded Java, look at nanoTime(), though I'm not sure this is in the embedded edition:
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#nanoTime()
If you want to get at the hardware counters, try PAPI:
http://icl.cs.utk.edu/papi/
Otherwise you can always go to assembler. You could look at the PAPI source for your architecture if you need some help with this.