sleep(seconds) in wasm keeps CPU usage high - sleep

My wasm code has a call to POSIX sleep(seconds) function. This call is done for limiting CPU consumption but I notice no difference with or without sleep, either with 1 or 1000 seconds.
My code initially had this structure
void myfunc(u32 *buff){
u32 size = 16;
while (1){
for (u32 i = 0; i < size; i++){
// do stuff
}
}
}
myfunc() si called by a Web Worker raising the CPU usage from 3% to 28% and when I terminate() the Web Worker the CPU drops down to 3%.
So I added a limiter to mitigate the CPU usage and keep it lower
#include <unistd.h>
void myfunc(u32 *buff){
u32 size = 16;
while (1){
sleep(1); // 1s or 1000s same behavior
for (u32 i = 0; i < size; i++){
// do stuff
}
}
}
but this change has no effect on CPU usage I only see that the sleep works and the thread is suspended for the time requested.
The for cycle takes a fraction of second so the time spent in sleeping is greater than the time spent in running.
I would add that when I do my tests I have no others CPU-intensive processes running hence I would expect a lower CPU usage when sleep(1000) for instance.

This only points that your environment uses a loop to implement the sleep function (you could probably verify that with a debugger).
If the stack-switching proposal was ready, merged, and implemented in your environment, probably some await for a promise would be used, but the stack-switching is not ready yet.

Related

Perf impact of sampling rate on performance - higher sample rates cost *less* overhead on NXP S32?

I am using perf in sampling mode to capture performance statistics of programs running on multi-core platform from NXP S32 platform running Linux 4.19.
E.g configuration
Core 0 - App0 , Core 1 - App1, Core 2 - App2
Without sampling i.e. at program level, App0 takes 6.9 seconds.
On sampling at 1Million cycles,App0 takes 6.3 sec
On sampling at 2Million cycles, App0 takes 6.4 sec
On sampling at 5Million cycles, App0 takes 6.5 sec
On sampling at 100Million cycles, App0 takes 6.8 sec.
As you can see with higher sampling period (100Million) App0 takes higher time to finish execution.
Actually I would have expected the opposite, i.e. sampling at 1Million cycles should have resulted in the program taking more time to execute due higher number of samples generated (perf overhead) as compared to 100 Million cycles?
I am unable to explain this behavior what do you think is causing this?
Any leads would be helpful.
P.S - On the Pi3B the behavior is as expected i.e sampling at 1million cycles results in longer execution time compared to 100 Million cycles.
UPDATE: I do not use perf from command line, instead make a perf system call using the perf event with the following flags in the struct perf_event_attr.
struct perf_event_attr hw_event;
pid_t pid = proccess_id; // measure the current process/thread
int cpu = -1; // measure on any cpu
unsigned long flags = 0;
int fd_current;
memset(&hw_event, 0, sizeof(struct perf_event_attr));
hw_event.type = event_type;
hw_event.size = sizeof(struct perf_event_attr);
hw_event.config = event;
if(group_fd == -1)
{
hw_event.sample_period = 2000000;
hw_event.sample_type = PERF_SAMPLE_READ;
hw_event.precise_ip = 1;
}
hw_event.disabled = 1; // off by default. specifies whether the counter starts out disabled or enabled.
hw_event.exclude_kernel = 0; // excluding events that happen in the kernel-space
hw_event.exclude_hv = 1; // excluding events that happen in the hypervisor
hw_event.pinned = pinned; // specifies the counter to be on the CPU if at all possible. applies only to hardware counters and only to group leaders.
hw_event.exclude_user = 0; // excludes events that happen in user space
hw_event.exclude_callchain_kernel = 0; // Do not include kernel callchains.
hw_event.exclude_callchain_user = 0; // Do not include user callchains.
hw_event.read_format = PERF_FORMAT_GROUP; // Allows all counter values in an event group to be read with one read
fd_current = syscall(__NR_perf_event_open, &hw_event, pid, cpu, group_fd, flags);
if (fd_current == -1) {
printf("Error opening leader %llx\n", hw_event.config);
exit(EXIT_FAILURE);
}
return fd_current;

Why disabling interrupt makes your code run even slower?

I was doing a little bit benchmark of how efficient do_gettimeofday() and getnstimeofday() run in the kernel. I did the experiment by compiling a kernel module and use the following code in the module_init function:
int i;
struct timeval st, en, diff;
struct timespec s, e, dif;
do_gettimeofday(&st);
for (i = 0; i < 10000000; i++)
do_gettimeofday(&en);
diff.tv_sec = en.tv_sec - st.tv_sec;
if (en.tv_usec < st.tv_usec) {
diff.tv_usec = 1000000 + en.tv_usec - st.tv_usec;
diff.tv_sec--;
}
else
diff.tv_usec = en.tv_usec - st.tv_usec;
getnstimeofday(&s);
for (i = 0; i < 10000000; i++)
getnstimeofday(&e);
dif = timespec_sub(e, s);
printk("do_gettimeofday: %d times in %lu.%06lu\n", i, diff.tv_sec, diff.tv_usec);
printk("getnstimeofday: %d times in %lu.%09lu\n", i, dif.tv_sec, dif.tv_nsec);
On an AR9331 based development board I got the following output from dmesg:
do_gettimeofday: 10000000 times in 4.452656
getnstimeofday: 10000000 times in 3.170668494
However, if I disable interrupt by enclosing my code between local_irq_save and local_irq_restore, i.e. do the following
local_irq_save(flags);
...{run the above code}...
local_irq_restore(flags);
and run the above code, I got the following output instead:
do_gettimeofday: 10000000 times in 5.417230
getnstimeofday: 10000000 times in 3.661163701
To my understanding, running the code with interrupt enabled should have prolonged running time (e.g. an interrupt triggered when the code is running, the CPU jumps to the interrupt handler and jumps back later). It feels weird that the code ran even slower with all the interrupt disabled on the device. Could anyone explain this behavior?

Keeping Track of Time With Mbed

I'm using the mbed platform to program a motion controller on an ARM MCU.
I need to determine the time at each iteration of a while loop, but am struggling to think of the best way to do this.
I have two potential methods:
1) Define how many iterations can be done per second and use "wait" so each iteration occurs after a regular interval. I can then increment a counter to determine time.
2) Capture system time before going into the loop and then continuously loop, subtracting current system time from original system time to determine time.
Am I thinking along the right tracks or have I completely missed it?
Your first option isn't ideal since the wait and counter portions will throw off the numbers and you will end up with less accurate information about your iterations.
The second option is viable depending on how you implement it. mbed has a library called "Timer.h" that would be an easy solution to your problem. The timer function is interrupt based (using Timer3 if you use a LPC1768) you can see the handbook here: mbed .org/ handbook /Timer. ARM supports 32-bit addresses as part of the Cortex-M3 processors, which means the timers are 32-bit int microsecond counters. What that means for your usability is that this library can keep time up to a maximum of 30 minutes so they are ideal for times between microseconds and seconds (if need more time than that then you will need a real-time clock). It's up to you if you want to know the count in milliseconds or microseconds. If you want micro, you will need to call the function read_us() and if you want milli you will use read_ms(). The utilization of the Timer interrupts will affect your time by 1-2 microseconds, so if you wish to keep track down to that level instead of milliseconds you will have to bear that in mind.
Here is a sample code for what you are trying to accomplish (based on an LPC1768 and written using the online compiler):
#include "mbed.h"
#include "Timer.h"
Timer timer;
Serial device (p9,p10);
int main() {
device.baud(19200); //setting baud rate
int my_num=10; //number of loops in while
int i=0;
float sum=0;
float dev=0;
float num[my_num];
float my_time[my_num]; //initial values of array set to zero
for(int i=0; i<my_num; i++)
{
my_time[i]=0; //initialize array to 0
}
timer.start(); //start timer
while (i < my_num) //collect information on timing
{
printf("Hello World\n");
i++;
my_time[i-1]=timer.read_ms(); //needs to be the last command before loop restarts to be more accurate
}
timer.stop(); //stop timer
sum=my_time[0]; //set initial value of sum to first time input
for(i=1; i < my_num; i++)
{
my_time[i]=my_time[i]-my_time[i-1]; //making the array hold each loop time
sum=sum+my_time[i]; //sum of times for mean and standard deviation
}
sum = sum/my_num; //sum of times is now the mean so as to not waste memory
device.printf("Here are the times for each loop: \n");
for(i=0; i<my_num; i++)
{
device.printf("Loop %d: %.3f\n", i+1, my_time[i]);
}
device.printf("Your average loop time is %.3f ms\n", sum);
for(int i=0; i<my_num; i++)
{
num[i]= my_time[i]-sum;
dev = dev +(num[i])*(num[i]);
}
dev = sqrt(dev/(my_num-1)); //dev is now the value of the standard deviation
device.printf("The standard deviation of your loops is %.3f ms\n", dev);
return 0;
}
Another option you can use are the SysTick timer functions which can be implemented similar to the functions seen above and it would make your code more portable to any ARM device with a Cortex-Mx since it's based on the system timer of the microprocessor (read more here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/Babieigh.html). It really depends on how precise and portable you want your project to be!
Original source: http://community.arm.com/groups/embedded/blog/2014/09/05/intern-inquiry-95

What caused my elapsed time much longer than user time?

I am benchmarking some R statements (see details here) and found that my elapsed time is way longer than my user time.
user system elapsed
7.910 7.750 53.916
Could someone help me to understand what factors (R or hardware) determine the difference between user time and elapsed time, and how I can improve it? In case it helps: I am running data.table data manipulation on a Macbook Air 1.7Ghz i5 with 4GB RAM.
Update: My crude understanding is that user time is what it takes my CPU to process my job. elapsed time is the length from I submit a job until I get the data back. What else did my computer need to do after processing for 8 seconds?
Update: as suggested in the comment, I run a couple times on two data.table: Y, with 104 columns (sorry, I add more columns as time goes by), and X as a subset of Y with only 3 keys. Below are the updates. Please note that I ran these two procedures consecutively, so the memory state should be similar.
X<- Y[, list(Year, MemberID, Month)]
system.time(
{X[ , Month:= -Month]
setkey(X,Year, MemberID, Month)
X[,Month:=-Month]}
)
user system elapsed
3.490 0.031 3.519
system.time(
{Y[ , Month:= -Month]
setkey(Y,Year, MemberID, Month)
Y[,Month:=-Month]}
)
user system elapsed
8.444 5.564 36.284
Here are the size of the only two objects in my workspace (commas added). :
object.size(X)
83,237,624 bytes
object.size(Y)
2,449,521,080 bytes
Thank you
User time is how many seconds the computer spent doing your calculations. System time is how much time the operating system spent responding to your program's requests. Elapsed time is the sum of those two, plus whatever "waiting around" your program and/or the OS had to do. It's important to note that these numbers are the aggregate of time spent. Your program might compute for 1 second, then wait on the OS for one second, then wait on disk for 3 seconds and repeat this cycle many times while it's running.
Based on the fact that your program took as much system time as user time it was a very IO intensive thing. Reading from disk a lot or writing to disk a lot. RAM is pretty fast, a few hundred nanoseconds usually. So if everything fits in RAM elapsed time is usually just a little bit longer than user time. But disk might take a few milliseconds to seek and even longer to reply with the data. That's slower by a factor of of a million.
We've determined that your processor was "doing stuff" for ~8 + ~8 = ~ 16 seconds. What was it doing for the other ~54 - ~16 = ~38 seconds? Waiting for the hard drive to send it the data it asked for.
UPDATE1:
Matthew had made some excellent points that I'm making assumptions that I probably shouldn't be making. Adam, if you'd care to publish a list of all the rows in your table (datatypes are all we need) we can get a better idea of what's going on.
I just cooked up a little do-nothing program to validate my assumption that time not spent in userspace and kernel space is likely spent waiting for IO.
#include <stdio.h>
int main()
{
int i;
for(i = 0; i < 1000000000; i++)
{
int j, k, l, m;
j = 10;
k = i;
l = j + k;
m = j + k - i + l;
}
return 0;
}
When I run the resulting program and time it I see something like this:
mike#computer:~$ time ./waste_user
real 0m4.670s
user 0m4.660s
sys 0m0.000s
mike#computer:~$
As you can see by inspection the program does no real work and as such it doesn't ask the kernel to do anything short of load it into RAM and start it running. So nearly ALL the "real" time is spent as "user" time.
Now a kernel-heavy do-nothing program (with a few less iterations to keep the time reasonable):
#include <stdio.h>
int main()
{
FILE * random;
random = fopen("/dev/urandom", "r");
int i;
for(i = 0; i < 10000000; i++)
{
fgetc(random);
}
return 0;
}
When I run that one, I see something more like this:
mike#computer:~$ time ./waste_sys
real 0m1.138s
user 0m0.090s
sys 0m1.040s
mike#computer:~$
Again it's easy to see by inspection that the program does little more than ask the kernel to give it random bytes. /dev/urandom is a non-blocking source of entropy. What does that mean? The kernel uses a pseudo-random number generator to quickly generate "random" values for our little test program. That means the kernel has to do some computation but it can return very quickly. So this program mostly waits for the kernel to compute for it, and we can see that reflected in the fact that almost all the time is spent on sys.
Now we're going to make one little change. Instead of reading from /dev/urandom which is non-blocking we'll read from /dev/random which is blocking. What does that mean? It doesn't do much computing but rather it waits around for stuff to happen on your computer that the kernel developers have empirically determined is random. (We'll also do far fewer iterations since this stuff takes much longer)
#include <stdio.h>
int main()
{
FILE * random;
random = fopen("/dev/random", "r");
int i;
for(i = 0; i < 100; i++)
{
fgetc(random);
}
return 0;
}
And when I run and time this version of the program, here's what I see:
mike#computer:~$ time ./waste_io
real 0m41.451s
user 0m0.000s
sys 0m0.000s
mike#computer:~$
It took 41 seconds to run, but immeasurably small amounts of time on user and real. Why is that? All the time was spent in the kernel, but not doing active computation. The kernel was just waiting for stuff to happen. Once enough entropy was collected the kernel would wake back up and send back the data to the program. (Note it might take much less or much more time to run on your computer depending on what all is going on). I argue that the difference in time between user+sys and real is IO.
So what does all this mean? It doesn't prove that my answer is right because there could be other explanations for why you're seeing the behavior that you are. But it does demonstrate the differences between user compute time, kernel compute time and what I'm claiming is time spent doing IO.
Here's my source for the difference between /dev/urandom and /dev/random:
http://en.wikipedia.org/wiki//dev/random
UPDATE2:
I thought I would try and address Matthew's suggestion that perhaps L2 cache misses are at the root of the problem. The Core i7 has a 64 byte cache line. I don't know how much you know about caches, so I'll provide some details. When you ask for a value from memory the CPU doesn't get just that one value, it gets all 64 bytes around it. That means if you're accessing memory in a very predictable pattern -- like say array[0], array[1], array[2], etc -- it takes a while to get value 0, but then 1, 2, 3, 4... are much faster. Until you get to the next cache line, that is. If this were an array of ints, 0 would be slow, 1..15 would be fast, 16 would be slow, 17..31 would be fast, etc.
http://software.intel.com/en-us/forums/topic/296674
In order to test this out I've made two programs. They both have an array of structs in them with 1024*1024 elements. In one case the struct has a single double in it, in the other it's got 8 doubles in it. A double is 8 bytes long so in the second program we're accessing memory in the worst possible fashion for a cache. The first will get to use the cache nicely.
#include <stdio.h>
#include <stdlib.h>
#define MANY_MEGS 1048576
typedef struct {
double a;
} PartialLine;
int main()
{
int i, j;
PartialLine* many_lines;
int total_bytes = MANY_MEGS * sizeof(PartialLine);
printf("Striding through %d total bytes, %d bytes at a time\n", total_bytes, sizeof(PartialLine));
many_lines = (PartialLine*) malloc(total_bytes);
PartialLine line;
double x;
for(i = 0; i < 300; i++)
{
for(j = 0; j < MANY_MEGS; j++)
{
line = many_lines[j];
x = line.a;
}
}
return 0;
}
When I run this program I see this output:
mike#computer:~$ time ./cache_hits
Striding through 8388608 total bytes, 8 bytes at a time
real 0m3.194s
user 0m3.140s
sys 0m0.016s
mike#computer:~$
Here's the program with the big structs, they each take up 64 bytes of memory, not 8.
#include <stdio.h>
#include <stdlib.h>
#define MANY_MEGS 1048576
typedef struct {
double a, b, c, d, e, f, g, h;
} WholeLine;
int main()
{
int i, j;
WholeLine* many_lines;
int total_bytes = MANY_MEGS * sizeof(WholeLine);
printf("Striding through %d total bytes, %d bytes at a time\n", total_bytes, sizeof(WholeLine));
many_lines = (WholeLine*) malloc(total_bytes);
WholeLine line;
double x;
for(i = 0; i < 300; i++)
{
for(j = 0; j < MANY_MEGS; j++)
{
line = many_lines[j];
x = line.a;
}
}
return 0;
}
And when I run it, I see this:
mike#computer:~$ time ./cache_misses
Striding through 67108864 total bytes, 64 bytes at a time
real 0m14.367s
user 0m14.245s
sys 0m0.088s
mike#computer:~$
The second program -- the one designed to have cache misses -- it took five times as long to run for the exact same number of memory accesses.
Also worth noting is that in both cases, all the time spent was spent in user, not sys. That means that the OS is counting the time your program has to wait for data against your program, not against the operating system. Given these two examples I think it's unlikely that cache misses are causing your elapsed time to be substantially longer than your user time.
UPDATE3:
I just saw your update that the really slimmed down table ran about 10x faster than the regular-sized one. That too would indicate to me that (as another Matthew also said) you're running out of RAM.
Once your program tries to use more memory than your computer actually has installed it starts swapping to disk. This is better than your program crashing, but its much slower than RAM and can cause substantial slowdowns.
I'll try and put together an example that shows swap problems tomorrow.
UPDATE4:
Okay, here's an example program which is very similar to the previous one. But now the struct is 4096 bytes, not 8 bytes. In total this program will use 2GB of memory rather than 64MB. I also change things up a bit and make sure that I access things randomly instead of element-by-element so that the kernel can't get smart and start anticipating my programs needs. The caches are driven by hardware (driven solely by simple heuristics) but it's entirely possible that kswapd (the kernel swap daemon) could be substantially smarter than the cache.
#include <stdio.h>
#include <stdlib.h>
typedef struct {
double numbers[512];
} WholePage;
int main()
{
int memory_ops = 1024*1024;
int total_memory = memory_ops / 2;
int num_chunks = 8;
int chunk_bytes = total_memory / num_chunks * sizeof(WholePage);
int i, j, k, l;
printf("Bouncing through %u MB, %d bytes at a time\n", chunk_bytes/1024*num_chunks/1024, sizeof(WholePage));
WholePage* many_pages[num_chunks];
for(i = 0; i < num_chunks; i++)
{
many_pages[i] = (WholePage*) malloc(chunk_bytes);
if(many_pages[i] == 0){ exit(1); }
}
WholePage* page_list;
WholePage* page;
double x;
for(i = 0; i < 300*memory_ops; i++)
{
j = rand() % num_chunks;
k = rand() % (total_memory / num_chunks);
l = rand() % 512;
page_list = many_pages[j];
page = page_list + k;
x = page->numbers[l];
}
return 0;
}
From the program I called cache_hits to cache_misses we saw the size of memory increased 8x and execution time increased 5x. What do you expect to see when we run this program? It uses 32x as much memory than cache_misses but has the same number of memory accesses.
mike#computer:~$ time ./page_misses
Bouncing through 2048 MB, 4096 bytes at a time
real 2m1.327s
user 1m56.483s
sys 0m0.588s
mike#computer:~$
It took 8x as long as cache_misses and 40x as long as cache_hits. And this is on a computer with 4GB of RAM. I used 50% of my RAM in this program versus 1.5% for cache_misses and 0.2% for cache_hits. It got substantially slower even though it wasn't using up ALL the RAM my computer has. It was enough to be significant.
I hope this is a decent primer on how to diagnose problems with programs running slow.

OpenCL performance on Lion

Maybe (or rather not maybe) I do something in a wrong way but it seems that I can't get good performance from OpenCl kernel even though it runs significantly faster on GPU than in CPU.
Let me explain.
CPU kernel running time is ~100ms.
GPU kernel running time is ~8ms.
Above was measured using the clCreateCommandQueue with CL_QUEUE_PROFILING_ENABLE flag.
But, what is the problem is the time needed to call (enqueue) the kernel repeatedly.
200 kernel calls on CPU: ~19s
200 kernel calls on GPU: ~18s
Above was measured with calls to gettimeofday before and after the 200 loop. And just after the loop was the call to clFinish to wait until the 200 enqueued kernels are done.
Moreover, the time was measured only for enqueueing and executing the kernel, no data transfer from/to the kernel was involved.
Here is the loop:
size_t global_item_size = LIST_SIZE;
Start_Clock(&startTime);
for(int k=0; k<200; k++)
{
// Execute the OpenCL kernel on the list
ret = clEnqueueNDRangeKernel (command_queue, kernel, 1, NULL, &global_item_size, NULL, 0, NULL, &event);
}
clFinish(command_queue);
printf("] (in %0.4fs)\n", Stop_Clock(&startTime));
If 200 calls to the kernel take ~18s then it's completely irrelevant that the kernel on GPU is several times faster than on CPU...
What am I doing wrong?
EDIT
I made some additional tests and it seems that actually assigning the result of the computation to the output buffer is producing the overhead.
This kernel
__kernel void test_kernel(__global const float *A, __global const float *B, __global float *C)
{
// Get the index of the current element to be processed
int i = get_global_id(0);
// Do the work
C[i] = sqrt(sin(A[i]) + tan(B[i]));
}
executed 200 times has the timings as above. But if I change the C[i] line to
float z = sqrt(sin(A[i]) + tan(B[i]));
then this kernel takes 0.3s on CPU and 2.6s on GPU.
Interesting.
I wonder if it would be possible to speed up the execution by collecting the results in __local table and then assigning them to the output buffer C only in the execution of the last kernel call? (the kernel with the last global id, not the 200th enqueued kernel)

Resources