why the sum of "Functions With Most Individual Work" can't be more than 100%? - visual-studio

I'm using VS2010 built-in profilier
My application contains three threads.
One of the thread is really simple:
while (true)
if (something) {
// blah blah, very fast and rarely occuring thing
}
Thread.sleep(1000);
}
Visual Studio reports that Thread.sleep takes 36% of the program time.
The question is "why not ~100% of the time?" Why Main methods takes 40% of the time, I definitely was inside this method durring application execution from start to end.
Do profiler devides the result to the number of the threads?
On my another thread I've observed that method takes 34% of the time.
What does it mean? Does it mean that it works only 34% of the time or it works almost all the time?
In my opinion if I have three threads that run in parallel, and if I sum methods time I should get 300% (if application runs for 10 seconds for example, this means that each thread runs for 10 seconds, and if there are 3 threads - it would be 30 seconds totally)

The question is what do you measuring and how you do it. From your question I'm unable to repeat your experience actually...
Thread.Sleep() call takes very small amount of time itself. Its task is to call native function from WinAPI that will command scheduler (responsible for dividing processor time between threads) that user thread it was called from should not be scheduled for the next second at all. After that this thread doesn't receive processor time until this second is over.
But thread do not takes any bit of processor time in that state. I'm not sure how this situation is reported by profiler.
Here is the code I was experimenting with:
internal class Program
{
private static int x = 0;
private static void A()
{
// Just to have something in the profiler here
Console.WriteLine("A");
}
private static void Main(string[] args)
{
var t = new Thread(() => { while (x == 0) Thread.MemoryBarrier(); });
t.Start();
while (true)
{
if (DateTime.Now.Millisecond%3 == 0)
A();
Thread.Sleep(1000);
}
}
}

Related

Split a loop into smaller parts

I have a function inside a loop that takes a lot of resources and usually does not get completed unless the server is on low load.
How can I split it into smaller loops? When decreasing the value the function executes fine.
As an example, this works:
x = 10
for i = 0; i <= x; i++
{
myfunction(i)
}
However, when increasing x to 100 the memory hogging function stops working.
How can one split 100 in chunks of 10 (i.e.) and run the loop 10 times?
I should be able to use any value, not only 100 or multiples of 10.
Thank you.
Is your function asynchronous and you're having too many instances at once? Is it opening and not closing resources? Perhaps you could put a delay in after every 10 iterations?
x = 1000
for i = 0; i <= x; i++
{
myfunction(i);
if(i%10==0)
{
Thread.Sleep(1000);
}
}
If your task is asynchronous you can use worker thread pool technique.
Ex
You can create thread pool with 10 thread.
First you assign 10 tasks to them.
Whenever a task is finished you can assign remaining task to the given thread.

Is Lambda Expression in Java 8 reduce execution time?

I am new to Java 8 and getting a bit confused about the scope of Lambda Expression. I read some articles expressing that Lambda Expression reduces execution time, so to find the same I wrote following two programs
1) Without using Lambda Expression
import java.util.*;
public class testing_without_lambda
{
public static void main(String args[])
{
long startTime = System.currentTimeMillis();
List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5, 6);
for (int number : numbers)
{
System.out.println(number);
}
long stopTime = System.currentTimeMillis();
System.out.print("without lambda:");
System.out.println(stopTime - startTime);
}//end main
}
output:
2) With using Lambda Expression
import java.util.*;
public class testing_with_lambda
{
public static void main(String args[])
{
long startTime = System.currentTimeMillis();
List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5, 6);
numbers.forEach((Integer value) -> System.out.println(value));
long stopTime = System.currentTimeMillis();
System.out.print("with lambda:");
System.out.print(stopTime - startTime);
}//end main
}
output:
Is this means Lambda Expression requires more time to execute?
There is no general statement about “execution time” possible, as even the term “execution time” isn’t always meaning the same. Of course, there is no reason, why just using a lambda expression should reduce the execution time in general.
Your code is measuring the initialization time of the code and it’s execution time, which is fair, when you consider the total execution time of that tiny program, but for real life applications, it has no relevance, as they usually run significantly longer than their initialization time.
What make the drastic difference in initialization time, is the fact that the JRE uses the Collection API itself internally, so its classes are loaded and initialized and possibly even optimized to some degree, before your application even starts (so you don’t measure its costs). In contrast, it doesn’t use lambda expressions, so your first use of a lambda expression will load and initialize an entire framework behind the scenes.
Since you are usually interested in how code would perform in a real application, where the initialization already happened, you would have to execute the code multiple times within the same JVM to get a better picture. However, allowing the JVM’s optimizer to process the code bears the possibility that it gets over-optimized due to its simpler nature (compared to a real life scenario) and shows too optimistic numbers then. That’s why it’s recommended to use sophisticated benchmark tools, developed by experts, instead of creating your own. Even with these tools, you have to study their documentation to understand and avoid the pitfalls. See also How do I write a correct micro-benchmark in Java?
When you compare the for loop, also known as “external iteration” with the equivalent forEach call, also known as “internal iteration”, the latter does indeed bear the potential of being more efficient, if properly implemented by the particular Collection, but its outcome is hard to predict, as the JVM’s optimizer is good at removing the drawbacks of the other solution. Also, your list is far too small to ever exhibit this difference, if it exists.
It also must be emphasized that this principle is not tight to lambda expressions. You could also implement the Consumer via an anonymous inner class, and in your case, where the example suffers from the first time initialization cost, the anonymous inner class would be faster than the lambda expression.
In Addition to Holger's answer I want to show how you could've benchmarked your code.
What you're really measuring is initialization of classes and system's IO (i.e. class loading and System.out::println).
In order to get rid of these you should use a benchmark framework like JMH. In addition you should measure with multiple list or array sizes.
Then your code may look like this:
#Fork(3)
#BenchmarkMode(Mode.AverageTime)
#Measurement(iterations = 10, timeUnit = TimeUnit.NANOSECONDS)
#State(Scope.Benchmark)
#Threads(1)
#Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {
#Param({ "10", "1000", "10000" })
private int length;
private List<Integer> numbers;
#Setup
public void setup() {
Random random = new Random(length);
numbers = random.ints(length).boxed().collect(Collectors.toList());
}
#Benchmark
public void externalIteration(Blackhole bh) {
for (Integer number : numbers) {
bh.consume(number);
}
}
#Benchmark
public void internalIteration(Blackhole bh) {
numbers.forEach(bh::consume);
}
}
And the results:
Benchmark (length) Mode Cnt Score Error Units
MyBenchmark.externalIteration 10 avgt 30 41,002 ± 0,263 ns/op
MyBenchmark.externalIteration 1000 avgt 30 4026,842 ± 71,318 ns/op
MyBenchmark.externalIteration 10000 avgt 30 40423,629 ± 572,055 ns/op
MyBenchmark.internalIteration 10 avgt 30 40,783 ± 0,815 ns/op
MyBenchmark.internalIteration 1000 avgt 30 3888,240 ± 28,790 ns/op
MyBenchmark.internalIteration 10000 avgt 30 41961,320 ± 991,047 ns/op
As you can see there is little to no difference.
I don't think lambda expression code will be faster in execution all the time, it would really depend on different conditions. Can you may be point me to the article where you read that the lambda expression is faster in execution time?
(It certainly is considered faster while writing the code, due to functional programming style of code)
I ran your test again locally, found something strange:
The first time I ran the code without lambda, it took almost the same time, rather more as with the lambda. (49 milliseconds)
Second time onward, the code without lambda ran very much faster i.e. in 1 millisecond.
The code with lambda expression runs in the same amount of time every time, I tried around 3-4 times in total.
General Results:
1
2
3
4
5
6
with lambda:47
1
2
3
4
5
6
without lambda:1
I think it would really take a very large sample of numbers to really test this and also multiple calls to the same code to remove any initialization burden on the JVM. This is a pretty small test sample.

How many tasks in parallel

I built a sample program to check the performance of tasks in parallel, with respect to the number of tasks running in parallel.
Few assumptions:
Operation is on thread is independent of another thread, so no synchronization mechanisms between threads are essential.
The idea is to check, whether it is efficient to:
1.Spawn as many tasks as possible
or
2. Restrict the number of tasks in parallel, and wait for some tasks to complete before spawning the remaining tasks.
Following is the program:
static void Main(string[] args)
{
System.IO.StreamWriter writer = new System.IO.StreamWriter("C:\\TimeLogV2.csv");
SemaphoreSlim availableSlots;
for (int slots = 10; slots <= 20000; slots += 10)
{
availableSlots = new SemaphoreSlim(slots, slots);
int maxTasks;
CountdownEvent countDownEvent;
Stopwatch watch = new Stopwatch();
watch.Start();
maxTasks = 20000;
countDownEvent = new CountdownEvent(maxTasks);
for (int i = 0; i < maxTasks; i++)
{
Console.WriteLine(i);
Task task = new Task(() => Thread.Sleep(50));
task.ContinueWith((t) =>
{
availableSlots.Release();
countDownEvent.Signal();
}
);
availableSlots.Wait();
task.Start();
}
countDownEvent.Wait();
watch.Stop();
writer.WriteLine("{0},{1}", slots, watch.ElapsedMilliseconds);
Console.WriteLine("{0}:{1}", slots, watch.ElapsedMilliseconds);
}
writer.Flush();
writer.Close();
}
Here are the results:
The Y-axis is time-taken in milliseconds, X-axis is the number of semaphore slots (refer to above program)
Essentially the trend is: More number of parallel tasks, the better. Now my question is in what conditions, does:
More number of parallel tasks = less optimal (time taken)?
One condition I suppose is that:
The tasks are interdependent, and may have to wait for certain resources to be available.
Have you in any scenario limited the number of parallel tasks?
The TPL will control how many threads are running at once - basically you're just queuing up tasks to be run on those threads. You're not really running all those tasks in parallel.
The TPL will use work-stealing queues to make it all as efficient as possible. If you have all the information about what tasks you need to run, you might as well queue them all to start with, rather than trying to micro-manage it yourself. Of course, this will take memory - so that might be a concern, if you have a huge number of tasks.
I wouldn't try to artificially break your logical tasks into little bits just to get more tasks, however. You shouldn't view "more tasks == better" as a general rule.
(I note that you're including time taken to write a lot of lines to the console in your measurements, by the way. I'd remove those Console.WriteLine calls and try again - they may well make a big difference.)

What's the purpose of sleep(long millis, int nanos)?

In the JDK, it's implemented as:
public static void sleep(long millis, int nanos)
throws InterruptedException {
if (millis < 0) {
throw new IllegalArgumentException("timeout value is negative");
}
if (nanos < 0 || nanos > 999999) {
throw new IllegalArgumentException(
"nanosecond timeout value out of range");
}
if (nanos >= 500000 || (nanos != 0 && millis == 0)) {
millis++;
}
sleep(millis);
}
which means the nanos argument doesn't do anything at all.
Is the idea behind it that on hardware with more accurate timing, the JVM for it can provide a better implementation for it?
A regular OS doesn't have a fine grained enough resolution to sleep for nanoseconds at a time. However, real time operating systems exist, where scheduling an event to take place at an exact moment in time is critical and latencies for many operations are VERY low. An ABS system is one example of a RTOS. Sleeping for nanoseconds is much more useful on such systems than on normal OSes where the OS can't reliably sleep for any period less than 15ms.
However, having two separate JDKs is no solution. Hence on Windows and Linux the JVM will make a best attempt to sleep for x nanoseconds.
It looks like a future-proof addition, for when we all have petaflop laptops and we routinely specify delays in nanoseconds. Meanwhile if you specify a nanosecond delay, you get a millisecond delay.
When hardware improves and the JVM follows, the app will not need to be rewritten.
The problem with future proofing is backward compatibily. This method has worked this way for so long that if you want sub-micro-second delays you have to use different methods.
For comparison,
Object.wait(millis, nano);

How do you measure the time a function takes to execute?

How can you measure the amount of time a function will take to execute?
This is a relatively short function and the execution time would probably be in the millisecond range.
This particular question relates to an embedded system, programmed in C or C++.
The best way to do that on an embedded system is to set an external hardware pin when you enter the function and clear it when you leave the function. This is done preferably with a little assembly instruction so you don't skew your results too much.
Edit: One of the benefits is that you can do it in your actual application and you don't need any special test code. External debug pins like that are (should be!) standard practice for every embedded system.
There are three potential solutions:
Hardware Solution:
Use a free output pin on the processor and hook an oscilloscope or logic analyzer to the pin. Initialize the pin to a low state, just before calling the function you want to measure, assert the pin to a high state and just after returning from the function, deassert the pin.
*io_pin = 1;
myfunc();
*io_pin = 0;
Bookworm solution:
If the function is fairly small, and you can manage the disassembled code, you can crack open the processor architecture databook and count the cycles it will take the processor to execute every instructions. This will give you the number of cycles required.
Time = # cycles * Processor Clock Rate / Clock ticks per instructions
This is easier to do for smaller functions, or code written in assembler (for a PIC microcontroller for example)
Timestamp counter solution:
Some processors have a timestamp counter which increments at a rapid rate (every few processor clock ticks). Simply read the timestamp before and after the function.
This will give you the elapsed time, but beware that you might have to deal with the counter rollover.
Invoke it in a loop with a ton of invocations, then divide by the number of invocations to get the average time.
so:
// begin timing
for (int i = 0; i < 10000; i++) {
invokeFunction();
}
// end time
// divide by 10000 to get actual time.
if you're using linux, you can time a program's runtime by typing in the command line:
time [funtion_name]
if you run only the function in main() (assuming C++), the rest of the app's time should be negligible.
I repeat the function call a lot of times (millions) but also employ the following method to discount the loop overhead:
start = getTicks();
repeat n times {
myFunction();
myFunction();
}
lap = getTicks();
repeat n times {
myFunction();
}
finish = getTicks();
// overhead + function + function
elapsed1 = lap - start;
// overhead + function
elapsed2 = finish - lap;
// overhead + function + function - overhead - function = function
ntimes = elapsed1 - elapsed2;
once = ntimes / n; // Average time it took for one function call, sans loop overhead
Instead of calling function() twice in the first loop and once in the second loop, you could just call it once in the first loop and don't call it at all (i.e. empty loop) in the second, however the empty loop could be optimized out by the compiler, giving you negative timing results :)
start_time = timer
function()
exec_time = timer - start_time
Windows XP/NT Embedded or Windows CE/Mobile
You an use the QueryPerformanceCounter() to get the value of a VERY FAST counter before and after your function. Then you substract those 64-bits values and get a delta "ticks". Using QueryPerformanceCounterFrequency() you can convert the "delta ticks" to an actual time unit. You can refer to MSDN documentation about those WIN32 calls.
Other embedded systems
Without operating systems or with only basic OSes you will have to:
program one of the internal CPU timers to run and count freely.
configure it to generate an interrupt when the timer overflows, and in this interrupt routine increment a "carry" variable (this is so you can actually measure time longer than the resolution of the timer chosen).
before your function you save BOTH the "carry" value and the value of the CPU register holding the running ticks for the counting timer you configured.
same after your function
substract them to get a delta counter tick.
from there it is just a matter of knowing how long a tick means on your CPU/Hardware given the external clock and the de-multiplication you configured while setting up your timer. You multiply that "tick length" by the "delta ticks" you just got.
VERY IMPORTANT Do not forget to disable before and restore interrupts after getting those timer values (bot the carry and the register value) otherwise you risk saving incorrect values.
NOTES
This is very fast because it is only a few assembly instructions to disable interrupts, save two integer values and re-enable interrupts. The actual substraction and conversion to real time units occurs OUTSIDE the zone of time measurement, that is AFTER your function.
You may wish to put that code into a function to reuse that code all around but it may slow things a bit because of the function call and the pushing of all the registers to the stack, plus the parameters, then popping them again. In an embedded system this may be significant. It may be better then in C to use MACROS instead or write your own assembly routine saving/restoring only relevant registers.
Depends on your embedded platform and what type of timing you are looking for. For embedded Linux, there are several ways you can accomplish. If you wish to measure the amout of CPU time used by your function, you can do the following:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#define SEC_TO_NSEC(s) ((s) * 1000 * 1000 * 1000)
int work_function(int c) {
// do some work here
int i, j;
int foo = 0;
for (i = 0; i < 1000; i++) {
for (j = 0; j < 1000; j++) {
for ^= i + j;
}
}
}
int main(int argc, char *argv[]) {
struct timespec pre;
struct timespec post;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &pre);
work_function(0);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &post);
printf("time %d\n",
(SEC_TO_NSEC(post.tv_sec) + post.tv_nsec) -
(SEC_TO_NSEC(pre.tv_sec) + pre.tv_nsec));
return 0;
}
You will need to link this with the realtime library, just use the following to compile your code:
gcc -o test test.c -lrt
You may also want to read the man page on clock_gettime there is some issues with running this code on SMP based system that could invalidate you testing. You could use something like sched_setaffinity() or the command line cpuset to force the code on only one core.
If you are looking to measure user and system time, then you could use the times(NULL) which returns something like a jiffies. Or you can change the parameter for clock_gettime() from CLOCK_THREAD_CPUTIME_ID to CLOCK_MONOTONIC...but be careful of wrap around with CLOCK_MONOTONIC.
For other platforms, you are on your own.
Drew
I always implement an interrupt driven ticker routine. This then updates a counter that counts the number of milliseconds since start up. This counter is then accessed with a GetTickCount() function.
Example:
#define TICK_INTERVAL 1 // milliseconds between ticker interrupts
static unsigned long tickCounter;
interrupt ticker (void)
{
tickCounter += TICK_INTERVAL;
...
}
unsigned in GetTickCount(void)
{
return tickCounter;
}
In your code you would time the code as follows:
int function(void)
{
unsigned long time = GetTickCount();
do something ...
printf("Time is %ld", GetTickCount() - ticks);
}
In OS X terminal (and probably Unix, too), use "time":
time python function.py
If the code is .Net, use the stopwatch class (.net 2.0+) NOT DateTime.Now. DateTime.Now isn't updated accurately enough and will give you crazy results
If you're looking for sub-millisecond resolution, try one of these timing methods. They'll all get you resolution in at least the tens or hundreds of microseconds:
If it's embedded Linux, look at Linux timers:
http://linux.die.net/man/3/clock_gettime
Embedded Java, look at nanoTime(), though I'm not sure this is in the embedded edition:
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#nanoTime()
If you want to get at the hardware counters, try PAPI:
http://icl.cs.utk.edu/papi/
Otherwise you can always go to assembler. You could look at the PAPI source for your architecture if you need some help with this.

Resources