Relationship between program size and nice value - linux-kernel

I'm currently studying program priority, I was wondering if program size has any impact on the program's nice value. For example, if a program is larger would its priority be set higher or vice versa?

if a program is larger would its priority be set higher or vice versa?
No, this is wrong! Nice value does not depend on a program's size.
Nice value is responsible for CPU time (this is actually wrong, because it's not actually time in seconds/milliseconds/nanoseconds/etc., it's measured in percents like 5%, 10%, 50%, 100% of CPU) that a process may occupy.
In current implementation the most interesting thing is that actually it doesn't matter whether this value higher or lower. What really matter is a difference between one process's nice value and another. That is the processes that have +10 and +11 nice values will have the same CPU time that is -5 and -4.
Thing that is really responsible for scheduling processes is CFS (Completely Fair Scheduler).
More about scheduler you'll find in Documentation/scheduler.

Related

jvmtop CPU usage >100%

I've been using jvmtop for a few months to monitor JVM statistics. As I've tallied the output with Jconsole, I've mostly observed similar stats in jvmtop as well.
However during a recent test execution I've observed few entries of CPU% to go above 100% (120% being the highest). Now as I believe jvmtop provides cumulative CPU usage (not like top which provides more of core-wise details), need guidance on how to interpret this entries of beyond 100% usage.
I have looked at jvmtop code and may conclude that its algorithm for computing CPU usage is completely screwed up. Let me explain.
In the simplified form, the formula looks like
CPU_usage = Δ processCpuTime / ( Δ processRunTime * CPU_count)
The idea makes sense, but the problem is how it is actually implemented. I see at least 3 issues:
Process CPU time and process uptime are obtained by two independent method calls. Furthermore, those are remote calls made via RMI. This means, there can be an arbitrary long delay between obtaining those values. So, Δ processCpuTime and Δ processRunTime are computed over different time intervals, and it's not quite correct to divide one by another. When Δ processCpuTime is computed over a longer period than Δ processRunTime, the result may happen to be > 100%.
Process CPU time is based on OperatingSystemMXBean.getProcessCpuTime() call, while the process uptime relies on RuntimeMXBean.getUptime(). The problem is, these two methods use the different clock source and the different time scale. Generally speaking, times obtained by these methods are not comparable to each other.
CPU_count is computed as OperatingSystemMXBean.getAvailableProcessors(). However, the number of logical processors visible to the JVM is not always equal to the number of physical processors on the machine. For example, in a container getAvailableProcessors() may return a value based on cpu-shares, while in practice the JVM may use more physical cores. In this case CPU usage may again appear > 100%.
However, as far as I see, in the latest version the final value of CPU load is artificially capped by 99%:
return Math.min(99.0,
deltaTime / (deltaUptime * osBean.getAvailableProcessors()));

Computing time in relation to number of operations

is it possible to calculate the computing time of a process based on the number of operations that it performs and the speed of the CPU in GHz?
For example, I have a for loop that performs a total number of 5*10^14 cycles. If it runs on a 2.4 GHz processor, will the computing time in seconds be: 5*10^14/2.4*10^9 = 208333 s?
If the process runs on 4 cores in parallel, will the time be reduced by four?
Thanks for your help.
No, it is not possible to calculate the computing time based just on the number of operations. First of all, based on your question, it sounds like you are talking about the number of lines of code in some higher-level programming language since you mention a for loop. So depending on the optimization level of your compiler, you could see varying results in computation time depending on what kinds of optimizations are done.
But even if you are talking about assembly language operations, it is still not possible to calculate the computation time based on the number of instructions and CPU speed alone. Some instructions might take multiple CPU cycles. If you have a lot of memory access, you will likely have cache misses and have to load data from disk, which is unpredictable.
Also, if the time that you are concerned about is the actual amount of time that passes between the moment the program begins executing and the time it finishes, you have the additional confounding variable of other processes running on the computer and taking up CPU time. The operating system should be pretty good about context switching during disk reads and other slow operations so that the program isn't stopped in the middle of computation, but you can't count on never losing some computation time because of this.
As far as running on four cores in parallel, a program can't just do that by itself. You need to actually write the program as a parallel program. A for loop is a sequential operation on its own. In order to run four processes on four separate cores, you will need to use the fork system call and have some way of dividing up the work between the four processes. If you divide the work into four processes, the maximum speedup you can have is 4x, but in most cases it is impossible to achieve the theoretical maximum. How close you get depends on how well you are able to balance the work between the four processes and how much overhead is necessary to make sure the parallel processes successfully work together to generate a correct result.

Cache miss in an Out of order processor

Imagine an application is running on an Out of order processor and it has a lot of last level cache(LLC) misses (more than 70%). Do you think that if we decrease the frequency of the processor and set it to a smaller value then the execution time of the application will increase in a big way or doesn't effect much? Could you please explain your answer
Thanks and regards
As is the case with most micro-architectural features, the safe answer would be - "it might, and it might not - depends on the exact characteristics of your application".
Take for e.g. a workload that runs through a large graph that resides in the memory - each new node needs to be fetched and processed before the new node can be chosen. If you reduce the frequency you would harm the execution phase which is latency critical since the next set of memory accesses depends on it.
On the other hand, a workload that is bandwidth-bounded (i.e.- performs as fast as the system memory BW limits), is probably not fully utilizing the CPU and would therefore not be harmed as much.
Basically the question boils down to how well your application utilizes the CPU, or rather - where between the CPU and the memory, can you find the performance bottleneck.
By the way, note that even if reducing the frequency does impact the execution time, it could still be beneficial for you power/performance ratio, depends where along the power/perf curve you're located and on the exact values.

Does Windows provide a monotonically increasing clock to applications

This question is inspired by Does Linux provide a monotonically increasing clock to applications.
Maybe I should be more precise:
I'm looking for a clock function which is strictly increasing, thus never returning the same value, independant how quick two calls follow each other.
Yes, GetTickCount() does this. If you want a higher fidelity counter, QueryPerformanceCounter is also available. Neither of these counters depend on the time of day.

How do I obtain CPU cycle count in Win32?

In Win32, is there any way to get a unique cpu cycle count or something similar that would be uniform for multiple processes/languages/systems/etc.
I'm creating some log files, but have to produce multiple logfiles because we're hosting the .NET runtime, and I'd like to avoid calling from one to the other to log. As such, I was thinking I'd just produce two files, combine them, and then sort them, to get a coherent timeline involving cross-world calls.
However, GetTickCount does not increase for every call, so that's not reliable. Is there a better number, so that I get the calls in the right order when sorting?
Edit: Thanks to #Greg that put me on the track to QueryPerformanceCounter, which did the trick.
Heres an interesting article! says not to use RDTSC, but to instead use QueryPerformanceCounter.
Conclusion:
Using regular old timeGetTime() to do
timing is not reliable on many
Windows-based operating systems
because the granularity of the system
timer can be as high as 10-15
milliseconds, meaning that
timeGetTime() is only accurate to
10-15 milliseconds. [Note that the
high granularities occur on NT-based
operation systems like Windows NT,
2000, and XP. Windows 95 and 98 tend
to have much better granularity,
around 1-5 ms.]
However, if you call
timeBeginPeriod(1) at the beginning of
your program (and timeEndPeriod(1) at
the end), timeGetTime() will usually
become accurate to 1-2 milliseconds,
and will provide you with extremely
accurate timing information.
Sleep() behaves similarly; the length
of time that Sleep() actually sleeps
for goes hand-in-hand with the
granularity of timeGetTime(), so after
calling timeBeginPeriod(1) once,
Sleep(1) will actually sleep for 1-2
milliseconds,Sleep(2) for 2-3, and so
on (instead of sleeping in increments
as high as 10-15 ms).
For higher precision timing
(sub-millisecond accuracy), you'll
probably want to avoid using the
assembly mnemonic RDTSC because it is
hard to calibrate; instead, use
QueryPerformanceFrequency and
QueryPerformanceCounter, which are
accurate to less than 10 microseconds
(0.00001 seconds).
For simple timing, both timeGetTime
and QueryPerformanceCounter work well,
and QueryPerformanceCounter is
obviously more accurate. However, if
you need to do any kind of "timed
pauses" (such as those necessary for
framerate limiting), you need to be
careful of sitting in a loop calling
QueryPerformanceCounter, waiting for
it to reach a certain value; this will
eat up 100% of your processor.
Instead, consider a hybrid scheme,
where you call Sleep(1) (don't forget
timeBeginPeriod(1) first!) whenever
you need to pass more than 1 ms of
time, and then only enter the
QueryPerformanceCounter 100%-busy loop
to finish off the last < 1/1000th of a
second of the delay you need. This
will give you ultra-accurate delays
(accurate to 10 microseconds), with
very minimal CPU usage. See the code
above.
You can use the RDTSC CPU instruction (assuming x86). This instruction gives the CPU cycle counter, but be aware that it will increase very quickly to its maximum value, and then reset to 0. As the Wikipedia article mentions, you might be better off using the QueryPerformanceCounter function.
System.Diagnostics.Stopwatch.GetTimestamp() return the number of CPU cycle since a time origin (maybe when the computer start, but I'm not sure) and I've never seen it not increased between 2 calls.
The CPU Cycles will be specific for each computer so you can't use it to merge log file between 2 computers.
RDTSC output may depend on the current core's clock frequency, which for modern CPUs is neither constant nor, in a multicore machine, consistent.
Use the system time, and if dealing with feeds from multiple systems use an NTP time source. You can get reliable, consistent time readings that way; if the overhead is too much for your purposes, using the HPET to work out time elapsed since the last known reliable time reading is better than using the HPET alone.
Use the GetTickCount and add another counter as you merge the log files. Won't give you perfect sequence between the different log files, but it will at least keep all logs from each file in the correct order.

Resources