I am trying to programmatically find the preemption time between two processes on my Ubuntu laptop. I am trying to implement the method mentioned in this research paper: https://ieeexplore.ieee.org/document/8441042. This paper calculates preemption time of processes in Vxworks OS. I am trying to do the exact same thing on my Ubuntu system.
As described in this paper, I created two processes (C programs with infinite loop) of different priorities using nice command. The lower priority task is run immediately (at time-stamp t1), while the higher priority task is run after a gap of 10 microseconds (at time-stamp t2). Then the preemption time should be t2-t1-10 microseconds. Both the processes run on the same processor using the taskset command. chrt command is used to set the scheduler to FIFO.
I cannot think of a better approach to implement the ideas of the above research paper on my Ubuntu system. I am also not sure that the above approach is foolproof. There are some obvious problems such as the processes that I have created may get preempted by some other higher priority tasks or the processes may get blocked due to some reason.
So, if there is a better way to go about calculating preemption time programmatically kindly please tell me some approach or point me to resources where I can learn about them. Thanks.
Related
I'm new to Elixir, and I'm starting to read through Dave Thomas's excellent Programming Elixir. I was curious how far I could take the concurrency of the "pmap" function, so I iteratively boosted the number of items to square from 1,000 to 10,000,000. Out of curiosity, I watched the output of htop as I did so, usually peaking out with CPU usage similar to that shown below:
After showing the example in the book, Dave says:
And, yes, I just kicked off 1,000 background processes, and I used all the cores and processors on my machine.
My question is, how come on my machine only cores 1, 3, 5, and 7 are lighting up? My guess would be that it has to do with my iex process being only a single OS-level process and OSX is managing the reach of that process. Is that what's going on here? Is there some way to ensure all cores get utilized for performance-intensive tasks?
Great comment by #Thiago Silveira about first line of iex's output. The part [smp:8:8] says how many operating system level processes is Erlang using. You can control this with flag --smp if you want to disable it:
iex --erl '-smp disable'
This will ensure that you have only one system process. You can achieve similar result by leaving symmetric multiprocessing enabled, but setting directly NumberOfShcedulers:NumberOfSchedulersOnline.
iex --erl '+S 1:1'
Each operating system process needs to have its own scheduler for Erlang processes, so you can easily see how many of them do you have currently:
:erlang.system_info(:schedulers_online)
To answer your question about performance. If your processors are not working at full capacity (100%) and non of them is doing nothing (0%) then it is probable that making the load more evenly distributed will not speed things up. Why?
The CPU usage is measured by probing the processor state at many points in time. This states are either "working" or "idle". 82% CPU usage means that you can perform couple of more tasks on this CPU without slowing other tasks.
Erlang schedulers try to be smart and not migrate Erlang processes between cores unless they have to because it requires copying. The migration occurs for example when one of schedulers is idle. It can then borrow a process from others scheduler run queue.
Next thing that may cause such a big discrepancy between odd and even cores is Hyper Threading. On my dual core processor htop shows 4 logical cores. In your case you probably have 4 physical cores and 8 logical because of HT. It might be the case that you are utilizing your physical cores with 100%.
Another thing: pmap needs to calculate result in separate process, but at the end it sends it to the caller which may be a bottleneck. The more you send messages the less CPU utilization you can achieve. You can try for fun giving the processes a task that is really CPU intensive like calculating Ackerman function. You can even calculate how much of your job is the sequential part and how much is parallel using Amdahl's law and measuring execution times for different number of cores.
To sum up: the CPU utilization from screenshot looks really great! You don't have to change anything for more performance-intensive tasks.
Concurrency is not Parallelism
In order to get good parallel performance out of Elixir/BEAM coding you need to have some understanding of how the BEAM scheduler works.
This is a very simplistic model, but the BEAM scheduler gives each process 2000 reductions before it swaps out the process for the next process. Reductions can be thought of as function calls. By default a process runs on the core/scheduler that spawned it. Processes only get moved between schedulers if the queue of outstanding processes builds up on a given scheduler. By default the BEAM runs a scheduling thread on each available core.
What this implies is that in order to get the most use of the processors you need to break up your tasks into large enough pieces of work that will exceed the standard "reduction" slice of work. In general, pmap style parallelism only gives significant speedup when you chunk many items into a single task.
The other thing to be aware of is that some parts of the BEAM use a spin/wait loop when awaiting work and that can skew usage when you use
a tool like htop to examine CPU usage. You'll get a much better understanding of your program's performance by using :observer.
i didn't understand this sentence plz explain me in detail and use an easy english to do that
Go routines are cooperatively scheduled, rather than relying on the kernel to manage their time sharing.
Disclaimer: this is a rough and inaccurate description of the scheduling in the kernel and in the go runtime aimed at explaining the concepts, not at being an exact or detailed explanation of the real system.
As you may (or not know), a CPU can't actually run two programs at the same time: a CPU only have one execution thread, which can execute one instruction at a time. The direct consequence on early systems was that you couldn't run two programs at the same time, each program needing (system-wise) a dedicated thread.
The solution currently adopted is called pseudo-parallelism: given a number of logical threads (e.g multiple programs), the system will execute one of the logical threads during a certain amount of time then switch to the next one. Using really small amounts of time (in the order of milliseconds), you give the human user the illusion of parallelism. This operation is called scheduling.
The Go language doesn't use this system directly: it itself implement a scheduler that run on top of the system scheduler, and schedule the execution of the goroutines itself, bypassing the performance cost of using a real thread for each routine. This type of system is called light/green thread.
Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?
First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.
Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.
Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).
I have a multithreaded code that I want to run on all 4 cores that my processor has. I.e. I create four threads, and I want each of them to run on a separate core.
What happens is that it starts running on four cores, but occasionally would switch to only three cores. The only things running are the OS and my exe. This is somewhat disappointing, since it decreases performance by a quarter, which is significant enough for me.
The process affinity that I see in Task Manager allows the process to use any core. I tried restricting thread affinities, but it did't help. I also tried increasing priority of the process, but it did not help the case either.
So the question is, is there any way to force Windows to keep it running on all four cores? If this is not possible, can I reduce the frequency of these interruptions? Thanks!
This is not an issue of affinity unless I am very much mistaken. Certainly the system will not restrict your process to affinity with a specific set of threads. Some other program in the system would have to do that, if indeed that is happening.
Much more likely however is that, simply, there is another thread that is ready to run that the system is scheduling in a round-robin fashion. You have four threads that are always ready to run. If there is another thread that is ready to run, it will get its turn. Now there are 5 threads sharing 4 processors. When the other thread is running, only 3 of yours are able to run.
If you want to be sure that such other threads won't run then you need to do one of the following:
Stop running the other program that wants to use CPU resource.
Make the relative thread priorities such that your threads always run in preference to the other thread.
Now, of these options, the first is to be preferred. If you prioritize your threads above others, then the other threads don't get to run at all. Is that really what you want to happen?
In the question you say that there are no other processes running. If that is the case, and nobody is meddling with processor affinity, and only a subset of your threads are executing, then the only conclusion is that not all of your threads are ready to run and have work to do. That might happen if you, for instance, join your threads at the end of one part of work, before continuing on to the next.
Perhaps the next step for you is to narrow things down a little. Use a tool like Process Explorer to diagnose which threads are actually running.
If this is windows, try SetThreadAffinityMask():
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247(v=vs.85).aspx
I would assume that if you only set a single bit, then that forces the thread to run only on the selected processor (core).
other process / thread functions:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684847(v=vs.85).aspx
I use a windows video program, and it's able to keep all the cores running at near max when rendering video.
I worked on VxWorks 5.5 long time back and it was the best experience working on world's best real time OS. Since then I never got a chance to work on it again. But, a question keeps popping to me, what makes is so fast and deterministic?
I have not been able to find many references for this question via Google.
So, I just tried thinking what makes a regular OS non-deterministic:
Memory allocation/de-allocation:- Wiki says RTOS use fixed size blocks, so that these blocks can be directly indexed, but this will cause internal fragmentation and I am sure this is something not at all desirable on mission critical systems where the memory is already limited.
Paging/segmentation:- Its kind of linked to Point 1
Interrupt Handling:- Not sure how VxWorks implements it, as this is something VxWorks handles very well
Context switching:- I believe in VxWorks 5.5 all the processes used to execute in kernel address space, so context switching used to involve just saving register values and nothing about PCB(process control block), but still I am not 100% sure
Process scheduling algorithms:- If Windows implements preemptive scheduling (priority/round robin) then will process scheduling be as fast as in VxWorks? I dont think so. So, how does VxWorks handle scheduling?
Please correct my understanding wherever required.
I believe the following would account for lots of the difference:
No Paging/Swapping
A deterministic RTOS simply can't swap memory pages to disk. This would kill the determinism, since at any moment you could have to swap memory in or out.
vxWorks requires that your application fit entirely in RAM
No Processes
In vxWorks 5.5, there are tasks, but no process like Windows or Linux. The tasks are more akin to threads and switching context is a relatively inexpensive operation. In Linux/Windows, switching process is quite expensive.
Note that in vxWorks 6.x, a process model was introduced, which increases some overhead, but mainly related to transitioning from User mode to Supervisor mode. The task switching time is not necessarily directly affected by the new model.
Fixed Priority
In vxWorks, the task priorities are set by the developer and are system wide. The highest priority task at any given time will be the one running. You can thus design your system to ensure that the tasks with the tightest deadline always executes before others.
In Linux/Windows, generally speaking, while you have some control over the priority of processes, the scheduler will eventually let lower priority processes run even if higher priority process are still active.