Limit spikes, flatten processor usage?

Limit spikes, flatten processor usage? - ruby

I have multiple instances of a Ruby script (running on Linux) that does some automated downloading and every 30 minutes it calls "ffprobe" to programmatically evaluate the video download.
Now, during the downloading my processor is at 60%. However, every 30 minutes (when ffprobe runs), my processor usage skyrockets to 100% for 1 to 3 minutes and ends up sometimes crashing other instances of the Ruby program.
Instead of this, I would like to allocate lesser cpu resources to the processor heavy ffprobe, so it runs slowly. i.e. I would like it to use - say, a max of 20% of the CPU and it can run as long as it likes. So, one might expect it to take 15 minutes to complete a task that it now takes 1-3 minutes to complete. That's fine with me.
This will then prevent crashing of my critical downloading program that should have the highest priority.
Thank you!

Related

Serial program runs slower with multiple instances or in parallel

I have a fortran code that I am using to calculate some quantities related to the work that I do. The code itself involves several nested loops, and requires very little disk I/O. Whenever the code is modified, I run it against a suite of several input files (just to make sure it's working properly).
To make a long story short, the most recent update has increased the run time of the program by about a factor of four, and running each input file serially with one CPU takes about 45 minutes (a long time to wait, just to see whether anything was broken). Consequently, I'd like to run each of the input files in parallel across the 4 cpus on the system. I've been attempting to implement the parallelism via a bash script.
The interesting thing I have noted is that, when only one instance of the program is running on the machine, it takes about three and a half minutes to crank through one of the input files. When four instances of the program are running, it takes more like eleven and a half minute to crank through one input file (bringing my total run time down from about 45 minutes to 36 minutes - an improvement, yes, but not quite what I had hoped for).
I've tried implementing the parallelism using gnu parallel, xargs, wait, and even just starting four instances of the program in the background from the command line. Regardless of how the instances are started, I see the same slow down. Consequently, I'm pretty sure this isn't an artifact of the shell scripting, but something going on with the program itself.
I have tried rebuilding the program with debugging symbols turned off, and also using static linking. Neither of these had any noticeable impact. I'm currently building the program with the following options:
$ gfortran -Wall -g -O3 -fbacktrace -ffpe-trap=invalid,zero,overflow,underflow,denormal -fbounds-check -finit-real=nan -finit-integer=nan -o [program name] {sources}
Any help or guidance would be much appreciated!

On modern CPUs you cannot expect a linear speedup. There are several reasons:
Hyperthreading GNU/Linux will see hyperthreading as a core eventhough it is not a real core. It is more like 30% of a core.
Shared caches If your cores share the same cache and a single instance of your program uses the full shared cache, then you will get more cache misses if you run more instances.
Memory bandwidth A similar case as the shared cache is the shared memory bandwidth. If a single thread uses the full memory bandwidth, then running more jobs in parallel may congest the bandwidth. This can partly be solved by running on a NUMA where each CPU has some RAM that is "closer" than other RAM.
Turbo mode Many CPUs can run a single thread at a higher clock rate than multiple threads. This is due to heat.
All of these will exhibit the same symptom: Running a single thread will be faster than each of the multiple threads, but the total throughput of the multiple threads will be bigger than the single thread.
Though I must admit your case sounds extreme: With 4 cores I would have expected a speedup of at least 2.
How to identify the reason
Hyperthreading Use taskset to select which cores to run on. If you use 2 of the 4 cores is there any difference if you use #1+2 or #1+3?
Turbo mode Use cpufreq-set to force a low frequency. Is the speed now the same if you run 1 or 2 jobs in parallel?
Shared cache Not sure how to do this, but if it is somehow possible to disable the cache, then comparing 1 job to 2 jobs run at the same low frequency should give an indication.

Set CPU affinity for profiling

I am working on a calculation intensive C# project that implements several algorithms. The problem is that when I want to profile my application, the time it takes for a particular algorithm varies. For example sometimes running the algorithm 100 times takes about 1100 ms and another time running 100 times takes much more time like 2000 or even 3000 ms. It may vary even in the same run. So it is impossible to measure improvement when I optimize a piece of code. It's just unreliable.
Here is another run:
So basically I want to make sure one CPU is dedicated to my app. The PC has an old dual core Intel E5300 CPU running on Windows 7 32 bit. So I can't just set process affinity and forget about one core forever. It would make the computer very slow for daily tasks. I need other apps to use a specific core when I desire and the when I'm done profiling, the CPU affinities come back to normal. Having a bat file to do the task would be a fantastic solution.
My question is: Is it possible to have a bat file to set process affinity for every process on windows 7?
PS: The algorithm is correct and every time runs the same code path. I created some object pool so after first run, zero memory is allocated. I also profiled memory allocation with dottrace and it showed no allocation after first run. So I don't believe GC is triggered when the algorithm is working. Physical memory is available and system is not running low on RAM.
Result: The answer by Chris Becke does the job and sets process affinities exactly as intended. It resulted in more uniform results specially when background apps like visual studio and dottrace are running. Further investigation into the divergent execution time revealed that the root for the unpredictability is CPU overheat. The CPU overheat alarm was off while the temperature was over 100C! So after fixing the malfunctioning fan, the results became completely uniform.

You mean SetProcessAffinityMask?
I see this question, while tagged windows, is c#, so... I see the System.Diagnostics.Process object has a ThreadAffinity member that should perform the same function.
I am just not sure that this will stabilize the CPU times quite in the way you expect. A single busy task that is not doing IO should remain scheduled on the same core anyway unless another thread interrupts it, so I think your variable times are more due to other threads / processes interrupting your algorithm than the OS randomly shunting your thread to a different core - so unless you set the affinity for all other threads in the system to exclude your preferred core I can't see this helping.

How can I diagnose my very slow Ruby startup times?

Intermittently, when I type a command that involves Ruby (like ruby somefile.rb, rake, rspec spec, or irb), it takes a long time for the command to execute. For example, a few minutes ago, it took about a minute for irb to start. A few seconds ago, it took about a second.
While waiting for irb to start, I pressed Control + T repeatedly. Some output I saw included:
load: 1.62 cmd: ruby 12374 uninterruptible 0.45u 0.13s
load: 1.62 cmd: ruby 12374 uninterruptible 0.48u 0.13s
load: 1.62 cmd: ruby 12374 uninterruptible 0.53u 0.15s
On OSX, this output represents "load, command running, pid, status, and user and system CPU time used". It appears that when I had been waiting 53 seconds, the CPU time used was only 0.15 seconds.
My understanding of load is that it's roughly "how many cores are being used". Eg, on a one-core system, 1.0 is full utilization, but on a four-core machine, it's 25% utilization. I don't think the amount of load is the problem, because my machine is multi-core. Also, when irb starts quickly, I can get one line of output with Control + T that's also above 1.0.
load: 1.22 cmd: ruby 12452 running 0.26u 0.02s
I also notice that in the good case, the status is "running", not "uninterruptible".
How can I diagnose and fix these slow startups?

This is a longshot. Try installing haveged.
http://freecode.com/projects/haveged
I've seen this problem before. That solved it for me. Sometimes there is not enough entropy for libraries or elements of Ruby which are trying to load up a pool of random numbers.
If you notice that the time for something to start goes quickly when you are typing more, moving your mouse, using a lot of network traffic -- then it's entropy, which would go against most of what you'd think.
If there is more processor and RAM usage, more interaction with the system, etc - you'd think it'd be slower, but in entropy depletion situations, that's actually exactly what you need.

What exactly is CPU load if instructions are executed one at a time?

I know this question has been asked many times in many different manners, but it's still not clear for me what the CPU load % means.
I'll start explaining how I perceive the concepts now (of course, I might, and sure will, be wrong):
A CPU core can only execute one instruction at a time. It will not execute the next instruction until it finishes executing the current one.
Suppose your box has one single CPU with one single core. Parallel computing is hence not possible. Your OS's scheduler will pick up a process, set the IP to the entry point, and send that instruction to the CPU. It won't move to the next instruction until the CPU finishes executing the current instruction. After a certain amount of time it will switch to another process, and so on. But it will never switch to another process if the CPU is currently executing an instruction. It will wait until the CPU becomes free to switch to another process. Since you only have one single core, you can't have two processes executing simultaneously.
I/O is expensive. Whenever a process wants to read a file from the disk, it has to wait until the disk accomplishes its task, and the current process can't execute its next instruction until then. The CPU is not doing anything while the disk is working, and so our OS will switch to another process until the disk finishes its job in order not to waste time.
Following these principles, I've come myself to the conclusion that CPU load at a given time can only be one of the following two values:
0% - Idle. CPU is doing nothing at all.
100% - Busy. CPU is currently executing an instruction.
This is obviously false as taskmgr reports %1, 12%, 15%, 50%, etc. CPU usage values.
What does it mean that a given process, at a given time, is utilizing 1% of a given CPU core (as reported by taskmgr)? While that given process is executing, what happens with the 99%?
What does it mean that the overall CPU usage is 19% (as reported by Rainmeter at the moment)?

If you look into the task manager on Windows there is Idle process, that does exactly that, it just shows amount of cycles not doing anything useful. Yes, CPU is always busy, but it might be just running in a loop waiting for useful things to come.
Since you only have one single core, you can't have two processes
executing simultaneously.
This is not really true. Yes, true parallelism is not possible with single core, but you can create illusion of one with preemptive multitasking. Yes, it is impossible to interrupt instruction, but it is not a problem because most of the instructions require tiny amount of time to finish. OS shares time with time slices, which are significantly longer than execution time of single instruction.
What does it mean that a given process, at a given time, is utilizing 1% of a given CPU core
Most of the time applications are not doing anything useful. Think of application that waits for user to click a button to start processing something. This app doesn't need CPU, so it sleeps most of the time, or every time it gets time slice it just goes into sleep (see event loop in Windows). GetMessage is blocking, so it means that thread will sleep until message arrives. So what CPU load really means? So imagine the app receives some events or data to do things, it will do operations instead of sleeping. So if it utilizes X% of CPU means that over sampling period of time that app used X% of CPU time. CPU time usage is average metric.
PS: To summarize concept of CPU load, think of speed (in terms of physics). There are instantaneous and average speeds, so speaking of CPU load, there also are instantaneous and average measurements. Instantaneous is always equal to either 0% or 100%, because at some point of time process either uses CPU or not. If process used 100% of CPU in the course of 250ms and didn't use for next 750ms then we can say that process loaded CPU for 25% with sampling period of 1 second (average measurement can only be applied with certain sampling period).

http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator ... sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be how many cars are waiting at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they're in for delays.
This is basically what CPU load is. "Cars" are processes using a slice of CPU time ("crossing the bridge") or queued up to use the CPU. Unix refers to this as the run-queue length: the sum of the number of processes that are currently running plus the number that are waiting (queued) to run.
Also see: http://en.wikipedia.org/wiki/Load_(computing)

Why isn't my application's performance scaling with more CPUs?

I am running a piece of software that is very parallel. There are about 400 commands I need to run that don't depend on each other at all, so I just fork them off and hope and that having more CPUs means more processes executed per unit time.
Code:
foreach cmd ($CMD_LIST)
$cmd & #fork it off
end
Very simple. Here are my testing results:
On 1 CPU, this takes 1006 seconds, or 16 mins 46 seconds.
With 10 CPUs, this took 600s, or 10 minutes!
Why wouldn't the time taken divide (roughly) by 10? I feel cheated here =(
edit - of course I'm willing to provide additional details you would want to know, just not sure what's relevant because in simplest terms this is what I'm doing.

You are assuming your processes are 100% CPU-bound.
If your processes do any disk or network I/O, the bottleneck will be on those operations, which cannot be parallelised (eg one process will download a file at 100k/s, 2 processes at 50k/s each so you would not see any improvement at all, furthermore you could experience a degrade in performance because of overheads).
See: Amdahl's_law - this allows you to estimate the improvement in performance when parallelising tasks, knowing the proportion between the parallelisable part and the non-parallelisable)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio