How can I diagnose my very slow Ruby startup times? - ruby

Intermittently, when I type a command that involves Ruby (like ruby somefile.rb, rake, rspec spec, or irb), it takes a long time for the command to execute. For example, a few minutes ago, it took about a minute for irb to start. A few seconds ago, it took about a second.
While waiting for irb to start, I pressed Control + T repeatedly. Some output I saw included:
load: 1.62 cmd: ruby 12374 uninterruptible 0.45u 0.13s
load: 1.62 cmd: ruby 12374 uninterruptible 0.48u 0.13s
load: 1.62 cmd: ruby 12374 uninterruptible 0.53u 0.15s
On OSX, this output represents "load, command running, pid, status, and user and system CPU time used". It appears that when I had been waiting 53 seconds, the CPU time used was only 0.15 seconds.
My understanding of load is that it's roughly "how many cores are being used". Eg, on a one-core system, 1.0 is full utilization, but on a four-core machine, it's 25% utilization. I don't think the amount of load is the problem, because my machine is multi-core. Also, when irb starts quickly, I can get one line of output with Control + T that's also above 1.0.
load: 1.22 cmd: ruby 12452 running 0.26u 0.02s
I also notice that in the good case, the status is "running", not "uninterruptible".
How can I diagnose and fix these slow startups?

This is a longshot. Try installing haveged.
http://freecode.com/projects/haveged
I've seen this problem before. That solved it for me. Sometimes there is not enough entropy for libraries or elements of Ruby which are trying to load up a pool of random numbers.
If you notice that the time for something to start goes quickly when you are typing more, moving your mouse, using a lot of network traffic -- then it's entropy, which would go against most of what you'd think.
If there is more processor and RAM usage, more interaction with the system, etc - you'd think it'd be slower, but in entropy depletion situations, that's actually exactly what you need.

Related

Set CPU affinity for profiling

I am working on a calculation intensive C# project that implements several algorithms. The problem is that when I want to profile my application, the time it takes for a particular algorithm varies. For example sometimes running the algorithm 100 times takes about 1100 ms and another time running 100 times takes much more time like 2000 or even 3000 ms. It may vary even in the same run. So it is impossible to measure improvement when I optimize a piece of code. It's just unreliable.
Here is another run:
So basically I want to make sure one CPU is dedicated to my app. The PC has an old dual core Intel E5300 CPU running on Windows 7 32 bit. So I can't just set process affinity and forget about one core forever. It would make the computer very slow for daily tasks. I need other apps to use a specific core when I desire and the when I'm done profiling, the CPU affinities come back to normal. Having a bat file to do the task would be a fantastic solution.
My question is: Is it possible to have a bat file to set process affinity for every process on windows 7?
PS: The algorithm is correct and every time runs the same code path. I created some object pool so after first run, zero memory is allocated. I also profiled memory allocation with dottrace and it showed no allocation after first run. So I don't believe GC is triggered when the algorithm is working. Physical memory is available and system is not running low on RAM.
Result: The answer by Chris Becke does the job and sets process affinities exactly as intended. It resulted in more uniform results specially when background apps like visual studio and dottrace are running. Further investigation into the divergent execution time revealed that the root for the unpredictability is CPU overheat. The CPU overheat alarm was off while the temperature was over 100C! So after fixing the malfunctioning fan, the results became completely uniform.
You mean SetProcessAffinityMask?
I see this question, while tagged windows, is c#, so... I see the System.Diagnostics.Process object has a ThreadAffinity member that should perform the same function.
I am just not sure that this will stabilize the CPU times quite in the way you expect. A single busy task that is not doing IO should remain scheduled on the same core anyway unless another thread interrupts it, so I think your variable times are more due to other threads / processes interrupting your algorithm than the OS randomly shunting your thread to a different core - so unless you set the affinity for all other threads in the system to exclude your preferred core I can't see this helping.

Intentionally high CPU usage, GCD, QOS_CLASS_BACKGROUND, and spindump

I am developing a program that happens to use a lot of CPU cycles to do its job. I have noticed that it, and other CPU intensive tasks, like iMovie import/export or Grapher Examples, will trigger a spin dump report, logged in Console:
1/21/16 12:37:30.000 PM kernel[0]: process iMovie[697] thread 22740 caught burning CPU! It used more than 50% CPU (Actual recent usage: 77%) over 180 seconds. thread lifetime cpu usage 91.400140 seconds, (87.318264 user, 4.081876 system) ledger info: balance: 90006145252 credit: 90006145252 debit: 0 limit: 90000000000 (50%) period: 180000000000 time since last refill (ns): 116147448571
1/21/16 12:37:30.881 PM com.apple.xpc.launchd[1]: (com.apple.ReportCrash[705]) Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash
1/21/16 12:37:30.883 PM ReportCrash[705]: Invoking spindump for pid=697 thread=22740 percent_cpu=77 duration=117 because of excessive cpu utilization
1/21/16 12:37:35.199 PM spindump[423]: Saved cpu_resource.diag report for iMovie version 9.0.4 (1634) to /Library/Logs/DiagnosticReports/iMovie_2016-01-21-123735_cudrnaks-MacBook-Pro.cpu_resource.diag
I understand that high CPU usage may be associated with software errors, but some operations simply require high CPU usage. It seems a waste of resources to watch-dog and report processes/threads that are expected to use a lot of CPU.
In my program, I use four serial GCD dispatch queues, one for each core of the i7 processor. I have tried using QOS_CLASS_BACKGROUND, and spin dump recognizes this:
Primary state: 31 samples Non-Frontmost App, Non-Suppressed, Kernel mode, Thread QoS Background
The fan spins much more slowly when using QOS_CLASS_BACKGROUND instead of QOS_CLASS_USER_INITIATED, and the program takes about 2x longer to complete. As a side issue, Activity Monitor still reports the same % CPU usage and even longer total CPU Time for the same task.
Based on Apple's Energy Efficiency documentation, QOS_CLASS_BACKGROUND seems to be the proper choice for something that takes a long time to complete:
Work takes significant time, such as minutes or hours.
So why then does it still complain about using a lot of CPU time? I've read about methods to disable spindump, but these methods disable it for all processes. Is there a programmatic way to tell the system that this process/thread is expected to use a lot of CPU, so don't bother watch-dogging it?

Script that benchmarks CPU performance giving a result that can be used to compare against other servers?

Any help with the below will be very much appreciated:
This script should not overrun the box to the extent all the resources are used.
As an example, the script could run an integer equation that works out response time as a measurement of performance.
My theory is to attach it to a monitoring system to periodically run the test across multiple servers.
Any ideas?
Performance testing isn't nearly as simplistic as that. All you test doing a particular operation is how fast a system completes that particular operation on that particular day.
CPUs are fast now, and they're rarely the bottleneck any more.
So this comes back to the question - what are you trying to accomplish? If you really want a relative measure of CPU speed, then you could try something like using the system utility time.
Something like:
time ssh-keygen -b 8192 -f /tmp/test_ssh -N '' -q
Following on from comments: This will prompt you to overwrite. You can always delete /tmp/test_ssh. (I wouldn't suggest using a variable filename, as that'll be erratic)
As an alternative - if it's installed - then I'd suggest using openssl.
E.g.
time openssl genrsa 4096
Time returns 3 numbers by default:
'real' - elapsed wallclock time.
'user' - cpu-seconds spent running the user elements of the task.
'sys' - cpu-seconds spent running system stuff.
For this task - I'm not entirely sure how the metrics pan out - I'm not 100% sure if VMware 'fakes' the CPU time in the virtualisation layer. Real and user should (for this operation) be pretty similar, normally.
I've just tested this on a few of my VMs, and have had a fairly pronounced amount of 'jitter' in the results. But then, this might be what you're actually testing for. Bear in mind though - CPU isn't the only contended resource on a VM. Memory/disk/network are also shared.

Limit spikes, flatten processor usage?

I have multiple instances of a Ruby script (running on Linux) that does some automated downloading and every 30 minutes it calls "ffprobe" to programmatically evaluate the video download.
Now, during the downloading my processor is at 60%. However, every 30 minutes (when ffprobe runs), my processor usage skyrockets to 100% for 1 to 3 minutes and ends up sometimes crashing other instances of the Ruby program.
Instead of this, I would like to allocate lesser cpu resources to the processor heavy ffprobe, so it runs slowly. i.e. I would like it to use - say, a max of 20% of the CPU and it can run as long as it likes. So, one might expect it to take 15 minutes to complete a task that it now takes 1-3 minutes to complete. That's fine with me.
This will then prevent crashing of my critical downloading program that should have the highest priority.
Thank you!

Why is the sleep-time of Sleep(1) seems to be variable in Windows?

Last week I needed to test some different algorithmic functions and to make it easy to myself I added some artifical sleeps and simply measured the clock time. Something like this:
start = clock();
for (int i=0;i<10000;++i)
{
...
Sleep(1);
...
}
end = clock();
Since the argument of Sleep is expressed in milliseconds I expected a total wall clock time of about 10 seconds (a big higher because of the algorithms but that's not important now), and that was indeed my result.
This morning I had to reboot my PC because of new Microsoft Windows hot fixes and to my surprise Sleep(1) didn't take 1 millisecond anymore, but about 0.0156 seconds.
So my test results were completely screwed up, since the total time grew from 10 seconds to about 156 seconds.
We tested this on several PC's and apparently on some PC's the result of one Sleep was indeed 1 ms. On other PC's it was 0.0156 seconds.
Then, suddenly, after a while, the time of Sleep dropped to 0.01 second, and then an hour later back to 0.001 second (1 ms).
Is this normal behavior in Windows?
Is Windows 'sleepy' the first hours after reboot and then gradually gets a higher sleep-granularity after a while?
Or are there any other aspects that might explain the change in behavior?
In all my tests no other application was running at the same time (or: at least not taking any CPU).
Any ideas?
OS is Windows 7.
I've not heard about the resolution jumping around like that on its own, but in general the resolution of Sleep follows the clock tick of the task scheduler. So by default it's usually 10 or 15 ms, depending on the edition of Windows. You can set it manually to 1 ms by issuing a timeBeginPeriod.
I'd guess it's the scheduler. Each OS has a certain amount of granularity. If you ask it to do something lower than that, the results aren't perfect. By asking to sleep 1ms (especially very often) the scheduler may decide you're not important and have you sleep longer, or your sleeps may run up against the end of your time slice.
The sleep call is an advisory call. It tells the OS you want to sleep for amount of time X. It can be less than X (due to a signal or something else), or it can be more (as you are seeing).
Another Stack Overflow question has a way to do it, but you have to use winsock.
When you call Sleep the processor is stopping that thread until it can resume at a time >= to the called Sleep time. Sometimes due to thread priority (which in some cases causes Sleep(0) to cause your program to hang indefinitely) your program may resume at a later time because more processor cycles were allocated for another thread to do work (mainly OS threads have higher priority).
I just wrote some words about the sleep() function in the thread Sleep Less Than One Millisecond. The characteristics of the sleep() function depends on the underlying hardware and on the setting of the multimedia timer interface.
A windows update may change the behavior, i.e. Windows 7 does treat things differently compared to Vista. See my comment in that thread and its links to learn more abut the sleep() function.
Most likely the sleep timer has not enough resolution.
What kind of resolution do you get when you call the timeGetDevCaps function as explained in the documentation of the Sleep function?
Windows Sleep granularity is normally 16ms, you get this unless your or some other program changes this. When you got 1ms granularity other days and 16ms others, some other program probably set time slice (effects to your program also). I think that my Labview for example does this.

Resources