Solaris prstat - definition of "recent" time used in percentages - performance

The man page for prstat (on Solaris 10 in my case) notes that that CPU % output is the "percentage of recent CPU time". I am trying to understand in more depth what "recent" means in this context - is it a defined amount of time prior to the sample, does it relate to the sampling interval, etc? Appreciate any insights, particularly with references to supporting documentation. I've searched but haven't been able to find a good answer. Thanks!
Adrian

The kernel maintains data that you see at the bottom - those three numbers.
For each process.
uptime shows you what those numbers are. Those are the 'recent' times for load average - the line at the bottom of prstat. 1 minute, 5 minutes, and 15 minutes.
Recent == 1 minute worth of sampling (last 60 seconds). Those numbers are averages, which is why when you first start prstat the number and processes usually change.
On the first pass you may see processes like nscd that have lots of cpu but have been up for a long time. The first display iteration is completely historical. After that the numbers reflect recent == last one minute average.
You should consider enabling sar sampling to get a much better picture.
Want a reference - try :
http://www.amazon.com/Solaris-Internals-OpenSolaris-Architecture-Edition/dp/0131482092

Related

Performance Counters and IMC Counter Not Matching

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.any_response and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.any_response,mem_load_uops_retired.l3_miss sleep 10
Performance counter stats for 'system wide':
3,713,037 offcore_response.all_data_rd.l3_miss.any_response
2,909,573 mem_load_uops_retired.l3_miss
10.016644133 seconds time elapsed
These two values seem consistent, as the latter excludes prefetch requests and those not targeted at DRAM. But they do not match the read counter in the IMC. This counter is called UNC_IMC_DRAM_DATA_READS and documented here. I read the counter reread it 1 second later. The difference was around 30,000,000 (EDITED). If multiplied by 10 (to estimate for 10 seconds) the resulting value will be around 300 million (EDITED), which is 100 times the value of the above-mentioned performance counters (EDITED). It is nowhere near 3 million! What am I missing?
P.S.: The difference is much smaller (but still large), when the system has more load.
The question is also asked, here:
https://community.intel.com/t5/Software-Tuning-Performance/Performance-Counters-and-IMC-Counter-Not-Matching/m-p/1288832
UPDATE:
Please note that PCM output matches my IMC counter reads.
This is the relevant PCM output:
The values for columns READ, WRITE and IO are calculated based on UNC_IMC_DRAM_DATA_READS, UNC_IMC_DRAM_DATA_WRITES and UNC_IMC_DRAM_IO_REQUESTS, respectively. It seems that requests classified as IO will be either READ or WRITE. In other words, during the depicted one second interval, almost (because of the inaccuracy reported in the above-mentioned doc) 2.01GB of the 2.42GB READ and WRITE requests belong to IO. Based on this explanation, the above three columns seem consistent with each other.
The problem is that there still exists a LARGE gap between the IMC and PMC values!
The situation is the same when I boot in runlevel 1. The processes on the scheduler are one of swapper, kworker and migration. Disk IO is almost 85KB/s. I'm wondering what leads to such a (relatively) huge amount of IO. Is it possible to detect that (e.g., using a counter or a tool)?
UPDATE 2:
I think that there is something wrong with the IO column. It is always something in the range [1.99,2.01], regardless of the amount of load in the system!
UPDATE 3:
In runlevel 1, the average number of occurrences of the uops_retired.all event in a 1-second interval is 15,000,000. During the same period, the number of read requests recorded by the associated IMC counter is around 30,000,000. In other words, assuming that all memory accesses are directly caused by cpu instructions, for each retired micro-operation, there exists two memory accesses. This seems impossible specially concerning the fact that there exist multiple levels of caches. Therefore, in the idle scenario, perhaps, the read accesses are caused by IO.
Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM on a relatively idle system with resolution 3840x2160 and refresh rate 60 using xrandr:
And this is for the situation with resolution 800x600 and the same refresh rate (i.e., 60):
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than 100x!).

H2O - Not seeing much speed-up after moving to powerful machine

I am running a Python program that calls H2O for deep learning (training and testing). The program runs in a loop of 20 iterations and in each loop calls H2ODeepLearningEstimator() 4 times and associated predict() and model_performance(). I am doing h2o.remove_all() and cleaning up all data-related Python objects after each iteration.
Data size: training set 80,000 with 122 features (all float) with 20% for validation (10-fold CV). test set 20,000. Doing binary classification.
Machine 1: Windows 7, 4 core, Xeon, each core 3.5GHz, Memory 32 GB
Takes about 24 hours to complete
Machine 2: CentOS 7, 20 core, Xeon, each core 2.0GHz, Memory 128 GB
Takes about 17 hours to complete
I am using h2o.init(nthreads=-1, max_mem_size = 96)
So, the speed-up is not that much.
My questions:
1) Is the speed-up typical?
2) What can I do to achieve substantial speed-up?
2.1) Will adding more cores help?
2.2) Are there any H2O configuration or tips that I am missing?
Thanks very much.
- Mohammad,
Graduate student
If the training time is the main effort, and you have enough memory, then the speed up will be proportional to cores times core-speed. So, you might have expected a 40/14 = 2.85 speed-up (i.e. your 24hrs coming down to the 8-10 hour range).
There is a typo in your h2o.init(): 96 should be "96g". However, I think that was a typo when writing the question, as h2o.init() would return an error message. (And H2O would fail to start if you'd tried "96", with the quotes but without the "g".)
You didn't show your h2o.deeplearning() command, but I am guessing you are using early stopping. And that can be unpredictable. So, what might have happened is that your first 24hr run did, say, 1000 epochs, but your second 17hr run did 2000 epochs. (1000 vs. 2000 would be quite an extreme difference, though.)
It might be that you are spending too much time scoring. If you've not touched the defaults, this is unlikely. But you could experiment with train_samples_per_iteration (e.g. set it to 10 times the number of your training rows).
What can I do to achieve substantial speed-up?
Stop using cross-validation. That might be a bit controversial, but personally I think 80,000 training rows is going to be enough to do an 80%/10%/10% split into train/valid/test. That will be 5-10 times quicker.
If it is for a paper, and you want to show more confidence in the results, once you have your final model, and you've checked that test score is close to valid score, then rebuild it a couple of times using a different seed for the 80/10/10 split, and confirm you end up with the same metrics. (*)
*: By the way, take a look at the score for each of the 10 cv models you've already made; if they are fairly close to each other, then this approach should work well. If they are all over the place, you might have to re-consider the train/valid/test splits - or just think about what it is in your data that might be causing that sensitivity.

VTUNE results of CPU usage and Concurrency Histogram

In Vtune results what the numbers 0,1,2 (and 3) actually represent ?
What is the meaning of Blue bar over 0?
It's a histogram - each column represents the portion of time you spend while the variable (the one appearing below the graph) is at any given value.
The left one states that you spend roughly 1/3 of the time with 0 utilized logical CPUs (fully idle), and 2/3 of the time with 1 logical core operating. You never reach 2 simultaneously operating cores.
In the same manner, the right histogram says you spend ~25% of the time with zero active threads, and ~75% with one thread (there's a negligible portion with 2 threads).
Note that the total times are slightly different, and the portion of fully-idle time also varies a bit - if this is taken over the exact same run, then this discrepancy might be explained by the difference between the time when a core becomes active (waking up from a low power state), and the moment that the OS can schedule a thread to actually start running on it.

Getting cpu usage and calculating % used

I need to calculate the cpu usage and aggregate it from proc file in linux
/proc/stat gives me data but how would i come to know the % used of cpu at time as
stat gives me the count of processes at cores running at any time which does not give me any idea of %use of cpu?
And i am coding this in Golang and have to do this w/o scripts
Thanks in advance!!
/proc/stat does not only give you the count of processes on each core. man proc will tell you the exact format of that file. Copied from it, here is the part you should be interested in:
/proc/stat
cpu 3357 0 4313 1362393
The amount of time, measured in units of USER_HZ
(1/100ths of a second on most architectures, use
sysconf(_SC_CLK_TCK) to obtain the right value), that the
system spent in user mode, user mode with low priority
(nice), system mode, and the idle task, respectively.
The last value should be USER_HZ times the second entry
in the uptime pseudo-file.
It is then easy to do the substraction of the idle field between two measures, which will give you the time spent not doing anything by this CPU. The other value that you can extract is the time doing something, which is the difference between two measures of:
time in user mode + time spent in user mode with low priority + time spent in system mode
You will then have two values; one, A, is expressing the time doing nothing, and the other, B, the time actually doing something. B / (A + B) will give you the percentage of time the CPU was busy.

How to handle mass database manipulation every second - threading?

I have a very hard problem:
I have round about 20-50 objects, which I MUST (that is given for the problem, please don't spend time in thinking around it) put througt a logic EVERY SECOND.
The logic itself need round about 200-600 milliseconds (90% it is 200ms - 10% it is 600ms).
I try to find any solution how I can make is smaller, but there isn't. I must get an object from DB, I must have a lot of if-else and I must actual it. - Even if I reduce it to 50ms or smaller, to veriable rate of the object up to 50 will break my neck with the 1 second timer, because 50 x 50mx =2,5 second. So a tick needs longer then the tickrate should be.
So, my only, not very smart I think, idea is to open for every object an own thread and lead a mainthread for handling. So the mainthread opens x other thread. So only this opening must take unter 1 second. After it logic is used, the thread can kill itself and we all are happy, aren't we?
By given the last answers, I will explain my problem:
I try to build an auctioneer site. So I have up to 50 auctions running at the same moment - nothing special. So I need to: every single second look to the auctionlist, see if the time is 00:00:01 and if it is, bid automaticly (it's a feature, that user can create).
So: get 50 objects in a list, iterate through, check if a automatic bid is need, do it.
With 50 objects and the processing time you've given on average you are doing 12 seconds worth of processing every second. Assuming you have 4 cores, you can get this down to an execution time of 4 seconds via threading. Every second. This means that you're going to start off behind and slip further behind as time goes on.
I know you said you tried to think of a way to make it more efficient, but couldn't, but I fear you're going to have to. The problem as stated now is computationally intractable. You're either going to have to process the objects in a rotating window (so each object gets hit once every 4th cycle or so), or you need to make your processing run faster.
First: Profile, if you haven't already. Figure out what section of your code are taking time, etc. I'd go after that database - how long is the I/O of the objects from the database taking? Can you cache that I/O? (If you're manipulating the same 50 objects, don't load them every second.)
Let's address your threads idea: If you want multiple threads, don't create and destroy them every second. Create your X threads, and leave them be -- creating & destroying them are going to be expensive operations. You might find that less threads will work better - such as 1 or 2 per core, as you might be able to reduce time doing context switches.
To expand on Jonathan Leffler's comment on the question, as the OP requested: (This answer is a wiki)
Say you have these three things being auctioned, ending at the times indicated:
10 Apples - ends at 1:05:00 PM
20 Blueberries - ends at 2:00:00 PM
15 Pears - ends at 3:50:00 PM
If the current time is 1:00:00 PM, then sleep for 4 minutes, 58 seconds (since the closest item ends in 5 minutes). We use the 2 seconds then for processing - adjust that threshold as needed. Once we're done with the apples, we'll sleep for (2 PM - now() - 2s), for the blueberries.
Note that when we wake up at 1:04:58 PM to process the apples auction, we do not touch the blueberries or the pears -- we know that they're still way out in the future, so we don't care.

Resources