I am making a Linux module to calculate CPU usage time for each task in the system. For this purpose, I am reading the task list using for_each_process to iterate over all task_struct entries. I have a couple of questions regarding this method:
In some of the code snippets, I have seen people using for_each_process with RCU lock. Is it necessarily required? I am only reading this data structure.
I am planning to use utime and stime fields of task_struct. But when I try to print these values, I find out that these values are different from the ones that I get using cat /proc/<pid>/stat. There is a huge difference between these values. Am I doing something wrong?
As per my understanding, utime and stime represent the number of clock ticks. How can I derive the time elapsed in seconds from these values?
Apologies if these are kind of basic questions, I am kind of new to the Linux kernel.
Thanks in advance.
I actually got some advice on Reddit.
https://www.reddit.com/r/linuxquestions/comments/kp55ev/calculating_cpu_usage_for_all_pids/?utm_source=share&utm_medium=web2x&context=3
Also, instead of calculating the time in seconds, I am able to find a kernel function task_cputime that returns time in nanoseconds. I used this function and I am able to see the correct time results.
Related
Disclaimer: I am new to perf and still trying to learn the ins/outs.
I have an executable that is running on my target system running Linux. I would like to use perf in order to profile/monitor its performance over time. For arguments sake I am trying to prove that my CPU utilization that is currently measured from top and collectD can be replaced by monitoring it via perf which will result in much more granular data.
Since we are trying to get a plot over time I have been using perf record -e cycles -p <pid>. Afterwards I can get the data to display via perf report.
Questions:
Just displaying the data with perf report shows me a summary of
the entire data set that I took correct?
If I were to run perf report -D I get a dump of all data. (Just as a side question the timestamp is uptime in ns correct?) Now I would assume that sample is based on the frequency that could be set in perf record correct? I have run into issues by taking the time delta of the timestamp and it appears to be recorded at a random interval.
Once I dump the data there is nothing in here that really shouts out "this is your count!!" So the assumption was that the "period" field from the dump is the raw count. Is this true? Meaning that if period = 100, I could assume that for that interval, my program used 100 cycles? Additionally, I am starting to get the feel that this is for not just the application but for each library or kernel call that the program makes. I.e. if a malloc is called a different event will be record outlining that calls cycles taken. So overall how can I derive duration or event + the number of cycles from the event + which event it actually was from this field to get a true measure of CPU utilization?
IF this application of perf is not what it was intended to do then I will also like to know why not? Additionally, I think this same type of analysis would be useful for all of the other types of statistics since you can then pinpoint in time when an anomaly occurred in your running code. Just for reference I am running perf against top collecting at 1s. I am doing this since I want to compare top output to perf output. Any insight would be helpful since as I said I am still learning and new to this powerful tool.
(Linux Kernel version: 3.10.82)
ANSWER #1
Yes mostly. perf report does show you a summary of the trace collected. Samples collected by perf record are saved into a binary file called, by default, perf.data. The perf report command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first. However, you can do much more detailed profiling also using this report.
ANSWER #2
You should ideally use perf script -D to get a trace of all data. The timestamp is in microseconds. Although, in kernels newer than the one you specify, with the help of a command line switch (-ns) you can display the time in nanoseconds as well. Here is the source -
Timestamp
It is quite hard to tell this without looking at what kind of "deltas" are you getting. Remember the period of collecting samples is usually tuned. There are two ways of specifying the rate at which to collect samples --
You can use the perf record (--c for count) to specify the period at which to collect samples. This will mean that for every c occurrences of the event that you are measuring, you will have a sample for that. You can then modify the sampling period and test various values. This means that at every two occurences of the event for which you are measuring, the counter will overflow and you will record a sample.
The other way around to express the sampling period, is to specify the average rate of samples per second (frequency) - which you can do using perf record -F. So perf record -F 1000 will record around 1000 samples per second and these samples will be generated when the hardware/PMU counter corresponding to the event overflows. This means that the kernel will dynamically adjust the sampling period. And you will get sample times at different random moments.
You can see for yourself in code here:
How perf dynamically updates time
ANSWER #3
Why not ? Ideally you should get the number of event samples collected if you do a perf report and just do a deeper analysis. Also when you do a perf record and finish recording samples, you would get a notification on the command line about the number of samples collected corresponding to the event you measured. (This may not be available in the kernel module you use, I would suggest you switch to a newer linux version if possible!). The number of samples should be the raw count - not the period.
If your period is 100 - it means that for the whole duration of the trace, perf recorded every 100th event. That means, if a total of 1000 events happened for the trace duration, perf approximately collected event 1, 100, 200, 300...1000.
Yes the samples recorded are not only from the application. In fact, you can use switches like this : perf record -e <event-name:u> or <event-name:k> (u for userspace and k for kernel) to record events. Additionally perf records samples from shared libraries as well. (Please consult the perf man-page for more details).
As I said previously, perf report should be an ideal tool to calculate the number of samples of event cycles recorded by perf. The number of events collected/recorded is not exact because it is simply not possible for hardware to record all cycle events. This is because recording and preparing details of all the events require the kernel to maintain a ring buffer which gets written to periodically as and when the counter overflows. This writing to the buffer happens via interrupts. They take up a fraction of CPU time- this time is lost and could have been used to record events which will now be lost as the CPU was busy servicing interrupts. You can get a really great estimate by perf even then, though.
CONCLUSION
perf does especially what it intends to do given the limitations of hardware resources we have at hand currently. I would suggest going through the man-pages for each command to understand better.
QUESTIONS
I assume you are looking at perf report. I also assume you are talking about the overhead % in perf report. Theoretically, it can be considered to be an arrangement of data from the highest to least occurrence as you specified. But, there are many underlying details that you need to consider and understand to properly make sense of the output. It represents which function has the most overhead (in terms of the number of events that occurred in that function ). There is also a parent-child relationship, based on which function calls which function, between all the functions and their overheads. Please use the Perf Report link to understand more.
As you know already events are being sampled, not counted. So you cannot accurately get the number of events, but you will get the number of samples and based on the tuned frequency of collecting samples, you will also get the raw count of the number of events ( Everything should be available to you with the perf report output ).
I'm using Tensorflow 1.2. for image segmentation on an AWS p2 instance (Tesla K80). Is there an easy way for me to find out if I can improve the performance of my code?
Here is what I know:
I measured the execution time of the various parts of my program and
99% of the time is spent calling session run.
sess.run([train_op, loss, labels_modified, output_modified],
feed_dict=feed_dict)
where feed_dict is a mapping from placeholders to tensors.
The session.run method only takes 0.43 seconds to execute for the following parameters: batch_size=1, image_height=512, image_width=512, channels=3.
The network has 14 convolutional layers (no dense layers) with a total of 11 million trainable parameters.
Because I'm doing segmentation I use a batch size of 1 and then compute the pixel-wise loss (512*512 cross entropy losses).
I tried to compile Tensorflow from source and got zero performance improvements.
I read through the performance guide https://www.tensorflow.org/performance/performance_guide but I don't want to spend a lot of time trying all of these suggestions. It already took me 8 hours to compile Tensorflow and it gave me zero benefits!
How can I find out which parts of the session run take most of the time? I have a feeling that it might be the loss calculation.
And is there any clear study that shows how much speedup I can expect from the things mentioned in the performance guide?
You're performing a computationally intensive task that requires a lot of calculations and a lot of memory. Your model has a lot of parameters and each one requires to be computed forward, backward and updated.
The suggestions in the page you linked are OK and if you followed them all there's nothing else you can do, except creating another (1 or more) instance and run the train in parallel. This will give you a Nx speed up (where N is the number of instances that compute the gradients for your input batch) but it's extremely expensive and not always applicable (moreover it requires to change you code in order to make it follow the client-server architecture for the gradient computation and weight updates)
Based on your small piece of code, I see you're using a feed dictionary. Generally it's best to avoid using feed dictionaries if queues can be used (see https://github.com/tensorflow/tensorflow/issues/2919). The Tensorflow documentation covers the use of queues here. Switching to queues will definitely improve your performance.
Maybe you can run your code with tfprof to do some profiling to find out where the bottleneck is.
For just guessing, the performance problem may caused by feeding data. Don't how did you prepare your feed_dict, if you have to read you data from disk for preparing your feed_dict for every sess.run, it will slow the program for reading data and training is in synchronous. you can try to covert you data to tfrecords, make loading data and training in asynchronous by using tf.FIFOQueue
I'm trying to make the code faster in Julia using parallelization. My code has nested serial for-loops and performs value function iteration. (as decribed in http://www.parallelecon.com/vfi/)
The following link shows the serial and parallelized version of the code I wrote:
https://github.com/minsuc/MyProject/blob/master/VFI_parallel.ipynb (You can find the functions defined in DefinitionPara.jl in the github page too.) Serial code is defined as main() and parallel code is defined as main_paral().
The third for-loop in main() is the step where I find the maximizer given (nCapital, nProductivity). As suggested in the official parallel documentation, I distribute the work over nCapital grid, which consists of many points.
When I do #time for the serial and the parallel code, I get
Serial: 0.001041 seconds
Parallel: 0.004515 seconds
My questions are as follows:
1) I added two workers and each of them works for 0.000714 seconds and 0.000640 seconds as you can see in the ipython notebook. The reason why parallel code is slower is due to the cost of overhead?
2) I increased the number of grid points by changing
vGridCapital = collect(0.5*capitalSteadyState:0.000001:1.5*capitalSteadyState)
Even though each worker does significant amount of work, serial code is way faster than the parallel code. When I added more workers, serial code is still faster. I think something is wrong but I haven't been able to figure out... Could it be related to the fact that I pass too many arguments in the parallelized function
final_shared(mValueFunctionNew, mPolicyFunction, pparams, vGridCapital, mOutput, expectedValueFunction)?
I will really appreciate your comments and suggestions!
If the amount of work is really small between synchronizations, the task sync overhead may be too long. Remember that a common OS timeslicing quantum is 10ms, and you are measuring in the 1ms range, so with a bit of load, 4ms latency for getting all work threads synced is perfectly reasonable.
In the case of all tasks accessing the same shared data structure, access locking overhead may well be the culprit, if the shared data structure is thread safe, even with longer parallel tasks.
In some cases, it may be possible to use non-thread-safe shared arrays for both input and output, but then it must be ensured that the workers don't clobber each other's results.
Depending on what exactly the work threads are doing, for example if they are outputting to the same array elements, it might be necessary to give each worker its own output array, and merge them together in the end, but that doesn't seem to be the case with your task.
I'm writing a DDE logging applet in visual c++ that logs several hundred events per minute and I need a faster way to keep time than calling GetSystemTime in winapi. Do you have any ideas?
(asking this because in testing under load, all exceptions were caused by a call to getsystemtime)
There is a mostly undocumented struct named USER_SHARED_DATA at a high usermode readable address that contains the time and other global things, but GetSystemTime just extracts the time from there and calls RtlTimeToTimeFields so there is not much you can gain by using the struct directly.
Possibly crazy thought: do you definately need an accurate timestamp? Suppose you only got the system time say every 10th call - how bad would that be?
As per your comments; calling GetSystemDateFormat and GetSystemTimeFormat each time will be a waste of time. These are not likely to change, so those values could easily be cached for improved performance. I would imagine (without having actually tested it) that these two calls are far more time-consuming than a call to GetSystemTime.
first of all, find out why your code is throwing exceptions (assuming you have described it correctly: i.e. a real exception has been thrown, and the app descends down into kernel mode - which is really slow btw.)
then you will most likely solve any performance bottleneck.
Chris J.
First, The fastest way I could think is using RDTSC instruction. RDTSC is an Intel time-stamp counter instruction with a nanosecond precision. Use it as a combination with CPUID instruction. To use those instructions properly, you can read "Using the RDTSC Instruction for Performance Monitoring", then you can convert the nanoseconds to seconds for your purpose.
Second, consider using QueryPerformanceFrequency() and QueryPerformanceCounter().
Third, not the fastest but much faster than the standard GetSystemTime(), that is, using timeGetTime().
I know that GetProcessTimes can be used to retrieve the time a process spent in user mode (as opposed to the time it spent in kernel mode or the time it was suspended). Unfortunately, it seems that the resolution is only 16ms (the same as GetTickCount).
Is there a way to retrieve the time with greater precision?
Take a look at QueryPerformanceCounter function. But I have heard that it has problems in multi core environments, not sure though.