How to measure total time spent in a function? - xcode

I have a utility function that I suspect is eating up a large portion of my application's execution time. Using Time Profiler to look at the call stack, this function takes up a large portion of the execution time of any function from which it is called. However, since this utility function is called from many different sources, I am having trouble determining if, overall, this is the best use of my optimization time.
How can I look at total time spent in this function during program execution, regardless of who called it?
For clarity, I want to combine the selected entries with all other calls to that function into a single entry:

For me, what does the trick is ticking "Invert Call Tree". It seems to sort "leaf" functions in the call tree in order of those that cumulate the most time, and allow you to see what calls them.
The checkbox can be found in the right panel, called "Display Settings" (If hidden: ⌘2 or View->Inspectors->Show Display Settings)

I am not aware of an instruments based solution but here is something you can do from code. Hope somebody provides an instruments solution but until then to get you going here goes.
#include <time.h>
//have this as a global variable to track time taken by the culprit function
static double time_consumed = 0;
void myTimeConsumingFunction(){
//add these lines in the function
clock_t start, end;
start = clock();
//main body of the function taking up time
end = clock();
//add this at the bottom and keep accumulating time spent across all calls
time_consumed += (double)(end - start) / CLOCKS_PER_SEC;
}
//at termination/end-of-program log time_consumed.

To see the totals for a particular function, follow these steps:
Profile your program with Time Profiler
Find and select any mention of the function of interest in the Call Tree view (you can use Edit->Find)
Summon the context menu over the selected function and 'Focus on calls made by ' (Or use Instrument->Call Tree Data Mining->Focus on Calls Made By )
If your program is multi-threaded and you want a total across all threads, make sure 'Separate by Thread' is not checked.

I can offer the makings of the answer you're looking for but haven't got this working within Instruments yet...
Instruments uses dtrace under the hood. dtrace allows you to respond to events in your program such as a function being entered or returned from. The response to each event can be scripted.
You can create a custom instrument with scripting in Instruments.
Here is a noddy shell script that launches dtrace outside of Instruments and records the time spent in a certain function.
#!/bin/sh
dtrace -c <yourprogram> -n '
unsigned long long totalTime;
self uint64_t lastEntry;
dtrace:::BEGIN
{
totalTime = 0;
}
pid$target:<yourprogram>:*<yourfunction>*:entry
{
self->lastEntry = vtimestamp;
}
pid$target:<yourprogram>:*<yourfunction>*:return
{
totalTime = totalTime + (vtimestamp - self->lastEntry);
/*#timeByThread[tid] = sum(vtimestamp - self->lastEntry);*/
}
dtrace:::END
{
printf( "\n\nTotal time %dms\n" , totalTime/1000000 )
}
'
What I haven't figured out yet is how to transfer this into instruments and get the results to appear in a useful way in the GUI.

I think you can call system("time ls"); twice and it will just work for you. The output will be printed on debug console.

Related

Is there a race between starting and seeing yourself in WinApi's EnumProcesses()?

I just found this code in the wild:
def _scan_for_self(self):
win32api.Sleep(2000) # sleep to give time for process to be seen in system table.
basename = self.cmdline.split()[0]
pids = win32process.EnumProcesses()
if not pids:
UserLog.warn("WindowsProcess", "no pids", pids)
for pid in pids:
try:
handle = win32api.OpenProcess(
win32con.PROCESS_QUERY_INFORMATION | win32con.PROCESS_VM_READ,
pywintypes.FALSE, pid)
except pywintypes.error, err:
UserLog.warn("WindowsProcess", str(err))
continue
try:
modlist = win32process.EnumProcessModules(handle)
except pywintypes.error,err:
UserLog.warn("WindowsProcess",str(err))
continue
This line caught my eye:
win32api.Sleep(2000) # sleep to give time for process to be seen in system table.
It suggests that if you call EnumProcesses() too fast after starting, you won't see yourself. Is there any truth to this?
There is a race, but it's not the race the code tried to protect against.
A successful call to CreateProcess returns only after the kernel object representing the process has been created and enqueued into the kernel's process list. A subsequent call to EnumProcesses accesses the same list, and will immediately observe the newly created process object.
That is, unless the process object has since been destroyed. This isn't entirely unusual since processes in Windows are initialized in-process. The documentation even makes note of that:
Note that the function returns before the process has finished initialization. If a required DLL cannot be located or fails to initialize, the process is terminated.
What this means is that if a call to EnumProcesses immediately following a successful call to CreateProcess doesn't observe the newly created process, it does so because it was late rather than early. If you are late already then adding a delay will only make you more late.
Which swiftly leads to the actual race here: Process IDs uniquely identify processes only for a finite time interval. Once a process object is gone, its ID is up for grabs, and the system will reuse it at some point. The only reliable way to identify a process is by holding a handle to it.
Now it's anyone's guess what the author of _scan_for_self was trying to accomplish. As written, the code takes more time to do something that's probably altogether wrong1 anyway.
1 Turns out my gut feeling was correct. This is just your average POSIX developer, that, in the process of learning that POSIX is insufficient would rather call out Microsoft instead of actually using an all-around superior API.
The documentation for EnumProcesses (WIn32 API - EnumProcesses function), does not mention anything about a delay needed to see the current process in the list it returns.
The example from Microsoft how to use EnumProcess to enumerate all running processes (Enumerating All Processes), also does not contain any delay before calling EnumProcesses.
A small test application I created in C++ (see below) always reports that the current process is in the list (tested on Windows 10):
#include <Windows.h>
#include <Psapi.h>
#include <iostream>
#include <vector>
const DWORD MAX_NUM_PROCESSES = 4096;
DWORD aProcesses[MAX_NUM_PROCESSES];
int main(void)
{
// Get the list of running process Ids:
DWORD cbNeeded;
if (!EnumProcesses(aProcesses, MAX_NUM_PROCESSES * sizeof(DWORD), &cbNeeded))
{
return 1;
}
// Check if current process is in the list:
DWORD curProcId = GetCurrentProcessId();
bool bFoundCurProcId{ false };
DWORD numProcesses = cbNeeded / sizeof(DWORD);
for (DWORD i=0; i<numProcesses; ++i)
{
if (aProcesses[i] == curProcId)
{
bFoundCurProcId = true;
}
}
std::cout << "bFoundCurProcId: " << bFoundCurProcId << std::endl;
return 0;
}
Note: I am aware that the fact that the program reported the expected result does not mean that there is no race. Maybe I just couldn't catch it manifest. But trying to run code like that can give you a hint sometimes (especially if the result would have been that there is a race).
The fact that I never had a problem running this test (did it many times), together with the lack of any mention of the need for a delay in Microsoft's documentation make me believe that it is not required.
My conclusion is that either:
There is a unique issue when using it from python (doubt it).
or:
The code you found is doing something unnecessary.
There is no race.
EnumProcesses calls a NT API function that switches to kernel mode to walk the linked list of processes. Your own process has been added to the list before it starts running.

Qtimer: calling different slot every 2000 ms

I'm working on a project but I'm new of c++ and Qt and I'm having some big troubles.
I have my program on Qt with my GUI interface and I already have done all the Slots related to the buttons in the GUI. Now what I need is to call different slot in sequences every 2000ms, to explain better, the Slots are for example
void on_setPort_clicked();
void on_portSearch_clicked();
void on_openPort_clicked();
void on_ledON_clicked();
I need the program (when I push the relative button) to execute the first one, then after 2 second the second one, then after 2 second again the third one and so on...
How can I do this? For now I understood how to make a certain Slot to execute every 2 second but I need to have a different slots every 2 seconds. I dont know what to put in my .h file and in my .cpp
Thanks guys, hope to have been clear in my answer, sorry for my english but I'm italian.
PS I also need a slot like for examples on_STOP_clicked with a command that will stop sequences to continue like a timer stop when I push the relative button in the GUI
QT Slots are just normal functions so what you need to do is have the timer call one slot and then have that slot determine which function should be called next and then call it.
//State held somewhere else that makes sense in your progrma
//preferably not just a global
int nextSlot =0;
void timer_slot(){
switch(nextSlot){
case 0:
first_slot();
break;
case 1:
second_slot();
break;
etc....
}
nextSlot++ % (number_of_other_functions-1);//-1 as the array is 0 indexed
}
}

determine the time taken for my code to execute in MFC

I need a way to find out the amount of time taken by a function, and a section of my code inside a function, to execute
Does Visual Studio provide any mechanism for doing this, or is it possible to do so from the program using MFC functions? I am new to MFC so I am not sure how this can be done. I thought this should be a pretty straight forward operation but I cannot find any examples on how this may be done either
A quick way, but quite imprecise, is by using GetTickCount():
DWORD time1 = GetTickCount();
// Code to profile
DWORD time2 = GetTickCount();
DWORD timeElapsed = time2-time1;
The problem is GetTickCount() uses the system timer, which has a typical resolution of 10 - 15 ms. So it is only useful with long computations.
It can't tell the difference between a function that takes 2 ms to run and one that takes 9 ms. But if you are in the seconds range, it may well be enough.
If you need more resolution, you can use the performance counter, as RedEye explains.
Or you can try a profiler (maybe this is what you were looking for?). See this question.
There may be better ways, but I do it like this :
// At the start of the function
LARGE_INTEGER lStart;
QueryPerformanceCounter(&lStart);
LARGE_INTEGER lFreq;
QueryPerformanceFrequency(&lFreq);
// At the en of the function
LARGE_INTEGER lEnd;
QueryPerformanceCounter(&lEnd);
TRACE("FunctionName t = %dms\n", (1000*(lEnd.LowPart - lStart.LowPart))/lFreq.LowPart);
I use this method quite a lot for optimising graphics code, finding time taken for screen updates etc. There are other methods of doing the same or similar, but this is quick and simple.

How to check which index in a loop is executing without slow down process?

What is the best way to check which index is executing in a loop without too much slow down the process?
For example I want to find all long fancy numbers and have a loop like
for( long i = 1; i > 0; i++){
//block
}
and I want to learn which i is executing in real time.
Several ways I know to do in the block are printing i every time, or checking if(i % 10000), or adding a listener.
Which one of these ways is the fastest. Or what do you do in similar cases? Is there any way to access the value of the i manually?
Most of my recent experience is with Java, so I'd write something like this
import java.util.concurrent.atomic.AtomicLong;
public class Example {
public static void main(String[] args) {
AtomicLong atomicLong = new AtomicLong(1); // initialize to 1
LoopMonitor lm = new LoopMonitor(atomicLong);
Thread t = new Thread(lm);
t.start(); // start LoopMonitor
while(atomicLong.get() > 0) {
long l = atomicLong.getAndIncrement(); // equivalent to long l = atomicLong++ if atomicLong were a primitive
//block
}
}
private static class LoopMonitor implements Runnable {
private final AtomicLong atomicLong;
public LoopMonitor(AtomicLong atomicLong) {
this.atomicLong = atomicLong;
}
public void run() {
while(true) {
try {
System.out.println(atomicLong.longValue()); // Print l
Thread.sleep(1000); // Sleep for one second
} catch (InterruptedException ex) {}
}
}
}
}
Most AtomicLong implementations can be set in one clock cycle even on 32-bit platforms, which is why I used it here instead of a primitive long (you don't want to inadvertently print a half-set long); look into your compiler / platform details to see if you need something like this, but if you're on a 64-bit platform then you can probably use a primitive long regardless of which language you're using. The modified for loop doesn't take much of an efficiency hit - you've replaced a primitive long with a reference to a long, so all you've added is a pointer dereference.
It won't be easy, but probably the only way to probe the value without affecting the process is to access the loop variable in shared memory with another thread. Threading libraries vary from one system to another, so I can't help much there (on Linux I'd probably use pthreads). The "monitor" thread might do something like probe the value once a minute, sleep()ing in between, and so allowing the first thread to run uninterrupted.
To have a null cost reporting (on multi-cpu computers) : set your index as a "global" property (class-wide for instance), and have a separate thread to read and report the index value.
This report could be timer-based (5 times per seconds or so).
Rq : Maybe you'll need also a boolean stating 'are we in the loop ?'.
Volatile and Caches
If you're going to be doing this in, say, C / C++ and use a separate monitor thread as previously suggested then you'll have to make the global/static loop variable volatile. You don't want the compiler decide deciding to use a register for the loop variable. Some toolchains make that assumption anyway, but there's no harm being explicit about it.
And then there's the small issue of caches. A separate monitor thread nowadays will end up on a separate core, and that'll mean that the two separate cache subsystems will have to agree on what the value is. That will unavoidably have a small impact on the runtime of the loop.
Real real time constraint?
So that begs the question of just how real time is your loop anyway? I doubt that your timing constraint is such that you're depending on it running within a specific number of CPU clock cycles. Two reasons, a) no modern OS will ever come close to guaranteeing that, you'd have to be running on the bare metal, b) most CPUs these days vary their own clock rate behind your back, so you can't count on a specific number of clock cycles corresponding to a specific real time interval.
Feature rich solution
So assuming that your real time requirement is not that constrained, you may wish to do a more capable monitor thread. Have a shared structure protected by a semaphore which your loop occasionally updates, and your monitor thread periodically inspects and reports progress. For best performance the monitor thread would take the semaphore, copy the structure, release the semaphore and then inspect/print the structure, minimising the semaphore locked time.
The only advantage of this approach over that suggested in previous answers is that you could report more than just the loop variable's value. There may be more information from your loop block that you'd like to report too.
Mutex semaphores in, say, C on Linux are pretty fast these days. Unless your loop block is very lightweight the runtime overhead of a single mutex is not likely to be significant, especially if you're updating the shared structure every 1000 loop iterations. A decent OS will put your threads on separate cores, but for the sake of good form you'd make the monitor thread's priority higher than the thread running the loop. This would ensure that the monitoring does actually happen if the two threads do end up on the same core.

How to locate idle time (and network IO time, etc.) in XPerf?

Let's say I have a contrived program:
#include <Windows.h>
void useless_function()
{
Sleep(5000);
}
void useful_function()
{
// ... do some work
useless_function();
// ... do some more work
}
int main()
{
useful_function();
return 0;
}
Objective: I want the profiler to tell me useful_function() is needlessly calling useless_function() which waits for no obvious reasons. Under XPerf, this doesn't show up in any of the graphs I have because the call to WaitForMultipleObjects() seem to be accounted to Idle.exe instead of my own program.
And here's the xperf command line that I currently run:
xperf -on Latency -stackwalk Profile
Any ideas?
(This is not restricted to wait functions. The above might have been solved by placing breakpoints at NtWaitForMultipleObjects. Ideally there could be a way to see the stack sample that's taking up a lot of wall-clock time as opposed to only CPU time)
I think what you are looking for is the Wait analysis with Ready Thread functionality in Xperf. It captures every context switch and gives you the call stack of the thread once it wakes up from sleep (or an otherwise blocked operation). In your case, you would see the stack just after the call sleep(5000) as well as the time spend sleeping.
The functionality is a bit obscure to use. But it is fortunately well described here:
Use Xperf's Wait Analysis for Application-Performance Troubleshooting
Wait Analysis is the way to do this. You should:
Record the CSWITCH provider, in order to get all context switches
Record call stacks on context switches by adding +CSWITCH to your -stackwalk argument
Probably record call stacks on the ready thread to get more information on who readied you (i.e.; who released the Mutex or CS or semaphore and where) by adding +READYTHREAD to your -stackwalk
Then you use CPU Usage (Precise) in WPA (or xperfview, but that's ancient) to look at the context switches and find where your TimeSinceLast is high on a thread that shouldn't be going idle. You'll typically want the columns in CPU Usage (Precise) in this sort of order:
NewProcess (your process being switched in)
NewThreadId
NewThreadStack
ReadyingProcess (who made your thread ready to run)
ReadyingThreadId (optional)
ReadyThreadStack (optional, requires +ReadyThread on -stackwalk)
Orange bar
Count
TimeSinceLast (us) - sort by this column, usually
Whatever other columns you want
For details see these particular articles from my blog:
- https://randomascii.wordpress.com/2014/08/19/etw-training-videos-available-now/
- https://randomascii.wordpress.com/2012/06/19/wpaxperf-trace-analysis-reimagined/
This "profiler" will tell you - just randomly pause it a few times and look at the stack. If do some work takes 5 seconds, and do some more work takes 5 seconds, then 33% of the time the stack will look like this
main: calling useful_function
useful_function: calling useless_function
useless_function: calling Sleep
So roughly 33% of your stack samples will show exactly that. Any line of code that's costing some fraction of wall-clock time will appear on roughly that fraction of samples.
On the rest of the samples you will see it doing the other things.
There are automated profilers that do the same thing in a more pretty way, such as Zoom and LTProf, although they don't actually show you the samples.
I looked at the xperf doc, trying to figure out if you could get stack samples on wall-clock time and get percents at line-level resolution. It seems you gotta be on Windows 7 or Vista. They only bother with functions, not lines, which if you have realistically big functions, is important. I couldn't figure out how to get access to the individual samples, which I think is important for seeing why the program is spending its time.

Resources