In previous versions of VTune, there was a program called dsep.exe, which could be used to periodically poll hardware counters (specifically related to DRAM reads/writes) from VTune. This allowed me to gather counter data about each instance in time, rather than one summary at the end.
Unfortunately, this tool has been deprecated in 64-bit operating systems. Does anyone know a way to periodically (e.g., every 1 sec) get hardware counter data from VTune (or another program in Windows)?
Thanks in advance for your help.
All right, I wasn't able to completely fix this issue, but I got pretty close.
The latest version of VTune saves all of the hardware counter data in a SQLite database (projectfolder/sqlite-db/dicer.db). Since you can't get all of the hardware counter data exported directly from the GUI, you can use a SQLite browser data to get to the data you need.
Most of the hardware data is stored in the pmu-data table, timestampped with some wonky version of a rdtsc call.
Related
I would like to know if there is a proper method to track memory accesses
across multiple resources at once. For example I set up a simple dual core CPU
by advancing the simple.py from learning gem5 (I just added another
TimingSimpleCPU and made the port connections).
I took a look at the different debug options and found for example the
MemoryAccess flag (and others), but this seemed to only show the accesses at
the DRAM or one other resource component.
Nevertheless I imagine a way to track events across CPU, bus and finally memory.
Does this feature already exist?
What can I try next? Is it and idea to add my own --debug-flag or can I work
with the TraceCPU for my specified use?
I haven't worked much with gem5 yet so I'm not sure how to achieve this. Since until now I only ran in SE mode is the FS mode a solution?
Finally I also found the TraceCPUData flag in the --debug-flags, but running
this with my config script created no output (like many other flags btw. ...).
It seems that this is a --debug-flag for the TraceCPU, what kind of output does this flag create and can it help me?
I'm doing this as a personal project, I want to make a visualizer for this data. but the first step is getting the data.
My current plan is to
make my program debug the target process step through it
each step record the EIP from every thread's context within the target process
construct the memory address the instruction uses from the context and store it.
Is there an easier or built in way to do this?
Have a look at Intel PIN for dynamic binary instrumentation / running a hook for every load / store instruction. intel-pin
Instead of actually single-stepping in a debugger (extremely slow), it does binary-to-binary JIT to add calls to your hooks.
https://software.intel.com/sites/landingpage/pintool/docs/81205/Pin/html/index.html
Honestly the best way to do this is probably instrumentation like Peter suggested, depending on your goals. Have you ever ran a script that stepped through code in a debugger? Even automated it's incredibly slow. The only other alternative I see is page faults, which would also be incredibly slow but should still be faster than single step. Basically you make every page not in the currently executing section inaccessible. Any RW access outside of executing code will trigger an exception where you can log details and handle it. Of course this has a lot of flaws -- you can't detect RW in the current page, it's still going to be slow, it can get complicated such as handling page execution transfers, multiple threads, etc. The final possible solution I have would be to have a timer interrupt that checks RW access for each page. This would be incredibly fast and, although it would provide no specific addresses, it would give you an aggregate of pages written to and read from. I'm actually not entirely sure off the top of my head if Windows exposes that information already and I'm also not sure if there's a reliable way to guarantee your timers would get hit before the kernel clears those bits.
We have some code that relies on extensive usage of fork. We started to hit the performance problems and one of our hypothesis is that we do have a lot of speed wasted when copy-on-write happens in the forked processes.
Is there a way to specifically detect when and how copy-and-write happens, to have a detailed insight into this process.
My platform is OSX but more general information is also appreciated.
There are a few ways to get this info on OS X. If you're satisfied with just watching information about copy-on-write behavior from the command-line, you can use the vm_stat tool with an interval. E.g., vm_stat 0.5 will print full statistics twice per second. One of the columns is the number of copy-on-write faults.
If you'd like to gather specific information in a more detailed way, but still from outside the actual running process, you can use the Instruments application that comes with OS X. This includes a set of tools for gathering information about a running process, the most useful of which for your case are likely to be the VM Tracker, Virtual Memory Trace, or Shared Memory instruments. These capture lots of useful information over the lifetime of a process. The application is not super intuitive, but it will do what you need.
If you'd like detailed information in-process, I think you'll need to use the (poorly documented) VM statistics API. You can request that the kernel fill a vm_statistics struct using the host_statistics routine. For example, running this code:
mach_msg_type_number_t count = HOST_VM_INFO_COUNT;
vm_statistics_data_t vmstats;
kern_return_t host_statistics(mach_host_self(), HOST_VM_INFO, (host_info_t) &vmstats, &count);
will fill the vmstats structure with information such as cow_faults, which gives the number of faults triggered by copy-on-write behavior. Check out the headers /usr/include/mach/vm_*, which declare the types and routines for gathering this information.
I ask this question because I'd like to know this from my kernel mode Windows driver.
I have some library code porting from user mode that has some accompanying stress test to run; that stress test code need to know when CPU is idle.
Simple googling shows no result, at least from first several pages.
you need use ZwQuerySystemInformation with SystemProcessorPerformanceInformation infoclass ( you got a array of SYSTEM_PROCESSOR_PERFORMANCE_INFORMATION structures on output)
During evaluation of memory performance of Power8 processor using perf I ended up with problem of understanding difference between events PM_DATA_ALL_* and PM_DATA_*. Most of the counters exists in both version, but the description in oprofile documentation and in papi_native_avail are the same, for example:
PM_DATA_FROM_LMEM
The processor's data cache was reloaded from the local chip's Memory due to either only demand loads or demand loads plus prefetches if MMCR1[16] is 1.
I though I will figure out the difference by measuring some data. If I provide task large enough, I can observe expected difference that *_ALL versions have higher values. I understand the concept of multiplexing counters in the measure using perf.
So what is actually the all in these events?
After few more hours of searching, I found another source directly from IBM describing the events as:
PM_DATA_ALL_FROM_LMEM
The processor's data cache was reloaded from the local chip's Memory due to either demand loads or data prefetch
and
PM_DATA_FROM_LMEM
The processor's data cache was reloaded from the local chip's Memory due to a demand load
So the difference makes prefetch load, which is not included in the second version.
The PAPI and perf tools just include wrong description. These events were contributed directly to oprofile by IBM but probably with some mistakes/inaccuracies. As I browse through the PAPI/libpfm source, I see that the correct description is in .pme_short_desc field, but the .pme_long_desc fields are both the same. And papi_native_avail reports only the long one:
Thanks for patience. Summing the stuff like this helped me a lot and I hope it will help somebody struggling with similar issues.