Is it possible to get information of cache inside of GPU using perf event? - gpgpu

I'm using perf event to get performance count or cache information (such as cache access count, cache miss count).
and now, I want to get GPU's cache information. But, the question is whether perf event can get GPU's cache information.
I did one test.
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
matrixMulCUDA<<< grid, threads >>> ( ... );
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
and I confirmed data seem to be extracted.
But I can't be sure it's cache information inside of GPU.
Anybody knows about this??
http://man7.org/linux/man-pages/man2/perf_event_open.2.html (perf event tutorial)

A short version : No.
A longer version : Why this is not possible:
Your host operates one or two CPU processors ( not speaking about respective CPU-cores thereof ), for which system tools alike perf can start and collect certain amount of "self-diagnostic telemetry" during the in-vivo operations, it is on the background, during the system work.
perf syntax supports all necessary parameters to specify ad-hoc which process / CPU to inspect and record data on-the-fly.
While this works fine as an ad-hoc observer for a CPU-hosted process, typical GPU operates many ( read 14 or more )Streaming Multiprocessor-s, called SM, each equipped for about 16-32-64 parallel code-execution cores.
If this were the only obstacle, one can guess the host operating system may have some "spare" power to sniff it's few processes, but over-subscribing to this capability could easily destabilise a smooth operations remaing still "within the footprint" of that said "spare power".
Now imagine a hypothetical case, that a GPU were able to and would teleport and throw onto a poor solo / pair of your operating system CPU-resources a few hundred times larger ingress of such "performance telemetry" data and will let you to just close your eyes and rely on the same "spare power" of the hosting operating system to somehow chew it up.
No, this would not be a fair move. Not even trying to "swallow" a data-flow from a multi-GPU infrastructure ...
Still, what if?
While there might be a need to do something wild in this direction, each SM has it's own set of L1 / L2 / texL cache resources and supposing a well designed TLP-architecture would allow for some "spare power"-in-GPU to be spent on SM-diagnostics overhead, you will have to code that yourself in a very wise manner, so as to limit the WARP-divergence side-effects on the GPU-kernel-code execution. While you can move some limited data through a pin-hole from GPU-code into host-address space via DMA/RDMA/pinned-memory alchemy, GPU-realm will not allow you to "read" telemetry data from SM[13] by a SM[14] et al.
A good news is, that ...
for a sake of getting at least some insight or a raw view on how the GPU-hardware-architecture would let your specific GPU-kernel-code get aligned onto a specific GPU-hardware resources, the GPU-vendor provides a set of tools, one of which can elaborate a simulated expectation of the SM cache-usage / cache-spil-overs ( elaborated on your host operating system, during your specific GPU-kernel-code compilation into target GPU-hardware assembly language -- this could help you design / fine-tune / compiler-optimise the final GPU-kernel-code with respect to your priorities - minimise latency, maximise processing speed, keep highest TLP and ILP parallelism of WARP-code-executions )
If still in doubts, ...
check how many on-chip JTAG diagnostic connectors do you get available from the GPU-fabric to connect to.
None...

Related

Using perf to monitor memory access of every CPU

I'm trying to use the linux perf tool to sample the memory accesses in my program. Specifically, I'm using perf to monitor read/write access of every CPU in NUMA.
Now, I can monitor every single CPU's read and write memory access, but I also have to know whether the access is a local memory access or a remote memory access.
I have used perf list to go through the events list, but I just find out some events about socket's memory access.
Questions
Is there any way to get every single CPU's remote memory access, when using perf ?
Is there a better option than perf ?
Yes, the PMU unit in your CPU can probably do what you want through the various uncore counters - in particular they can count the various offcore responses for non-local memory access. This blog post is a reasonable starting point.
The main problem is that often the perf tool, which is tied to the specific kernel version, will lag behind in its support of modern processors1, especially when it comes to uncore and NUMA related events2.
To work around that, you can use Andi Kleen's pmu-tools, which provides an ocperf wrapper script that uses whatever underlying perf you have on your system but with up-to-date event ids downloaded directly from Intel. That will usually give you access to the uncore events you need.
Of course, even when you get that working, these events are often very tough to interpret, especially because the mental model you have of demand-memory requests is complicated by a ton of factors such as prefetch behavior, request-for-ownership, accesses that "hit" in a line-buffer in the process of being filled, etc, etc.
1 Both because adding new processors/events as some lag, but especially because the tool is tied to the kernel, and you likely aren't on a bleeding edge kernel, so even though mainline perf might have support, you are stuck with the perf version associated with your kernel.
2 Probably because most kernel developers, like developers in general, aren't working on NUMA systems.

Cycle accurate emulation

I'm currently learning C for my next emulation project, a cycle accurate 68000 core (my last project being a non-cycle accurate Sega Master System emulator written in Java which is now on its third release). My query regards cycle level accuracy as taking things to this level is a new thing for me.
To break things down to a granularity of 1 CPU cycle, presumably I need to know how long memory accesses take and so on, but my question is that for instructions that take multiple cycles in their memory fetch/write stages, what is the CPU doing each cycle - e.g. are x amount of bits copied per cycle.
With my SMS emulator I didn't have to worry too much about M1 stages etc, as it just used a cycle count for each instruction - in other words it is only accurate to an instruction level, not a cycle level. I'm not looking for architecture specific details, merely an idea of what sort of things I should look out for when going to this level of granularity.
68k details are welcome however. Basically I'm wondering what is supposed to happen if a video chip reads from an area of memory whilst a CPU is still writing the data to it mid way through that phase of an instruction, and other similar situations. I hope I've made it clear enough, thank you.
For a really cycle accurate emulation you have first to decide on a master clock you want to use as reference. That should be the fastest clock at which's granularity the software running can detect differences in order of occurance. This could by the CPU clock, but in most cases the bus cycle time decides at which granularity events can be discerned (and that is often only a fraction of the CPU clock).
Then you need to find out the precendence order the different devices (IC's) connected to that bus have (if there is more than one bus master). An example would be if (and how) video DMA can delay the CPU.
There exist generally no at the same time events. Either the CPU writes before the DMA reads, or the other way around (that is still true in case of dual ported devices, you just need to consider the device's inherent predence mechanism).
Once you have a solid understanding which clock is the effectively controlling the granularity of discernible events you can think about how to structure the emulator to reproduce that behaviour exactly.
This way you can create a 100% cycle exact emulation, given you have enough information about all the devices behavior.
Sorry I can't give you more detailed info, I know nothing about the specifics of the Sega's hardware.
My guess is that you don't have to get in to excruciating detail to get good enough results for the timing for this sort of thing. Which you can't do anyway f you don't want to get into the specifics of the architecture.
Your main question seemed to be "what is supposed to happen if a video chip reads from an area of memory whilst a CPU is still writing the data to it". Generally on these older chips, the bus protocols are pretty simple (they're not packetized) and there is usually a pin that indicates that the bus is busy. So if the CPU is writing to memory, the video chip will simply have to wait until the CPU is done. Because of these sorts of limitations, dual ported ram was popular for a while so that the frame buffer could be simultaneously written by the CPU and read by the RAMDAC.

How to measure memory bandwidth utilization on Windows?

I have a highly threaded program but I believe it is not able to scale well across multiple cores because it is already saturating all the memory bandwidth.
Is there any tool out there which allows to measure how much of the memory bandwidth is being used?
Edit: Please note that typical profilers show things like memory leaks and memory allocation, which I am not interested in.
I am only whether the memory bandwidth is being saturated or not.
If you have a recent Intel processor, you might try to use Intel(r) Performance Counter Monitor: http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ It can directly measure consumed memory bandwidth from the memory controllers.
I'd recommend the Visual Studio Sample Profiler which can collect sample events on specific hardware counters. For example, you can choose to sample on cache misses. Here's an article explaining how to choose the CPU counter, though there are other counters you can play with as well.
it would be hard to find a tool that measured memory bandwidth utilization for your application.
But since the issue you face is a suspected memory bandwidth problem, you could try and measure if your application is generating a lot of page faults / sec, which would definitely mean that you are no where near the theoretical memory bandwidth.
You should also measure how cache friendly your algorithms are. If they are thrashing the cache, your memory bandwidth utilization will be severely hampered. Google "measuring cache misses" on good sources that tells you how to do this.
It isn't possible to properly measure memory bus utilisation with any kind of software-only solution. (it used to be, back in the 80's or so. But then we got piplining, cache, out-of-order execution, multiple cores, non-uniform memory architectues with multiple busses, etc etc etc).
You absolutely have to have hardware monitoring the memory bus, to determine how 'busy' it is.
Fortunately, most PC platforms do have some, so you just need the drivers and other software to talk to it:
wenjianhn comments that there is a project specficially for intel hardware (which they call the Processor Counter Monitor) at https://github.com/opcm/pcm
For other architectures on Windows, I am not sure. But there is a project (for linux) which has a grab-bag of support for different architectures at https://github.com/RRZE-HPC/likwid
In principle, a computer engineer could attach a suitable oscilloscope to almost any PC and do the monitoring 'directly', although this is likely to require both a suitably-trained computer engineer as well as quite high performance test instruments (read: both very costly).
If you try this yourself, know that you'll likely need instruments or at least analysis which is aware of the protocol of the bus you're intending to monitor for utilisation.
This can sometimes be really easy, with some busses - eg old parallel FIFO hardware, which usually has a separate wire for 'fifo full' and another for 'fifo empty'.
Such chips are used usually between a faster bus and a slower one, on a one-way link. The 'fifo full' signal, even it it normally occasionally triggers, can be monitored for excessively 'long' levels: For the example of a USB 2.0 Hi-Speed link, this happens when the OS isn't polling the USB fifo hardware on time. Measuring the frequency and duration of these 'holdups' then lets you measure bus utilisation, but only for this USB 2.0 bus.
For a PC memory bus, I guess you could also try just monitoring how much power your RAM interface is using - which perhaps may scale with use. This might be quite difficult to do, but you may 'get lucky'. You want the current of the supply which feeds VccIO for the bus. This should actually work much better for newer PC hardware than those ancient 80's systems (which always just ran at full power when on).
A fairly ordinary oscilloscope is enough for either of those examples - you just need one that can trigger only on 'pulses longer than a given width', and leave it running until it does, which is a good way to do 'soak testing' over long periods.
You monitor utiliation either way by looking for the change in 'idle' time.
But modern PC memory busses are quite a bit more complex, and also much faster.
To do it directly by tapping the bus, you'll need at least an oscilloscope (and active probes) designed explicitly for monitoring the generation of DDR bus your PC has, along with the software analysis option (usually sold separately) to decode the protocol enough to figure out the kind of activity which is occuring on it, from which you can figure out what kind of activity you want to measure as 'idle'.
You may even need a motherboard designed to allow you to make those measurements also.
This isn't so staightfoward as just looking for periods of no activity - all DRAM needs regular refresh cycles at the very least, which may or may not happen along with obvious bus activity (some DRAM's do it automatically, some need a specific command to trigger it, some can continue to address and transfer data from banks not in refresh, some can't, etc).
So the instrument needs to be able to analyse the data deeply enough for you extract how busy it is.
Your best, and simplest bet is to find a PC hardware (CPU) vendor who has tools which do what you want, and buy that hardware so you can use those tools.
This might even involve running your application in a VM, so you can benefit from better tools in a different OS hosting it.
To this end, you'll likely want to try Linux KVM (yes, even for Windows - there are windows guest drivers for it), and also pin down your VM to specific CPUs, whilst you also configure linux to avoid putting other jobs on those same CPUs.

Software performance (MCPS and Power consumed) in a Embedded system

Assume an embedded environment which has either a DSP core(any other processor core).
If i have a code for some application/functionality which is optimized to be one of the best from point of view of Cycles consumed(MCPS) , will it also be a code, best from the point of view of Power consumed by that code in a real hardware system?
Can a code optimized for least MCPS be guaranteed to have least power consumption as well?
I know there are many aspects to be considered here like the architecture of the underlying processor and the hardware system(memory, bus, etc..).
Very difficult to tell without putting a sensitive ammeter between your board and power supply and logging the current drawn. My approach is to test assumptions for various real world scenarios rather than go with the supporting documentation.
No, lowest cycle count will not guarantee lowest power consumption.
It's a good indication, but you didn't take into account that memory bus activity consumes quite a lot of power as well.
Your code may for example have a higher cycle count but lower power consumption if you move often needed data into internal memory (on chip ram). That won't increase the cycle-count of your algorithms but moving the data in- and out the internal memory increases cycle-count.
If your system has a cache as well as internal memory, optimize for best cache utilization as well.
This isn't a direct answer, but I thought this paper (from this answer) was interesting: Real-Time Task Scheduling for Energy-Aware Embedded Systems.
As I understand it, it trying to run each task under the processor's low power state, unless it can't meet the deadline without high power. So in a scheme like that, more time efficient code (less cycles) should allow the processor to spend more time throttled back.

Power Efficient Software Coding

In a typical handheld/portable embedded system device Battery life is a major concern in design of H/W, S/W and the features the device can support. From the Software programming perspective, one is aware of MIPS, Memory(Data and Program) optimized code.
I am aware of the H/W Deep sleep mode, Standby mode that are used to clock the hardware at lower Cycles or turn of the clock entirel to some unused circutis to save power, but i am looking for some ideas from that point of view:
Wherein my code is running and it needs to keep executing, given this how can I write the code "power" efficiently so as to consume minimum watts?
Are there any special programming constructs, data structures, control structures which i should look at to achieve minimum power consumption for a given functionality.
Are there any s/w high level design considerations which one should keep in mind at time of code structure design, or during low level design to make the code as power efficient(Least power consuming) as possible?
Like 1800 INFORMATION said, avoid polling; subscribe to events and wait for them to happen
Update window content only when necessary - let the system decide when to redraw it
When updating window content, ensure your code recreates as little of the invalid region as possible
With quick code the CPU goes back to deep sleep mode faster and there's a better chance that such code stays in L1 cache
Operate on small data at one time so data stays in caches as well
Ensure that your application doesn't do any unnecessary action when in background
Make your software not only power efficient, but also power aware - update graphics less often when on battery, disable animations, less hard drive thrashing
And read some other guidelines. ;)
Recently a series of posts called "Optimizing Software Applications for Power", started appearing on Intel Software Blogs. May be of some use for x86 developers.
Zeroith, use a fully static machine that can stop when idle. You can't beat zero Hz.
First up, switch to a tickless operating system scheduler. Waking up every millisecend or so wastes power. If you can't, consider slowing the scheduler interrupt instead.
Secondly, ensure your idle thread is a power save, wait for next interrupt instruction.
You can do this in the sort of under-regulated "userland" most small devices have.
Thirdly, if you have to poll or perform user confidence activities like updating the UI,
sleep, do it, and get back to sleep.
Don't trust GUI frameworks that you haven't checked for "sleep and spin" kind of code.
Especially the event timer you may be tempted to use for #2.
Block a thread on read instead of polling with select()/epoll()/ WaitForMultipleObjects().
Puts stress on the thread scheuler ( and your brain) but the devices generally do okay.
This ends up changing your high-level design a bit; it gets tidier!.
A main loop that polls all the things you Might do ends up slow and wasteful on CPU, but does guarantee performance. ( Guaranteed to be slow)
Cache results, lazily create things. Users expect the device to be slow so don't disappoint them. Less running is better. Run as little as you can get away with.
Separate threads can be killed off when you stop needing them.
Try to get more memory than you need, then you can insert into more than one hashtable and save ever searching. This is a direct tradeoff if the memory is DRAM.
Look at a realtime-ier system than you think you might need. It saves time (sic) later.
They cope better with threading too.
Do not poll. Use events and other OS primitives to wait for notifiable occurrences. Polling ensures that the CPU will stay active and use more battery life.
From my work using smart phones, the best way I have found of preserving battery life is to ensure that everything you do not need for your program to function at that specific point is disabled.
For example, only switch Bluetooth on when you need it, similarly the phone capabilities, turn the screen brightness down when it isn't needed, turn the volume down, etc.
The power used by these functions will generally far outweigh the power used by your code.
To avoid polling is a good suggestion.
A microprocessor's power consumption is roughly proportional to its clock frequency, and to the square of its supply voltage. If you have the possibility to adjust these from software, that could save some power. Also, turning off the parts of the processor that you don't need (e.g. floating-point unit) may help, but this very much depends on your platform. In any case, you need a way to measure the actual power consumption of your processor, so that you can find out what works and what not. Just like speed optimizations, power optimizations need to be carefully profiled.
Consider using the network interfaces the least you can. You might want to gather information and send it out in bursts instead of constantly send it.
Look at what your compiler generates, particularly for hot areas of code.
If you have low priority intermittent operations, don't use specific timers to wake up to deal with them, but deal with when processing other events.
Use logic to avoid stupid scenarios where your app might go to sleep for 10 ms and then have to wake up again for the next event. For the kind of platform mentioned it shouldn't matter if both events are processed at the same time.
Having your own timer & callback mechanism might be appropriate for this kind of decision making. The trade off is in code complexity and maintenance vs. likely power savings.
Simply put, do as little as possible.
Well, to the extent that your code can execute entirely in the processor cache, you'll have less bus activity and save power. To the extent that your program is small enough to fit code+data entirely in the cache, you get that benefit "for free". OTOH, if your program is too big, and you can divide your programs into modules that are more or less independent of the other, you might get some power saving by dividing it into separate programs. (I suppose it's also possible to make a toolchain that spreas out related bundles of code and data into cache-sized chunks...)
I suppose that, theoretically, you can save some amount of unnecessary work by reducing the number of pointer dereferencing, and by refactoring your jumps so that the most likely jumps are taken first -- but that's not realistic to do as a programmer.
Transmeta had the idea of letting the machine do some instruction optimization on-the-fly to save power... But that didn't seem to help enough... And look where that got them.
Set unused memory or flash to 0xFF not 0x00. This is certainly true for flash and eeprom, not sure about s or d ram. For the proms there is an inversion so a 0 is stored as a 1 and takes more energy, a 1 is stored as a zero and takes less. This is why you read 0xFFs after erasing a block.
Rather timely this, article on Hackaday today about measuring power consumption of various commands:
Hackaday: the-effect-of-code-on-power-consumption
Aside from that:
- Interrupts are your friends
- Polling / wait() aren't your friends
- Do as little as possible
- make your code as small/efficient as possible
- Turn off as many modules, pins, peripherals as possible in the micro
- Run as slowly as possible
- If the micro has settings for pin drive strengh, slew rate, etc. check them & configure them, the defaults are often full power / max speed.
- returning to the article above, go back and measure the power & see if you can drop it by altering things.
also something that is not trivial to do is reduce precision of the mathematical operations, go for the smallest dataset available and if available by your development environment pack data and aggregate operations.
knuth books could give you all the variant of specific algorithms you need to save memory or cpu, or going with reduced precision minimizing the rounding errors
also, spent some time checking for all the embedded device api - for example most symbian phones could do audio encoding via a specialized hardware
Do your work as quickly as possible, and then go to some idle state waiting for interrupts (or events) to happen. Try to make the code run out of cache with as little external memory traffic as possible.
On Linux, install powertop to see how often which piece of software wakes up the CPU. And follow the various tips that the powertop site links to, some of which are probably applicable to non-Linux, too.
http://www.lesswatts.org/projects/powertop/
Choose efficient algorithms that are quick and have small basic blocks and minimal memory accesses.
Understand the cache size and functional units of your processor.
Don't access memory. Don't use objects or garbage collection or any other high level constructs if they expands your working code or data set outside the available cache. If you know the cache size and associativity, lay out the entire working data set you will need in low power mode and fit it all into the dcache (forget some of the "proper" coding practices that scatter the data around in separate objects or data structures if that causes cache trashing). Same with all the subroutines. Put your working code set all in one module if necessary to stripe it all in the icache. If the processor has multiple levels of cache, try to fit in the lowest level of instruction or data cache possible. Don't use floating point unit or any other instructions that may power up any other optional functional units unless you can make a good case that use of these instructions significantly shortens the time that the CPU is out of sleep mode.
etc.
Don't poll, sleep
Avoid using power hungry areas of the chip when possible. For example multipliers are power hungry, if you can shift and add you can save some Joules (as long as you don't do so much shifting and adding that actually the multiplier is a win!)
If you are really serious,l get a power-aware debugger, which can correlate power usage with your source code. Like this

Resources