GPU benchmark for nvidia Optimus cards - gpgpu

I need a GPGPU benchmark which will load the GPU so that I can measure the parameters like temperature rise, amount of battery drain etc. Basically I want to alert the user when the GPU is using a lot of power than normal use. Hence I need to decide on threshold values of GPU temperature, clock frequency and battery drain rate above which GPU will be using more power than normal use. I have tried using several graphics benchmark but most of them don't use GPU resources to the fullest.
Please provide me a link to such GPGPU benchmark.

Related

cpu frequency impact on build graphic card

I'm working on building a Dynamic Voltage Frequency Scaling (DVFS) algorithm for a video decoding application operating on an Intel core i7 6500U CPU (Skylake). The application is to support both software as well as hardware decoder modules and the software decoder is working as expected. It controls the operational frequency of the CPU which eventually controls the operational voltage, thereby reducing the overall energy consumption.
My question is regarding the hardware decoder which is available in the Intel skylake processor (Intel HD graphics 520) which performs the hardware decoding. The experimental results for the two decoders suggest that the energy consumption reduction is much less in the hardware decoder compared to the software decoder when using the DVFS algorithm.
Does the CPU frequency level adjusted on the software before passing the video frame to be decoded on the hardware decoder, actually have an impact on the energy consumption of the hardware decoder?.
Does the Intel HD graphics 520 GPU on the same chip as the CPU have any impact on the CPU's operational frequency and the voltage level?
Why did you need to implement your own DVFS in the first place? Didn't Skylake's self-regulating mode work well? (where you let the CPU's hardware power management controller make all the frequency decisions, instead of just choosing whether to turbo or not).
Setting the CPU core clock speeds should have little to no effect on the GPU's DVFS. It's in a separate domain, and not linked to any of the cores (which can each choose their clocks individually). As you can see on Wikipedia, that SKL model can scale its GPU clocks from 300MHz to 1050MHz, and is probably doing so automatically if you're using an OS running Intel's normal graphics drivers.
For more about how Skylake power management works under the hood, see Efraim Rotem's (Lead Client Power Architect) IDF2015 talk (audio+slides, very good stuff). The title is Skylake Deep Dive: A New Architecture to Manage Power Performance and Energy Efficiency.
There's a link to the list of IDF2015 sessions in the x86 tag wiki.

Why the data transfer is slow from GPU to CPU?

Today I have figured out something that really made me wondering. I have the Samsung Exynos 4412 ARM9 CPU which has a GPU400(QuadCore). I tried to get a texture from the GPU to CPU by all known methods and its really slow. The same scenario and slow speed happens also in modern CPUs and GPUs in the PC Platform. My wondering is how that happens and the Samsung Exynos is an SoC and both of them has the same memory and I should not care about the bus.
Why that happens ?
The data from the GPU to the CPU is transferred by many methods which I have tried glReadpixels, gltexSubImage2D, gltexImage2d, FBO.
The frame rate drops from 40FPS to 7FPs or 7FPS while using any of those methods, on a texture 1024*1024 24bits.
Possible answers taken from the OpenGL forums:
Latency: it takes time for the read command to reach the hardware.
OpenGL command buffering: Reading the data requires the OpenGL driver to complete all outstanding commands.
Hardware buffering: Hardware must empty all GPU core pipelines before doing a readback.
Possible solution:
- Copy the data internally on the GPU to another location and read it back some number of frames after computing it. This should allow everything writing to that location to have completed before you attempt to read it.

What CPU instructions use the most power?

The background is thus: next week our office will have one day with no heating, due to maintenance. Outdoor temperature is expected between 7 and 12 degrees Celcius, so it might become chilly. The portable electric heaters are too few to cater for everyone.
However, I have, in my office of about 6-8 m2, a big honkin' (3 yrs old) workstation (HP xw8600 with 3.0 GHz Quad-core Xeon) that should be able output a couple of hundred Watts of heat. Running Furmark will max out the GPU but I'm not sure how to best work the CPU.
Last time I was in a cold office I either compiled more often or just launched 4-8 DOSBox:es running Norton Commander, but I think one can do better by using SSE1-2-3-4,MMX etc, i.e. stuff that does more work per cycle.
So, what CPU instructions toggle the most transistors each cycle, and thus use cause the CPU to draw most amount of power and thus give off maximum heat?
If I had a power meter available, then I could benchmark myself, but I figure this would be a fun challenge for the SO-crowd. :)
For your specific goal, if you really want to use your system as a heat generator, you need to first make sure that the cooling system is working really well (throwing the heat out of the box). Processors today are designed to throttle themselves when they reach a critical temperature which happens when a proper heatsink is used and the processor is at TDP (Thermal Design Power is the max power for the processor using normal programs). If you have a better heat sink and good ventilation (box fan?), you can probably get beyond TDP assuming that your power supply can handle it. If you turn the fan off, you basically will hit the thermal limit right away.
To be more explicit, the individual instructions that burn the most are generally load instructions that miss in the caches and go out to memory. To guarantee misses, you'll want to allocate a chunk of memory that's bigger than the last level CPU cache and hop around that memory. The pattern of hopping in the maximum power case is a bit complex because you're trying to get the maximum number of misses outstanding at every level of the cache hierarchy simultaneously. If you have 3 levels of cache, in a given period of time, you can have more misses to the L1 than you can to the L2 than you can to the L3 than you can to the DRAM page. (And the specific design of your processor will have a total limit on misses.) Between misses, the instruction doesn't matter too much, but I'd guess that one of the SSE4 multiplies (PMULUDQ) is probably the best since on a lot of modern processors, they execute in pretty quickly and generally do a whole lot of work (compared to say an add).
The funny thing is, running the GPU may limit the amount of heat that you can generate using misses to the L3 cache since the memory may be bogged down by the GPU. In that case, you should make sure that all accesses to the L3 are hits, but that you're missing in the other levels.
For GeForce graphics, my CudaMFLOPS program (free) is quite handy for obtaining high temperatures on the graphics card. If you have an appropriate card details are in:
http://www.roylongbottom.org.uk/cuda1.htm#anchor8
I find that my tests that execute SSE instructions with data from L1 cache generally produce the highest CPU temperatures.
For cpu use Prime95. That is lightweight and will load up all cores nicely. You aren't really going to get much heat out of a 3ghz xeon though. Chips of that age are usually good for over 4ghz with average cooling, and close to 5ghz with high end water loops. With a 6-core chip # >4ghz with extra voltage added you might be hitting 200w TDP but with that system you will be lucky to get the cpu to 100w.
As for the GPU, the Heaven Benchmark is a good one for quickly getting it up to temperature. Again, unless you have a high end card a couple of hundred watts of heat is unlikely. Another alternative on AMD gpus (maybe nvidia too?) is to use crpto-currency mining software, maybe get a USB stick with a mining linux distribution installed and ready to go. You could also use Prime95 on the same rig as mining software uses very little cpu time.
I actually kept a couple of rooms warm over winter with the heat from a computer, only rarely needing extra heating. This was done with a crypto-currency mining rig, which had 4 gpus running at ~80 degrees C, 24/7, with a box fan to circulate the air round the room. That rig had a 1300W PSU. Might I suggest that instead of trying to use the computer to keep you warm, you wear more clothes?

Estimating how processor frequency affects I/O performance

I am doing research about dedicated I/O software that would run on consumer hardware. Essentially it boils down to saving huge data streams for later processing. Right now I am looking for a model to estimate performance factors on x86.
Take for example the new Macbook Pro:
high-speed Thunderbolt I/O (input/output) technology delivers
an amazing 10 gigabits per second of transfer speeds in both
directions
1.25 GB/s sounds nice but most processors of the day are clocked around 2 Ghz. Multiple cores make little difference as long as only one can be assigned per network channel.
So even if the software acts as a miniature operating system and limits itself to network/disk operations, the amount of data flowing to storage can't be greater than P / (2 * N)[1] chunks per second. Although this hints the rough performance limit, I feel it's far from adequate.
What other considerations should one take estimating I/O performance in regards to processor frequency and other hardware specifics? For simplicity's sake, assume here that storage performs instantly under all circumstances.
[1] P - processor frequency; N - algorithm overhead
The hardware limiting factors are probably the I/O bus performance, say PCIe, and more recently, the FSB clock-rates, since memory controllers are moving from northbridge to the CPUs themselves.
Then, of course, you have to figure out what sort of processing you need to do on the input, and how much work it is to produce the output. These, at least for conventional software running on a CPU, are dependent on the processor clock, but not only. Writing your code to take advantage of the hardware facilities like caches, instruction-level parallelism, etc. is still a black art but can give you an order of magnitude performance boost.
Basically what I'm ranting about is that not all software is created equal, and you probably want to take that into account.
Likely, harddisk controllers will decide the harddisk I/O performance, graphics cards will decide maximum resolution and refresh I/O performance, and so on. Don't really understand the question, the CPU is becoming less and less involved in these kinds of things (well, has been for the last 10 years).
I doubt the question will even have bearing on CPUs with integrated GPUs, since the buffer to be output to screen is in external memory sharing a bus with (again) a controller on the motherboard.
It's all buffered, so I can only see CPUs affecting file performance if you somehow force the hardware buffer size to something insanely puny. Edit: and I'm pretty sure Apple will prevent you from doing such things. ;)
For Thunderbolt specifically, it's more about what the minimum CPU model is, that supports the kinds of bus speeds required by the Thunderbolt chip set version that is in the machine in question.
Thunderbolt is a raw data traffic system and performance specs are potential maximums, hence all the asterisks in the Apple specs. I believe it will indeed alleviate bottlenecks and in general give lag-free intelligent data shuffling doing many things simultaneously.
The CPU will idle-wait a shorter time for needed data, but the processing speed of the data is the same. When playing or creating a movie, codec processing time will be the same, but you will still feel a boost/lack of lag because the data is there when it needs it. For the I/O, the bottleneck will become the read/write speed of your harddisk instead, and the CPU bottleneck (for file copy operations, likely at least some code in Finder) will stay the same.
In other words, only CPU-intensive tasks such as for example movie encoding will benefit significantly from a faster CPU, while the benefits of Thunderbolt vs. a mix of interfaces will boost machines with both slow and fast CPUs.

Optimum performance of GPU

I have been asked to measure how "efficiently " does my code use the GPU /what % of peak performance are algorithms achieving.I am not sure how to do this comparison.Till now I have basically had timers put in my code and measure the execution.How can I compare this to optimal performance and find what might be the bottle necks? (I did hear about visual profiler but couldnt get it to work ..it keeps giving me "cannot load output" error).
Each card has a maximum memory bandwidth and processing speed. For example, the GTX 480 bandwidth is 177.4 GB/s. You will need to know the specs for your card.
The first thing to decide is whether your code is memory bound or computation bound. If it is clearly one or the other, that will help you focus on the correct "efficiency" to measure. If your program is memory bound, then you need to compare your bandwidth with the cards maximum bandwidth.
You can calculate memory bandwidth by computing the amount of memory you read/write and dividing by run time (I use cuda events for timing). Here is a good example of calculating bandwidth efficiency (look at the whitepaper for the parallel reduction) and using it to help validate a kernel.
I don't know very much about determining the efficiency if instead you are ALU bound. You can probably count (or profile) the number of instructions, but what is the card's maximum?
I'm also not sure what to do in the likely case that your kernel is something in between memory bound and ALU bound.
Anyone...?
Generally "efficiently" would probably be a measure of how much memory and GPU cycles (average, min, max) of your program is using. Then the efficiency measure would be avg(mem)/total memory for the time period and so on with AVG(GPU cycles)/Max GPU cycles.
Then I'd compare these metrics to metrics from some GPU benchmark suites (which you can assume to be pretty efficient at using most of the GPU). Or you could measure against some random GPU intensive programs of your choice. That'd be how I'd do it but I've never thought to try so good luck!
As for bottlenecks and "optimal" performance. These are probably NP-Complete problems that no one can help you with. Get out the old profiler and debuggers and start working your way through your code.
Can't help with profiler and microoptimisation, but there is a CUDA calculator http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls , which trys to estimate how does your CUDA code use the hardware resources, based on this values:
Threads Per Block
Registers Per Thread
Shared Memory Per Block (bytes)

Resources