CUDA fallback to CPU? - cpu

I have a CUDA application that on one computer (with a GTX 275) works fine and on another, with a GeForce 8400 works about 100 times slower. My suspicion is that there is some kind of fallback that makes the code actually run on the CPU rather than on the GPU.
Is there a way to actually make sure that the code is running on the GPU?
Is this fallback documented somewhere?
What conditions may trigger it?
EDIT: The code is compiled with compute capabilities 1.1 which what the 8400 has.

Couldn't it just be that the gap in performance is that large. This link indicates that the 8400 operates at 22-62 GFlops and this link indicates that the GTX 275 operates at 1010.88 GFlops.

There are a number of possible reasons for this.
Presumably you're not using the emulation device. Can you run the device query sample from the SDK? That will show if you have the toolkit and driver installed correctly.
You can also query the device properties from within your app to check what device you are attached to.
The 8400 is much lower performance than the GTX275, so it could be real, but see the next point.
One of the major changes in going from compute capability 1.1 to 1.2 and beyond is the way the memory accesses are handled. In 1.1 you have to be very careful not only to coalesce your memory accesses but also to make sure that each half-warp is aligned, otherwise each thread will issue it's own 32 byte transaction. In 1.2 and beyond the alignment is not such an issue as it degrades gracefully to minimise transactions.
This, combined with the lower performance of the 8400, could also account for what you are seeing.

If I remember correctly, you can list all available devices (and choose which device to use for your kernel) from the host code. You could try determine if the available device is software emulation and issue a warning.

Related

Is it possible to compare ARM and x86 performance via benchmarks?

Judging by the latest news, new Apple processor A11 Bionic gains more points than the mobile Intel Core i7 in the Geekbench benchmark.
As I understand, there are a lot of different tests in this benchmark. These tests simulate a different load, including the load, which can occur in everyday use.
Some people state that these results can not be compared to x86 results. They say that x86 is able to perform "more complex tasks". As an example, they lead Photoshop, video conversion, scientific calculations. I agree that the software for the ARM is often only a "lighweight" version of software for desktops. But it seems to me that this limitation is caused by the format of mobile operating systems (do your work on the go, no mouse, etc), and not by the performance of ARM.
As an opposite example, let's look at Safari. A browser is a complex program. And on the iPad Safari works just as well as on the Mac. Moreover, if we take the results of Sunspider (JS benchmark), it turns out that Safari on the iPad is gaining more points.
I think that in everyday tasks (Web, Office, Music/Films) ARM (A10X, A11) and x86 (dual core mobile Intel i7) performance are comparable and equal.
Are there any kinds of tasks where ARM really lags far behind x86? If so, what is the reason for this? What's stopping Apple from releasing a laptop on ARM? They already do same thing with migration from POWER to x86. This is technical restrictions, or just marketing?
(Intended this as a comment since this question is off topic, but it got long..).
Of course you can compare, you just need to be very careful, which most people aren't. The fact that companies publishing (or "leaking") results are biased also doesn't help much.
The common misconception is that you can compare a benchmark across two systems and get a single score for each. That ignores the fact that different systems have different optimization points, most often with regards to power (or "TDP"). What you need to look at is the power/performance curve - this graph shows how the system reacts to more power (raising the frequency, enabling more performance features, etc), and how much it contributes to its performance.
One system can win over the low power range, but lose when the available power increases since it doesn't scale that well (or even stops scaling at some point). This is usually the case with Arm, as most of these CPUs are tuned for low power, while x86 covers a larger domain and scales much better.
If you are forced to observe a single point along the graph (which is a legitimate scenario, for example if you're looking for a CPU for a low-power device), at least make sure the comparison is fair and uses the same power envelope.
There are of course other factors that must be aligned (and sometimes aren't due to negligence or an intention to cheat) - the workload should be the same (i've seen different versions compared..), the compiler should be as close as possible (although generating arm vs x86 code is already a difference, but the compiler intermediate optimizations should be similar. When comparing 2 x86 like intel and AMD you should prefer the same binary, unless you also want to allow machine specific optimizations).
Finally, the system should also be similar, which is not the case when comparing a smartphone against a pc/macbook. The memory could differ, the core count, etc. This could be legitimate difference, but it's not really related to one architecture being better than the other.
the topic is bogus, from the ISA to an application or source code there are many abstraction level and the only metric that we have (execution time, or throughput) depends on many factors that could advantage one or the other: the algorithm choices, the optimization written in source code, the compiler/interpreter implementation/optimizations, the operating system behaviour. So they are not exactly/mathematically comparable.
However, looking at the numbers, and the utility of the mobile application written by talking as a management engeneer, ARM chip seems to be capable of run quite good.
I think the only reason is inertia of standard spread around (if you note microsoft propose a variant of windows running on ARM processors, debian ARM variant are ready https://www.debian.org/distrib/netinst).
the ARMv8 cores seems close to x86/64 ones by looking at raw numbers
note i7-3770k results: https://en.wikipedia.org/wiki/Instructions_per_second#MIPS
summary of last Armv8 CPU characteristics, note the quantity of decode, dispatch, caches, and compare the last column on cortex A73 to the i7 3770k
https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores
intel ivy bridge characteristics:
https://en.wikichip.org/wiki/intel/microarchitectures/ivy_bridge_(client)
A75 details. https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55
the topic of power consumption is complex again, the basic rule that go under all the frequency/tension rule (used and abused) over www is: transistors raise time. https://en.wikipedia.org/wiki/Rise_time
There is a fixed time delay in the switching of a transistor, this determinates the maximum frequency that a transistor could switch, and with more of them linked in a cascade way this time sums up in a nonlinear way (need some integration to demonstrate it), as a result 10 years ago to increase the GHz companies try to split in more stage the execution of an operation and runs them (operations) in a pipeline way, even inside the logical pipeline stage. https://en.wikipedia.org/wiki/Instruction_pipelining
the raise time depends of physical characteristics (materials and shape of transistors). It can be reduced by increasing the voltage, so the transistor switch faster, as the switching is associated (let me the term) to a the charge/discharge of a capacitor that trigger the transistor channel opening/closing.
These ARM chips are designed to low power applications, by changing the design they could easily gain MHz, but they will use much power, how much? again not comparable if you don't work inside a foundry and have the numbers.
an example of server applications of ARM processors that could be closer to desktop/workstation CPU as power consumption are Cavium or qualcomm Falkor CPUs, and some benchmark report that they are not bad.

QueryPerformanceCounter on multi-core processor under Windows 10 behaves erratically

Under Windows, my application makes use of QueryPerformanceCounter (and QueryPerformanceFrequency) to perform "high resolution" timestamping.
Since Windows 10 (and only tested on Intel i7 processors so far), we observe erratic behaviours in the values returned by QueryPerformanceCounter.
Sometimes, the value returned by the call will jump far ahead and then back to its previous value.
It feels as if the thread has moved from one core to another and was returned a different counter value for a lapse of time (no proof, just a gut feeling).
This has never been observed under XP or 7 (no data about Vista, 8 or 8.1).
A "simple" workaround has been to enable the UsePlatformClock boot opiton using BCDEdit (which makes everything behaves wihtout a hitch).
I know about the potentially superior GetSystemTimePreciseAsFileTime but as we still support 7 this is not exactly an option unless we write totatlly different code for different OSes, which we really don't want to do.
Has such behaviour been observed/explained under Windows 10 ?
I'd need much more knowledge about your code but let me highlight few things from MSDN:
When computing deltas, the values [from QueryPerformanceCounter] should be clamped to ensure that any bugs in the timing values do not cause crashes or unstable time-related computations.
And especially this:
Set that single thread to remain on a single processor by using the Windows API SetThreadAffinityMask ... While QueryPerformanceCounter and QueryPerformanceFrequency typically adjust for multiple processors, bugs in the BIOS or drivers may result in these routines returning different values as the thread moves from one processor to another. So, it's best to keep the thread on a single processor.
Your case might exploited one of those bugs. In short:
You should query the timestamp always from one thread (setting same CPU affinity to be sure it won't change) and read that value from any other thread (just an interlocked read, no need for fancy synchronizations).
Clamp the calculated delta (at least to be sure it's not negative)...
Notes:
QueryPerformanceCounter() uses, if possible, TSC (see MSDN). Algorithm to synchronize TSC (if available and in your case it should be) is vastly changed from Windows 7 to Windows 8 however note that:
With the advent of multi-core/hyper-threaded CPUs, systems with multiple CPUs, and hibernating operating systems, the TSC cannot be relied upon to provide accurate results — unless great care is taken to correct the possible flaws: rate of tick and whether all cores (processors) have identical values in their time-keeping registers. There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized. Therefore, a program can get reliable results only by limiting itself to run on one specific CPU.
Then, even if in theory QPC is monotonic then you must always call it from the same thread to be sure of this.
Another note: if synchronization is made by software you may read from Intel documentation that:
...It may be difficult for software to do this in a way than ensures that all logical processors will have the same value for the TSC at a given point in time...
Edit: if your application is multithreaded and you can't (or you don't wan't) to set CPU affinity (especially if you need precise timestamping at the cost to have de-synchronized values between threads) then you may use GetSystemTimePreciseAsFileTime() when running on Win8 (or later) and fallback to timeGetTime() for Win7 (after you set granularity to 1 ms with timeBeginPeriod(1) and assuming 1 ms resolution is enough). A very interesting reading: The Windows Timestamp Project.
Edit 2: directly suggested by OP! This, when applicable (because it's a system setting, not local to your application), might be an easy workaround. You can force QPC to use HPET instead of TSC using bcdedit (see MSDN). Latency and resolution should be worse but it's intrinsically safe from above described issues.

Windows 7 QueryPerformanceFrequency returns 2.4 MHz-ish?

I'm running some timing code on various OS. I notice the following patterns with the results from QueryPerformanceCounter
Standard Windows XP uses the processor frequency, which means it's using RDTSC under the hood.
Vista uses the HPET, 14,318,180 Hz
Any version of Windows with /usepmtimer uses the ACPI clock, 3,579,545 Hz
Windows 7 uses a clock of undetermined origin, returning varying numbers around 2.4 to 2.6 MHz.
Does anyone know what clock Windows 7 is using by default? Why is it even slower than the ACPI clock? Is there a way to force Windows 7 to use the HPET instead?
Windows 7 will pick different QPC sources at boot based on what processor / hardware is available - I believe there are also changes in SP1 regarding this as well.
The change from Vista was most likely taken for AppCompat reasons, since on multicore CPUs that are reading RDTSC, they are not guaranteed to be in-sync, so apps being scheduled on multiple CPUs would sometimes see QPC go backwards and would freak out.
Ok this is only a partial answer as I'm still nailing it down, but this 2.x MHz frequency is equal to the nominal TSC speed divided by 1024.
Try to do the math with your QPF result and your own CPU speed and it should be right.
I initially thought it was a division of the HPET rate but it does not seem to be the case.
Now the question is: the LAPIC timer runs at system bus rate but so is the TSC (before the mult coeff is applied) so we don't know what counter is used before the final division (it could be TSC/1024 or BUS/something else) but we know it's using the main motherboard crystal (the one driving the bus)
What doesn't sound right is that some MSDN articles seem to imply the LAPIC timer is barely used (excepted for hypervisor/virtual machines) but given the fact that the HPET failed to deliver its promises due to many implementation problems, and the fact most new platforms feature an invariant TSC, they are changing direction again.
I didn't found any formal proof from Microsoft concerning the new source used in Win7 though... and we can't completely rule the HPET as even if it's not used in timer mode its counters can still be read (ex: by QPF) but why divide its rate and thus lower its resolution then?

How can I limit the processing power given to a specific program?

I develop on a laptop with a dual-core amd 1.8 GHz processor but people frequently run my programs on much weaker systems (300 MHz ARM for example).
I would like to simulate such weak environments on my laptop so I can observe how my program runs. It is an interactive application.
I looked at qemu and I know how to set up an environment but its a bit painful and I didn't see the exact incantation of switches I would need to make qemu simulate a weaker cpu.
I have virtualbox but it doesn't seem like I can virtualize fewer than 1 full host cpu.
I know about http://cpulimit.sourceforge.net/ which uses sigstop and sigcont to try to limit the cpu given to a process but I am worried this is not really an accurate portrayal of a weaker cpu.
Any ideas?
If your CPU is 1800 MHz and your target is 300 MHz, and your code is like this:
while(1) { /*...*/ }
you can rewrite it like:
long last=gettimestamp();
while(1)
{
long curr=gettimestamp();
if(curr-last>1000) // out of every second...
{
long target=curr+833; // ...waste 5/6 of it
while(gettimestamp()<target);
last=target;
}
// your original code
}
where gettimestamp() is your OS's high frequency timer.
You can choose to work with smaller values for a smoother experience, say 83ms out of every 100ms, or 8ms out of every 10ms, and so on. The lower you go though the more precision loss will ruin your math.
edit: Or how about this? Create a second process that starts the first and attaches itself as a debugger to it, then periodically pauses it and resumes it according to the algorithm above.
You may want to look at an emulator that is built for this. For example, from Microsoft you can find this tech note, http://www.nsbasic.com/ce/info/technotes/TN23.htm.
Without knowing more about the languages you are using, and platforms, it is hard to be more specific, but I would trust the emulator programs to do a good job in providing the test environment.
I've picked a PIIMMX-266 laptop somewhere, and installed a mininal Debian on it. That was a perfect solution until it has died some weeks ago. It is a Panasonic model, which has a non-standard IDE connector (it's not 40-pin, nor 44-pin), so I was unable to replace its HDD with a CF (a CF-to-IDE adapter costs near zero). Also, the price of such a machine is USD 50 / EUR 40.
(I was using it to simulate a slow ARM-based machine for our home aut. system, which is planned to able to run on even smallest-slowest Linux systems. Meanwhile, we've choosen a small and slow computer for home aut. purposes: GuruPlug. It has a cca. 1.2 GByte fast CPU.)
(I'm not familiar with QEMU, but the manual says that you can use KVM (kernel virtualization) in order to run programs at native speed; I assume that if it's an extra feature then it can be turned off, so, strange but true, it can emulate x86 on x86.)

How to measure memory bandwidth utilization on Windows?

I have a highly threaded program but I believe it is not able to scale well across multiple cores because it is already saturating all the memory bandwidth.
Is there any tool out there which allows to measure how much of the memory bandwidth is being used?
Edit: Please note that typical profilers show things like memory leaks and memory allocation, which I am not interested in.
I am only whether the memory bandwidth is being saturated or not.
If you have a recent Intel processor, you might try to use Intel(r) Performance Counter Monitor: http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ It can directly measure consumed memory bandwidth from the memory controllers.
I'd recommend the Visual Studio Sample Profiler which can collect sample events on specific hardware counters. For example, you can choose to sample on cache misses. Here's an article explaining how to choose the CPU counter, though there are other counters you can play with as well.
it would be hard to find a tool that measured memory bandwidth utilization for your application.
But since the issue you face is a suspected memory bandwidth problem, you could try and measure if your application is generating a lot of page faults / sec, which would definitely mean that you are no where near the theoretical memory bandwidth.
You should also measure how cache friendly your algorithms are. If they are thrashing the cache, your memory bandwidth utilization will be severely hampered. Google "measuring cache misses" on good sources that tells you how to do this.
It isn't possible to properly measure memory bus utilisation with any kind of software-only solution. (it used to be, back in the 80's or so. But then we got piplining, cache, out-of-order execution, multiple cores, non-uniform memory architectues with multiple busses, etc etc etc).
You absolutely have to have hardware monitoring the memory bus, to determine how 'busy' it is.
Fortunately, most PC platforms do have some, so you just need the drivers and other software to talk to it:
wenjianhn comments that there is a project specficially for intel hardware (which they call the Processor Counter Monitor) at https://github.com/opcm/pcm
For other architectures on Windows, I am not sure. But there is a project (for linux) which has a grab-bag of support for different architectures at https://github.com/RRZE-HPC/likwid
In principle, a computer engineer could attach a suitable oscilloscope to almost any PC and do the monitoring 'directly', although this is likely to require both a suitably-trained computer engineer as well as quite high performance test instruments (read: both very costly).
If you try this yourself, know that you'll likely need instruments or at least analysis which is aware of the protocol of the bus you're intending to monitor for utilisation.
This can sometimes be really easy, with some busses - eg old parallel FIFO hardware, which usually has a separate wire for 'fifo full' and another for 'fifo empty'.
Such chips are used usually between a faster bus and a slower one, on a one-way link. The 'fifo full' signal, even it it normally occasionally triggers, can be monitored for excessively 'long' levels: For the example of a USB 2.0 Hi-Speed link, this happens when the OS isn't polling the USB fifo hardware on time. Measuring the frequency and duration of these 'holdups' then lets you measure bus utilisation, but only for this USB 2.0 bus.
For a PC memory bus, I guess you could also try just monitoring how much power your RAM interface is using - which perhaps may scale with use. This might be quite difficult to do, but you may 'get lucky'. You want the current of the supply which feeds VccIO for the bus. This should actually work much better for newer PC hardware than those ancient 80's systems (which always just ran at full power when on).
A fairly ordinary oscilloscope is enough for either of those examples - you just need one that can trigger only on 'pulses longer than a given width', and leave it running until it does, which is a good way to do 'soak testing' over long periods.
You monitor utiliation either way by looking for the change in 'idle' time.
But modern PC memory busses are quite a bit more complex, and also much faster.
To do it directly by tapping the bus, you'll need at least an oscilloscope (and active probes) designed explicitly for monitoring the generation of DDR bus your PC has, along with the software analysis option (usually sold separately) to decode the protocol enough to figure out the kind of activity which is occuring on it, from which you can figure out what kind of activity you want to measure as 'idle'.
You may even need a motherboard designed to allow you to make those measurements also.
This isn't so staightfoward as just looking for periods of no activity - all DRAM needs regular refresh cycles at the very least, which may or may not happen along with obvious bus activity (some DRAM's do it automatically, some need a specific command to trigger it, some can continue to address and transfer data from banks not in refresh, some can't, etc).
So the instrument needs to be able to analyse the data deeply enough for you extract how busy it is.
Your best, and simplest bet is to find a PC hardware (CPU) vendor who has tools which do what you want, and buy that hardware so you can use those tools.
This might even involve running your application in a VM, so you can benefit from better tools in a different OS hosting it.
To this end, you'll likely want to try Linux KVM (yes, even for Windows - there are windows guest drivers for it), and also pin down your VM to specific CPUs, whilst you also configure linux to avoid putting other jobs on those same CPUs.

Resources