I am struggling with "-ta" flag in pgi compiler in order to use GPU acceleration using OpenACC. I did not find any comprehensive answer.
Yes, I know that it is called target accelerator to boost using information about the hardware. So, what -ta should I set, if my GPU hardware is:
weugene#landau:~$ sudo lspci -vnn | grep VGA -A 12
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104GL [10de:1bb1] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation GP104GL [Quadro P4000] [10de:11a3]
Physical Slot: 4
Flags: bus master, fast devsel, latency 0, IRQ 46, NUMA node 0
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at c0000000 (64-bit, prefetchable) [size=256M]
Memory at d0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
CUDA versions for pgi compiler (/opt/pgi/linux86-64/2019/cuda) are: 9.2, 10.0, 10.1
As you note, "-ta" stands for "target accelerator" and is a way for you to override the default target device when using "-acc" ("-acc" tells the compiler to use OpenACC and using just "-ta" implies "-acc"). PGI currently supports two targets, "multicore" to target a mult-core CPU, or "tesla" to target an NVIDIA Tesla device. Other NVIDIA products such as Quadro and GeForce will also work under the "tesla" flag provided they share the same architecture as a Tesla product.
By default when using "-ta=tesla", the PGI compiler will create a unified binary supporting multiple NVIDIA architectures. The exact set of architectures will depend on the compiler version and the CUDA device driver on the build system. For example with PGI 19.4 on a system with a CUDA 9.2 driver, the compiler will target Kepler (cc35), Maxwell (cc50), Pascal (cc60), and Volta (cc70) architectures. "cc" stands for the compute capability. Note if no CUDA driver can be found on the system, then the 19.4 compiler default to use CUDA 10.0.
In your case, a Quadro P4000 uses the Pascal architecture (cc60) so would be targeted by default. If you wanted to have the compiler only target your device, as opposed to creating a unified binary, you'd used the option "-ta=tesla:cc60"
You can also override which Cuda version to use as a sub-option. For example "-ta=tesla:cuda10.1". For a complete list of sub-options please run "pgcc -help -ta" from the command line or consult PGI's documentation.
If you don't know the compute capability of the device, run the PGI utility "pgaccelinfo" which will give you this information. For example, here's the output for my system which has a V100:
% pgaccelinfo
CUDA Driver Version: 10010
NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.67 Sat Apr 6 03:07:24 CDT 2019
Device Number: 0
Device Name: Tesla V100-PCIE-16GB
Device Revision Number: 7.0
Global Memory Size: 16914055168
Number of Multiprocessors: 80
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1380 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 877 MHz
Memory Bus Width: 4096 bits
L2 Cache Size: 6291456 bytes
Max Threads Per SMP: 2048
Async Engines: 7
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device: Yes
PGI Default Target: -ta=tesla:cc70
Hope this helps!
Related
I have tried to evaluate an OpenMPI program with Matrix Multiplication algorithm, the written code scales very well on a single thread per core machine in our Laboratory (close to ideal speedup within 48 and 64 cores), However, on some other machines which are hyperthreaded there is strange behavior, as you can see in the screenshot from htop I realized the CPU utilization when I run the same experiment with the same command is different and strange, I executed the program with
mpirun --bind-to hwthread--use-hwthread-cpus -n 2 ...
Here I bind the MPI workers to each hwthread, and can be seen with -n 2 which means I overwrite the variable in such a way to bind the execution on two processors (here hwthreads), however, seems it uses another hwthread with more or less 50% of utilization as well! I found this strange because there is not any extra CPU utilization on other machines, I tried this experiment many times and I'm sure this is not a temporary check or sth by OS and is due to the execution model of OpenMPI.
I appreciate it if someone could explain this behavior and extra CPU utilization when I execute this on the hyper-threaded machine.
The output of lscpu is as below:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6786.36
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
The version of OpenMPI for all machines is the same 2.1.1.
Maybe Hyperthreading is not the case and I was misled by this, but the only big difference between these environments are 1) the Hyperthreading and 2) Clock Frequency of the processors which is based on different CPUs is different between 2200 MHz to 4.8 GHz.
I am testing hadoop 3.0 Erasure coding..
For the test, I uploaded 100GB in hadoop 3.2.1, and the results were shown below. (5 datanode)
3 COPY : 150 minute
E C : 250 minute (RS-3-2-1024k)
To increase the speed of the EC by applying ISA-L, I set it up and tested the operation, but the speed came out the same.
zlib: true /lib64/libz.so.1
zstd : false
snappy: false
lz4: true revision:10301
bzip2: false
openssl: false build does not support openssl.
ISA-L: true /lib64/libisal.so.2
(1) it is an old device, so I wonder if the CPU does not support it.
Where can I check the list of CPUs that support ISA-L?
CPU : Intel(R) Xeon(R) CPU E5-2609 v2 # 2.50GHz
(2) Please advise if there is a method of applying ISA-L to be added.
In respect to your question
(1) Tt is an old device, so I wonder if the CPU does not support it.
Where can I check the list of CPUs that support ISA-L?
I were going through
Code Sample: Intel® ISA-L Erasure Code and Recovery
Intel(R) Intelligent Storage Acceleration Library
and haven't found a hint regarding specific CPUs.
Optimizing Storage Solutions Using the Intel® Intelligent Storage Acceleration Library
mention only
Depending on the platform capability, Intel ISA-L can run on various Intel® processor families.
Improvements are obtained by speeding up the computations through the use of the following instruction sets:
Intel® AES-NI – Intel® Advanced Encryption Standard - New Instruction
Intel® SSE – Intel® Streaming SIMD Extensions
Intel® AVX – Intel® Advanced Vector Extensions
Intel® AVX2 - Intel® Advanced Vector Extensions 2
Intel® ISA-L also includes unit tests, performance tests and samples written in C which can be used as usage examples.*
I have a Java Servlet based application running on Apache Tomcat on two different machines with similar hardware (RAM, SSD disk, network interface and bandwidth) but different CPU architectures:
x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6266C CPU # 3.00GHz
Stepping: 7
CPU MHz: 3000.000
BogoMIPS: 6000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 30976K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear flush_l1d arch_capabilities
aarch64
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: 0x48
Model: 0
Stepping: 0x1
BogoMIPS: 200.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-7
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
I have experience profiling Java applications both for CPU and memory usage with tools like Yourkit, JProfiler and Async Profiler. And I think I've found all the obvious performance related problems in our application. Using Apache JMeter (5.3.0) I've created a test plan that simulates real case loading: 9000 virtual users navigate the application, with think time, ramp up time, etc. The JMeter reports for both machines look very similar - after all the tweaking and tuning I was able to reach 1200 requests per second with this JMeter plan. If I increase the number of virtual users or decrease the think time then JMeter starts reporting errors mostly related to timeouts (both connect and read timeouts).
So I've decided to use wrk. With it the client machine (the machine where the load test client runs at) uses much less resources and I was able to get much better throughput:
around 40000 req/s when executing against the x86_64 machine
around 20000 req/s when executing against the aarch64 machine
Now, my question is: How to find out what makes the x86_64 machine twice more performant than the aarch64 one ? What kind of tools would you use to find where is the difference ?
I've tried with perf tool but so far I cannot really grasp how to read and interpret its records.
One thing I know for sure is that it is not the network bandwidth because with iperf I can get 5.48 Gbits/sec, while wrk reaches at most 220 MBit/sec (according to nload). If I am not wrong this is around 5 times below the maximum throughput.
All machines run on Ubuntu 18.04.4
Looking into your own CPU information:
x64 -BogoMIPS: 6000.00
aarch64 - BogoMIPS: 200.00
And as per Wikipedia:
BogoMips (from "bogus" and MIPS) is a crude measurement of CPU speed
made by the Linux kernel when it boots to calibrate an internal
busy-loop.1 An often-quoted definition of the term is "the number of
million times per second a processor can do absolutely nothing"
It's related to the CPU frequency so my expectation is that the ARM processor actual frequency is much lower. You can use sar tool or JMeter PerfMon Plugin in order to check both systems metrics (CPU, RAM, Swap, etc.), this way you will be able to tell for sure what is the bottleneck when it comes to ARM system.
With regards to the tool selection, JMeter is more "heavy" than wrk, however it us more powerful as well due to support of Cookies, Cache, working with embedded resources (parsing the response and automatically downloading images, scripts, styles, etc.)
The Problem is found under Windows XP when I want to write a function like GetTicketCount64 which does not exists on this platform. Here is my test code:
uint64_t GetTickCountEx()
{
#if _WIN32_WINNT > _WIN32_WINNT_WINXP
return GetTickCount64();
#else
// http://msdn.microsoft.com/en-us/library/windows/desktop/dn553408.aspx
LARGE_INTEGER Frequency = {};
LARGE_INTEGER Counter = {};
BOOST_VERIFY(QueryPerformanceFrequency(&Frequency));
BOOST_VERIFY(QueryPerformanceCounter(&Counter));
return 1000 * Counter.QuadPart / Frequency.QuadPart;
#endif
}
for (int i = 0; ++i < 1000; Sleep(30000))
{
const auto utc = time(nullptr); // System time
const auto xp = GetTickCount(); // API of Windows XP SP3
const auto ex = GetTickCountEx(); // Performance counter
const auto diff = ex - xp;
printf("%lld %I32u %I64u %I64u \n", utc, xp, ex, diff);
}
I cannot understand the below result. From this article, reply from Angstrom seems not correct. Last column suggests that the difference of GTC and GPC is closer as time goes by! ... and, will it reaches zero some hours later?
So, my question is: Is my implementation of GetTickCount64 correct, and why?
1401778679 503258484 503355416 96932
1401778709 503288484 503385374 96890
1401778739 503318484 503415354 96870
1401778769 503348484 503445289 96805
1401778799 503378484 503475274 96790
1401778829 503408484 503505272 96788
1401778859 503438484 503535245 96761
1401778889 503468500 503565210 96710
1401778919 503498500 503595143 96643
1401778949 503528500 503625137 96637
1401778979 503558500 503655100 96600
1401779009 503588500 503685069 96569
1401779039 503618500 503715069 96569
1401779069 503648500 503745006 96506
1401779099 503678500 503774951 96451
1401779129 503708500 503804958 96458
1401779159 503738500 503834943 96443
1401779189 503768500 503864911 96411
1401779219 503798500 503894792 96292
1401779249 503828500 503924759 96259
1401779279 503858500 503954607 96107
1401779309 503888500 503984607 96107
1401779339 503918500 504014392 95892
1401779369 503948500 504044362 95862
CPU Core Info from coreinfo.exe:
Coreinfo v3.21 - Dump information on system CPU and memory topology
Copyright (C) 2008-2013 Mark Russinovich
Sysinternals - www.sysinternals.com
Intel(R) Core(TM) i3 CPU M 380 # 2.53GHz
x86 Family 6 Model 37 Stepping 5, GenuineIntel
HTT * Hyperthreading enabled
HYPERVISOR - Hypervisor is present
VMX * Supports Intel hardware-assisted virtualization
SVM - Supports AMD hardware-assisted virtualization
EM64T * Supports 64-bit mode
SMX - Supports Intel trusted execution
SKINIT - Supports AMD SKINIT
NX * Supports no-execute page protection
SMEP - Supports Supervisor Mode Execution Prevention
SMAP - Supports Supervisor Mode Access Prevention
PAGE1GB - Supports 1 GB large pages
PAE * Supports > 32-bit physical addresses
PAT * Supports Page Attribute Table
PSE * Supports 4 MB pages
PSE36 * Supports > 32-bit address 4 MB pages
PGE * Supports global bit in page tables
SS * Supports bus snooping for cache operations
VME * Supports Virtual-8086 mode
RDWRFSGSBASE - Supports direct GS/FS base access
FPU * Implements i387 floating point instructions
MMX * Supports MMX instruction set
MMXEXT - Implements AMD MMX extensions
3DNOW - Supports 3DNow! instructions
3DNOWEXT - Supports 3DNow! extension instructions
SSE * Supports Streaming SIMD Extensions
SSE2 * Supports Streaming SIMD Extensions 2
SSE3 * Supports Streaming SIMD Extensions 3
SSSE3 * Supports Supplemental SIMD Extensions 3
SSE4a - Supports Sreaming SIMDR Extensions 4a
SSE4.1 * Supports Streaming SIMD Extensions 4.1
SSE4.2 * Supports Streaming SIMD Extensions 4.2
AES - Supports AES extensions
AVX - Supports AVX intruction extensions
FMA - Supports FMA extensions using YMM state
MSR * Implements RDMSR/WRMSR instructions
MTRR * Supports Memory Type Range Registers
XSAVE - Supports XSAVE/XRSTOR instructions
OSXSAVE - Supports XSETBV/XGETBV instructions
RDRAND - Supports RDRAND instruction
RDSEED - Supports RDSEED instruction
CMOV * Supports CMOVcc instruction
CLFSH * Supports CLFLUSH instruction
CX8 * Supports compare and exchange 8-byte instructions
CX16 * Supports CMPXCHG16B instruction
BMI1 - Supports bit manipulation extensions 1
BMI2 - Supports bit manipulation extensions 2
ADX - Supports ADCX/ADOX instructions
DCA - Supports prefetch from memory-mapped device
F16C - Supports half-precision instruction
FXSR * Supports FXSAVE/FXSTOR instructions
FFXSR - Supports optimized FXSAVE/FSRSTOR instruction
MONITOR * Supports MONITOR and MWAIT instructions
MOVBE - Supports MOVBE instruction
ERMSB - Supports Enhanced REP MOVSB/STOSB
PCLULDQ - Supports PCLMULDQ instruction
POPCNT * Supports POPCNT instruction
LZCNT - Supports LZCNT instruction
SEP * Supports fast system call instructions
LAHF-SAHF * Supports LAHF/SAHF instructions in 64-bit mode
HLE - Supports Hardware Lock Elision instructions
RTM - Supports Restricted Transactional Memory instructions
DE * Supports I/O breakpoints including CR4.DE
DTES64 * Can write history of 64-bit branch addresses
DS * Implements memory-resident debug buffer
DS-CPL * Supports Debug Store feature with CPL
PCID * Supports PCIDs and settable CR4.PCIDE
INVPCID - Supports INVPCID instruction
PDCM * Supports Performance Capabilities MSR
RDTSCP * Supports RDTSCP instruction
TSC * Supports RDTSC instruction
TSC-DEADLINE - Local APIC supports one-shot deadline timer
TSC-INVARIANT * TSC runs at constant rate
xTPR * Supports disabling task priority messages
EIST * Supports Enhanced Intel Speedstep
ACPI * Implements MSR for power management
TM * Implements thermal monitor circuitry
TM2 * Implements Thermal Monitor 2 control
APIC * Implements software-accessible local APIC
x2APIC - Supports x2APIC
CNXT-ID - L1 data cache mode adaptive or BIOS
MCE * Supports Machine Check, INT18 and CR4.MCE
MCA * Implements Machine Check Architecture
PBE * Supports use of FERR#/PBE# pin
PSN - Implements 96-bit processor serial number
PREFETCHW * Supports PREFETCHW instruction
Maximum implemented CPUID leaves: 0000000B (Basic), 80000008 (Extended).
Logical to Physical Processor Map:
*-*- Physical Processor 0 (Hyperthreaded)
-*-* Physical Processor 1 (Hyperthreaded)
Logical Processor to Socket Map:
**** Socket 0
Logical Processor to NUMA Node Map:
**** NUMA Node 0
Logical Processor to Cache Map:
*-*- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
*-*- Instruction Cache 0, Level 1, 32 KB, Assoc 4, LineSize 64
*-*- Unified Cache 0, Level 2, 256 KB, Assoc 8, LineSize 64
-*-* Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
-*-* Instruction Cache 1, Level 1, 32 KB, Assoc 4, LineSize 64
-*-* Unified Cache 1, Level 2, 256 KB, Assoc 8, LineSize 64
**** Unified Cache 2, Level 3, 3 MB, Assoc 12, LineSize 64
You cannot compare the two timing sources, they have drastically different implementations in PCs.
GetTickCount() is derived from the clock tick interrupt, a signal that's generated by the real-time clock. Traditionally a dedicated chip, originally the Motorola MC146818, nowadays integrated in the south-bridge. It has the kind of oscillator that was used in watches, crystal stabilized and usually running at 32768 Hertz. This oscillator keeps running when the machine power is turned off, running off a lithium battery or a super-capacitor.
So resolution is quite poor, but it is made very accurate with very good long-term stability by periodically resynchronizing the clock with time provided by a time server, most Windows machines use time.windows.com. Review GetSystemTimeAdjustment() for details.
QueryPerformanceCounter() uses a frequency source available in the chipset. Traditionally the 8053 counter running at 1193182 Hertz. Nowadays the HPET timer, the HAL (Hardware Abstraction Layer) allows a system integrator to pick any frequency source he's got available. Using the CPU clock is not unusual in cheaper designs.
So resolution is very high, but it is inaccurate and there is no mechanism to calibrate this timer. Being off by 800 ppm from the reported QPF is not unusual. This timer should only ever be used for short interval measurements, the kind that a profiler would use for example.
So no, using QueryPerformanceCounter() as an alternative for GetTickCount64() isn't a very good idea, unless you can live with the inaccuracy. Technically you can synthesize your own 64-bit counter, as long as you keep track of the value of GetTickCount() overflowing. You could, say, increment the course count when the previous value was negative and the new value is positive, indicating that it overflowed. The only requirement is that you sample GetTickCount() often enough to see the transition, at least once in 24 days.
In the same thread as the one you linked, the reply from Raymond Chen specifically says that you should consider neither to be the time 'since' anything. Only consider time differences (intervals) to be relevant quantities. Therefore, what you should be testing is intervals, for instance before your loop the start value(s), and each time in your loop the elapsed time since the start of the loop.
This is on a MacBookPro7,1 with a GeForce 320M (compute capability 1.2). Previously, with OS X 10.7.8, XCode 4.x and CUDA 5.0, CUDA code compiled and ran fine.
Then, I update to OS X 10.9.2, XCode 5.1 and CUDA 5.5. At first, deviceQuery failed. I read elsewhere that 5.5.28 (the driver CUDA 5.5 shipped with) did not support compute capability 1.x (sm_10), but that 5.5.43 did. After updating the CUDA driver to the even more current 5.5.47 (GPU Driver verions 8.24.11 310.90.9b01), deviceQuery indeed passes with the following output.
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors, ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce 320M
Result = PASS
Furthermore, I can successfully compile without modification the CUDA 5.5 samples, though I have not tried to compile all of them.
However, samples such as matrixMul, simpleCUFFT, simpleCUBLAS all fail immediately when run.
$ ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
MatrixA(160,160), MatrixB(320,160)
cudaMalloc d_A returned error code 2, line(164)
$ ./simpleCUFFT
[simpleCUFFT] is starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
CUDA error at simpleCUFFT.cu:105 code=2(cudaErrorMemoryAllocation) "cudaMalloc((void **)&d_signal, mem_size)"
Error Code 2 is cudaErrorMemoryAllocation, but I suspect it hides a failed CUDA initialization somehow.
$ ./simpleCUBLAS
GPU Device 0: "GeForce 320M" with compute capability 1.2
simpleCUBLAS test running..
!!!! CUBLAS initialization error
Actual error code is CUBLAS_STATUS_NOT_INITIALIZED being returned from call to cublasCreate().
Has anyone run into this before and found a fix? Thanks in advance.
I would guess you are running out of memory. Your GPU is being used by the display manager, and it only has 256Mb of RAM. The combined memory footprint of the OS 10.9 display manager and the CUDA 5.5 runtime might be leaving you with almost no free memory. I would recommend writing and running a small test program like this:
#include <iostream>
int main(void)
{
size_t mfree, mtotal;
cudaSetDevice(0);
cudaMemGetInfo(&mfree, &mtotal);
std::cout << mfree << " bytes of " << mtotal << " available." << std::endl;
return cudaDeviceReset();
}
[disclaimer: written in browser, never compiled or tested use at own risk ]
That should give you a picture of the available free memory after context establishment on the device. You might be surprised at how little there is to work with.
EDIT: Here is an even lighter weight alternative test which doesn't even attempt to establish a context on the device. Instead, it only uses the driver API to check the device. If this succeeds, then either the runtime API shipping for OS X is broken somehow, or you have no memory available on the device for establishing a context. If it fails, then your truly have a broken CUDA installation. Either way, I would consider opening a bug report with NVIDIA:
#include <iostream>
#include <cuda.h>
int main(void)
{
CUdevice d;
size_t b;
cuInit(0);
cuDeviceGet(&d, 0);
cuDeviceTotalMem(&b, d);
std::cout << "Total memory = " << b << std::endl;
return 0;
}
Note you will need to explicitly link the cuda driver library to get this to work (pass -lcuda to nvcc, for example)