how to increase memory limit in Visual Studio C++ - visual-studio

Need Help.I'm stuck at a problem when running a C++ code on Windows- Visual Studio.
When I run that code in Linux environment, there is no restriction on the memory I am able to allocate dynamically(till the size available in RAM).
But on VS Compiler, it does not let me create an array beyond a limited size.
I've tried /F option and 20-25 of google links to increase memory size but they dont seem to help much.
I am currently able to assign only around 100mb out of 3gb available.
If there is a solution for this in Windows and not in Visual Studio's compiler, I will be glad to hear that as I have a CUDA TeslaC2070 card which is proving to be pretty useless on Windows as I wanted to run my CUDA/C++ code on Windows environment.
Here's my code. it fails when LENGTH>128(no of images 640x480pngs. less than 0.5mb each. I've also calculated the approximate memory size it takes by counting data structures and types used in OpenCV and by me but still it is very less than 2gb). stackoverflow exception. Same with dynamic allocation. I've already maximized the heap and stack sizes.
#include "stdafx.h"
#include <cv.h>
#include <cxcore.h>
#include <highgui.h>
#include <cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#define LENGTH 100
#define SIZE1 640
#define SIZE2 480
#include <iostream>
using namespace std;
__global__ void square_array(double *img1_d, long N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
img1_d[idx]= 255.0-img1_d[idx];
}
int _tmain(int argc, _TCHAR* argv[])
{
IplImage *img1[LENGTH];
// Open the file.
for(int i=0;i<LENGTH;i++)
{ img1[i] = cvLoadImage("abstract3.jpg");}
CvMat *mat1[LENGTH];
for(int i=0;i<LENGTH;i++)
{
mat1[i] = cvCreateMat(img1[i]->height,img1[i]->width,CV_32FC3 );
cvConvert( img1[i], mat1[i] );
}
double a[LENGTH][2*SIZE1][SIZE2][3];
for(int m=0;m<LENGTH;m++)
{
for(int i=0;i<SIZE1;i++)
{
for(int j=0;j<SIZE2;j++)
{
CvScalar scal = cvGet2D( mat1[m],j,i);
a[m][i][j][0] = scal.val[0];
a[m][i][j][1] = scal.val[1];
a[m][i][j][2] = scal.val[2];
a[m][i+SIZE1][j][0] = scal.val[0];
a[m][i+SIZE1][j][1] = scal.val[1];
a[m][i+SIZE1][j][2] = scal.val[2];
}
} }
//cuda
double *a_d;
int N=LENGTH*2*SIZE1*SIZE2*3;
cudaMalloc((void **) &a_d, N*sizeof(double));
cudaMemcpy(a_d, a, N*sizeof(double), cudaMemcpyHostToDevice);
int block_size = 370;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
cout<<n_blocks<<block_size;
square_array <<< n_blocks, block_size >>> (a_d, N);
cudaMemcpy(a, a_d, N*sizeof(double), cudaMemcpyDeviceToHost);
//cuda end
char name[]= "Image: 00000";
name[12]='\0';
int x=0,y=0;
for(int m=0;m<LENGTH;m++)
{
for (int i = 0; i < img1[m]->width*img1[m]->height*3; i+=3)
{
img1[m]->imageData[i]= a[m][x][y][0];
img1[m]->imageData[i+1]= a[m][x][y][1];
img1[m]->imageData[i+2]= a[m][x][y][2];
if(x==SIZE1)
{
x=0;
y++;
}
x++;
}
switch(name[11])
{
case '9': switch(name[10])
{
case '9':
switch(name[9])
{
case '9': name[11]='0';name[10]='0';name[9]='0';name[8]++;
break;
default : name[11]='0';
name[10]='0';
name[9]++;
}break;
default : name[11]='0'; name[10]++;break;
}
break;
default : name[11]++;break;
}
// Display the image.
cvNamedWindow(name, CV_WINDOW_AUTOSIZE);
cvShowImage(name,img1);
//cvSaveImage(name ,img1);
}
// Wait for the user to press a key in the GUI window.
cvWaitKey(0);
// Free the resources.
//cvDestroyWindow(x);
//cvReleaseImage(&img1);
//cvDestroyWindow("Image:");
//cvReleaseImage(&img2);
return 0;
}

The problem is that you are allocating a huge multidimensional array on the stack in your main function (double a[..][..][..]). Do not allocate this much memory on the stack. Use malloc/new to allocate on the heap.

Related

CUDA Initialize Array on Device

I am very new to CUDA and I am trying to initialize an array on the device and return the result back to the host to print out to show if it was correctly initialized. I am doing this because the end goal is a dot product solution in which I multiply two arrays together, storing the results in another array and then summing up the entire thing so that I only need to return the host one value.
In the code I am working on all I am only trying to see if I am initializing the array correctly. I am trying to create an array of size N following the patterns of 1,2,3,4,5,6,7,8,1,2,3....
This is the code that I've written and it compiles without issue but when I run it the terminal is hanging and I have no clue why. Could someone help me out here? I'm so incredibly confused :\
#include <stdio.h>
#include <stdlib.h>
#include <chrono>
#define ARRAY_SIZE 100
#define BLOCK_SIZE 32
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if(temp != 8){
a_d[x] = temp;
temp++;
} else {
a_d[x] = temp;
temp = 1;
}
}
int main (int argc, char *argv[])
{
//declare pointers for arrays
int *a_d, *b_d, *c_d, *sum_h, *sum_d,a_h[ARRAY_SIZE];
//set space for device variables
cudaMalloc((void**) &a_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &b_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &c_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &sum_d, sizeof(int));
// set execution configuration
dim3 dimblock (BLOCK_SIZE);
dim3 dimgrid (ARRAY_SIZE/BLOCK_SIZE);
// actual computation: call the kernel
cu_kernel <<<dimgrid, dimblock>>> (a_d,b_d,c_d,ARRAY_SIZE);
cudaError_t result;
// transfer results back to host
result = cudaMemcpy (a_h, a_d, sizeof(int) * ARRAY_SIZE, cudaMemcpyDeviceToHost);
if (result != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed.");
exit(1);
}
// print reversed array
printf ("Final state of the array:\n");
for (int i =0; i < ARRAY_SIZE; i++) {
printf ("%d ", a_h[i]);
}
printf ("\n");
}
There are at least 3 issues with your kernel code.
you are using shared memory variable temp without initializing it.
you are not resolving the order in which threads access a shared variable as discussed here.
you are imagining (perhaps) a particular order of thread execution, and CUDA provides no guarantees in that area
The first item seems self-evident, however naive methods to initialize it in a multi-threaded environment like CUDA are not going to work. Firstly we have the multi-threaded access pattern, again, Furthermore, in a multi-block scenario, shared memory in one block is logically distinct from shared memory in another block.
Rather than wrestle with mechanisms unsuited to create the pattern you desire, (informed by notions carried over from a serial processing environment), I would simply do something trivial like this to create the pattern you desire:
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
if (x < size) a_d[x] = (x&7) + 1;
}
Are there other ways to do it? certainly.
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if (!threadIdx.x) temp = blockIdx.x*blockDim.x;
__syncthreads();
if (x < size) a_d[x] = ((temp+threadIdx.x) & 7) + 1;
}
You can get as fancy as you like.
These changes will still leave a few values at zero at the end of the array, which would require changes to your grid sizing. There are many questions about this already, or study a sample code like vectorAdd.

Windows 10 poor performance compared to Windows 7 (page fault handling is not scalable, severe lock contention when no of threads > 16)

We set up two identical HP Z840 Workstations with the following specs
2 x Xeon E5-2690 v4 # 2.60GHz (Turbo Boost ON, HT OFF, total 28 logical CPUs)
32GB DDR4 2400 Memory, Quad-channel
and installed Windows 7 SP1 (x64) and Windows 10 Creators Update (x64) on each.
Then we ran a small memory benchmark (code below, built with VS2015 Update 3, 64-bit architecture) which performs memory allocation-fill-free simultaneously from multiple threads.
#include <Windows.h>
#include <vector>
#include <ppl.h>
unsigned __int64 ZQueryPerformanceCounter()
{
unsigned __int64 c;
::QueryPerformanceCounter((LARGE_INTEGER *)&c);
return c;
}
unsigned __int64 ZQueryPerformanceFrequency()
{
unsigned __int64 c;
::QueryPerformanceFrequency((LARGE_INTEGER *)&c);
return c;
}
class CZPerfCounter {
public:
CZPerfCounter() : m_st(ZQueryPerformanceCounter()) {};
void reset() { m_st = ZQueryPerformanceCounter(); };
unsigned __int64 elapsedCount() { return ZQueryPerformanceCounter() - m_st; };
unsigned long elapsedMS() { return (unsigned long)(elapsedCount() * 1000 / m_freq); };
unsigned long elapsedMicroSec() { return (unsigned long)(elapsedCount() * 1000 * 1000 / m_freq); };
static unsigned __int64 frequency() { return m_freq; };
private:
unsigned __int64 m_st;
static unsigned __int64 m_freq;
};
unsigned __int64 CZPerfCounter::m_freq = ZQueryPerformanceFrequency();
int main(int argc, char ** argv)
{
SYSTEM_INFO sysinfo;
GetSystemInfo(&sysinfo);
int ncpu = sysinfo.dwNumberOfProcessors;
if (argc == 2) {
ncpu = atoi(argv[1]);
}
{
printf("No of threads %d\n", ncpu);
try {
concurrency::Scheduler::ResetDefaultSchedulerPolicy();
int min_threads = 1;
int max_threads = ncpu;
concurrency::SchedulerPolicy policy
(2 // two entries of policy settings
, concurrency::MinConcurrency, min_threads
, concurrency::MaxConcurrency, max_threads
);
concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);
}
catch (concurrency::default_scheduler_exists &) {
printf("Cannot set concurrency runtime scheduler policy (Default scheduler already exists).\n");
}
static int cnt = 100;
static int num_fills = 1;
CZPerfCounter pcTotal;
// malloc/free
printf("malloc/free\n");
{
CZPerfCounter pc;
for (int i = 1 * 1024 * 1024; i <= 8 * 1024 * 1024; i *= 2) {
concurrency::parallel_for(0, 50, [i](size_t x) {
std::vector<void *> ptrs;
ptrs.reserve(cnt);
for (int n = 0; n < cnt; n++) {
auto p = malloc(i);
ptrs.emplace_back(p);
}
for (int x = 0; x < num_fills; x++) {
for (auto p : ptrs) {
memset(p, num_fills, i);
}
}
for (auto p : ptrs) {
free(p);
}
});
printf("size %4d MB, elapsed %8.2f s, \n", i / (1024 * 1024), pc.elapsedMS() / 1000.0);
pc.reset();
}
}
printf("\n");
printf("Total %6.2f s\n", pcTotal.elapsedMS() / 1000.0);
}
return 0;
}
Surprisingly, the result is very bad in Windows 10 CU compared to Windows 7. I plotted the result below for 1MB chunk size and 8MB chunk size, varying the number of threads from 2,4,.., up to 28. While Windows 7 gave slightly worse performance when we increased the number of threads, Windows 10 gave much worse scalability.
We have tried to make sure all Windows update is applied, update drivers, tweak BIOS settings, without success. We also ran the same benchmark on several other hardware platforms, and all gave similar curve for Windows 10. So it seems to be a problem of Windows 10.
Does anyone have similar experience, or maybe know-how about this (maybe we missed something ?). This behavior has made our multithreaded application got significant performance hit.
*** EDITED
Using https://github.com/google/UIforETW (thanks to Bruce Dawson) to analyze the benchmark, we found that most of the time is spent inside kernels KiPageFault. Digging further down the call tree, all leads to ExpWaitForSpinLockExclusiveAndAcquire. Seems that the lock contention is causing this issue.
*** EDITED
Collected Server 2012 R2 data on the same hardware. Server 2012 R2 is also worse than Win7, but still a lot better than Win10 CU.
*** EDITED
It happens in Server 2016 as well. I added the tag windows-server-2016.
*** EDITED
Using info from #Ext3h, I modified the benchmark to use VirtualAlloc and VirtualLock. I can confirmed significant improvement compared to when VirtualLock is not used. Overall Win10 is still 30% to 40% slower than Win7 when both using VirtualAlloc and VirtualLock.
Microsoft seems to have fixed this issue with Windows 10 Fall Creators Update and Windows 10 Pro for Workstation.
Here is the updated graph.
Win 10 FCU and WKS has lower overhead than Win 7. In exchange, the VirtualLock seems to have higher overhead.
Unfortunately not an answer, just some additional insight.
Little experiment with a different allocation strategy:
#include <Windows.h>
#include <thread>
#include <condition_variable>
#include <mutex>
#include <queue>
#include <atomic>
#include <iostream>
#include <chrono>
class AllocTest
{
public:
virtual void* Alloc(size_t size) = 0;
virtual void Free(void* allocation) = 0;
};
class BasicAlloc : public AllocTest
{
public:
void* Alloc(size_t size) override {
return VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
}
void Free(void* allocation) override {
VirtualFree(allocation, NULL, MEM_RELEASE);
}
};
class ThreadAlloc : public AllocTest
{
public:
ThreadAlloc() {
t = std::thread([this]() {
std::unique_lock<std::mutex> qlock(this->qm);
do {
this->qcv.wait(qlock, [this]() {
return shutdown || !q.empty();
});
{
std::unique_lock<std::mutex> rlock(this->rm);
while (!q.empty())
{
q.front()();
q.pop();
}
}
rcv.notify_all();
} while (!shutdown);
});
}
~ThreadAlloc() {
{
std::unique_lock<std::mutex> lock1(this->rm);
std::unique_lock<std::mutex> lock2(this->qm);
shutdown = true;
}
qcv.notify_all();
rcv.notify_all();
t.join();
}
void* Alloc(size_t size) override {
void* target = nullptr;
{
std::unique_lock<std::mutex> lock(this->qm);
q.emplace([this, &target, size]() {
target = VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
VirtualLock(target, size);
VirtualUnlock(target, size);
});
}
qcv.notify_one();
{
std::unique_lock<std::mutex> lock(this->rm);
rcv.wait(lock, [&target]() {
return target != nullptr;
});
}
return target;
}
void Free(void* allocation) override {
{
std::unique_lock<std::mutex> lock(this->qm);
q.emplace([allocation]() {
VirtualFree(allocation, NULL, MEM_RELEASE);
});
}
qcv.notify_one();
}
private:
std::queue<std::function<void()>> q;
std::condition_variable qcv;
std::condition_variable rcv;
std::mutex qm;
std::mutex rm;
std::thread t;
std::atomic_bool shutdown = false;
};
int main()
{
SetProcessWorkingSetSize(GetCurrentProcess(), size_t(4) * 1024 * 1024 * 1024, size_t(16) * 1024 * 1024 * 1024);
BasicAlloc alloc1;
ThreadAlloc alloc2;
AllocTest *allocator = &alloc2;
const size_t buffer_size =1*1024*1024;
const size_t buffer_count = 10*1024;
const unsigned int thread_count = 32;
std::vector<void*> buffers;
buffers.resize(buffer_count);
std::vector<std::thread> threads;
threads.resize(thread_count);
void* reference = allocator->Alloc(buffer_size);
std::memset(reference, 0xaa, buffer_size);
auto func = [&buffers, allocator, buffer_size, buffer_count, reference, thread_count](int thread_id) {
for (int i = thread_id; i < buffer_count; i+= thread_count) {
buffers[i] = allocator->Alloc(buffer_size);
std::memcpy(buffers[i], reference, buffer_size);
allocator->Free(buffers[i]);
}
};
for (int i = 0; i < 10; i++)
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int t = 0; t < thread_count; t++) {
threads[t] = std::thread(func, t);
}
for (int t = 0; t < thread_count; t++) {
threads[t].join();
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
std::cout << duration << std::endl;
}
DebugBreak();
return 0;
}
Under all sane conditions, BasicAlloc is faster, just as it should be. In fact, on a quad core CPU (no HT), there is no constellation in which ThreadAlloc could outperform it. ThreadAlloc is constantly around 30% slower. (Which is actually surprisingly little, and it keeps true even for tiny 1kB allocations!)
However, if the CPU has around 8-12 virtual cores, then it eventually reaches the point where BasicAlloc actually scales negatively, while ThreadAlloc just "stalls" on the base line overhead of soft faults.
If you profile the two different allocation strategies, you can see that for a low thread count, KiPageFault shifts from memcpy on BasicAlloc to VirtualLock on ThreadAlloc.
For higher thread and core counts, eventually ExpWaitForSpinLockExclusiveAndAcquire starts emerging from virtually zero load to up to 50% with BasicAlloc, while ThreadAlloc only maintains the constant overhead from KiPageFault itself.
Well, the stall with ThreadAlloc is also pretty bad. No matter how many cores or nodes in a NUMA system you have, you are currently hard capped to around 5-8GB/s in new allocations, across all processes in the system, solely limited by single thread performance. All the dedicated memory management thread achieves, is not wasting CPU cycles on a contended critical section.
You would have expected that Microsoft had a lock free strategy for assigning pages on different cores, but apparently that's not even remotely the case.
The spin-lock was also already present in the Windows 7 and earlier implementations of KiPageFault. So what did change?
Simple answer: KiPageFault itself became much slower. No clue what exactly caused it to slow down, but the spin-lock simply never became a obvious limit, because 100% contention was never possible before.
If someone whishes to disassemble KiPageFault to find the most expensive part - be my guest.

Inverting an image using MPI

I am trying to invert a PGM image using MPI. The grayscale (PGM) image should be loaded on the root processor and then be sent to each of the s^2 processors. Each processor will invert a block of the given image, and the inverted blocks will be gathered back on the root processor, which will assemble the blocks into the final image and write it to a PGM image. I ran the following code, but did not get any output. The image was read after running the code, but there was no indication of writing the resultant image. Could you please let me know what could be wrong with it?
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <time.h>
#include <string.h>
#include <math.h>
#include <memory.h>
#define max(x, y) ((x>y) ? (x):(y))
#define min(x, y) ((x<y) ? (x):(y))
int xdim;
int ydim;
int maxraw;
unsigned char *image;
void ReadPGM(FILE*);
void WritePGM(FILE*);
#define s 2
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
int p, rank;
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
const int NPROWS=s; /* number of rows in _decomposition_ */
const int NPCOLS=s; /* number of cols in _decomposition_ */
const int BLOCKROWS = xdim/NPROWS; /* number of rows in _block_ */
const int BLOCKCOLS = ydim/NPCOLS; /* number of cols in _block_ */
int i, j;
FILE *fp;
float BLimage[BLOCKROWS*BLOCKCOLS];
for (int ii=0; ii<BLOCKROWS*BLOCKCOLS; ii++)
BLimage[ii] = 0;
float BLfilteredMat[BLOCKROWS*BLOCKCOLS];
for (int ii=0; ii<BLOCKROWS*BLOCKCOLS; ii++)
BLfilteredMat[ii] = 0;
if (rank == 0) {
/* begin reading PGM.... */
ReadPGM(fp);
}
MPI_Datatype blocktype;
MPI_Datatype blocktype2;
MPI_Type_vector(BLOCKROWS, BLOCKCOLS, ydim, MPI_FLOAT, &blocktype2);
MPI_Type_create_resized( blocktype2, 0, sizeof(float), &blocktype);
MPI_Type_commit(&blocktype);
int disps[NPROWS*NPCOLS];
int counts[NPROWS*NPCOLS];
for (int ii=0; ii<NPROWS; ii++) {
for (int jj=0; jj<NPCOLS; jj++) {
disps[ii*NPCOLS+jj] = ii*ydim*BLOCKROWS+jj*BLOCKCOLS;
counts [ii*NPCOLS+jj] = 1;
}
}
MPI_Scatterv(image, counts, disps, blocktype, BLimage, BLOCKROWS*BLOCKCOLS, MPI_FLOAT, 0, MPI_COMM_WORLD);
//************** Invert the block **************//
for (int proc=0; proc<p; proc++) {
if (proc == rank) {
for (int j = 0; j < BLOCKCOLS; j++) {
for (int i = 0; i < BLOCKROWS; i++) {
BLfilteredMat[j*BLOCKROWS+i] = 255 - image[j*BLOCKROWS+i];
}
}
} // close if (proc == rank) {
MPI_Barrier(MPI_COMM_WORLD);
} // close for (int proc=0; proc<p; proc++) {
MPI_Gatherv(BLfilteredMat, BLOCKROWS*BLOCKCOLS,MPI_FLOAT, image, counts, disps,blocktype, 0, MPI_COMM_WORLD);
if (rank == 0) {
/* Begin writing PGM.... */
WritePGM(fp);
free(image);
}
MPI_Finalize();
return (1);
}
It is very likely MPI is not the right tool for the job. The reason for this is that your job is inherently bandwidth limited.
Think of it this way: You have a coloring book with images which you all want to color in.
Method 1: you take your time and color them in one by one.
Method 2: you copy each page to a new sheet of paper and mail it to a friend who then colors it in for you. He mails it back to you and in the end you glue all the pages you received from all of your friends together to make one colored-in book.
Note that method two involves copying the whole book, which is arguably the same amount of work needed to color in the whole book. So method two is less time-efficient without even considering the overhead of shoving the pages into an envelope, licking the stamp, going to the post office and waiting for the letter to be delivered.
If you look at your code, every transmitted byte is only touched once throughout the whole program in this line:
BLfilteredMat[j*BLOCKROWS+i] = 255 - image[j*BLOCKROWS+i];
The single processor is much faster at subtracting two integers than it is at sending an integer of the wire, therefore one must advise against using MPI for your particular problem.
My suggestion to solve your problem: Try to avoid unneccessary communication whenever possible. Do all processes have access to the file system on which the files are located? You could try reading them directly from the filesystem.

fftw3 proper using

#include <stdio.h>
#include <Windows.h>
#include <string.h>
#include <stdlib.h>
#include "fftw3.h"
int main(void)
{
FILE *fp;
int rozmiar_pliku;
char standard[5] = {0};
char format[5] = {0};
int samplerate;
int k,i;
fftw_complex in[128];
fftw_complex out[128];
fftw_plan p;
fp = fopen("Kalimba.wav","rb" );
//printf("%d\n",fp);
if (fp)
{
fread(standard,1,4,fp);
printf("%s\n",standard);
printf("RIFF\n");
if (!strcmp(standard,"RIFF" ))
{
fread(&rozmiar_pliku,4,1,fp);
printf("size: %d\n", rozmiar_pliku);
}
fread(format,1,4,fp);
printf("format: %s\n",format);
fseek(fp,24,SEEK_SET);
fread(&samplerate,1,4,fp);
printf("sample rate: %d\n",samplerate);
fseek(fp,44,SEEK_SET);
for(i=0;i<128;++i)
{
in[i][0]=getc(fp);
in[i][1]=in[i][0];
}
/*
p = fftw_plan_dft_1d(128, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(p);
for(int j=0;j<128;++j)
printf("%lf+i*%lf",out[j][0],out[j][1]);
fftw_destroy_plan(p);
fftw_free(in);
fftw_free(out);
*/
}
return 0;
}
I'm trying to read wave file and perform FFT by using FFTW3. If i uncomment part which is commented there's nothing show on screen. If I leave it commented :
RIFF
RIFF
size: 61392422
format: WAVE
sample rate: 44100
If uncommented nothing appears. I don't know why it is going like this. Any use of fftw3 cause this situation.
in and out are statically declared arrays. Try passing &in[0] and &out[0] to match the type expected by fftw_plan_dft_1d.
As it is recommended in the documentation, you should declare in and out using fftw_malloc.
You can allocate them in any way that you like, but we recommend using fftw_malloc.
Then, you'll need to initialize in after creating the plan.
You must create the plan before initializing the input, because FFTW_MEASURE overwrites the in/out arrays. (Technically, FFTW_ESTIMATE does not touch your arrays, but you should always create plans first just to be sure.)
The result with some other modifications, is
#include <stdio.h>
#include <stdlib.h>
#include "fftw3.h"
int main(void)
{
FILE *fp;
int rozmiar_pliku;
char standard[5] = {0};
char format[5] = {0};
int samplerate;
int i;
fftw_complex *in, *out;
fftw_plan p;
fp = fopen("audioFile1.wav","rb" );
if (fp)
{
fread(standard,1,4,fp);
printf("%s\n",standard);
printf("RIFF\n");
if (!strcmp(standard,"RIFF" ))
{
fread(&rozmiar_pliku,4,1,fp);
printf("size: %d\n", rozmiar_pliku);
}
fread(format,1,4,fp);
printf("format: %s\n",format);
fseek(fp,24,SEEK_SET);
fread(&samplerate,1,4,fp);
printf("sample rate: %d\n",samplerate);
fseek(fp,44,SEEK_SET);
// Allocate in and out buffers using fftw_alloc
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * 128);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * 128);
// Create plan before initializing in
p = fftw_plan_dft_1d(128, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
// Initialize in after creating plan
for(i=0;i<128;++i)
{
in[i][0]=getc(fp);
in[i][1]=in[i][0];
}
fftw_execute(p);
for(int j=0;j<128;++j)
printf("%lf+i*%lf\n",out[j][0],out[j][1]);
fftw_destroy_plan(p);
fftw_free(in); fftw_free(out);
}
return 0;
}

Breakpoints inside CUDA kernel __global__ not hitting

Using visual studios 2010. Win 7. Nsight 2.1
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
// incrementArray.cu
#include <stdio.h>
#include <assert.h>
void incrementArrayOnHost(float *a, int N)
{
int i;
for (i=0; i < N; i++) a[i] = a[i]+1.f;
}
__global__ void incrementArrayOnDevice(float *a, int N)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
int j = idx;
int i = 2;
i = i+j; //->breakpoint here
if (idx<N) a[idx] = a[idx]+1.f; //->breakpoint here
}
int main(void)
{
float *a_h, *b_h; // pointers to host memory
float *a_d; // pointer to device memory
int i, N = 10;
size_t size = N*sizeof(float);
// allocate arrays on host
a_h = (float *)malloc(size);
b_h = (float *)malloc(size);
// allocate array on device
cudaMalloc((void **) &a_d, size);
// initialization of host data
for (i=0; i<N; i++) a_h[i] = (float)i;
// copy data from host to device
cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice);
// do calculation on host
incrementArrayOnHost(a_h, N);
// do calculation on device:
// Part 1 of 2. Compute execution configuration
int blockSize = 4;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1);
// Part 2 of 2. Call incrementArrayOnDevice kernel
incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);
// Retrieve result from device and store in b_h
cudaMemcpy(b_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// check results
for (i=0; i<N; i++) assert(a_h[i] == b_h[i]);
// cleanup
free(a_h); free(b_h); cudaFree(a_d);
return 0;
}
I've tried inserting breakpoints as listed above inside my global void incrementArrayOnDevice(float *a, int N) but they're not hitting.
When I run debug (f5) in visual studios, I tried to step into incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N); but they would skip the entire kernel code section.
tried to add a watch on the variables i and j but there was an error "CXX0017: Error: symbol "i" not found."
Is this issue normal? Can someone please try on their pc and let me know if they can hit the breakpoints? If you can, what possible problem could mine be? Please help! :(
Nsight debugging is different from VS debugging . You need to use Nsight debugging to hit the kernel breakpoints. However, for this you need 2 GPU cards. Do you have 2 cards in the first place? Please check
You can debug on a single GPU but on the following conditions:
You have to be using 5.0 toolkit
You have to be programming on a GPU that suports 303.xx NForceWare or higher

Resources