I'm planning using pthreads and mach semaphores to try to basically farm out a parallel computation to a limited number of CPUs, and I can't quite get a test program to work. Right now I have something that just goes through threads and prints out some identifier so that I could verify that it works. The code is pretty simple, except that I'm on OSX so I have to use mach semaphores instead of POSIX. My code is below
#include <iostream>
#include <pthread.h>
#include <semaphore.h>
#include <errno.h>
#include <mach/semaphore.h>
#include <mach/mach.h>
#define MAX_THREADS 256
semaphore_t free_CPU = 0;
void* t_function(void *arg) {
int* cur_number;
cur_number = (int*) arg;
kern_return_t test = semaphore_wait(free_CPU);
std::cout << "I am thread # " << *cur_number << ". Kernel return is " << test << std::endl;
semaphore_signal(free_CPU);
std::cout << "I am thread # " << *cur_number << ". I just signaled the semaphore." << std::endl;
pthread_exit(NULL);
}
int main (int argc, char * const argv[]) {
int num_reps = 10;
int n_threads = 1;
if (n_threads < MAX_THREADS) {
n_threads += 0;
} else {
n_threads = MAX_THREADS;
}
pthread_t threads[n_threads];
semaphore_create(mach_task_self(), &free_CPU, SYNC_POLICY_FIFO, 1);
// Loop over a bunch of things, feeding out to only nthreads threads at a time!
int i;
int* numbers = new int[num_reps];
for (i = 0; i < num_reps; i++) {
numbers[i] = i;
std::cout << "Throwing thread " << numbers[i] << std::endl;
int rc = pthread_create(&threads[i], NULL, &t_function, &numbers[i]);
if (rc) {
std::cout << "Failed to throw thread " << i << " Error: " << strerror(errno) << std::endl;
exit(1);
}
}
std::cout << "Threw all threads" << std::endl;
// Loop over threads to join
for (i = 0; i < num_reps; i++) {
std::cout << "Joining thread " << i << std::endl;
int rc = pthread_join(threads[i],NULL);
if (rc) {
std::cout << "Failed to join thread " << i << ". Error: " << strerror(errno) << std::endl;
exit(1);
}
}
semaphore_destroy(mach_task_self(), free_CPU);
delete[] numbers;
return 0;
}
Running this code gives me:
Throwing thread 0
Throwing thread 1
Throwing thread 2
Throwing thread 3
Throwing thread 4
Throwing thread 5
Throwing thread 6
Throwing thread 7
Throwing thread 8
Throwing thread 9
Threw all threads
Joining thread 0
I am thread # 0. Kernel return is 0
I am thread # 0. I just signaled the semaphore.
I am thread # 1. Kernel return is 0
I am thread # 1. I just signaled the semaphore.
I am thread # 2. Kernel return is 0
I am thread # 2. I just signaled the semaphore.
I am thread # 3. Kernel return is 0
I am thread # 3. I just signaled the semaphore.
I am thread # 4. Kernel return is 0
I am thread # 4. I just signaled the semaphore.
I am thread # 5. Kernel return is 0
I am thread # 5. I just signaled the semaphore.
I am thread # 6. Kernel return is 0
I am thread # 6. I just signaled the semaphore.
I am thread # 7. Kernel return is 0
I am thread # 7. I just signaled the semaphore.
I am thread # 8. Kernel return is 0
I am thread # 8. I just signaled the semaphore.
I am thread # 9. Kernel return is 0
I am thread # 9. I just signaled the semaphore.
Joining thread 1
Joining thread 2
Joining thread 3
Joining thread 4
Joining thread 5
Joining thread 6
Joining thread 7
Joining thread 8
Failed to join thread 8. Error: Unknown error: 0
To me, it looks like everything is totally fine, except it just bites the dust when it tries to join thread 8. I have no clue what's going on.
Your problem lies here:
#define MAX_THREADS 256
:
int n_threads = 1;
if (n_threads < MAX_THREADS) {
n_threads += 0;
} else {
n_threads = MAX_THREADS;
}
pthread_t threads[n_threads];
This is giving you an array of one thread ID. You're then trying to populate ten of them.
I'm not entirely certain what you're trying to acheive with that. It seems to me that, if you just used num_reps to dimension your array, it would work fine (you'd get an array of ten elements).
Related
I found the method 'vectorized/batch sort' and 'nested sort' on below link. How to use Thrust to sort the rows of a matrix?
When I tried this method for 500 row and 1000 elements, the result of them are
vectorized/batch sort : 66ms
nested sort : 3290ms
I am using 1080ti HOF model to do this operation but it takes too long compared to your case.
But in the below link, it could be less than 10ms and almost 100 microseconds.
(How to find median value in 2d array for each column with CUDA?)
Could you recommend how to optimize this method to reduce operation time?
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <thrust/generate.h>
#include <thrust/equal.h>
#include <thrust/sequence.h>
#include <thrust/for_each.h>
#include <iostream>
#include <stdlib.h>
#define NSORTS 500
#define DSIZE 1000
int my_mod_start = 0;
int my_mod() {
return (my_mod_start++) / DSIZE;
}
bool validate(thrust::device_vector<int> &d1, thrust::device_vector<int> &d2) {
return thrust::equal(d1.begin(), d1.end(), d2.begin());
}
struct sort_functor
{
thrust::device_ptr<int> data;
int dsize;
__host__ __device__
void operator()(int start_idx)
{
thrust::sort(thrust::device, data + (dsize*start_idx), data + (dsize*(start_idx + 1)));
}
};
#include <time.h>
#include <windows.h>
unsigned long long dtime_usec(LONG start) {
SYSTEMTIME timer2;
GetSystemTime(&timer2);
LONG end = (timer2.wSecond * 1000) + timer2.wMilliseconds;
return (end-start);
}
int main() {
for (int i = 0; i < 3; i++) {
SYSTEMTIME timer1;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, (16 * DSIZE*NSORTS));
thrust::host_vector<int> h_data(DSIZE*NSORTS);
thrust::generate(h_data.begin(), h_data.end(), rand);
thrust::device_vector<int> d_data = h_data;
// first time a loop
thrust::device_vector<int> d_result1 = d_data;
thrust::device_ptr<int> r1ptr = thrust::device_pointer_cast<int>(d_result1.data());
GetSystemTime(&timer1);
LONG time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
for (int i = 0; i < NSORTS; i++)
thrust::sort(r1ptr + (i*DSIZE), r1ptr + ((i + 1)*DSIZE));
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
//vectorized sort
thrust::device_vector<int> d_result2 = d_data;
thrust::host_vector<int> h_segments(DSIZE*NSORTS);
thrust::generate(h_segments.begin(), h_segments.end(), my_mod);
thrust::device_vector<int> d_segments = h_segments;
GetSystemTime(&timer1);
time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
thrust::stable_sort_by_key(d_result2.begin(), d_result2.end(), d_segments.begin());
thrust::stable_sort_by_key(d_segments.begin(), d_segments.end(), d_result2.begin());
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
if (!validate(d_result1, d_result2)) std::cout << "mismatch 1!" << std::endl;
//nested sort
thrust::device_vector<int> d_result3 = d_data;
sort_functor f = { d_result3.data(), DSIZE };
thrust::device_vector<int> idxs(NSORTS);
thrust::sequence(idxs.begin(), idxs.end());
GetSystemTime(&timer1);
time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
thrust::for_each(idxs.begin(), idxs.end(), f);
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
if (!validate(d_result1, d_result3)) std::cout << "mismatch 2!" << std::endl;
}
return 0;
}
The main takeaway from your thrust experience is that you should never compile a debug project or with device debug switch (-G) when you are interested in performance. Compiling device debug code causes the compiler to omit many performance optimizations. The difference in your case was quite dramatic, about a 30x improvement going from debug to release code.
Here is a segmented cub sort, where we are launching 500 blocks and each block is handling a separate 1024 element array. The CUB code is lifted from here.
$ cat t1761.cu
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
#include <iostream>
const int ipt=8;
const int tpb=128;
__global__ void ExampleKernel(int *data)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 8 integer items each
typedef cub::BlockRadixSort<int, tpb, ipt> BlockRadixSort;
// Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[ipt];
// just create some synthetic data in descending order 1023 1022 1021 1020 ...
for (int i = 0; i < ipt; i++) thread_keys[i] = (tpb-1-threadIdx.x)*ipt+i;
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
__syncthreads();
// write results to output array
for (int i = 0; i < ipt; i++) data[blockIdx.x*ipt*tpb + threadIdx.x*ipt+i] = thread_keys[i];
}
int main(){
const int blks = 500;
int *data;
cudaMalloc(&data, blks*ipt*tpb*sizeof(int));
ExampleKernel<<<blks,tpb>>>(data);
int *h_data = new int[blks*ipt*tpb];
cudaMemcpy(h_data, data, blks*ipt*tpb*sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < 10; i++) std::cout << h_data[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1761 t1761.cu -I/path/to/cub/cub-1.8.0
$ CUDA_VISIBLE_DEVICES="2" nvprof ./t1761
==13713== NVPROF is profiling process 13713, command: ./t1761
==13713== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
0 1 2 3 4 5 6 7 8 9
==13713== Profiling application: ./t1761
==13713== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 60.35% 308.66us 1 308.66us 308.66us 308.66us [CUDA memcpy DtoH]
39.65% 202.79us 1 202.79us 202.79us 202.79us ExampleKernel(int*)
API calls: 98.39% 210.79ms 1 210.79ms 210.79ms 210.79ms cudaMalloc
0.72% 1.5364ms 1 1.5364ms 1.5364ms 1.5364ms cudaMemcpy
0.32% 691.15us 1 691.15us 691.15us 691.15us cudaLaunchKernel
0.28% 603.26us 97 6.2190us 400ns 212.71us cuDeviceGetAttribute
0.24% 516.56us 1 516.56us 516.56us 516.56us cuDeviceTotalMem
0.04% 79.374us 1 79.374us 79.374us 79.374us cuDeviceGetName
0.01% 13.373us 1 13.373us 13.373us 13.373us cuDeviceGetPCIBusId
0.00% 5.0810us 3 1.6930us 729ns 2.9600us cuDeviceGetCount
0.00% 2.3120us 2 1.1560us 609ns 1.7030us cuDeviceGet
0.00% 748ns 1 748ns 748ns 748ns cuDeviceGetUuid
$
(CUDA 10.2.89, RHEL 7)
Above I am running on a Tesla K20x, which has performance that is "closer" to your 1080ti than a Tesla V100. We see that the kernel execution time is ~200us. If I run the exact same code on a Tesla V100, the kernel execution time drops to ~35us:
$ CUDA_VISIBLE_DEVICES="0" nvprof ./t1761
==13814== NVPROF is profiling process 13814, command: ./t1761
0 1 2 3 4 5 6 7 8 9
==13814== Profiling application: ./t1761
==13814== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 82.33% 163.43us 1 163.43us 163.43us 163.43us [CUDA memcpy DtoH]
17.67% 35.073us 1 35.073us 35.073us 35.073us ExampleKernel(int*)
API calls: 98.70% 316.92ms 1 316.92ms 316.92ms 316.92ms cudaMalloc
0.87% 2.7879ms 1 2.7879ms 2.7879ms 2.7879ms cuDeviceTotalMem
0.19% 613.75us 97 6.3270us 389ns 205.37us cuDeviceGetAttribute
0.19% 601.61us 1 601.61us 601.61us 601.61us cudaMemcpy
0.02% 72.718us 1 72.718us 72.718us 72.718us cudaLaunchKernel
0.02% 59.905us 1 59.905us 59.905us 59.905us cuDeviceGetName
0.01% 37.886us 1 37.886us 37.886us 37.886us cuDeviceGetPCIBusId
0.00% 4.6830us 3 1.5610us 546ns 2.7850us cuDeviceGetCount
0.00% 1.9900us 2 995ns 587ns 1.4030us cuDeviceGet
0.00% 677ns 1 677ns 677ns 677ns cuDeviceGetUuid
$
You'll note there is no "input" array, I'm just synthesizing data in the kernel, since we are interested in performance, primarily. If you need to handle an array size like 1000, you should probably just pad each array to 1024 (e.g. pad with a very large number, then ignore the last numbers in the sorted result.)
This code is largely lifted from external documentation. It is offered for instructional purposes. I'm not suggesting it is defect-free or suitable for any particular purpose. Use it at your own risk.
I am writing a simple OpenCL application, which is going to calculate the maximum experiment FLOPS of a target GPU device. I have decided to keep my cl kernel as simple as possible. Here are my OpenCL kernel and my host code. Kernel code is:
__kernel void flops(__global float *data) {
int gid = get_global_id(0);
double s = data[gid];
data[gid] = s * 0.35;
}
And the host code is:
#include <iostream>
#include <sstream>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include "support.h"
#include "Event.h"
#include "ResultDatabase.h"
#include "OptionParser.h"
#include "ProgressBar.h"
using namespace std;
std::string kernels_folder = "/home/users/saman/shoc/src/opencl/level3/FlopsFolder/";
std::string kernel_file = "flops.cl";
static const char *opts = "-cl-mad-enable -cl-no-signed-zeros "
"-cl-unsafe-math-optimizations -cl-finite-math-only";
cl_program createProgram (cl_context context,
cl_device_id device,
const char* fileName) {
cl_int errNum;
cl_program program;
std::ifstream kernelFile (fileName, std::ios::in);
if (!kernelFile.is_open()) {
std::cerr << "Failed to open file for reading: " << fileName << std::endl;
}
std::ostringstream oss;
oss << kernelFile.rdbuf();
std::string srcStdStr = oss.str();
const char *srcStr = srcStdStr.c_str();
program = clCreateProgramWithSource (context, 1, (const char **)&srcStr,
NULL, &errNum);
CL_CHECK_ERROR(errNum);
errNum = clBuildProgram (program, 0, NULL, NULL, NULL, NULL);
CL_CHECK_ERROR (errNum);
return program;
}
bool createMemObjects (cl_context context, cl_command_queue queue,
cl_mem* memObject,
const int memFloatsSize, float *a) {
cl_int err;
*memObject = clCreateBuffer (context, CL_MEM_READ_WRITE,
memFloatsSize * sizeof(float), NULL, &err);
CL_CHECK_ERROR(err);
if (*memObject == NULL) {
std::cerr << "Error creating memory objects. " << std::endl;
return false;
}
Event evWrite("write");
err = clEnqueueWriteBuffer (queue, *memObject, CL_FALSE, 0, memFloatsSize * sizeof(float),
a, 0, NULL, &evWrite.CLEvent());
CL_CHECK_ERROR(err);
err = clWaitForEvents (1, &evWrite.CLEvent());
CL_CHECK_ERROR(err);
return true;
}
void cleanup (cl_context context, cl_command_queue commandQueue,
cl_program program, cl_kernel kernel, cl_mem memObject) {
if (memObject != NULL)
clReleaseMemObject (memObject);
if (kernel != NULL)
clReleaseKernel (kernel);
if (program != NULL)
clReleaseProgram (program);
}
void addBenchmarkSpecOptions(OptionParser &op) {
}
void RunBenchmark(cl_device_id id,
cl_context ctx,
cl_command_queue queue,
ResultDatabase &resultDB,
OptionParser &op)
{
for (float i = 0.1; i <= 0.2; i+=0.1 ) {
std::cout << "Deploying " << 100*i << "%" << std::endl;
bool verbose = false;
cl_int errNum;
cl_program program = 0;
cl_kernel kernel;
cl_mem memObject = 0;
char maxFloatsStr[128];
char testStr[128];
program = createProgram (ctx, id, (kernels_folder + kernel_file).c_str());
if (program == NULL) {
exit (0);
}
if (verbose) std::cout << "Program created successfully!" << std::endl;
kernel = clCreateKernel (program, "flops", &errNum);
CL_CHECK_ERROR(errNum);
if (verbose) std::cout << "Kernel created successfully!" << std::endl;
// Identify maximum size of the global memory on the device side
cl_long maxAllocSizeBytes = 0;
cl_long maxComputeUnits = 0;
cl_long maxWorkGroupSize = 0;
clGetDeviceInfo (id, CL_DEVICE_MAX_MEM_ALLOC_SIZE,
sizeof(cl_long), &maxAllocSizeBytes, NULL);
clGetDeviceInfo (id, CL_DEVICE_MAX_COMPUTE_UNITS,
sizeof(cl_long), &maxComputeUnits, NULL);
clGetDeviceInfo (id, CL_DEVICE_MAX_WORK_GROUP_SIZE,
sizeof(cl_long), &maxWorkGroupSize, NULL);
// Let's use 80% of this memory for transferring data
cl_long maxFloatsUsageSize = ((maxAllocSizeBytes / 4) * 0.8);
if (verbose) std::cout << "Max floats usage size is " << maxFloatsUsageSize << std::endl;
if (verbose) std::cout << "Max compute unit is " << maxComputeUnits << std::endl;
if (verbose) std::cout << "Max Work Group size is " << maxWorkGroupSize << std::endl;
// Prepare buffer on the host side
float *a = new float[maxFloatsUsageSize];
for (int j = 0; j < maxFloatsUsageSize; j++) {
a[j] = (float) (j % 77);
}
if (verbose) std::cout << "Host buffer been prepared!" << std::endl;
// Creating buffer on the device side
if (!createMemObjects(ctx, queue, &memObject, maxFloatsUsageSize, a)) {
exit (0);
}
errNum = clSetKernelArg (kernel, 0, sizeof(cl_mem), &memObject);
CL_CHECK_ERROR(errNum);
size_t wg_size, wg_multiple;
cl_ulong local_mem, private_usage, local_usage;
errNum = clGetKernelWorkGroupInfo (kernel, id,
CL_KERNEL_WORK_GROUP_SIZE,
sizeof (wg_size), &wg_size, NULL);
CL_CHECK_ERROR (errNum);
errNum = clGetKernelWorkGroupInfo (kernel, id,
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE,
sizeof (wg_multiple), &wg_multiple, NULL);
CL_CHECK_ERROR (errNum);
errNum = clGetKernelWorkGroupInfo (kernel, id,
CL_KERNEL_LOCAL_MEM_SIZE,
sizeof (local_usage), &local_usage, NULL);
CL_CHECK_ERROR (errNum);
errNum = clGetKernelWorkGroupInfo (kernel, id,
CL_KERNEL_PRIVATE_MEM_SIZE,
sizeof (private_usage), &private_usage, NULL);
CL_CHECK_ERROR (errNum);
if (verbose) std::cout << "Work Group size is " << wg_size << std::endl;
if (verbose) std::cout << "Preferred Work Group size is " << wg_multiple << std::endl;
if (verbose) std::cout << "Local memory size is " << local_usage << std::endl;
if (verbose) std::cout << "Private memory size is " << private_usage << std::endl;
size_t globalWorkSize[1] = {maxFloatsUsageSize};
size_t localWorkSize[1] = {1};
Event evKernel("flops");
errNum = clEnqueueNDRangeKernel (queue, kernel, 1, NULL,
globalWorkSize, localWorkSize,
0, NULL, &evKernel.CLEvent());
CL_CHECK_ERROR (errNum);
if (verbose) cout << "Waiting for execution to finish ";
errNum = clWaitForEvents(1, &evKernel.CLEvent());
CL_CHECK_ERROR(errNum);
evKernel.FillTimingInfo();
if (verbose) cout << "Kernel execution terminated successfully!" << std::endl;
delete[] a;
sprintf (maxFloatsStr, "Size: %d", maxFloatsUsageSize);
sprintf (testStr, "Flops: %f\% Memory", 100*i);
double flopCount = maxFloatsUsageSize * 16000;
double gflop = flopCount / (double)(evKernel.SubmitEndRuntime());
resultDB.AddResult (testStr, maxFloatsStr, "GFLOPS", gflop);
// Now it's time to read back the data
a = new float[maxFloatsUsageSize];
errNum = clEnqueueReadBuffer(queue, memObject, CL_TRUE, 0, maxFloatsUsageSize*sizeof(float), a, 0, NULL, NULL);
CL_CHECK_ERROR(errNum);
if (verbose) {
for (int j = 0; j < 10; j++) {
std::cout << a[j] << " ";
}
}
delete[] a;
if (memObject != NULL)
clReleaseMemObject (memObject);
if (program != NULL)
clReleaseProgram (program);
if (kernel != NULL)
clReleaseKernel (kernel);
}
std::cout << "Program executed successfully!" << std::endl;
}
Explaining the code, in the kernel code I actually do a single floating point operation, which means every single task will do on FOPS. In the host code, I first retrieve the maximum global memory size of the GPU, allocate portion of it (for loop define how much of it), then push the data and kernel execution into it. I will measure the execution time of clEnqueueNDRangeKernel and then calculate the GFLOPS of application. In my current implementation, no matter what is the size of cl_mem, I get around 0.28 GFLOPS of performance, which is much less than the advertised power. I assume I do specific things inefficiently here. Or in general my method for calculating the GPU performance is not right. Does anyone can tell my what kind of changes should I make into the code?
With local group size of 1, you are wasting 31/32 of the resources (thus you can have 1/32 of the peak performance at most). You need local group size of at least 32 (and is multiple of 32) to fully utilize computation resources and 64 to achieve 100% occupancy (100% occupancy is not necessary though).
Memory access has high latency and low bandwidth. Your kernel will always be waiting for memory controllers if other things are right. You need do more arithmetic operations to make the ALU's busy.
You need read the document first and make use of the Visual Profiler. In the previous two parts I just want to tell that things are stranger than you thought. But more strange things are waiting.
You can achieve peak performance eaily on CPU with assembly language (By doing only independent arithmetic operations. If you write such code in C it will simply be dropped by the compiler). NVidia only provides us an IL interface called PTX, and I'm not sure if compiler will optimize it. And you can only use PTX in CUDA I think.
edit: It seems that compiler will optimize unused PTX code away, at least in inline assembers.
I wrote the following for a class, but came across some strange behavior while testing it. arrayProcedure is meant to do things with an array based on the 2 "tweaks" at the top of the function (arrSize, and start). For the assignment, arrSize must be 10,000, and start, 100. Just for kicks, I decided to see what happens if I increase them, and for some reason, if arrSize exceeds around 60,000 (I haven't found the exact limit), the program immediately crashes with a stack overflow when using a debugger:
Unhandled exception at 0x008F6977 in TMA3Question1.exe: 0xC00000FD: Stack overflow (parameters: 0x00000000, 0x00A32000).
If I just run it without a debugger, I don't get any helpful errors; windows hangs for a fraction of a second, then gives me an error TMA3Question1.exe has stopped working.
I decided to play around with debugging it, but that didn't shed any light. I placed breaks above and below the call to arrayProcedure, as well as peppered inside of it. When arrSize doesn't exceed 60,000 it runs fine: It pauses before calling arrayProcedure, properly waits at all the points inside of it, then pauses on the break underneath the call.
If I raise arrSize however, the break before the call happens, but it appears as though it never even steps into arrayProcedure; it immediately gives me a stack overflow without pausing at any of the internal breakpoints.
The only thing I can think of is the resulting arrays exceeds my computer's current memory, but that doesn't seem likely for a couple reasons:
It should only use just under a megabyte:
sizeof(double) = 8 bytes
8 * 60000 = 480000 bytes per array
480000 * 2 = 960000 bytes for both arrays
As far as I know, arrays aren't immediately constructed when I function is entered; they're allocated on definition. I placed several breakpoints before the arrays are even declared, and they are never reached.
Any light that you could shed on this would be appreciated.
The code:
#include <iostream>
#include <ctime>
//CLOCKS_PER_SEC is a macro supplied by ctime
double msBetween(clock_t startTime, clock_t endTime) {
return endTime - startTime / (CLOCKS_PER_SEC * 1000.0);
}
void initArr(double arr[], int start, int length, int step) {
for (int i = 0, j = start; i < length; i++, j += step) {
arr[i] = j;
}
}
//The function we're going to inline in the next question
void helper(double a1, double a2) {
std::cout << a1 << " * " << a2 << " = " << a1 * a2 << std::endl;
}
void arrayProcedure() {
const int arrSize = 70000;
const int start = 1000000;
std::cout << "Checking..." << std::endl;
if (arrSize > INT_MAX) {
std::cout << "Given arrSize is too high and exceeds the INT_MAX of: " << INT_MAX << std::endl;
return;
}
double arr1[arrSize];
double arr2[arrSize];
initArr(arr1, start, arrSize, 1);
initArr(arr2, arrSize + start - 1, arrSize, -1);
for (int i = 0; i < arrSize; i++) {
helper(arr1[i], arr2[i]);
}
}
int main(int argc, char* argv[]) {
using namespace std;
const clock_t startTime = clock();
arrayProcedure();
clock_t endTime = clock();
cout << endTime << endl;
double elapsedTime = msBetween(startTime, endTime);
cout << "\n\n" << elapsedTime << " milliseconds. ("
<< elapsedTime / 60000 << " minutes)\n";
}
The default stack size is 1 MB with Visual Studio.
https://msdn.microsoft.com/en-us/library/tdkhxaks.aspx
You can increase the stack size or use the new operator.
double *arr1 = new double[arrSize];
double *arr2 = new double[arrSize];
...
delete [] arr1;
delete [] arr2;
I'm implementing an openMP version of a sequential program, and for a function that distributes a list for the threads, I need function to know the number of threads.
Boiled down, the code looks like this:
int numberOfThreads = 0;
#pragma omp parallel
{
//split nodeQueue
omp_set_num_threads(NUM_THREADS);
#pragma omp master
{
cout << "Asked for " << NUM_THREADS << endl;
numberOfThreads = omp_get_num_threads();
cout << "Got " << numberOfThreads << " threads" << endl;
splitNodeQueue(numberOfThreads);
}
}
No matter what I set NUM_THREADS to, it seems to get 4 threads, and outputs:
Asked for 1
Got 4 threads
Shouln't it get a maximum of NUM_THREADS when I use omp_set_num_threads(NUM_THREADS)?
It doesn't matter what number of threads I ask for - it always gets 4 (which is the number of threads available on the CPU)...
Can't I force it to use the specified number of threads as maximum?
I think, setting num_threads from within parallel region would not change the number of threads for the fork at the start of the parallel region, it only changes the number of threads for nested parallel regions, which defaults to 1 by OMP specs
I have the following piece of code, which I want to make parallel in a certain way. I am making a mistake, and hence not all threads are running the loop as I thought it should. It would be great if somebody could help me out identifying that mistake.
This is a code to calculate histograms.
#pragma omp parallel default(shared) private(iIndex2, iIndex1, fDist) shared(iSize, dense) reduction(+:iCount)
{
chunk = (unsigned int)(iSize / omp_get_num_threads());
threadID = omp_get_thread_num();
svtout << "Number of threads available " << omp_get_num_threads() << endl;
svtout << "The threadID is " << threadID << endl;
//want each of the thread to execute the loop
for (iIndex1=0; iIndex1 < chunk; iIndex1++)
{
for (iIndex2=iIndex1+1; iIndex2 < chunk; iIndex2++)
{
iCount++;
fDist = (*this)[iIndex1 + threadID*chunk].distance( (*this)[iIndex2 + threadID*chunk] );
idx = (int)(fDist/fWidth);
if ((int)fDist % (int)fWidth >= 0)
{
#pragma omp atomic
dense[idx] += 1;
}
}
}
The iCount variable keeps track of the number of iterations, and I noticed that there is a marked difference between the serial and the parallel version. I guess not all threads are running, and hence the histogram values that I'm obtaining from the parallel program are much less than the actual readings (the dense array stores the histogram values).
Thanks,
Sayan
you are a looping over chunk, rather than iSize with more than one thread.
Try replacing loop bounds with iSize .