Host to kernel sending information issue - parallel-processing

The kernel below in C language changes a cell of an array:
__global__ void test(int *mt[matrix_size])
{
mt[0][0]=12;
}
The code below copies kernel results to host but it doesn't send the array to host correctly:
int *matrix[matrix_size],*d_matrix[matrix_size];
for(int i=0;i<matrix_size;i++)
matrix[i] = (int *)malloc(n*n*sizeof(int));
for(int i=0;i<matrix_size;i++)
cudaMalloc((void**)&d_matrix[i],sizeof(int));
test<<<1,1>>>(d_matrix);
cudaMemcpy(*matrix,*d_matrix,n*n*sizeof(int),cudaMemcpyDeviceToHost);
printf("\n\n %d \n\n",matrix[0][0]); //the result is zero instead of 12
How can I fix the problem?

You have gotten a lot wrong here.
The root cause is that d_matrix is in host memory and can't be passed directly to a kernel. If you check runtime errors you will see that firstly the first cudaMemcpy call fails because of the wrong direction argument, then when you fix that, the kernel fails with a invalid address error.
To fix this, you need to allocate a copy of d_matrix on the GPU and copy d_matrix to that copy. This is because the array you are passing to the kernel decays to a pointer and is not passed by value.
Something like this:
#include <cstdio>
const int n = 9;
const int matrix_size = 16;
__global__
void test(int *mt[matrix_size])
{
mt[threadIdx.x][0] = 12 + threadIdx.x;
}
int main()
{
int *matrix[matrix_size],*d_matrix[matrix_size];
for(int i=0;i<matrix_size;i++) {
matrix[i] = (int *)malloc(n * n * sizeof(int));
cudaMalloc((void**)&d_matrix[i], n * n * sizeof(int));
}
int **dd_matrix;
cudaMalloc(&dd_matrix, matrix_size * sizeof(int*));
cudaMemcpy(dd_matrix, d_matrix, matrix_size * sizeof(int *), cudaMemcpyHostToDevice);
test<<<1,matrix_size>>>(dd_matrix);
for(int i=0;i<matrix_size;i++) {
cudaMemcpy(matrix[i], d_matrix[i], n*n*sizeof(int), cudaMemcpyDeviceToHost);
printf("%d = %d \n", i, matrix[i][0]);
}
return 0;
}
Which when run gives this:
$ nvcc -g -G -arch=sm_52 -o bozocu bozocu.cu
$ ./bozocu
0 = 12
1 = 13
2 = 14
3 = 15
4 = 16
5 = 17
6 = 18
7 = 19
8 = 20
9 = 21
10 = 22
11 = 23
12 = 24
13 = 25
14 = 26
15 = 27
is, I believe, more in line with what you were expecting.

Related

Understanding the speed up of openmp program across NUMA nodes

I came across this behavior of speed up and I am finding it hard to explain. Following is the background:
Program
Invocation of Gaussian Elimination method to solve linear equation within a loop to parallelize the work load across compute units. We use an augmented matrix of dimension (M by M+1) where one additional column holds the RHS
HPC Setup - Cray XC50 node with Intel Xeon 6148 Gold with the following configuration
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 0 size: 95325 MB
node 0 free: 93811 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 1 size: 96760 MB
node 1 free: 96374 MB
node distances:
node 0 1
0: 10 21
1: 21 10
Although not the actual HPC, but the block diagram and the related explanation seems to fully apply (https://www.nas.nasa.gov/hecc/support/kb/skylake-processors_550.html). Specifically sub NUMA clustering seems to be disabled.
Job submitted through APLS is as follows
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,30,31,32,33,34,35,36,37,38,39 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 40,41,42,43,44,45,46,47,48,49,60,61,62,63,64,65,66,67,68,69 -e N=4000 -e M=200 -e MODE=2 ./gem
In the above N indicates the number of matrices and M replaces the dimension of the matrix. These are passed as environment variable to the program and used internally. MODE can be ignored for this discussion
cc list specifically lists the CPUs to bind with. OMP_NUM_THREADS is set to 20. The intent is to use 20 threads across 20 compute units.
Time to run sequentially and parallel is recorded within the program using omp_get_wtime() and the results are the following
CPU Binding
Objective
Speed Up
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Load work across 20 physical cores on socket 0
13.081944
0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29
Spread across first 10 physical cores on socket 0 & socket 1
18.332559
10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
Spread across 2nd set of 1o physical cores on socket 0 & first 10 of socket 1
18.636265
40,41,42,43,44,45,46,47,48,49,60,61,62,63,64,65,66,67,68,69
Spread across virtual cores across sockets(40-0, 60-21)
15.922209
Why is the speed up less for the first case when all physical nodes on socket 0 are being used ? The understanding here is that when tasks are spread across sockets, UPI comes into effect and it should be slower whereas it seems to be exactly the opposite. Also what can possibly explain the last scenario when virtual cores are being used.
Note: We have tried multiple iterations and the results for the above combinations are pretty consistent.
Edit1:
Edit2: Source code
#define _GNU_SOURCE
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include "sched.h"
#include "omp.h"
double drand(double low, double high, unsigned int *seed)
{
return ((double)rand_r(seed) * (high - low)) / (double)RAND_MAX + low;
}
void init_vars(int *N, int *M, int *mode)
{
const char *number_of_instances = getenv("N");
if (number_of_instances) {
*N = atoi(number_of_instances);
}
const char *matrix_dim = getenv("M");
if (matrix_dim) {
*M = atoi(matrix_dim);
}
const char *running_mode = getenv("MODE");
if (running_mode) {
*mode = atoi(running_mode);
}
}
void print_matrix(double *instance, int M)
{
for (int row = 0; row < M; row++) {
for (int column = 0; column <= M; column++) {
printf("%lf ", instance[row * (M + 1) + column]);
}
printf("\n");
}
printf("\n");
}
void swap(double *a, double *b)
{
double temp = *a;
*a = *b;
*b = temp;
}
void init_matrix(double *instance, unsigned int M)
{
unsigned int seed = 45613 + 19 * omp_get_thread_num();
for (int row = 0; row < M; row++) {
for (int column = 0; column <= M; column++) {
instance[row * (M + 1) + column] = drand(-1.0, 1.0, &seed);
}
}
}
void initialize_and_solve(int M)
{
double *instance;
instance = malloc(M * (M + 1) * sizeof(double));
// Initialise the matrix
init_matrix(instance, M);
// Performing elementary operations
int i, j, k = 0, c, flag = 0, m = 0;
for (i = 0; i < M; i++) {
if (instance[i * (M + 2)] == 0) {
c = 1;
while ((i + c) < M && instance[(i + c) * (M + 1) + i] == 0)
c++;
if ((i + c) == M) {
flag = 1;
break;
}
for (j = i, k = 0; k <= M; k++) {
swap(&instance[j * (M + 1) + k], &instance[(j + c) * (M + 1) + k]);
}
}
for (j = 0; j < M; j++) {
// Excluding all i == j
if (i != j) {
// Converting Matrix to reduced row
// echelon form(diagonal matrix)
double pro = instance[j * (M + 1) + i] / instance[i * (M + 2)];
for (k = 0; k <= M; k++)
instance[j * (M + 1) + k] -= (instance[i * (M + 1) + k]) * pro;
}
}
}
// Get the solution in the last column
for (int i = 0; i < M; i++) {
instance[i * (M + 1) + M] /= instance[i * (M + 2)];
}
free(instance);
instance = NULL;
}
double solve_serial(int N, int M)
{
double now = omp_get_wtime();
for (int i = 0; i < N; i++) {
initialize_and_solve(M);
}
return omp_get_wtime() - now;
}
double solve_parallel(int N, int M)
{
double now = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < N; i++) {
initialize_and_solve(M);
}
return omp_get_wtime() - now;
}
int main(int argc, char **argv)
{
// Default parameters
int N = 200, M = 200, mode = 2;
if (argc == 4) {
N = atoi(argv[1]);
M = atoi(argv[2]);
mode = atoi(argv[3]);
}
init_vars(&N, &M, &mode);
if (mode == 0) {
// Serial only
double l2_norm_serial = 0.0;
double serial = solve_serial(N, M);
printf("Time, %d, %d, %lf\n", N, M, serial);
} else if (mode == 1) {
// Parallel only
double l2_norm_parallel = 0.0;
double parallel = solve_parallel(N, M);
printf("Time, %d, %d, %lf\n", N, M, parallel);
} else {
// Both serial and parallel
// Solve using GEM (serial)
double serial = solve_serial(N, M);
// Solve using GEM (parallel)
double parallel = solve_parallel(N, M);
printf("Time, %d, %d, %lf, %lf, %lf\n", N, M, serial, parallel, serial / parallel);
}
return 0;
}
Edit3: Rephrased the first point to clarify what is actually being done ( based on feedback in comment )
You say you implement a "Simple implementation of Gaussian Elimination". Sorry, there is no such thing. There are multiple different algorithms and they all come with their own analysis. But let's assume you use the textbook one. Even then, Gaussian Elimination is not simple.
First of all, you haven't stated that you initialized your data in parallel. If you don't do that, all the data will wind up on socket 0 and you will get bad performance, never mind the speedup. But let's assume you did the right thing here. (If not, google "first touch".)
In the GE algorithm, each of the sequential k iterations works on a smaller and smaller subset of the data. This means that no simple mapping of data to cores is possible. If you place your data in such a way that initially each core works on local data, this will quickly no longer be the case.
In fact, after half the number of iterations, half your cores will be pulling data from the other socket, leading to NUMA coherence delays. Maybe a spread binding is better here than your compact binding.
Why is the speed up less for the first case when all physical nodes on socket 0 are being used ?
Results are often dependent of the application but some patterns regularly happens. My guess is that your application heavily use the main RAM and 2 sockets results in more DDR4 RAM blocks being used than only one. Indeed, with local NUMA-node allocations, 1 socket can access to the RAM at the speed of 128 GB/s while 2 sockets can access to the RAM at the speed of 256 GB/s. With a balanced use of DDR4 RAM blocks, the performance with be far worst and bounded by UPI (I do not expect 2 socket to be much slower because of the full-duplex data transfer).
The understanding here is that when tasks are spread across sockets, UPI comes into effect and it should be slower whereas it seems to be exactly the opposite.
UPI is only a bottleneck if data are massively transferred between the two sockets, but good NUMA applications should not do that because they should operate on their own NUMA-node memory.
You can check the use of the UPI and RAM throughput using hardware counters.
Also what can possibly explain the last scenario when virtual cores are being used.
I do not have an explanation for this. Note higher IDs are the second hyperthreads of each core so it is certainly related to a low-level behaviour of the hyperthreading (maybe some processes are bound to some PU causing pre-emption the target PUs or simply the second PU have somehow a lower priority). Note also that physical core IDs and logical PU IDs are often not mapped the same way so if you use the wrong one you could end up binding 2 threads to the same core. I advise you to use hwloc to check that.

Speed up random memory access using prefetch

I am trying to speed up a single program by using prefetches. The purpose of my program is just for test. Here is what it does:
It uses two int buffers of the same size
It reads one-by-one all the values of the first buffer
It reads the value at the index in the second buffer
It sums all the values taken from the second buffer
It does all the previous steps for bigger and bigger
At the end, I print the number of voluntary and involuntary CPU
In the very first time, values in the first buffers contains the values of its index (cf. function createIndexBuffer in the code just below) .
It will be more clear in the code of my program:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#define BUFFER_SIZE ((unsigned long) 4096 * 100000)
unsigned int randomUint()
{
int value = rand() % UINT_MAX;
return value;
}
unsigned int * createValueBuffer()
{
unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
valueBuffer[i] = randomUint();
}
return (valueBuffer);
}
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = i;
}
return (indexBuffer);
}
unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
unsigned long long sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
unsigned int index = indexBuffer[i];
sum += valueBuffer[index];
}
return (sum);
}
unsigned int computeTimeInMicroSeconds()
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer();
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
unsigned long long sum = computeSum(indexBuffer, valueBuffer);
gettimeofday(&endTime, NULL);
printf("Sum = %llu\n", sum);
free(indexBuffer);
free(valueBuffer);
return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
}
int main()
{
printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds();
printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}
If I launch it, I get the following output:
$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch
sizeof buffers = 1562Mb
Sum = 439813150288855829
Time: 201172 micro-seconds = 0.201 seconds
Quick and fast!!!
According to my knowledge (I may be wrong), one of the reason for having such a fast program is that, as I access my two buffers sequentially, data can be prefetched in the CPU cache.
We can make it more complex in order that data is (almost) prefeched in CPU cache. For example, we can just change the createIndexBuffer function in:
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = rand() % BUFFER_SIZE;
}
return (indexBuffer);
}
Let's try the program once again:
$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch
sizeof buffers = 1562Mb
Sum = 439835307963131237
Time: 3730387 micro-seconds = 3.730 seconds
More than 18 times slower!!!
We now arrive to my problem. Given the new createIndexBuffer function, I would like to speed up computeSum function using prefetch
unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
unsigned long long sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
__builtin_prefetch((char *) &indexBuffer[i + 1], 0, 0);
unsigned int index = indexBuffer[i];
sum += valueBuffer[index];
}
return (sum);
}
of course I also have to change my createIndexBuffer in order it allocates a buffer having one more element
I relaunch my program: not better! As prefetch may be slower than one "for" loop iteration, I may prefetch not one element before but two elements before
__builtin_prefetch((char *) &indexBuffer[i + 2], 0, 0);
not better! two loops iterations? not better? Three? **I tried it until 50 (!!!) but I cannot enhance the performance of my function computeSum.
Can I would like help to understand why
Thank you very much for your help
I believe that above code is automatically optimized by CPU without any further space for manual optimization.
1. Main problem is that indexBuffer is sequentially accessed. Hardware prefetcher senses it and prefetches further values automatically, without need to call prefetch manually. So, during iteration #i, values indexBuffer[i+1], indexBuffer[i+2],... are already in cache. (By the way, there is no need to add artificial element to the end of array: memory access errors are silently ignored by prefetch instructions).
What you really need to do is to prefetch valueBuffer instead:
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + 1]], 0, 0);
2. But adding above line of code won't help either in such simple scenario. Cost of accessing memory is hundreds of cycles, while add instruction is ~1 cycle. Your code already spends 99% of time in memory accesses. Adding manual prefetch will make it this one cycle faster and no better.
Manual prefetch would really work well if your math were much more heavy (try it), like using an expression with large number of non-optimized out divisions (20-30 cycles each) or calling some math function (log, sin).
3. But even this doesn't guarantee to help. Dependency between loop iterations is very weak, it is only via sum variable. This allows CPU to execute instructions speculatively: it may start fetching valueBuffer[i+1] concurrently while still executing math for valueBuffer[i].
Prefetch fetches normally a full cache line. This is typically 64 bytes. So the random example fetches always 64 bytes for a 4 byte int. 16 times the data you actually need which fits very well with the slow down by a factor of 18. So the code is simply limited by memory throughput and not latency.
Sorry. What I gave you was not the correct version of my code. The correct version is, what you said:
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
However, even with the right version, it is unfortunately not better
Then I adapted my program to try your suggestion using the sin function.
My adapted program is the following one:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#include <math.h>
#define BUFFER_SIZE ((unsigned long) 4096 * 50000)
unsigned int randomUint()
{
int value = rand() % UINT_MAX;
return value;
}
unsigned int * createValueBuffer()
{
unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
valueBuffer[i] = randomUint();
}
return (valueBuffer);
}
unsigned int * createIndexBuffer(unsigned short prefetchStep)
{
unsigned int * indexBuffer = (unsigned int *) malloc((BUFFER_SIZE + prefetchStep) * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = rand() % BUFFER_SIZE;
}
return (indexBuffer);
}
double computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer, unsigned short prefetchStep)
{
double sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
unsigned int index = indexBuffer[i];
sum += sin(valueBuffer[index]);
}
return (sum);
}
unsigned int computeTimeInMicroSeconds(unsigned short prefetchStep)
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer(prefetchStep);
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
double sum = computeSum(indexBuffer, valueBuffer, prefetchStep);
gettimeofday(&endTime, NULL);
printf("prefetchStep = %d, Sum = %f - ", prefetchStep, sum);
free(indexBuffer);
free(valueBuffer);
return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
}
int main()
{
printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
for (unsigned short prefetchStep = 0 ; prefetchStep < 250 ; prefetchStep++)
{
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds(prefetchStep);
printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}
}
The output is:
$ gcc TestPrefetch.c -O3 -o TestPrefetch -lm && taskset -c 7 ./TestPrefetch
sizeof buffers = 781Mb
prefetchStep = 0, Sum = -1107.523504 - Time: 20895326 micro-seconds = 20.895 seconds
prefetchStep = 1, Sum = 13456.262424 - Time: 12706720 micro-seconds = 12.707 seconds
prefetchStep = 2, Sum = -20179.289469 - Time: 12136174 micro-seconds = 12.136 seconds
prefetchStep = 3, Sum = 12068.302534 - Time: 11233803 micro-seconds = 11.234 seconds
prefetchStep = 4, Sum = 21071.238160 - Time: 10855348 micro-seconds = 10.855 seconds
prefetchStep = 5, Sum = -22648.280105 - Time: 10517861 micro-seconds = 10.518 seconds
prefetchStep = 6, Sum = 22665.381676 - Time: 9205809 micro-seconds = 9.206 seconds
prefetchStep = 7, Sum = 2461.741268 - Time: 11391088 micro-seconds = 11.391 seconds
...
So here, it works better! Honestly, I was almost sure that it will not be better because the math function cost is higher compared to the memory access.
If anyone could give me more information about why it is better now, I would appreciate it
Thank you very much

Algorithm for listing all multiples possible by set of numbers than is less than x

I'm trying to work on a sub-problem of an larger algorithm which I am really struggling on!
The Problem
If I had a array of numbers (say A), how can I efficiently list all the numbers that can be made by multiplying the numbers together (which can be used as many times as you want) and is less than another number (say x).
For example, let's say I had A = [7, 11, 13] and x was 1010, the answers would be:
- 7 = 7
- 11 = 11
- 13 = 13
- 7*7 = 49
- 7*11 = 77
- 7*13 = 91
- 11*11 = 121
- 11*13 = 143
- 13*13 = 169
- 7*7*7 = 343
- 7*7*11 = 539
- 7*7*13 = 637
- 7*11*11 = 847
- 7*11*13 = 1001
I tried my best not to miss any (but feel free to edit if I have)!
I can tell this is probably some type of recursion but am really struggling on this one!
Optional
A naive solution will also be nice (that's how much I'm struggling).
Running time is also optional.
UPDATE
All numbers in A are all the prime numbers (except 1, 2, 3, 5) got from the sieve of eratosthenes.
UPDATE 2
A is also sorted
UPDATE 3
All numbers in A is under the limit
UPDATE 4
The solution does NOT need to be recursion. That was just an idea I had. And Java or Pseudo code more preferable!
I'd go with using a queue. The algorithm I have in mind would be something like the following (in pseudocode):
multiplyUntil(A, X)
{
queue q = A.toQueue();
result;
while(!q.isEmpty())
{
element = q.pop();
result.add(element); // only if the initial elements are guaranteed to be < X otherwise you should add other checks
for(int i = 0; i < A.length; i++)
{
product = element * A[i];
// A is sorted so if this product is >= X the following will also be >= X
if(product >= X)
{
// get out of the inner cycle
break;
}
q.add(product);
}
}
return result;
}
Let me know if something is unclear.
P.S: Keep in mind that the result is not guaranteed to be sorted. If you want the result to be sorted you could use a heap instead of a queue or sort the result in the end of the computation.
Here's solution on Java along with comments. It's pretty straightforward to translate it to other language.
// numbers is original numbers like {7, 11, 13}, not modified
// offset is the offset of the currently processed number (0 = first)
// limit is the maximal allowed product
// current array is the current combination, each element denotes
// the number of times given number is used. E. g. {1, 2, 0} = 7*11*11
private static void getProducts(int[] numbers, int offset, int limit, int[] current) {
if(offset == numbers.length) {
// all numbers proceed: output the current combination
int product = 1;
StringBuilder res = new StringBuilder();
for(int i=0; i<offset; i++) {
for(int j = 0; j<current[i]; j++) {
if(res.length() > 0) res.append(" * ");
res.append(numbers[i]);
product *= numbers[i];
}
}
// instead of printing you may copy the result to some collection
if(product != 1)
System.out.println(" - "+res+" = "+product);
return;
}
int n = numbers[offset];
int count = 0;
while(limit >= 1) {
current[offset] = count;
getProducts(numbers, offset+1, limit, current);
count++;
// here the main trick: we reduce limit for the subsequent recursive calls
// note that in Java it's integer division
limit/=n;
}
}
// Main method to launch
public static void getProducts(int[] numbers, int limit) {
getProducts(numbers, 0, limit, new int[numbers.length]);
}
Usage:
public static void main(String[] args) {
getProducts(new int[] {7, 11, 13}, 1010);
}
Output:
- 13 = 13
- 13 * 13 = 169
- 11 = 11
- 11 * 13 = 143
- 11 * 11 = 121
- 7 = 7
- 7 * 13 = 91
- 7 * 11 = 77
- 7 * 11 * 13 = 1001
- 7 * 11 * 11 = 847
- 7 * 7 = 49
- 7 * 7 * 13 = 637
- 7 * 7 * 11 = 539
- 7 * 7 * 7 = 343
The resulting products are sorted in different way, but I guess sorting is not a big problem.
Here is my solution in C++. I use a recursive function. The principle is:
the recursive function is given a limit, a current which is a composite and a range of primes [start, end(
it will output all combination of powers of the primes in the given range, multiplied by the current composite
At each step, the function takes the first prime p from the range, and compute all its powers. It then multiplies current by the p as long as the product, cp is under the limit.
We use the fact the array is sorted by leaving as soon as cp is above the limit.
Due to the way we compute the numbers they won't be sorted. But it is easy to add this as a final step once you collected the numbers (in which case ou would use a back_inserter output iterator instead of an ostream_iterator, and do a sort on the collection vector)
#include <algorithm>
#include <iostream>
#include <iterator>
using namespace std;
template <class It, class Out>
void f(int limit, int current, It start, It end, Out out) {
// terminal condition
if(start == end) {
if(current != 1)
*(out++) = current;
return;
}
// Output all numbers where current prime is a factor
// starts at p^0 until p^n where p^n > limit
int p = *start;
for(int cp = current; cp < limit; cp *= p) {
f(limit, cp, start+1, end, out);
}
}
int main(int argc, char* argv[]) {
int const N = 1010;
vector<int> primes{7, 11, 13};
f(N, 1, begin(primes), end(primes), ostream_iterator<int>(cout, "\n"));
}

CUDA: Why accessing the same device array is not coalesced?

I am posting a drilled down code for review. I believe it should compile and execute without any problems but since i excluded all the irrelevant parts, I might have made some mistake.
struct Users {
double A[96];
double B[32];
double C[32];
};
This is my Users structure with fixed length arrays. Below is given the main function.
int main(int argc, char **argv) {
int numUsers = 10;
Users *users = new Users[numUsers];
double Step[96];
for (int i = 0; i < 32; i++) {
Step[i] = 0.8;
Step[i + 32] = 0.8;
Step[i + 64] = 0.8;
}
for (int usr = 0; usr < numUsers; usr++) {
for (int i = 0; i < 32; i++) {
users[usr].A[i] = 10;
users[usr].A[i + 32] = 20;
users[usr].A[i + 64] = 30;
}
memset(users[usr].B, 0, sizeof(double) * 32);
memset(users[usr].C, 0, sizeof(double) * 32);
}
double *d_Step;
cudaMalloc((void**)&d_Step, sizeof(double) * 96);
cudaMemcpy(d_Step, Step, sizeof(double) * 96, cudaMemcpyHostToDevice);
Users *deviceUsers;
cudaMalloc((void**)&deviceUsers, sizeof(Users) * numUsers);
cudaMemcpy(deviceUsers, users, sizeof(Users) * numUsers, cudaMemcpyHostToDevice);
dim3 grid;
dim3 block;
grid.x = 1;
grid.y = 1;
grid.z = 1;
block.x = 32;
block.y = 10;
block.z = 1;
calc<<<grid, block >>> (deviceUsers, d_Step, numUsers);
delete users;
return 0;
}
Please note here that Step array is 1D array with 96 bins and I am spanning 10 warps (32 threads in x direction and there are 10 of these in my block). Each warp will access the same Step array. This can be seen below in the kernel.
__global__ void calc(Users *users, double *Step, int numUsers) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
int uId = threadIdx.y;
while (uId < numUsers) {
double mean00 = users[uId].A[tId] * Step[tId];
double mean01 = users[uId].A[tId + 32] * Step[tId + 32];
double mean02 = users[uId].A[tId + 64] * Step[tId + 64];
users[uId].A[tId] = (mean00 == 0? 0 : 1 / mean00);
users[uId].A[tId + 32] = (mean01 == 0? 0 : 1 / mean01);
users[uId].A[tId + 64] = (mean02 == 0? 0 : 1 / mean02);
uId += 10;
}
}
Now when I use NVIDIA Visual Profiler, the coalesced retrieves are 47%. I further investigated and found out that Step array which is being accessed by each warp causes this problem. If i replace it with some constant, the accesses are 100% coalesced.
Q1) As I understand, coalesced accesses are linked to byte line i.e. byte lines has to be multiple of 32, whether they are integer, double byte lines. Why I am not getting coalesced accesses?
As per my knowledge, cuda whenever assigns a memory block in the device global memory it, it assigned an even address to it. Thus as long as the starting point + 32 location are accessed by a warp, the access should be coalesced. Am I correct?
Hardware
Geforce GTX 470, Compute Capability 2.0
Your kernel read Step 10 times from global memory. Although L1 cache can reduce the actual access to global mem, it still be treated as inefficient access pattern by the profiler.
My profiler names it 'global load efficiency'. It doesn't say if it is coalesced or not.

Cuda kernel function only changes matrix's first row

I am trying to sum up two matrices a_h_1 and a_h_2, and writing the result back to a_h_1. But for some reason my kernel function does not change the array members except the first N elements. Even if I write a[8] = 45, for example, it is printed as 8 when it is copied back to host. What is wrong?
#include <stdio.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void matrix_summation(float *a, float *b, int M, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<M*N)
{
a[idx] = blockIdx.x;
}
}
// main routine that executes on the host
int main(void)
{
float *a_h_1,*a_h_2, *a_d_1,*a_d_2; // Pointer to host & device arrays
const int N = 5;
const int M = 5;
// Number of elements in arrays
size_t size = (N * M) * sizeof(float);
a_h_1 = (float *)malloc(size); // Allocate array1 on host
a_h_2 = (float *)malloc(size); // Allocate array2 on host
cudaMalloc((void **) &a_d_1, size); // Allocate array1 on device
cudaMalloc((void **) &a_d_2, size); // Allocate array2 on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N*M; i++){
a_h_1[i] = (float)i;
a_h_2[i] = (float)i;
}
cudaMemcpy(a_d_1, a_h_1, size, cudaMemcpyHostToDevice);
cudaMemcpy(a_d_2, a_h_2, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = M;
int n_blocks = (M*N)/block_size;
matrix_summation <<< n_blocks, block_size >>> ( a_d_1,a_d_2, M, N));
// Retrieve result from device and store it in host array
cudaMemcpy(a_h_1, a_d_1, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
printf("\n\nROW 1 \n");
for (int i=0; i<(M*N); i++)
{
printf(" %f ", a_h_1[i]);
if((i+1)%N == 0)
{
printf("\nROW %d \n", ((i+1)/N)+1);
}
}
// Cleanup
free(a_h_1);
free(a_h_2);
cudaFree(a_d_1);
cudaFree(a_d_2);
system("pause");
}
Here is the output:
ROW 1
0.0 2.0 4.0 6.0 8.0 < this line is correct but others are not
ROW 2
5.0 6.0 7.0 8.0 9.0
ROW 3
10.0 11.0 12.0 13.0 14.0
ROW 4
15.0 16.0 17.0 18.0 19.0
ROW 5
20.0 21.0 22.0 23.0 24.0
It looks like you're not copying all of the device array to your host array. In this line:
cudaMemcpy(a_h_1, a_d_1, sizeof(float)*N, cudaMemcpyDeviceToHost);
I think you meant to copy sizeof(float)*N*M

Resources