Understanding the speed up of openmp program across NUMA nodes

Understanding the speed up of openmp program across NUMA nodes - openmp

I came across this behavior of speed up and I am finding it hard to explain. Following is the background:
Program
Invocation of Gaussian Elimination method to solve linear equation within a loop to parallelize the work load across compute units. We use an augmented matrix of dimension (M by M+1) where one additional column holds the RHS
HPC Setup - Cray XC50 node with Intel Xeon 6148 Gold with the following configuration
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 0 size: 95325 MB
node 0 free: 93811 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 1 size: 96760 MB
node 1 free: 96374 MB
node distances:
node 0 1
0: 10 21
1: 21 10
Although not the actual HPC, but the block diagram and the related explanation seems to fully apply (https://www.nas.nasa.gov/hecc/support/kb/skylake-processors_550.html). Specifically sub NUMA clustering seems to be disabled.
Job submitted through APLS is as follows
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,30,31,32,33,34,35,36,37,38,39 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 40,41,42,43,44,45,46,47,48,49,60,61,62,63,64,65,66,67,68,69 -e N=4000 -e M=200 -e MODE=2 ./gem
In the above N indicates the number of matrices and M replaces the dimension of the matrix. These are passed as environment variable to the program and used internally. MODE can be ignored for this discussion
cc list specifically lists the CPUs to bind with. OMP_NUM_THREADS is set to 20. The intent is to use 20 threads across 20 compute units.
Time to run sequentially and parallel is recorded within the program using omp_get_wtime() and the results are the following
CPU Binding
Objective
Speed Up
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Load work across 20 physical cores on socket 0
13.081944
0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29
Spread across first 10 physical cores on socket 0 & socket 1
18.332559
10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
Spread across 2nd set of 1o physical cores on socket 0 & first 10 of socket 1
18.636265
40,41,42,43,44,45,46,47,48,49,60,61,62,63,64,65,66,67,68,69
Spread across virtual cores across sockets(40-0, 60-21)
15.922209
Why is the speed up less for the first case when all physical nodes on socket 0 are being used ? The understanding here is that when tasks are spread across sockets, UPI comes into effect and it should be slower whereas it seems to be exactly the opposite. Also what can possibly explain the last scenario when virtual cores are being used.
Note: We have tried multiple iterations and the results for the above combinations are pretty consistent.
Edit1:
Edit2: Source code
#define _GNU_SOURCE
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include "sched.h"
#include "omp.h"
double drand(double low, double high, unsigned int *seed)
{
return ((double)rand_r(seed) * (high - low)) / (double)RAND_MAX + low;
}
void init_vars(int *N, int *M, int *mode)
{
const char *number_of_instances = getenv("N");
if (number_of_instances) {
*N = atoi(number_of_instances);
}
const char *matrix_dim = getenv("M");
if (matrix_dim) {
*M = atoi(matrix_dim);
}
const char *running_mode = getenv("MODE");
if (running_mode) {
*mode = atoi(running_mode);
}
}
void print_matrix(double *instance, int M)
{
for (int row = 0; row < M; row++) {
for (int column = 0; column <= M; column++) {
printf("%lf ", instance[row * (M + 1) + column]);
}
printf("\n");
}
printf("\n");
}
void swap(double *a, double *b)
{
double temp = *a;
*a = *b;
*b = temp;
}
void init_matrix(double *instance, unsigned int M)
{
unsigned int seed = 45613 + 19 * omp_get_thread_num();
for (int row = 0; row < M; row++) {
for (int column = 0; column <= M; column++) {
instance[row * (M + 1) + column] = drand(-1.0, 1.0, &seed);
}
}
}
void initialize_and_solve(int M)
{
double *instance;
instance = malloc(M * (M + 1) * sizeof(double));
// Initialise the matrix
init_matrix(instance, M);
// Performing elementary operations
int i, j, k = 0, c, flag = 0, m = 0;
for (i = 0; i < M; i++) {
if (instance[i * (M + 2)] == 0) {
c = 1;
while ((i + c) < M && instance[(i + c) * (M + 1) + i] == 0)
c++;
if ((i + c) == M) {
flag = 1;
break;
}
for (j = i, k = 0; k <= M; k++) {
swap(&instance[j * (M + 1) + k], &instance[(j + c) * (M + 1) + k]);
}
}
for (j = 0; j < M; j++) {
// Excluding all i == j
if (i != j) {
// Converting Matrix to reduced row
// echelon form(diagonal matrix)
double pro = instance[j * (M + 1) + i] / instance[i * (M + 2)];
for (k = 0; k <= M; k++)
instance[j * (M + 1) + k] -= (instance[i * (M + 1) + k]) * pro;
}
}
}
// Get the solution in the last column
for (int i = 0; i < M; i++) {
instance[i * (M + 1) + M] /= instance[i * (M + 2)];
}
free(instance);
instance = NULL;
}
double solve_serial(int N, int M)
{
double now = omp_get_wtime();
for (int i = 0; i < N; i++) {
initialize_and_solve(M);
}
return omp_get_wtime() - now;
}
double solve_parallel(int N, int M)
{
double now = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < N; i++) {
initialize_and_solve(M);
}
return omp_get_wtime() - now;
}
int main(int argc, char **argv)
{
// Default parameters
int N = 200, M = 200, mode = 2;
if (argc == 4) {
N = atoi(argv[1]);
M = atoi(argv[2]);
mode = atoi(argv[3]);
}
init_vars(&N, &M, &mode);
if (mode == 0) {
// Serial only
double l2_norm_serial = 0.0;
double serial = solve_serial(N, M);
printf("Time, %d, %d, %lf\n", N, M, serial);
} else if (mode == 1) {
// Parallel only
double l2_norm_parallel = 0.0;
double parallel = solve_parallel(N, M);
printf("Time, %d, %d, %lf\n", N, M, parallel);
} else {
// Both serial and parallel
// Solve using GEM (serial)
double serial = solve_serial(N, M);
// Solve using GEM (parallel)
double parallel = solve_parallel(N, M);
printf("Time, %d, %d, %lf, %lf, %lf\n", N, M, serial, parallel, serial / parallel);
}
return 0;
}
Edit3: Rephrased the first point to clarify what is actually being done ( based on feedback in comment )

You say you implement a "Simple implementation of Gaussian Elimination". Sorry, there is no such thing. There are multiple different algorithms and they all come with their own analysis. But let's assume you use the textbook one. Even then, Gaussian Elimination is not simple.
First of all, you haven't stated that you initialized your data in parallel. If you don't do that, all the data will wind up on socket 0 and you will get bad performance, never mind the speedup. But let's assume you did the right thing here. (If not, google "first touch".)
In the GE algorithm, each of the sequential k iterations works on a smaller and smaller subset of the data. This means that no simple mapping of data to cores is possible. If you place your data in such a way that initially each core works on local data, this will quickly no longer be the case.
In fact, after half the number of iterations, half your cores will be pulling data from the other socket, leading to NUMA coherence delays. Maybe a spread binding is better here than your compact binding.

Why is the speed up less for the first case when all physical nodes on socket 0 are being used ?
Results are often dependent of the application but some patterns regularly happens. My guess is that your application heavily use the main RAM and 2 sockets results in more DDR4 RAM blocks being used than only one. Indeed, with local NUMA-node allocations, 1 socket can access to the RAM at the speed of 128 GB/s while 2 sockets can access to the RAM at the speed of 256 GB/s. With a balanced use of DDR4 RAM blocks, the performance with be far worst and bounded by UPI (I do not expect 2 socket to be much slower because of the full-duplex data transfer).
The understanding here is that when tasks are spread across sockets, UPI comes into effect and it should be slower whereas it seems to be exactly the opposite.
UPI is only a bottleneck if data are massively transferred between the two sockets, but good NUMA applications should not do that because they should operate on their own NUMA-node memory.
You can check the use of the UPI and RAM throughput using hardware counters.
Also what can possibly explain the last scenario when virtual cores are being used.
I do not have an explanation for this. Note higher IDs are the second hyperthreads of each core so it is certainly related to a low-level behaviour of the hyperthreading (maybe some processes are bound to some PU causing pre-emption the target PUs or simply the second PU have somehow a lower priority). Note also that physical core IDs and logical PU IDs are often not mapped the same way so if you use the wrong one you could end up binding 2 threads to the same core. I advise you to use hwloc to check that.

Related

Using an algorithm to determine the ideal size for shipping boxes

I work in a logistic department for a company, recently we have been trying to narrow down the amount of different packaging options that we use.
I have all the necessary product data like length, width, height, volume and also sales data.
So I was thinking if it is possible to use an algorithm to cluster the different volumes of the products and maybe also take into account which sizes are selling the most, to determine, which box sizes would be ideal.
(Taking into account how often a product sells is secondary so that is not absolutely necessary)
What I want is that I can give the Algorithm an amount of how many different boxsizes I want and the algorithm should determine where to put the limits, so that there is a solution for every product that we have. With the goal of the optimization being minimum volume wasted while also not using more than the set amount of different boxes.
Also important to note, the orientation of the products and the amount per box is set, so there is no need to determine how to pack the products and how many go into one box idealy or something like that.
What kind of algorithms could be used for a problem like this and what are my options to program them? I was thinking of using Matlab, but would also be open for other possible options. I want to program it, not simply use an existing program like SPSS.
Thanks in advance and forgive me if my english is not the best, I'm not a native speaker.

The following C++ program will find optimal solutions for small instances. For 10 input box sizes, each having dimensions randomly chosen in the range 1..100, and for any number 1..10 of box sizes to choose, it computes the answer in a couple of seconds on my computer. For 15 input box sizes, it takes around 10s. For 20 input box sizes, I could compute up to 4 chosen box sizes in about 3 minutes, with memory becoming an issue (it used around 3GB). I had to increase the linker's default stack size to avoid stack overflows.
#include <iostream>
#include <algorithm>
#include <vector>
#include <array>
#include <map>
#include <set>
#include <functional>
#include <climits>
using namespace std;
ostream& operator<<(ostream& os, array<int, 3> a) {
return os << '(' << a[0] << ", " << a[1] << ", " << a[2] << ')';
}
template <int N>
long long vol(array<int, N> b) {
return static_cast<long long>(b[0]) * b[1] * b[2];
}
template <int N, int M>
bool fits(array<int, N> a, array<int, M> b) {
return a[0] <= b[0] && a[1] <= b[1] && a[2] <= b[2];
}
// Compares first by volume, then lexicographically.
struct CompareByVolumeDesc {
bool operator()(array<int, 3> a, array<int, 3> b) const {
return vol(a) > vol(b) || vol(a) == vol(b) && a < b;
}
};
vector<array<int, 3>> candSizes;
struct State {
vector<array<int, 4>> req;
int n;
int k;
// Needed for map<>
bool operator<(State const& other) const {
return make_tuple(n, k, req) < make_tuple(other.n, other.k, other.req);
}
} dummy = { {}, -1, -1 };
map<State, pair<int, State>> memo;
// Compute the minimum volume required for the given list of box sizes if we use exactly k of the first n candidate box sizes.
pair<long long, State> solve(State const& s) {
if (empty(s.req)) return { 0, dummy };
if (s.k == 0 || s.k > s.n) return { LLONG_MAX / 4, dummy };
auto previousAnswer = memo.find(s);
if (previousAnswer != end(memo)) return (*previousAnswer).second;
// Try using the nth candidate box size.
int nFitting = 0;
vector<array<int, 4>> notFitting;
for (auto r : s.req) {
if (fits(r, candSizes[s.n - 1])) {
nFitting += r[3];
} else {
notFitting.push_back(r);
}
}
pair<long long, State> solution;
solution.second = { s.req, s.n - 1, s.k };
solution.first = solve(solution.second).first;
if (nFitting > 0) {
State useNth = { notFitting, s.n - 1, s.k - 1 };
long long useNthVol = nFitting * vol(candSizes[s.n - 1]) + solve(useNth).first;
if (useNthVol < solution.first) solution = { useNthVol, useNth };
}
memo[s] = solution;
return solution;
}
void printOptimalSolution(State s) {
while (!empty(s.req)) {
State next = solve(s).second;
if (next.k < s.k) cout << candSizes[s.n - 1] << endl;
s = next;
}
}
int main(int argc, char** argv) {
int n, k;
cin >> n >> k;
vector<array<int, 4>> requestedBoxSizes;
set<int> lengths, widths, heights;
for (int i = 0; i < n; ++i) {
array<int, 4> d; // d[3] is actually the number of requests for this box size
cin >> d[0] >> d[1] >> d[2] >> d[3];
sort(begin(d), begin(d) + 3, std::greater<int>());
requestedBoxSizes.push_back(d);
lengths.insert(d[0]);
widths.insert(d[1]);
heights.insert(d[2]);
}
// Generate all candidate box sizes
for (int l : lengths) {
for (int w : widths) {
for (int h : heights) {
array<int, 3> cand = { l, w, h };
sort(begin(cand), end(cand), std::greater<int>());
candSizes.push_back(cand);
}
}
}
sort(begin(candSizes), end(candSizes), CompareByVolumeDesc());
candSizes.erase(unique(begin(candSizes), end(candSizes)), end(candSizes));
cout << "Number of candidate box sizes: " << size(candSizes) << endl;
State startState = { requestedBoxSizes, static_cast<int>(size(candSizes)), k };
long long minVolume = solve(startState).first;
cout << "Minimum achievable volume using " << k << " box sizes: " << minVolume << endl;
cout << "Optimal set of " << k << " box sizes:" << endl;
printOptimalSolution(startState);
return 0;
}
Example input:
15 5
100 61 35 27
17 89 96 47
31 69 30 55
37 23 39 9
94 11 48 19
38 17 29 36
63 79 80 36
59 52 37 51
86 63 54 7
32 30 11 26
50 88 51 5
74 70 33 14
67 46 4 79
83 94 89 58
65 42 37 69
Example output:
Number of candidate box sizes: 2310
Minimum achievable volume using 5 box sizes: 124069460
Optimal set of 5 box sizes:
(94, 48, 11)
(69, 52, 37)
(100, 89, 35)
(88, 79, 63)
(94, 89, 83)
I'll explain the algorithm behind this if there's interest. It's better than considering all possible combinations of k candidate box sizes, but not terribly efficient.

How to avoid un-coalesced accesses in matrix multiplication CUDA kernel?

I am learning CUDA with the book 'Programming Massively Parallel Processors'. A practice problem from chapter 5 confuses me:
For tiled matrix multiplication out of possible range of values for
BLOCK_SIZE, for what values of BLOCK_SIZE will the kernel completely
avoid un-coalesced accesses to global memory? (you only need to consider square blocks)
On my understanding, BLOCK_SIZE does little to memory-coalescing. As long as threads within single warp access consecutive elements, we will have a coalesced accesses. I could not figure out where the kernel has un-coalesced accesses to global memory. Any hints from you guys?
Here is the kernel's source codes:
#define COMMON_WIDTH 512
#define ROW_LEFT 500
#define COL_RIGHT 250
#define K 1000
#define TILE_WIDTH 32
__device__ int D_ROW_LEFT = ROW_LEFT;
__device__ int D_COL_RIGHT = COL_RIGHT;
__device__ int D_K = K;
.....
__global__
void MatrixMatrixMultTiled(float *matrixLeft, float *matrixRight, float *output){
__shared__ float sMatrixLeft[TILE_WIDTH][TILE_WIDTH];
__shared__ float sMatrixRight[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int col = bx * TILE_WIDTH + tx;
int row = by * TILE_WIDTH + ty;
float value = 0;
for (int i = 0; i < ceil(D_K/(float)TILE_WIDTH); ++i){
if (row < D_ROW_LEFT && row * D_K + i * TILE_WIDTH +tx < D_K){
sMatrixLeft[ty][tx] = matrixLeft[row * D_K + i * TILE_WIDTH +tx];
}
if (col < D_COL_RIGHT && (ty + i * TILE_WIDTH) * D_COL_RIGHT + col < D_K ){
sMatrixRight[ty][tx] = matrixRight[(ty + i * TILE_WIDTH) * D_COL_RIGHT + col];
}
__syncthreads();
for (int j = 0; j < TILE_WIDTH; j++){
value += sMatrixLeft[ty][j] * sMatrixRight[j][tx];
}
__syncthreads();
}
if (row < D_ROW_LEFT && col < D_COL_RIGHT ){
output[row * D_COL_RIGHT + col] = value;
}
}

Your question is incomplete, since the code you have posted does not make any reference to BLOCK_SIZE, and that is certainly at least very relevant to the question posed in the book. More generally, questions that pose a kernel without the launch configuration are often incomplete, since the launch configuration is often relevant to both the correctness and the behavior, of a kernel.
I've not re-read this portion of the book right at the moment. However I'll assume the kernel launch configuration includes a block dimension that is something like the following: (this information is absent from your question but should have been included, in my opinion, for a sensible question)
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(...,...);
And I will assume the kernel launch is given by something like:
MatrixMatrixMultTiled<<<dimGrid, dimBlock>>>(...);
Your statement: "As long as threads within single warp access consecutive elements, we will have a coalesced accesses." is a reasonable working definition. Let's show that that is violated for some choices of BLOCK_SIZE, given the above assumptions to cover over the gaps in your incomplete question.
Coalesced access is a term that applies to global memory accesses only. We will therefore ignore accesses to shared memory. We will also, for this discussion, ignore accesses to the __device__ variables such as D_ROW_LEFT. (The access to those variables appears to be uniform. We can quibble about whether that constitutes coalesced access. My claim would be that it does constitute coalesced access, but we need not unpack that here.) Therefore we are left with just 3 "access" points:
matrixLeft[row * D_K + i * TILE_WIDTH +tx];
matrixRight[(ty + i * TILE_WIDTH) * D_COL_RIGHT + col];
output[row * D_COL_RIGHT + col]
Now, to pick an example, let's suppose BLOCK_SIZE is 16. Will any of the above access points violate your statement "threads within single warp access consecutive elements"?
Let's start with the block (0,0). Therefore row is equal to threadIdx.y and col is equal to threadIdx.x. Let's consider the first warp in that block. Therefore the first 16 threads in that warp will have a threadIdx.y value of 0, and their threadIdx.x values will be increasing from 0..15. Likewise the second 16 threads in that warp will have a threadIdx.y value of 1, and their threadIdx.x values will be increasing from 0..15.
Now let's compute the actual index generated for the first access point above, across the warp. Let's assume we are on the first loop iteration, so i is zero. Therefore this:
matrixLeft[row * D_K + i * TILE_WIDTH +tx];
reduces to:
matrixLeft[threadIdx.y * D_K + threadIdx.x];
D_K here is just the device copy of the K variable, which is 1000. Now let's evaluate the reduced index expression above across our selected warp (0) in our selected block (0,0):
warp lane: 0 1 2 3 4 5 6 .. 15 16 17 18 .. 31
threadIdx.x 0 1 2 3 4 5 6 15 0 1 2 15
threadIdx.y 0 0 0 0 0 0 0 0 1 1 1 1
index: 0 1 2 3 4 5 6 15 1000 1001 1002 1015
Therefore the generated index pattern here shows a discontinuity between the 16th and 17th thread in the warp, and the access pattern does not fit your previously stated condition:
"threads within single warp access consecutive elements"
and we do not have coalesced access in this case (at least, for float quantities).

C/C++ rand() function for biased expectation

I am using <stdlib.h> rand() function to generate 100 random integers within range [0 ... 9]. I used the following way to generate them on equal distribution,
int random_numbers[100];
for(register int i = 0; i < 100; i++){
random_numbers[i] = rand() % 10;
}
This is working fine. But now I want to get 100 numbers where I want around 50% of those numbers to be 5. How do I do that?
Extended Problem
I want to get 100 numbers. What if I want 50% of those number will be between 0~2. I mean 50 percent of those number will consists only with number 0, 1, 2. How to do that?
I am expecting generalised steps which can be applied beyond the boundary of 10 or 100.

Hmmm, how about choosing a random number between 0 and 17, and if the number is greater than 9, change it to 5?
For 0 - 17, you would get a distribution like
0,1,2,3,4,5,6,7,8,9,5,5,5,5,5,5,5,5
Code:
int random_numbers[100];
for(register int i = 0; i < 100; i++){
random_numbers[i] = rand() % 18;
if (random_numbers[i] > 9) {
random_numbers[i] = 5;
}
}
You basically add a set of numbers beyond your desired range that, when translated to 5 give you equal numbers of 5 and non-5.

In order to get around 50% of these numbers to be in [0, 2] range you can split the full range of rand() into two equal halves and then use the same %-based technique to map the first half to [0, 2] range and the second half to [3, 9] range.
int random_numbers[100];
for(int i = 0; i < 100; i++)
{
int r = rand();
random_numbers[i] = r <= RAND_MAX / 2 ? r % 3 : r % 7 + 3;
}
To to get around 50% of these numbers to be 5 a similar technique will work. Just map the second half to [0, 9] range with 5 excluded
int random_numbers[100];
for(int i = 0; i < 100; i++)
{
int r = rand();
if (r <= RAND_MAX / 2)
r = 5;
else if ((r %= 9) >= 5)
++r;
random_numbers[i] = r;
}

I think it is easy to solve the particular problem of 50% using the techniques mentioned by other answers. Let us try to answer the question for a general case -
Let us say you want a distribution where you want the numbers {A1, A2, .. An} with the percentages {P1, P2, .. Pn} and sum of Pi is 100% (and all the percentages are integers, if not it can be adjusted).
We will create an array of 100 size and fill it with the numbers A1-An.
int distribution[100];
Now we fill each number, it's percentage number of times.
int postion = 0;
for (int i = 0; i < n; i++) {
for( int j = 0; j < P[i]; j++) {
// Add a check here to make sure the sum hasn't crossed 100
distribution[position] = A[i];
position ++;
}
}
Now that this initialization is done once, you can draw a random number as -
int number = distribution[rand() % 100];
In case your percentages are not integers but say you want precision of 0.1%, you can create an array of 1000 instead of 100.

In both case, the goal is 50% selected from one set and 50% from another. Code could call rand() and uses some bits (one) for choosing the group and the remaining bits for value selection.
If the range of numbers needed is much smaller than RAND_MAX, a first attempt could use:
int rand_special_50percent(int n, int special) {
int r = rand();
int r_div_2 = r/2;
if (r%2) {
return special;
}
int y = r_div_2%(n-1); // 9 numbers left
if (y >= special) y++;
return y;
}
int rand_low_50percent(int n, int low_special) {
int r = rand();
int r_div_2 = r/2;
if (r%2) {
return r_div_2%(low_special+1);
}
return r_div_2%(n - low_special) + low_special + 1;
}
Sample
int r5 = rand_special_50percent(10, 5);
int preferred_low_value_max = 2;
int r012 = rand_low_50percent(10, preferred_low_value_max);
Advanced:
With n above RAND_MAX/2, additional calls to rand() are needed.
When using rand()%n, unless (RAND_MAX+1u)%n == 0 (n is a divisor of RAND_MAX+1), a bias is introduced. The above code does not compensate for that.

C++11 solution (not optimal but easy)
std::piecewise_constant_distribution can generate random real numbers (float or double) for given intervals and weights for the each interval.
Not optimal because this solution is generating double and converting double to int. Also getting exactly 50 from [0,3) 100 samples is not guaranteed but for around 50 samples is guaranteed.
For your case : 2 intervals - [0,3), [3,100) and their weights [1,1]
Equal weights, so ~50% of the numbers from [0,3) and ~50% from [3,100)
#include <iostream>
#include <string>
#include <map>
#include <random>
int main()
{
std::random_device rd;
std::mt19937 gen(rd());
std::vector<double> intervals{0, 3, 3, 100};
std::vector<double> weights{ 1, 0, 1};
std::piecewise_constant_distribution<> d(intervals.begin(), intervals.end(), weights.begin());
std::map<int, int> hist;
for(int n=0; n<100; ++n) {
++hist[(int)d(gen)];
}
for(auto p : hist) {
std::cout << p.first << " : generated " << p.second << " times"<< '\n';
}
}
Output:
0 : generated 22 times
1 : generated 19 times
2 : generated 16 times
4 : generated 1 times
5 : generated 2 times
8 : generated 1 times
12 : generated 1 times
17 : generated 1 times
19 : generated 1 times
22 : generated 2 times
23 : generated 1 times
25 : generated 1 times
29 : generated 1 times
30 : generated 2 times
31 : generated 1 times
36 : generated 1 times
38 : generated 1 times
44 : generated 1 times
45 : generated 1 times
48 : generated 1 times
49 : generated 1 times
51 : generated 1 times
52 : generated 1 times
53 : generated 1 times
57 : generated 2 times
58 : generated 3 times
62 : generated 1 times
65 : generated 2 times
68 : generated 1 times
71 : generated 1 times
76 : generated 2 times
77 : generated 1 times
85 : generated 1 times
90 : generated 1 times
94 : generated 1 times
95 : generated 1 times
96 : generated 2 times

OpenMP Code Not Scaling due to overheads and cache issues

struct xnode
{
float *mat;
};
void testScaling( )
{
int N = 1000000; ///total num matrices
int dim = 10;
//memory for matrices
std::vector<xnode> nodeArray(N);
for( int k = 0; k < N; ++k )
nodeArray[k].mat = new float [dim*dim];
//memory for Y
std::vector<float*> Y(N,0);
for( int k = 0; k < N; ++k )
Y[k] = new float [dim];
//shared X
float* X = new float [dim];
for(int i = 0; i < dim; ++i ) X[i] = 1.0;
//init mats
for( int k = 0; k < N; ++k )
{
for( int i=0; i<dim*dim; ++i )
nodeArray[k].mat[i] = 0.25+((float)i)/3;
}
int NTIMES = 500;
//gemv args
char trans = 'N';
int lda = dim;
int incx = 1;
float alpha =1 , beta = 0;
//threads
int thr[4];
thr[0] =1 ; thr[1] = 2; thr[2] = 4; thr[3] = 8;
for( int t = 0; t<4; ++t )//test for nthreads
{
int nthreads = thr[t];
double t_1 = omp_get_wtime();
for( int ii = 0; ii < NTIMES; ++ii )//do matvec NTIMES
{
#pragma omp parallel for num_threads(nthreads)
for( int k=0; k<N; ++k )
{
//compute Y[k] = mat[k] * X;
GEMV(&trans, &dim, &dim, &alpha, nodeArray[k].mat, &lda, X, &incx, &beta, Y[k], &incx);
//GEMV(&trans, &dim, &dim, &alpha, nodeArray[0].mat, &lda, X, &incx, &beta, Y[k], &incx);
}
}
double t_2 = omp_get_wtime();
std::cout << "Threads " << nthreads << " time " << (t_2-t_1)/NTIMES << std::endl;
}
//clear memory
for( int k = 0; k < N; ++k )
{
delete [] nodeArray[k].mat;
delete [] Y[k];
}
delete [] X;
}
The above code parallelizes the matrix-vector product of N matrices of size dim, and stores results in N output vectors. The average of 500 products is taken as the time per matrix-vector product. The matrix-vector products in the above example are all of equal size and thus the threads should be perfectly balanced - we should achieve a performance scaling close to ideal 8x. The following are the observations (Machine – Intel Xeon 3.1Ghz.2 processors,8cores each, HyperThreading enabled, Windows, VS2012, Intel MKL, Intel OMP library).
OBSERVATION 1:
dim=10 N=1000000
Threads 1 - time 0.138068s
Threads 2 - time 0.0729147s
Threads 4 - time 0.0360527s
Threads 8 - time 0.0224268s (6.1x on 8threads)
OBSERVATION 2 :
dim=20 N=1000000
Threads 1 time 0.326617
Threads 2 time 0.185706
Threads 4 time 0.0886508
Threads 8 time 0.0733666 (4.5x on 8 threads).
Note – I ran VTune on this case. It showed CPUTime 267.8sec, Overhead time 43 sec, Spin time – 8 sec. The overhead time is all spent in a libiomp function (intel library). 8Threads/1Thread scaling is poor for such cases.
Next - in the gemv for loop, we change nodeArray[k].mat to nodeArray[0].mat (see commented statement), so that only the first matrix is used for all the matrix-vector products.
OBSERVATION 3
dim=20 N=1000000
Threads 1 time 0.152298 (The serial time is halved)
Threads 2 time 0.0769173
Threads 4 time 0.0384086
Threads 8 time 0.019336 (7.87x on 8 threads)
Thus I get almost ideal scaling - why is this behavior? VTune says that a significant portion of CPU time is spent in synchronization and thread overhead. Here it seems there is no relation between the load balancing and thread synchronization. As matrix size is increased the granularity should increase and thread overhead should be proportionately small. But as we increase from size 10 to 20 the scaling is weakening. When we use nodeArray[0].mat (only the first matrix) for doing all the matrix-vector products the cache is updated only once (since the compiler knows this during optimization) and we get near ideal scaling. Thus the synchronization overhead seems to be related to some cache related issue. I have tried a number of other things like setting KMP_AFFINITY and varying load distribution but that did not buy me anything.
My questions are:
1. I dont have a clear idea about how does the cache performance affect openMP thread synchronization. Can someone explain this?
2. Can anything be done about improving the scaling and reducing the overhead?
Thanks

Big-O algorithmic analysis

I would say it's not a homework problem. It's just a tutorial resource online to learn the dynamic programming concepts from USACO website.
In the resource, a problem was given as follows.
Question:
A sequcen of as many as 10000 integers, ( 0 < integer < 100,000), what is the maximum decreasing subsequence?
The decent recursive approach was given
1 #include <stdio.h>
2 long n, sequence[10000];
3 main () {
4 FILE *in, *out;
5 int i;
6 in = fopen ("input.txt", "r");
7 out = fopen ("output.txt", "w");
8 fscanf(in, "%ld", &n);
9 for (i = 0; i < n; i++) fscanf(in, "%ld", &sequence[i]);
10 fprintf (out, "%d\n", check (0, 0, 99999));
11 exit (0);
12 }
13 check (start, nmatches, smallest) {
14 int better, i, best=nmatches;
15 for (i = start; i < n; i++) {
16 if (sequence[i] < smallest) {
17 better = check (i, nmatches+1, sequence[i]);
18 if (better > best) best = better;
19 }
20 }
21 return best;
22 }
Guys, I am not good at the algorithmic analysis. Would you please tell me what's the Big-O notation to this recursive enumeration solution in worst case as tight as possible. My personal thought would be O(N^N), but I have no confidence. Because the runtime is still acceptable under N <= 100. There must be something wrong. Please help me. Thank you.
In the USACO website, it gives the dynamic programming approach in O(n^2) as follows.
1 #include <stdio.h>
2 #define MAXN 10000
3 main () {
4 long num[MAXN], bestsofar[MAXN];
5 FILE *in, *out;
6 long n, i, j, longest = 0;
7 in = fopen ("input.txt", "r");
8 out = fopen ("output.txt", "w");
9 fscanf(in, "%ld", &n);
10 for (i = 0; i < n; i++) fscanf(in, "%ld", &num[i]);
11 bestsofar[n-1] = 1;
12 for (i = n-1-1; i >= 0; i--) {
13 bestsofar[i] = 1;
14 for (j = i+1; j < n; j++) {
15 if (num[j] < num[i] && bestsofar[j] >= bestsofar[i]) {
16 bestsofar[i] = bestsofar[j] + 1;
17 if (bestsofar[i] > longest) longest = bestsofar[i];
18 }
19 }
20 }
21 fprintf(out, "bestsofar is %d\n", longest);
22 exit(0);
23 }

Just look at with what kind of parameters you call the function. The first determines the third (which btw means you needed have the third parameter). The first ranges between 0 and n. The second one is smaller than the first. This means that you have at most n^2 different calls to the function.
Now comes the question how many times you call the function with the same parameters. And the answer is simple: you actually generate every single decreasing subsequece. This means that for the sequence N, N-1, N-2, ... you will generate 2^N sequences. Pretty poor, right (if you want experiment with the sequence I have given you)?
However if you use the memoization technique you should have already read about, you can improve the complexity to N^3 (at most n operations in every call to the function, the different calls are N^2 and memoization allows you to pay only once for a different call).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio