Parallel hashing using openmp - parallel-processing

I have a piece of code for parallel hashing, the insert code is as follows:
int main(int argc, char** argv){
.....
Entry* table;//hash table
for(size_t i=0;i<N;i++){
keys[i]=i;
values[i] = rand();//random key-value pairs
}
int omp_p = omp_get_max_threads();
#pragma omp parallel for
for(int p=0;p<omp_p;p++){
size_t start = p*N/omp_p;
size_t end = (p+1)*N/omp_p;//each thread gets contiguous chunks of the arrays
for(size_t i=start;i<end;i++){
size_t key = keys[i];
size_t value = values[i];
if(insert(table,key,value) == 0){
printf("Failure!\n");
}
}
}
....
return 0;
}
int insert(Entry* table,size_t key, size_t value){
Entry entry = (((Entry)key) << 32)+value; //Coalesce key and value into an entry
/*Use cuckoo hashing*/
size_t location = hash_function_1(key);
for(size_t its=0;its<MAX_ITERATIONS;its++){
entry = __sync_lock_test_and_set(&table[location],entry);
key=get_key(entry);
if(key == KEY_EMPTY)
return1;
}
/*We have replaced a valid key, try to hash it using next available hash function*/
size_t location_1 = hash_function_1(key);
size_t location_2 = hash_function_2(key);
size_t location_3 = hash_function_3(key);
if(location == location_1) location = location_2;
else if(location == location_2) location = location_3;
else location = location_1;
}
return 0;
}
The insert code doesn't scale at all. If I use a single thread, for say, 10M keys, I complete in about 170ms, whereas using 16 threads, I take > 500ms. My suspicion is that this is because the cache line (consisting of the table[] array) is being moved around between the threads during the write operation (__sync_lock_test_and_set(...)) and the invalidation results in a slow down
For example if I modify the insert code to just:
int insert(Entry* table,size_t key, size_t value){
Entry entry = (((Entry)key) << 32)+value; //Coalesce key and value into an entry
size_t location = hash_function_1(key);
table[location] = entry;
return 1;
}
I still get the same bad performance. Since this is hashing, I cannot control, where a particular element hashes to. So any suggestions? Also, if this isn't the right reason, any other pointers as to what might be going wrong? I have tried it from 1M to 100M keys, but the single threaded performance is always better.

I have a few suggestions. Since the run time of your insert function is not constant then you should use schedule(dynamic). Second, you should let OpenMP divide the tasks and not do it yourself (one reason, but not the main reason, is that the way you have it now N has to be a multiple of omp_p). If you want to have some control over how it divides the tasks then try changing the chunksize like this schedule(dynamic,n) where n is the chuck size.
#pragma omp parallel for schedule(dynamic)
for(size_t i=0;i<N;i++){
size_t key = keys[i];
size_t value = values[i];
if(insert(table,key,value) == 0){
printf("Failure!\n");
}
}

I would try experimenting with a strategy based on locks, like this simple snippet shows:
#include<omp.h>
#define NHASHES 4
#define NTABLE 1000000
typedef size_t (hash_f)(size_t);
int main(int argc, char** argv) {
Entry table [NTABLE ];
hash_f hashes[NHASHES];
omp_lock_t locks [NTABLE ]
/* ... */
for(size_t ii = 0; ii < N; ii++) {
keys [ii] = ii;
values [ii] = rand();
}
for(size_t ii = 0; ii < NTABLE; ii++) {
omp_init_lock(&locks[ii]);
}
#pragma omp parallel
{
#pragma omp for schedule(static)
for(int ii = 0; ii < N; ii++) {
size_t key = keys [ii];
size_t value = values[ii];
Entry entry = (((Entry)key) << 32) + value;
for ( jj = 0; jj < NHASHES; jj++ ) {
size_t location = hashes[jj]; // I assume this is the computationally demanding part
omp_set_lock(&locks[location]); // Locks the hash table location before working on it
if ( get_key(table[location]) == KEY_EMPTY ) {
table[location] = entry;
break;
}
omp_unset_lock(&locks[location]); // Unlocks the hash table location
}
// Handle failures here
}
} /* pragma omp parallel */
for(size_t ii = 0; ii < NTABLE; ii++) {
omp_destroy_lock(&locks[ii]);
}
/* ... */
return 0;
}
With a little more machinery you can handle a variable number of locks ranging from 1 (equivalent to a critical section) to NTABLE (equivalent to an atomic construct) and see if the granularity in-between provides some benefit.

Related

Matrix Multiplication OpenMP Counter-Intuitive Results

I am currently porting some code over to OpenMP at my place of work. One of the tasks I am doing is figuring out how to speed up matrix multiplication for one of our applications.
The matrices are stored in row-major format, so A[i*cols +j] gives the A_i_j element of the matrix A.
The code looks like this (uncommenting the pragma parallelises the code):
#include <omp.h>
#include <iostream>
#include <iomanip>
#include <stdio.h>
#define NUM_THREADS 8
#define size 500
#define num_iter 10
int main (int argc, char *argv[])
{
// omp_set_num_threads(NUM_THREADS);
int *A = new int [size*size];
int *B = new int [size*size];
int *C = new int [size*size];
for (int i=0; i<size; i++)
{
for (int j=0; j<size; j++)
{
A[i*size+j] = j*1;
B[i*size+j] = i*j+2;
C[i*size+j] = 0;
}
}
double total_time = 0;
double start = 0;
for (int t=0; t<num_iter; t++)
{
start = omp_get_wtime();
int i, k;
// #pragma omp parallel for num_threads(10) private(i, k) collapse(2) schedule(dynamic)
for (int j=0; j<size; j++)
{
for (i=0; i<size; i++)
{
for (k=0; k<size; k++)
{
C[i*size+j] += A[i*size+k] * B[k*size+j];
}
}
}
total_time += omp_get_wtime() - start;
}
std::setprecision(5);
std::cout << total_time/num_iter << std::endl;
delete[] A;
delete[] B;
delete[] C;
return 0;
}
What is confusing me is the following: why is dynamic scheduling faster than static scheduling for this task? Timing the runs and taking an average shows that static scheduling is slower, which to me is a bit counterintuitive since each thread is doing the same amount of work.
Also, am I correctly speeding up my matrix multiplication code?
Parallel matrix multiplication is non-trivial (have you even considered cache-blocking?). Your best bet is likely to be to use a BLAS Library for this, rather than writing it yourself. (Remember, "The best code is the code I do not have to write").
Wikipedia: Basic Linear Algebra Subprograms points to many implementations, a lot of which (including Intel Math Kernel Library) have free licenses.

CUDA string search in large file, wrong result

I am working on simple naive string search in CUDA.
I am new in CUDA. It works fine fol smaller files ( aprox. ~1MB ). After I make these files bigger ( ctrl+a ctrl+c several times in notepad++ ), my program's results are higher ( about +1% ) than a
grep -o text file_name | wc -l
It is very simple function, so I don't know what could cause this. I need it to work with larger files ( ~500MB ).
Kernel code ( gpuCount is a __device__ int global variable ):
__global__ void stringSearchGpu(char *data, int dataLength, char *input, int inputLength){
int id = blockDim.x*blockIdx.x + threadIdx.x;
if (id < dataLength)
{
int fMatch = 1;
for (int j = 0; j < inputLength; j++)
{
if (data[id + j] != input[j]) fMatch = 0;
}
if (fMatch)
{
atomicAdd(&gpuCount, 1);
}
}
}
This is calling the kernel in main function:
int blocks = 1, threads = fileSize;
if (fileSize > 1024)
{
blocks = (fileSize / 1024) + 1;
threads = 1024;
}
clock_t cpu_start = clock();
// kernel call
stringSearchGpu<<<blocks, threads>>>(cudaBuffer, strlen(buffer), cudaInput, strlen(input));
cudaDeviceSynchronize();
After this I just copy the result to Host and print it.
Can anyone please help me with this?
First of all, you should always check return values of CUDA functions to check for errors. Best way to do so would be the following:
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
Wrap your CUDA calls, such as:
gpuErrchk(cudaDeviceSynchronize());
Second, your kernel accesses out of bounds memory. Suppose, dataLength=100, inputLength=7 and id=98. In your kernel code:
if (id < dataLength) // 98 is less than 100, so condition true
{
int fMatch = 1;
for (int j = 0; j < inputLength; j++) // j runs from [0 - 6]
{
// if j>1 then id+j>=100, which is out of bounds, illegal operation
if (data[id + j] != input[j]) fMatch = 0;
}
Change the condition to something like:
if (id < dataLength - inputLength)

MPI_Recv() invalid buffer pointer

I have a dynamically allocated array that is sent by rank 0 to other ranks using MPI_Send()
On the receiving side, a dynamic array is allocated memory using malloc()
MPI_Recv() happens on the other ranks. At this receive function, I get invalid Buffer Pointer error.
Code is conceptually similar to this:
struct graph{
int count;
int * array;
} a_graph;
int x = 10;
MPI_Status status;
//ONLY 2 RANKS ARE PRESENT. RANK 0 SENDS MSG TO RANK 1
if (rank == 0){
a_graph * my_graph = malloc(sizeof(my_graph))
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
for(int i =0; i < my_graph->count; i++)
my_graph->array[i] = i;
MPI_Send(my_graph->array,my_graph->count,int,1,0,MPI_COMM_WORLD);
free(my_graph->array);
free(my_graph);
}
else if (rank == 1){
a_graph * my_graph = malloc(sizeof(my_graph))
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
MPI_Recv(my_graph->array,my_graph->count,int,0,0,MPI_COMM_WORLD,&status) // MPI INVALID BUFFER POINTER ERROR HAPPENS AT THIS RECV
}
I dont understand why this happens since memory is allocated in both sender and receiver ranks
Below is a minimal, working, and verifiable (MWVE) example which Zulan suggested you to make. Please provide MWVE in your future questions. Anyway, you need to use MPI datatype MPI_INT instead of int for sending and receiving.
#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
typedef struct graph{
int count;
int * array;
} a_graph;
int main()
{
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int x = 10;
MPI_Status status;
//ONLY 2 RANKS ARE PRESENT. RANK 0 SENDS MSG TO RANK 1
if (rank == 0){
a_graph * my_graph = malloc(sizeof(a_graph));
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
for(int i =0; i < my_graph->count; i++)
my_graph->array[i] = i;
MPI_Send(my_graph->array,my_graph->count,MPI_INT,1,0,MPI_COMM_WORLD);
free(my_graph->array);
free(my_graph);
}
else if (rank == 1){
a_graph * my_graph = malloc(sizeof(a_graph));
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
MPI_Recv(my_graph->array,my_graph->count,MPI_INT,0,0,MPI_COMM_WORLD,&status);
for (int i=0; i<my_graph->count; ++i)
{
printf("%i\n", my_graph->array[i]);
}
}
MPI_Finalize();
return 0;
}

Save state of c++11 random generator without using iostream

What is the best way to store the state of a C++11 random generator without using the iostream interface. I would like to do like the first alternative listed here[1]? However, this approach requires that the object contains the PRNG state and only the PRNG state. In partucular, it fails if the implementation uses the pimpl pattern(at least this is likely to crash the application when reloading the state instead of loading it with bad data), or there are more state variables associated with the PRNG object that does not have to do with the generated sequence.
The size of the object is implementation defined:
g++ (tdm64-1) 4.7.1 gives sizeof(std::mt19937)==2504 but
Ideone http://ideone.com/41vY5j gives 2500
I am missing member functions like
size_t state_size();
const size_t* get_state() const;
void set_state(size_t n_elems,const size_t* state_new);
(1) shall return the size of the random generator state array
(2) shall return a pointer to the state array. The pointer is managed by the PRNG.
(3) shall copy the buffer std::min(n_elems,state_size()) from the buffer pointed to by state_new
This kind of interface allows more flexible state manipulation. Or are there any PRNG:s whose state cannot be represented as an array of unsigned integers?
[1]Faster alternative than using streams to save boost random generator state
I've written a simple (-ish) test for the approach I mentioned in the comments of the OP. It's obviously not battle-tested, but the idea is represented - you should be able to take it from here.
Since the amount of bytes read is so much smaller than if one were to serialize the entire engine, the performance of the two approaches might actually be comparable. Testing this hypothesis, as well as further optimization, are left as an exercise for the reader.
#include <iostream>
#include <random>
#include <chrono>
#include <cstdint>
#include <fstream>
using namespace std;
struct rng_wrap
{
// it would also be advisable to somehow
// store what kind of RNG this is,
// so we don't deserialize an mt19937
// as a linear congruential or something,
// but this example only covers mt19937
uint64_t seed;
uint64_t invoke_count;
mt19937 rng;
typedef mt19937::result_type result_type;
rng_wrap(uint64_t _seed) :
seed(_seed),
invoke_count(0),
rng(_seed)
{}
rng_wrap(istream& in) {
in.read(reinterpret_cast<char*>(&seed), sizeof(seed));
in.read(reinterpret_cast<char*>(&invoke_count), sizeof(invoke_count));
rng = mt19937(seed);
rng.discard(invoke_count);
}
void discard(unsigned long long z) {
rng.discard(z);
invoke_count += z;
}
result_type operator()() {
++invoke_count;
return rng();
}
static constexpr result_type min() {
return mt19937::min();
}
static constexpr result_type max() {
return mt19937::max();
}
};
ostream& operator<<(ostream& out, rng_wrap& wrap)
{
out.write(reinterpret_cast<char*>(&(wrap.seed)), sizeof(wrap.seed));
out.write(reinterpret_cast<char*>(&(wrap.invoke_count)), sizeof(wrap.invoke_count));
return out;
}
istream& operator>>(istream& in, rng_wrap& wrap)
{
wrap = rng_wrap(in);
return in;
}
void test(rng_wrap& rngw, int count, bool quiet=false)
{
uniform_int_distribution<int> integers(0, 9);
uniform_real_distribution<double> doubles(0, 1);
normal_distribution<double> stdnorm(0, 1);
if (quiet) {
for (int i = 0; i < count; ++i)
integers(rngw);
for (int i = 0; i < count; ++i)
doubles(rngw);
for (int i = 0; i < count; ++i)
stdnorm(rngw);
} else {
cout << "Integers:\n";
for (int i = 0; i < count; ++i)
cout << integers(rngw) << " ";
cout << "\n\nDoubles:\n";
for (int i = 0; i < count; ++i)
cout << doubles(rngw) << " ";
cout << "\n\nNormal variates:\n";
for (int i = 0; i < count; ++i)
cout << stdnorm(rngw) << " ";
cout << "\n\n\n";
}
}
int main(int argc, char** argv)
{
rng_wrap rngw(123456790ull);
test(rngw, 10, true); // this is just so we don't start with a "fresh" rng
uint64_t seed1 = rngw.seed;
uint64_t invoke_count1 = rngw.invoke_count;
ofstream outfile("rng", ios::binary);
outfile << rngw;
outfile.close();
cout << "Test 1:\n";
test(rngw, 10); // test 1
ifstream infile("rng", ios::binary);
infile >> rngw;
infile.close();
cout << "Test 2:\n";
test(rngw, 10); // test 2 - should be identical to 1
return 0;
}

Ways to parallelize this using OpenMP?

Can anybody suggest a best way to parallelize this using openmp? The program gets aborted when I run this code.
void grayerode(int **img, int height, int width, int filterheight,
int filterwidth, int iterations, int pixrange)
{
int maxlabel=0;
int fh, fw, iters, pixval=0, i, j, s, k;
int fhlimit = filterheight/2;
int fwlimit = filterwidth/2;
int **smoothedlabels;
allocate_2D_int_matrix ( &smoothedlabels, height, width );
#pragma omp parallel for shared(smoothedlabels,height,width,k)
for (i=0; i<height; i++)
for (j=0; j<width; j++)
smoothedlabels[i][j] = img[i][j];
int *labeltemp = (int *)malloc(pixrange*sizeof(int));
for (s=0; s<pixrange; s++)
labeltemp[s] = 0;
for (iters=0; iters<iterations; iters++) {
#pragma omp parallel for private(i,j,labeltemp)
for (i=fhlimit; i<height-fhlimit; i++) {
for (j=fwlimit; j<width-fwlimit; j++) {
for (fh=-fhlimit; fh<=fhlimit; fh++)
for (fw=-fwlimit; fw<=fwlimit; fw++) {
labeltemp[img[i+fh][j+fw]]++;
}
for (s=0; s<pixrange; s++) {
if (labeltemp[s]>maxlabel) {
maxlabel = labeltemp[s];
pixval = s;
}
}
smoothedlabels[i][j]=pixval;
for (s=0; s<pixrange; s++)
labeltemp[s] = 0;
maxlabel = 0;
}
}
}
for (i=0; i<height; i++)
for (j=0; j<width; j++)
img[i][j] = smoothedlabels[i][j];
free_2D_int_matrix ( &smoothedlabels );
free(labeltemp);
return;
}
A few things:
You are not declaring private variables correctly. One example of doing it the correct way in your code:
#pragma omp parallel for private(i,j) shared(smoothedlabels, img, width, height)
for(i=0; i<height; i++)
for(j=0; j<width; j++)
smoothedlabels[i][j] = img[i][j]
It is important that j remains private or each thread will try change its value - giving you unexpected behaviour. (Note: i is actually implicitly declared private when you declare the pragma statement, but I always prefer to state it explicitly for better readability)
Try avoiding 2D arrays because they restrict your ability to parallelize. In the same example you could do the following:
#pragma omp parallel for private(i) shared(width, height, smoothedlabels, img)
for(i=0; i<width * height; i++)
smoothedlabels[i] = img[i]
This will parallelize the entire loop for you rather than just the outer loop. You can order your 1D array either column wise or row wise.
Same thing goes for the rest of the loops - just apply the same concept.
Later in your code for example, you have the following:
for (fh=-fhlimit; fh<=fhlimit; fh++)
for (fw=-fwlimit; fw<=fwlimit; fw++) {
labeltemp[img[i+fh][j+fw]]++;
}
If you do not declare fh and fw private, then you will get unexpected behaviour for the same reason not declaring j before would give you unexpected behaviour.

Resources