Creating a parallel arithmetic operator in an infinite loop in C compiler on Mac Terminal - shell

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
void main (void)
{
while(1) // infinite loop
{
int a, b;
printf("Give me an integer a: ");
scanf("%d",&a);
printf("Give me an integer b: ");
scanf("%d",&b);
sum = &a + &b;
product = &a * &b;
difference = &a - &b;
echo Here is your sum, product, and difference:
printf("Sum: %d + d% = %d.\n", a, b, sum);
printf("Product: %d * d% = %d.\n", a, b, product);
printf("Difference: %d - d% = %d.\n", a, b, difference);
return 0;
}
}
I keep getting a syntax error with my void main (void) on line 6
Using child processes to create 3 parallel processes.

This program has a bunch of problems, and it's not doing anything in parallel:
void main (void)
This will usually compile, but according to the C standard it's not correct. The signature of main() that takes no arguments should be int main(void).
sum = &a + &b;
product = &a * &b;
difference = &a - &b;
You're using the addresses of a and b instead of the values, so you'd always get the same results for sum, product, and difference (except that didn't define the last three.) The correct version should be:
sum = a + b;
product = a * b;
difference = a - b;
echo Here is your sum, product, and difference:
This isn't a C construct, so the compiler will complain about this. Use printf().
return 0;
Although you've indicated that the loop should be infinite, it won't be because you exit the function at the end of the loop.

Related

OpenMP and (Rcpp)Eigen

I am wondering how to write code that at times makes use of OpenMP parallelization built into the Eigen library while at other times uses Parallelization that I specify. Hopefully, the below code snippet should provide background into my problem.
I am asking this question at the design stage of my library (sorry I don't have a working / broken code example).
#ifdef _OPENMP
#include <omp.h>
#endif
#include <RcppEigen.h>
void fxn(..., int ncores=-1){
if (ncores > 0) omp_set_num_threads(ncores);
/*
* Code with matrix products
* where I would like to use Eigen's
* OpenMP parallelization
*/
#pragma omp parallel for
for (int i=0; i < iter; i++){
/*
* Code I would like to parallelize "myself"
* even though it involves matrix products
*/
}
}
What is best practice for controlling the balance between Eigen's own parallelization with OpenMP and my own.
UPDATE:
I wrote a simple example and tested ggael's suggestion. In short, I am skeptical that it solves the problem I was posing (or I am doing something else wrong - apologies if its the latter). Notice that with explicit parallelization of the for loop there is no change in run-time (not even a slow
#ifdef _OPENMP
#include <omp.h>
#endif
#include <RcppEigen.h>
using namespace Rcpp;
// [[Rcpp::plugins(openmp)]]
// [[Rcpp::export]]
Eigen::MatrixXd testing(Eigen::MatrixXd A, Eigen::MatrixXd B, int n_threads=1){
Eigen::setNbThreads(n_threads);
Eigen::MatrixXd C = A*B;
Eigen::setNbThreads(1);
for (int i=0; i < A.cols(); i++){
A.col(i).array() = A.col(i).array()*B.col(i).array();
}
return A;
}
// [[Rcpp::export]]
Eigen::MatrixXd testing_omp(Eigen::MatrixXd A, Eigen::MatrixXd B, int n_threads=1){
Eigen::setNbThreads(n_threads);
Eigen::MatrixXd C = A*B;
Eigen::setNbThreads(1);
#pragma omp parallel for num_threads(n_threads)
for (int i=0; i < A.cols(); i++){
A.col(i).array() = A.col(i).array()*B.col(i).array();
}
return A;
}
/*** R
A <- matrix(rnorm(1000*1000), 1000, 1000)
B <- matrix(rnorm(1000*1000), 1000, 1000)
microbenchmark::microbenchmark(testing(A,B, n_threads=1),
testing_omp(A,B, n_threads=1),
testing(A,B, n_threads=8),
testing_omp(A,B, n_threads=8),
times=10)
*/
Unit: milliseconds
expr min lq mean median uq max neval cld
testing(A, B, n_threads = 1) 169.74272 183.94500 212.83868 218.15756 236.97049 264.52183 10 b
testing_omp(A, B, n_threads = 1) 166.53132 178.48162 210.54195 227.65258 234.16727 238.03961 10 b
testing(A, B, n_threads = 8) 56.03258 61.16001 65.15763 62.67563 67.37089 83.43565 10 a
testing_omp(A, B, n_threads = 8) 54.18672 57.78558 73.70466 65.36586 67.24229 167.90310 10 a
The easiest is probably to disable/enable Eigen's multi-threading at runtime:
Eigen::setNbThreads(1); // single thread mode
#pragma omp parallel for
for (int i=0; i < iter; i++){
// Code I would like to parallelize "myself"
// even though it involves matrix products
}
Eigen::setNbThreads(0); // restore default

range-based for loop over references

This is question is out of curiosity, not necessity. One way I have found C++11's range based for loop useful is for iterating over discrete objects:
#include <iostream>
#include <functional>
int main()
{
int a = 1;
int b = 2;
int c = 3;
// handy:
for (const int& n : {a, b, c}) {
std::cout << n << '\n';
}
I would like to be able to use the same loop style to modify non-const references too, but I believe it is not allowed by the standard (see Why are arrays of references illegal?):
// would be handy but not allowed:
// for (int& n : {a, b, c}) {
// n = 0;
// }
I thought of two workarounds but these seem like they could incur some minor additional cost and they just don't look as clean:
// meh:
for (int* n : {&a, &b, &c}) {
*n = 0;
}
// meh:
using intRef = std::reference_wrapper<int>;
for (int& n : {intRef (a), intRef (b), intRef (c)}) {
n = 0;
}
}
So the question is, is there a cleaner or better way? There may be no answer to this but I'm always impressed with the clever ideas people have on stackoverflow so I thought I would ask.
Picking up #Sombrero Chicken's idea, here is an approach with less typing:
template <class ...Args> constexpr auto ref(Args&&... args)
{
using First = std::tuple_element_t<0, std::tuple<Args...>>;
using WrappedFirst = std::reference_wrapper<std::remove_reference_t<First>>;
constexpr auto n = sizeof...(Args);
return std::array<WrappedFirst, n>{std::ref(args)...};
}
which can be used via
for (int& n : ref(a, b, c))
n = 0;
Instead of constructing a reference_wrapper yourself you could use std::ref, that's as far as you can get:
using std::ref;
for (int& n : {ref(a), ref(b), ref(c)}) {
n = 0;
}

Why does this OpenMP code work on dynamic scheduling but not on static?

I'm learning OpenMP by building a simple program to calculate pi using the following algorithm:
pi = 4/1 - 4/3 + 4/5 - 4/7 + 4/9...
The problem is that it does not work correctly when I change the scheduling to static. It works perfectly when the thread count is one. It also runs correctly under dynamic scheduling despite the result differing slightly every time it's run. Any idea what could be the problem?
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 100
#define CSIZE 1
#define nthread 2
int pi()
{
int i, chunk;
float pi = 0, x = 1;
chunk = CSIZE;
omp_set_num_threads(nthread);
#pragma omp parallel shared(i, x, chunk)
{
if (omp_get_num_threads() == 0)
{
printf("Number of threads = %d\n", omp_get_num_threads());
}
printf("Thread %d starting...\n", omp_get_thread_num());
#pragma omp for schedule(dynamic, chunk)
for (i = 1; i <= N; i++)
{
if (i % 2 == 0)
pi = pi - 4/x;
else
pi = pi + 4/x;
x = x + 2;
printf("Pi is currently %f at iteration %d with x = %0.0f on thread %d\n",
pi, i, x, omp_get_thread_num());
}
}
return EXIT_SUCCESS;
}
Using printf in the loop when I test your code makes dynamic do all the work on the first thread and none on the second (making the program effectively serial). If you remove the printf statement then you will find that the value of pi is random. This is because you have race conditions in x and pi.
Instead of using x you can divide by 2*i+1 (for i starting at zero). Also instead of using a branch to get the sign you can use sign = -2*(i%2)+1. To get pi you need to do a reduction using #pragma omp for schedule(static) reduction(+:pi).
#include <stdio.h>
#define N 10000
int main() {
float pi;
int i;
pi = 0;
#pragma omp parallel for schedule(static) reduction(+:pi)
for (i = 0; i < N; i++) {
pi += (-2.0f*(i&1)+1)/(2*i+1);
}
pi*=4.0f;
printf("%f\n", pi);
}

MPI hangs during execution

I'm trying to write a simple program with MPI that finds all numbers less than 514, that are equal to the exponent of the sum of their digits(for example, 512 = (5+1+2)^3. The problem I have is with the main loop - it works just fine on a few iterations(c=10), but when I try to increase the number of iterations(c=x), mpiexec.exe just hangs - seemingly in the middle of printf routine.
I'm pretty sure that deadlocks are to blame, but I couldn't find any.
The source code:
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include "mpi.h"
int main(int argc, char* argv[])
{
//our number
int x=514;
//amount of iterations
int c = 10;
//tags for message identification
int tag = 42;
int tagnumber = 43;
int np, me, y1, y2;
MPI_Status status;
/* Initialize MPI */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &np);
MPI_Comm_rank(MPI_COMM_WORLD, &me);
/* Check that we run on more than two processors */
if (np < 2)
{
printf("You have to use at least 2 processes to run this program\n");
MPI_Finalize();
exit(0);
}
//begin iterations
while(c>0)
{
//if main thread, then send messages to all created threads
if (me == 0)
{
printf("Amount of threads: %d\n", np);
int b = 1;
while(b<np)
{
int q = x-b;
//sends a number to a secondary thread
MPI_Send(&q, 1, MPI_INT, b, tagnumber, MPI_COMM_WORLD);
printf("Process %d sending to process %d, value: %d\n", me, b, q);
//get a number from secondary thread
MPI_Recv(&y2, 1, MPI_INT, b, tag, MPI_COMM_WORLD, &status);
printf ("Process %d received value %d\n", me, y2);
//compare it with the sent one
if (q==y2)
{
//if they're equal, then print the result
printf("\nValue found: %d\n", q);
}
b++;
}
x = x-b+1;
b = 1;
}
else
{
//if not a main thread, then process the message sent and send the result back.
MPI_Recv (&y1, 1, MPI_INT, 0, tagnumber, MPI_COMM_WORLD, &status);
int sum = 0;
int y2 = y1;
while (y1!=0)
{
//find the number's sum of digits
sum += y1%10;
y1 /= 10;
}
int sum2 = sum;
while(sum2<y2)
{
//calculate the exponentiation
sum2 = sum2*sum;
}
MPI_Send (&sum2, 1, MPI_INT, 0, tag, MPI_COMM_WORLD);
}
c--;
}
MPI_Finalize();
exit(0);
}
And I run the compiled exe-file as "mpiexec.exe -n 4 lab2.exe". I use HPC Pack 2008 SDK, if that's of any use to you guys.
Is there any way to fix it? Or maybe some way to debug that situation properly?
Thanks a lot in advance!
Not sure if you already found where's the problem, but your infinite run happens in this loop:
while(sum2<y2)
{
//calculate the exponentiation
sum2 = sum2*sum;
}
You can confirm this by setting c to about 300 or above then make a printf call in this while loop. I haven't completely pinpoint your error of logic, but I marked three comments below at your code location where I feel is strange:
while(c>0)
{
if (me == 0)
{
...
while(b<np)
{
int q = x-b; //<-- you subtract b from x here
...
b++;
}
x = x-b+1; //<-- you subtract b again. sure this is what you want?
b = 1; //<-- this is useless
}
Hope this helps.

CUDA Thrust and sort_by_key

I’m looking for a sorting algorithm on CUDA that can sort an array A of elements (double) and returns an array of keys B for that array A.
I know the sort_by_key function in the Thrust library but I want my array of elements A to remain unchanged.
What can I do?
My code is:
void sortCUDA(double V[], int P[], int N) {
real_t *Vcpy = (double*) malloc(N*sizeof(double));
memcpy(Vcpy,V,N*sizeof(double));
thrust::sort_by_key(V, V + N, P);
free(Vcpy);
}
i'm comparing the thrust algorithm against others that i have on sequencial cpu
N mergesort sortCUDA
113 0.000008 0.000010
226 0.000018 0.000016
452 0.000036 0.000020
905 0.000061 0.000034
1810 0.000135 0.000071
3621 0.000297 0.000156
7242 0.000917 0.000338
14484 0.001421 0.000853
28968 0.003069 0.001931
57937 0.006666 0.003939
115874 0.014435 0.008025
231749 0.031059 0.016718
463499 0.067407 0.039848
926999 0.148170 0.118003
1853998 0.329005 0.260837
3707996 0.731768 0.544357
7415992 1.638445 1.073755
14831984 3.668039 2.150179
115035495 39.276560 19.812200
230070990 87.750377 39.762915
460141980 200.940501 74.605219
Thrust performance is not bad, but I think if I use OMP can probably get easily a better CPU time
I think this is because to memcpy
SOLUTION:
void thrustSort(double V[], int P[], int N)
{
thrust::device_vector<int> d_P(N);
thrust::device_vector<double> d_V(V, V + N);
thrust::sequence(d_P.begin(), d_P.end());
thrust::sort_by_key(d_V.begin(), d_V.end(), d_P.begin());
thrust::copy(d_P.begin(),d_P.end(),P);
}
where V is a my double values to sort
You can modify comparison operator to sort keys instead of values. #Robert Crovella correctly pointed that a raw device pointer cannot be assigned from the host. The modified algorithm is below:
struct cmp : public binary_function<int,int,bool>
{
cmp(const double *ptr) : rawA(ptr) { }
__host__ __device__ bool operator()(const int i, const int j) const
{return rawA[i] > rawA[j];}
const double *rawA; // an array in global mem
};
void sortkeys(double *A, int n) {
// move data to the gpu
thrust::device_vector<double> devA(A, A + n);
double *rawA = thrust::raw_pointer_cast(devA.data());
thrust::device_vector<int> B(n);
// initialize keys
thrust::sequence(B.begin(), B.end());
thrust::sort(B.begin(), B.end(), cmp(rawA));
// B now contains the sorted keys
}
And here is alternative with arrayfire. Though I am not sure which one is more efficient since arrayfire solution uses two additional arrays:
void sortkeys(double *A, int n) {
af::array devA(n, A, af::afHost);
af::array vals, indices;
// sort and populate vals/indices arrays
af::sort(vals, indices, devA);
std::cout << devA << "\n" << indices << "\n";
}
How large is this array? The most efficient way, in terms of speed, will likely be to just duplicate the original array before sorting, if the memory is available.
Building on the answer provided by #asm (I wasn't able to get it working), this code seemed to work for me, and does sort only the keys. However, I believe it is limited to the case where the keys are in sequence 0, 1, 2, 3, 4 ... corresponding to the (double) values. Since this is a "index-value" sort, it could be extended to the case of an arbitrary sequence of keys, perhaps by doing an indexed copy. However I'm not sure the process of generating the index sequence and then rearranging the original keys will be any faster than just copying the original value data to a new vector (for the case of arbitrary keys).
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
using namespace std;
__device__ double *rawA; // an array in global mem
struct cmp : public binary_function<int, int, bool>
{
__host__ __device__ bool operator()(const int i, const int j) const
{return ( rawA[i] < rawA[j]);}
};
void sortkeys(double *A, int n) {
// move data to the gpu
thrust::device_vector<double> devA(A, A + n);
// rawA = thrust::raw_pointer_cast(&(devA[0]));
double *test = raw_pointer_cast(devA.data());
cudaMemcpyToSymbol(rawA, &test, sizeof(double *));
thrust::device_vector<int> B(n);
// initialize keys
thrust::sequence(B.begin(), B.end());
thrust::sort(B.begin(), B.end(), cmp());
// B now contains the sorted keys
thrust::host_vector<int> hostB = B;
for (int i=0; i<hostB.size(); i++)
std::cout << hostB[i] << " ";
std::cout<<std::endl;
for (int i=0; i<hostB.size(); i++)
std::cout << A[hostB[i]] << " ";
std::cout<<std::endl;
}
int main(){
double C[] = {0.7, 0.3, 0.4, 0.2, 0.6, 1.2, -0.5, 0.5, 0.0, 10.0};
sortkeys(C, 9);
std::cout << std::endl;
return 0;
}

Resources