Imagine having n processes each hold a matrix of 2 rows and 8 elements (stored linearly, not in 2D). I want each processes to communicate its rows to all processes with lower ranks. For instance, the process with rank 2 communicates its rows to the processes with rank 1 and 0; the process with rank 0 does not communicate its rows to any process.
I'm having issues deciding how to approach this problem. Using MPI_Bcast is a possible solution, but I can't seem to get the operation to work as expected. Below you can see a sample of the code I'm executing.
// npes is the number of processes obtained from MPI_INIT
// The value for i below is used to specify the number of
// rows that will be received
for (i = (npes - rank - 1) * rowsPerProcess; i > 0; i--) {
// Receive
MPI_Bcast(temp, columns, MPI_DOUBLE, i/rowsPerProcess, MPI_COMM_WORLD);
printf("I'm %d and I received from %d\n", rank, i/rowsPerProcess);
}
if (rank != 0) { // rank 0 does not send data
for (row = rowsPerProcess - 1; row >= 0; row--) {
for (j = 0; j < columns; j++) {
//matrix_chunk is the per process matrix of 2 rows
temp[j] = matrix_chunk[row*columns + j];
}
// Send
printf("I'm sender %d\n", rank);
MPI_Bcast(temp, columns, MPI_DOUBLE, rank, MPI_COMM_WORLD);
}
}
The output I receive is the following:
I'm 1 and I received from 1
I'm sender 2
I'm sender 2
I'm 0 and I received from 2
I'm 0 and I received from 1
I'm 0 and I received from 1
I'm 0 and I received from 0
I'm 1 and I received from 0
I'm sender 1
I'm sender 1
It seems that the first receive MPI_Bcast call is executing as a sender operation. I have also printed the contents of the received temp matrix and they are not what I expect them to be.
More than trying to correct this mess, I would like to get a perspective on how I can perform this particular communication problem. I feel like I'm approaching this from the wrong direction. Please let me know if you have any suggestions!
I implemented matched mpi_send and mpi_recv as suggested by High Performance Mark. The problem immediately made sense when I thought of it through this approach.
Related
I am building a Lottery Smart Contract in which user can buy a Ticket that have 5 generated numbers from 1 to 25 without duplicates numbers.
Everything works just fine, except that 30% of the time, Metamask says "transaction runs out of gas", because of this function.
The function below generate an array of 5 numbers between 1 and 25 :
/// #dev Generate 5 numbers ( 1 <= x >= 25) without duplicate numbers
/// #param _lotteryCount used for the random generator number function
/// #return uint256[5] return an array of 5 generated uint
function generateRandomTicketNumbers(uint256 _lotteryCount) internal view returns (uint8[5] memory) {
uint8[5] memory numbers;
uint256 generatedNumber;
// Execute 5 times (to generate 5 numbers)
for (uint256 i = 0; i < 5; i++) {
// Check duplicate
bool readyToAdd = false;
uint256 maxRetry = 5;
uint256 retry = 0;
// Generate a new number while it is a duplicate, up to 5 times (to prevent errors and infinite loops)
while (!readyToAdd && retry <= maxRetry) {
generatedNumber = (uint256(keccak256(abi.encodePacked(msg.sender, block.timestamp, i, retry, _lotteryCount))) % 25).add(1);
bool isDuplicate = false;
// Look in all already generated numbers array if the new generated number is already there.
for (uint256 j = 0; j < numbers.length; j++) {
if (numbers[j] == generatedNumber) {
isDuplicate = true;
break;
}
}
readyToAdd = !isDuplicate;
retry++;
}
// Throw if we hit maximum retry : generated a duplicate 5 times in a row.
require(retry < maxRetry, 'Error generating random ticket numbers. Max retry.');
numbers[i] = uint8(generatedNumber);
}
return numbers;
}
I'm not sure to understand how Metamask estimates the gas of a transaction but I guess it runs locally the transaction, see how much gas it used, and use this amount for the real transaction.
If this is correct, that could explain why it fails 30% of the time. Sometime, this function needs to retry multiple times to generate a number in the "while" loop, and sometime, the "while" loop is only executed once.
I could force the function to runs every "for" and "while" loop the maximum number of time, but I believe there is a better solution than wasting gas.
Do you have any idea how I can fix this ?
Thanks !
First use chainlink as a source of randomness https://docs.chain.link/docs/chainlink-vrf/ , follow the best practices sections https://docs.chain.link/docs/chainlink-vrf-best-practices/ , then when you get the random number you can do as they recommend in the docs
function expand(
uint256 randomValue,
uint256 n
) public pure returns (uint256[] memory expandedValues) {
expandedValues = new uint256[](n);
for (uint256 i = 0; i < n; i++) {
expandedValues[i] = uint256(keccak256(abi.encode(randomValue, i)));
}
return expandedValues;
}
I have the following homework:
We have N works, which durations are: t1, t2, ..., tN, which's deadlines are d1, d2, ..., dN. If the works aren't done till the deadline, a penalty is given accordingly b1, b2, ..., bN. In what order should the jobs be done, that the penalty would be minimum?
I've written this code so far and it's working but I want to improve it by skipping unnecessary permutations. For example, I know that the jobs in order:
1 2 3 4 5 - will give me 100 points of penalty and if I change the order let's say like this:
2 1 ..... - it gives me instantly 120 penalty and from this moment I know I don't have to check all of the rest permutations which start with 2 1, I have to skip them somehow.
Here's the code:
int finalPenalty = -1;
bool z = true;
while(next_permutation(jobs.begin(), jobs.end(), compare) || z)
{
int time = 0;
int penalty = 0;
z = false;
for (int i = 0; i < verseNumber; i++)
{
if (penalty > finalPenalty && finalPenalty >= 0)
break;
time += jobs[i].duration;
if (time > jobs[i].deadline)
penalty += jobs[i].penalty;
}
if (finalPenalty < 0 || penalty < finalPenalty)
{
sortedJobs = jobs;
finalPenalty = penalty;
}
if (finalPenalty == 0)
break;
}
I think I should do this somewhere here:
if (penalty > finalPenalty && finalPenalty >= 0)
break;
But I'm not sure how to do this. It skips me one permutation here if the penalty is already higher, but it doesn't skip everything and it still does next_permutation. Any ideas?
EDIT:
I'm using vector and my job structure looks like this:
struct job
{
int ID;
int duration;
int deadline;
int penalty;
};
ID is given automatically when reading from file and the rest is read from file (for example: ID = 1, duration = 5, deadline = 10, penalty = 10)
If you are planning to use next_permutation function provided by STL, there is not much you can do.
Say the last k digits are redundant to check. If you will use next_permutation function, a simple, yet inefficient strategy you can use is calling next_permutation for k! times(i.e. number of permutations of those last k elements) and just not go through with computing their penalties, as you know they will be higher. (k! assumes there are not repetitions. if you have repetitions, you would need to take extra measures to be able to compute that) This would cost you O(k!n) operations on the worst case, as next_permutation has linear time complexity.
Let's consider how we can improve this. A sound strategy may be, once an inefficient setting is found, before calling next_permutation again, ordering those k digits in descending order so that the next call would effectively skip the inefficient portion of permutations that need not be checked. Consider the following example.
Say our method found 1 2 3 4 5 has a penalty of 100. Then, while computing 2 1 3 4 5 at the next step, if our method finds that we got a penalty higher than 100 only after computing 2 1, if could just sort 3 4 5 in descending order using sort along with your custom comparison mechanism, and just skip the rest of the loop, arriving at another next_permutation call, which would give you 2 1 4 3 5, the next sequence to continue.
Let's consider how much skipping costs. This method requires sorting those k digits and calling next_permutation, which has an overall time complexity of O(klogk + n). This is a huge improvement over the previous method which has O(k!n).
See below for an crude implementation of the method I propose as an improvement over your existing code. I had to use type auto as you did not provide the exact type for jobs. I also sorted then reversed those k digits, as you did not provide your comparison function and I wanted to emphasize that what I was doing was reversing the ascending order.
int finalPenalty = -1;
bool z = true;
while(next_permutation(jobs.begin(), jobs.end(), compare) || z)
{
int time = 0;
int penalty = 0;
z = false;
auto it = jobs.begin();
for (int i = 0; i < verseNumber; i++)
{
time += jobs[i].duration;
if (time > jobs[i].deadline)
{
penalty += jobs[i].penalty;
if(finalPenalty >= 0 && penalty > finalPenalty)
{
it++; // only the remaining jobs need to be sorted in reverse
sort(it, jobs.end(), compare);
reverse(it, jobs.end());
break;
}
}
it++;
}
if (finalPenalty < 0 || penalty < finalPenalty)
{
sortedJobs = jobs;
finalPenalty = penalty;
}
if (finalPenalty == 0)
break;
}
I have some array of integers that I generate randomly for example between 0 and 9.
I have a set of constraints that I need to apply to this generation.
For example :
1) Constraint 1 -> the numbers at even position in the array should be 0 or 1.
2) Constraint 2 -> the array should have at least one 0 and at least one 1 on even values
etc ...
What I do for now :
I generate randomly the numbers array.
Then for each even position, I randomly pick between 0 and 1.
Then I check that for each even value, I have at least one 0 and at least one 1. If not, I regerate all the values with the constraint above (Then for each even position, I randomly pick between 0 and 1.) until I have something which works (in a do for).
However, this works good because these are very simple constraints.
3) Constraint 3 -> the sum of the differences between the numbers should be superior to a certain value.
etc ...
The issue I have is that it is inter dependent constraints problems and I dont want to encapsulate do while in another do while, etc ... when I add one constraint.
What would be the proper way to achieve this in the cleanest way possible?
Edit:
I realized I was not clear at all... My apologies.
I edited the constraints to make it easier to understand.
My code looks like that (typed on notepad++ might be mistakes) :
std::vector<int> myVector;
int N = 100; // vector contains 100 values
do{
myVector.clear();
for(int i=0;i<N;i++){
if(i%2==0){
myVector.push_back(rand() % 2);
}
else{
myVector.push_back(rand() % 10);
}
}
}while(!doesContains0and1(myVector));
bool MyClass::doesContains0and1(std::vector<int> avector)
{
bool returnVal = true;
for(int i=0;i<avector.size;i++){
if(i%2==0){
if(!avector.contains(0) || !avector.contains(1){
returnVal = false;
}
}
}
return returnVal;
}
The 3) constraint means, if I have for example :
0 5 1 7 0 9 0 1 1
that there is a constraint on abs(5-0) + abs (1-5) + abs(7-1) + ... etc > a certain value
These constraints are examples, I am more looking for some methodology than pure code :)
Thanks for your help !
Let me take a stab at this...
int sum;
for (int i = 0 ; i < array_size ; i++){
if (!(i % 10)) sum = 0; // resets the sum to 0
if(i % 2){ //even or odd
//odd
sum += array[i] = rand() % 2; // 1 or 0
if ((!(i % 10)) && (sum == 0 || sum == 5))
array[i] = sum?0:1;
}else{
//even
array[i] = rand();
}
}
This follows the first 2 constraints. The sum will count the number of 1's in each subset of 10. All 0's will yield a sum of 0 and all 1's will yield a sum of 5. It will set the last element accordingly.
I'm not sure how to interpret the 3rd constraint, but hopefully this gets you started
Some questions about CUDA.
1) I noticed that, in every sample code, operations which are not parallel (i.e., the computation of a scalar), performed in global functions, are always done specifying a certain thread. For example, in this simple code for a dot product, thread 0 performs the summation:
__global__ void dot( int *a, int *b, int *c )
{
// Shared memory for results of multiplication
__shared__ int temp[N];
temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];
// Thread 0 sums the pairwise products
if( 0 == threadIdx.x )
{
int sum = 0;
for( int i = 0; i < N; i++ )
sum += temp[i];
*c = sum;
}
}
This is fine for me; however, in a code which I wrote I did not specify the thread for the non-parallel operation, and it still works: hence, is it compulsory to define the thread? In particular, the non-parallel operation which I want to perform is the following:
if (epsilon == 1)
{
V[0] = B*(Exp - 1 - b);
}
else
{
V[0] = B*(Exp - 1 + a);
}
The various variables were passed as arguments of the global function. And here comes my second question.
2) I computed the value of V[0] with a program in CUDA and another serial on the CPU, obtaining different results. Obviously I thought that the problem in CUDA could be that I did not specify the thread, but, even with this, the result does not change, and it is still (much) greater from the serial one: 6.71201e+22 vs -2908.05. Where could be the problem? The other calculations performed in the global function are the following:
int tid = threadIdx.x;
if ( tid != 0 && tid < N )
{
{Various stuff which does not involve V or the variables used to compute V[0]}
V[tid] = B*(1/(1+alpha[tid]*alpha[tid])*(One_G[tid]*Exp - Cos - alpha[tid]*Sin) + kappa[tid]*Sin);
}
As you can see, in my condition I avoid to consider the case tid == 0.
3) Finally, a last question: usually in the sample codes I noticed that, if you want to use on the CPU values allocated and computed on the GPU memory, you should copy those values on the CPU (e.g, with command cudaMemcpy, specifying cudaMemcpyDeviceToHost). But I manage to use those values directly in the main code (CPU) without any problem. Can be this a clue that there is something wrong with my GPU (or my installation of CUDA), which also causes the previous odd things?
Thank you for your help.
== Added on the 5th January ==
Sorry for the late of my reply. Before invoking the kernel, there are all the memory allocations of the arrays to compute (which are quite a lot). In particular, the code about the array involved in my question is:
float * V;
cudaMalloc( (void**)&V, N * sizeof(float) );
At the end of the code I wrote:
float V_ [N];
cudaMemcpy( &V_, V, N * sizeof(float), cudaMemcpyDeviceToHost );
cudaFree(V);
cout << V_[0] << endl;
Thank you again for your attention.
if you don't have any cudaMemcpy in your code, that's exactly the problem. ;-)
The GPU is accessing it's own memory (the RAM on your graphics card), while the CPU is accessing the RAM on your mainboard.
You need to allocate and copy alpha, kappa, One_g and all other arrays to your GPU first, using cudaMemcpy, then run your kernel and after that copy your results back to the CPU.
Also, don't forget to allocate the memory on BOTH sides.
As for the non-parallel stuff: If the result is always the same, all threads will write the same thing, so the result is exactly the same, just quite a bit more inefficient, since all of them try to access the same resources.
Is that the exact code you're using?
In regards to question 1, you should have a __syncthreads() after the assignment to your shared memory, temp.
Otherwise you'll get a race condition where thread 0 can start the summation prior to temp being fully populated.
As for your other question about specifying the thread, if you have
if (epsilon == 1)
{
V[0] = B*(Exp - 1 - b);
}
else
{
V[0] = B*(Exp - 1 + a);
}
Then every thread will execute that code; for example, if you have X number of threads executing, and epsilon is 1 for all of them, then all X threads will evaluate the same line:
V[0] = B*(Exp - 1 - b);
and hence you'll have another race condition, as you'll have all X threads writing to V[0]. If all the threads have the same value for B*(Exp - 1 - b), then you might not notice a difference, while if they have different values then you're liable to get different results each time, depending on what order the threads arrive
I have a CUDA program that calls the kernel repeatedly within a for loop. The
code computes all rows of a matrix by using the values computed in the previous one
until the entire matrix is done. This is basically a dynamic programming algorithm.
The code below fills the (i,j) entry of many separate matrices in parallel with
the kernel.
for(i = 1; i <=xdim; i++){
for(j = 1; j <= ydim; j++){
start3time = clock();
assign5<<<BLOCKS, THREADS>>>(Z, i, j, x, y, z)
end3time = clock();
diff = static_cast<double>(end3time-start3time)/(CLOCKS_PER_SEC / 1000);
printf("Time for i=%d j=%d is %f\n", i, j, diff);
}
}
The kernel assign5 is straightforward
__global__ void assign5(float* Z, int i, int j, int x, int y, int z) {
int id = threadIdx.x + blockIdx.x * blockDim.x;
char ch = database[j + id];
Z[i+id] = (Z[x+id] + Z[y+id] + Z[z+id])*dev_matrix[i][index[ch - 'A']];
}
}
My problem is that when I run this program the time for each i and j is 0 most of the
time but sometimes it is 10 milliseconds. So the output looks like
Time for i=0 j=0 is 0
Time for i=0 j=1 is 0
.
.
Time for i=15 j=21 is 10
Time for i=15 j=22 is 0
.
I don't understand why this is happening. I don't see a thread race condition. If I add
if(i % 20 == 0) cudaThreadSynchronize();
right after the first loop then the Time for i and j is mostly 0. But then the time
for sync is sometimes 10 or even 20. It seems like CUDA is performing many operations
at low cost and then charges a lot for later ones. Any help would be appreciated.
I think you have a misconception about what a kernel call in CUDA actually does on the host. A kernel call is non-blocking and is only added to the device's queue. If you're measuring time before and after your kernel call, then the difference has nothing to do with how long your kernel call takes (it would measure the time it takes to add the kernel call to the queue).
You should add a cudaThreadSynchronize() after every kernel call and before you measure end3time. cudaThreadSynchronize() blocks and returns if all kernels in the queue have finished their work.
This is why
if(i % 20 == 0) cudaThreadSynchronize();
made spikes in your measurments.