Best way to achieve CUDA Vector Diagonalization

Best way to achieve CUDA Vector Diagonalization - matrix

What I want to do is feed in my m x n matrix, and in parallel, construct n square diagonal matrices for each column of the matrix, perform an operation on each square diagonal matrix, and then recombine the result. How do I do this?
So far, I start of with an m x n matrix; the result from a previous matrix computation where each element is calculated using the function y = f(g(x)).
This gives me a matrix with n column elements [f1, f2...fn] where each fn represents a column vector of height m.
From here, I want to differentiate each column of the matrix with respect to g(x). Differentiating fn(x) w.r.t. g(x) results in a square matrix with elements f'(x). Under constraint, this square matrix reduces to a Jacobian with the elements of each row along the diagonal of the square matrix, and equal to fn', all other elements equaling zero.
Hence the reason why it is necessary to construct the diagonal for each of the vector rows fn.
To do this, I take a target vector defined as A(hA x 1) which was extracted from the larger A(m x n) matrix. I then prepared a zeroed matrix defined as C(hA x hA) which will be used to hold the diagonals.
The aim is to diagonalize the vector A into a square matrix with each element of A sitting on the diagonal of C, everything else being zero.
There are probably more efficient ways to accomplish this using some pre-built routine without building a whole new kernel, but please be aware that for these purposes, this method is necessary.
The kernel code (which works) to accomplish this is shown here:
_cudaDiagonalizeTest << <5, 1 >> >(d_A, matrix_size.uiWA, matrix_size.uiHA, d_C, matrix_size.uiWC, matrix_size.uiHC);
__global__ void _cudaDiagonalizeTest(float *A, int wA, int hA, float *C, int wC, int hC)
{
int ix, iy, idx;
ix = blockIdx.x * blockDim.x + threadIdx.x;
iy = blockIdx.y * blockDim.y + threadIdx.y;
idx = iy * wA + ix;
C[idx * (wC + 1)] = A[idx];
}
I am a bit suspicious that this is a very naive approach to a solution and was wondering if someone could give an example of how I could do the same using
a) reduction
b) thrust
For vectors of large row size, I would like to be able to use the GPU's multithreading capabilities to chunk the task into small jobs, and combine each result at the end with __syncthreads().
The picture below shows what the desired result is.
I have read NVIDIA's article on reduction, but did not manage to achieve the desired results.
Any assistance or explanation would be very much welcomed.
Thanks.
Matrix A is the target with 4 columns. I want to take each column, and copy its elements into Matrix B as a diagonal, iterating through each column.

I created a simple example based on thrust. It uses column-major order to store the matrices in a thrust::device_vector. It should scale well with larger row/column counts.
Another approach could be based off the thrust strided_range example.
This example does what you want (fill the diagonals based on the input vector). However, depending on how you proceed with the resulting matrix to your "Differentiating" step, it might still be worth investigating if a sparse storage (without all the zero entries) is possible, since this will reduce memory consumption and ease iterating.
#include <thrust/device_vector.h>
#include <thrust/scatter.h>
#include <thrust/sequence.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <iostream>
template<typename V>
void print_matrix(const V& mat, int rows, int cols)
{
for(int i = 0; i < rows; ++i)
{
for(int j = 0; j < cols; ++j)
{
std::cout << mat[i + j*rows] << "\t";
}
std::cout << std::endl;
}
}
struct diag_index : public thrust::unary_function<int,int>
{
diag_index(int rows) : rows(rows){}
__host__ __device__
int operator()(const int index) const
{
return (index*rows + (index%rows));
}
const int rows;
};
int main()
{
const int rows = 5;
const int cols = 4;
// allocate memory and fill with demo data
// we use column-major order
thrust::device_vector<int> A(rows*cols);
thrust::sequence(A.begin(), A.end());
thrust::device_vector<int> B(rows*rows*cols, 0);
// fill diagonal matrix
thrust::scatter(A.begin(), A.end(), thrust::make_transform_iterator(thrust::make_counting_iterator(0),diag_index(rows)), B.begin());
print_matrix(A, rows, cols);
std::cout << std::endl;
print_matrix(B, rows, rows*cols);
return 0;
}
This example will output:
0 5 10 15
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
0 0 0 0 0 5 0 0 0 0 10 0 0 0 0 15 0 0 0 0
0 1 0 0 0 0 6 0 0 0 0 11 0 0 0 0 16 0 0 0
0 0 2 0 0 0 0 7 0 0 0 0 12 0 0 0 0 17 0 0
0 0 0 3 0 0 0 0 8 0 0 0 0 13 0 0 0 0 18 0
0 0 0 0 4 0 0 0 0 9 0 0 0 0 14 0 0 0 0 19

An alternate answer that does not use thrust is as follows:
_cudaMatrixTest << <5, 5 >> >(d_A, matrix_size.uiWA, matrix_size.uiHA, d_C, matrix_size.uiWC, matrix_size.uiHC);
__global__ void _cudaMatrixTest(float *A, int wA, int hA, float *C, int wC, int hC)
{
int ix, iy, idx;
ix = blockIdx.x * blockDim.x + threadIdx.x;
iy = blockIdx.y * blockDim.y + threadIdx.y;
idx = iy * wA + ix;
C[idx * wC + (idx % wC)] = A[threadIdx.x * wA + (ix / wC)];
}
where d_A is
0 5 10 15
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
Both answers are viable solutions. The question is, which is better/faster?

Related

Change the range of IRAND() in Fortran 77 [duplicate]

This is a follow on from a previously posted question:
How to generate a random number in C?
I wish to be able to generate a random number from within a particular range, such as 1 to 6 to mimic the sides of a die.
How would I go about doing this?

All the answers so far are mathematically wrong. Returning rand() % N does not uniformly give a number in the range [0, N) unless N divides the length of the interval into which rand() returns (i.e. is a power of 2). Furthermore, one has no idea whether the moduli of rand() are independent: it's possible that they go 0, 1, 2, ..., which is uniform but not very random. The only assumption it seems reasonable to make is that rand() puts out a Poisson distribution: any two nonoverlapping subintervals of the same size are equally likely and independent. For a finite set of values, this implies a uniform distribution and also ensures that the values of rand() are nicely scattered.
This means that the only correct way of changing the range of rand() is to divide it into boxes; for example, if RAND_MAX == 11 and you want a range of 1..6, you should assign {0,1} to 1, {2,3} to 2, and so on. These are disjoint, equally-sized intervals and thus are uniformly and independently distributed.
The suggestion to use floating-point division is mathematically plausible but suffers from rounding issues in principle. Perhaps double is high-enough precision to make it work; perhaps not. I don't know and I don't want to have to figure it out; in any case, the answer is system-dependent.
The correct way is to use integer arithmetic. That is, you want something like the following:
#include <stdlib.h> // For random(), RAND_MAX
// Assumes 0 <= max <= RAND_MAX
// Returns in the closed interval [0, max]
long random_at_most(long max) {
unsigned long
// max <= RAND_MAX < ULONG_MAX, so this is okay.
num_bins = (unsigned long) max + 1,
num_rand = (unsigned long) RAND_MAX + 1,
bin_size = num_rand / num_bins,
defect = num_rand % num_bins;
long x;
do {
x = random();
}
// This is carefully written not to overflow
while (num_rand - defect <= (unsigned long)x);
// Truncated division is intentional
return x/bin_size;
}
The loop is necessary to get a perfectly uniform distribution. For example, if you are given random numbers from 0 to 2 and you want only ones from 0 to 1, you just keep pulling until you don't get a 2; it's not hard to check that this gives 0 or 1 with equal probability. This method is also described in the link that nos gave in their answer, though coded differently. I'm using random() rather than rand() as it has a better distribution (as noted by the man page for rand()).
If you want to get random values outside the default range [0, RAND_MAX], then you have to do something tricky. Perhaps the most expedient is to define a function random_extended() that pulls n bits (using random_at_most()) and returns in [0, 2**n), and then apply random_at_most() with random_extended() in place of random() (and 2**n - 1 in place of RAND_MAX) to pull a random value less than 2**n, assuming you have a numerical type that can hold such a value. Finally, of course, you can get values in [min, max] using min + random_at_most(max - min), including negative values.

Following on from #Ryan Reich's answer, I thought I'd offer my cleaned up version. The first bounds check isn't required given the second bounds check, and I've made it iterative rather than recursive. It returns values in the range [min, max], where max >= min and 1+max-min < RAND_MAX.
unsigned int rand_interval(unsigned int min, unsigned int max)
{
int r;
const unsigned int range = 1 + max - min;
const unsigned int buckets = RAND_MAX / range;
const unsigned int limit = buckets * range;
/* Create equal size buckets all in a row, then fire randomly towards
* the buckets until you land in one of them. All buckets are equally
* likely. If you land off the end of the line of buckets, try again. */
do
{
r = rand();
} while (r >= limit);
return min + (r / buckets);
}

Here is a formula if you know the max and min values of a range, and you want to generate numbers inclusive in between the range:
r = (rand() % (max + 1 - min)) + min

unsigned int
randr(unsigned int min, unsigned int max)
{
double scaled = (double)rand()/RAND_MAX;
return (max - min +1)*scaled + min;
}
See here for other options.

Wouldn't you just do:
srand(time(NULL));
int r = ( rand() % 6 ) + 1;
% is the modulus operator. Essentially it will just divide by 6 and return the remainder... from 0 - 5

For those who understand the bias problem but can't stand the unpredictable run-time of rejection-based methods, this series produces a progressively less biased random integer in the [0, n-1] interval:
r = n / 2;
r = (rand() * n + r) / (RAND_MAX + 1);
r = (rand() * n + r) / (RAND_MAX + 1);
r = (rand() * n + r) / (RAND_MAX + 1);
...
It does so by synthesising a high-precision fixed-point random number of i * log_2(RAND_MAX + 1) bits (where i is the number of iterations) and performing a long multiplication by n.
When the number of bits is sufficiently large compared to n, the bias becomes immeasurably small.
It does not matter if RAND_MAX + 1 is less than n (as in this question), or if it is not a power of two, but care must be taken to avoid integer overflow if RAND_MAX * n is large.

Here is a slight simpler algorithm than Ryan Reich's solution:
/// Begin and end are *inclusive*; => [begin, end]
uint32_t getRandInterval(uint32_t begin, uint32_t end) {
uint32_t range = (end - begin) + 1;
uint32_t limit = ((uint64_t)RAND_MAX + 1) - (((uint64_t)RAND_MAX + 1) % range);
/* Imagine range-sized buckets all in a row, then fire randomly towards
* the buckets until you land in one of them. All buckets are equally
* likely. If you land off the end of the line of buckets, try again. */
uint32_t randVal = rand();
while (randVal >= limit) randVal = rand();
/// Return the position you hit in the bucket + begin as random number
return (randVal % range) + begin;
}
Example (RAND_MAX := 16, begin := 2, end := 7)
=> range := 6 (1 + end - begin)
=> limit := 12 (RAND_MAX + 1) - ((RAND_MAX + 1) % range)
The limit is always a multiple of the range,
so we can split it into range-sized buckets:
Possible-rand-output: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Buckets: [0, 1, 2, 3, 4, 5][0, 1, 2, 3, 4, 5][X, X, X, X, X]
Buckets + begin: [2, 3, 4, 5, 6, 7][2, 3, 4, 5, 6, 7][X, X, X, X, X]
1st call to rand() => 13
→ 13 is not in the bucket-range anymore (>= limit), while-condition is true
→ retry...
2nd call to rand() => 7
→ 7 is in the bucket-range (< limit), while-condition is false
→ Get the corresponding bucket-value 1 (randVal % range) and add begin
=> 3

In order to avoid the modulo bias (suggested in other answers) you can always use:
arc4random_uniform(MAX-MIN)+MIN
Where "MAX" is the upper bound and "MIN" is lower bound. For example, for numbers between 10 and 20:
arc4random_uniform(20-10)+10
arc4random_uniform(10)+10
Simple solution and better than using "rand() % N".

While Ryan is correct, the solution can be much simpler based on what is known about the source of the randomness. To re-state the problem:
There is a source of randomness, outputting integer numbers in range [0, MAX) with uniform distribution.
The goal is to produce uniformly distributed random integer numbers in range [rmin, rmax] where 0 <= rmin < rmax < MAX.
In my experience, if the number of bins (or "boxes") is significantly smaller than the range of the original numbers, and the original source is cryptographically strong - there is no need to go through all that rigamarole, and simple modulo division would suffice (like output = rnd.next() % (rmax+1), if rmin == 0), and produce random numbers that are distributed uniformly "enough", and without any loss of speed. The key factor is the randomness source (i.e., kids, don't try this at home with rand()).
Here's an example/proof of how it works in practice. I wanted to generate random numbers from 1 to 22, having a cryptographically strong source that produced random bytes (based on Intel RDRAND). The results are:
Rnd distribution test (22 boxes, numbers of entries in each box):
1: 409443 4.55%
2: 408736 4.54%
3: 408557 4.54%
4: 409125 4.55%
5: 408812 4.54%
6: 409418 4.55%
7: 408365 4.54%
8: 407992 4.53%
9: 409262 4.55%
10: 408112 4.53%
11: 409995 4.56%
12: 409810 4.55%
13: 409638 4.55%
14: 408905 4.54%
15: 408484 4.54%
16: 408211 4.54%
17: 409773 4.55%
18: 409597 4.55%
19: 409727 4.55%
20: 409062 4.55%
21: 409634 4.55%
22: 409342 4.55%
total: 100.00%
This is as close to uniform as I need for my purpose (fair dice throw, generating cryptographically strong codebooks for WWII cipher machines such as http://users.telenet.be/d.rijmenants/en/kl-7sim.htm, etc). The output does not show any appreciable bias.
Here's the source of cryptographically strong (true) random number generator:
Intel Digital Random Number Generator
and a sample code that produces 64-bit (unsigned) random numbers.
int rdrand64_step(unsigned long long int *therand)
{
unsigned long long int foo;
int cf_error_status;
asm("rdrand %%rax; \
mov $1,%%edx; \
cmovae %%rax,%%rdx; \
mov %%edx,%1; \
mov %%rax, %0;":"=r"(foo),"=r"(cf_error_status)::"%rax","%rdx");
*therand = foo;
return cf_error_status;
}
I compiled it on Mac OS X with clang-6.0.1 (straight), and with gcc-4.8.3 using "-Wa,q" flag (because GAS does not support these new instructions).

As said before modulo isn't sufficient because it skews the distribution. Heres my code which masks off bits and uses them to ensure the distribution isn't skewed.
static uint32_t randomInRange(uint32_t a,uint32_t b) {
uint32_t v;
uint32_t range;
uint32_t upper;
uint32_t lower;
uint32_t mask;
if(a == b) {
return a;
}
if(a > b) {
upper = a;
lower = b;
} else {
upper = b;
lower = a;
}
range = upper - lower;
mask = 0;
//XXX calculate range with log and mask? nah, too lazy :).
while(1) {
if(mask >= range) {
break;
}
mask = (mask << 1) | 1;
}
while(1) {
v = rand() & mask;
if(v <= range) {
return lower + v;
}
}
}
The following simple code lets you look at the distribution:
int main() {
unsigned long long int i;
unsigned int n = 10;
unsigned int numbers[n];
for (i = 0; i < n; i++) {
numbers[i] = 0;
}
for (i = 0 ; i < 10000000 ; i++){
uint32_t rand = random_in_range(0,n - 1);
if(rand >= n){
printf("bug: rand out of range %u\n",(unsigned int)rand);
return 1;
}
numbers[rand] += 1;
}
for(i = 0; i < n; i++) {
printf("%u: %u\n",i,numbers[i]);
}
}

Will return a floating point number in the range [0,1]:
#define rand01() (((double)random())/((double)(RAND_MAX)))

How to avoid un-coalesced accesses in matrix multiplication CUDA kernel?

I am learning CUDA with the book 'Programming Massively Parallel Processors'. A practice problem from chapter 5 confuses me:
For tiled matrix multiplication out of possible range of values for
BLOCK_SIZE, for what values of BLOCK_SIZE will the kernel completely
avoid un-coalesced accesses to global memory? (you only need to consider square blocks)
On my understanding, BLOCK_SIZE does little to memory-coalescing. As long as threads within single warp access consecutive elements, we will have a coalesced accesses. I could not figure out where the kernel has un-coalesced accesses to global memory. Any hints from you guys?
Here is the kernel's source codes:
#define COMMON_WIDTH 512
#define ROW_LEFT 500
#define COL_RIGHT 250
#define K 1000
#define TILE_WIDTH 32
__device__ int D_ROW_LEFT = ROW_LEFT;
__device__ int D_COL_RIGHT = COL_RIGHT;
__device__ int D_K = K;
.....
__global__
void MatrixMatrixMultTiled(float *matrixLeft, float *matrixRight, float *output){
__shared__ float sMatrixLeft[TILE_WIDTH][TILE_WIDTH];
__shared__ float sMatrixRight[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int col = bx * TILE_WIDTH + tx;
int row = by * TILE_WIDTH + ty;
float value = 0;
for (int i = 0; i < ceil(D_K/(float)TILE_WIDTH); ++i){
if (row < D_ROW_LEFT && row * D_K + i * TILE_WIDTH +tx < D_K){
sMatrixLeft[ty][tx] = matrixLeft[row * D_K + i * TILE_WIDTH +tx];
}
if (col < D_COL_RIGHT && (ty + i * TILE_WIDTH) * D_COL_RIGHT + col < D_K ){
sMatrixRight[ty][tx] = matrixRight[(ty + i * TILE_WIDTH) * D_COL_RIGHT + col];
}
__syncthreads();
for (int j = 0; j < TILE_WIDTH; j++){
value += sMatrixLeft[ty][j] * sMatrixRight[j][tx];
}
__syncthreads();
}
if (row < D_ROW_LEFT && col < D_COL_RIGHT ){
output[row * D_COL_RIGHT + col] = value;
}
}

Your question is incomplete, since the code you have posted does not make any reference to BLOCK_SIZE, and that is certainly at least very relevant to the question posed in the book. More generally, questions that pose a kernel without the launch configuration are often incomplete, since the launch configuration is often relevant to both the correctness and the behavior, of a kernel.
I've not re-read this portion of the book right at the moment. However I'll assume the kernel launch configuration includes a block dimension that is something like the following: (this information is absent from your question but should have been included, in my opinion, for a sensible question)
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(...,...);
And I will assume the kernel launch is given by something like:
MatrixMatrixMultTiled<<<dimGrid, dimBlock>>>(...);
Your statement: "As long as threads within single warp access consecutive elements, we will have a coalesced accesses." is a reasonable working definition. Let's show that that is violated for some choices of BLOCK_SIZE, given the above assumptions to cover over the gaps in your incomplete question.
Coalesced access is a term that applies to global memory accesses only. We will therefore ignore accesses to shared memory. We will also, for this discussion, ignore accesses to the __device__ variables such as D_ROW_LEFT. (The access to those variables appears to be uniform. We can quibble about whether that constitutes coalesced access. My claim would be that it does constitute coalesced access, but we need not unpack that here.) Therefore we are left with just 3 "access" points:
matrixLeft[row * D_K + i * TILE_WIDTH +tx];
matrixRight[(ty + i * TILE_WIDTH) * D_COL_RIGHT + col];
output[row * D_COL_RIGHT + col]
Now, to pick an example, let's suppose BLOCK_SIZE is 16. Will any of the above access points violate your statement "threads within single warp access consecutive elements"?
Let's start with the block (0,0). Therefore row is equal to threadIdx.y and col is equal to threadIdx.x. Let's consider the first warp in that block. Therefore the first 16 threads in that warp will have a threadIdx.y value of 0, and their threadIdx.x values will be increasing from 0..15. Likewise the second 16 threads in that warp will have a threadIdx.y value of 1, and their threadIdx.x values will be increasing from 0..15.
Now let's compute the actual index generated for the first access point above, across the warp. Let's assume we are on the first loop iteration, so i is zero. Therefore this:
matrixLeft[row * D_K + i * TILE_WIDTH +tx];
reduces to:
matrixLeft[threadIdx.y * D_K + threadIdx.x];
D_K here is just the device copy of the K variable, which is 1000. Now let's evaluate the reduced index expression above across our selected warp (0) in our selected block (0,0):
warp lane: 0 1 2 3 4 5 6 .. 15 16 17 18 .. 31
threadIdx.x 0 1 2 3 4 5 6 15 0 1 2 15
threadIdx.y 0 0 0 0 0 0 0 0 1 1 1 1
index: 0 1 2 3 4 5 6 15 1000 1001 1002 1015
Therefore the generated index pattern here shows a discontinuity between the 16th and 17th thread in the warp, and the access pattern does not fit your previously stated condition:
"threads within single warp access consecutive elements"
and we do not have coalesced access in this case (at least, for float quantities).

C/C++ rand() function for biased expectation

I am using <stdlib.h> rand() function to generate 100 random integers within range [0 ... 9]. I used the following way to generate them on equal distribution,
int random_numbers[100];
for(register int i = 0; i < 100; i++){
random_numbers[i] = rand() % 10;
}
This is working fine. But now I want to get 100 numbers where I want around 50% of those numbers to be 5. How do I do that?
Extended Problem
I want to get 100 numbers. What if I want 50% of those number will be between 0~2. I mean 50 percent of those number will consists only with number 0, 1, 2. How to do that?
I am expecting generalised steps which can be applied beyond the boundary of 10 or 100.

Hmmm, how about choosing a random number between 0 and 17, and if the number is greater than 9, change it to 5?
For 0 - 17, you would get a distribution like
0,1,2,3,4,5,6,7,8,9,5,5,5,5,5,5,5,5
Code:
int random_numbers[100];
for(register int i = 0; i < 100; i++){
random_numbers[i] = rand() % 18;
if (random_numbers[i] > 9) {
random_numbers[i] = 5;
}
}
You basically add a set of numbers beyond your desired range that, when translated to 5 give you equal numbers of 5 and non-5.

In order to get around 50% of these numbers to be in [0, 2] range you can split the full range of rand() into two equal halves and then use the same %-based technique to map the first half to [0, 2] range and the second half to [3, 9] range.
int random_numbers[100];
for(int i = 0; i < 100; i++)
{
int r = rand();
random_numbers[i] = r <= RAND_MAX / 2 ? r % 3 : r % 7 + 3;
}
To to get around 50% of these numbers to be 5 a similar technique will work. Just map the second half to [0, 9] range with 5 excluded
int random_numbers[100];
for(int i = 0; i < 100; i++)
{
int r = rand();
if (r <= RAND_MAX / 2)
r = 5;
else if ((r %= 9) >= 5)
++r;
random_numbers[i] = r;
}

I think it is easy to solve the particular problem of 50% using the techniques mentioned by other answers. Let us try to answer the question for a general case -
Let us say you want a distribution where you want the numbers {A1, A2, .. An} with the percentages {P1, P2, .. Pn} and sum of Pi is 100% (and all the percentages are integers, if not it can be adjusted).
We will create an array of 100 size and fill it with the numbers A1-An.
int distribution[100];
Now we fill each number, it's percentage number of times.
int postion = 0;
for (int i = 0; i < n; i++) {
for( int j = 0; j < P[i]; j++) {
// Add a check here to make sure the sum hasn't crossed 100
distribution[position] = A[i];
position ++;
}
}
Now that this initialization is done once, you can draw a random number as -
int number = distribution[rand() % 100];
In case your percentages are not integers but say you want precision of 0.1%, you can create an array of 1000 instead of 100.

In both case, the goal is 50% selected from one set and 50% from another. Code could call rand() and uses some bits (one) for choosing the group and the remaining bits for value selection.
If the range of numbers needed is much smaller than RAND_MAX, a first attempt could use:
int rand_special_50percent(int n, int special) {
int r = rand();
int r_div_2 = r/2;
if (r%2) {
return special;
}
int y = r_div_2%(n-1); // 9 numbers left
if (y >= special) y++;
return y;
}
int rand_low_50percent(int n, int low_special) {
int r = rand();
int r_div_2 = r/2;
if (r%2) {
return r_div_2%(low_special+1);
}
return r_div_2%(n - low_special) + low_special + 1;
}
Sample
int r5 = rand_special_50percent(10, 5);
int preferred_low_value_max = 2;
int r012 = rand_low_50percent(10, preferred_low_value_max);
Advanced:
With n above RAND_MAX/2, additional calls to rand() are needed.
When using rand()%n, unless (RAND_MAX+1u)%n == 0 (n is a divisor of RAND_MAX+1), a bias is introduced. The above code does not compensate for that.

C++11 solution (not optimal but easy)
std::piecewise_constant_distribution can generate random real numbers (float or double) for given intervals and weights for the each interval.
Not optimal because this solution is generating double and converting double to int. Also getting exactly 50 from [0,3) 100 samples is not guaranteed but for around 50 samples is guaranteed.
For your case : 2 intervals - [0,3), [3,100) and their weights [1,1]
Equal weights, so ~50% of the numbers from [0,3) and ~50% from [3,100)
#include <iostream>
#include <string>
#include <map>
#include <random>
int main()
{
std::random_device rd;
std::mt19937 gen(rd());
std::vector<double> intervals{0, 3, 3, 100};
std::vector<double> weights{ 1, 0, 1};
std::piecewise_constant_distribution<> d(intervals.begin(), intervals.end(), weights.begin());
std::map<int, int> hist;
for(int n=0; n<100; ++n) {
++hist[(int)d(gen)];
}
for(auto p : hist) {
std::cout << p.first << " : generated " << p.second << " times"<< '\n';
}
}
Output:
0 : generated 22 times
1 : generated 19 times
2 : generated 16 times
4 : generated 1 times
5 : generated 2 times
8 : generated 1 times
12 : generated 1 times
17 : generated 1 times
19 : generated 1 times
22 : generated 2 times
23 : generated 1 times
25 : generated 1 times
29 : generated 1 times
30 : generated 2 times
31 : generated 1 times
36 : generated 1 times
38 : generated 1 times
44 : generated 1 times
45 : generated 1 times
48 : generated 1 times
49 : generated 1 times
51 : generated 1 times
52 : generated 1 times
53 : generated 1 times
57 : generated 2 times
58 : generated 3 times
62 : generated 1 times
65 : generated 2 times
68 : generated 1 times
71 : generated 1 times
76 : generated 2 times
77 : generated 1 times
85 : generated 1 times
90 : generated 1 times
94 : generated 1 times
95 : generated 1 times
96 : generated 2 times

How to get minimum number of moves to solve `game of fifteen`?

I was reading about this and thought to form an algorithm to find the minimum number of moves to solve this.
Constraints I made: An N X N matrix having one empty slot ,say 0, would be plotted having numbers 0 to n-1.
Now we have to recreate this matrix and form the matrix having numbers in increasing order from left to right beginning from the top row and have the last element 0 i.e. (N X Nth)element.
For example,
Input :
8 4 0
7 2 5
1 3 6
Output:
1 2 3
4 5 6
7 8 0
Now the problem is how to do this in minimum number of steps possible.
As in game(link provided) you can either move left, right, up or bottom and shift the 0(empty slot) to corresponding position to make the final matrix.
The output to printed for this algorithm is number of steps say M and then Tile(number) moved in the direction say, 1 for swapping with upper adjacent element, 2 for lower adjacent element, 3 for left adjacent element and 4 for right adjacent element.
Like, for
2 <--- order of N X N matrix
3 1
0 2
Answer should be: 3 4 1 2 where 3 is M and 4 1 2 are steps to tile movement.
So I have to minimise the complexity for this algorithm and want to find minimum number of moves. Please suggest me the most efficient approach to solve this algorithm.
Edit:
What I coded in c++, Please see the algorithm rather than pointing out other issues in code .
#include <bits/stdc++.h>
using namespace std;
int inDex=0,shift[100000],N,initial[500][500],final[500][500];
struct Node
{
Node* parent;
int mat[500][500];
int x, y;
int cost;
int level;
};
Node* newNode(int mat[500][500], int x, int y, int newX,
int newY, int level, Node* parent)
{
Node* node = new Node;
node->parent = parent;
memcpy(node->mat, mat, sizeof node->mat);
swap(node->mat[x][y], node->mat[newX][newY]);
node->cost = INT_MAX;
node->level = level;
node->x = newX;
node->y = newY;
return node;
}
int row[] = { 1, 0, -1, 0 };
int col[] = { 0, -1, 0, 1 };
int calculateCost(int initial[500][500], int final[500][500])
{
int count = 0;
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
if (initial[i][j] && initial[i][j] != final[i][j])
count++;
return count;
}
int isSafe(int x, int y)
{
return (x >= 0 && x < N && y >= 0 && y < N);
}
struct comp
{
bool operator()(const Node* lhs, const Node* rhs) const
{
return (lhs->cost + lhs->level) > (rhs->cost + rhs->level);
}
};
void solve(int initial[500][500], int x, int y,
int final[500][500])
{
priority_queue<Node*, std::vector<Node*>, comp> pq;
Node* root = newNode(initial, x, y, x, y, 0, NULL);
Node* prev = newNode(initial,x,y,x,y,0,NULL);
root->cost = calculateCost(initial, final);
pq.push(root);
while (!pq.empty())
{
Node* min = pq.top();
if(min->x > prev->x)
{
shift[inDex] = 4;
inDex++;
}
else if(min->x < prev->x)
{
shift[inDex] = 3;
inDex++;
}
else if(min->y > prev->y)
{
shift[inDex] = 2;
inDex++;
}
else if(min->y < prev->y)
{
shift[inDex] = 1;
inDex++;
}
prev = pq.top();
pq.pop();
if (min->cost == 0)
{
cout << min->level << endl;
return;
}
for (int i = 0; i < 4; i++)
{
if (isSafe(min->x + row[i], min->y + col[i]))
{
Node* child = newNode(min->mat, min->x,
min->y, min->x + row[i],
min->y + col[i],
min->level + 1, min);
child->cost = calculateCost(child->mat, final);
pq.push(child);
}
}
}
}
int main()
{
cin >> N;
int i,j,k=1;
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
cin >> initial[j][i];
}
}
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
final[j][i] = k;
k++;
}
}
final[N-1][N-1] = 0;
int x = 0, y = 1,a[100][100];
solve(initial, x, y, final);
for(i=0;i<inDex;i++)
{
cout << shift[i] << endl;
}
return 0;
}
In this above code I am checking for each child node which has the minimum cost(how many numbers are misplaced from the final matrix numbers).
I want to make this algorithm further efficient and reduce it's time complexity. Any suggestions would be appreciable.

While this sounds a lot like a homework problem, I'll lend a bit of help.
For significantly small problems, like your 2x2 or 3x3, you can just brute force it. Basically, you do every possible combination with every possible move, track how many turns each took, and then print out the smallest.
To improve on this, maintain a list of solved solutions, and then any time you make a possible move, if that moves already done, stop trying that one since it can't possible be the smallest.
Example, say I'm in this state (flattening your matrix to a string for ease of display):
5736291084
6753291084
5736291084
Notice that we're back to a state we've seen before. That means it can't possible be the smallest move, because the smallest would be done without returning to a previous state.
You'll want to create a tree doing this, so you'd have something like:
134
529
870
/ \
/ \
/ \
/ \
134 134
529 520
807 879
/ | \ / | \
/ | X / X \
134 134 134 134 134 130
509 529 529 502 529 524
827 087 870 879 870 879
And so on. Notice I marked some with X because they were duplicates, and thus we wouldn't want to pursue them any further since we know they can't be the smallest.
You'd just keep repeating this until you've tried all possible solutions (i.e., all non-stopped leaves reach a solution), then you just see which was the shortest. You could also do it in parallel so you stop once any one has found a solution, saving you time.
This brute force approach won't be effective against large matrices. To solve those, you're looking at some serious software engineering. One approach you could take with it would be to break it into smaller matrices and solve that way, but that may not be the best path.
This is a tricky problem to solve at larger values, and is up there with some of the trickier NP problems out there.
Start from solution, determine ranks of permuation
The reverse of above would be how you can pre-generate a list of all possible values.
Start with the solution. That has a rank of permutation of 0 (as in, zero moves):
012
345
678
Then, make all possible moves from there. All of those moves have rank of permutation of 1, as in, one move to solve.
012
0 345
678
/ \
/ \
/ \
102 312
1 345 045
678 678
Repeat that as above. Each new level all has the same rank of permutation. Generate all possible moves (in this case, until all of your branches are killed off as duplicates).
You can then store all of them into an object. Flattening the matrix would make this easy (using JavaScript syntax just for example):
{
'012345678': 0,
'102345678': 1,
'312045678': 1,
'142305678': 2,
// and so on
}
Then, to solve your question "minimum number of moves", just find the entry that is the same as your starting point. The rank of permutation is the answer.
This would be a good solution if you are in a scenario where you can pre-generate the entire solution. It would take time to generate, but lookups would be lightning fast (this is similar to "rainbow tables" for cracking hashes).
If you must solve on the fly (without pre-generation), then the first solution, start with the answer and work your way move-by-move until you find a solution would be better.
While the maximum complexity is O(n!), there are only O(n^2) possible solutions. Chopping off duplicates from the tree as you go, your complexity will be somewhere in between those two, probably in the neighborhood of O(n^3) ~ O(2^n)

You can use BFS.
Each state is one vertex, and there is an edge between two vertices if they can transfer to each other.
For example
8 4 0
7 2 5
1 3 6
and
8 0 4
7 2 5
1 3 6
are connected.
Usually, you may want to use some numbers to represent your current state. For small grid, you can just follow the sequence of the number. For example,
8 4 0
7 2 5
1 3 6
is just 840725136.
If the grid is large, you may consider using the rank of the permutation of the numbers as your representation of the state. For example,
0 1 2
3 4 5
6 7 8
should be 0, as it is the first in permutation.
And
0 1 2
3 4 5
6 7 8
(which is represented by 0)
and
1 0 2
3 4 5
6 7 8
(which is represented by some other number X)
are connected is the same as 0 and X are connected in the graph.
The complexity of the algo should be O(n!) as there are at most n! vertices/permutations.

Algorithm for Simple Squared Squares

I want to split a square in unequal squares.
After some search on the web found this Link.
This is an output i need :
Does anyone have idea for this?

As Yves Daoust said the algorithm to solve this is going to be slow. The first challenge is to determine what squares COULD be combined to fit into your big square. Then figure out if they WILL fit in there.
I would first filter by area.
To answer the first part you need to look for a combination of squares that will fit into your big one. There are likely multiple combinations as a 5x5 square takes up the same area as a 3x3 with a 4x4 square. This is a O(2^n) problem in itself.
Then attempt to arrange them.
I would make a matrix that is the size of your big square. Then starting at the topmost then right most index add in a square by marking the matrix positions as occupied by that square. Then move to the next unoccupied space, based on the previous rules adding an unused square. If no square fits then remove the previous square and continue to the next. This is a method begging for recursion.
As I said at the beginning this is a SLOW way to do it but it will give you a solution if one exists.

I used a dynamic programming approach for solving this. but it works until n ~ 50. I stored a solution as a bitset for efficiency:
You can compile the code yourself with:
$ g++ -O3 -std=c++11 squares.cpp -o squares
#include <bitset>
#include <iostream>
#include <list>
#include <vector>
using namespace std;
constexpr auto N = 116;
class FastSquareList {
public:
FastSquareList() = default;
FastSquareList(int i) { mask_.set(i); }
FastSquareList operator+(int i) const {
FastSquareList result = *this;
result.mask_.set(i);
return result;
}
bool has(int i) const { return mask_.test(i); }
void display() const {
for (auto i = 1; i <= N; ++i) {
if (has(i)) {
cout << i * i << " ";
}
}
cout << endl;
}
private:
bitset<N + 1> mask_;
};
int main() {
int n;
cin >> n;
vector<list<FastSquareList> > v(n * n + 1);
for (int i = 1; i <= n; ++i) {
v[i * i].push_back(i);
for (int a = i * i + 1; a <= n * n; ++a) {
int p = a - i * i;
for (const auto& l : v[p]) {
if (l.has(i)) {
continue;
}
v[a].emplace_back(l + i);
}
}
}
for (const auto& l : v[n * n]) {
l.display();
}
cout << "solutions count = " << v[n*n].size() << endl;
return 0;
}
an example:
$ ./Squares
15
9 16 36 64 100
25 36 64 100
1 4 9 16 25 49 121
4 36 64 121
4 100 121
4 16 25 36 144
1 16 64 144
81 144
4 16 36 169
4 9 16 196
4 25 196
225
solutions count = 12

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio