CUDA matrix preferred indexing method - parallel-processing

Let's say that I have a 4x4 matrix, which is divided into 2x2 block and 2x2 grid, so func<<<(2,2), (2,2)>>>(). The matrix is stored in a 1d array of size 16. The usual method to calculate x and y is the following:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
It seems like the recommended (at least by multiple examples) way to calculate the global index is:
int index = y * width + x;
This would generate the following indices:
blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 0, index = 0
blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 1, index = 1
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 0, index = 4
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 1, index = 5
So, on each y increment, the index would be strided, which means that only the x threads would benefit from coalescing. Another way to calculate the index is:
int index = y * blockDim.x + x;
Which would give the following indices:
blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 0, index = 0
blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 1, index = 1
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 0, index = 2
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 1, index = 3
In this case, the entire block is coalesced as all threads would access consecutive elements of the array.
Why is the first method generally recommended? Doesn't the second one achieve a better performance?

Why is the first method generally recommended?
One possibility might be that no one really thinks problems involving a 4x4 matrix accessed across a 4x4 grid are useful to tune for. Once you get to large matrices broken into 32x32 tiles, this becomes moot. (&)
Another way to calculate the index is:
int index = y * blockDim.x + x;
I don't think so. One thread in your grid will have an (x,y) ordered pair of (0,1). Another will have an ordered pair of (2,0). Considering your proposed value for blockDim.x of 2, those two threads will yield the same index value. I don't imagine that is what you want.
(&) And with no loss of generality in my opinion, if I wanted to create threadblocks of less than 32x32 = 1024 threads, I would scale down the block y dimension, e.g. 32x16 for 512 threads, or 32x8 for 256 threads. This allows me to use the same indexing "everywhere".

Related

Different ways to select ordered triplets from an array of N integers

Given an array A of n integers, I want to find the ways of selecting ordered triplets. For eg.
A = [1, 2, 1, 1]
different ways are (1, 2, 1), (1, 1, 1) and (2, 1, 1)
so the answer will be 3.
for A = [2, 2, 1, 2, 2]
different ways are (1, 2, 2), (2, 1, 2), (2, 2, 1) and (2, 2, 2)
so the answer will be 4 in this case
If all the numbers are unique then I have come up with a recurrence
f(n) = f(n-1) + ((n-1) * (n-2))/2
where f(3) = 1 and f(2) = f(1) = 0
I am having trouble when numbers are repeated. This needs to be solved in O(n) time and O(n) space.
The dynamic programming relation for the number of unique, ordered sets, from an array of size idx is:
DP[size of set][idx] = DP[size of set][idx-1] + DP[size of set - 1][idx-1] - DP[size of set - 1][ last_idx[ A[idx] - 1]
So, to calculate the number of ordered, unique sets of size LEN from an array of idx elements:
Take the number of ordered, unique sets of size LEN that can be created from an array of idx-1 elements
Add the number of ordered, unique sets that can be formed by adding element idx to the end of ordered, unique sets for size LEN-1
Don’t double count. Subtract the number of ordered, unique sets that can be formed by adding the PREVIOUS occurrence of element idx to the end of ordered, unique sets for size LEN-1.
This works because we are always counting unique sets as we go through the array. Counting unique the sets is based on the previous element counts of unique sets.
So, start with sets of size 1, then do size 2, then size 3, etc.
For unique, ordered sets of constant size LEN, my function takes O(LEN * N) memory and O(LEN * N) time. You should be able to reuse the DP array to reduce the memory to a constant independent of LEN, O(constant * N).
Here is the function.
static int answer(int[] A) {
// This example is for 0 <= A[i] <= 9. For an array of arbitrary integers, use a proper
// HashMap instead of an array as a HashMap. Alternatively, one could compress the input array
// down to distinct, consecutive numbers. Either way max memory of the last_idx array is O(n).
// This is left as an exercise to the reader.
final int MAX_INT_DIGIT = 10;
final int SUBSEQUENCE_LENGTH = 3;
int n = A.length;
int[][] dp = new int[SUBSEQUENCE_LENGTH][n];
int[] last_idx = new int[MAX_INT_DIGIT];
Arrays.fill(last_idx, -1);
// Init dp[0] which gives the number of distinct sets of length 1 ending at index i
dp[0][0] = 1;
last_idx[A[0]] = 0;
for (int i = 1; i < n; i++) {
if (last_idx[A[i]] == -1) {
dp[0][i] = dp[0][i - 1] + 1;
} else {
dp[0][i] = dp[0][i - 1];
}
last_idx[A[i]] = i;
}
for (int ss_len = 1; ss_len < SUBSEQUENCE_LENGTH; ss_len++) {
Arrays.fill(last_idx, -1);
last_idx[A[0]] = 0;
for (int i = 1; i < n; i++) {
if (last_idx[A[i]] <= 0) {
dp[ss_len][i] = dp[ss_len][i - 1] + dp[ss_len-1][i - 1];
} else {
dp[ss_len][i] = dp[ss_len][i - 1] + dp[ss_len-1][i - 1] - dp[ss_len-1][last_idx[A[i]] - 1];
}
last_idx[A[i]] = (i);
}
}
return dp[SUBSEQUENCE_LENGTH-1][n - 1];
}
For [3 1 1 3 8 0 5 8 9 0] the answer I get is 62.

Minimize total area using K rectangles in less than O(N^4)

Given an increasing sequence of N numbers (up to T), we can use at most K rectangles (placed starting at position 0) such as for the i-th value v in the sequence, exists a rectangle in positions [v, T) with height at least i + 1.
Total area of rectangles should be the minimum that satisfies what mentioned above.
Example: given the sequence [0, 3, 4], T = 5 and K = 2 we can use:
a rectangle from 0 to 2 with height 1 (thus having an area of 3)
a rectangle from 3 to 4 with height 3 (thus having an area of 6).
Using at most 2 rectangles, we cannot get a total area smaller than 9.
This problem can be solved using DP.
int dp[MAXK+1][MAXN][MAXN];
int sequence[MAXN];
int filldp(int cur_idx, int cur_value, int cur_K) {
int res = dp[cur_K][cur_idx][cur_value];
if (res != -1) return res;
res = INF;
if (cur_idx == N - 1 && cur_value >= N)
res = min(res, (T - seq[cur_idx]) * cur_value);
else {
if (cur_idx < N - 1 && cur_value >= cur_idx + 1) {
int cur_cost = (seq[cur_idx + 1] - seq[cur_idx]) * cur_value;
res = min(res, cur_cost + filldp(cur_idx + 1, cur_value, cur_K);
}
// Try every possible height for a rectangle
if (cur_K < K)
for (int new_value = cur_value + 1; cur_value <= N; new_value++)
res = min(res, filldp(cur_idx, new_value, cur_K + 1));
}
dp[cur_K][cur_idx][cur_value] = res;
return res;
}
Unsurprisingly, this DP approach is not really fast probably due to the for cycle. However, as far as I can understand, this code should not do more than MAXK * MAXN * MAXN significative calls (i.e., not more that every cell in dp). MAXK and MAXN are both 200, so dp has 8 millions of cells, which is not too much.
Am I missing anything?
UPDATE: As pointed out by Saeed Amiri (thank you!), the code makes N^2*K significative calls, but each one is O(N). The whole algorithm is then O(N^3*K) = O(N^4).
Can we do better?

select a group of pairs in order to minimize rms of group

Simplified problem
I have ~40 resistors (all the same value +-5%) and I need to select 12 of them so that they are as similar as possible.
Solution: I list them in order and take the 12 consecutive with the smallest RMS.
The actual problem
I have ~40 resistors (all the same value +-5%) and I have to choose 12 pairs of them so that the resistance of the pairs is as similar as possible.
Notes
The resistance of the pair (R1,R2) is R1+R2.
I do not really care about the programming language, but let's say that I'm looking for a solution in C++ or Python, the two languages I'm most familiar with.
This gives reasonably good results (in MATLAB)
a = ones(40,1) + rand(40,1)*0.1-0.05; % The resistors
vec = zeros(40,2); % Initialize matrix
indices = zeros(40,2); % Initialize matrix
a = sort(a); % Sort vector of resistors
for ii = 1:length(a)
vec(ii,:) = [a(ii) a(ii)]; % Assign resistor values to row ii of vec
indices(ii,:) = [ii,ii]; % Corresponding resistor number (index)
for jj = 1:length(a)
if sum(abs((a(ii)+a(jj))-2*mean(a))) < abs(sum(vec(ii,:))-2*mean(a))
vec(ii,:) = [a(ii) a(jj)]; % Check if the new set is better than the
indices(ii,:) = [ii, jj]; % previous, and update vec and indices if true.
end
end
end
[x, idx] = sort(sum(vec')'); % Sort the sum of the pairs
final_list = indices(idx); % The indices of the sorted pairs
This is the result when I plot it:
This is not optimal but should give somewhat decent results. It's very fast though so if you ever need to choose 1000 pairs out of 10000 resistors...
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#define GROUPS 12
#define N 40
int compare (const void * a, const void * b)
{
return ( *(int*)a - *(int*)b );
}
int main ()
{
// generate random numbers
float *values = (float *)malloc(sizeof(float) * N);
srand(time(0));
for (int i = 0; i < N; i++)
values[i] = 950 + rand()%101;
qsort(values, N, sizeof(float), compare);
// find "best" pairing
float bestrms = -1;
int beststart = -1;
float bestmean = -1;
for (int start = 0; start <= N - 2 * GROUPS; start++)
{
float sum = 0;
for (int i = start; i < start + 2 * GROUPS; i++)
sum += values[i];
float mean = sum / GROUPS;
float square = 0;
for (int i = 0; i < GROUPS; i++)
{
int x = start + 2 * GROUPS - 1 - i;
float first = values[start + i];
// in a sorted sequence of 24 resistors, always pair 1st with 24th, 2nd with 23rd, etc
float second = values[start + 2 * GROUPS - 1 - i];
float err = mean - (first + second);
square += err * err;
}
float rms = sqrt(square/GROUPS);
if (bestrms == -1 || rms < bestrms)
{
bestrms = rms;
beststart = start;
bestmean = mean;
}
}
for (int i = 0; i < GROUPS; i++)
{
float first = values[beststart + i];
float second = values[beststart + 2 * GROUPS - 1 - i];
float err = bestmean - (first + second);
printf("(%f, %f) %f %f\n", first, second, first + second, err);
}
printf("mean %f rms %f\n", bestmean, bestrms);
free(values);
}
Sort them and then pair 1 with 2, 3 with 4, 5 with 6 and so on. Find the difference between each pair and sort again, choosing the 12 with the least difference.
sort them by resistance
pair 1 with 40, 2 with 39 etc, compute R1+R2 for each pair and pick the best set of 12 pairs (needs another sorting step). compute the mean of all select (R1+R2).
try to refine this initial solution successively by trying to plug in one of the remaining 16 resistors for one of the 24 chosen ones. an attempt would be successful if combined resistance of the new pair is closer to the mean than the combined resistance of the old pair. repeat this step until you can't find any further improvement.
this solution will definitely not always compute the optimal solution but it might be good enough. another idea would be simulated annealing but that would be a lot more work and still not guarantee to find the best solution.

For a given cent amount, minimize the number of coin-tubes if all tubes hold 64 but do not need to be filled

Edit: If someone could provide an explained recursive answer(a link would do) to the famous coin change problem this would help a LOT
For a given cent amount, minimize the number of coin-tubes if all tubes can hold 64 coins.
each tube can ONLY hold a single type of coin.
each tube does NOT need to be fully filled.
e.g. for american coins the amounts would be $0.01, $0.05, $0.10, $0.25, $0.50, and $1.00
6 cents could be done as 6 1cent coins in a single tube,
25 cents could be a tube with a single 25c coin or a tube with five 5c coins.
65 cents would be done as 13 5c coins, as 65 1c coins would need to use 2 tubes.
I'm attempting to write a minecraft plugin, and I am having a LOT of difficulty with this algorithm.
A lookup table is a good method.
int[] Coins = new[] { 100, 50, 25, 10, 5, 1 };
int[,] Table = new int[6,6400];
/// Calculate the number of coins of each type that minimizes the number of
/// tubes used.
int[] Tubes(int cents)
{
int[] counts = new int[Coins.Length];
if (cents >= 6400)
{
counts[0] += (cents / 6400) * 64; // number of coins in filled $1-tubes
cents %= 6400;
}
for (int i = 0; i < Coins.Length; i++)
{
int count = Table[i, cents]; // N coins in (N + 63) / 64 tubes
counts[i] += count;
cents -= count * Coins[i];
}
return cents;
}
To calculate the table, you could use this:
void CalculateTable()
{
for (int i = Coins.Length-1; i >= 0; i--)
{
int coin = Coins[i];
for (int cents = 0; cents < 6400; cents++)
{
if (i == Coins.Length-1)
{
// The 1 cent coin can't be divided further
Table[i,cents] = cents;
}
else
{
// Find the count that minimizes the number of tubes.
int n = cents / coin;
int bestTubes = -1;
int bestCount = 0;
for (int count = cents / coin; count >= 0; count--)
{
int cents1 = cents - count * coin;
int tubes = (count + 63) / 64;
// Use the algorithm from Tubes() above, to optimize the
// lesser coins.
for (int j = i+1; j < Coins.Length; j++)
{
int count1 = Table[j, cents1];
cents1 -= count1 * Coins[j];
tubes += (count1 + 63) / 64;
}
if (bestTubes == -1 || tubes < bestTubes)
{
bestTubes = tubes;
bestCount = count;
}
}
// Store the result
Table[i,cents] = bestCount;
}
}
}
}
CalculateTable runs in a few milliseconds, so you don't have to store it to disk.
Example:
Tubes(3149) -> [ 31, 0, 0, 0, 0, 49]
Tubes (3150) -> [ 0, 63, 0, 0, 0, 0]
Tubes (31500) -> [315, 0, 0, 0, 0, 0]
The numbers mean the number of coins. N coins could be put into (N + 63)/64 tubes.
something like this:
a[0] = 100; //cents
a[1] = 50; a[2] = 25; a[3] = 10; a[4] = 5; a[5] = 1;
cnt[6]; //array to store how much coins of type i you use;
void rec(sum_left, p /* position in a array */) {
if ( p == 5 ) {
cnt[5] = sum_left;
//count how many tubes are used by cnt array, update current answer if neccessary;
return;
}
for ( int i = 0; i <= sum_left/a[p]; i++ )
//take i coins of type a[p]
rec(sum_left - i*a[i], p+1);
}
int main() {
rec(sum, 0);
}
Here is a recursive, heuristic and greedy algorithm.
In the array T, each T[i] holds an array of 6 integers.
If the given sum is 65 then you call tubes(65) and then print T[65].
coins[1..6] = {1, 5, 10, 25, 50, 100}
tubes(sum)
if sum < coins[1]
return
for i = 1 to 6
tubes(sum - coins[i])
best-tubes[1..6] = {64, 64, 64, 64, 64, 64}
for i = 1 to 6
if sum - coins[i] >= 0
current-tubes[1..6] = copy of T[sum - coins[i]]
if current-tubes[i] < 64
current-tubes[i] += 1
if current-tubes is better than best-tubes*
best-tubes = current-tubes
T[sum] = best-tubes
To vastly improve the running time, you can check if the current T[sum] has already been evaluated. Adding this check completes the approach called dynamic programming.
*current-tubes is better than best-tubes is using less tubes, or using the same number of tubes with less coins or using the same number of tubes but tubes that hold larger values. This is the greedy in action part.

algorithm to find ten integers>0 that sum to 2011 but their reciprocals sum to 1

find ten integers>0 that sum to 2011 but their reciprocals sum to 1
e.g.
x1+x2+..+x10 = 2011
1/x1+1/x2+..+1/x10 = 1
I found this problem here http://blog.computationalcomplexity.org/2011/12/is-this-problem-too-hard-for-hs-math.html
I was wondering what the computation complexity was, and what types of algorithms can solve it.
EDIT2:
I wrote the following brute force code which is fast enough. I didn't find any solutions though so I need to tweak my assumptions slightly. I'm now confident I will find the solution.
from fractions import Fraction
pairs = [(i,j) for i in range(2,30) for j in range(2,30)]
x1x2 = set((i+j, Fraction(1,i)+Fraction(1,j)) for i,j in pairs)
print('x1x2',len(x1x2))
x1x2x3x4 = set((s1+s2,f1+f2) for s1,f1 in x1x2 for s2,f2 in x1x2 if f1+f2<1)
print('x1x2x3x4',len(x1x2x3x4))
count = 0
for s,f in x1x2x3x4:
count+=1
if count%1000==0:
print('count',count)
s2 = 2011 - s
f2 = 1 - f
for s3,f3 in x1x2:
s4 = s2-s3
if s4>0:
f4 = f2-f3
if f4>0:
if (s4,f4) in x1x2x3x4:
print('s3f3',s3,f3)
print('sf',s,f)
Note that you cannot define computational complexity for a single problem instance, as once you know the answer the computational complexity is O(1), i.e. constant-time. Computational complexity can be only defined for an infinite family of problems.
One approach for solving this type of a problem would be to use backtracking search. Your algorithm spends too much time in searching parts of the 10-dimensional space that can't contain solutions. An efficient backtracking algorithm would
assign the variables in the order x1, x2, ..., x10
maintain the constraint x1 <= x2 <= ... <= x10
during search, always when number xi has been assigned
let S = x1 + ... + xi
let R = 1/x1 + ... + 1/xi
always check that S <= 2011 - (10 - i) * xi
always check that R <= 1 - (1 / [(2011 - S) / (10 - i)])
if these two constraints are not fulfilled during search there can't be a solution any more and the algorithm should backtrack immediately. Note that the constraints are based on the fact that the numbers are assigned in increasing order, i.e. xi <= xi+1 in all cases
Note: you can speed up search, limiting the search space and making calculations faster, by assuming that all x1, ..., x10 divide a given number evenly, e.g. 960. That is, you only consider such xi that 960 divided by xi is an integer. This makes calculating the fractional part much easier, as instead of checking that 1/x1 + ... equals 1 you can check that 960/x1 + ... equals 960. Because all the divisions are even and return integers, you don't need to use floating or rational arithmetics at all but everything works with integers only. Of course, the smaller the fixed modulus is the less solutions you can find, but this also makes the search faster.
I note that one of the things on the next blog in the series, http://blog.computationalcomplexity.org/2011/12/solution-to-reciprocals-problem.html, is a paper on the problem, and a suggested dynamic programming approach to counting the number of answers. Since it is a dynamic programming approach, you should be able to turn that into a dynamic program to find those answers.
Dynamic programming solution (C#) based on the Bill Gasarch paper someone posted. But this does not necessarily find the optimal (minimum number of numbers used) solution. It is only guaranteed to find a solution if allowed to go high enough, but it doesn't have to be with the desired N. Basically, I feel like it "accidentally" works for (10, 2011).
Some example solutions for 2011:
10 numbers: 2, 4, 5, 80, 80, 80, 160, 320, 640, 640
11 numbers: 3, 6, 4, 12, 12, 24, 30, 480, 480, 480, 480
13 numbers: 2, 4, 5, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200
15 numbers: 3, 6, 6, 12, 16, 16, 32, 32, 32, 64, 256, 256, 256, 512, 512
Anyone have an idea how to fix it to work in general?
using System;
using System.Collections.Generic;
namespace Recip
{
class Program
{
static void Main(string[] args)
{
int year = 2011;
int numbers = 20;
int[,,] c = new int[year+1, numbers+1, numbers];
List<int> queue = new List<int>();
// need some initial guesses to expand on - use squares because 1/y * y = 1
int num = 1;
do
{
for (int i = 0; i < num; i++)
c[num * num, num, i] = num;
queue.Add(num * num);
num++;
} while (num <= numbers && num * num <= year);
// expand
while (queue.Count > 0)
{
int x0 = queue[0];
queue.RemoveAt(0);
for (int i = 0; i <= numbers; i++)
{
if (c[x0, i, 0] > 0)
{
int[] coefs ={ 20, 4, 2, 2, 3, 3};
int[] cons = { 11, 6, 8, 9, 6, 8};
int[] cool = { 3, 2, 2, 2, 2, 2};
int[] k1 = { 2, 2, 4, 3, 3, 2};
int[] k2 = { 4, 4, 4, 6, 3, 6};
int[] k3 = { 5, 0, 0, 0, 0, 0};
int[] mul = { 20, 4, 2, 2, 3, 3};
for (int k = 0; k < 6; k++)
{
int x1 = x0 * coefs[k] + cons[k];
int c1 = i + cool[k];
if (x1 <= year && c1 <= numbers && c[x1, c1, 0] == 0)
{
queue.Add(x1);
c[x1, c1, 0] = k1[k];
c[x1, c1, 1] = k2[k];
int index = 2;
if (k == 0)
{
c[x1, c1, index] = k3[k];
index++;
}
int diff = index;
while (c[x0, i, index - diff] > 0)
{
c[x1, c1, index] = c[x0, i, index - diff] * mul[k];
index++;
}
}
}
}
}
}
for (int n = 1; n < numbers; n++)
{
if (c[year, n, 0] == 0) continue;
int ind = 0;
while (ind < n && c[year, n, ind] > 0)
{
Console.Write(c[year, n, ind] + ", ");
ind++;
}
Console.WriteLine();
}
Console.ReadLine();
}
}
}
There are Choose(2011,10) or about 10^26 sets of 10 numbers that add up to 2011. So, in order for a brute force approach to work, the search tree would have to be trimmed significantly.
Fortunately, there are a few ways to do that.
The first obvious way is to require that the numbers are ordered. This reduces the number of options by a factor of around 10^7.
The second is that we can detect early if our current partial solution can never lead to a complete solution. Since our values are ordered, the remaining numbers in the set are at least as large as the current number. Note that the sum of the numbers increases as the numbers get larger, while the sum of the reciprocals decreases.
There are two sure ways we can tell we're at a dead end:
We get the smallest possible total from where we are when we take all remaining numbers to be the same as the current number. If this smallest sum is too big, we'll never get less.
We get the largest possible sum of reciprocals when we take all remaining numbers to be the same as the current number. If this largest sum is less than 1, we'll never get to 1.
These two conditions set an upper bound on the next xi.
Thirdly, we can stop looking if our partial sum of reciprocals is greater than or equal to 1.
Putting all this together, here is a solution in C#:
static int[] x = new int[10];
static void Search(int depth, int xi, int sum, double rsum) {
if (depth == 9) {
// We know exactly what the last number should be
// to make the sum 2011:
xi = 2011 - sum;
// Now check if the sum of reciprocals adds up as well
if (Math.Abs(rsum + 1.0 / xi - 1.0) < 1e-12) {
// We have a winner!
x[depth] = xi;
var s = string.Join(" ", Array.ConvertAll(x, n => n.ToString()));
Console.WriteLine(s);
}
} else {
int lastxi = xi;
// There are 2 ways xi can be too large:
xi = Math.Min(
// 1. If adding it (10 - depth) times to the sum
// is greater than our total:
(2011 - sum) / (10 - depth),
// 2. If adding (10 - depth) times its reciprocal
// is less than 1.
(int)((10 - depth) * remainder));
// We iterate towards smaller xi so we can stop
// when the reciprocal sum is too large:
while (xi >= lastxi) {
double newRSum = rsum + 1.0 / xi;
if (newRSum >= 1.0)
break;
x[depth] = xi;
Search(depth + 1, xi, sum + xi, newRSum);
xi--;
}
}
}
Search(0, 1, 0, 0)
If you used a brute force algorithm to iterate through all the combinations, you'd end up with the answers. But I don't think it's quite as big as 10*2011*2011. Since you can easily arbitrarily postulate that x1
I think a brute force approach would easily get the answer. However I would imagine that the instructor is looking for a mathematical approach. I'm thinking the '1' must have some significance with regards to finding how to manipulate the equations to the answer. The '2011' seems arbitrary.

Resources