I hope someone can help here.
I have a large byte vector from which i create a small byte vector ( based on a mask ) which I then process with simd.
Currently the mask is an array of baseOffset + submask (byte[256]) , optimized for storage as there are > 10^8 . I create a maxsize subvector , then loop through the mask array multiply the baseOffssetby 256 and for each bit offset in the mask load from the large vector and put the values in a smaller vector sequentially . The smaller vector is then processed via a number of VPMADDUBSW and accumulated . I can change this structure. eg walk the bits once to use a 8K bit array buffer and then create the small vector.
Is there a faster way i can create the subarray ?
I pulled the code out of the app into a test program but the original is in a state of flux ( moving to AVX2 and pulling more out of C# )
#include "stdafx.h"
#include<stdio.h>
#include <mmintrin.h>
#include <emmintrin.h>
#include <tmmintrin.h>
#include <smmintrin.h>
#include <immintrin.h>
//from
char N[4096] = { 9, 5, 5, 5, 9, 5, 5, 5, 5, 5 };
//W
char W[4096] = { 1, 2, -3, 5, 5, 5, 5, 5, 5, 5 };
char buffer[4096] ;
__declspec(align(2))
struct packed_destination{
char blockOffset;
__int8 bitMask[32];
};
__m128i sum = _mm_setzero_si128();
packed_destination packed_destinations[10];
void process128(__m128i u, __m128i s)
{
__m128i calc = _mm_maddubs_epi16(u, s); // pmaddubsw
__m128i loints = _mm_cvtepi16_epi32(calc);
__m128i hiints = _mm_cvtepi16_epi32(_mm_shuffle_epi32(calc, 0x4e));
sum = _mm_add_epi32(_mm_add_epi32(loints, hiints), sum);
}
void process_array(char n[], char w[], int length)
{
sum = _mm_setzero_si128();
int length128th = length >> 7;
for (int i = 0; i < length128th; i++)
{
__m128i u = _mm_load_si128((__m128i*)&n[i * 128]);
__m128i s = _mm_load_si128((__m128i*)&w[i * 128]);
process128(u, s);
}
}
void populate_buffer_from_vector(packed_destination packed_destinations[], char n[] , int dest_length)
{
int buffer_dest_index = 0;
for (int i = 0; i < dest_length; i++)
{
int blockOffset = packed_destinations[i].blockOffset <<8 ;
// go through mask and copy to buffer
for (int j = 0; j < 32; j++)
{
int joffset = blockOffset + j << 3;
int mask = packed_destinations[i].bitMask[j];
if (mask & 1 << 0)
buffer[buffer_dest_index++] = n[joffset + 1<<0 ];
if (mask & 1 << 1)
buffer[buffer_dest_index++] = n[joffset + 1<<1];
if (mask & 1 << 2)
buffer[buffer_dest_index++] = n[joffset + 1<<2];
if (mask & 1 << 3)
buffer[buffer_dest_index++] = n[joffset + 1<<3];
if (mask & 1 << 4)
buffer[buffer_dest_index++] = n[joffset + 1<<4];
if (mask & 1 << 5)
buffer[buffer_dest_index++] = n[joffset + 1<<5];
if (mask & 1 << 6)
buffer[buffer_dest_index++] = n[joffset + 1<<6];
if (mask & 1 << 7)
buffer[buffer_dest_index++] = n[joffset + 1<<7];
};
}
}
int _tmain(int argc, _TCHAR* argv[])
{
for (int i = 0; i < 32; ++i)
{
packed_destinations[0].bitMask[i] = 0x0f;
packed_destinations[1].bitMask[i] = 0x04;
}
packed_destinations[1].blockOffset = 1;
populate_buffer_from_vector(packed_destinations, N, 1);
process_array(buffer, W, 256);
int val = sum.m128i_i32[0] +
sum.m128i_i32[1] +
sum.m128i_i32[2] +
sum.m128i_i32[3];
printf("sum is %d" , val);
printf("Press Any Key to Continue\n");
getchar();
return 0;
}
Normally mask usage would be 5-15% for some work loads it would be 25-100% .
MASKMOVDQU is close but then we would have to re pack /swl according to the mask before saving..
A couple of optimisations for your existing code:
If your data is sparse then it would probably be a good idea to add an additional test of each 8 bit mask value prior to testing the additional bits, i.e.
int mask = packed_destinations[i].bitMask[j];
if (mask != 0)
{
if (mask & 1 << 0)
buffer[buffer_dest_index++] = n[joffset + 1<<0 ];
if (mask & 1 << 1)
buffer[buffer_dest_index++] = n[joffset + 1<<1];
...
Secondly your process128 function can be optimised considerably:
inline __m128i process128(const __m128i u, const __m128i s, const __m128i sum)
{
const __m128i vk1 = _mm_set1_epi16(1);
__m128i calc = _mm_maddubs_epi16(u, s);
calc = _mm_madd_epi16(v, vk1);
return _mm_add_epi32(sum, calc);
}
Note that as well as reducing the SSE instruction count from 6 to 3, I've also made sum a parameter, to get away from any dependency on global variables (it's always a good idea to avoid globals, not only for good software engineering but also because they can inhibit certain compiler optimisations).
It would be interesting to see a profile of your code (using a decent sampling profiler, not via instrumentation), since this would help to prioritise any further optimisation efforts.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last month.
Improve this question
Is there a performant way to generate an unbiased 64b random integer without 3 set bits in a row, assuming a fast-and-unbiased input PRNG? I don't care about 'wasting bits' of the input source.
That is, something better than the naive rejection-sampling approach:
uint64_t r;
do {
r = get_rand_64();
} while (r & (r >> 1) & (r >> 2));
...which "works", but is very slow. It looks like it's iterating ~187x on average or so.
One possibility I've explored is roughly:
bool p2 = get_rand_bit();
bool p1 = get_rand_bit();
uint64_t r = (p1 << 1) | p2;
for (int i = 2; i < 64; i++) {
bool p0 = (p1 && p2) ? false : get_rand_bit();
r |= p0 << i;
p2 = p1;
p1 = p0;
}
...however, this is still slow. Mainly because using this approach the entire calculation is bit-serial. EDIT: and it's also biased. Easiest to see with a 3-bit integer - 0b011 occurs 1/8th of the time, which is wrong (should be 1/7th).
I've tried doing various parallel fixups, but haven't been able to come up with anything unbiased. It's useful to play around with 4-bit integers first - e.g. setting all bits involved in a conflict to random values ends up biased, and drawing out the Markov chain for 4 bits makes that obvious
Is there a better way to do this?
I optimized the lexicographic decoder, resulting in a four-fold speedup relative to my previous answer. There are two new ideas:
Use the one-to-one correspondence implied by the recurrence T(n) = T(k−1) T(n−k) + T(k−2) T(n−k−1) + T(k−2) T(n−k−2) + T(k−3) T(n−k−1) to avoid working one bit at a time;
Cache the small words without 111 in addition to the recurrence values, incurring an L1 cache hit to save a number of arithmetic operations.
#include <assert.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
enum { kTribonacci14 = 5768 };
static uint64_t g_tribonacci[65];
static void InitTribonacci(void) {
for (unsigned i = 0; i < 65; i++) {
g_tribonacci[i] =
i < 3 ? 1 << i
: g_tribonacci[i - 1] + g_tribonacci[i - 2] + g_tribonacci[i - 3];
}
assert(g_tribonacci[14] == kTribonacci14);
}
static uint16_t g_words_no_111[kTribonacci14];
static void InitCachedWordsNo111(void) {
unsigned i = 0;
for (unsigned word = 0; word < ((unsigned)1 << 14); word++) {
if ((word & (word >> 1) & (word >> 2)) == 0) {
assert(i < kTribonacci14);
g_words_no_111[i++] = (uint16_t)word;
}
}
assert(i == kTribonacci14);
}
static bool CaseNo111(uint64_t *restrict result, unsigned *restrict n,
uint64_t *restrict index, unsigned left_n,
unsigned right_n) {
uint64_t left_count = g_tribonacci[left_n];
uint64_t right_count = g_tribonacci[right_n];
uint64_t product = left_count * right_count;
if (*index >= product) {
*index -= product;
return false;
}
*result = (*result << left_n) + g_words_no_111[*index / right_count];
*n = right_n;
*index %= right_count;
return true;
}
static void Append(uint64_t *result, uint64_t bit) {
*result = (*result << 1) + bit;
}
static uint64_t DecodeNo111(unsigned n, uint64_t index) {
assert(0 <= n && n <= 64);
assert(index < g_tribonacci[n]);
uint64_t result = 0;
while (n > 14) {
assert(g_tribonacci[n] == g_tribonacci[12] * g_tribonacci[n - 13] +
g_tribonacci[11] * g_tribonacci[n - 14] +
g_tribonacci[11] * g_tribonacci[n - 15] +
g_tribonacci[10] * g_tribonacci[n - 14]);
if (CaseNo111(&result, &n, &index, 12, n - 13)) {
Append(&result, 0);
} else if (CaseNo111(&result, &n, &index, 11, n - 14)) {
Append(&result, 0);
Append(&result, 1);
Append(&result, 0);
} else if (CaseNo111(&result, &n, &index, 11, n - 15)) {
Append(&result, 0);
Append(&result, 1);
Append(&result, 1);
Append(&result, 0);
} else if (CaseNo111(&result, &n, &index, 10, n - 14)) {
Append(&result, 0);
Append(&result, 1);
Append(&result, 1);
Append(&result, 0);
} else {
assert(false);
}
}
return (result << n) + g_words_no_111[index];
}
static void PrintWord(unsigned n, uint64_t word) {
assert(0 <= n && n <= 64);
while (n-- > 0) {
putchar('0' + ((word >> n) & 1));
}
putchar('\n');
}
int main(void) {
InitTribonacci();
InitCachedWordsNo111();
if ((false)) {
enum { kN = 20 };
for (uint64_t i = 0; i < g_tribonacci[kN]; i++) {
PrintWord(kN, DecodeNo111(kN, i));
}
}
uint64_t sum = 0;
uint64_t index = 0;
for (uint32_t i = 0; i < 10000000; i++) {
sum += DecodeNo111(64, index % g_tribonacci[64]);
index = (index * 2862933555777941757) + 3037000493;
}
return sum & 127;
}
From #John Coleman's comment, here's the start of an approach based on Tribonacci numbers. Basic idea:
Generate an unbiased number in the range [0..T(bits)), where T(0) = 1, T(1) = 2, T(2) = 4, T(n) = T(n-1) + T(n-2) + T(n-3).
Convert to Tribonacci representation.
You're done.
A minimal example is as follows:
// 1, 2, 4, TRIBO[n-3]+TRIBO[n-2]+TRIBO[n-1]
// possible minor perf optimization: reverse TRIBO
static const uint64_t TRIBO[65] = {1, 2, 4, 7, 13, 24, 44, 81, 149, 274, 504, 927, 1705, 3136, 5768, 10609, 19513, 35890, 66012, 121415, 223317, 410744, 755476, 1389537, 2555757, 4700770, 8646064, 15902591, 29249425, 53798080, 98950096, 181997601, 334745777, 615693474, 1132436852, 2082876103, 3831006429, 7046319384, 12960201916, 23837527729, 43844049029, 80641778674, 148323355432, 272809183135, 501774317241, 922906855808, 1697490356184, 3122171529233, 5742568741225, 10562230626642, 19426970897100, 35731770264967, 65720971788709, 120879712950776, 222332455004452, 408933139743937, 752145307699165, 1383410902447554, 2544489349890656, 4680045560037375, 8607945812375585, 15832480722303616, 29120472094716576, 53560898629395777, 98513851446415969];
// exclusive of max
extern uint64_t get_rand_64_range(uint64_t max);
uint64_t get_rand_no111(void) {
uint64_t idx = get_rand_64_range(TRIBO[64]);
uint64_t ret = 0;
for (int i = 63; i >= 0; i--) {
if (idx >= TRIBO[i]) {
ret |= ((uint64_t) 1) << i;
idx -= TRIBO[i];
}
// optional: if (idx == 0) {break;}
}
return ret;
}
(Warning: retyped from Python code. I suggest testing.)
This satisfies the 'unbiased' portion, and is indeed faster than the naive rejection-sampling approach, but unfortunately is still pretty slow, because it's looping ~64 times.
The idea behind the code below is to generate the upper 32 bits with the proper (non-uniform!) distribution, then generate the lower 32 conditional on the upper. On my laptop, it’s significantly faster than the baseline, and slightly faster than lexicographic decoding.
You can see the logic behind the non-uniform upper distribution with 4-bit outputs: 00 and 10 have four 2-bit lowers, 01 has three lowers, and 11 has two lowers.
#include <cstdint>
#include <random>
namespace {
using Generator = std::mt19937_64;
template <int bits> std::uint64_t GenerateUniform(Generator &gen) {
static_assert(0 <= bits && bits <= 63);
return gen() & ((std::uint64_t{1} << bits) - 1);
}
template <> std::uint64_t GenerateUniform<64>(Generator &gen) { return gen(); }
template <int bits> std::uint64_t GenerateNo111Baseline(Generator &gen) {
std::uint64_t r;
do {
r = GenerateUniform<bits>(gen);
} while (r & (r >> 1) & (r >> 2));
return r;
}
template <int bits> struct Tribonacci {
static constexpr std::uint64_t value = Tribonacci<bits - 1>::value +
Tribonacci<bits - 2>::value +
Tribonacci<bits - 3>::value;
};
template <> struct Tribonacci<0> { static constexpr std::uint64_t value = 1; };
template <> struct Tribonacci<-1> { static constexpr std::uint64_t value = 1; };
template <> struct Tribonacci<-2> { static constexpr std::uint64_t value = 0; };
template <int bits> std::uint64_t GenerateNo111(Generator &gen) {
constexpr int upper_bits = 16;
constexpr int lower_bits = bits - upper_bits;
const std::uint64_t upper = GenerateNo111Baseline<upper_bits>(gen);
for (;;) {
if ((upper & 1) == 0) {
return (upper << lower_bits) + GenerateNo111<lower_bits>(gen);
}
std::uint64_t outcome = std::uniform_int_distribution<std::uint64_t>{
0, Tribonacci<upper_bits>::value - 1}(gen);
if ((upper & 2) == 0) {
if (outcome < Tribonacci<upper_bits - 2>::value) {
return (upper << lower_bits) + (std::uint64_t{1} << (lower_bits - 1)) +
GenerateNo111<lower_bits - 2>(gen);
}
outcome -= Tribonacci<upper_bits - 2>::value;
}
if (outcome < Tribonacci<lower_bits - 1>::value) {
return (upper << lower_bits) + GenerateNo111<lower_bits - 1>(gen);
}
}
}
#define BASELINE(bits) \
template <> std::uint64_t GenerateNo111<bits>(Generator & gen) { \
return GenerateNo111Baseline<bits>(gen); \
}
BASELINE(0)
BASELINE(1)
BASELINE(2)
BASELINE(3)
BASELINE(4)
BASELINE(5)
BASELINE(6)
BASELINE(7)
BASELINE(8)
BASELINE(9)
BASELINE(10)
BASELINE(11)
BASELINE(12)
BASELINE(13)
BASELINE(14)
BASELINE(15)
BASELINE(16)
#undef BASELINE
static const std::uint64_t TRIBO[65] = {1,
2,
4,
7,
13,
24,
44,
81,
149,
274,
504,
927,
1705,
3136,
5768,
10609,
19513,
35890,
66012,
121415,
223317,
410744,
755476,
1389537,
2555757,
4700770,
8646064,
15902591,
29249425,
53798080,
98950096,
181997601,
334745777,
615693474,
1132436852,
2082876103,
3831006429,
7046319384,
12960201916,
23837527729,
43844049029,
80641778674,
148323355432,
272809183135,
501774317241,
922906855808,
1697490356184,
3122171529233,
5742568741225,
10562230626642,
19426970897100,
35731770264967,
65720971788709,
120879712950776,
222332455004452,
408933139743937,
752145307699165,
1383410902447554,
2544489349890656,
4680045560037375,
8607945812375585,
15832480722303616,
29120472094716576,
53560898629395777,
98513851446415969};
std::uint64_t get_rand_no111(Generator &gen) {
std::uint64_t idx =
std::uniform_int_distribution<std::uint64_t>{0, TRIBO[64] - 1}(gen);
std::uint64_t ret = 0;
for (int i = 63; i >= 0; --i) {
if (idx >= TRIBO[i]) {
ret |= std::uint64_t{1} << i;
idx -= TRIBO[i];
}
}
return ret;
}
} // namespace
int main() {
Generator gen{std::random_device{}()};
std::uint64_t sum = 0;
for (std::int32_t i = 0; i < 10000000; i++) {
if constexpr (true) {
sum += GenerateNo111<64>(gen);
} else {
sum += get_rand_no111(gen);
}
}
return sum & 127;
}
What about following simple idea:
Generate random r.
Find within this r window(s)-mask, contains 3 or more sequenced 1s.
If mask is 0 (no 3 or more sequenced bits) - return the r.
Substitute "incorrect" bits under that mask to new random ones.
Goto 2
Code sample (did not tested, compiled only):
uint64_t rand_no3() {
uint64_t r, mask;
for(r = get_rand_64() ; ; ) {
mask = r & (r >> 1) & (r >> 2);
mask |= (mask << 1) | (mask << 2);
if(mask == 0)
return r;
r ^= mask & get_rand_64();
}
}
Another variant of same code, with just single call get_rand_64():
uint64_t rand_no3() {
uint64_t r, mask = ~0ULL;
do {
r ^= mask & get_rand_64();
mask = r & (r >> 1) & (r >> 2);
mask |= (mask << 1) | (mask << 2);
} while(mask != 0);
return r;
}
I know, the last code does not init the r, but it is not matter, because of this variable will be overwritten in 1st loop iteration.
You could generate the number one bit at a time, keeping track of the number of consecutive set bits. Whenever you have two consecutive set bits, you insert an unset bit and set the count back to 0.
The product of the sequence of binomials reads
where {a_i} and {b_i} are coefficients in binomials.
I need to expand it to a polynomial
and use all coefficients {c_k} in the polynomial afterwards.
How to expand it efficiently? The speed has priority over the memory occupation because the expansion will be used many times.
What I tried
At present I just come up with an update scheme, which expands the polynomial right after absorbing one binomial.
This scheme needs two arrays — one for results up to i-1 and the other for results up to i.
Here is the C++ code for my naive scheme, but I think this question is irrelevant to what language is used.
#include <iostream>
#include <vector>
int main()
{
using namespace std;
// just an example, the coefficients are actually real numbers in [0,1]
unsigned d = 3;
vector<double> a;
vector<double> b;
a.resize(d, 1); b.resize(d, 1);
// given two arrays, a[] and b[], of length d
vector< vector<double> > coefficients(2);
coefficients[0].resize(d + 1);
coefficients[1].resize(d + 1);
if (d > 0) {
auto &coeff = coefficients[0]; // i = 0
coeff[0] = a[0];
coeff[1] = b[0];
for (unsigned i = 1; i < d; ++i) {// i : [1, d-1]
const auto ai = a[i];
const auto bi = b[i];
const auto &oldCoeff = coefficients[(i-1)%2];
auto &coeff = coefficients[i%2];
coeff[0] = oldCoeff[0] * ai; // j = 0
for (unsigned j = 1; j <= i; ++j) { // j : [1, i]
coeff[j] = oldCoeff[j] * ai + oldCoeff[j-1] * bi;
}
coeff[i+1] = oldCoeff[i] * bi; // j = i
}
}
const auto &coeff = coefficients[(d-1)%2];
for (unsigned i = 0; i < d; ++i) {
cout << coeff[i] << "\t";
}
cout << coeff[d] << '\n';
}
I came across this question during an interview -
Convert a number source to target in the minimum number of operations.
Allowed Operations
Multiplied by 2.
Addition by 1.
subtraction by 1.
0 < source, target <= 1000.
I tried going the naive recursive route(O(3^n)) ie. subtract 1, add 1 and multiply by 2 at each level to try and find a solution that I could extend to Dynamic Programming but couldnt because of an infinite loop.
//Naive approach Via Recursion
int minMoves(int source, int target){
if(source <1 || source > target){
return -1;
}
int moves =0;
// Potential infinite loop - consider 3,6-> 2,6- >1,6->(0,6)x (2,6)->1,6->(0,6)x (1,6)->(0,6)x (2,6)->1,6..
int movesLeft = minMoves(source -1, target) ==-1? Integer.MAX_VALUE:minMoves(source -1, target);
int movesRight = minMoves(source +1, target) ==-1? Integer.MAX_VALUE:minMoves(source +1, target);
int moves2X = minMoves(2*source, target) ==-1? Integer.MAX_VALUE:minMoves(2*source, target);
moves = 1+ Math.min(Math.min(movesRight,movesLeft), moves2X);
return moves;
}
Any ideas on how I can tweak my solution? Or possibly a better way to solve it?
If you think about your solution like a graph traversal, where each node is an intermediate value you can produce, your recursive solution is like a depth first search (DFS). You'll have to fully expand until you've tried all solutions from that "branch" of the search space before you can proceed anywhere else. If you have an infinite loop, this means it will never terminate even if a shorter path exists, and even if you don't have an infinite loop, you still have to search the rest of the solution space to make sure its optimal.
Instead, consider an approach similar to breadth first search (BFS). You expand outward uniformly, and will never search a path longer than the optimal solution. Just use FIFO queue to schedule which node to access next. This is the approach I've taken with my solver.
from queue import Queue
def solve(source, target):
queue = Queue()
path = [source]
queue.put(path)
while source != target:
queue.put(path + [source * 2])
queue.put(path + [source + 1])
queue.put(path + [source - 1])
path = queue.get()
source = path[-1]
return path
if __name__ == "__main__":
print(solve(4,79))
One way in which you can speed up(and possibly fix) this code, while maintaining the recursive implementation, is to use memoization.
The issue here is that you are recalculating the same value many times. Instead you can use a map to store the results that you already calculated, and reuse them when you need it again.
This problem can be solved constructively. First, the easy cases. If s=t, the answer is 0. If s > t, the answer is s-t because subtraction by 1 is the only operation that lowers s, and the other two can only increase the number of subtractions required.
Now let's assume s < t. Since s>0 is given, doubling will always be the fastest way to increase (if s is 1, it's tied with incrementing). So if the challenge was to make s >= t, the answer would always be the number of doublings required to do that. This procedure may overshoot t, but the first doubling greater than t and the last doubling not greater than t must be within a factor of 2 of t.
Let's look at the effect of when we do an addition or subtraction. First, look only at addition:
(((s*2) * 2) * 2) + 1 = 8s + 1
vs:
((((s+1)*2) * 2) * 2) = 8s + 8
Putting an addition before n doublings makes the final result 2^n bigger. So consider if s is 3 and t is 8. The last double not bigger than 8 is 6. This is 2 off, so if we put an addition 1 double before the last double, we get what we want: (3+1) * 2. Alternatively we could try overshooting to the first double greater than 8, which is 12. This is 4 off, so we need to put a subtraction two doublings before the last : (3-1)*2*2 = 8
In general if we are x below the target, we need to put a +1 at n doublings before the last if the binary representation of x has a 1 at the nth place.
Similarly, if we are x above the target, we do likewise with -1's.
This procedure won't help for the 1's in x's binary representation that are at a position more than the number of doublings there are. For example, if s = 100, t=207, there is only 1 doubling to do, but x is 7, which is 111. We can knock out the middle one by doing an addition first, the rest we have to do one by one (s+1)*2 + 1 + 1 + 1 + 1 + 1.
Here is an implementation that has a debug flag that also outputs the list of operations when the flag is defined. The run time is O(log(t)):
#include <iostream>
#include <string>
#include <sstream>
#define DEBUG_INFO
int MinMoves(int s, int t)
{
int ans = 0;
if (t <= s)
{
return s - t; //Only subtraction will help
}
int firstDoubleGreater = s;
int lastDoubleNotGreater = s;
int nDouble = 0;
while(firstDoubleGreater <= t)
{
nDouble++;
lastDoubleNotGreater = firstDoubleGreater;
firstDoubleGreater *= 2;
}
int d1 = t - lastDoubleNotGreater;
int d2 = firstDoubleGreater - t;
if (d1 == 0)
return nDouble -1;
int strat1 = nDouble -1; //Double and increment
int strat2 = nDouble; //Double and decrement
#ifdef DEBUG_INFO
std::cout << "nDouble: " << nDouble << "\n";
std::stringstream s1Ops;
std::stringstream s2Ops;
int s1Tmp = s;
int s2Tmp = s;
#endif
int mask = 1<<strat1;
for(int pos = 0; pos < nDouble-1; pos++)
{
#ifdef DEBUG_INFO
if (d1 & mask)
{
s1Ops << s1Tmp << "+1=" << s1Tmp+1 << "\n" << s1Tmp+1 << "*2= " << (s1Tmp+1)*2 << "\n";
s1Tmp = (s1Tmp + 1) * 2;
}
else
{
s1Ops << s1Tmp << "*2= " << s1Tmp*2 << "\n";
s1Tmp = s1Tmp*2;
}
#endif
if(d1 & mask)
strat1++;
d1 = d1 & ~mask;
mask = mask >> 1;
}
strat1 += d1;
#ifdef DEBUG_INFO
if (d1 != 0)
s1Ops << s1Tmp << " +1 " << d1 << " times = " << s1Tmp + d1 << "\n";
#endif
mask = 1<<strat2;
for(int pos = 0; pos < nDouble; pos++)
{
#ifdef DEBUG_INFO
if (d2 & mask)
{
s2Ops << s2Tmp << "-1=" << s2Tmp-1 << "\n" << s2Tmp-1 << "*2= " << (s2Tmp-1)*2 << "\n";
s2Tmp = (s2Tmp-1)*2;
}
else
{
s2Ops << s2Tmp << "*2= " << s2Tmp*2 << "\n";
s2Tmp = s2Tmp*2;
}
#endif
if(d2 & mask)
strat2++;
d2 = d2 & ~mask;
mask = mask >> 1;
}
strat2 += d2;
#ifdef DEBUG_INFO
if (d2 != 0)
s2Ops << s2Tmp << " -1 " << d2 << " times = " << s2Tmp - d2 << "\n";
std::cout << "Strat1: " << strat1 << "\n";
std::cout << s1Ops.str() << "\n";
std::cout << "\n\nStrat2: " << strat2 << "\n";
std::cout << s2Ops.str() << "\n";
#endif
if (strat1 < strat2)
{
return strat1;
}
else
{
std::cout << "Strat2\n";
return strat2;
}
}
int main()
{
int s = 25;
int t = 193;
std::cout << "s = " << s << " t = " << t << "\n";
std::cout << MinMoves(s, t) << std::endl;
}
Short BFS algorithm. It finds the shortest path in graph where every vertex x is connected to x + 1, x - 1 and x * 2; O(n)
#include <bits/stdc++.h>
using namespace std;
const int _MAX_DIS = 2020;
const int _MIN_DIS = 0;
int minMoves(int begin, int end){
queue<int> Q;
int dis[_MAX_DIS];
fill(dis, dis + _MAX_DIS, -1);
dis[begin] = 0;
Q.push(begin);
while(!Q.empty()){
int v = Q.front(); Q.pop();
int tab[] = {v + 1, v - 1, v * 2};
for(int i = 0; i < 3; i++){
int w = tab[i];
if(_MIN_DIS <= w && w <= _MAX_DIS && dis[w] == -1){
Q.push(w);
dis[w] = dis[v] + 1;
}
}
}
return dis[end];
}
int main(){
ios_base::sync_with_stdio(false);
cin.tie(0);
cout.tie(0);
cout << minMoves(1, 1000);
return 0;
}
I want to generate 3 random numbers in the range 0 to 9 in a row which should sum up to a given fixed number. For example, for the given fixed sum 15, one possible solution would be (3, 8, 4). How can I do this ? Thanks.
We can:
First generate random float number a,b,c between 0 and 1
Get sum of a,b,c
Divide a,b,c by sum
Multiple a,b,c by given desired sum integer, and then round a,b,c to the nearest integer
See if sum(a, b, c) == given integer ? get result : try again
Check this demo:
Using boost random generator:
#include <iostream>
#include <time.h>
#include <iomanip>
#include <boost/random.hpp>
int main()
{
static time_t seed = time(0);
boost::random::mt19937 RandomNumGen(seed++);
boost::random::uniform_real_distribution<> Range(0, 1);
int Desired_Integer = 15;
int Rand_Max = 9;
int Max_Itr = 100000000;
int Count = 0;
int SumABC[3][10] = { 0 };
float bias = 0.5;
float a, b, c;
for (int Loop = 1; Loop <= Max_Itr; ++Loop)
{
a = Range(RandomNumGen);
b = Range(RandomNumGen);
c = Range(RandomNumGen);
float Sum = a + b + c;
a = a / Sum;
b = b / Sum;
c = c / Sum;
//Round to the nearest integer;
int aI = static_cast<int>(a * Desired_Integer + bias), bI = static_cast<int>(b * Desired_Integer + bias), cI = static_cast<int>(c * Desired_Integer + bias);
if (aI <= Rand_Max && bI <= Rand_Max && cI <= Rand_Max && aI + bI + cI == Desired_Integer)
{
SumABC[0][aI]++;
SumABC[1][bI]++;
SumABC[2][cI]++;
Count++;
}
}
int PaddingWidth = 10;
std::cout << "\n" << Count << " in " << Max_Itr << " loops get desired outcome. \nDistribution of a,b,c: \n";
std::cout << "Number" << std::setw(PaddingWidth) << "a" << std::setw(PaddingWidth) << "b" << std::setw(PaddingWidth) << "c" << std::endl;
for (int i = 0; i < 10; i++)
{
std::cout
<< i << std::setw(PaddingWidth + 8)
<< std::setprecision(4) << 100.0 * SumABC[0][i] / (float)Count << std::setw(PaddingWidth)
<< std::setprecision(4) << 100.0 * SumABC[1][i] / (float)Count << std::setw(PaddingWidth)
<< std::setprecision(4) << 100.0 * SumABC[2][i] / (float)Count << std::endl;
}
std::cout << "\n\n";
system("pause");
return 0;
}
Test efficiency:
When dealing with random variables it's a really good idea to check the work.
I simulated both answers. Xiaotao's not only has a different distribution, but different distribution frequencies. aI and bI have the same distribution but cI is significantly different. All three should have identical distributions.
Also, Kay's solution has the proper distribution as P(a)==1 s/b 1.25 times P(a)==1.
This is a deterministic solution and it has exactly the same statistics as Kay's
Further, the frequency of occurrence of each number looking at it purely from a probability POV from 0 to 9 is 4/73, 5/73, 6/73, 7/73, 8/73, 9/73, 10/73, 9/73, 8/73 and 7/73
A vector of all possible number sequences that sums to 15 is created. Then one element is chosen randomly. Each number set has an identical probability of being selected
#include <algorithm>
#include <array>
#include <iostream>
#include <numeric>
#include <random>
using namespace std;
// Your constants:
static constexpr unsigned DICE_COUNT = 3;
static constexpr unsigned DICE_SIDES = 10;
static constexpr unsigned DESIRED_NUMBER = 15;
int main() {
// Initialize your PRNG:
vector<array<int, 3>> allLegalNumbers;
for (int i=0; i <= 9; i++) // go through all possible sets of 3 numbers from 0 to 9
for (int ii = 0; ii < DICE_SIDES; ii++)
for (int iii = 0; iii < DICE_SIDES; iii++)
if (i + ii + iii == DESIRED_NUMBER) // keep the ones that add up to 15
allLegalNumbers.push_back(array<int, 3> {i, ii, iii});
random_device rd;
mt19937 generator(rd());
uniform_int_distribution<unsigned> distribution(0, allLegalNumbers.size() - 1);
int sum[3][DICE_SIDES]{};
int sum_count = 0;
for (int Loop = 1; Loop < 100000000; ++Loop)
{
auto index = distribution(generator);
sum[0][allLegalNumbers[index][0]]++;
sum[1][allLegalNumbers[index][1]]++;
sum[2][allLegalNumbers[index][2]]++;
sum_count++;
}
for (int i = 0; i < DICE_SIDES; i++)
printf("Percent of aI==%d:%5.2f bI==%d:%5.2f cI==%d:%5.2f\n",
i, 100.0*sum[0][i] / sum_count,
i, 100.0*sum[1][i] / sum_count,
i, 100.0*sum[2][i] / sum_count);
return 0;
}
/* Results:
Percent of aI==0: 5.48 bI==0: 5.48 cI==0: 5.48
Percent of aI==1: 6.85 bI==1: 6.85 cI==1: 6.85
Percent of aI==2: 8.22 bI==2: 8.22 cI==2: 8.22
Percent of aI==3: 9.59 bI==3: 9.59 cI==3: 9.59
Percent of aI==4:10.96 bI==4:10.96 cI==4:10.96
Percent of aI==5:12.33 bI==5:12.33 cI==5:12.34
Percent of aI==6:13.69 bI==6:13.70 cI==6:13.70
Percent of aI==7:12.34 bI==7:12.33 cI==7:12.33
Percent of aI==8:10.96 bI==8:10.96 cI==8:10.95
Percent of aI==9: 9.59 bI==9: 9.59 cI==9: 9.58
*/
Xiaotao's answer simulation: Note the different distribution of cI v aI and bI
#include <iostream>
int main()
{
int SumI = 15;
int Rand_Max = 9;
float a, b, c;
int sum[3][10]{};
int sum_count = 0;
for (int Loop = 1; Loop < 100000000; ++Loop)
{
a = static_cast<float>(rand() % Rand_Max) / static_cast<float>(Rand_Max);
b = static_cast<float>(rand() % Rand_Max) / static_cast<float>(Rand_Max);
c = static_cast<float>(rand() % Rand_Max) / static_cast<float>(Rand_Max);
float Sum = a + b + c;
a = a / Sum;
b = b / Sum;
c = c / Sum;
//Round to the nearest integer;
int aI = static_cast<int>(a * SumI + 0.5), bI = static_cast<int>(b * SumI + 0.5), cI = static_cast<int>(c * SumI + 0.5);
if (aI <= Rand_Max && bI <= Rand_Max && cI <= Rand_Max && aI + bI + cI == SumI)
{
sum[0][aI]++;
sum[1][bI]++;
sum[2][cI]++;
sum_count++;
}
}
for (int i = 0; i < 10; i++)
printf("Percent of aI==%d:%5.2f bI==%d:%5.2f cI==%d:%5.2f\n",
i, 100.0*sum[0][i] / sum_count,
i, 100.0*sum[1][i] / sum_count,
i, 100.0*sum[2][i] / sum_count);
return 0;
}
/* Results:
Percent of aI==0: 5.84 bI==0: 5.83 cI==0: 5.84
Percent of aI==1: 5.30 bI==1: 5.31 cI==1: 5.31
Percent of aI==2: 7.43 bI==2: 7.43 cI==2: 6.90
Percent of aI==3: 9.55 bI==3: 9.54 cI==3: 9.28
Percent of aI==4:10.61 bI==4:10.61 cI==4:10.60
Percent of aI==5:15.64 bI==5:15.66 cI==5:15.39
Percent of aI==6:16.18 bI==6:16.18 cI==6:17.51
Percent of aI==7:11.41 bI==7:11.40 cI==7:10.88
Percent of aI==8: 9.82 bI==8: 9.81 cI==8:10.08
Percent of aI==9: 8.22 bI==9: 8.22 cI==9: 8.22
*/
Kay's answer does not exhibit this error. Here's that simulation:
#include <algorithm>
#include <array>
#include <iostream>
#include <numeric>
#include <random>
// Don't use "using namespace" in production.
// I only use it to avoid the horizontal scrollbar.
using namespace std;
// Your constants:
static constexpr unsigned DICE_COUNT = 3;
static constexpr unsigned DICE_SIDES = 10;
static constexpr unsigned DESIRED_NUMBER = 15;
int main() {
// Initialize your PRNG:
random_device rd;
mt19937 generator(rd());
uniform_int_distribution<unsigned> distribution(0, DICE_SIDES - 1);
int sum[3][10]{};
int sum_count = 0;
for (int Loop = 1; Loop < 10000000; ++Loop)
{
// Fill the array with three random numbers until you have a match:
array<unsigned, DICE_COUNT> values = { 0 };
while (accumulate(begin(values), end(values), 0) != DESIRED_NUMBER) {
for_each(begin(values), end(values), [&](unsigned &v) {
v = distribution(generator);
//v = rand() % DICE_SIDES; // substitute this to use rand()
});
}
sum[0][values[0]]++;
sum[1][values[1]]++;
sum[2][values[2]]++;
sum_count++;
}
for (int i = 0; i < 10; i++)
printf("Percent of aI==%d:%5.2f bI==%d:%5.2f cI==%d:%5.2f\n",
i, 100.0*sum[0][i] / sum_count,
i, 100.0*sum[1][i] / sum_count,
i, 100.0*sum[2][i] / sum_count);
return 0;
}
/* Results:
Percent of aI==0: 5.48 bI==0: 5.48 cI==0: 5.47
Percent of aI==1: 6.85 bI==1: 6.85 cI==1: 6.85
Percent of aI==2: 8.22 bI==2: 8.19 cI==2: 8.22
Percent of aI==3: 9.60 bI==3: 9.59 cI==3: 9.60
Percent of aI==4:10.97 bI==4:10.96 cI==4:10.99
Percent of aI==5:12.34 bI==5:12.32 cI==5:12.32
Percent of aI==6:13.69 bI==6:13.70 cI==6:13.71
Percent of aI==7:12.31 bI==7:12.34 cI==7:12.30
Percent of aI==8:10.95 bI==8:10.96 cI==8:10.95
Percent of aI==9: 9.60 bI==9: 9.60 cI==9: 9.59
*/
Here's a tutorial how to generate random numbers in C++11: http://en.cppreference.com/w/cpp/numeric/random/uniform_int_distribution
The easiest solution is to try it until you find a match:
#include <algorithm>
#include <array>
#include <iostream>
#include <numeric>
#include <random>
// Don't use "using namespace" in production.
// I only use it to avoid the horizontal scrollbar.
using namespace std;
// Your constants:
static constexpr unsigned DICE_COUNT = 3;
static constexpr unsigned DICE_SIDES = 10;
static constexpr unsigned DESIRED_NUMBER = 15;
int main() {
// Initialize your PRNG:
random_device rd;
mt19937 generator(rd());
uniform_int_distribution<unsigned> distribution(0, DICE_SIDES - 1);
// Fill the array with three random numbers until you have a match:
array<unsigned, DICE_COUNT> values = { 0 };
while (accumulate(begin(values), end(values), 0) != DESIRED_NUMBER) {
for_each(begin(values), end(values), [&](unsigned &v) {
v = distribution(generator);
});
}
// Print the result:
for_each(begin(values), end(values), [&](unsigned &v) {
cout << v << ' ';
});
cout << endl;
return 0;
}
You'll need about nine iterations to have a 50/50 chance that you'll throw a 15:
P(3d10 = 18) ≈ 1/14 (+3 to account for the range shift)
(13/14)^n < 0.5 → n ≈ 9.4
A previous question asked how to find to find the maximum value of an array in CUDA efficiently: Finding max value in CUDA, the top response provided a link to a NVIDIA presentation on optimizing reduction kernels.
If you are using Visual Studio, simply remove the header reference, and everything between CPU EXECUTION.
I setup a variant which found the max, but it doesn't match what the CPU is finding:
// Returns the maximum value of
// an array of size n
float GetMax(float *maxes, int n)
{
int i = 0;
float max = -100000;
for(i = 0; i < n; i++)
{
if(maxes[i] > max)
max = maxes[i];
}
return max;
}
// Too obvious...
__device__ float MaxOf2(float a, float b)
{
if(a > b) return a;
else return b;
}
__global__ void MaxReduction(int n, float *g_idata, float *g_odata)
{
extern __shared__ float sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(BLOCKSIZE*2) + tid;
unsigned int gridSize = BLOCKSIZE*2*gridDim.x;
sdata[tid] = 0;
//MMX(index,i)
//MMX(index,i+blockSize)
// Final Optimized Kernel
while (i < n) {
sdata[tid] = MaxOf2(g_idata[i], g_idata[i+BLOCKSIZE]);
i += gridSize;
}
__syncthreads();
if (BLOCKSIZE >= 512) { if (tid < 256) { sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 256]); } __syncthreads(); }
if (BLOCKSIZE >= 256) { if (tid < 128) { sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 128]); } __syncthreads(); }
if (BLOCKSIZE >= 128) { if (tid < 64) { sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 64]); } __syncthreads(); }
if (tid < 32) {
if (BLOCKSIZE >= 64) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 32]);
if (BLOCKSIZE >= 32) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 16]);
if (BLOCKSIZE >= 16 ) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 8]);
if (BLOCKSIZE >= 8) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 4]);
if (BLOCKSIZE >= 4) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 2]);
if (BLOCKSIZE >= 2) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 1]);
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
I have a giant setup to test this algorithm:
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <sys/time.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include "book.h"
#define ARRAYSIZE 16384
#define GRIDSIZE 60
#define BLOCKSIZE 32
#define SIZEFLOAT 4
using namespace std;
// Function definitions
float GetMax(float *maxes, int n);
__device__ float MaxOf2(float a, float b);
__global__ void MaxReduction(int n, float *g_idata, float *g_odata);
// Returns random floating point number
float RandomReal(float low, float high)
{
float d;
d = (float) rand() / ((float) RAND_MAX + 1);
return (low + d * (high - low));
}
int main()
{
/*****************VARIABLE SETUP*****************/
// Pointer to CPU numbers
float *numbers;
// Pointer to GPU numbers
float *dev_numbers;
// Counter
int i = 0;
// Randomize
srand(time(0));
// Timers
// Kernel timers
cudaEvent_t start_kernel, stop_kernel;
float elapsedTime_kernel;
HANDLE_ERROR(cudaEventCreate(&start_kernel));
HANDLE_ERROR(cudaEventCreate(&stop_kernel));
// cudaMalloc timers
cudaEvent_t start_malloc, stop_malloc;
float elapsedTime_malloc;
HANDLE_ERROR(cudaEventCreate(&start_malloc));
HANDLE_ERROR(cudaEventCreate(&stop_malloc));
// CPU timers
struct timeval start, stop;
float elapsedTime = 0;
/*****************VARIABLE SETUP*****************/
/*****************CPU ARRAY SETUP*****************/
// Setup CPU array
HANDLE_ERROR(cudaHostAlloc((void**)&numbers, ARRAYSIZE * sizeof(float), cudaHostAllocDefault));
for(i = 0; i < ARRAYSIZE; i++)
numbers[i] = RandomReal(0, 50000.0);
/*****************CPU ARRAY SETUP*****************/
/*****************GPU ARRAY SETUP*****************/
// Start recording cuda malloc time
HANDLE_ERROR(cudaEventRecord(start_malloc,0));
// Allocate memory to GPU
HANDLE_ERROR(cudaMalloc((void**)&dev_numbers, ARRAYSIZE * sizeof(float)));
// Transfer CPU array to GPU
HANDLE_ERROR(cudaMemcpy(dev_numbers, numbers, ARRAYSIZE*sizeof(float), cudaMemcpyHostToDevice));
// An array to temporarily store maximum values on the GPU
float *dev_max;
HANDLE_ERROR(cudaMalloc((void**)&dev_max, GRIDSIZE * sizeof(float)));
// An array to hold grab the GPU max
float maxes[GRIDSIZE];
/*****************GPU ARRAY SETUP*****************/
/*****************KERNEL EXECUTION*****************/
// Start recording kernel execution time
HANDLE_ERROR(cudaEventRecord(start_kernel,0));
// Run kernel
MaxReduction<<<GRIDSIZE, BLOCKSIZE, SIZEFLOAT*BLOCKSIZE>>> (ARRAYSIZE, dev_numbers, dev_max);
// Transfer maxes over
HANDLE_ERROR(cudaMemcpy(maxes, dev_max, GRIDSIZE * sizeof(float), cudaMemcpyDeviceToHost));
// Print out the max
cout << GetMax(maxes, GRIDSIZE) << endl;
// Stop recording kernel execution time
HANDLE_ERROR(cudaEventRecord(stop_kernel,0));
HANDLE_ERROR(cudaEventSynchronize(stop_kernel));
// Retrieve recording data
HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime_kernel, start_kernel, stop_kernel));
// Stop recording cuda malloc time
HANDLE_ERROR(cudaEventRecord(stop_malloc,0));
HANDLE_ERROR(cudaEventSynchronize(stop_malloc));
// Retrieve recording data
HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime_malloc, start_malloc, stop_malloc));
// Print results
printf("%5.3f\t%5.3f\n", elapsedTime_kernel, elapsedTime_malloc);
/*****************KERNEL EXECUTION*****************/
/*****************CPU EXECUTION*****************/
// Capture the start time
gettimeofday(&start, NULL);
// Call generic P7Viterbi function
cout << GetMax(numbers, ARRAYSIZE) << endl;
// Capture the stop time
gettimeofday(&stop, NULL);
// Retrieve time elapsed in milliseconds
long int elapsed_sec = stop.tv_sec - start.tv_sec;
long int elapsed_usec = stop.tv_usec - start.tv_usec;
elapsedTime = (float)(1000.0f * elapsed_sec) + (float)(0.001f * elapsed_usec);
// Print results
printf("%5.3f\n", elapsedTime);
/*****************CPU EXECUTION*****************/
// Free memory
cudaFreeHost(numbers);
cudaFree(dev_numbers);
cudaFree(dev_max);
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);
cudaEventDestroy(start_malloc);
cudaEventDestroy(stop_malloc);
// Exit program
return 0;
}
I ran cuda-memcheck on this test program, with -g & -G switches on, and it reports 0 problems. Can anyone spot the issue?
NOTE: Be sure to have book.h from the CUDA by Example book in your current directory when you compile the program. Source link here: http://developer.nvidia.com/cuda-example-introduction-general-purpose-gpu-programming
Download the source code, and book.h will be under the common directory/folder.
Your kernel looks broken to me. The thread local search (before the shared memory reduction), should look something like this:
sdata[tid] = g_idata[i];
i += gridSize;
while (i < n) {
sdata[tid] = MaxOf2(sdata[tid], g_idata[i]);
i += gridSize;
}
shouldn't it?
Also note that if you run this on Fermi, the shared memory buffer should be declared volatile, and you will get a noticeable improvement in performance if the thread local search is done with a register variable, rather than in shared memory. There is about an 8 times difference in effective bandwidth between the two.
EDIT: Here is a simplified, working version of your reduction kernel. You should note a number of differences compared to your original:
__global__ void MaxReduction(int n, float *g_idata, float *g_odata)
{
extern __shared__ volatile float sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(BLOCKSIZE) + tid;
unsigned int gridSize = BLOCKSIZE*gridDim.x;
float val = g_idata[i];
i += gridSize;
while (i < n) {
val = MaxOf2(g_idata[i],val);
i += gridSize;
}
sdata[tid] = val;
__syncthreads();
// This versions uses a single warp for the shared memory
// reduction
# pragma unroll
for(int i=(tid+32); ((tid<32)&&(i<BLOCKSIZE)); i+=32)
sdata[tid] = MaxOf2(sdata[tid], sdata[i]);
if (tid < 16) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 16]);
if (tid < 8) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 8]);
if (tid < 4) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 4]);
if (tid < 2) sdata[tid] = MaxOf2(sdata[tid], sdata[tid + 2]);
if (tid == 0) g_odata[blockIdx.x] = MaxOf2(sdata[tid], sdata[tid + 1]);
}
This code should also be safe on Fermi. You should also familiarise yourself with the CUDA math library, because there is a fmax(x,y) intrinsic which you should use in place of your MaxOf2 function.