I'm using C++11's random library for producing deterministic random values, I need to restrict the output to various ranges and naturally used std::uniform_int_distribution but much to my dismay the specification gives library implementations too much freedom and output differs between i.e. X86-64 Linux and X86 Windows 7.
Is there any option besides implementing my own distribution to ensure that the output is deterministic regardless of library implementation?
Is there any option besides implementing my own distribution to ensure
that the output is deterministic regardless of library implementation?
There is no option besides implementing your own distribution to ensure that the output is deterministic regardless of library implementation. While engines are deterministic, distributions are not.
Typical implementations for uniform_int_distribution will involve "rejection" algorithms which will repeatedly get a number from a URNG and if that result is not in the desired range, throw it away and try again. Variations on that theme will optimize that algorithm by folding out of range values into in-range values so as to minimize rejections without biasing the acceptance range, as the simplistic offset + lcg() % range algorithm does.
At least if an LCG is good enough for your purpose, the std::linear_congruential_engine instantiated with some fixed constants your application provides should produce deterministic output.
typedef uint64_t number;
constexpr number a {1};
constexpr number c {2};
constexpr number m {3};
constexpr number seed {0};
std::linear_congruential_engine<number, a, c, m> lcg {seed};
std::uniform_int_distribution<number> dist {0, 1000};
for (int i = 0; i < 10; ++i)
std::cout << dist(lcg) << std::endl;
Note that I've intentionally picked silly values for a, c and m so not to take any side in the debate about what would make a good set of parameters.
Related
Say that I have a construct like this:
for(int i=0;i<5000;i++){
const int upper_bound = f(i);
#pragma acc parallel loop
for(int j=0;j<upper_bound;j++){
//Do work...
}
}
Where f is a monotonically-decreasing function of i.
Since num_gangs, num_workers, and vector_length are not set, OpenACC chooses what it thinks is an appropriate scheduling.
But does it choose such a scheduling afresh each time it encounters the pragma, or only once the first time the pragma is encountered?
Looking at the output of PGI_ACC_TIME suggests that scheduling is only performed once.
The PGI compiler will choose how to decompose the work at compile-time, but will generally determine the number of gangs at runtime. Gangs are inherently scalable parallelism, so the decision on how many can be deferred until runtime. The vector length and number of workers affects how the underlying kernel gets generated, so they're generally selected at compile-time to maximize optimization opportunities. With loops like these, where the bounds aren't really known at compile-time, the compiler has to generate some extra code in the kernel to ensure exactly the correct number of iterations are performed.
According to OpenAcc 2.6 specification[1] Line 1357 and 1358:
A loop associated with a loop construct that does not have a seq clause must be written such that the loop iteration count is computable when entering the loop construct.
Which seems to be the case, so your code is valid.
However, note it is implementation defined how to distribute the work among the gangs and workers, and it may be that the PGI compiler is simply doing some simple partitioning of the iterations.
You could manually define values of gang/workers using num_gangs and num_workers, and the integer expression passed to those clauses can depend on the value of your function (See 2.5.7 and 2.5.8 on OpenACC specification).
[1] https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf
I have a Deck vector with 52 Card, and I want to shuffle it.
vector<Card^> cards;
So I used this:
random_shuffle(cards.begin(), cards.end());
The problem was that it gave me the same result every time, so I used srand to randomize it:
srand(unsigned(time(NULL)));
random_shuffle(cards.begin(),cards.end());
This was still not truly random. When I started dealing cards, it was the same as in the last run. For example: "1. deal: A,6,3,2,K; 2. deal: Q,8,4,J,2", and when I restarted the program I got exactly the same order of deals.
Then I used srand() and random_shuffle with its 3rd parameter:
int myrandom (int i) {
return std::rand()%i;
}
srand(unsigned(time(NULL)));
random_shuffle(cards.begin(),cards.end(), myrandom);
Now it's working and always gives me different results on re-runs, but I don't know why it works this way. How do these functions work, what did I do here?
This answer required some investigation, looking at the C++ Standard Library headers in VC++ and looking at the C++ standard itself. I knew what the standard said, but I was curious about VC++ (including C++CLI) did their implementation.
First what does the standard say about std::random_shuffle . We can find that here. In particular it says:
Reorders the elements in the given range [first, last) such that each possible permutation of those elements has equal probability of appearance.
1) The random number generator is implementation-defined, but the function std::rand is often used.
The bolded part is key. The standard says that the RNG can be implementation specific (so results across different compilers will vary). The standard suggests that std::rand is often used. But this isn't a requirement. So if an implementation doesn't use std::rand then it follows that it likely won't use std::srand for a starting seed. An interesting footnote is that the std::random_shuffle functions are deprecated as of C++14. However std::shuffle remains. My guess is that since std::shuffle requires you to provide a function object you are explicitly defining the behavior you want when generating random numbers, and that is an advantage over the older std::random_shuffle.
I took my VS2013 and looked at the C++ standard library headers and discovered that <algorithm> uses template class that uses a completely different pseudo-rng (PRNG) than std::rand with an index (seed) set to zero. Although this may vary in detail between different versions of VC++ (including C++/CLI) I think it is probable that most versions of VC++/CLI do something similar. This would explain why each time you run your application you get the same shuffled decks.
The option I would opt for if I am looking for a Pseudo RNG and I'm not doing cryptography is to use something well established like Mersenne Twister:
Advantages The commonly-used version of Mersenne Twister, MT19937, which produces a sequence of 32-bit integers, has the following desirable properties:
It has a very long period of 2^19937 − 1. While a long period is not a guarantee of quality in a random number generator, short periods (such as the 2^32 common in many older software packages) can be problematic.
It is k-distributed to 32-bit accuracy for every 1 ≤ k ≤ 623 (see definition below).
It passes numerous tests for statistical randomness, including the Diehard tests.
Luckily for us C++11 Standard Library (which I believe should work on VS2010 and later C++/CLI) includes a Mersenne Twister function object that can be used with std::shuffle Please see this C++ documentation for more details. The C++ Standard Library reference provided earlier actually contains code that does this:
std::random_device rd;
std::mt19937 g(rd());
std::shuffle(v.begin(), v.end(), g);
The thing to note is that std::random_device produces non-deterministic (non repeatable) unsigned integers. We need non-deterministic data if we want to seed our Mersenne Twister (std::mt19937) PRNG with. This is similar in concept to seeding rand with srand(time(NULL)) (The latter not being an overly good source of randomness).
This looks all well and good but has one disadvantage when dealing with card shuffling. An unsigned integer on the Windows platform is 4 bytes (32 bits) and can store 2^32 values. This means there are only 4,294,967,296 possible starting points (seeds) therefore only that many ways to shuffle the deck. The problem is that there are 52! (52 factorial) ways to shuffle a standard 52 card deck. That happens to be 80658175170943878571660636856403766975289505440883277824000000000000 ways, which is far bigger than the number of unique ways we can get from setting a 32-bit seed.
Thankfully, Mersenne Twister can accept seeds between 0 and 2^19937-1. 52! is a big number but all combinations can be represented with a seed of 226 bits (or ~29 bytes). The Standard Library allow std::mt19937 to accept a seed up to 2^19937-1 (~624 bytes of data) if we so choose. But since we need only 226 bits the following code would allow us to create 29 bytes of non-deterministic data to be used as a suitable seed for std::mt19937:
// rd is an array to hold 29 bytes of seed data which covers the 226 bits we need */
std::array<unsigned char, 29> seed_data;
std::random_device rd;
std::generate_n(seed_data.data(), seed_data.size(), std::ref(rd));
std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
// Set the seed for Mersenne *using the 29 byte sequence*
std::mt19937 g(seq);
Then all you need to do is call shuffle with code like:
std::shuffle(cards.begin(),cards.end(), g);
On Windows VC++/CLI you will get a warning that you'll want to suppress with the code above. So at the top of the file (before other includes) you can add this:
#define _SCL_SECURE_NO_WARNINGS 1
Why _n versions of the copy, fill and generate have been provided in C++11, and why only these algorithms?
In general, the STL only provides primitives from which one can define suitably adapted variants.
The SGI documentation gives the following rationale for providing the exceptions you noted:
copy_n works for Input Iterators that are not also Forward Iterators.
fill_n and generate_n work for Output Iterators that are not also Forward Iterators.
As pointed out by #Jared Hoberock in the comments, the <memory>header also has uninitialized_ versions of copy_n and fill_n that are optimized versions when the count is already known.
C++11 provides a few other convenience wrappers (e.g. find_if_not), but with lambda predicates such wrappers become a lot easier to write yourself.
Note: there is also a search_n but this has different semantics than search because the latter will look at overlap between two input ranges and the former will look at consecutive elements from a single input range.
Let's take for example std::generate() and std::generate_n(). The former takes ForwardIterators, pointing to the beginnig and end of the range, the latter an OutputIterator. This has subtle implications, for example:
#include <algorithm>
#include <vector>
int main() {
std::vector<int> v;
v.resize(5); // <-- Elements constructed!!!
std::generate(v.begin(), v.end(), [](){ return 42; });
std::vector<int> w;
w.reserve(5); // Space only reserved but not initialized
std::generate_n(std::back_inserter(w), 5, [](){ return 42; });
}
That's enough for me to justify the existence of the two versions.
You are absolutely right that in many use cases the functionality of these functions overlap and one of them may look redundant.
why only these algorithms?
Probably because nobody proposed the _n version yet for the other algorithms. As TemplateRex linked, there could be a std::iota_n() as well: What would be a good implementation of iota_n (missing algorithm from the STL)?
Alexander Stepanov (original designer of the STL) discusses this issue (amongst many others) in his excellent video series Efficient Programming with Components. He originally proposed a number of other _n variants of STL algorithms but they were not accepted when STL was originally standardized. Some were added back in for C++11 but there are still some that he believes should be available that are missing.
There are a number of reasons why _n variants of algorithms are useful. You may have an input iterator or output iterator which you know can produce or consume n elements but you don't have a way to obtain a suitable end iterator. You may have a container type like a list which you know is big enough for an operation but which doesn't give you an efficient way to obtain an iterator n positions beyond your begin iterator. You may have an algorithm like binary_search / lower_bound which is most naturally expressed in terms of counted ranges. It may just be more convenient when you have n already but you don't have an end iterator and would have to generate one to call the non _n variant of an algorithm.
I know enough Haskell to translate the code below, but I don't know much about making it perform well:
typedef unsigned long precision;
typedef unsigned char uc;
const int kSpaceForByte = sizeof(precision) * 8 - 8;
const int kHalfPrec = sizeof(precision) * 8 / 2;
const precision kTop = ((precision)1) << kSpaceForByte;
const precision kBot = ((precision)1) << kHalfPrec;
//This must be called before encoding starts
void RangeCoder::StartEncode(){
_low = 0;
_range = (precision) -1;
}
/*
RangeCoder does not concern itself with models of the data.
To encode each symbol, you pass the parameters *cumFreq*, which gives
the cumulative frequency of the possible symbols ordered before this symbol,
*freq*, which gives the frequency of this symbol. And *totFreq*, which gives
the total frequency of all symbols.
This means that you can have different frequency distributions / models for
each encoded symbol, as long as you can restore the same distribution at
this point, when restoring.
*/
void RangeCoder::Encode(precision cumFreq, precision freq, precision totFreq){
assert(cumFreq + freq <= totFreq && freq && totFreq <= kBot);
_low += cumFreq * (_range /= totFreq);
_range *= freq;
while ((_low ^ _low + _range) < kTop or
_range < kBot and ((_range= -_low & kBot - 1), 1)){
//the "a or b and (r=..,1)" idiom is a way to assign r only if a is false.
OutByte(_low >> kSpaceForByte); //output one byte.
_range <<= sizeof(uc) * 8;
_low <<= sizeof(uc) * 8;
}
}
I know, I know "Write several versions and use criterion to see what works". I don't know enough to know what my options are though, or to avoid silly mistakes.
Here are my thoughts so far. One way would be to use the State monad and/or lenses. Another would be to translate the loop and state to explicit recursion. I read somewhere that explicit recursion tends to performs badly on ghc though. I think using ByteString Builder would be a good way to output each byte. Assuming I run on a 64 bit platform, should I use unboxed Word64 arguments? The compression quality will not decrease significantly if I decrease the precision to 32 bits. Will GHC optimize better for this?
Since this is not a 1-1 mapping, pipes with StateP would lead to very neat code, where I would request arguments one at a time and then let the while-loop respond byte for byte. Unfortunately, when i benchmarked it, it seems the pipe overhead (unsurprisingly) is quite large. Since each symbol can lead to many byte outputs, it feels a bit like a concatMap with State. Perhaps this would be the idiomatic solution? Concatenating lists of bytes does not sound very fast to me, though. ByteString has a concatMap. Perhaps this is the correct way? EDIT: no it is not. It takes a ByteString as input.
I intend to release the package on Hackage when I'm done, so any advice (or actual code!) you can give will benefit the community :). I plan to use this compression as a base for writing a very memory efficient compressed map.
I read somewhere that explicit recursion tends to performs badly on ghc though.
No. GHC produce slow machine code for recursion, which couldn't be reduced (or GHC "don't want" to reduce). If recursion could be unrolled (I don't see any fundamential problems with it in your snippet), it is translated to almost the same machine code as while-loop in C or C++.
Assuming I run on a 64 bit platform, should I use unboxed Word64 arguments? The compression quality will not decrease significantly if I decrease the precision to 32 bits. Will GHC optimize better for this?
Do you mean Word#? Let GHC to deal with it, use boxed types. I've never met a situation when some profit could be achived only by using unboxed types. Using 32bit types wouldn't help on 64bit platform.
One general rule of optimizing performance for GHC is avoiding data structures where possible. If you can pass pieces of data through function arguments or closures, use the chance.
Which version is faster:
x * 0.5
or
x / 2 ?
I've had a course at the university called computer systems some time ago. From back then I remember that multiplying two values can be achieved with comparably "simple" logical gates but division is not a "native" operation and requires a sum register that is in a loop increased by the divisor and compared to the dividend.
Now I have to optimise an algorithm with a lot of divisions. Unfortunately it's not just dividing by two, so binary shifting is not an option. Will it make a difference to change all divisions to multiplications ?
Update:
I have changed my code and didn't notice any difference. You're probably right about compiler optimisations. Since all the answers were great ive upvoted them all. I chose rahul's answer because of the great link.
Usually division is a lot more expensive than multiplication, but a smart compiler will often convert division by a compile-time constant to a multiplication anyway. If your compiler is not smart enough though, or if there are floating point accuracy issues, then you can always do the optimisation explicitly, e.g. change:
float x = y / 2.5f;
to:
const float k = 1.0f / 2.5f;
...
float x = y * k;
Note that this is most likely a case of premature optimisation - you should only do this kind of thing if you have profiled your code and positively identified division as being a performance bottlneck.
Division by a compile-time constant that's a power of 2 is quite fast (comparable to multiplication by a compile-time constant) for both integers and floats (it's basically convertible into a bit shift).
For floats even dynamic division by powers of two is much faster than regular (dynamic or static division) as it basically turns into a subtraction on its exponent.
In all other cases, division appears to be several times slower than multiplication.
For dynamic divisor the slowndown factor at my Intel(R) Core(TM) i5 CPU M 430 # 2.27GHz appears to be about 8, for static ones about 2.
The results are from a little benchmark of mine, which I made because I was somewhat curious about this (notice the aberrations at powers of two) :
ulong -- 64 bit unsigned
1 in the label means dynamic argument
0 in the lable means statically known argument
The results were generated from the following bash template:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
int main(int argc, char** argv){
$TYPE arg = atoi(argv[1]);
$TYPE i = 0, res = 0;
for (i=0;i< $IT;i++)
res+=i $OP $ARG;
printf($FMT, res);
return 0;
}
with the $-variables assigned and the resulting program compiled with -O3 and run (dynamic values came from the command line as it's obvious from the C code).
Well if it is a single calculation you wil hardly notice any difference but if you talk about millions of transaction then definitely Division is costlier than Multiplication. You can always use whatever is the clearest and readable.
Please refer this link:- Should I use multiplication or division?
That will likely depend on your specific CPU and the types of your arguments. For instance, in your example you're doing a floating-point multiplication but an integer division. (Probably, at least, in most languages I know of that use C syntax.)
If you are doing work in assembler, you can look up the specific instructions you are using and see how long they take.
If you are not doing work in assembler, you probably don't need to care. All modern compilers with optimization will change your operations in this way to the most appropriate instructions.
Your big wins on optimization will not be from twiddling the arithmetic like this. Instead, focus on how well you are using your cache. Consider whether there are algorithm changes that might speed things up.
One note to make, if you are looking for numerical stability:
Don't recycle the divisions for solutions that require multiple components/coordinates, e.g. like implementing an n-D vector normalize() function, i.e. the following will NOT give you a unit-length vector:
V3d v3d(x,y,z);
float l = v3d.length();
float oneOverL = 1.f / l;
v3d.x *= oneOverL;
v3d.y *= oneOverL;
v3d.z *= oneOverL;
assert(1. == v3d.length()); // fails!
.. but this code will..
V3d v3d(x,y,z);
float l = v3d.length();
v3d.x /= l;
v3d.y /= l;
v3d.z /= l;
assert(1. == v3d.length()); // ok!
Guess the problem in the first code excerpt is the additional float normalization (the pre-division will impose a different scale normalization to the floating point number, which is then forced upon the actual result and introducing additional error).
Didn't look into this for too long, so please share your explanation why this happens. Tested it with x,y and z being .1f (and with doubles instead of floats)