I have some code were I avoid some costly divisions by converting a boost integer to a double. For the real code I will build an fp type that's big enough to hold the maximal value (exponent). To test I am using a double. So I do this:
#define NTYPE_BITS 512
typedef number<cpp_int_backend<NTYPE_BITS, NTYPE_BITS, unsigned_magnitude, unchecked, void> > NTYPE;
NTYPE a1 = BIG_VALUE;
double a1f = (double)a1;
The code generated for that cast is quite complicated. I see it's basically looping over all the values in a1 (least significant first) scaling them by powers of two.
Now in this case I guess at most the number of elements that could affect the result are the last two (64 bits for each element and the most significant element might have less that 64 bits used).
Is there a better way to do this?
First off, NEVER use C-Style casts. (Why use static_cast<int>(x) instead of (int)x?).
Second, avoid using namespace.
(Third, reserve all-caps names for macros).
That said:
double a1f = a1.convert_to<double>();
Is your ticket.
Live On Coliru
#include <boost/multiprecision/cpp_int.hpp>
#include <iostream>
namespace bmp = boost::multiprecision;
//0xDEADBEEFE1E104B1D00008BADF00D000ABADBABE000D15EA5E
#define BIG_VALUE "0xDEADBEEFE1E104B1D00008BADF00D000ABADBABE000D15EA5E"
#define NTYPE_BITS 512
int main() {
using NTYPE = bmp::number<
bmp::cpp_int_backend<
NTYPE_BITS, NTYPE_BITS,
bmp::unsigned_magnitude, bmp::unchecked, void>>;
NTYPE a1(BIG_VALUE);
std::cout << a1 << "\n";
std::cout << std::hex << a1 << "\n";
std::cout << a1.convert_to<double>() << "\n";
}
Prints
1397776821048146366831161011449418369017198837637750820563550
deadbeefe1e104b1d00008badf00d000abadbabe000d15ea5e
1.39778e+60
Related
For an embedded design I want to place a C++ std::array at a specific memory address, which points to a buffer shared by hardware and software. Is this possible?
Try Placement new:
#include <array>
#include <iostream>
int main()
{
alignas(double) char buffer[100];
auto parr = new (buffer) std::array<double, 3> {3.14, 15.161, 12};
const auto & arr = *parr;
std::cout << (void*) &arr << " " << (void*) buffer << "\n";
for (auto x: arr)
std::cout << x << " ";
std::cout << "\n";
}
which may give this output:
0x7ffcdd906770 0x7ffcdd906770
3.14 15.161 12
One thing to be worried about is whether the address you want to use is consistent with the alignment requirements of the data you want to hold in the array. Most likely you will be dealing with chars, floats or ints, so this condition shouldn't be difficult to enforce.
For embedded code, two simple solutions spring to mind:
Declare extern int myArray[1234] ; and then define myArray (or rather its mangled form) in your linker definition file.
Define a macro: #define myArray ((int*)0xC0DEFACE)
The first solution is for the purists; the second is for the pragmatists.
I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11},
where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.
I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.
I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.
A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.
Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.
As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.
However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)
If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.
Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:
$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/scan.h>
#include <thrust/equal.h>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
const int num_blocks = 2048;
const int items_per = 4;
const int nTPB = 512;
const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
typedef float mt;
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
struct my_sort_functor
{
template <typename T, typename T2>
__host__ __device__
bool operator()(T t1, T2 t2){
if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
return true;}
};
// from: https://nvlabs.github.io/cub/example_block_radix_sort_8cu-example.html#_a0
#define CUB_STDERR
#include <stdio.h>
#include <iostream>
#include <algorithm>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_sort.cuh>
using namespace cub;
//---------------------------------------------------------------------
// Globals, constants and typedefs
//---------------------------------------------------------------------
bool g_verbose = false;
bool g_uniform_keys;
//---------------------------------------------------------------------
// Kernels
//---------------------------------------------------------------------
template <
typename Key,
int BLOCK_THREADS,
int ITEMS_PER_THREAD>
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
Key *d_in, // Tile of input
Key *d_out) // Tile of output
{
enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
// Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
typedef BlockLoad<Key, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;
// Specialize BlockRadixSort type for our thread block
typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Shared memory
__shared__ union TempStorage
{
typename BlockLoadT::TempStorage load;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// Per-thread tile items
Key items[ITEMS_PER_THREAD];
// Our current block's offset
int block_offset = blockIdx.x * TILE_SIZE;
// Load items into a blocked arrangement
BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
// Barrier for smem reuse
__syncthreads();
// Sort keys
BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(items);
// Store output in striped fashion
StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
}
int main(){
const int ds = num_blocks*block_size;
thrust::host_vector<mt> data(ds);
thrust::host_vector<int> keys(ds);
for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
thrust::device_vector<int> d_keys = keys;
for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size; // populate data
thrust::device_vector<mt> d_data = data;
thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin()); // fill out keys array 000111222...
thrust::device_vector<mt> d1 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long os = dtime_usec(0);
thrust::sort(d1.begin(), d1.end()); // ordinary sort
cudaDeviceSynchronize();
os = dtime_usec(os);
thrust::device_vector<mt> d2 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long ss = dtime_usec(0);
thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
cudaDeviceSynchronize();
ss = dtime_usec(ss);
if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
thrust::device_vector<mt> d3(ds);
cudaDeviceSynchronize();
unsigned long long cs = dtime_usec(0);
BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d3.data()));
cudaDeviceSynchronize();
cs = dtime_usec(cs);
if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
}
$ nvcc -o t1 t1.cu
$ ./t1
ordinary thrust sort: 0.001652s segmented sort: 0.00263s
cub sort: 0.000265s
$
(CUDA 10.2.89, Tesla V100, Ubuntu 18.04)
I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.
I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.
In the quest for optimal matrix-matrix multiplication using eigen3 (and hopefully profiting from SIMD support) I wrote the following test:
#include <iostream>
#include <Eigen/Dense>
#include <ctime>
using namespace Eigen;
using namespace std;
const int test_size= 13;
const int test_size_16b= test_size+1;
typedef Matrix<double, Dynamic, Dynamic, ColMajor, test_size_16b, test_size_16b> TestMatrix_dyn16b_t;
typedef Matrix<double, Dynamic, Dynamic> TestMatrix_dynalloc_t;
typedef Matrix<double, test_size, test_size> TestMatrix_t;
typedef Matrix<double, test_size_16b, test_size_16b> TestMatrix_fix16b_t;
template<typename TestMatrix_t> EIGEN_DONT_INLINE void test(const char * msg, int m_size= test_size, int n= 10000) {
double s= 0.0;
clock_t elapsed= 0;
TestMatrix_t m3;
for(int i= 0; i<n; i++) {
TestMatrix_t m1 = TestMatrix_t::Random(m_size, m_size);
TestMatrix_t m2= TestMatrix_t::Random(m_size, m_size);
clock_t begin = clock();
m3.noalias()= m1*m2;
clock_t end = clock();
elapsed+= end - begin;
// make sure m3 is not optimized away
s+= m3(1, 1);
}
double elapsed_secs = double(elapsed) / CLOCKS_PER_SEC;
cout << "Elapsed time " << msg << ": " << elapsed_secs << " size " << m3.cols() << ", " << m3.rows() << endl;
}
int main() {
#ifdef EIGEN_VECTORIZE
cout << "EIGEN_VECTORIZE on " << endl;
#endif
test<TestMatrix_t> ("normal ");
test<TestMatrix_dyn16b_t> ("dyn 16b ");
test<TestMatrix_dynalloc_t>("dyn alloc");
test<TestMatrix_fix16b_t> ("fix 16b ", test_size_16b);
}
compiled with g++ -msse3 -O2 -DEIGEN_DONT_PARALLELIZE test.cpp and ran it on an Athlon II X2 255. The result rather surprised me:
EIGEN_VECTORIZE on
Elapsed time normal : 0.019193 size 13, 13
Elapsed time dyn 16b : 0.025226 size 13, 13
Elapsed time dyn alloc: 0.018648 size 13, 13
Elapsed time fix 16b : 0.018221 size 14, 14
Similar results are attained with other odd numbers for test_size. What confuses me is this:
From reading Eigen Vectorization FAQ I would have thought that a 13x13 matrix has no multiple of 16 bytes size and thus will not profit from SIMD optimization. I expected the computation time to be much worse but it isn't.
From reading about Optional template parameters I would have thought that dynamic matrices with fixed upper bound known at compile time would behave much like dynamically allocated matrices an thus would have a similar computation speed. But they don't. That's actually what surprises me the most and what triggered my initial quest: I wanted to know if it is better to use a dynamic matrix with fixed upper bound that is a multiple of 16 bytes than a fixed size matrix whos size is not a multiple of 16 bytes.
Finally interesting but not so much surprising: a matrix whos fixed size is a multiple of 16 is no slower that that of a matrix whos col and row length is one less. SIMD just does the extra col and row for free.
Not my original question but also interesting: when the test is compiled without SSE2 support and thus without vectorization the relative computation times are roughly proportional. The dynamically sized fixed memory matrix is again slowest.
To put my question short: why is Matrix<double, Dynamic, Dynamic, ColMajor, test_size_16b, test_size_16b> so much slower? Can you confirm my observations and maybe even explain them?
The FAQ was obsolete. Since Eigen version 3.3, unaligned vectors and matrices are vectorized.
Then regarding why Matrix<double, Dynamic, Dynamic, ColMajor, test_size_16b, test_size_16b> was slower, that was just an issue in the compile-time selection of the preferred matrix product implementation. The fix will be part of Eigen 3.3.1.
The code below is meant to generate a list of five pseudo-random numbers in the interval [1,100]. I seed the default_random_engine with time(0), which returns the system time in unix time. When I compile and run this program on Windows 7 using Microsoft Visual Studio 2013, it works as expected (see below). When I do so in Arch Linux with the g++ compiler, however, it behaves strangely.
In Linux, 5 numbers will be generated each time. The last 4 numbers will be different on each execution (as will often be the case), but the first number will stay the same.
Example output from 5 executions on Windows and Linux:
| Windows: | Linux:
---------------------------------------
Run 1 | 54,01,91,73,68 | 25,38,40,42,21
Run 2 | 46,24,16,93,82 | 25,78,66,80,81
Run 3 | 86,36,33,63,05 | 25,17,93,17,40
Run 4 | 75,79,66,23,84 | 25,70,95,01,54
Run 5 | 64,36,32,44,85 | 25,09,22,38,13
Adding to the mystery, that first number periodically increments by one on Linux. After obtaining the above outputs, I waited about 30 minutes and tried again to find that the 1st number had changed and now was always being generated as a 26. It has continued to increment by 1 periodically and is now at 32. It seems to correspond with the changing value of time(0).
Why does the first number rarely change across runs, and then when it does, increment by 1?
The code. It neatly prints out the 5 numbers and the system time:
#include <iostream>
#include <random>
#include <time.h>
using namespace std;
int main()
{
const int upper_bound = 100;
const int lower_bound = 1;
time_t system_time = time(0);
default_random_engine e(system_time);
uniform_int_distribution<int> u(lower_bound, upper_bound);
cout << '#' << '\t' << "system time" << endl
<< "-------------------" << endl;
for (int counter = 1; counter <= 5; counter++)
{
int secret = u(e);
cout << secret << '\t' << system_time << endl;
}
system("pause");
return 0;
}
Here's what's going on:
default_random_engine in libstdc++ (GCC's standard library) is minstd_rand0, which is a simple linear congruential engine:
typedef linear_congruential_engine<uint_fast32_t, 16807, 0, 2147483647> minstd_rand0;
The way this engine generates random numbers is xi+1 = (16807xi + 0) mod 2147483647.
Therefore, if the seeds are different by 1, then most of the time the first generated number will differ by 16807.
The range of this generator is [1, 2147483646]. The way libstdc++'s uniform_int_distribution maps it to an integer in the range [1, 100] is essentially this: generate a number n. If the number is not greater than 2147483600, then return (n - 1) / 21474836 + 1; otherwise, try again with a new number. It should be easy to see that in the vast majority of cases, two ns that differ by only 16807 will yield the same number in [1, 100] under this procedure. In fact, one would expect the generated number to increase by one about every 21474836 / 16807 = 1278 seconds or 21.3 minutes, which agrees pretty well with your observations.
MSVC's default_random_engine is mt19937, which doesn't have this problem.
The std::default_random_engine is implementation defined. Use std::mt19937 or std::mt19937_64 instead.
In addition std::time and the ctime functions are not very accurate, use the types defined in the <chrono> header instead:
#include <iostream>
#include <random>
#include <chrono>
int main()
{
const int upper_bound = 100;
const int lower_bound = 1;
auto t = std::chrono::high_resolution_clock::now().time_since_epoch().count();
std::mt19937 e;
e.seed(static_cast<unsigned int>(t)); //Seed engine with timed value.
std::uniform_int_distribution<int> u(lower_bound, upper_bound);
std::cout << '#' << '\t' << "system time" << std::endl
<< "-------------------" << std::endl;
for (int counter = 1; counter <= 5; counter++)
{
int secret = u(e);
std::cout << secret << '\t' << t << std::endl;
}
system("pause");
return 0;
}
In Linux, the random function is not a random function in the probabilistic sense of the way, but a pseudo random number generator.
It is salted with a seed, and based on that seed, the numbers that are produced are pseudo random and uniformly distributed.
The Linux way has the advantage that in the design of certain experiments using information from populations, that the repeat of the experiment with known tweaking of input information can be measured. When the final program is ready for real-life testing, the salt (seed), can be created by asking for the user to move the mouse, mix the mouse movement with some keystrokes and add in a dash of microsecond counts since the beginning of the last power on.
Windows random number seed is obtained from the collection of mouse, keyboard, network and time of day numbers. It is not repeatable. But this salt value may be reset to a known seed, if as mentioned above, one is involved in the design of an experiment.
Oh yes, Linux has two random number generators. One, the default is modulo 32bits, and the other is modulo 64bits. Your choice depends on the accuracy needs and amount of compute time you wish to consume for your testing or actual use.
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
double A,R;
R=100.64;
R=R*R;
A=3.14159*R;
cout<< setprecision(3)<<A<<endl;
return 0;
}
The reasonably precise and accurate(a) value you would get from those calculations (mathematically) is 31,819.31032.
You have asked for a precision of three digits and, with that value and the floating point format currently active (probably std::defaultfloat), it's only giving you three significant digits:
3.18e+04 (3.18x104 in mathematical form).
If your intent is to instead show three digits after the decimal point, you can do that with the std::fixed manipulator:
#include <iostream>
#include <iomanip>
int main() {
double R = 100.64;
double A = 3.14159 * R * R;
std::cout << std::setprecision(3) << std::fixed << A << '\n';
return 0;
}
This gives 31819.310.
(a) Make sure you never conflate these two, they're different concepts. See for example, the following values of π you may come up with:
Value
Properties
9
Both im-precise and in-accurate.
3
Im-precise but accurate.
2.718281828459
Precise but in-accurate.
3.141592653590
Both precise and accurate.
π
Has maximum precision and accuracy.