IEE 754 total order in standard C++11 - c++11

According to the IEEE floating point wikipage (on IEEE 754), there is a total order on double-precision floating points (i.e. on C++11 implementations having IEEE-754 floats, like gcc 4.8 on Linux / x86-64).
Of course, operator < on double is often providing a total order, but NaN are known to be exceptions (it is well known folklore that x != x is a way of testing if x, declared as double x; is a NaN).
The reason I am asking is that I want to have a.g. std::set<double> (actually, a set of JSON-like -or Python like- values) and I would like the set to have some canonical representation (my practical concern is to emit portable JSON -same data, ordered in the same order, both on Linux/x86-64 and e.g. on Linux/ARM, even in weird cases like NaN).
I cannot find any simple way to get that total order. I coded
// a totally ordering function,
// return -1 for less-than, 0 for equal, +1 for greater
int mydoublecompare(double x, double y) {
if (x==y) return 0;
else if (x<y) return -1;
else if (x>y) return 1;
int kx = std::fpclassify(x);
int ky = std::fpclassify(y);
if (kx == FP_INFINITE) return (x>0)?1:-1;
if (ky == FP_INFINITE) return (y>0)?-1:1;
if (kx == FP_NAN && ky == FP_NAN) return 0;
return (kx==ky)?0:(kx<ky)?-1:1;
}
Actually, I do know that it is not a really (mathematically speaking) total order
(since e.g. bit-wise different NaN are all equal), but I am hoping it has the same
(or a very close) behavior on several common architectures.
Any comments or suggestion?
(perhaps I should not care that much; and I deliberately don't care about signaling NaNs)
The overall motivation is that I am coding some dynamically typed interpreter which persists its entire memory state in JSON notation, and I want to be sure that the persistent state is stable between architectures, in other words if I load the JSON state and dump it, it stays idempotent for several architectures (notably all of x86-64, ia-32, ARM 32 bits...).

I would use:
int totalcompare(double x, double y) {
int64_t rx, ry;
memcpy(&rx, &x, sizeof rx);
memcpy(&ry, &y, sizeof ry);
if (rx == ry) return 0;
if (rx < 0) rx ^= INT64_MAX;
if (ry < 0) ry ^= INT64_MAX;
if (rx < ry) return -1; else return 1;
}
This makes 0.0 and -0.0 compare unequal, whereas if (x==y) return 0; in your version makes them compare equal, meaning that your version is only a preorder. NaN values are above the rest and different NaNs compare different. All values comparable for <= should be in the same order for the above relation.
Note: the above function is C. I do not know C++.

Related

Fast random/mutation algorithms (vector to vector) [duplicate]

I've been trying to create a generalized Gradient Noise generator (which doesn't use the hash method to get gradients). The code is below:
class GradientNoise {
std::uint64_t m_seed;
std::uniform_int_distribution<std::uint8_t> distribution;
const std::array<glm::vec2, 4> vector_choice = {glm::vec2(1.0, 1.0), glm::vec2(-1.0, 1.0), glm::vec2(1.0, -1.0),
glm::vec2(-1.0, -1.0)};
public:
GradientNoise(uint64_t seed) {
m_seed = seed;
distribution = std::uniform_int_distribution<std::uint8_t>(0, 3);
}
// 0 -> 1
// just passes the value through, origionally was perlin noise activation
double nonLinearActivationFunction(double value) {
//return value * value * value * (value * (value * 6.0 - 15.0) + 10.0);
return value;
}
// 0 -> 1
//cosine interpolation
double interpolate(double a, double b, double t) {
double mu2 = (1 - cos(t * M_PI)) / 2;
return (a * (1 - mu2) + b * mu2);
}
double noise(double x, double y) {
std::mt19937_64 rng;
//first get the bottom left corner associated
// with these coordinates
int corner_x = std::floor(x);
int corner_y = std::floor(y);
// then get the respective distance from that corner
double dist_x = x - corner_x;
double dist_y = y - corner_y;
double corner_0_contrib; // bottom left
double corner_1_contrib; // top left
double corner_2_contrib; // top right
double corner_3_contrib; // bottom right
std::uint64_t s1 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y) + m_seed);
std::uint64_t s2 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s3 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s4 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y) + m_seed);
// each xy pair turns into distance vector from respective corner, corner zero is our starting corner (bottom
// left)
rng.seed(s1);
corner_0_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y});
rng.seed(s2);
corner_1_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y - 1});
rng.seed(s3);
corner_2_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y - 1});
rng.seed(s4);
corner_3_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y});
double u = nonLinearActivationFunction(dist_x);
double v = nonLinearActivationFunction(dist_y);
double x_bottom = interpolate(corner_0_contrib, corner_3_contrib, u);
double x_top = interpolate(corner_1_contrib, corner_2_contrib, u);
double total_xy = interpolate(x_bottom, x_top, v);
return total_xy;
}
};
I then generate an OpenGL texture to display with like this:
int width = 1024;
int height = 1024;
unsigned char *temp_texture = new unsigned char[width*height * 4];
double octaves[5] = {2,4,8,16,32};
for( int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
double d_noise = 0;
d_noise += temp_1.noise(j/octaves[0], i/octaves[0]);
d_noise += temp_1.noise(j/octaves[1], i/octaves[1]);
d_noise += temp_1.noise(j/octaves[2], i/octaves[2]);
d_noise += temp_1.noise(j/octaves[3], i/octaves[3]);
d_noise += temp_1.noise(j/octaves[4], i/octaves[4]);
d_noise/=5;
uint8_t noise = static_cast<uint8_t>(((d_noise * 128.0) + 128.0));
temp_texture[j*4 + (i * width * 4) + 0] = (noise);
temp_texture[j*4 + (i * width * 4) + 1] = (noise);
temp_texture[j*4 + (i * width * 4) + 2] = (noise);
temp_texture[j*4 + (i * width * 4) + 3] = (255);
}
}
Which give good results:
But gprof is telling me that the Mersenne twister is taking up 62.4% of my time and growing with larger textures. Nothing else individual takes any where near as much time. While the Mersenne twister is fast after initialization, the fact that I initialize it every time I use it seems to make it pretty slow.
This initialization is 100% required for this to make sure that the same x and y generates the same gradient at each integer point (so you need either a hash function or seed the RNG each time).
I attempted to change the PRNG to both the linear congruential generator and Xorshiftplus, and while both ran orders of magnitude faster, they gave odd results:
LCG (one time, then running 5 times before using)
Xorshiftplus
After one iteration
After 10,000 iterations.
I've tried:
Running the generator several times before utilizing output, this results in slow execution or simply different artifacts.
Using the output of two consecutive runs after initial seed to seed the PRNG again and use the value after wards. No difference in result.
What is happening? What can i do to get faster results that are of the same quality as the mersenne twister?
OK BIG UPDATE:
I don't know why this works, I know it has something to do with the prime number utilized, but after messing around a bit, it appears that the following works:
Step 1, incorporate the x and y values as seeds separately (and incorporate some other offset value or additional seed value with them, this number should be a prime/non trivial factor)
Step 2, Use those two seed results into seeding the generator again back into the function (so like geza said, the seeds made were bad)
Step 3, when getting the result, instead of using modulo number of items (4) trying to get, or & 3, modulo the result by a prime number first then apply & 3. I'm not sure if the prime being a mersenne prime matters or not.
Here is the result with prime = 257 and xorshiftplus being used! (note I used 2048 by 2048 for this one, the others were 256 by 256)
LCG is known to be inadequate for your purpose.
Xorshift128+'s results are bad, because it needs good seeding. And providing good seeding defeats the whole purpose of using it. I don't recommend this.
However, I recommend using an integer hash. For example, one from Bob's page.
Here's a result of the first hash of that page, it looks OK to me, and it is fast (I think it is much faster than Mersenne Twister):
Here's the code I've written to generate this:
#include <cmath>
#include <stdio.h>
unsigned int hash(unsigned int a) {
a = (a ^ 61) ^ (a >> 16);
a = a + (a << 3);
a = a ^ (a >> 4);
a = a * 0x27d4eb2d;
a = a ^ (a >> 15);
return a;
}
unsigned int ivalue(int x, int y) {
return hash(y<<16|x)&0xff;
}
float smooth(float x) {
return 6*x*x*x*x*x - 15*x*x*x*x + 10*x*x*x;
}
float value(float x, float y) {
int ix = floor(x);
int iy = floor(y);
float fx = smooth(x-ix);
float fy = smooth(y-iy);
int v00 = ivalue(iy+0, ix+0);
int v01 = ivalue(iy+0, ix+1);
int v10 = ivalue(iy+1, ix+0);
int v11 = ivalue(iy+1, ix+1);
float v0 = v00*(1-fx) + v01*fx;
float v1 = v10*(1-fx) + v11*fx;
return v0*(1-fy) + v1*fy;
}
unsigned char pic[1024*1024];
int main() {
for (int y=0; y<1024; y++) {
for (int x=0; x<1024; x++) {
float v = 0;
for (int o=0; o<=9; o++) {
v += value(x/64.0f*(1<<o), y/64.0f*(1<<o))/(1<<o);
}
int r = rint(v*0.5f);
pic[y*1024+x] = r;
}
}
FILE *f = fopen("x.pnm", "wb");
fprintf(f, "P5\n1024 1024\n255\n");
fwrite(pic, 1, 1024*1024, f);
fclose(f);
}
If you want to understand, how a hash function work (or better yet, which properties a good hash have), check out Bob's page, for example this.
You (unknowingly?) implemented a visualization of PRNG non-random patterns. That looks very cool!
Except Mersenne Twister, all your tested PRNGs do not seem fit for your purpose. As I have not done further tests myself, I can only suggest to try out and measure further PRNGs.
The randomness of LCGs are known to be sensitive to the choice of their parameters. In particular, the period of a LCG is relative to the m parameter - at most it will be m (your prime factor) & for many values it can be less.
Similarly, the careful parameters selection is required to get a long period from Xorshift PRNGs.
You've noted that some PRNGs give good procedural generation results while other do not. In order to isolate the cause, I would factor out the proc gen stuff & examine the PRNG output directly. An easy way to visualize the data is to build a grey scale image where each pixel value is a (possibly scaled) random value. For image based stuff, I find this to be an easy way to find stuff that may lead to visual artifacts. Any artifacts you see with this are likely to cause issues with your proc gen output.
Another option is to try something like the Diehard tests. If the aforementioned image test failed to reveal any problems, I might use this just to be sure my PRNG techniques were trustworthy.
Note that your code seeds the PRNG, then generates one pseudorandom number from the PRNG. The reason for the nonrandomness in xorshift128+ that you discovered is that xorshift128+ simply adds the two halves of the seed (and uses the result mod 264 as the generated number) before changing its state (review its source code). This makes that PRNG considerably different from a hash function.
What you see is the practical demonstration of quality of PRNG. Mersenne Twister is one of the best PRNGs with good performance, it passes DIEHARD tests. One should know that generating a random numbers is not an easy computational task, so looking for a better performance will inevitably result in poor quality. LCG is known to be simplest and worst PRNG ever designed and it clearly shows two-dimensional correlation as in your picture. The quality of Xorshift generators largely depend on bitness and parameters. They are definitely worse than Mersenne Twister, but some (xorshift128+) may work good enough to pass BigCrush battery of TestU01 tests.
In other words, if you are making an important physical modelling numerical experiment, you better continue to use Mersenne Twister as known to be a good trade-off between speed and quality and it comes in many standard libraries. On a less important case you may try to use xorshift128+ generator. For an ultimate results you need to use cryptographical-quality PRNG (none of mentioned here may be used for cryptographical purposes).

Using SIMD to find the biggest difference of two elements

I wrote an algorithm to get the biggest difference between two elements in an std::vector where the bigger of the two values must be at a higher index than the lower value.
unsigned short int min = input.front();
unsigned short res = 0;
for (size_t i = 1; i < input.size(); ++i)
{
if (input[i] <= min)
{
min = input[i];
continue;
}
int dif = input[i] - min;
res = dif > res ? dif : res;
}
return res != 0 ? res : -1;
Is it possible to optimize this algorithm using SIMD? I'm new to SIMD and so far I've been unsuccessful with this one
You didn't specify any particular architecture so I'll keep this mostly architecture neutral with an algorithm described in English. But it requires a SIMD ISA that can efficiently branch on SIMD compare results to check a usually-true condition, like x86 but not really ARM NEON.
This won't work well for NEON because it doesn't have a movemask equivalent, and SIMD -> integer causes stalls on many ARM microarchitectures.
The normal case while looping over the array is that an element, or a whole SIMD vector of elements, is not a new min, and not diff candidate. We can quickly fly through those elements, only slowing down to get the details right when there's a new min. This is like a SIMD strlen or SIMD memcmp, except instead of stopping at the first search hit, we just go scalar for one block and then resume.
For each vector v[0..7] of the input array (assuming 8 int16_t elements per vector (16 bytes), but that's arbitrary):
SIMD compare vmin > v[0..7], and check for any elements being true. (e.g. x86 _mm_cmpgt_epi16 / if(_mm_movemask_epi8(cmp) != 0)) If there's a new min somewhere, we have a special case: the old min applies to some elements, but the new min applies to others. And it's possible there are multiple new-min updates within the vector, and new-diff candidates at any of those points.
So handle this vector with scalar code (updating a scalar diff which doesn't need to be in sync with the vector diffmax because we don't need position).
Broadcast the final min to vmin when you're done. Or do a SIMD horizontal min so out-of-order execution of later SIMD iterations can get started without waiting for a vmin from scalar. Should work well if the scalar code is branchless, so there are no mispredicts in the scalar code that cause later vector work to be thrown out.
As an alternative, a SIMD prefix-sum type of thing (actually prefix-min) could produce a vmin where every element is the min up to that point. (parallel prefix (cumulative) sum with SSE). You could always do this to avoid any branching, but if new-min candidates are rare then it's expensive. Still, it could be viable on ARM NEON where branching is hard.
If there's no new min, SIMD packed max diffmax[0..7] = max(diffmax[0..7], v[0..7]-vmin). (Use saturating subtraction so you don't get wrap-around to a large unsigned difference, if you're using unsigned max to handle the full range.)
At the end of the loop, do a SIMD horizontal max of the diffmax vector. Notice that since we don't need the position of the max-difference, we don't need to update all elements inside the loop when one finds a new candidate. We don't even need to keep the scalar special-case diffmax and SIMD vdiffmax in sync with each other, just check at the end to take the max of the scalar and SIMD max diffs.
SIMD min/max is basically the same as a horizontal sum, except you use packed-max instead of packed-add. For x86, see Fastest way to do horizontal float vector sum on x86.
Or on x86 with SSE4.1 for 16-bit integer elements, phminposuw / _mm_minpos_epu16 can be used for min or max, signed or unsigned, with appropriate tweaks to the input. max = -min(-diffmax). You can treat diffmax as unsigned because it's known to be non-negative, but Horizontal minimum and maximum using SSE shows how to flip the sign bit to range-shift signed to unsigned and back.
We probably get a branch mispredict every time we find a new min candidate, or else we're finding new min candidates too often for this to be efficient.
If new min candidates are expected frequently, using shorter vectors could be good. Or on discovering there's a new-min in a current vector, then use narrower vectors to only go scalar over fewer elements. On x86, you might use bsf (bit-scan forward) to find which element had the first new-min. That gives your scalar code a data dependency on the vector compare-mask, but if the branch to it was mispredicted then the compare-mask will be ready. Otherwise if branch-prediction can somehow find a pattern in which vectors need the scalar fallback, prediction+speculative execution will break that data dependency.
Unfinished / broken (by me) example adapted from #harold's deleted answer of a fully branchless version that constructs a vector of min-up-to-that-element on the fly, for x86 SSE2.
(#harold wrote it with suffix-max instead of min, which is I think why he deleted it. I partially converted it from max to min.)
A branchless intrinsics version for x86 could look something like this. But branchy is probably better unless you expect some kind of slope or trend that makes new min values frequent.
// BROKEN, see FIXME comments.
// converted from #harold's suffix-max version
int broken_unfinished_maxDiffSSE(const std::vector<uint16_t> &input) {
const uint16_t *ptr = input.data();
// construct suffix-min
// find max-diff at the same time
__m128i min = _mm_set_epi32(-1);
__m128i maxdiff = _mm_setzero_si128();
size_t i = input.size();
for (; i >= 8; i -= 8) {
__m128i data = _mm_loadu_si128((const __m128i*)(ptr + i - 8));
// FIXME: need to shift in 0xFFFF, not 0, for min.
// or keep the old data, maybe with _mm_alignr_epi8
__m128i d = data;
// link with suffix
d = _mm_min_epu16(d, _mm_slli_si128(max, 14));
// do suffix-min within block.
d = _mm_min_epu16(d, _mm_srli_si128(d, 2));
d = _mm_min_epu16(d, _mm_shuffle_epi32(d, 0xFA));
d = _mm_min_epu16(d, _mm_shuffle_epi32(d, 0xEE));
max = d;
// update max-diff
__m128i diff = _mm_subs_epu16(data, min); // with saturation to 0
maxdiff = _mm_max_epu16(maxdiff, diff);
}
// horizontal max
maxdiff = _mm_max_epu16(maxdiff, _mm_srli_si128(maxdiff, 2));
maxdiff = _mm_max_epu16(maxdiff, _mm_shuffle_epi32(maxdiff, 0xFA));
maxdiff = _mm_max_epu16(maxdiff, _mm_shuffle_epi32(maxdiff, 0xEE));
int res = _mm_cvtsi128_si32(maxdiff) & 0xFFFF;
unsigned scalarmin = _mm_extract_epi16(min, 7); // last element of last vector
for (; i != 0; i--) {
scalarmin = std::min(scalarmin, ptr[i - 1]);
res = std::max(res, ptr[i - 1] - scalarmin);
}
return res != 0 ? res : -1;
}
We could replace the scalar cleanup with a final unaligned vector, if we handle the overlap between the last full vector min.

How to calculate modulus of 64-bit unsigned integer?

Note: This question is different from Fastest way to calculate a 128-bit integer modulo a 64-bit integer.
Here's a C# fiddle:
https://dotnetfiddle.net/QbLowb
Given the pseudocode:
UInt64 a = 9228496132430806238;
UInt32 d = 585741;
How do i calculate
UInt32 r = a % d?
The catch, of course, is that i am not in a compiler that supports the UInt64 data type.1 But i do have access to the Windows ULARGE_INTEGER union:
typedef struct ULARGE_INTEGER {
DWORD LowPart;
DWORD HighPart;
};
Which means really that i can turn my code above into:
//9228496132430806238 = 0x80123456789ABCDE
UInt32 a = 0x80123456; //high part
UInt32 b = 0x789ABCDE; //low part
UInt32 r = 585741;
How to do it
But now comes how to do the actual calculation. I can start with the pencil-and-paper long division:
________________________
585741 ) 0x80123456 0x789ABCDE
To make it simpler, we can work in variables:
Now we are working entirely with 32-bit unsigned types, which my compiler does support.
u1 = a / r; //integer truncation math
v1 = a % r; //modulus
But now i've brought myself to a standstill. Because now i have to calculate:
v1||b / r
In other words, I have to perform division of a 64-bit value, which is what i was unable to perform in the first place!
This must be a solved problem already. But the only questions i can find on Stackoverflow are people trying to calculate:
a^b mod n
or other cryptographically large multi-precision operations, or approximate floating point.
Bonus Reading
Microsoft Research: Division and Modulus for Computer Scientists
https://stackoverflow.com/questions/36684771/calculating-large-mods-by-hand
Fastest way to calculate a 128-bit integer modulo a 64-bit integer (unrelated question; i hate you people)
1But it does support Int64, but i don't think that helps me
Working with Int64 support
I was hoping for the generic solution to the performing modulus against a ULARGE_INTEGER (and even LARGE_INTEGER), in a compiler without native 64-bit support. That would be the correct, good, perfect, and ideal answer, which other people will be able to use when they need.
But there is also the reality of the problem i have. And it can lead to an answer that is generally not useful to anyone else:
cheating by calling one of the Win32 large integer functions (although there is none for modulus)
cheating by using 64-bit support for signed integers
I can check if a is positive. If it is, i know my compiler's built-in support for Int64 will handle:
UInt32 r = a % d; //for a >= 0
Then there's there's how to handle the other case: a is negative
UInt32 ModU64(ULARGE_INTEGER a, UInt32 d)
{
//Hack: Our compiler does support Int64, just not UInt64.
//Use that Int64 support if the high bit in a isn't set.
Int64 sa = (Int64)a.QuadPart;
if (sa >= 0)
return (sa % d);
//sa is negative. What to do...what to do.
//If we want to continue to work with 64-bit integers,
//we could now treat our number as two 64-bit signed values:
// a == (aHigh + aLow)
// aHigh = 0x8000000000000000
// aLow = 0x0fffffffffffffff
//
// a mod d = (aHigh + aLow) % d
// = ((aHigh % d) + (aLow % d)) % d //<--Is this even true!?
Int64 aLow = sa && 0x0fffffffffffffff;
Int64 aHigh = 0x8000000000000000;
UInt32 rLow = aLow % d; //remainder from low portion
UInt32 rHigh = aHigh % d; //this doesn't work, because it's "-1 mod d"
Int64 r = (rHigh + rLow) % d;
return d;
}
Answer
It took a while, but i finally got an answer. I would post it as an answer; but Z29kIGZ1Y2tpbmcgZGFtbiBzcGVybSBidXJwaW5nIGNvY2tzdWNraW5nIHR3YXR3YWZmbGVz people mistakenly decided that my unique question was an exact duplicate.
UInt32 ModU64(ULARGE_INTEGER a, UInt32 d)
{
//I have no idea if this overflows some intermediate calculations
UInt32 Al = a.LowPart;
UInt32 Ah = a.HighPart;
UInt32 remainder = (((Ah mod d) * ((0xFFFFFFFF - d) mod d)) + (Al mod d)) mod d;
return remainder;
}
Fiddle
I just updated my ALU32 class code in this related QA:
Cant make value propagate through carry
As CPU assembly independent code for mul,div was requested. The divider is solving all your problems. However it is using Binary long division so its a bit slover than stacking up 32 bit mul/mod/div operations. Here the relevant part of code:
void ALU32::div(DWORD &c,DWORD &d,DWORD ah,DWORD al,DWORD b)
{
DWORD ch,cl,bh,bl,h,l,mh,ml;
int e;
// edge cases
if (!b ){ c=0xFFFFFFFF; d=0xFFFFFFFF; cy=1; return; }
if (!ah){ c=al/b; d=al%b; cy=0; return; }
// align a,b for binary long division m is the shifted mask of b lsb
for (bl=b,bh=0,mh=0,ml=1;bh<0x80000000;)
{
e=0; if (ah>bh) e=+1; // e = cmp a,b {-1,0,+1}
else if (ah<bh) e=-1;
else if (al>bl) e=+1;
else if (al<bl) e=-1;
if (e<=0) break; // a<=b ?
shl(bl); rcl(bh); // b<<=1
shl(ml); rcl(mh); // m<<=1
}
// binary long division
for (ch=0,cl=0;;)
{
sub(l,al,bl); // a-b
sbc(h,ah,bh);
if (cy) // a<b ?
{
if (ml==1) break;
shr(mh); rcr(ml); // m>>=1
shr(bh); rcr(bl); // b>>=1
continue;
}
al=l; ah=h; // a>=b ?
add(cl,cl,ml); // c+=m
adc(ch,ch,mh);
}
cy=0; c=cl; d=al;
if ((ch)||(ah)) cy=1; // overflow
}
Look the linked QA for description of the class and used subfunctions. The idea behind a/b is simple:
definition
lets assume that we got 64/64 bit division (modulus will be a partial product) and want to use 32 bit arithmetics so:
(ah,al) / (bh,bl) = (ch,cl)
each 64bit QWORD will be defined as high and low 32bit DWORD.
align a,b
exactly like computing division on paper we must align b so it divides a so find sh that:
(bh,bl)<<sh <= (ah,al)
(bh,bl)<<(sh+1) > (ah,al)
and compute m so
(mh,ml) = 1<<sh
beware that in case bh>=0x80000000 stop the shifting or we would overflow ...
divide
set result c = 0 and then simply substract b from a while b>=a. For each substraction add m to c. Once b>a shift both b,m right to align again. Stop if m==0 or a==0.
result
c will hold 64bit result of division so use cl and similarly a holds the remainder so use al as your modulus result. You can check if ch,ah are zero if not overflow occurs (as result is bigger than 32 bit). The same goes for edge cases like division by zero...
Now as you want 64bit/32bit simply set bh=0 ... To do this I needed 64bit operations (+,-,<<,>>) which I did by stacking up 32bit operations with Carry (that is the reason why my ALU32 class was created in the first place) for more info see the link above.

How can I use mach_absolute_time without overflowing?

On Darwin, the POSIX standard clock_gettime(CLOCK_MONOTONIC) timer is not available. Instead, the highest resolution monotonic timer is obtained through the mach_absolute_time function from mach/mach_time.h.
The result returned may be an unadjusted tick count from the processor, in which case the time units could be a strange multiple. For example, on a CPU with a 33MHz tick count, Darwin returns 1000000000/33333335 as the exact units of the returned result (ie, multiply the mach_absolute_time by that fraction to obtain a nanosecond value).
We usually wish to convert from exact ticks to "standard" (decimal) units, but unfortunately, naively multiplying the absolute time by the fraction will overflow even in 64-bit arithmetic. This is an error that Apple's sole piece of documentation on mach_absolute_time falls into (Technical Q&A QA1398).1
How should I write a function that correctly uses mach_absolute_time?
Note that this is not a theoretical problem: the sample code in QA1398 completely fails to work on PowerPC-based Macs. On Intel Macs, mach_timebase_info always returns 1/1 as the scaling factor because the CPU's raw tick count is unreliable (dynamic speed-stepping), so the API does the scaling for you. On PowerPC Macs, mach_timebase_info returns either 1000000000/33333335 or 1000000000/25000000, so Apple's provided code definitely overflows every few minutes. Oops.
Most-precise (best) answer
Perform the arithmetic at 128-bit precision to avoid the overflow!
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
}
uint64_t scale(uint64_t i) {
return scaleHighPrecision(i - bias, tb.numer, tb.denom);
}
static uint64_t scaleHighPrecision(uint64_t i, uint32_t numer,
uint32_t denom) {
U64 high = (i >> 32) * numer;
U64 low = (i & 0xffffffffull) * numer / denom;
U64 highRem = ((high % denom) << 32) / denom;
high /= denom;
return (high << 32) + highRem + low;
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return data.scale(now);
}
A simple low-resolution answer
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
if (tb.denom > 1024) {
double frac = (double)tb.numer/tb.denom;
tb.denom = 1024;
tb.numer = tb.denom * frac + 0.5;
assert(tb.numer > 0);
}
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
A fiddly solution using low-precision arithmetic but using continued fractions to avoid loss of accuracy
// This function returns the rational number inside the given interval with
// the smallest denominator (and smallest numerator breaks ties; correctness
// proof neglects floating-point errors).
static mach_timebase_info_data_t bestFrac(double a, double b) {
if (floor(a) < floor(b))
{ mach_timebase_info_data_t rv = {(int)ceil(a), 1}; return rv; }
double m = floor(a);
mach_timebase_info_data_t next = bestFrac(1/(b-m), 1/(a-m));
mach_timebase_info_data_t rv = {(int)m*next.numer + next.denum, next.numer};
return rv;
}
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count. However, although the bound on the error is
// the same as for the pragmatic answer, the error is actually minimized over
// the given accuracy bound.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
double frac = (double)tb.numer/tb.denom;
uint64_t spanTarget = 315360000000000000llu; // 10 years
if (getExpressibleSpan(tb.numer, tb.denom) >= spanTarget)
return;
for (double errorTarget = 1/1024.0; errorTarget > 0.000001;) {
mach_timebase_info_data_t newFrac =
bestFrac((1-errorTarget)*frac, (1+errorTarget)*frac);
if (getExpressibleSpan(newFrac.numer, newFrac.denom) < spanTarget)
break;
tb = newFrac;
errorTarget = fabs((double)tb.numer/tb.denom - frac) / frac / 8;
}
assert(getExpressibleSpan(tb.numer, tb.denom) >= spanTarget);
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
The derivation
We aim to reduce the fraction returned by mach_timebase_info to one that is essentially the same, but with a small denominator. The size of the timespan that we can handle is limited only by the size of the denominator, not the numerator of the fraction we shall multiply by:
uint64_t getExpressibleSpan(uint32_t numer, uint32_t denom) {
// This is just less than the smallest thing we can multiply numer by without
// overflowing. ceilLog2(numer) = 64 - number of leading zeros of numer
uint64_t maxDiffWithoutOverflow = ((uint64_t)1 << (64 - ceilLog2(numer))) - 1;
return maxDiffWithoutOverflow * numer / denom;
}
If denom=33333335 as returned by mach_timebase_info, we can handle differences of up to 18 seconds only before the multiplication by numer overflows. As getExpressibleSpan shows, by calculating a rough lower bound for this, the size of numer doesn't matter: halving numer doubles maxDiffWithoutOverflow. The only goal therefore is to produce a fraction close to numer/denom that has a smaller denominator. The simplest method to do this is using continued fractions.
The continued fractions method is rather handy. bestFrac clearly works correctly if the provided interval contains an integer: it returns the least integer in the interval over 1. Otherwise, it calls itself recursively with a strictly larger interval and returns m+1/next. The final result is a continued fraction that can be shown by induction to have the correct property: it's optimal, the fraction inside the given interval with the least denominator.
Finally, we reduce the fraction Darwin passes us to a smaller one to use when rescaling the mach_absolute_time to nanoseconds. We may introduce an error here because we can't reduce the fraction in general without losing accuracy. We set ourselves the target of 0.1% error, and check that we've reduced the fraction enough for common timespans (up to ten years) to be handled correctly.
Arguably the method is over-complicated for what it does, but it handles correctly anything the API can throw at it, and the resulting code is still short and extremely fast (bestFrac typically recurses only three or four iterations deep before returning a denominator less than 1000 for random intervals [a,a*1.002]).
You're worrying about overflow when multiplying/dividing with values from the mach_timebase_info struct, which is used for conversion to nanoseconds. So, while it may not fit your exact needs, there are easier ways to get a count in nanoseconds or seconds.
All solutions below are using mach_absolute_time internally (and NOT the wall clock).
Use double instead of uint64_t
(supported in Objective-C and Swift)
double tbInSeconds = 0;
mach_timebase_info_data_t tb;
kern_return_t kError = mach_timebase_info(&tb);
if (kError == 0) {
tbInSeconds = 1e-9 * (double)tb.numer / (double)tb.denom;
}
(remove the 1e-9 if you want nanoseconds)
Usage:
uint64_t start = mach_absolute_time();
// do something
uint64_t stop = mach_absolute_time();
double durationInSeconds = tbInSeconds * (stop - start);
Use ProcessInfo.processInfo.systemUptime
(supported in Objective-C and Swift)
It does the job in double seconds directly:
CFTimeInterval start = NSProcessInfo.processInfo.systemUptime;
// do something
CFTimeInterval stop = NSProcessInfo.processInfo.systemUptime;
NSTimeInterval durationInSeconds = stop - start;
For reference, source code of systemUptime
just does something similar as previous solution:
struct mach_timebase_info info;
mach_timebase_info(&info);
__CFTSRRate = (1.0E9 / (double)info.numer) * (double)info.denom;
__CF1_TSRRate = 1.0 / __CFTSRRate;
uint64_t tsr = mach_absolute_time();
return (CFTimeInterval)((double)tsr * __CF1_TSRRate);
Use QuartzCore.CACurrentMediaTime()
(supported in Objective-C and Swift)
Same as systemUptime, but without being open source.
Use Dispatch.DispatchTime.now()
(supported in Swift only)
Another wrapper around mach_absolute_time(). Base precision is nanoseconds, backed with UInt64.
DispatchTime start = DispatchTime.now()
// do something
DispatchTime stop = DispatchTime.now()
TimeInterval durationInSeconds = Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000_000
For reference, source code of DispatchTime.now() says it basically simply returns a struct DispatchTime(rawValue: mach_absolute_time()). And the calculation for uptimeNanoseconds is:
(result, overflow) = result.multipliedReportingOverflow(by: UInt64(DispatchTime.timebaseInfo.numer))
result = overflow ? UInt64.max : result / UInt64(DispatchTime.timebaseInfo.denom)
So it just discards results if the multiplication can't be stored in an UInt64.
If mach_absolute_time() sets the uint64 back to 0 then reset the time calculations if less than the last check.
That's the problem, they don't document what happens when the uint64 reaches all ones (binary).
read it. https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time

How to wrap a number using mod operator

Not sure if this is possible, but is there an automatic way, using mod or something similiar, to automatically correct bad input values? For example:
If r>255, then set r=255 and
if r<0, then set r=0
So basically what I'm asking is whats a clever mathematical way to set this rather than using
if(r>255)
r=255;
if(r<0)
r=0;
How about:
r = std:max(0, std::min(r, 255));
The following function will output what you are looking for:
f(x) = (510*(1 + Sign[-255 + x]) + x*(1 + Sign[255 - x])*(1 + Sign[x]))/4
As shown here:
Could you do something like --
R = MIN(r, 255);
R = MAX(R, 0);
Depending on how your hardware and possibly how your interpreter deal with ints, you can do this:
Assuming that an unsigned int is 16 bits (to keep my masks short):
r = r & 0000000011111111;
If an int was 32 bits, you'd need 16 more zeros at the start of the bit mask.
After that bitwise AND, the maximum value r can have is 255. Depending on the hardware, an unsigned int might do something odd given a value below zero. I believe that case is already handled by the bitmask (at least on the hardware that I've used). If not, you can do r = min(r, 0); first.
I had similar problem when dealing with images. For some special values (like these ones, 0 and 255) you can use this nonportable method:
static inline int trim_8bit(unsigned i){
return 0xff & ((i | -!!(i & ~0xff))) + (i >> 31);
// where "0xff &" can be omitted if you return unsigned char
};
In real cases the clamping have to be performed rarely, so that you could write
static inline unsigned char trim_8bit_v2(unsigned i){
if (__builtin_expect(i & ~0xFF, 0)) // it's for gcc, use __assume for MSVC
return (i >> 31) - 1;
return i;
};
And to be sure which is fastest, measure.

Resources