How to set a negative number to infinity without using an if statement (or ternary)

How to set a negative number to infinity without using an if statement (or ternary) - performance

I have the following piece of code:
for(uint i=0; i<6; i++)
coeffs[i] = coeffs[i] < 0 ? 1.f/0.f : coeffs[i];
Which checks an array with 6 elements and if it finds a negative entry it sets it to infinity and otherwise leaves the entry intact.
I need to do the same thing without using any if-statements

One obvious question would be what infinity you need when the input is less than 0.
Any Infinity
If the result can be negative infinity, I'd do something like this:
coeffs[i] /= (coeffs[i] >= 0.0);
The coeffs[i] >= 0.0 produces 1.0 if the input is positive, and 0.0 if the input is negative. Dividing the input by 1.0 leaves it unchanged. Dividing it by 0 produces infinity.
Positive Infinity
If it has to be a positive infinity, you'd change that to something like:
coeffs[i] = (fabs(coeffs[i]) / (coeffs[i] >= 0.0);
By taking the absolute value before the division, the infinity we produce for a negative is forced to be positive. Otherwise, the input started out positive, so the fabs and division by 1.0 leave the value intact.
Performance
As to whether this will actually improve performance, that's probably open to a lot more question. For the moment, let's look at code for the CPU, since Godbolt lets us examine that pretty easily.
If we look at this:
#include <limits>
double f(double in) {
return in / (in >= 0.0);
}
double g(double in) {
return in > 0.0 ? in : std::numeric_limits<double>::infinity();
}
So, let's look at the code produced for the first function:
xorpd xmm1, xmm1
cmplesd xmm1, xmm0
movsd xmm2, qword ptr [rip + .LCPI0_0] # xmm2 = mem[0],zero
andpd xmm2, xmm1
divsd xmm0, xmm2
ret
So that's not too terrible--branch-free, and (depending on the exact processor involved) a throughput around 8-10 cycles on most reasonably modern processors. On the other hand, here's the code produced for the second function:
xorpd xmm1, xmm1
cmpltsd xmm1, xmm0
andpd xmm0, xmm1
movsd xmm2, qword ptr [rip + .LCPI1_0] # xmm2 = mem[0],zero
andnpd xmm1, xmm2
orpd xmm0, xmm1
ret
This is also branch-free--and doesn't have that (relatively slow) divsd instruction either. Again, performance will vary depending on the specific processor, but we can probably plan on this having a throughput around 6 cycles or so--not tremendously faster than the previous, but probably at least a few cycles faster part of the time, and almost certain to never be any slower. In short, it's probably preferable under nearly any possible CPU.
GPU Code
GPUs have their own instruction sets, of course--but given the penalty they suffer for branches, compilers for them (and the instruction sets they provide) probably do at least as much to help eliminate branches as CPUs do, so chances are that the straightforward code will work just fine on it as well (though to say with certainty, you'd need to either examine the code it produced or profile it).

Big disclaimer up front: I haven't actually tested this, but I doubt it really is faster than using ternaries. Perform benchmarks to see if it really is an optimization!
Also: these are implemented/tested in C. They should be easily portable to GLSL, but you may need explicit type-conversions, which may make them (even) slower.
There are two ways to do it, based on whether you strictly need INFINITY or can just use a large value. Neither use branching expressions or statements, but they do involve a comparison. Both use the fact that comparison operators in C always return either 0 or 1.
The INFINITY-based way uses a 2-element array and has the comparison output choose the element of the choice-array:
float chooseCoefs[2] = {0.f, INFINITY}; /* initialize choice-array */
for(uint i = 0; i < 6; i++){
int neg = coefs[i] < 0; /* outputs 1 or 0 */
/* set 0-element of choice-array to regular value */
chooseCoefs[0] = coefs[i];
/* if neg == 0: pick coefs[i], else neg == 1: pick INFINITY */
coefs[i] = chooseCoefs[neg];
}
If you can use a normal (but big) value instead of INFINITY you can two multiplications & one addition instead:
#define BIGFLOAT 1000.f /* a swimming sasquatch... */
for(uint i = 0; i < 6; i++){
int neg = coefs[i] < 0;
/* if neg == 1: 1 * BIGFLOAT + 0 * coefs[i] == BIGFLOAT,
else neg == 0: 0 * BIGFLOAT + 1 * coefs[i] == coefs[i] */
coefs[i] = neg * BIGFLOAT + !neg * coefs[i];
}
Again, I didn't benchmark these, but my guess is that at least the array-based solution is far slower than simple ternaries. Don't underestimate the optimizing-power of your compiler!

Related

Assembly language using signed int multiplication math to perform shifts

This is a bit of a turn around.
Usually one is attempting to use shifts to perform multiplication and not the other way around.
On the Hitachi/Motorola 6309 there is no shift by n bits. There is only shift by 1 bit.
However there is a 16 bit x 16 bit signed multiply (provides a 32 bit signed result).
(EDIT) Using this is no problem for a 16 bit shift (left) however I'm trying to use 2 x 16x16 signed mults to do a 32 bit shift. The high order word of the result for the low order word shift is the problem. (Does that make sence?)
Some pseudo code might help:
result.highword = low word of (val.highword * shiftmulttable[shift])
temp = val.lowword * shiftmulttable[shift]
result.lowword = temp.lowword
result.highword = or (result.highword, temp.highword)
(with some magic on temp.highword to consider signed values)
I have been exercising my logic in an attempt to use this instruction to perform the shifts but so far I have failed.
I can easily achieve any positive value shifts by 0 to 14 but when it comes to shifting by 15 bits (mult by 0x8000) or shifting any negative values certain combinations of values require either:
complementing the result by 1
complementing the result by 2
adding 1 to the result
doing nothing to the result
And I just can't see any pattern to these values.
Any ideas appreciated!

Best I can tell from the problem description, implementing the 32-bit shift would work as desired by using an unsigned 16x16->32 bit multiply. This can easily be synthesized from a signed 16x16->32 multiply instruction by exploiting the two's complement integer representation. If the two factors are a and b, adding b to the high-order 16 bits of the signed product when a is negative, and adding a to the high-order 16 bits of the signed product when b is negative will give us the unsigned multiplication result.
The following C code implements this approach and tests it exhaustively:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
/* signed 16x16->32 bit multiply. Hardware instruction */
int32_t mul16_wide (int16_t a, int16_t b)
{
return (int32_t)a * (int32_t)b;
}
/* unsigned 16x16->32 bit multiply (synthetic) */
int32_t umul16_wide (int16_t a, int16_t b)
{
int32_t p = mul16_wide (a, b); // signed 16x16->32 bit multiply
if (a < 0) p = p + (b << 16); // add 'b' to upper 16 bits of product
if (b < 0) p = p + (a << 16); // add 'a' to upper 16 bits of product
return p;
}
/* unsigned 16x16->32 bit multiply (reference) */
uint32_t umul16_wide_ref (uint16_t a, uint16_t b)
{
return (uint32_t)a * (uint32_t)b;
}
/* test synthetic unsigned multiply exhaustively */
int main (void)
{
int16_t a, b;
int32_t res, ref;
uint64_t count = 0;
a = -32768;
do {
b = -32768;
do {
res = umul16_wide (a, b);
ref = umul16_wide_ref (a, b);
count++;
if (res != ref) {
printf ("!!!! a=%d b=%d res=%d ref=%d\n", a, b, res, ref);
return EXIT_FAILURE;
}
if (b == 32767) break;
b = b + 1;
} while (1);
if (a == 32767) break;
a = a + 1;
} while (1);
printf ("test cases passed: %llx\n", count);
return EXIT_SUCCESS;
}
I am not familiar with the Hitachi/Motorola 6309 architecture. I assume it uses a special 32-bit register to hold the result of a wide multiply, from which high and low half can be extracted into 16-bit general-purpose registers, and the conditional corrections can then be applied to the register holding the upper 16 bits.

Are you using fixed-point multiplicative inverses to use the high half result for a right shift?
If you're just left-shifting, multiply by 0x8000 should work. The low half of an NxN => 2N-bit multiply is the same whether inputs are treated as signed or unsigned. Or do you need a 32-bit shift result from your 16-bit input?
Is the multiply instruction actually faster than a few 1-bit shifts for small shift counts? (I wouldn't be surprised if compile-time-constant counts of 2 or 3 would be faster with just a chain of 2 or 3 add same,same or left-shift instructions.)
Anyway, for a compile-time-constant shift count of 15, maybe just multiply by 1<<14 and then do the last count with a 1-bit shift (add same,same).
Or if your ISA has rotates, rotate right by 1 and mask away the low bits, skipping the multiply. Or zero a register, right-shift the low bit into the carry flag, then rotate-through-carry into the top of the zeroed register.
(The latter might be useful on an ISA that doesn't have large immediates and couldn't "mask away all the low bits" in one instruction. Or an ISA that only has RCR not ROR. I don't know 6309 at all)
If you're using a runtime count to look up a multiplier from a table, maybe branch for that case, or adjust your LUT so every entry needs an extra 1-bit shift, so you can do mul(lut[count]) and an unconditional extra shift.
(Only works if you don't need to support a shift-count of zero.)

Not that there would be many interested people who would want to see the 6309 code, but here it is:
Compliant with OS9 C ABI.
Pointer to result and arguments pushed on stack right to left.
U,PC,val(4bytes),shift(2bytes),*result(2bytes)
0 2 4 8 10
:
* 10,s pointer to long result
* 4,s 4 byte value
* 8,s 2 byte shift
* x = pointer to result
pshs u
ldx 10,s * load pointer to result
ldd 8,s * load shift
* if shift amount is greater than 31 then
* just return zero. OS9 C standard.
cmpd #32
blt _10x
ldq #0
stq 4,s
bra _13x
* if shift amount is greater than 16 than
* move bottom word of value into top word
* and clear bottom word
_10x
cmpb #16
blt _1x
ldu 6,s
stu 4,s
clr 6,s
clr 7,s
_1x
* setup pointer u and offset e into mult table _2x
leau _2x,pc
andb #15
* if there is no shift value just return value
beq _13x
aslb * need to double shift to use as word table offset
stb 8,s * save double shft
tfr b,e
* shift top word q = val.word.high * multtab[shft]
ldd 4,s
muld e,u
stw ,x * result.word.high = low word of mult
* shift bottom word q = val.word.low * multtab[shft]
lde 8,s * reload double shft
ldd 6,s
muld e,u
stw 2,x * result.word.low = low word of mult
* The high word or mult needs to be corrected for sign
* if val is negative then muld will return negated results
* and need to un negate it
lde 8,s * reload double shift
tst 4,s * test top byte of val for negative
bge _11x
addd e,u * add the multtab[shft] again to top word
_11x
* if multtab[shft] is negative (shft is 15 or shft<<1 is 30)
* also need to un negate result
cmpe #30
bne _12x
addd 6,s * add val.word.low to top word
_12x
* combine top and bottom and save bottom half of result
ord ,x
std ,x
bra _14x
* this is only reached if the result is in value (let result = value)
_13x
ldq 4,s * load value
stq ,x * result = value
_14x
puls u,pc
_2x fdb $01,$02,$04,$08,$10,$20,$40,$80,$0100,$0200,$0400,$0800
fdb $1000,$2000,$4000,$8000

What is the fastest way to handle overflow on integer division/remainder without panic?

I'm still improving overflower to handle integer overflow. One goal was to be able use #[overflow(wrap)] to avoid panics on overflow. However, I found out that the .wrapping_div(_) and .wrapping_rem(_) functions of the standard integer types do in fact panic when dividing by zero. Edit: To motivate this use case better: Within interrupt handlers, we absolutely want to avoid panics. I assume that the div-by-zero condition is highly unlikely, but we still need to return a "valid" value for some definition of valid.
One possible solution is saturating the value (which I do when code is annotated with #[overflow(saturate)]), but this is likely relatively slow (especially since other,more operations are also saturated). So I want to add an #[overflow(no_panic)] mode that avoids panics completely, and is almost as fast as #[overflow(wrap)] in all cases.
My question is: What is the fastest way to return something (don't care what) without panicking on dividing (or getting the remainder) by zero?

Disclaimer: this isn't really a serious answer. It is almost certainly slower than the naive solution of using an if statement to check whether the divisor is zero.
#![feature(asm)]
fn main() {
println!("18 / 3 = {}", f(18, 3));
println!("2555 / 10 = {}", f(2555, 10));
println!("-16 / 3 = {}", f(-16, 3));
println!("7784388 / 0 = {}", f(7784388, 0));
}
fn f(x: i32, y: i32) -> i32 {
let z: i32;
unsafe {
asm!(
"
test %ecx, %ecx
lahf
and $$0x4000, %eax
or %eax, %ecx
mov %ebx, %eax
cdq
idiv %ecx
"
: "={eax}"(z)
: "{ebx}"(x), "{ecx}"(y)
: "{edx}"
);
}
z
}
Rust Playground

pub fn nopanic_signed_div(x: i32, y: i32) -> i32 {
if y == 0 || y == -1 {
// Divide by -1 is equivalent to neg; we don't care what
// divide by zero returns.
x.wrapping_neg()
} else {
// (You can replace this with unchecked_div to make it more
// obvious this will never panic.)
x / y
}
}
This produces the following on x86-64 with "rustc 1.11.0-nightly (6e00b5556 2016-05-29)":
movl %edi, %eax
leal 1(%rsi), %ecx
cmpl $1, %ecx
ja .LBB0_2
negl %eax
retq
.LBB0_2:
cltd
idivl %esi
retq
It should produce something similar on other platforms.
At least one branch is necessary because LLVM IR considers divide by zero to be undefined behavior. Checking for 0 and -1 separately would involve an extra branch. With those constraints, there isn't really any other choice.
(It might be possible to come up with something slightly faster with inline assembly, but it would be a terrible idea because you would end up generating much worse code in the case of dividing by a constant.)
Whether this solution is actually appropriate probably depends on what your goal is; a divide by zero is probably a logic error, so silently accepting it seems like a bad idea.

How to divide by 9 using just shifts/add/sub?

Last week I was in an interview and there was a test like this:
Calculate N/9 (given that N is a positive integer), using only
SHIFT LEFT, SHIFT RIGHT, ADD, SUBSTRACT instructions.

first, find the representation of 1/9 in binary
0,0001110001110001
means it's (1/16) + (1/32) + (1/64) + (1/1024) + (1/2048) + (1/4096) + (1/65536)
so (x/9) equals (x>>4) + (x>>5) + (x>>6) + (x>>10) + (x>>11)+ (x>>12)+ (x>>16)
Possible optimization (if loops are allowed):
if you loop over 0001110001110001b right shifting it each loop,
add "x" to your result register whenever the carry was set on this shift
and shift your result right each time afterwards,
your result is x/9
mov cx, 16 ; assuming 16 bit registers
mov bx, 7281 ; bit mask of 2^16 * (1/9)
mov ax, 8166 ; sample value, (1/9 of it is 907)
mov dx, 0 ; dx holds the result
div9:
inc ax ; or "add ax,1" if inc's not allowed :)
; workaround for the fact that 7/64
; are a bit less than 1/9
shr bx,1
jnc no_add
add dx,ax
no_add:
shr dx,1
dec cx
jnz div9
( currently cannot test this, may be wrong)

you can use fixed point math trick.
so you just scale up so the significant fraction part goes to integer range, do the fractional math operation you need and scale back.
a/9 = ((a*10000)/9)/10000
as you can see I scaled by 10000. Now the integer part of 10000/9=1111 is big enough so I can write:
a/9 = ~a*1111/10000
power of 2 scale
If you use power of 2 scale then you need just to use bit-shift instead of division. You need to compromise between precision and input value range. I empirically found that on 32 bit arithmetics the best scale for this is 1<<18 so:
(((a+1)<<18)/9)>>18 = ~a/9;
The (a+1) corrects the rounding errors back to the right range.
Hardcoded multiplication
Rewrite the multiplication constant to binary
q = (1<<18)/9 = 29127 = 0111 0001 1100 0111 bin
Now if you need to compute c=(a*q) use hard-coded binary multiplication: for each 1 of the q you can add a<<(position_of_1) to the c. If you see something like 111 you can rewrite it to 1000-1 minimizing the number of operations.
If you put all of this together you should got something like this C++ code of mine:
DWORD div9(DWORD a)
{
// ((a+1)*q)>>18 = (((a+1)<<18)/9)>>18 = ~a/9;
// q = (1<<18)/9 = 29127 = 0111 0001 1100 0111 bin
// valid for a = < 0 , 147455 >
DWORD c;
c =(a<< 3)-(a ); // c= a*29127
c+=(a<< 9)-(a<< 6);
c+=(a<<15)-(a<<12);
c+=29127; // c= (a+1)*29127
c>>=18; // c= ((a+1)*29127)>>18
return c;
}
Now if you see the binary form the pattern 111000 is repeating so yu can further improve the code a bit:
DWORD div9(DWORD a)
{
DWORD c;
c =(a<<3)-a; // first pattern
c+=(c<<6)+(c<<12); // and the other 2...
c+=29127;
c>>=18;
return c;
}

Faster lookup tables using AVX2

I'm trying to speed up an algorithm which performs a series of lookup tables. I'd like to use SSE2 or AVX2. I've tried using the _mm256_i32gather_epi32 command but it is 31% slower. Does anyone have any suggestions to any improvements or a different approach?
Timings:
C code = 234
Gathers = 340
static const int32_t g_tables[2][64]; // values between 0 and 63
template <int8_t which, class T>
static void lookup_data(int16_t * dst, T * src)
{
const int32_t * lut = g_tables[which];
// Leave this code for Broadwell or Skylake since it's 31% slower than C code
// (gather is 12 for Haswell, 7 for Broadwell and 5 for Skylake)
#if 0
if (sizeof(T) == sizeof(int16_t)) {
__m256i avx0, avx1, avx2, avx3, avx4, avx5, avx6, avx7;
__m128i sse0, sse1, sse2, sse3, sse4, sse5, sse6, sse7;
__m256i mask = _mm256_set1_epi32(0xffff);
avx0 = _mm256_loadu_si256((__m256i *)(lut));
avx1 = _mm256_loadu_si256((__m256i *)(lut + 8));
avx2 = _mm256_loadu_si256((__m256i *)(lut + 16));
avx3 = _mm256_loadu_si256((__m256i *)(lut + 24));
avx4 = _mm256_loadu_si256((__m256i *)(lut + 32));
avx5 = _mm256_loadu_si256((__m256i *)(lut + 40));
avx6 = _mm256_loadu_si256((__m256i *)(lut + 48));
avx7 = _mm256_loadu_si256((__m256i *)(lut + 56));
avx0 = _mm256_i32gather_epi32((int32_t *)(src), avx0, 2);
avx1 = _mm256_i32gather_epi32((int32_t *)(src), avx1, 2);
avx2 = _mm256_i32gather_epi32((int32_t *)(src), avx2, 2);
avx3 = _mm256_i32gather_epi32((int32_t *)(src), avx3, 2);
avx4 = _mm256_i32gather_epi32((int32_t *)(src), avx4, 2);
avx5 = _mm256_i32gather_epi32((int32_t *)(src), avx5, 2);
avx6 = _mm256_i32gather_epi32((int32_t *)(src), avx6, 2);
avx7 = _mm256_i32gather_epi32((int32_t *)(src), avx7, 2);
avx0 = _mm256_and_si256(avx0, mask);
avx1 = _mm256_and_si256(avx1, mask);
avx2 = _mm256_and_si256(avx2, mask);
avx3 = _mm256_and_si256(avx3, mask);
avx4 = _mm256_and_si256(avx4, mask);
avx5 = _mm256_and_si256(avx5, mask);
avx6 = _mm256_and_si256(avx6, mask);
avx7 = _mm256_and_si256(avx7, mask);
sse0 = _mm_packus_epi32(_mm256_castsi256_si128(avx0), _mm256_extracti128_si256(avx0, 1));
sse1 = _mm_packus_epi32(_mm256_castsi256_si128(avx1), _mm256_extracti128_si256(avx1, 1));
sse2 = _mm_packus_epi32(_mm256_castsi256_si128(avx2), _mm256_extracti128_si256(avx2, 1));
sse3 = _mm_packus_epi32(_mm256_castsi256_si128(avx3), _mm256_extracti128_si256(avx3, 1));
sse4 = _mm_packus_epi32(_mm256_castsi256_si128(avx4), _mm256_extracti128_si256(avx4, 1));
sse5 = _mm_packus_epi32(_mm256_castsi256_si128(avx5), _mm256_extracti128_si256(avx5, 1));
sse6 = _mm_packus_epi32(_mm256_castsi256_si128(avx6), _mm256_extracti128_si256(avx6, 1));
sse7 = _mm_packus_epi32(_mm256_castsi256_si128(avx7), _mm256_extracti128_si256(avx7, 1));
_mm_storeu_si128((__m128i *)(dst), sse0);
_mm_storeu_si128((__m128i *)(dst + 8), sse1);
_mm_storeu_si128((__m128i *)(dst + 16), sse2);
_mm_storeu_si128((__m128i *)(dst + 24), sse3);
_mm_storeu_si128((__m128i *)(dst + 32), sse4);
_mm_storeu_si128((__m128i *)(dst + 40), sse5);
_mm_storeu_si128((__m128i *)(dst + 48), sse6);
_mm_storeu_si128((__m128i *)(dst + 56), sse7);
}
else
#endif
{
for (int32_t i = 0; i < 64; i += 4)
{
*dst++ = src[*lut++];
*dst++ = src[*lut++];
*dst++ = src[*lut++];
*dst++ = src[*lut++];
}
}
}

You're right that gather is slower than a PINSRD loop on Haswell. It's probably nearly break-even on Broadwell. (See also the x86 tag wiki for perf links, especially Agner Fog's insn tables, microarch pdf, and optimization guide)
If your indices are small, or you can slice them up, pshufb can be used as parallel LUT with 4bit indices. It gives you sixteen 8bit table entries, but you can use stuff like punpcklbw to combine two vectors of byte results into one vector of 16bit results. (Separate tables for high and low halves of the LUT entries, with the same 4bit indices).
This kind of technique gets used for Galois Field multiplies, when you want to multiply every element of a big buffer of GF16 values by the same value. (e.g. for Reed-Solomon error correction codes.) Like I said, taking advantage of this requires taking advantage of special properties of your use-case.
AVX2 can do two 128b pshufbs in parallel, in each lane of a 256b vector. There is nothing better until AVX512F: __m512i _mm512_permutex2var_epi32 (__m512i a, __m512i idx, __m512i b). There are byte (vpermi2b in AVX512VBMI), word (vpermi2w in AVX512BW), dword (this one, vpermi2d in AVX512F), and qword (vpermi2q in AVX512F) element size versions. This is a full cross-lane shuffle, indexing into two concatenated source registers. (Like AMD XOP's vpperm).
The two different instructions behind the one intrinsic (vpermt2d / vpermi2d) give you a choice of overwriting the table with the result, or overwriting the index vector. The compiler will pick based on which inputs are reused.
Your specific case:
*dst++ = src[*lut++];
The lookup-table is actually src, not the variable you've called lut. lut is actually walking through an array which is used as a shuffle-control mask for src.
You should make g_tables an array of uint8_t for best performance. The entries are only 0..63, so they fit. Zero-extending loads into full registers are as cheap as normal loads, so it just reduces the cache footprint. To use it with AVX2 gathers, use vpmovzxbd. The intrinsic is frustratingly difficult to use as a load, because there's no form that takes an int64_t *, only __m256i _mm256_cvtepu8_epi32 (__m128i a) which takes a __m128i. This is one of the major design flaws with intrinsics, IMO.
I don't have any great ideas for speeding up your loop. Scalar code is probably the way to go here. The SIMD code shuffles 64 int16_t values into a new destination, I guess. It took me a while to figure that out, because I didn't find the if (sizeof...) line right away, and there are no comments. :( It would be easier to read if you used sane variable names, not avx0... Using x86 gather instructions for elements smaller than 4B certainly requires annoying masking. However, instead of pack, you could use a shift and OR.
You could make an AVX512 version for sizeof(T) == sizeof(int8_t) or sizeof(T) == sizeof(int16_t), because all of src will fit into one or two zmm registers.
If g_tables was being used as a LUT, AVX512 could do it easily, with vpermi2b. You'd have a hard time with out AVX512, though, because a 64 byte table is too big for pshufb. Using four lanes (16B) of pshufb for each input lane could work: Mask off indices outside 0..15, then indices outside 16..31, etc, with pcmpgtb or something. Then you have to OR all four lanes together. So this sucks a lot.
possible speedups: design the shuffle by hand
If you're willing to design a shuffle by hand for a specific value of g_tables, there are potential speedups that way. Load a vector from src, shuffle it with a compile-time constant pshufb or pshufd, then store any contiguous blocks in one go. (Maybe with pextrd or pextrq, or even better movq from the bottom of the vector. Or even a full-vector movdqu).
Actually, loading multiple src vectors and shuffling between them is possible with shufps. It works fine on integer data, with no slowdowns except on Nehalem (and maybe also on Core2). punpcklwd / dq / qdq (and the corresponding punpckhwd etc) can interleave elements of vectors, and give different choices for data movement than shufps.
If it doesn't take too many instructions to construct a few full 16B vectors, you're in good shape.
If g_tables can take on too many possible values, it might be possible to JIT-compile a custom shuffle function. This is probably really hard to do well, though.

Sort points in clockwise order?

Given an array of x,y points, how do I sort the points of this array in clockwise order (around their overall average center point)? My goal is to pass the points to a line-creation function to end up with something looking rather "solid", as convex as possible with no lines intersecting.
For what it's worth, I'm using Lua, but any pseudocode would be appreciated.
Update: For reference, this is the Lua code based on Ciamej's excellent answer (ignore my "app" prefix):
function appSortPointsClockwise(points)
local centerPoint = appGetCenterPointOfPoints(points)
app.pointsCenterPoint = centerPoint
table.sort(points, appGetIsLess)
return points
end
function appGetIsLess(a, b)
local center = app.pointsCenterPoint
if a.x >= 0 and b.x < 0 then return true
elseif a.x == 0 and b.x == 0 then return a.y > b.y
end
local det = (a.x - center.x) * (b.y - center.y) - (b.x - center.x) * (a.y - center.y)
if det < 0 then return true
elseif det > 0 then return false
end
local d1 = (a.x - center.x) * (a.x - center.x) + (a.y - center.y) * (a.y - center.y)
local d2 = (b.x - center.x) * (b.x - center.x) + (b.y - center.y) * (b.y - center.y)
return d1 > d2
end
function appGetCenterPointOfPoints(points)
local pointsSum = {x = 0, y = 0}
for i = 1, #points do pointsSum.x = pointsSum.x + points[i].x; pointsSum.y = pointsSum.y + points[i].y end
return {x = pointsSum.x / #points, y = pointsSum.y / #points}
end

First, compute the center point.
Then sort the points using whatever sorting algorithm you like, but use special comparison routine to determine whether one point is less than the other.
You can check whether one point (a) is to the left or to the right of the other (b) in relation to the center by this simple calculation:
det = (a.x - center.x) * (b.y - center.y) - (b.x - center.x) * (a.y - center.y)
if the result is zero, then they are on the same line from the center, if it's positive or negative, then it is on one side or the other, so one point will precede the other.
Using it you can construct a less-than relation to compare points and determine the order in which they should appear in the sorted array. But you have to define where is the beginning of that order, I mean what angle will be the starting one (e.g. the positive half of x-axis).
The code for the comparison function can look like this:
bool less(point a, point b)
{
if (a.x - center.x >= 0 && b.x - center.x < 0)
return true;
if (a.x - center.x < 0 && b.x - center.x >= 0)
return false;
if (a.x - center.x == 0 && b.x - center.x == 0) {
if (a.y - center.y >= 0 || b.y - center.y >= 0)
return a.y > b.y;
return b.y > a.y;
}
// compute the cross product of vectors (center -> a) x (center -> b)
int det = (a.x - center.x) * (b.y - center.y) - (b.x - center.x) * (a.y - center.y);
if (det < 0)
return true;
if (det > 0)
return false;
// points a and b are on the same line from the center
// check which point is closer to the center
int d1 = (a.x - center.x) * (a.x - center.x) + (a.y - center.y) * (a.y - center.y);
int d2 = (b.x - center.x) * (b.x - center.x) + (b.y - center.y) * (b.y - center.y);
return d1 > d2;
}
This will order the points clockwise starting from the 12 o'clock. Points on the same "hour" will be ordered starting from the ones that are further from the center.
If using integer types (which are not really present in Lua) you'd have to assure that det, d1 and d2 variables are of a type that will be able to hold the result of performed calculations.
If you want to achieve something looking solid, as convex as possible, then I guess you're looking for a Convex Hull. You can compute it using the Graham Scan.
In this algorithm, you also have to sort the points clockwise (or counter-clockwise) starting from a special pivot point. Then you repeat simple loop steps each time checking if you turn left or right adding new points to the convex hull, this check is based on a cross product just like in the above comparison function.
Edit:
Added one more if statement if (a.y - center.y >= 0 || b.y - center.y >=0) to make sure that points that have x=0 and negative y are sorted starting from the ones that are further from the center. If you don't care about the order of points on the same 'hour' you can omit this if statement and always return a.y > b.y.
Corrected the first if statements with adding -center.x and -center.y.
Added the second if statement (a.x - center.x < 0 && b.x - center.x >= 0). It was an obvious oversight that it was missing. The if statements could be reorganized now because some checks are redundant. For example, if the first condition in the first if statement is false, then the first condition of the second if must be true. I decided, however, to leave the code as it is for the sake of simplicity. It's quite possible that the compiler will optimize the code and produce the same result anyway.

What you're asking for is a system known as polar coordinates. Conversion from Cartesian to polar coordinates is easily done in any language. The formulas can be found in this section.
After converting to polar coordinates, just sort by the angle, theta.

An interesting alternative approach to your problem would be to find the approximate minimum to the Traveling Salesman Problem (TSP), ie. the shortest route linking all your points. If your points form a convex shape, it should be the right solution, otherwise, it should still look good (a "solid" shape can be defined as one that has a low perimeter/area ratio, which is what we are optimizing here).
You can use any implementation of an optimizer for the TSP, of which I am pretty sure you can find a ton in your language of choice.

Another version (return true if a comes before b in counterclockwise direction):
bool lessCcw(const Vector2D &center, const Vector2D &a, const Vector2D &b) const
{
// Computes the quadrant for a and b (0-3):
// ^
// 1 | 0
// ---+-->
// 2 | 3
const int dax = ((a.x() - center.x()) > 0) ? 1 : 0;
const int day = ((a.y() - center.y()) > 0) ? 1 : 0;
const int qa = (1 - dax) + (1 - day) + ((dax & (1 - day)) << 1);
/* The previous computes the following:
const int qa =
( (a.x() > center.x())
? ((a.y() > center.y())
? 0 : 3)
: ((a.y() > center.y())
? 1 : 2)); */
const int dbx = ((b.x() - center.x()) > 0) ? 1 : 0;
const int dby = ((b.y() - center.y()) > 0) ? 1 : 0;
const int qb = (1 - dbx) + (1 - dby) + ((dbx & (1 - dby)) << 1);
if (qa == qb) {
return (b.x() - center.x()) * (a.y() - center.y()) < (b.y() - center.y()) * (a.x() - center.x());
} else {
return qa < qb;
}
}
This is faster, because the compiler (tested on Visual C++ 2015) doesn't generate jump to compute dax, day, dbx, dby. Here the output assembly from the compiler:
; 28 : const int dax = ((a.x() - center.x()) > 0) ? 1 : 0;
vmovss xmm2, DWORD PTR [ecx]
vmovss xmm0, DWORD PTR [edx]
; 29 : const int day = ((a.y() - center.y()) > 0) ? 1 : 0;
vmovss xmm1, DWORD PTR [ecx+4]
vsubss xmm4, xmm0, xmm2
vmovss xmm0, DWORD PTR [edx+4]
push ebx
xor ebx, ebx
vxorps xmm3, xmm3, xmm3
vcomiss xmm4, xmm3
vsubss xmm5, xmm0, xmm1
seta bl
xor ecx, ecx
vcomiss xmm5, xmm3
push esi
seta cl
; 30 : const int qa = (1 - dax) + (1 - day) + ((dax & (1 - day)) << 1);
mov esi, 2
push edi
mov edi, esi
; 31 :
; 32 : /* The previous computes the following:
; 33 :
; 34 : const int qa =
; 35 : ( (a.x() > center.x())
; 36 : ? ((a.y() > center.y()) ? 0 : 3)
; 37 : : ((a.y() > center.y()) ? 1 : 2));
; 38 : */
; 39 :
; 40 : const int dbx = ((b.x() - center.x()) > 0) ? 1 : 0;
xor edx, edx
lea eax, DWORD PTR [ecx+ecx]
sub edi, eax
lea eax, DWORD PTR [ebx+ebx]
and edi, eax
mov eax, DWORD PTR _b$[esp+8]
sub edi, ecx
sub edi, ebx
add edi, esi
vmovss xmm0, DWORD PTR [eax]
vsubss xmm2, xmm0, xmm2
; 41 : const int dby = ((b.y() - center.y()) > 0) ? 1 : 0;
vmovss xmm0, DWORD PTR [eax+4]
vcomiss xmm2, xmm3
vsubss xmm0, xmm0, xmm1
seta dl
xor ecx, ecx
vcomiss xmm0, xmm3
seta cl
; 42 : const int qb = (1 - dbx) + (1 - dby) + ((dbx & (1 - dby)) << 1);
lea eax, DWORD PTR [ecx+ecx]
sub esi, eax
lea eax, DWORD PTR [edx+edx]
and esi, eax
sub esi, ecx
sub esi, edx
add esi, 2
; 43 :
; 44 : if (qa == qb) {
cmp edi, esi
jne SHORT $LN37#lessCcw
; 45 : return (b.x() - center.x()) * (a.y() - center.y()) < (b.y() - center.y()) * (a.x() - center.x());
vmulss xmm1, xmm2, xmm5
vmulss xmm0, xmm0, xmm4
xor eax, eax
pop edi
vcomiss xmm0, xmm1
pop esi
seta al
pop ebx
; 46 : } else {
; 47 : return qa < qb;
; 48 : }
; 49 : }
ret 0
$LN37#lessCcw:
pop edi
pop esi
setl al
pop ebx
ret 0
?lessCcw##YA_NABVVector2D##00#Z ENDP ; lessCcw
Enjoy.

vector3 a = new vector3(1 , 0 , 0)..............w.r.t X_axis
vector3 b = any_point - Center;
- y = |a * b| , x = a . b
- Atan2(y , x)...............................gives angle between -PI to + PI in radians
- (Input % 360 + 360) % 360................to convert it from 0 to 2PI in radians
- sort by adding_points to list_of_polygon_verts by angle we got 0 to 360
Finally you get Anticlockwize sorted verts
list.Reverse()..................Clockwise_order

I know this is somewhat of an old post with an excellent accepted answer, but I feel like I can still contribute something useful. All the answers so far essentially use a comparison function to compare two points and determine their order, but what if you want to use only one point at a time and a key function?
Not only is this possible, but the resulting code is also extremely compact. Here is the complete solution using Python's built-in sorted function:
# Create some random points
num = 7
points = np.random.random((num, 2))
# Compute their center
center = np.mean(points, axis=0)
# Make arctan2 function that returns a value from [0, 2 pi) instead of [-pi, pi)
arctan2 = lambda s, c: angle if (angle := np.arctan2(s, c)) >= 0 else 2 * np.pi + angle
# Define the key function
def clockwise_around_center(point):
diff = point - center
rcos = np.dot(diff, center)
rsin = np.cross(diff, center)
return arctan2(rsin, rcos)
# Sort our points using the key function
sorted_points = sorted(points, key=clockwise_around_center)
This answer would also work in 3D, if the points are on a 2D plane embedded in 3D. We would only have to modify the calculation of rsin by dotting it with the normal vector of the plane. E.g.
rsin = np.dot([0,0,1], np.cross(diff, center))
if that plane has e_z as its normal vector.
The advantage of this code is that it works on only one point at the time using a key function. The quantity rsin, if you work it out on a coefficient level, is exactly the same as what is called det in the accepter answer, except that I compute it between point - center and center, not between point1 - center and point2 - center. But the geometrical meaning of this quantity is the radius times the sin of the angle, hence I call this variable rsin. Similarly for the dot product, which is the radius times the cosine of the angle and hence called rcos.
One could argue that this solution uses arctan2, and is therefore less clean. However, I personally think that the clearity of using a key function outweighs the need for one call to a trig function. Note that I prefer to have arctan2 return a value from [0, 2 pi), because then we get the angle 0 when point happens to be identical to center, and thus it will be the first point in our sorted list. This is an optional choice.
In order to understand why this code works, the crucial insight is that all our points are defined as arrows with respect to the origin, including the center point itself. So if we calculate point - center, this is equivalent to placing the arrow from the tip of center to the tip of point, at the origin. Hence we can sort the arrow point - center with respect to the angle it makes with the arrow pointing to center.

Here's a way to sort the vertices of a rectangle in clock-wise order. I modified the original solution provided by pyimagesearch and got rid of the scipy dependency.
import numpy as np
def pointwise_distance(pts1, pts2):
"""Calculates the distance between pairs of points
Args:
pts1 (np.ndarray): array of form [[x1, y1], [x2, y2], ...]
pts2 (np.ndarray): array of form [[x1, y1], [x2, y2], ...]
Returns:
np.array: distances between corresponding points
"""
dist = np.sqrt(np.sum((pts1 - pts2)**2, axis=1))
return dist
def order_points(pts):
"""Orders points in form [top left, top right, bottom right, bottom left].
Source: https://www.pyimagesearch.com/2016/03/21/ordering-coordinates-clockwise-with-python-and-opencv/
Args:
pts (np.ndarray): list of points of form [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
Returns:
[type]: [description]
"""
# sort the points based on their x-coordinates
x_sorted = pts[np.argsort(pts[:, 0]), :]
# grab the left-most and right-most points from the sorted
# x-roodinate points
left_most = x_sorted[:2, :]
right_most = x_sorted[2:, :]
# now, sort the left-most coordinates according to their
# y-coordinates so we can grab the top-left and bottom-left
# points, respectively
left_most = left_most[np.argsort(left_most[:, 1]), :]
tl, bl = left_most
# now that we have the top-left coordinate, use it as an
# anchor to calculate the Euclidean distance between the
# top-left and right-most points; by the Pythagorean
# theorem, the point with the largest distance will be
# our bottom-right point. Note: this is a valid assumption because
# we are dealing with rectangles only.
# We need to use this instead of just using min/max to handle the case where
# there are points that have the same x or y value.
D = pointwise_distance(np.vstack([tl, tl]), right_most)
br, tr = right_most[np.argsort(D)[::-1], :]
# return the coordinates in top-left, top-right,
# bottom-right, and bottom-left order
return np.array([tl, tr, br, bl], dtype="float32")

With numpy:
import matplotlib.pyplot as plt
import numpy as np
# List of coords
coords = np.array([7,7, 5, 0, 0, 0, 5, 10, 10, 0, 0, 5, 10, 5, 0, 10, 10, 10]).reshape(-1, 2)
centroid = np.mean(coords, axis=0)
sorted_coords = coords[np.argsort(np.arctan2(coords[:, 1] - centroid[1], coords[:, 0] - centroid[0])), :]
plt.scatter(coords[:,0],coords[:,1])
plt.plot(coords[:,0],coords[:,1])
plt.plot(sorted_coords[:,0],sorted_coords[:,1])
plt.show()

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio