Which one is faster x+=x or x*=2 - performance

I recently make a program to make an image of the Mandelbrot set. To do this is I have written a function which returns if a point is a point of Mandelbrot set. And in this function I found 2 ways to do my calculation:
let temp=a;
a=a*a-b*b+x;
b=2.0*b*temp+y;
or
let temp=a;
a=a*a-b*b+x;
b*=temp;
b+=b+y;
Which one is faster if there is a faster one? (I use rust language if this changes something)?

I've put both your codes into the playground, as a public functions (assuming that your values are all floats, but this shouldn't make any real diference):
pub fn mult(mut a: f32, mut b: f32, x: f32, y: f32) -> f32 {
let temp = a;
a = a * a - b * b + x;
b = 2.0 * b * temp + y;
b
}
pub fn add(mut a: f32, mut b: f32, x: f32, y: f32) -> f32 {
let temp = a;
a = a * a - b * b + x;
b *= temp;
b += b + y;
b
}
The assembly generated in release mode is almost identical (just reordered):
playground::mult:
addss xmm1, xmm1
mulss xmm0, xmm1
addss xmm0, xmm3
ret
playground::add:
mulss xmm0, xmm1
addss xmm3, xmm0
addss xmm0, xmm3
ret
So, there should be no measurable difference. However, if you're worried, you should benchmark your real case to see whether some of these approaches leads to missing optimizations in the larger picture.

Related

How to set a negative number to infinity without using an if statement (or ternary)

I have the following piece of code:
for(uint i=0; i<6; i++)
coeffs[i] = coeffs[i] < 0 ? 1.f/0.f : coeffs[i];
Which checks an array with 6 elements and if it finds a negative entry it sets it to infinity and otherwise leaves the entry intact.
I need to do the same thing without using any if-statements
One obvious question would be what infinity you need when the input is less than 0.
Any Infinity
If the result can be negative infinity, I'd do something like this:
coeffs[i] /= (coeffs[i] >= 0.0);
The coeffs[i] >= 0.0 produces 1.0 if the input is positive, and 0.0 if the input is negative. Dividing the input by 1.0 leaves it unchanged. Dividing it by 0 produces infinity.
Positive Infinity
If it has to be a positive infinity, you'd change that to something like:
coeffs[i] = (fabs(coeffs[i]) / (coeffs[i] >= 0.0);
By taking the absolute value before the division, the infinity we produce for a negative is forced to be positive. Otherwise, the input started out positive, so the fabs and division by 1.0 leave the value intact.
Performance
As to whether this will actually improve performance, that's probably open to a lot more question. For the moment, let's look at code for the CPU, since Godbolt lets us examine that pretty easily.
If we look at this:
#include <limits>
double f(double in) {
return in / (in >= 0.0);
}
double g(double in) {
return in > 0.0 ? in : std::numeric_limits<double>::infinity();
}
So, let's look at the code produced for the first function:
xorpd xmm1, xmm1
cmplesd xmm1, xmm0
movsd xmm2, qword ptr [rip + .LCPI0_0] # xmm2 = mem[0],zero
andpd xmm2, xmm1
divsd xmm0, xmm2
ret
So that's not too terrible--branch-free, and (depending on the exact processor involved) a throughput around 8-10 cycles on most reasonably modern processors. On the other hand, here's the code produced for the second function:
xorpd xmm1, xmm1
cmpltsd xmm1, xmm0
andpd xmm0, xmm1
movsd xmm2, qword ptr [rip + .LCPI1_0] # xmm2 = mem[0],zero
andnpd xmm1, xmm2
orpd xmm0, xmm1
ret
This is also branch-free--and doesn't have that (relatively slow) divsd instruction either. Again, performance will vary depending on the specific processor, but we can probably plan on this having a throughput around 6 cycles or so--not tremendously faster than the previous, but probably at least a few cycles faster part of the time, and almost certain to never be any slower. In short, it's probably preferable under nearly any possible CPU.
GPU Code
GPUs have their own instruction sets, of course--but given the penalty they suffer for branches, compilers for them (and the instruction sets they provide) probably do at least as much to help eliminate branches as CPUs do, so chances are that the straightforward code will work just fine on it as well (though to say with certainty, you'd need to either examine the code it produced or profile it).
Big disclaimer up front: I haven't actually tested this, but I doubt it really is faster than using ternaries. Perform benchmarks to see if it really is an optimization!
Also: these are implemented/tested in C. They should be easily portable to GLSL, but you may need explicit type-conversions, which may make them (even) slower.
There are two ways to do it, based on whether you strictly need INFINITY or can just use a large value. Neither use branching expressions or statements, but they do involve a comparison. Both use the fact that comparison operators in C always return either 0 or 1.
The INFINITY-based way uses a 2-element array and has the comparison output choose the element of the choice-array:
float chooseCoefs[2] = {0.f, INFINITY}; /* initialize choice-array */
for(uint i = 0; i < 6; i++){
int neg = coefs[i] < 0; /* outputs 1 or 0 */
/* set 0-element of choice-array to regular value */
chooseCoefs[0] = coefs[i];
/* if neg == 0: pick coefs[i], else neg == 1: pick INFINITY */
coefs[i] = chooseCoefs[neg];
}
If you can use a normal (but big) value instead of INFINITY you can two multiplications & one addition instead:
#define BIGFLOAT 1000.f /* a swimming sasquatch... */
for(uint i = 0; i < 6; i++){
int neg = coefs[i] < 0;
/* if neg == 1: 1 * BIGFLOAT + 0 * coefs[i] == BIGFLOAT,
else neg == 0: 0 * BIGFLOAT + 1 * coefs[i] == coefs[i] */
coefs[i] = neg * BIGFLOAT + !neg * coefs[i];
}
Again, I didn't benchmark these, but my guess is that at least the array-based solution is far slower than simple ternaries. Don't underestimate the optimizing-power of your compiler!

Finding the most frequently occurring element in an SSE register

Does anyone have any thoughts on how to calculate the mode (statistic) of a vector of 8-bit integers in SSE4.x? To clarify, this would be 16x8-bit values in a 128-bit register.
I want the result as a vector mask which selects the mode-valued elements. i.e. the result of _mm_cmpeq_epi8(v, set1(mode(v))), as well as the scalar value.
Providing some additional context; while the above problem is an interesting one to solve in its own right, I have been through most algorithms I can think of with linear complexity. This class will wipe out any gains I can get from calculating this number.
I hope to engage you all in searching for some deep magic, here. It's possible that an approximation may be necessary to break this bound, such as "select a frequently occurring element" for example (N.B. difference against the most), which would be of merit. A probabilistic answer would be usable, too.
SSE and x86 have some very interesting semantics. It may be worth exploring a superoptimization pass.
Probably a relatively simple brute force SSEx approach is suitable here, see the code below.
The idea is to byte-rotate the input vector v by 1 to 15 positions and compare the rotated vector
with the original v for equality. To shorten the dependency chain and to increase the
instruction level parallelism, two counters are used to count (vertical sum) these equal elements:
sum1 and sum2, because there might be architectures that benefit from this.
Equal elements are counted as -1. Variable sum = sum1 + sum2 contains the total count with values
between -1 and -16. min_brc contains the horizontal minimum of sum broadcasted to all elements.
mask = _mm_cmpeq_epi8(sum,min_brc) is the mask for the mode-valued elements requested as an
intermediate result by the OP. In the next few lines of the code the actual mode is extracted.
This solution is certainly faster than a scalar solution.
Note that with AVX2 the upper 128-bit lanes can be used to speedup the computation further.
It takes 20 cycles (throughput) to compute only the a mask for the mode-valued elements. With the actual mode broadcasted
across the SSE register it takes about 21.4 cycles.
Note the behaviour in the next example:
[1, 1, 3, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] returns mask=[-1,-1,-1,-1,0,0,...,0]
and the mode value is 1, although 1 occurs as often as 3.
The code below is tested, but not thoroughly tested
#include <stdio.h>
#include <x86intrin.h>
/* gcc -O3 -Wall -m64 -march=nehalem mode_uint8.c */
int print_vec_char(__m128i x);
__m128i mode_statistic(__m128i v){
__m128i sum2 = _mm_set1_epi8(-1); /* Each integer occurs at least one time */
__m128i v_rot1 = _mm_alignr_epi8(v,v,1);
__m128i v_rot2 = _mm_alignr_epi8(v,v,2);
__m128i sum1 = _mm_cmpeq_epi8(v,v_rot1);
sum2 = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot2));
__m128i v_rot3 = _mm_alignr_epi8(v,v,3);
__m128i v_rot4 = _mm_alignr_epi8(v,v,4);
sum1 = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot3));
sum2 = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot4));
__m128i v_rot5 = _mm_alignr_epi8(v,v,5);
__m128i v_rot6 = _mm_alignr_epi8(v,v,6);
sum1 = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot5));
sum2 = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot6));
__m128i v_rot7 = _mm_alignr_epi8(v,v,7);
__m128i v_rot8 = _mm_alignr_epi8(v,v,8);
sum1 = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot7));
sum2 = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot8));
__m128i v_rot9 = _mm_alignr_epi8(v,v,9);
__m128i v_rot10 = _mm_alignr_epi8(v,v,10);
sum1 = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot9));
sum2 = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot10));
__m128i v_rot11 = _mm_alignr_epi8(v,v,11);
__m128i v_rot12 = _mm_alignr_epi8(v,v,12);
sum1 = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot11));
sum2 = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot12));
__m128i v_rot13 = _mm_alignr_epi8(v,v,13);
__m128i v_rot14 = _mm_alignr_epi8(v,v,14);
sum1 = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot13));
sum2 = _mm_add_epi8(sum2,_mm_cmpeq_epi8(v,v_rot14));
__m128i v_rot15 = _mm_alignr_epi8(v,v,15);
sum1 = _mm_add_epi8(sum1,_mm_cmpeq_epi8(v,v_rot15));
__m128i sum = _mm_add_epi8(sum1,sum2); /* Sum contains values such as -1, -2 ,...,-16 */
/* The next three instructions compute the horizontal minimum of sum */
__m128i sum_shft = _mm_srli_epi16(sum,8); /* Shift right 8 bits, while shifting in zeros */
__m128i min1 = _mm_min_epu8(sum,sum_shft); /* sum and sum_shuft are considered as unsigned integers. sum_shft is zero at the odd positions and so is min1 */
__m128i min2 = _mm_minpos_epu16(min1); /* Byte 0 within min2 contains the horizontal minimum of sum */
__m128i min_brc = _mm_shuffle_epi8(min2,_mm_setzero_si128()); /* Broadcast horizontal minimum */
__m128i mask = _mm_cmpeq_epi8(sum,min_brc); /* Mask = -1 at the byte positions where the value of v is equal to the mode of v */
/* comment next 4 lines out if there is no need to broadcast the mode value */
int bitmask = _mm_movemask_epi8(mask);
int indx = __builtin_ctz(bitmask); /* Index of mode */
__m128i v_indx = _mm_set1_epi8(indx); /* Broadcast indx */
__m128i answer = _mm_shuffle_epi8(v,v_indx); /* Broadcast mode to each element of answer */
/* Uncomment lines below to print intermediate results, to see how it works. */
// printf("sum = ");print_vec_char (sum );
// printf("sum_shft = ");print_vec_char (sum_shft );
// printf("min1 = ");print_vec_char (min1 );
// printf("min2 = ");print_vec_char (min2 );
// printf("min_brc = ");print_vec_char (min_brc );
// printf("mask = ");print_vec_char (mask );
// printf("v_indx = ");print_vec_char (v_indx );
// printf("answer = ");print_vec_char (answer );
return answer; /* or return mask, or return both .... :) */
}
int main() {
/* To test throughput set throughput_test to 1, otherwise 0 */
/* Use e.g. perf stat -d ./a.out to test throughput */
#define throughput_test 0
/* Different test vectors */
int i;
char x1[16] = {5, 2, 2, 7, 21, 4, 7, 7, 3, 9, 2, 5, 4, 3, 5, 5};
char x2[16] = {5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
char x3[16] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
char x4[16] = {1, 2, 3, 2, 1, 6, 7, 8, 2, 2, 2, 3, 3, 2, 15, 16};
char x5[16] = {1, 1, 3, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
printf("\n15...0 = 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0\n\n");
__m128i x_vec = _mm_loadu_si128((__m128i*)x1);
printf("x_vec = ");print_vec_char(x_vec );
__m128i y = mode_statistic (x_vec);
printf("answer = ");print_vec_char(y );
#if throughput_test == 1
__m128i x_vec1 = _mm_loadu_si128((__m128i*)x1);
__m128i x_vec2 = _mm_loadu_si128((__m128i*)x2);
__m128i x_vec3 = _mm_loadu_si128((__m128i*)x3);
__m128i x_vec4 = _mm_loadu_si128((__m128i*)x4);
__m128i x_vec5 = _mm_loadu_si128((__m128i*)x5);
__m128i y1, y2, y3, y4, y5;
__asm__ __volatile__ ( "vzeroupper" : : : ); /* Remove this line on non-AVX processors */
for (i=0;i<100000000;i++){
y1 = mode_statistic (x_vec1);
y2 = mode_statistic (x_vec2);
y3 = mode_statistic (x_vec3);
y4 = mode_statistic (x_vec4);
y5 = mode_statistic (x_vec5);
x_vec1 = mode_statistic (y1 );
x_vec2 = mode_statistic (y2 );
x_vec3 = mode_statistic (y3 );
x_vec4 = mode_statistic (y4 );
x_vec5 = mode_statistic (y5 );
}
printf("mask mode = ");print_vec_char(y1 );
printf("mask mode = ");print_vec_char(y2 );
printf("mask mode = ");print_vec_char(y3 );
printf("mask mode = ");print_vec_char(y4 );
printf("mask mode = ");print_vec_char(y5 );
#endif
return 0;
}
int print_vec_char(__m128i x){
char v[16];
_mm_storeu_si128((__m128i *)v,x);
printf("%3hhi %3hhi %3hhi %3hhi | %3hhi %3hhi %3hhi %3hhi | %3hhi %3hhi %3hhi %3hhi | %3hhi %3hhi %3hhi %3hhi\n",
v[15],v[14],v[13],v[12],v[11],v[10],v[9],v[8],v[7],v[6],v[5],v[4],v[3],v[2],v[1],v[0]);
return 0;
}
Output:
15...0 = 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
x_vec = 5 5 3 4 | 5 2 9 3 | 7 7 4 21 | 7 2 2 5
sum = -4 -4 -2 -2 | -4 -3 -1 -2 | -3 -3 -2 -1 | -3 -3 -3 -4
min_brc = -4 -4 -4 -4 | -4 -4 -4 -4 | -4 -4 -4 -4 | -4 -4 -4 -4
mask = -1 -1 0 0 | -1 0 0 0 | 0 0 0 0 | 0 0 0 -1
answer = 5 5 5 5 | 5 5 5 5 | 5 5 5 5 | 5 5 5 5
The horizontal minimum is computed with Evgeny Kluev's method.
Sort the data in the register.
Insertion sort can be done in 16 (15) steps, by initializing the register to "Infinity", which tries to illustrate a monotonically decreasing array and inserting the new element in parallel to all possible places:
// e.g. FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 78
__m128i sorted = _mm_or_si128(my_array, const_FFFFF00);
for (int i = 1; i < 16; ++i)
{
// Trying to insert e.g. A0, we must shift all the FF's to left
// e.g. FF FF FF FF FF FF FF FF FF FF FF FF FF FF 78 00
__m128i shifted = _mm_bslli_si128(sorted, 1);
// Taking the MAX of shifted and 'A0 on all places'
// e.g. FF FF FF FF FF FF FF FF FF FF FF FF FF FF A0 A0
shifted = _mm_max_epu8(shifted, _mm_set1_epi8(my_array[i]));
// and minimum of the shifted + original --
// e.g. FF FF FF FF FF FF FF FF FF FF FF FF FF FF A0 78
sorted = _mm_min_epu8(sorted, shifted);
}
Then calculate mask for vec[n+1] == vec[n], move mask to GPR and use that to index a 32768 entry LUT for best index location.
In real case one probably want's to sort more than just one vector; i.e. sort 16 16-entry vectors at once;
__m128i input[16]; // not 1, but 16 vectors
transpose16x16(input); // inplace vector transpose
sort(transpose); // 60-stage network exists for 16 inputs
// linear search -- result in 'mode'
__m128i mode = input[0];
__m128i previous = mode;
__m128i count = _mm_set_epi8(0);
__m128i max_count = _mm_setzero_si128(0);
for (int i = 1; i < 16; i++)
{
__m128i &current = input[i];
// histogram count is off by one
// if (current == previous) count++;
// else count = 0;
// if (count > max_count)
// mode = current, max_count = count
prev = _mm_cmpeq_epi8(prev, current);
count = _mm_and_si128(_mm_sub_epi8(count, prev), prev);
__m128i max_so_far = _mm_cmplt_epi8(max_count, count);
mode = _mm_blendv_epi8(mode, current, max_so_far);
max_count = _mm_max_epi8(max_count, count);
previous = current;
}
The inner loop totals amortized cost of 7-8 instructions per result;
Sorting has typically 2 instructions per stage -- i.e. 8 instructions per result, when 16 results need 60 stages or 120 instructions.
(This still leaves the transpose as an exercise -- but I think it should be vastly faster than sorting?)
So, this should be in the ball park of 24 instructions per 8-bit result.
For performance comparison with scalar code. Non-vectorized on main part but vectorized on table-clear and tmp initialization. (168 cycles per f() call for fx8150 (22M calls complete in 1.0002 seconds at 3.7 GHz))
#include <x86intrin.h>
unsigned char tmp[16]; // extracted values are here (single instruction, store_ps)
unsigned char table[256]; // counter table containing zeroes
char f(__m128i values)
{
_mm_store_si128((__m128i *)tmp,values);
int maxOccurence=0;
int currentValue=0;
for(int i=0;i<16;i++)
{
unsigned char ind=tmp[i];
unsigned char t=table[ind];
t++;
if(t>maxOccurence)
{
maxOccurence=t;
currentValue=ind;
}
table[ind]=t;
}
for(int i=0;i<256;i++)
table[i]=0;
return currentValue;
}
g++ 6.3 output:
f: # #f
movaps %xmm0, tmp(%rip)
movaps %xmm0, -24(%rsp)
xorl %r8d, %r8d
movq $-15, %rdx
movb -24(%rsp), %sil
xorl %eax, %eax
jmp .LBB0_1
.LBB0_2: # %._crit_edge
cmpl %r8d, %esi
cmovgel %esi, %r8d
movb tmp+16(%rdx), %sil
incq %rdx
.LBB0_1: # =>This Inner Loop Header: Depth=1
movzbl %sil, %edi
movb table(%rdi), %cl
incb %cl
movzbl %cl, %esi
cmpl %r8d, %esi
cmovgl %edi, %eax
movb %sil, table(%rdi)
testq %rdx, %rdx
jne .LBB0_2
xorps %xmm0, %xmm0
movaps %xmm0, table+240(%rip)
movaps %xmm0, table+224(%rip)
movaps %xmm0, table+208(%rip)
movaps %xmm0, table+192(%rip)
movaps %xmm0, table+176(%rip)
movaps %xmm0, table+160(%rip)
movaps %xmm0, table+144(%rip)
movaps %xmm0, table+128(%rip)
movaps %xmm0, table+112(%rip)
movaps %xmm0, table+96(%rip)
movaps %xmm0, table+80(%rip)
movaps %xmm0, table+64(%rip)
movaps %xmm0, table+48(%rip)
movaps %xmm0, table+32(%rip)
movaps %xmm0, table+16(%rip)
movaps %xmm0, table(%rip)
movsbl %al, %eax
ret

Which is faster for reverse iteration, for or while loops?

I am trying to implement the standard memmove function in Rust and I was wondering which method is faster for downwards iteration (where src < dest):
for i in (0..n).rev() {
//Do copying
}
or
let mut i = n;
while i != 0 {
i -= 1;
// Do copying
}
Will the rev() in the for loops version significantly slow it down?
TL;DR: Use the for loop.
Both should be equally fast. We can check the compiler's ability to peel away the layers of abstraction involved in the for loop quite simply:
#[inline(never)]
fn blackhole() {}
#[inline(never)]
fn with_for(n: usize) {
for i in (0..n).rev() { blackhole(); }
}
#[inline(never)]
fn with_while(n: usize) {
let mut i = n;
while i > 0 {
blackhole();
i -= 1;
}
}
This generates this LLVM IR:
; Function Attrs: noinline nounwind readnone uwtable
define internal void #_ZN8with_for20h645c385965fcce1fhaaE(i64) unnamed_addr #0 {
entry-block:
ret void
}
; Function Attrs: noinline nounwind readnone uwtable
define internal void #_ZN10with_while20hc09c3331764a9434yaaE(i64) unnamed_addr #0 {
entry-block:
ret void
}
Even if you are not versed in LLVM, it is obvious that both functions compiled down to the same IR (and thus obviously to the same assembly).
Since their performance is the same, one should prefer the more explicit for loop and reserve the while loop to cases where the iteration is irregular.
EDIT: to address starblue's concern of unfitness.
#[link(name = "snappy")]
extern {
fn blackhole(i: libc::c_int) -> libc::c_int;
}
#[inline(never)]
fn with_for(n: i32) {
for i in (0..n).rev() { unsafe { blackhole(i as libc::c_int); } }
}
#[inline(never)]
fn with_while(n: i32) {
let mut i = n;
while i > 0 {
unsafe { blackhole(i as libc::c_int); }
i -= 1;
}
}
compiles down to:
; Function Attrs: noinline nounwind uwtable
define internal void #_ZN8with_for20h7cf06f33e247fa35maaE(i32) unnamed_addr #1 {
entry-block:
%1 = icmp sgt i32 %0, 0
br i1 %1, label %match_case.preheader, label %clean_ast_95_
match_case.preheader: ; preds = %entry-block
br label %match_case
match_case: ; preds = %match_case.preheader, %match_case
%.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
%2 = add i32 %.in, -1
%3 = tail call i32 #blackhole(i32 %2)
%4 = icmp sgt i32 %2, 0
br i1 %4, label %match_case, label %clean_ast_95_.loopexit
clean_ast_95_.loopexit: ; preds = %match_case
br label %clean_ast_95_
clean_ast_95_: ; preds = %clean_ast_95_.loopexit, %entry-block
ret void
}
; Function Attrs: noinline nounwind uwtable
define internal void #_ZN10with_while20hee8edd624cfe9293IaaE(i32) unnamed_addr #1 {
entry-block:
%1 = icmp sgt i32 %0, 0
br i1 %1, label %while_body.preheader, label %while_exit
while_body.preheader: ; preds = %entry-block
br label %while_body
while_exit.loopexit: ; preds = %while_body
br label %while_exit
while_exit: ; preds = %while_exit.loopexit, %entry-block
ret void
while_body: ; preds = %while_body.preheader, %while_body
%i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
%2 = tail call i32 #blackhole(i32 %i.05)
%3 = add nsw i32 %i.05, -1
%4 = icmp sgt i32 %i.05, 1
br i1 %4, label %while_body, label %while_exit.loopexit
}
The core loops are:
; -- for loop
match_case: ; preds = %match_case.preheader, %match_case
%.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
%2 = add i32 %.in, -1
%3 = tail call i32 #blackhole(i32 %2)
%4 = icmp sgt i32 %2, 0
br i1 %4, label %match_case, label %clean_ast_95_.loopexit
; -- while loop
while_body: ; preds = %while_body.preheader, %while_body
%i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
%2 = tail call i32 #blackhole(i32 %i.05)
%3 = add nsw i32 %i.05, -1
%4 = icmp sgt i32 %i.05, 1
br i1 %4, label %while_body, label %while_exit.loopexit
And the only difference is that:
for decrements before calling blackhole, while decrements after
for compares against 0, while compares against 1
otherwise, it's the same core loop.
In short: They are (nearly) equally fast -- use the for loop!
Longer version:
First: rev() only works for iterators that implement DoubleEndedIterator, which provides a next_back() method. This method is expected to run in o(n) (sublinear time), usually even O(1) (constant time). And indeed, by looking at the implementation of next_back() for Range, we can see that it runs in constant time.
Now we know that both versions have asymptotically identical runtime. If this is the case, you should usually stop thinking about it and use the solution that is more idiomatic (which is for in this case). Thinking about optimization too early often decreases programming productivity, because performance matters only in a tiny percentage of all code you write.
But since you are implementing memmove, performance might actually really matter to you. So lets try to look at the resulting ASM. I used this code:
#![feature(start)]
#![feature(test)]
extern crate test;
#[inline(never)]
#[no_mangle]
fn with_for(n: usize) {
for i in (0..n).rev() {
test::black_box(i);
}
}
#[inline(never)]
#[no_mangle]
fn with_while(n: usize) {
let mut i = n;
while i > 0 {
test::black_box(i);
i -= 1;
}
}
#[start]
fn main(_: isize, vargs: *const *const u8) -> isize {
let random_enough_value = unsafe {
**vargs as usize
};
with_for(random_enough_value);
with_while(random_enough_value);
0
}
(Playground Link)
The #[no_mangle] is to improve readability in the resulting ASM. The #inline(never) and the random_enough_value as well as the black_box are used to prevent LLVM to optimize things we don't want to be optimized. The generated ASM of this (in release mode!) with some cleanup looks like:
with_for: | with_while:
testq %rdi, %rdi | testq %rdi, %rdi
je .LBB0_3 | je .LBB1_3
decq %rdi |
leaq -8(%rsp), %rax | leaq -8(%rsp), %rax
.LBB0_2: | .LBB1_2:
movq %rdi, -8(%rsp) | movq %rdi, -8(%rsp)
decq %rdi | decq %rdi
cmpq $-1, %rdi |
jne .LBB0_2 | jne .LBB1_2
.LBB0_3: | .LBB1_3:
retq | retq
The only difference is that with_while has two instructions less, because it's counting down to 0 instead of -1, like with_for does.
Conclusion: if you can tell that the asymptotic runtime is optimal, you should probably not think about optimization at all. Modern optimizers are clever enough to compile high level constructs down to pretty perfect ASM. Often, data layout and resulting cache efficiency is much more important than a minimal count of instructions, anyway.
If you actually need to think about optimization though, look at the ASM (or LLVM IR). In this case the for loop is actually a bit slower (more instructions, comparison with -1 instead of 0). But the number of cases where a Rust programmers should care about this, is probably miniscule.
For small N, it really shouldn't matter.
Rust is lazy on iterators; 0..n won't cause any evaluation until you actually ask for an element. rev() asks for the last element first. As far as I know, the Rust counter iterator is clever and doesn't need to generate the first N-1 elements to get the Nth one. In this specific case, the rev method is probably even faster.
In the general case, it depends on what kind of access paradigm and access time your iterator has; make sure that accessing the end takes constant time, and it doesn't make a difference.
As with all benchmarking questions, it depends. Test for your N values yourself!
Premature optimization is also evil, so if your N is small, and your loop isn't done very often... don't worry.

subtracting two 8 bits integer bit by bit in assembly x86

so I'm trying to implement this algorithm to calculate the difference of two 8 bits integers
b = 0
difference = 0
for i = 0 to (n-1)
x = bit i of X
y = bit i of Y
bit i of difference = x xor y xor b
b = ((not x) and y) or ((not x) and b) or (y and b)
end for loop
this is what i did
calculation:
mov ebx, 0
mov diff, 0
mov ecx, 7
subtract:
mov al, X
and al, 1h ; find bit i of X
mov dl, Y
and dl, 1h ; find bit i of Y
mov ah, al
mov dh, al
xor al, dl
xor al, bl
mov diff, al ; find bit i of the difference
; calculate b value for the next interation
not ah
and ah, dl
not dh
and dh, dl
and dl, bl
or ah, dh
or ah, dl
mov bl, ah
; rotate X and Y to get ready for the next iteration
rol X, 1
rol Y, 1
loop subtract
the problem with this code is its only work on the first iteration of the loop
so for example if I enter first number to be 2 and the second number to be 1
the when i go through the loop,first iteration, the x value would be 0 and the y value would be 1, the i bit of the difference would be 1 and b value calculated would be 1
, but this only work for the first iteration, on the next iteration, I had x = 0, y = 0 and b = 1(from the last calculation), so I wanted my diff to be 1 and my b value for this iteration to be 1, instead I got 0 for both of them.
why doesn't the code work, as i was following the algorithm, and implement accordingly.
thank in advance
and
Try a higher level language first to understand the algorithm, then port that to asm.
#include <stdio.h>
//b = 0
//difference = 0
//for i = 0 to (n-1)
//
// x = bit i of X
// y = bit i of Y
// bit i of difference = x xor y xor b
// b = ((not x) and y) or ((not x) and b) or (y and b)
//
//end for loop
int main ( void )
{
unsigned char X,Y,Z;
unsigned char x,y,z,b,bnext;
unsigned char i;
X=0Xf5; Y=0Xf1;
b=0;
Z=0;
for (i=1;i;i<<=1)
{
x=0;
y=0;
if(i&X) x=1;
if(i&Y) y=1;
z=((x^y)^b)&1;
if(z) Z|=i;
bnext = ((~x)&y) | ((~x)&b) | (y&b);
b=bnext&1;
}
printf("0x%02X 0x%02X\n",Z,X-Y);
return(0);
}
you might even re-write it a few times to approach real instructions.
z=((x^y)^b)&1;
becomes
z = x;
z = z ^ y;
z = z ^ b;
z = z & 1;

Sort points in clockwise order?

Given an array of x,y points, how do I sort the points of this array in clockwise order (around their overall average center point)? My goal is to pass the points to a line-creation function to end up with something looking rather "solid", as convex as possible with no lines intersecting.
For what it's worth, I'm using Lua, but any pseudocode would be appreciated.
Update: For reference, this is the Lua code based on Ciamej's excellent answer (ignore my "app" prefix):
function appSortPointsClockwise(points)
local centerPoint = appGetCenterPointOfPoints(points)
app.pointsCenterPoint = centerPoint
table.sort(points, appGetIsLess)
return points
end
function appGetIsLess(a, b)
local center = app.pointsCenterPoint
if a.x >= 0 and b.x < 0 then return true
elseif a.x == 0 and b.x == 0 then return a.y > b.y
end
local det = (a.x - center.x) * (b.y - center.y) - (b.x - center.x) * (a.y - center.y)
if det < 0 then return true
elseif det > 0 then return false
end
local d1 = (a.x - center.x) * (a.x - center.x) + (a.y - center.y) * (a.y - center.y)
local d2 = (b.x - center.x) * (b.x - center.x) + (b.y - center.y) * (b.y - center.y)
return d1 > d2
end
function appGetCenterPointOfPoints(points)
local pointsSum = {x = 0, y = 0}
for i = 1, #points do pointsSum.x = pointsSum.x + points[i].x; pointsSum.y = pointsSum.y + points[i].y end
return {x = pointsSum.x / #points, y = pointsSum.y / #points}
end
First, compute the center point.
Then sort the points using whatever sorting algorithm you like, but use special comparison routine to determine whether one point is less than the other.
You can check whether one point (a) is to the left or to the right of the other (b) in relation to the center by this simple calculation:
det = (a.x - center.x) * (b.y - center.y) - (b.x - center.x) * (a.y - center.y)
if the result is zero, then they are on the same line from the center, if it's positive or negative, then it is on one side or the other, so one point will precede the other.
Using it you can construct a less-than relation to compare points and determine the order in which they should appear in the sorted array. But you have to define where is the beginning of that order, I mean what angle will be the starting one (e.g. the positive half of x-axis).
The code for the comparison function can look like this:
bool less(point a, point b)
{
if (a.x - center.x >= 0 && b.x - center.x < 0)
return true;
if (a.x - center.x < 0 && b.x - center.x >= 0)
return false;
if (a.x - center.x == 0 && b.x - center.x == 0) {
if (a.y - center.y >= 0 || b.y - center.y >= 0)
return a.y > b.y;
return b.y > a.y;
}
// compute the cross product of vectors (center -> a) x (center -> b)
int det = (a.x - center.x) * (b.y - center.y) - (b.x - center.x) * (a.y - center.y);
if (det < 0)
return true;
if (det > 0)
return false;
// points a and b are on the same line from the center
// check which point is closer to the center
int d1 = (a.x - center.x) * (a.x - center.x) + (a.y - center.y) * (a.y - center.y);
int d2 = (b.x - center.x) * (b.x - center.x) + (b.y - center.y) * (b.y - center.y);
return d1 > d2;
}
This will order the points clockwise starting from the 12 o'clock. Points on the same "hour" will be ordered starting from the ones that are further from the center.
If using integer types (which are not really present in Lua) you'd have to assure that det, d1 and d2 variables are of a type that will be able to hold the result of performed calculations.
If you want to achieve something looking solid, as convex as possible, then I guess you're looking for a Convex Hull. You can compute it using the Graham Scan.
In this algorithm, you also have to sort the points clockwise (or counter-clockwise) starting from a special pivot point. Then you repeat simple loop steps each time checking if you turn left or right adding new points to the convex hull, this check is based on a cross product just like in the above comparison function.
Edit:
Added one more if statement if (a.y - center.y >= 0 || b.y - center.y >=0) to make sure that points that have x=0 and negative y are sorted starting from the ones that are further from the center. If you don't care about the order of points on the same 'hour' you can omit this if statement and always return a.y > b.y.
Corrected the first if statements with adding -center.x and -center.y.
Added the second if statement (a.x - center.x < 0 && b.x - center.x >= 0). It was an obvious oversight that it was missing. The if statements could be reorganized now because some checks are redundant. For example, if the first condition in the first if statement is false, then the first condition of the second if must be true. I decided, however, to leave the code as it is for the sake of simplicity. It's quite possible that the compiler will optimize the code and produce the same result anyway.
What you're asking for is a system known as polar coordinates. Conversion from Cartesian to polar coordinates is easily done in any language. The formulas can be found in this section.
After converting to polar coordinates, just sort by the angle, theta.
An interesting alternative approach to your problem would be to find the approximate minimum to the Traveling Salesman Problem (TSP), ie. the shortest route linking all your points. If your points form a convex shape, it should be the right solution, otherwise, it should still look good (a "solid" shape can be defined as one that has a low perimeter/area ratio, which is what we are optimizing here).
You can use any implementation of an optimizer for the TSP, of which I am pretty sure you can find a ton in your language of choice.
Another version (return true if a comes before b in counterclockwise direction):
bool lessCcw(const Vector2D &center, const Vector2D &a, const Vector2D &b) const
{
// Computes the quadrant for a and b (0-3):
// ^
// 1 | 0
// ---+-->
// 2 | 3
const int dax = ((a.x() - center.x()) > 0) ? 1 : 0;
const int day = ((a.y() - center.y()) > 0) ? 1 : 0;
const int qa = (1 - dax) + (1 - day) + ((dax & (1 - day)) << 1);
/* The previous computes the following:
const int qa =
( (a.x() > center.x())
? ((a.y() > center.y())
? 0 : 3)
: ((a.y() > center.y())
? 1 : 2)); */
const int dbx = ((b.x() - center.x()) > 0) ? 1 : 0;
const int dby = ((b.y() - center.y()) > 0) ? 1 : 0;
const int qb = (1 - dbx) + (1 - dby) + ((dbx & (1 - dby)) << 1);
if (qa == qb) {
return (b.x() - center.x()) * (a.y() - center.y()) < (b.y() - center.y()) * (a.x() - center.x());
} else {
return qa < qb;
}
}
This is faster, because the compiler (tested on Visual C++ 2015) doesn't generate jump to compute dax, day, dbx, dby. Here the output assembly from the compiler:
; 28 : const int dax = ((a.x() - center.x()) > 0) ? 1 : 0;
vmovss xmm2, DWORD PTR [ecx]
vmovss xmm0, DWORD PTR [edx]
; 29 : const int day = ((a.y() - center.y()) > 0) ? 1 : 0;
vmovss xmm1, DWORD PTR [ecx+4]
vsubss xmm4, xmm0, xmm2
vmovss xmm0, DWORD PTR [edx+4]
push ebx
xor ebx, ebx
vxorps xmm3, xmm3, xmm3
vcomiss xmm4, xmm3
vsubss xmm5, xmm0, xmm1
seta bl
xor ecx, ecx
vcomiss xmm5, xmm3
push esi
seta cl
; 30 : const int qa = (1 - dax) + (1 - day) + ((dax & (1 - day)) << 1);
mov esi, 2
push edi
mov edi, esi
; 31 :
; 32 : /* The previous computes the following:
; 33 :
; 34 : const int qa =
; 35 : ( (a.x() > center.x())
; 36 : ? ((a.y() > center.y()) ? 0 : 3)
; 37 : : ((a.y() > center.y()) ? 1 : 2));
; 38 : */
; 39 :
; 40 : const int dbx = ((b.x() - center.x()) > 0) ? 1 : 0;
xor edx, edx
lea eax, DWORD PTR [ecx+ecx]
sub edi, eax
lea eax, DWORD PTR [ebx+ebx]
and edi, eax
mov eax, DWORD PTR _b$[esp+8]
sub edi, ecx
sub edi, ebx
add edi, esi
vmovss xmm0, DWORD PTR [eax]
vsubss xmm2, xmm0, xmm2
; 41 : const int dby = ((b.y() - center.y()) > 0) ? 1 : 0;
vmovss xmm0, DWORD PTR [eax+4]
vcomiss xmm2, xmm3
vsubss xmm0, xmm0, xmm1
seta dl
xor ecx, ecx
vcomiss xmm0, xmm3
seta cl
; 42 : const int qb = (1 - dbx) + (1 - dby) + ((dbx & (1 - dby)) << 1);
lea eax, DWORD PTR [ecx+ecx]
sub esi, eax
lea eax, DWORD PTR [edx+edx]
and esi, eax
sub esi, ecx
sub esi, edx
add esi, 2
; 43 :
; 44 : if (qa == qb) {
cmp edi, esi
jne SHORT $LN37#lessCcw
; 45 : return (b.x() - center.x()) * (a.y() - center.y()) < (b.y() - center.y()) * (a.x() - center.x());
vmulss xmm1, xmm2, xmm5
vmulss xmm0, xmm0, xmm4
xor eax, eax
pop edi
vcomiss xmm0, xmm1
pop esi
seta al
pop ebx
; 46 : } else {
; 47 : return qa < qb;
; 48 : }
; 49 : }
ret 0
$LN37#lessCcw:
pop edi
pop esi
setl al
pop ebx
ret 0
?lessCcw##YA_NABVVector2D##00#Z ENDP ; lessCcw
Enjoy.
vector3 a = new vector3(1 , 0 , 0)..............w.r.t X_axis
vector3 b = any_point - Center;
- y = |a * b| , x = a . b
- Atan2(y , x)...............................gives angle between -PI to + PI in radians
- (Input % 360 + 360) % 360................to convert it from 0 to 2PI in radians
- sort by adding_points to list_of_polygon_verts by angle we got 0 to 360
Finally you get Anticlockwize sorted verts
list.Reverse()..................Clockwise_order
I know this is somewhat of an old post with an excellent accepted answer, but I feel like I can still contribute something useful. All the answers so far essentially use a comparison function to compare two points and determine their order, but what if you want to use only one point at a time and a key function?
Not only is this possible, but the resulting code is also extremely compact. Here is the complete solution using Python's built-in sorted function:
# Create some random points
num = 7
points = np.random.random((num, 2))
# Compute their center
center = np.mean(points, axis=0)
# Make arctan2 function that returns a value from [0, 2 pi) instead of [-pi, pi)
arctan2 = lambda s, c: angle if (angle := np.arctan2(s, c)) >= 0 else 2 * np.pi + angle
# Define the key function
def clockwise_around_center(point):
diff = point - center
rcos = np.dot(diff, center)
rsin = np.cross(diff, center)
return arctan2(rsin, rcos)
# Sort our points using the key function
sorted_points = sorted(points, key=clockwise_around_center)
This answer would also work in 3D, if the points are on a 2D plane embedded in 3D. We would only have to modify the calculation of rsin by dotting it with the normal vector of the plane. E.g.
rsin = np.dot([0,0,1], np.cross(diff, center))
if that plane has e_z as its normal vector.
The advantage of this code is that it works on only one point at the time using a key function. The quantity rsin, if you work it out on a coefficient level, is exactly the same as what is called det in the accepter answer, except that I compute it between point - center and center, not between point1 - center and point2 - center. But the geometrical meaning of this quantity is the radius times the sin of the angle, hence I call this variable rsin. Similarly for the dot product, which is the radius times the cosine of the angle and hence called rcos.
One could argue that this solution uses arctan2, and is therefore less clean. However, I personally think that the clearity of using a key function outweighs the need for one call to a trig function. Note that I prefer to have arctan2 return a value from [0, 2 pi), because then we get the angle 0 when point happens to be identical to center, and thus it will be the first point in our sorted list. This is an optional choice.
In order to understand why this code works, the crucial insight is that all our points are defined as arrows with respect to the origin, including the center point itself. So if we calculate point - center, this is equivalent to placing the arrow from the tip of center to the tip of point, at the origin. Hence we can sort the arrow point - center with respect to the angle it makes with the arrow pointing to center.
Here's a way to sort the vertices of a rectangle in clock-wise order. I modified the original solution provided by pyimagesearch and got rid of the scipy dependency.
import numpy as np
def pointwise_distance(pts1, pts2):
"""Calculates the distance between pairs of points
Args:
pts1 (np.ndarray): array of form [[x1, y1], [x2, y2], ...]
pts2 (np.ndarray): array of form [[x1, y1], [x2, y2], ...]
Returns:
np.array: distances between corresponding points
"""
dist = np.sqrt(np.sum((pts1 - pts2)**2, axis=1))
return dist
def order_points(pts):
"""Orders points in form [top left, top right, bottom right, bottom left].
Source: https://www.pyimagesearch.com/2016/03/21/ordering-coordinates-clockwise-with-python-and-opencv/
Args:
pts (np.ndarray): list of points of form [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
Returns:
[type]: [description]
"""
# sort the points based on their x-coordinates
x_sorted = pts[np.argsort(pts[:, 0]), :]
# grab the left-most and right-most points from the sorted
# x-roodinate points
left_most = x_sorted[:2, :]
right_most = x_sorted[2:, :]
# now, sort the left-most coordinates according to their
# y-coordinates so we can grab the top-left and bottom-left
# points, respectively
left_most = left_most[np.argsort(left_most[:, 1]), :]
tl, bl = left_most
# now that we have the top-left coordinate, use it as an
# anchor to calculate the Euclidean distance between the
# top-left and right-most points; by the Pythagorean
# theorem, the point with the largest distance will be
# our bottom-right point. Note: this is a valid assumption because
# we are dealing with rectangles only.
# We need to use this instead of just using min/max to handle the case where
# there are points that have the same x or y value.
D = pointwise_distance(np.vstack([tl, tl]), right_most)
br, tr = right_most[np.argsort(D)[::-1], :]
# return the coordinates in top-left, top-right,
# bottom-right, and bottom-left order
return np.array([tl, tr, br, bl], dtype="float32")
With numpy:
import matplotlib.pyplot as plt
import numpy as np
# List of coords
coords = np.array([7,7, 5, 0, 0, 0, 5, 10, 10, 0, 0, 5, 10, 5, 0, 10, 10, 10]).reshape(-1, 2)
centroid = np.mean(coords, axis=0)
sorted_coords = coords[np.argsort(np.arctan2(coords[:, 1] - centroid[1], coords[:, 0] - centroid[0])), :]
plt.scatter(coords[:,0],coords[:,1])
plt.plot(coords[:,0],coords[:,1])
plt.plot(sorted_coords[:,0],sorted_coords[:,1])
plt.show()

Resources