ARM-NEON: Conditional register swapping based on parameters

ARM-NEON: Conditional register swapping based on parameters - performance

I am writing a piece of subroutine in NEON for image processing which does color swapping, i.e., I sequentialy load the R,G,B channels from an array, and depending on some configuration, permute some of them.
There are as maximum 6 permutes
(RGB) -> { (RGB),(RBG),(GRB),(GBR),(BRG),(BGR) }
The most efficient way would be to have a separate subroutine for each case and the corresponding VSWP instructions. As the Subroutine will do several other things, I would prefer to keep everything in just one sub, even if it is not so efficient,
Also have read that conditional execution and branching is not advisable. So, if I want to have it in a block with branchless code, the only thing coming to my mind is
New_R = a(0)*R+a(1)*G+a(2)*B
New_G = a(3)*R+a(4)*G+a(5)*B
New_B = a(6)*R+a(7)*G+a(8)*B
where only one a(i) in each row and column will be =1 each time, and the rest will be =0
Question: Any smarter way to do it, having in mind that it has to be coded to NEON?

VTBL.8 is the most powerful tool in NEON to swap bytes.
Loading 3x8 bytes to registers d0,d1,d2 would look like
R G B R G B R G | B R G B R G B R | G B R G B R G B |
0 1 2 3 4 5 6 7 8 9 a b c d e f .... 17
VTBL d3, { d0,d1,d2 }, d6 ;; select bytes to d3 from d0,d1,d2 based on d6
VTBL d4, { d0,d1,d2 }, d7
VTBL d5, { d0,d1,d2 }, d8
where d6,d7,d8 encode the positions to read in the new bytes.
e.g. '0 1 2 3 4 5 6 7' for the original permutation and '0 2 1 3 5 4 6 8', '7 ...' to swap G and B. The constant vectors d6..d8 need to be loaded just once in the beginning of the routine.
Another possibility is to encode the following sequence with interleaved read;
VLD3.8 { d0,d1,d2 }, [r0] ; // Read R, G, B to separate registers
VLD3.8 { d3,d4,d5 }, [r0] ; // Make a second copy (or use some other instruction)
VBIT d3, d1, d6 ; // d3 is now either R or G
VBIT d4, d2, d7 ; // d4 is now either G or B
VBIT d5, d0, d8 ; // d5 is now either B or R
VBIT d0, d4, d9 ; // d0 is now R or (G or B)
VBIT d1, d5, d10 ; // d1 is now G or (B or R)
VBIT d2, d3, d11 ; // d2 is now B or (R or G)
Even though 6 registers for the condition codes are used in the example, 3 independent registers should be enough -- one can also use VBIF if reversed logic needs to be used.

Related

Discussion about how to retrieve an i-th element in the j-th level of a binary tree algorithm

I am solving some problems from a site called codefights and the last one solved was about a binary tree in which are:
Consider a special family of Engineers and Doctors. This family has
the following rules:
Everybody has two children. The first child of an Engineer is an
Engineer and the second child is a Doctor. The first child of a Doctor
is a Doctor and the second child is an Engineer. All generations of
Doctors and Engineers start with an Engineer.
We can represent the situation using this diagram:
E
/ \
E D
/ \ / \
E D D E
/ \ / \ / \ / \
E D D E D E E D
Given the level and position of a person in the ancestor tree above,
find the profession of the person. Note: in this tree first child is
considered as left child, second - as right.
As there is some space and time restrictions, the solution can not be based on actually constructing the tree until the level required and check which element is in the position asked. So far so good. My proposed solution written in python was:
def findProfession(level, pos):
size = 2**(level-1)
shift = False
while size > 2:
if pos <= size/2:
size /= 2
else:
size /= 2
pos -= size
shift = not shift
if pos == 1 and shift == False:
return 'Engineer'
if pos == 1 and shift == True:
return 'Doctor'
if pos == 2 and shift == False:
return 'Doctor'
if pos == 2 and shift == True:
return 'Engineer'
As it solved the problem, I got access to the solutions of other used and I was astonished by this one:
def findProfession(level, pos):
return ['Engineer', 'Doctor'][bin(pos-1).count("1")%2]
Even more, I did not understand the logic behind it and so we arrived to this question. Someone could explain to me this algorithm?

Let's number the nodes of the tree in the following way:
1) the root has number 1
2) the first child of node x has number 2*x
3) the second child of node x has number 2*x+1
Now, notice that each time you go to the first child, the profession stays the same, and you add a 0 to the binary representation of the node.
And each time you go to the second child, the profession flips and you add a 1 to the binary representation.
Example: Let's find the profession of the 4th node in the 4th level (last level in the diagram you have in the question). First we start at the root with number 1, then we go to the first child with number 2 (10 binary). After that we go to the second child of 2 which is 5 (101 binary). Finally, we go to the second child of 5 which is 11 (1011 binary).
Notice that we started with only one bit equal to 1, then every 1 bit we added to the binary representation flipped the profession. So the number of times we flip a profession is equal to the (number of bits equal to 1) - 1. The parity of this amount decides the profession.
This leads us to the following solution:
X = number of bits equal to 1 in [ 2^(level-1) + pos - 1 ]
Y = (X-1) mod 2
if Y is 0 then the answer is "Engineer"
Otherwise the answer is "Doctor"
since 2^(level-1) is a power of 2, it has exactly one bit equal to 1, therefore you can write:
X = number of bits equal to 1 in [ pos-1 ]
Y = X mod 2
Which is equal to the solution you mentioned in the question.

This type of sequence is known as the Thue-Morse sequence. Using the same tree, here is a demonstration of why it gives the correct answer:
p is the 0-indexed position
b is the binary representation of p
c is the number of 1's in b
p0
E
b0
c0
/ \
p0 p1
E D
b0 b1
c0 c1
/ \ / \
p0 p1 p2 p3
E D D E
b0 b1 b10 b11
c0 c1 c1 c2
/ \ / \ / \ / \
p0 p1 p2 p3 p4 p5 p6 p7
E D D E D E E D
b0 b1 b10 b11 b100 b101 b110 b111
c0 c1 c1 c2 c1 c2 c2 c3
c is always even for Engineer and odd for Doctor. Therefore:
index = bin(pos-1).count('1') % 2
return ['Engineer', 'Doctor'][index]

Tableau LOD to find median

I have some data:
Inst Dest_Group Dest Cipn1 N
I1 C a 43
I1 F a 63
I1 U a 54
I1 C b 96
I1 F b 3
I1 U b 78
I1 C c 12
I1 F c 65
I1 U c 49
I2 C a 3
I2 F a 47
etc...
My worksheet is set up so that [Dest Cipn1] is a row, and [Dest Group] is a column. They display [value] as a bar chart. [value] = {include [Inst] : sum([N])} / {fixed [Inst] : sum([N])}
This worksheet is filtered on [Inst] = I1. I would like to add a reference line that shows the median value for each bar (cell) across all the [Inst]. (In the end I will add a band that displays 25th - 75th percentile but I figured working with the median would be simpler first).
I thought this would work, but it doesn't: [AllInstMedian] = {fixed [Inst],[Dest Group], [Dest Cipn1] : Sum([N])} / {fixed [Inst] : Sum([N])}
Any suggestions? I'm attaching a sample workbook here hoping that helps .
This is cross-posted here
Thank you

Steve mayer commented on the tableau link posted in the question with an answer. I ended up using a Lookup trick to copy inst and then used table calculations on the 25th and 75th window_percentile.

Parity bit checks using General Hamming Algorithm

In a logic circuit, I have an 8-bit data vector that is fed into an ECC IC which I am supposed to develop the logic for and that contains a vector of 5 Parity Bits. My first step to develop the logic (with logic gates, XOR), is to figure out which parity bit is going to check for which Data bits (since they are interlaced). I am using even parity, and following general hamming code rules (a parity bit in every 2^n ), I get the following sequence of output:
P1 P2 D1 P3 D2 D3 D4 P4 D5 D6 D7 D8 P5
Following the General Hamming Algorithm:
For each parity bit, Position 1,2,4,8,16 and so on... (Powers of 2), we skip for the first position n (n-1) and we check 1 bit, then we skip another one, the check another one, etc... we repeat the same process for the other bits, but this time checking/skipping every 2^n, where n is the position they occupy in the output array (P1 P2 D1 P3 D2 D3 D4 P4 D5 D6 D7 D8 P5)
Following that convention, I get:
P1 Checks data bits -> XOR(3 5 7 9 10 12)
P2 Checks data bits -> XOR(3 6 7 10 11)
P3 Checks data bits -> XOR(5 6 10 11 12)
P4 Checks data bits -> XOR(9 10 11)
Am I right? The thing that confuses me is that if I should start checking counting the parity bit as one of the 2^n bits that are supposed to be checked, or 1 bit after that specific parity bit. Pretty much sums up to if it is inclusive or not.
Thank you for your help in advance!
Cheers!

You can follow this sheme. The bits marked in each row must sum up to 0 (mod 2) in other words for the marked positions in each row the number of set bits must be even.
P1 P2 D1 P3 D2 D3 D4 P4 D5 D6 D7 D8
x x x x x x
x x x x x x
x x x x x
x x x x x
I don't understand why you have P5 in the scheme.

tinyAVR: How can one multiply by 203, 171, or 173 real fast?

Focussing on worst case cycle count, I've coded integer multiplication routines for Atmel's AVR architecture.
In one particular implementation, I'm stuck with 2+1 worst cases, for each of which I seek a faster implementation. These multiply multiplicands with an even number of bytes with known values of an 8-bit part of the multiplier:
* 11001011 (20310)
* 10101011 (17110)
* 10101101 (17310)
GCC (4.8.1) computes these as *29*7, *19*9, and *(43*4+1) - a nice fit for a 3-address machine, which the tinyAVR isn't (quite: most have register pair move twice as fast as add). For a two byte multiplicand & product, this uses 9+2, 10+2, and 11+2 additions(&subtractions) and moves, respectively, for 20, 22, and 24 cycles. Radix-4 Booth would use 11+1 additions (under not exactly comparable conditions) and 23 cycles.
For reasons beyond this question, I have 16*multiplicand precomputed (a5:a4, 7 cycles including move); both original and shifted multiplicand might be used later on (but for the MSByte). And the product is initialised to the multiplicand for the following assembler code snippets (in which I use a Booth-style recoding notation: . for NOP, +, and -. owing is a label "one instruction before done", performing a one-cycle fix-up):
locb:; .-..+.++ ..--.-.- ++..++.- ++.+.-.- ..--.-.- 29*7
; WC?!? 11001011 s 18
add p0, p0;15 n 16 a4 15 s 16 n 15 s0 17
adc p1, p1
sub p0, a4;13 d 13 z 13 s0 15 s4 12 d 15
sbc p1, a5
add p0, p0;11 s4 11 d 12 z 13 z 10 s0 13
adc p1, p1
add p0, a0; 9 d 9 aZ 10 a4 12 s0 9 z 11
adc p1, a1
add p0, p0; 7 s4 7 d 8 a4 10 d 7 d 10
adc p1, p1
add p0, a0; 5 s0 5 d 6 d 8 az 5 aZ 8
adc p1, a1
rjmp owing; 3 owi 3 s0 4 d 6 owi 3 d 6
; aZ 4 aZ 4
(The comments are, from left to right, a cycle count ("backwards from reaching done"), and further code sequences using the recodings in the same column in "the label line", using a shorthand of n for negate, d for double (partial) product, a0/a4/s0/s4 for adding or subtracting the multiplicand shifted 0 or 4 bits to the left, z for storing the product in ZL:ZH, and aZ/sZ for using that.)
The other worst cases using macros (and the above-mentioned shorthand):
loab:; .-.-.-.- +.++.-.- +.+.+.++ .-.-+.++ +.+.++.-
; WC 10101011
negP ;15 set4 ;16 add4 ;15 d 16 a4 16
sub4 ;12 doubleP ;15 pp2Z ;13 s4 14 d 14
pp2Z ;10 subA ;13 doubleP ;12 z 12 a0 12
doubleP ; 9 pp2Z ;11 doubleP ;10 d 11 d 10
doubleP ; 7 doubleP ;10 addZ ; 8 d 9 a4 8
addZ ; 5 doubleP ; 8 doubleP ; 6 aZ 7 d 6
owing ; 3 addZ ; 6 addA ; 4 a0 5 s0 4
; add4 ; 4
loac:; +.+.++.. ++-.++.. .-.-.-..
load: ; .-.-..-- .-.-.-.+ .--.++.+ 0.9 1.8 0.8 (avg)
; WC 10101101
negP ;15 -1 negP ;16 a4 a0 a0 17
sub4 ;12 -16-1 sub4 ;13 d s4 a0
pp2Z ;10 -16-1 doubleP ;11 a0 Z s4
sub4 ; 9 -32-1 doubleP ; 9 d d d
doubleP ; 7 -64-2 sub4 ; 7 a4 aZ d
addZ ; 5 -80-3 addA ; 5 d d a0
owing ; 3 owing ; 3 a0 a0 s4
(I'm not looking for more than one of the results at any single time/as the result of any single invocation - but if you have a way to get two in less than 23 cycles or all three in less than 26, please let me know. To substantiate my claim to know about CSE, (re)using the notations [rl]sh16 and add16 introduced by vlad_tepesch:
movw C, x ; 1
lsh16(C) ; 3 C=2x
movw A, C ; 4
swap Al ; 5
swap Ah ; 6
cbr Ah, 15 ; 7
add Ah, Al ; 8
cbr Al, 15 ; 9
sub Ah, Al ;10 A=32x
movw X, A ;11
add16(X, C) ;13 X=34x
movw B, X ;14
lsh16(X) ;16
lsh16(X) ;18 X=136X
add16(B, X) ;20 B=170X
add16(B, x) ;22 B=171X
add16(A, B) ;24 A=203X
add16(C, B) ;26 C=173X
&hyphen; notice that the 22 cycles to the first result are just the same old 20 cycles, plus two register pair moves. The sequence of actions beside those is that of the third column/alternative following the label loab above.)
While 20 cycles (15-2(rjmp)+7(*16) doesn't look that bad, these are the worst cases. On an AVR CPU without mul-instruction,
How can one multiply real fast by each of 203, 171, and 173?
(Moving one case just before done or owing, having the other two of these faster would shorten the critical path/improve worst case cycle count.)

I am not very familiar with the avr-asm but i knwo the AVRs quite well, so i give it a try
If you need this products at the same place you could use common intermediate results and try adding multiples of 2.
(a) 203: +128 +64 +8 +2 +1 = +128 +32 +8 +2 +1 +32
(b) 171: +128 +32 +8 +2 +1
(c) 173: +128 +32 +8 +4 +1 = +128 +32 +8 +2 +1 +2
The key is that 16bit right shift and 16 addition have to be efficient.
I do not know if i overseen something, but:
rsh16 (X):
LSR Xh
ROR Xl
and
add16 (Y,X)
ADD Yl, Xl
ADDC Yh, Xh
Both 2 cycles.
One register pair holds the current x*2^n value (Xh, Xl). and 3 other pairs (Ah, Ab, Bh, Bl, Ch, Cl) hold the results.
1. Xh <- x; Xl <- 0 (X = 256*x)
2. rsh16(X) (X = 128*x)
3. B = X (B = 128*x)
4. rsh16(X); rsh16(X) (X = 32*x)
5. add16(B, X) (B = 128*x + 32*x)
6. A = X (A = 32*X)
7. rsh16(X); rsh16(X) (X = 8*x)
8. add16(B, X) (B = 128*x + 32*x+ 8*x)
9. rsh16(X); rsh16(X) (X = 2*x)
10. add16(B, X) (B = 128*x + 32*x + 8*x + 2*x)
11. C = X (C = 2*X)
12. CLR Xh (clear Xh so we only add the carry below)
add Bl, x
addc Bh, Xh (B = 128*x + 32*x + 8*x + 2*x + x)
13. add16(A, B) (A = 32*X + B)
14. add16(C, B) (C = 2*X + B)
If I am correct this would sum up to 32 cycles for all three multiplications and requires 9 Registers (1 in, 6 out, 2 temporary)

What did (sort of) work for me:
Triplicate the owing&done-code: no jump for each of the worst cases.
(Making all three faster than tens of "runners up" - meh.)

Is there a specialized algorithm, faster than quicksort, to reorder data ACEGBDFH?

I have some data coming from the hardware. Data comes in blocks of 32 bytes, and there are potentially millions of blocks. Data blocks are scattered in two halves the following way (a letter is one block):
A C E G I K M O B D F H J L N P
or if numbered
0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
First all blocks with even indexes, then the odd blocks. Is there a specialized algorithm to reorder the data correctly (alphabetical order)?
The constraints are mainly on space. I don't want to allocate another buffer to reorder: just one more block. But I'd also like to keep the number of moves low: a simple quicksort would be O(NlogN). Is there a faster solution in O(N) for this special reordering case?

Since this data is always in the same order, sorting in the classical sense is not needed at all. You do not need any comparisons, since you already know in advance which of two given data points.
Instead you can produce the permutation on the data directly. If you transform this into cyclic form, this will tell you exactly which swaps to do, to transform the permuted data into ordered data.
Here is an example for your data:
0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Now calculate the inverse (I'll skip this step, because I am lazy here, assume instead the permutation I have given above actually is the inverse already).
Here is the cyclic form:
(0)(1 8 4 2)(3 9 12 6)(5 10)(7 11 13 14)(15)
So if you want to reorder a sequence structured like this, you would do
# first cycle
# nothing to do
# second cycle
swap 1 8
swap 8 4
swap 4 2
# third cycle
swap 3 9
swap 9 12
swap 12 6
# so on for the other cycles
If you would have done this for the inverse instead of the original permutation, you would get the correct sequence with a proven minimal number of swaps.
EDIT:
For more details on something like this, see the chapter on Permutations in TAOCP for example.

So you have data coming in in a pattern like
a0 a2 a4...a14 a1 a3 a5...a15
and you want to have it sorted to
b0 b1 b2...b15
With some reordering the permutation can be written like:
a0 -> b0
a8 -> b1
a1 -> b2
a2 -> b4
a4 -> b8
a9 -> b3
a3 -> b6
a6 -> b12
a12 -> b9
a10 -> b5
a5 -> b10
a11 -> b7
a7 -> b14
a14 -> b13
a13 -> b11
a15 -> b15
So if you want to sort in place it with only one block additional space in a temporary t, this could be done in O(1) with
t = a8; a8 = a4; a4 = a2; a2 = a1; a1 = t
t = a9; a9 = a12; a12= a6; a6 = a3; a9 = t
t = a10; a10 = a5; a5 = t
t = a11; a11 = a13; a13 = a14; a14 = a7; a7 = t
Edit:The general case (for N != 16), if it is solvable in O(N), is actually an interesting question. I suspect the cycles always start with a prime number which satisfies p < N/2 && N mod p != 0 and the indices have a recurrence like in+1 = 2in mod N, but I am not able to prove it. If this is the case, deriving an O(N) algorithm is trivial.

maybe i'm misunderstanding, but if the order is always identical to the one given then you can "pre-program" (ie avoiding all comparisons) the optimum solution (which is going to be the one that has the minimmum number of swaps to move from the string given to ABCDEFGHIJKLMNOP and which, for something this small, you can work out by hand - see LiKao's answer).

It is easier for me to label your set with numbers:
0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
Start from the 14 and move all even numbers to place (8 swaps). You will get this:
0 1 2 9 4 6 13 8 3 10 7 12 11 14 15
Now you need another 3 swaps (9 with 3, 7 with 13, 11 with 13 moved from 7).
A total of 11 swaps. Not a general solution, but it could give you some hints.

You can also view the intended permutation as a shuffle of the address-bits `abcd <-> dabc' (with abcd the individual bits of the index) Like:
#include <stdio.h>
#define ROTATE(v,n,i) (((v)>>(i)) | (((v) & ((1u <<(i))-1)) << ((n)-(i))))
/******************************************************/
int main (int argc, char **argv)
{
unsigned i,a,b;
for (i=0; i < 16; i++) {
a = ROTATE(i,4,1);
b = ROTATE(a,4,3);
fprintf(stdout,"i=%u a=%u b=%u\n", i, a, b);
}
return 0;
}
/******************************************************/

That was count sort I believe

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ARM-NEON: Conditional register swapping based on parameters - performance

Related

Discussion about how to retrieve an i-th element in the j-th level of a binary tree algorithm

Tableau LOD to find median

Parity bit checks using General Hamming Algorithm

tinyAVR: How can one multiply by 203, 171, or 173 real fast?

Is there a specialized algorithm, faster than quicksort, to reorder data ACEGBDFH?

Categories

Resources