tinyAVR: How can one multiply by 203, 171, or 173 real fast?

tinyAVR: How can one multiply by 203, 171, or 173 real fast? - algorithm

Focussing on worst case cycle count, I've coded integer multiplication routines for Atmel's AVR architecture.
In one particular implementation, I'm stuck with 2+1 worst cases, for each of which I seek a faster implementation. These multiply multiplicands with an even number of bytes with known values of an 8-bit part of the multiplier:
* 11001011 (20310)
* 10101011 (17110)
* 10101101 (17310)
GCC (4.8.1) computes these as *29*7, *19*9, and *(43*4+1) - a nice fit for a 3-address machine, which the tinyAVR isn't (quite: most have register pair move twice as fast as add). For a two byte multiplicand & product, this uses 9+2, 10+2, and 11+2 additions(&subtractions) and moves, respectively, for 20, 22, and 24 cycles. Radix-4 Booth would use 11+1 additions (under not exactly comparable conditions) and 23 cycles.
For reasons beyond this question, I have 16*multiplicand precomputed (a5:a4, 7 cycles including move); both original and shifted multiplicand might be used later on (but for the MSByte). And the product is initialised to the multiplicand for the following assembler code snippets (in which I use a Booth-style recoding notation: . for NOP, +, and -. owing is a label "one instruction before done", performing a one-cycle fix-up):
locb:; .-..+.++ ..--.-.- ++..++.- ++.+.-.- ..--.-.- 29*7
; WC?!? 11001011 s 18
add p0, p0;15 n 16 a4 15 s 16 n 15 s0 17
adc p1, p1
sub p0, a4;13 d 13 z 13 s0 15 s4 12 d 15
sbc p1, a5
add p0, p0;11 s4 11 d 12 z 13 z 10 s0 13
adc p1, p1
add p0, a0; 9 d 9 aZ 10 a4 12 s0 9 z 11
adc p1, a1
add p0, p0; 7 s4 7 d 8 a4 10 d 7 d 10
adc p1, p1
add p0, a0; 5 s0 5 d 6 d 8 az 5 aZ 8
adc p1, a1
rjmp owing; 3 owi 3 s0 4 d 6 owi 3 d 6
; aZ 4 aZ 4
(The comments are, from left to right, a cycle count ("backwards from reaching done"), and further code sequences using the recodings in the same column in "the label line", using a shorthand of n for negate, d for double (partial) product, a0/a4/s0/s4 for adding or subtracting the multiplicand shifted 0 or 4 bits to the left, z for storing the product in ZL:ZH, and aZ/sZ for using that.)
The other worst cases using macros (and the above-mentioned shorthand):
loab:; .-.-.-.- +.++.-.- +.+.+.++ .-.-+.++ +.+.++.-
; WC 10101011
negP ;15 set4 ;16 add4 ;15 d 16 a4 16
sub4 ;12 doubleP ;15 pp2Z ;13 s4 14 d 14
pp2Z ;10 subA ;13 doubleP ;12 z 12 a0 12
doubleP ; 9 pp2Z ;11 doubleP ;10 d 11 d 10
doubleP ; 7 doubleP ;10 addZ ; 8 d 9 a4 8
addZ ; 5 doubleP ; 8 doubleP ; 6 aZ 7 d 6
owing ; 3 addZ ; 6 addA ; 4 a0 5 s0 4
; add4 ; 4
loac:; +.+.++.. ++-.++.. .-.-.-..
load: ; .-.-..-- .-.-.-.+ .--.++.+ 0.9 1.8 0.8 (avg)
; WC 10101101
negP ;15 -1 negP ;16 a4 a0 a0 17
sub4 ;12 -16-1 sub4 ;13 d s4 a0
pp2Z ;10 -16-1 doubleP ;11 a0 Z s4
sub4 ; 9 -32-1 doubleP ; 9 d d d
doubleP ; 7 -64-2 sub4 ; 7 a4 aZ d
addZ ; 5 -80-3 addA ; 5 d d a0
owing ; 3 owing ; 3 a0 a0 s4
(I'm not looking for more than one of the results at any single time/as the result of any single invocation - but if you have a way to get two in less than 23 cycles or all three in less than 26, please let me know. To substantiate my claim to know about CSE, (re)using the notations [rl]sh16 and add16 introduced by vlad_tepesch:
movw C, x ; 1
lsh16(C) ; 3 C=2x
movw A, C ; 4
swap Al ; 5
swap Ah ; 6
cbr Ah, 15 ; 7
add Ah, Al ; 8
cbr Al, 15 ; 9
sub Ah, Al ;10 A=32x
movw X, A ;11
add16(X, C) ;13 X=34x
movw B, X ;14
lsh16(X) ;16
lsh16(X) ;18 X=136X
add16(B, X) ;20 B=170X
add16(B, x) ;22 B=171X
add16(A, B) ;24 A=203X
add16(C, B) ;26 C=173X
&hyphen; notice that the 22 cycles to the first result are just the same old 20 cycles, plus two register pair moves. The sequence of actions beside those is that of the third column/alternative following the label loab above.)
While 20 cycles (15-2(rjmp)+7(*16) doesn't look that bad, these are the worst cases. On an AVR CPU without mul-instruction,
How can one multiply real fast by each of 203, 171, and 173?
(Moving one case just before done or owing, having the other two of these faster would shorten the critical path/improve worst case cycle count.)

I am not very familiar with the avr-asm but i knwo the AVRs quite well, so i give it a try
If you need this products at the same place you could use common intermediate results and try adding multiples of 2.
(a) 203: +128 +64 +8 +2 +1 = +128 +32 +8 +2 +1 +32
(b) 171: +128 +32 +8 +2 +1
(c) 173: +128 +32 +8 +4 +1 = +128 +32 +8 +2 +1 +2
The key is that 16bit right shift and 16 addition have to be efficient.
I do not know if i overseen something, but:
rsh16 (X):
LSR Xh
ROR Xl
and
add16 (Y,X)
ADD Yl, Xl
ADDC Yh, Xh
Both 2 cycles.
One register pair holds the current x*2^n value (Xh, Xl). and 3 other pairs (Ah, Ab, Bh, Bl, Ch, Cl) hold the results.
1. Xh <- x; Xl <- 0 (X = 256*x)
2. rsh16(X) (X = 128*x)
3. B = X (B = 128*x)
4. rsh16(X); rsh16(X) (X = 32*x)
5. add16(B, X) (B = 128*x + 32*x)
6. A = X (A = 32*X)
7. rsh16(X); rsh16(X) (X = 8*x)
8. add16(B, X) (B = 128*x + 32*x+ 8*x)
9. rsh16(X); rsh16(X) (X = 2*x)
10. add16(B, X) (B = 128*x + 32*x + 8*x + 2*x)
11. C = X (C = 2*X)
12. CLR Xh (clear Xh so we only add the carry below)
add Bl, x
addc Bh, Xh (B = 128*x + 32*x + 8*x + 2*x + x)
13. add16(A, B) (A = 32*X + B)
14. add16(C, B) (C = 2*X + B)
If I am correct this would sum up to 32 cycles for all three multiplications and requires 9 Registers (1 in, 6 out, 2 temporary)

What did (sort of) work for me:
Triplicate the owing&done-code: no jump for each of the worst cases.
(Making all three faster than tens of "runners up" - meh.)

Related

Parity bit checks using General Hamming Algorithm

In a logic circuit, I have an 8-bit data vector that is fed into an ECC IC which I am supposed to develop the logic for and that contains a vector of 5 Parity Bits. My first step to develop the logic (with logic gates, XOR), is to figure out which parity bit is going to check for which Data bits (since they are interlaced). I am using even parity, and following general hamming code rules (a parity bit in every 2^n ), I get the following sequence of output:
P1 P2 D1 P3 D2 D3 D4 P4 D5 D6 D7 D8 P5
Following the General Hamming Algorithm:
For each parity bit, Position 1,2,4,8,16 and so on... (Powers of 2), we skip for the first position n (n-1) and we check 1 bit, then we skip another one, the check another one, etc... we repeat the same process for the other bits, but this time checking/skipping every 2^n, where n is the position they occupy in the output array (P1 P2 D1 P3 D2 D3 D4 P4 D5 D6 D7 D8 P5)
Following that convention, I get:
P1 Checks data bits -> XOR(3 5 7 9 10 12)
P2 Checks data bits -> XOR(3 6 7 10 11)
P3 Checks data bits -> XOR(5 6 10 11 12)
P4 Checks data bits -> XOR(9 10 11)
Am I right? The thing that confuses me is that if I should start checking counting the parity bit as one of the 2^n bits that are supposed to be checked, or 1 bit after that specific parity bit. Pretty much sums up to if it is inclusive or not.
Thank you for your help in advance!
Cheers!

You can follow this sheme. The bits marked in each row must sum up to 0 (mod 2) in other words for the marked positions in each row the number of set bits must be even.
P1 P2 D1 P3 D2 D3 D4 P4 D5 D6 D7 D8
x x x x x x
x x x x x x
x x x x x
x x x x x
I don't understand why you have P5 in the scheme.

ARM-NEON: Conditional register swapping based on parameters

I am writing a piece of subroutine in NEON for image processing which does color swapping, i.e., I sequentialy load the R,G,B channels from an array, and depending on some configuration, permute some of them.
There are as maximum 6 permutes
(RGB) -> { (RGB),(RBG),(GRB),(GBR),(BRG),(BGR) }
The most efficient way would be to have a separate subroutine for each case and the corresponding VSWP instructions. As the Subroutine will do several other things, I would prefer to keep everything in just one sub, even if it is not so efficient,
Also have read that conditional execution and branching is not advisable. So, if I want to have it in a block with branchless code, the only thing coming to my mind is
New_R = a(0)*R+a(1)*G+a(2)*B
New_G = a(3)*R+a(4)*G+a(5)*B
New_B = a(6)*R+a(7)*G+a(8)*B
where only one a(i) in each row and column will be =1 each time, and the rest will be =0
Question: Any smarter way to do it, having in mind that it has to be coded to NEON?

VTBL.8 is the most powerful tool in NEON to swap bytes.
Loading 3x8 bytes to registers d0,d1,d2 would look like
R G B R G B R G | B R G B R G B R | G B R G B R G B |
0 1 2 3 4 5 6 7 8 9 a b c d e f .... 17
VTBL d3, { d0,d1,d2 }, d6 ;; select bytes to d3 from d0,d1,d2 based on d6
VTBL d4, { d0,d1,d2 }, d7
VTBL d5, { d0,d1,d2 }, d8
where d6,d7,d8 encode the positions to read in the new bytes.
e.g. '0 1 2 3 4 5 6 7' for the original permutation and '0 2 1 3 5 4 6 8', '7 ...' to swap G and B. The constant vectors d6..d8 need to be loaded just once in the beginning of the routine.
Another possibility is to encode the following sequence with interleaved read;
VLD3.8 { d0,d1,d2 }, [r0] ; // Read R, G, B to separate registers
VLD3.8 { d3,d4,d5 }, [r0] ; // Make a second copy (or use some other instruction)
VBIT d3, d1, d6 ; // d3 is now either R or G
VBIT d4, d2, d7 ; // d4 is now either G or B
VBIT d5, d0, d8 ; // d5 is now either B or R
VBIT d0, d4, d9 ; // d0 is now R or (G or B)
VBIT d1, d5, d10 ; // d1 is now G or (B or R)
VBIT d2, d3, d11 ; // d2 is now B or (R or G)
Even though 6 registers for the condition codes are used in the example, 3 independent registers should be enough -- one can also use VBIF if reversed logic needs to be used.

Unification algorithm example in WAM (Warren's Abstract Machine)

Exercise 2.2 in Warren's Abstract Machine: A Tutorial Reconstruction
asks for representations for the terms f(X, g(X, a)) and f(b, Y) and then to perform unification on the address of these terms (denoted a1 and a2 respectively).
I've constructed the heap representations for the terms, and they are as follows:
f(X, g(X, a)):
0 STR 1
1 a/0
2 STR 3
3 g/2
4 REF 4
5 STR 1
6 STR 7
7 f/2
8 REF 4
9 STR 3
f(b, Y):
10 STR 11
11 b/0
12 STR 7
13 STR 11
14 REF 14
and I am now asked to trace unify(a1, a2), but following the algorithm on page 20 in 1 I get:
d1 = deref(a1) = deref(10) = 10
d2 = deref(a2) = deref(0) = 0
0 != 10 so we continue
<t1, v1> = STORE(d1) = STORE(10) = <STR, 11>
<t2, v2> = STORE(d2) = STORE(0) = <STR, 1>
t1 != REF and t2 != REF so we continue
f1 / n1 = STORE(v1) = STORE(11) = b / 0
f2 / n2 = STORE(v2) = STORE(1) = a / 0
and now b != a so the algorithm terminated with fail = true,
and thus unification failed, but obviously there exists
a solution with X = b and Y = g(b, a).
Where is my mistake?

I found the solution myself. Here's my corrections:
Each term should have their own definitions of the functors (ie. the f-functor in the second term should not just link to the first f-functor in the first term, but should have its own) and pointers to the terms (a1 and a2) should point to the outermost term functor.
This means that a1 = 6 and a2 = 12 in the following layout
f(X, g(X, a)):
0 STR 1
1 a/0
2 STR 3
3 g/2
4 REF 4
5 STR 1
6 STR 7
7 f/2
8 REF 4
9 STR 3
f(b, Y):
10 STR 11
11 b/0
12 STR 13
13 f/2
14 REF 11
15 REF 15

Is there a specialized algorithm, faster than quicksort, to reorder data ACEGBDFH?

I have some data coming from the hardware. Data comes in blocks of 32 bytes, and there are potentially millions of blocks. Data blocks are scattered in two halves the following way (a letter is one block):
A C E G I K M O B D F H J L N P
or if numbered
0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
First all blocks with even indexes, then the odd blocks. Is there a specialized algorithm to reorder the data correctly (alphabetical order)?
The constraints are mainly on space. I don't want to allocate another buffer to reorder: just one more block. But I'd also like to keep the number of moves low: a simple quicksort would be O(NlogN). Is there a faster solution in O(N) for this special reordering case?

Since this data is always in the same order, sorting in the classical sense is not needed at all. You do not need any comparisons, since you already know in advance which of two given data points.
Instead you can produce the permutation on the data directly. If you transform this into cyclic form, this will tell you exactly which swaps to do, to transform the permuted data into ordered data.
Here is an example for your data:
0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Now calculate the inverse (I'll skip this step, because I am lazy here, assume instead the permutation I have given above actually is the inverse already).
Here is the cyclic form:
(0)(1 8 4 2)(3 9 12 6)(5 10)(7 11 13 14)(15)
So if you want to reorder a sequence structured like this, you would do
# first cycle
# nothing to do
# second cycle
swap 1 8
swap 8 4
swap 4 2
# third cycle
swap 3 9
swap 9 12
swap 12 6
# so on for the other cycles
If you would have done this for the inverse instead of the original permutation, you would get the correct sequence with a proven minimal number of swaps.
EDIT:
For more details on something like this, see the chapter on Permutations in TAOCP for example.

So you have data coming in in a pattern like
a0 a2 a4...a14 a1 a3 a5...a15
and you want to have it sorted to
b0 b1 b2...b15
With some reordering the permutation can be written like:
a0 -> b0
a8 -> b1
a1 -> b2
a2 -> b4
a4 -> b8
a9 -> b3
a3 -> b6
a6 -> b12
a12 -> b9
a10 -> b5
a5 -> b10
a11 -> b7
a7 -> b14
a14 -> b13
a13 -> b11
a15 -> b15
So if you want to sort in place it with only one block additional space in a temporary t, this could be done in O(1) with
t = a8; a8 = a4; a4 = a2; a2 = a1; a1 = t
t = a9; a9 = a12; a12= a6; a6 = a3; a9 = t
t = a10; a10 = a5; a5 = t
t = a11; a11 = a13; a13 = a14; a14 = a7; a7 = t
Edit:The general case (for N != 16), if it is solvable in O(N), is actually an interesting question. I suspect the cycles always start with a prime number which satisfies p < N/2 && N mod p != 0 and the indices have a recurrence like in+1 = 2in mod N, but I am not able to prove it. If this is the case, deriving an O(N) algorithm is trivial.

maybe i'm misunderstanding, but if the order is always identical to the one given then you can "pre-program" (ie avoiding all comparisons) the optimum solution (which is going to be the one that has the minimmum number of swaps to move from the string given to ABCDEFGHIJKLMNOP and which, for something this small, you can work out by hand - see LiKao's answer).

It is easier for me to label your set with numbers:
0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
Start from the 14 and move all even numbers to place (8 swaps). You will get this:
0 1 2 9 4 6 13 8 3 10 7 12 11 14 15
Now you need another 3 swaps (9 with 3, 7 with 13, 11 with 13 moved from 7).
A total of 11 swaps. Not a general solution, but it could give you some hints.

You can also view the intended permutation as a shuffle of the address-bits `abcd <-> dabc' (with abcd the individual bits of the index) Like:
#include <stdio.h>
#define ROTATE(v,n,i) (((v)>>(i)) | (((v) & ((1u <<(i))-1)) << ((n)-(i))))
/******************************************************/
int main (int argc, char **argv)
{
unsigned i,a,b;
for (i=0; i < 16; i++) {
a = ROTATE(i,4,1);
b = ROTATE(a,4,3);
fprintf(stdout,"i=%u a=%u b=%u\n", i, a, b);
}
return 0;
}
/******************************************************/

That was count sort I believe

maximum value of xor operation

I came up with this question.
There is an encryption algorithm which uses bitwise XOR operations extensively. This encryption algorithm uses a sequence of non-negative integers x1, x2, ... xn as key. To implement this algorithm efficiently, Xorq needs to find maximum value for (a xor xj) for given integers a, p and q such that p <= j <= q. Help Xorq to implement this function.
Input
First line of input contains a single integer T (1<=T<=6). T test cases follow.
First line of each test case contains two integers N and Q separated by a single space (1 <= N <= 100,000; 1 <= Q <= 50,000). Next line contains N integers x1, x2, ... xn separated by a single space (0 <= xj < 215). Each of next Q lines describe a query which consists of three integers ai, pi and qi (0 <= ai < 215, 1<= pi <= qi <= N).
Output
For each query, print the maximum value for (ai xor xj) such that pi <= j <= qi in a single line.
Sample Input
1
15 8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
10 6 10
1023 7 7
33 5 8
182 5 10
181 1 13
5 10 15
99 8 9
33 10 14
Sample Output
13
1016
41
191
191
15
107
47
Explanation
First Query (10 6 10): x6 xor 10 = 12,
x7 xor 10 = 13, x8 xor 10 = 2, x9 xor 10 = 3, x10 xor 10 = 0,
therefore answer for this query is 13.
Second Query (1023 7 7): x7 xor 1023 = 1016,
therefore answer for this query is 1016.
Third Query (33 5 8): x5 xor 33 = 36, x6 xor 33 = 39,
x7 xor 33 = 38, x8 xor 33 = 41, therefore answer for this query is 41.
Fourth Query (182 5 10): x5 xor 182 = 179,
x6 xor 182 = 176, x7 xor 182 = 177, x8 xor 182 = 190,
x9 xor 182 = 191, x10 xor 182 = 188,
therefore answer for this query is 191.
I tried this by first making the numbers length(in binary)
in the given range equal and then comparing 'a' bit by
bit with the particular xj values.But it is time exceeding.
Maximum time limit in java is 5sec.

I haven't gone through your code in detail, but you seem to have loops over the range of r = p - 1; r < q - 1; r++, and it would be nice not to have to do this.
Given ai, we want to find a value of xi in the given range with as many of its top bits the inverse of ai as possible. Everything is between 0 and 2^15, so there aren't many bits to worry about. For n = 1 to 15 you could divide the xi up according to its n highest bits, so dividing it into 2, 4, 8, 16.. 32768 portions. For each portion keep a list in sorted order of the positions where each possible value is found, so for the top bit you will have two lists, one giving the positions at which the bit pattern is 0.............. and one giving the position at which the bit pattern is 1............ For each triple, you can use binary chop on a particular portion to find if there are any positions within your range at which the top n bits have the bit pattern you are looking for. If they do, fine. If not you will have to accept that one of the xor positions is 0 and slightly modify the pattern you look for with one more top bit set.
The setup cost is 15 linear passes over the xi, which is probably less time than it takes you to read it in. For each line you could do 15 binary chops to see which values of xi match in the top n bits, and modify the pattern of top bits you look for if you can't match a particular bit.
I think your program would be clearer if you separated the I/O from the problem code by making the problem code a separate subroutine. This would also make it easier to compare one version of the problem code with another, to see which is faster and if they both get the same answer.

The biggest inefficiency that I can spot in the original algorithm is that N can be up to 100,000 but a and x can only go up to 214. So I would write pseudocode something like this:
bool set[256] = { false };
for (j = p; j <= q; j++) set[x[j]] = true;
for (k = 255; !set[a ^ j]; k--);
return k;
This reduces the number of xor operations to 256 in the worst case.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio