Is there a specialized algorithm, faster than quicksort, to reorder data ACEGBDFH? - performance

I have some data coming from the hardware. Data comes in blocks of 32 bytes, and there are potentially millions of blocks. Data blocks are scattered in two halves the following way (a letter is one block):
A C E G I K M O B D F H J L N P
or if numbered
0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
First all blocks with even indexes, then the odd blocks. Is there a specialized algorithm to reorder the data correctly (alphabetical order)?
The constraints are mainly on space. I don't want to allocate another buffer to reorder: just one more block. But I'd also like to keep the number of moves low: a simple quicksort would be O(NlogN). Is there a faster solution in O(N) for this special reordering case?

Since this data is always in the same order, sorting in the classical sense is not needed at all. You do not need any comparisons, since you already know in advance which of two given data points.
Instead you can produce the permutation on the data directly. If you transform this into cyclic form, this will tell you exactly which swaps to do, to transform the permuted data into ordered data.
Here is an example for your data:
0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Now calculate the inverse (I'll skip this step, because I am lazy here, assume instead the permutation I have given above actually is the inverse already).
Here is the cyclic form:
(0)(1 8 4 2)(3 9 12 6)(5 10)(7 11 13 14)(15)
So if you want to reorder a sequence structured like this, you would do
# first cycle
# nothing to do
# second cycle
swap 1 8
swap 8 4
swap 4 2
# third cycle
swap 3 9
swap 9 12
swap 12 6
# so on for the other cycles
If you would have done this for the inverse instead of the original permutation, you would get the correct sequence with a proven minimal number of swaps.
EDIT:
For more details on something like this, see the chapter on Permutations in TAOCP for example.

So you have data coming in in a pattern like
a0 a2 a4...a14 a1 a3 a5...a15
and you want to have it sorted to
b0 b1 b2...b15
With some reordering the permutation can be written like:
a0 -> b0
a8 -> b1
a1 -> b2
a2 -> b4
a4 -> b8
a9 -> b3
a3 -> b6
a6 -> b12
a12 -> b9
a10 -> b5
a5 -> b10
a11 -> b7
a7 -> b14
a14 -> b13
a13 -> b11
a15 -> b15
So if you want to sort in place it with only one block additional space in a temporary t, this could be done in O(1) with
t = a8; a8 = a4; a4 = a2; a2 = a1; a1 = t
t = a9; a9 = a12; a12= a6; a6 = a3; a9 = t
t = a10; a10 = a5; a5 = t
t = a11; a11 = a13; a13 = a14; a14 = a7; a7 = t
Edit:The general case (for N != 16), if it is solvable in O(N), is actually an interesting question. I suspect the cycles always start with a prime number which satisfies p < N/2 && N mod p != 0 and the indices have a recurrence like in+1 = 2in mod N, but I am not able to prove it. If this is the case, deriving an O(N) algorithm is trivial.

maybe i'm misunderstanding, but if the order is always identical to the one given then you can "pre-program" (ie avoiding all comparisons) the optimum solution (which is going to be the one that has the minimmum number of swaps to move from the string given to ABCDEFGHIJKLMNOP and which, for something this small, you can work out by hand - see LiKao's answer).

It is easier for me to label your set with numbers:
0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
Start from the 14 and move all even numbers to place (8 swaps). You will get this:
0 1 2 9 4 6 13 8 3 10 7 12 11 14 15
Now you need another 3 swaps (9 with 3, 7 with 13, 11 with 13 moved from 7).
A total of 11 swaps. Not a general solution, but it could give you some hints.

You can also view the intended permutation as a shuffle of the address-bits `abcd <-> dabc' (with abcd the individual bits of the index) Like:
#include <stdio.h>
#define ROTATE(v,n,i) (((v)>>(i)) | (((v) & ((1u <<(i))-1)) << ((n)-(i))))
/******************************************************/
int main (int argc, char **argv)
{
unsigned i,a,b;
for (i=0; i < 16; i++) {
a = ROTATE(i,4,1);
b = ROTATE(a,4,3);
fprintf(stdout,"i=%u a=%u b=%u\n", i, a, b);
}
return 0;
}
/******************************************************/

That was count sort I believe

Related

Converting to and from a number system that doesn't have a zero digit

Consider Microsoft Excel's column-numbering system. Columns are "numbered" A, B, C, ... , Y, Z, AA, AB, AC, ... where A is 1.
The column system is similar to the base-10 numbering system that we're familiar with in that when any digit has its maximum value and is incremented, its value is set to the lowest possible digit value and the digit to its left is incremented, or a new digit is added at the minimum value. The difference is that there isn't a digit that represents zero in the letter numbering system. So if the "digit alphabet" contained ABC or 123, we could count like this:
(base 3 with zeros added for comparison)
base 3 no 0 base 3 with 0 base 10 with 0
----------- ------------- --------------
- - 0 0
A 1 1 1
B 2 2 2
C 3 10 3
AA 11 11 4
AB 12 12 5
AC 13 20 6
BA 21 21 7
BB 22 22 8
BC 23 100 9
CA 31 101 10
CB 32 102 11
CC 33 110 12
AAA 111 111 13
Converting from the zeroless system to our base 10 system is fairly simple; it's still a matter of multiplying the power of that space by the value in that space and adding it to the total. So in the case of AAA with the alphabet ABC, it's equivalent to (1*3^2) + (1*3^1) + (1*3^0) = 9 + 3 + 1 = 13.
I'm having trouble converting inversely, though. With a zero-based system, you can use a greedy algorithm moving from largest to smallest digit and grabbing whatever fits. This will not work for a zeroless system, however. For example, converting the base-10 number 10 to the base-3 zeroless system: Though 9 (the third digit slot: 3^2) would fit into 10, this would leave no possible configuration of the final two digits since their minimum values are 1*3^1 = 3 and 1*3^0 = 1 respectively.
Realistically, my digit alphabet will contain A-Z, so I'm looking for a quick, generalized conversion method that can do this without trial and error or counting up from zero.
Edit
The accepted answer by n.m. is primarily a string-manipulation-based solution.
For a purely mathematical solution see kennytm's links:
What is the algorithm to convert an Excel Column Letter into its Number?
How to convert a column number (eg. 127) into an excel column (eg. AA)
Convert to base-3-with-zeroes first (digits 0AB), and from there, convert to base-3-without-zeroes (ABC), using these string substitutions:
A0 => 0C
B0 => AC
C0 => BC
Each substitution either removes a zero, or pushes one to the left. In the end, discard leading zeroes.
It is also possible, as an optimisation, to process longer strings of zeros at once:
A000...000 = 0BBB...BBC
B000...000 = ABBB...BBC
C000...000 = BBBB...BBC
Generalizable to any base.

Parity bit checks using General Hamming Algorithm

In a logic circuit, I have an 8-bit data vector that is fed into an ECC IC which I am supposed to develop the logic for and that contains a vector of 5 Parity Bits. My first step to develop the logic (with logic gates, XOR), is to figure out which parity bit is going to check for which Data bits (since they are interlaced). I am using even parity, and following general hamming code rules (a parity bit in every 2^n ), I get the following sequence of output:
P1 P2 D1 P3 D2 D3 D4 P4 D5 D6 D7 D8 P5
Following the General Hamming Algorithm:
For each parity bit, Position 1,2,4,8,16 and so on... (Powers of 2), we skip for the first position n (n-1) and we check 1 bit, then we skip another one, the check another one, etc... we repeat the same process for the other bits, but this time checking/skipping every 2^n, where n is the position they occupy in the output array (P1 P2 D1 P3 D2 D3 D4 P4 D5 D6 D7 D8 P5)
Following that convention, I get:
P1 Checks data bits -> XOR(3 5 7 9 10 12)
P2 Checks data bits -> XOR(3 6 7 10 11)
P3 Checks data bits -> XOR(5 6 10 11 12)
P4 Checks data bits -> XOR(9 10 11)
Am I right? The thing that confuses me is that if I should start checking counting the parity bit as one of the 2^n bits that are supposed to be checked, or 1 bit after that specific parity bit. Pretty much sums up to if it is inclusive or not.
Thank you for your help in advance!
Cheers!
You can follow this sheme. The bits marked in each row must sum up to 0 (mod 2) in other words for the marked positions in each row the number of set bits must be even.
P1 P2 D1 P3 D2 D3 D4 P4 D5 D6 D7 D8
x x x x x x
x x x x x x
x x x x x
x x x x x
I don't understand why you have P5 in the scheme.

tinyAVR: How can one multiply by 203, 171, or 173 real fast?

Focussing on worst case cycle count, I've coded integer multiplication routines for Atmel's AVR architecture.
In one particular implementation, I'm stuck with 2+1 worst cases, for each of which I seek a faster implementation. These multiply multiplicands with an even number of bytes with known values of an 8-bit part of the multiplier:
* 11001011 (20310)
* 10101011 (17110)
* 10101101 (17310)
GCC (4.8.1) computes these as *29*7, *19*9, and *(43*4+1) - a nice fit for a 3-address machine, which the tinyAVR isn't (quite: most have register pair move twice as fast as add). For a two byte multiplicand & product, this uses 9+2, 10+2, and 11+2 additions(&subtractions) and moves, respectively, for 20, 22, and 24 cycles. Radix-4 Booth would use 11+1 additions (under not exactly comparable conditions) and 23 cycles.
For reasons beyond this question, I have 16*multiplicand precomputed (a5:a4, 7 cycles including move); both original and shifted multiplicand might be used later on (but for the MSByte). And the product is initialised to the multiplicand for the following assembler code snippets (in which I use a Booth-style recoding notation: . for NOP, +, and -. owing is a label "one instruction before done", performing a one-cycle fix-up):
locb:; .-..+.++ ..--.-.- ++..++.- ++.+.-.- ..--.-.- 29*7
; WC?!? 11001011 s 18
add p0, p0;15 n 16 a4 15 s 16 n 15 s0 17
adc p1, p1
sub p0, a4;13 d 13 z 13 s0 15 s4 12 d 15
sbc p1, a5
add p0, p0;11 s4 11 d 12 z 13 z 10 s0 13
adc p1, p1
add p0, a0; 9 d 9 aZ 10 a4 12 s0 9 z 11
adc p1, a1
add p0, p0; 7 s4 7 d 8 a4 10 d 7 d 10
adc p1, p1
add p0, a0; 5 s0 5 d 6 d 8 az 5 aZ 8
adc p1, a1
rjmp owing; 3 owi 3 s0 4 d 6 owi 3 d 6
; aZ 4 aZ 4
(The comments are, from left to right, a cycle count ("backwards from reaching done"), and further code sequences using the recodings in the same column in "the label line", using a shorthand of n for negate, d for double (partial) product, a0/a4/s0/s4 for adding or subtracting the multiplicand shifted 0 or 4 bits to the left, z for storing the product in ZL:ZH, and aZ/sZ for using that.)
The other worst cases using macros (and the above-mentioned shorthand):
loab:; .-.-.-.- +.++.-.- +.+.+.++ .-.-+.++ +.+.++.-
; WC 10101011
negP ;15 set4 ;16 add4 ;15 d 16 a4 16
sub4 ;12 doubleP ;15 pp2Z ;13 s4 14 d 14
pp2Z ;10 subA ;13 doubleP ;12 z 12 a0 12
doubleP ; 9 pp2Z ;11 doubleP ;10 d 11 d 10
doubleP ; 7 doubleP ;10 addZ ; 8 d 9 a4 8
addZ ; 5 doubleP ; 8 doubleP ; 6 aZ 7 d 6
owing ; 3 addZ ; 6 addA ; 4 a0 5 s0 4
; add4 ; 4
loac:; +.+.++.. ++-.++.. .-.-.-..
load: ; .-.-..-- .-.-.-.+ .--.++.+ 0.9 1.8 0.8 (avg)
; WC 10101101
negP ;15 -1 negP ;16 a4 a0 a0 17
sub4 ;12 -16-1 sub4 ;13 d s4 a0
pp2Z ;10 -16-1 doubleP ;11 a0 Z s4
sub4 ; 9 -32-1 doubleP ; 9 d d d
doubleP ; 7 -64-2 sub4 ; 7 a4 aZ d
addZ ; 5 -80-3 addA ; 5 d d a0
owing ; 3 owing ; 3 a0 a0 s4
(I'm not looking for more than one of the results at any single time/as the result of any single invocation - but if you have a way to get two in less than 23 cycles or all three in less than 26, please let me know. To substantiate my claim to know about CSE, (re)using the notations [rl]sh16 and add16 introduced by vlad_tepesch:
movw C, x ; 1
lsh16(C) ; 3 C=2x
movw A, C ; 4
swap Al ; 5
swap Ah ; 6
cbr Ah, 15 ; 7
add Ah, Al ; 8
cbr Al, 15 ; 9
sub Ah, Al ;10 A=32x
movw X, A ;11
add16(X, C) ;13 X=34x
movw B, X ;14
lsh16(X) ;16
lsh16(X) ;18 X=136X
add16(B, X) ;20 B=170X
add16(B, x) ;22 B=171X
add16(A, B) ;24 A=203X
add16(C, B) ;26 C=173X
&hyphen; notice that the 22 cycles to the first result are just the same old 20 cycles, plus two register pair moves. The sequence of actions beside those is that of the third column/alternative following the label loab above.)
While 20 cycles (15-2(rjmp)+7(*16) doesn't look that bad, these are the worst cases. On an AVR CPU without mul-instruction,
How can one multiply real fast by each of 203, 171, and 173?
(Moving one case just before done or owing, having the other two of these faster would shorten the critical path/improve worst case cycle count.)
I am not very familiar with the avr-asm but i knwo the AVRs quite well, so i give it a try
If you need this products at the same place you could use common intermediate results and try adding multiples of 2.
(a) 203: +128 +64 +8 +2 +1 = +128 +32 +8 +2 +1 +32
(b) 171: +128 +32 +8 +2 +1
(c) 173: +128 +32 +8 +4 +1 = +128 +32 +8 +2 +1 +2
The key is that 16bit right shift and 16 addition have to be efficient.
I do not know if i overseen something, but:
rsh16 (X):
LSR Xh
ROR Xl
and
add16 (Y,X)
ADD Yl, Xl
ADDC Yh, Xh
Both 2 cycles.
One register pair holds the current x*2^n value (Xh, Xl). and 3 other pairs (Ah, Ab, Bh, Bl, Ch, Cl) hold the results.
1. Xh <- x; Xl <- 0 (X = 256*x)
2. rsh16(X) (X = 128*x)
3. B = X (B = 128*x)
4. rsh16(X); rsh16(X) (X = 32*x)
5. add16(B, X) (B = 128*x + 32*x)
6. A = X (A = 32*X)
7. rsh16(X); rsh16(X) (X = 8*x)
8. add16(B, X) (B = 128*x + 32*x+ 8*x)
9. rsh16(X); rsh16(X) (X = 2*x)
10. add16(B, X) (B = 128*x + 32*x + 8*x + 2*x)
11. C = X (C = 2*X)
12. CLR Xh (clear Xh so we only add the carry below)
add Bl, x
addc Bh, Xh (B = 128*x + 32*x + 8*x + 2*x + x)
13. add16(A, B) (A = 32*X + B)
14. add16(C, B) (C = 2*X + B)
If I am correct this would sum up to 32 cycles for all three multiplications and requires 9 Registers (1 in, 6 out, 2 temporary)
What did (sort of) work for me:
Triplicate the owing&done-code: no jump for each of the worst cases.
(Making all three faster than tens of "runners up" - meh.)

Divide and Conquer Matrix Multiplication

I am having trouble getting divide and conquer matrix multiplication to work. From what I understand, you split the matrices of size nxn into quadrants (each quadrant is n/2) and then you do:
C11 = A11⋅ B11 + A12 ⋅ B21
C12 = A11⋅ B12 + A12 ⋅ B22
C21 = A21 ⋅ B11 + A22 ⋅ B21
C22 = A21 ⋅ B12 + A22 ⋅ B22
My output for divide and conquer is really large and I'm having trouble figuring out the problem as I am not very good with recursion.
example output:
Original Matrix A:
4 0 4 3
5 4 0 4
4 0 4 0
4 1 1 1
A x A
Classical:
44 3 35 15
56 20 24 35
32 0 32 12
29 5 21 17
Divide and Conquer:
992 24 632 408
1600 272 720 1232
512 0 512 384
460 17 405 497
Could someone tell me what I am doing wrong for divide and conquer? All my matrices are int[][] and classical method is the traditional 3 for loop matrix multiplication
You are recursively calling divideAndConquer in the wrong way. What your function does is square a matrix. In order for divide and conquer matrix multiplication to work, it needs to be able to multiply two potentially different matrixes together.
It should look something like this:
private static int[][] divideAndConquer(int[][] matrixA, int[][] matrixB){
if (matrixA.length == 2){
//calculate and return base case
}
else {
//make a11, b11, a12, b12 etc. by dividing a and b into quarters
int[][] c11 = addMatrix(divideAndConquer(a11,b11),divideAndConquer(a12,b21));
int[][] c12 = addMatrix(divideAndConquer(a11,b12),divideAndConquer(a12,b22));
int[][] c21 = addMatrix(divideAndConquer(a21,b11),divideAndConquer(a22,b21));
int[][] c22 = addMatrix(divideAndConquer(a21,b12),divideAndConquer(a22,b22));
//combine result quarters into one result matrix and return
}
}
Some debugging approaches to try:
Try some very simple test matrices as input (e.g. all zeros, with a one or a few strategic ones). You may see a pattern in the "failures" that will show you where your error(s) are.
Make sure your "classical" approach is giving you correct answers. For small matrices, you can use Woflram Alpha on-line to test answers: http://www.wolframalpha.com/examples/Matrices.html
To debug recursion: add printf() statements at the entry and exit of your function, including the invocation arguments. Run your test matrix, write the output to a log file, and open the log file with a text editor. Step through each case, writing your notes alongside in the editor making sure it's working correctly at each step. Add more printf() statements and run again if needed.
Good luck with the homework!
Could someone tell me what I am doing wrong for divide and conquer?
Yes:
int[][] a = divideAndConquer(topLeft);
int[][] b = divideAndConquer(topRight);
int[][] c = divideAndConquer(bottomLeft);
int[][] d = divideAndConquer(bottomRight);
int[][] c11 = addMatrix(classical(a,a),classical(b,c));
int[][] c12 = addMatrix(classical(a,b),classical(b,d));
int[][] c21 = addMatrix(classical(c,a),classical(d,c));
int[][] c22 = addMatrix(classical(c,b),classical(d,d));
You are going through an extra multiplication step here: you shouldn't be calling both divideAndConquer() and classical().
What you are effectively doing is:
C11 = (A11^2)⋅(B11^2) + (A12^2)⋅(B21^2)
C12 = (A11^2)⋅(B12^2) + (A12^2)⋅(B22^2)
C21 = (A21^2)⋅(B11^2) + (A22^2)⋅(B21^2)
C22 = (A21^2)⋅(B12^2) + (A22^2)⋅(B22^2)
which is not correct.
First, remove the divideAndConquer() calls, and replace a/b/c/d by topLeft/topRight/etc.
See if it gives you the proper results.
Your divideAndConquer() method needs a pair of input parameters, so you can use A*B. Once you get that working, get rid of the calls to classical(), and use divideAndConquer() instead. (or save them for matrices that are not a multiple of 2 in length.)
You might find the Wiki article on Strassen's algorithm helpful.

Compress two or more numbers into one byte

I think this is not really possible but worth asking anyway. Say I have two small numbers (Each ranges from 0 to 11). Is there a way that I can compress them into one byte and get them back later. How about with four numbers of similar sizes.
What I need is something like: a1 + a2 = x. I only know x and from that get a1, a2
For the second part: a1 + a2 + a3 + a4 = x. I only know x and from that get a1, a2, a3, a4
Note: I know you cannot unadd, just illustrating my question.
x must be one byte. a1, a2, a3, a4 range [0, 11].
Thats trivial with bit masks. Idea is to divide byte into smaller units and dedicate them to different elements.
For 2 numbers, it can be like this: first 4 bits are number1, rest are number2. You would use number1 = (x & 0b11110000) >> 4, number2 = (x & 0b00001111) to retrieve values, and x = (number1 << 4) | number2 to compress them.
For two numbers, sure. Each one has 12 possible values, so the pair has a total of 12^2 = 144 possible values, and that's less than the 256 possible values of a byte. So you could do e.g.
x = 12*a1 + a2
a1 = x / 12
a2 = x % 12
(If you only have signed bytes, e.g. in Java, it's a little trickier)
For four numbers from 0 to 11, there are 12^4 = 20736 values, so you couldn't fit them in one byte, but you could do it with two.
x = 12^3*a1 + 12^2*a2 + 12*a3 + a4
a1 = x / 12^3
a2 = (x / 12^2) % 12
a3 = (x / 12) % 12
a4 = x % 12
EDIT: the other answers talk about storing one number per four bits and using bit-shifting. That's faster.
The 0-11 example is pretty easy -- you can store each number in four bits, so putting them into a single byte is just a matter of shifting one 4 bits to the left, and oring the two together.
Four numbers of similar sizes won't fit -- four bits apiece times four gives a minimum of 16 bits to hold them.
Let's say it in general: suppose you want to mix N numbers a1, a2, ... aN, a1 ranging from 0..k1-1, a2 from 0..k2-1, ... and aN from 0 .. kN-1.
Then, the encoded number is:
encoded = a1 + k1*a2 + k1*k2*a3 + ... k1*k2*..*k(N-1)*aN
The decoding is then more tricky, stepwise:
rest = encoded
a1 = rest mod k1
rest = rest div k1
a2 = rest mod k2
rest = rest div k2
...
a(N-1) = rest mod k(N-1)
rest = rest div k(N-1)
aN = rest # rest is already < kN
If the numbers 0-11 aren't evenly distributed you can do even better by using shorter bit sequences for common values and longer ones for rarer values. It costs at least one bit to code which length you are using so there is a whole branch of CS devoted to proving when it's worth doing.
So a byte can hold upto 256 values or FF in Hex. So you can encode two numbers from 0-16 in a byte.
byte a1 = 0xf;
byte a2 = 0x9;
byte compress = a1 << 4 | (0x0F & a2); // should yield 0xf9 in one byte.
4 Numbers you can do if you reduce it to only 0-8 range.
Since a single byte is 8 bits, you can easily subdivide it, with smaller ranges of values. The extreme limit of this is when you have 8 single bit integers, which is called a bit field.
If you want to store two 4-bit integers (which gives you 0-15 for each), you simply have to do this:
value = a * 16 + b;
As long as you do proper bounds checking, you will never lose any information here.
To get the two values back, you just have to do this:
a = floor(value / 16)
b = value MOD 15
MOD is modulus, it's the "remainder" of a division.
If you want to store four 2-bit integers (0-3), you can do this:
value = a * 64 + b * 16 + c * 4 + d
And, to get them back:
a = floor(value / 64)
b = floor(value / 16) MOD 4
c = floor(value / 4) MOD 4
d = value MOD 4
I leave the last division as an exercise for the reader ;)
#Mike Caron
your last example (4 integers between 0-3) is much faster with bit-shifting. No need for floor().
value = (a << 6) | (b << 4) | (c << 2) | d;
a = (value >> 6);
b = (value >> 4) % 4;
c = (value >> 2) % 4;
d = (value) % 4;
Use Bit masking or Bit Shifting. The later is faster
Test out BinaryTrees for some fun. (it will be handing later on in dev life regarding data and all sorts of dev voodom lol)
Packing four values into one number will require at least 15 bits. This doesn't fit in a single byte, but in two.
What you need to do is a conversion from base 12 to base 65536 and conversely.
B = A1 + 12.(A2 + 12.(A3 + 12.A4))
A1 = B % 12
A2 = (B / 12) % 12
A3 = (B / 144) % 12
A4 = B / 1728
As this takes 2 bytes anyway, conversion from base 12 to (packed) base 16 is by far prefable.
B1 = A1 + 256.A2
B2 = A3 + 256.A4
A1 = B1 % 256
A2 = B1 / 256
A3 = B2 % 256
A4 = B2 / 256
The modulos and divisions are implemented bymaskings and shifts.
0-9 works much easier. You can easily store 11random order decimals in 4 1/2 bytes. Which is tighter compression than log(256)÷log(10). Just by creative mapping. Remember not all compression has to do with, dictionaries, redundancies, or sequences.
If you are talking of random numbers 0 - 9 you can have 4 digits per 14 bits not 15.

Resources