General algorithm for bit permutations - algorithm

Is there a general strategy to create an efficient bit permutation algorithm. The goal is to create a fast branch-less and if possible LUT-less algorithm. I'll give an example:
A 13 bit code is to be transformed into another 13 bit code according to the following rule table:
BIT
INPUT (DEC)
INPUT (BIN)
OUTPUT (BIN)
OUTPUT (DEC)
0
1
0000000000001
0000100000000
256
1
2
0000000000010
0010000000000
1024
2
4
0000000000100
0100000000000
2048
3
8
0000000001000
1000000000000
4096
4
16
0000000010000
0000001000000
64
5
32
0000000100000
0000000100000
32
6
64
0000001000000
0001000000000
512
7
128
0000010000000
0000000010000
16
8
256
0000100000000
0000000001000
8
9
512
0001000000000
0000000000010
2
10
1024
0010000000000
0000000000001
1
11
2048
0100000000000
0000000000100
4
12
4096
1000000000000
0000010000000
128
Example: If the input code is 1+2+4096=4099 the resulting output would be 256+1024+128=1408
A naive approach would be:
OUTPUT = ((INPUT AND 0000000000001) << 8) OR ((INPUT AND 0000000000010) << 9) OR ((INPUT AND 0000000000100) << 9) OR ((INPUT AND 0000000001000) << 9) OR ...
It means we have 3 instructions per bit (AND, SHIFT, OR) = 39-1 (last OR omitted) instructions for the above example. Instead we could also use a combination of left and right shifts to potentially reduce code size (depends on target platform), but this will not decrease the amount of instructions.
When inspecting the example table, you will of course notice a few obvious possibilities for optimization, for example in line 2/3/4 which can be combined as ((INPUT AND 0000000000111) << 9). But beside that it is becoming a difficult tedious task.
Are the general strategies? I think using Karnaugh-Veitch Map's to simplify the expression could be one approach? However it is pretty difficult for 13 input variables. Also the resulting expression would only be a combination of OR's and AND's.

For bit permutations, several strategies are known that work in certain cases. There's a code generator at https://programming.sirrida.de/calcperm.php which implements most of them. However, in this case, it seems to find only basically the strategy you suggested, indicating that it seems hard to find any pattern to exploit in this permutation.

If one big lookup table is too much, you can try to use two smaller ones.
Take 7 lower bits of the input, look up a 16-bit value in table A.
Take 6 higher bits of the input, look up a 16-bit value in table B.
or the values from 1. and 2. to produce the result.
Table A needs 128*2 bytes, table B needs 64*2 bytes, that's 384 bytes for the lookup tables.

This is a hand-optimised multiple LUT solution, which doesn't really prove anything other than that I had some time to burn.
Multiple small lookup tables can occasionally save time and/or space, but I don't know of a strategy to find the optimal combination. In this case, the best division seems to be three LUTs of three bits each (bits 4-6, 7-9 and 10-12), totalling 24 bytes (each table has 8 one-byte entries), plus a simple shift to cover bits through 3, and another simple shift for the remaining bit 0. Bit 5, which is untransformed, was also a tempting target but I don't see a good way to divide bit ranges around it.
The three look-up tables have single-byte entries because the range of the transformations for each range is just one byte. In fact, the transformations for two of the bit ranges fit entirely in the low-order byte, avoiding a shift.
Here's the code:
unsigned short permute_bits(unsigned short x) {
#define LUT3(BIT0, BIT1, BIT2) \
{ 0, (BIT0), (BIT1), (BIT1)+(BIT0), \
(BIT2), (BIT2)+(BIT0), (BIT2)+(BIT1), (BIT2)+(BIT1)+(BIT0)}
static const unsigned char t4[] = LUT3(1<<(6-3), 1<<(5-3), 1<<(9-3));
static const unsigned char t7[] = LUT3(1<<4, 1<<3, 1<<1);
static const unsigned char t10[] = LUT3(1<<0, 1<<2, 1<<7);
#undef LUT3
return ( (x&1) << 8) // Bit 0
+ ( (x&14) << 9) // Bits 1-3, simple shift
+ (t4[(x>>4)&7] << 3) // Bits 4-6, see below
+ (t7[(x>>7)&7] ) // Bits 7-9, three-bit lookup for LOB
+ (t10[(x>>10)&7] ); // Bits 10-12, ditto
}
Note on bits 4-6
Bit 6 is transformed to position 9, which is outside of the low-order byte. However, bits 4 and 5 are moved to positions 6 and 5, respectively, and the total range of the transformed bits is only 5 bit positions. Several different final shifts are possible, but keeping the shift relatively small provides a tiny improvement on x86 architecture, because it allows the use of LEA to do a simultaneous shift and add. (See the second last instruction in the assembly below.)
The intermediate results are added instead of using boolean OR for the same reason. Since the sets of bits in each intermediate result are disjoint, ADD and OR have the same result; using add can take advantage of chip features like LEA.
Here's the compilation of that function, taken from http://gcc.godbolt.org using gcc 12.1 with -O3:
permute_bits(unsigned short):
mov edx, edi
mov ecx, edi
movzx eax, di
shr di, 4
shr dx, 7
shr cx, 10
and edi, 7
and edx, 7
and ecx, 7
movzx ecx, BYTE PTR permute_bits(unsigned short)::t10[rcx]
movzx edx, BYTE PTR permute_bits(unsigned short)::t7[rdx]
add edx, ecx
mov ecx, eax
sal eax, 9
sal ecx, 8
and ax, 7168
and cx, 256
or eax, ecx
add edx, eax
movzx eax, BYTE PTR permute_bits(unsigned short)::t4[rdi]
lea eax, [rdx+rax*8]
ret
I left out the lookup tables themselves because the assembly produced by GCC isn't very helpful.
I don't know if this is any quicker than #trincot's solution (in a comment); a quick benchmark was inconclusive, but it looked to be a few percent quicker. But it's quite a bit shorter, possibly enough to compensate for the 24 bytes of lookup data.

Related

Checking if an adress is linecache aligned

This is a quiz question which I failed in the past and despite having access to the solution, I don't understand the different step to come to the correct answer.
Here is the problem :
Which of these adress is line cache aligned
a. 0x7ffc32a21164
b. 0x560c40e05350
c. 0x560c40e052c0
d. 0x560c3f2d71ff
And the solution to the problem:
Each hex char is represented by 4 bits
It takes 6 bits to represent 64 adress, since ln(64)/ln(2) = 6
0x0 0000
0x4 0100
0x8 1000
0xc 1100
________
2^3 2^2 2^1 2^0
8 4 2 1
Conclusion: if the adress ends if either 00, 40, 80 or c0, then it is aligned on 64 bytes.
The answer is c.
I really don't see how we go from 6 bits representation to this answer. Can anyone adds something to the solution given to make it clearer?
The question boils down to: Which number is a multiple of 64? All that remains is understanding the number system they're using.
In binary, 64 is written as 1000000. In hexadecimal, it's written as 0x40. So multiples of 64 will end in 0x00 (0 * 64), 0x40 (1 * 64), 0x80 (2 * 64), or 0xC0 (3 * 64). (The cycle then repeats.) Answer c is the one with the right ending.
An analogy in decimal would be: Which number is a multiple of 5? 0 * 5 is 0 and 1 * 5 is 5, after which the cycle repeats. So we just need to look at the last digit. If it's a 0 or a 5, we know the number is a multiple of 5.

Direct mapped cache example

i am really confused on the topic Direct Mapped Cache i've been looking around for an example with a good explanation and it's making me more confused then ever.
For example: I have
2048 byte memory
64 byte big cache
8 byte cache lines
with direct mapped cache how do i determine the 'LINE' 'TAG' and "Byte offset'?
i believe that the total number of addressing bits is 11 bits because 2048 = 2^11
2048/64 = 2^5 = 32 blocks (0 to 31) (5bits needed) (tag)
64/8 = 8 = 2^3 = 3 bits for the index
8 byte cache lines = 2^3 which means i need 3 bits for the byte offset
so the addres would be like this: 5 for the tag, 3 for the index and 3 for the byte offset
Do i have this figured out correctly?
Do i figured out correctly? YES
Explanation
1) Main memmory size is 2048 bytes = 211. So you need 11 bits to address a byte (If your word size is 1 byte) [word = smallest individual unit that will be accessed with the address]
2) You can calculating tag bits in direct mapping by doing (main memmory size / cash size). But i will explain a little more about tag bits.
Here the size of a cashe line( which is always same as size of a main memmory block) is 8 bytes. which is 23 bytes. So you need 3 bits to represent a byte within a cashe line. Now you have 8 bits (11 - 3) are remaining in the address.
Now the total number of lines present in the cache is (cashe size / line size) = 26 / 23 = 23
So, you have 3 bits to represent the line in which the your required byte is present.
The number of remaining bits now are 5 (8 - 3).
These 5 bits can be used to represent a tag. :)
3) 3 bit for index. If you were trying to label the number of bits needed to represent a line as index. Yes you are right.
4) 3 bits will be used to access a byte withing a cache line. (8 = 23)
So,
11 bits total address length = 5 tag bits + 3 bits to represent a line + 3 bits to represent a byte(word) withing a line
Hope there is no confusion now.

Algorithm to find out matched bits and non matched bits from two streams

I am adding corresponding bits of two bit steams in Java like below:
1 0 1 1 0 0
1 0 1 0 1 0
====================
2 0 2 1 1 0
After this I am adding result as:
2+0+2+1+1+0 = 6
Now, I have to find out number of 1ns and 2s in the result (6) that is matched bits and non-matched bits. I tried hard to device such an algorithm which can tell me exact number of 1ns and 2s the result is made up of but I am unable to create any so far.
It allows multiplying of each addition result with a constant number. Individual bits can be subtracted to achieve above goal.
I can also multiply these individual bits like I am adding above. But I cannot add these bits neither the result. Bits can be multiplied with itself or any other bit. Even I can represent these bits with number of my choice. That is I can say that 1=2 and 0=3 then I can have:
For addition (Pascal Paillier):
2 3 2 2 3 3
2 3 2 3 2 3
====================
4 6 4 5 5 9
For Multiplication (RSA)
2 3 2 2 3 3
2 3 2 3 2 3
====================
4 9 4 6 6 9
The only purpose is to find out the number of similar bits (1&1) and non similar bits (0&1, 1&0) from the overall number will be generated either by addition (Pascal Paillier) or by multiplication (RSA).
Furthermore, 2nd bit-stream can be represented with different numbers than the above.
Following can also be used:
Multiplication with bits and results and exponential with a constant
Addition/Subtraction among bits and result and multiplication with a constant only
Further detail:
I am using Pascal Paillier Homomorphic algorithm to encrypt these individual bits. Pascal Paillier supports addition only over encrypted data so I have to add only. I have to send this number to some application which have to find out the exact number of matched bits and non-matched bits.
Also, I can use RSA but it only allow multiplication.
In general, from two original streams of bits a and b, you will want to perform something like
popcnt(a XOR b) = # of non-matching bits (pairs of 0,1 and 1,0)
popcnt(a AND b) = # of matching 1-bits (pairs of 1,1)
where popcnt is the Hamming weight of the resultant string.
If you happen to be compiling it on a processor which supports the SSE4 POPCNT instruction (AMD and Intel processors produced in the last 7 years), this will likely be the most performant way of solving the problem. As an example, an implementation for both (testing an int-sized number of bits from streams a and b at a time)
int nonmatch(int a, int b) {
return __builtin_popcount(a ^ b);
}
int match(int a, int b) {
return __builtin_popcount(a & b);
}
will yield assembly code
; nonmatch
xor %esi, %edi
popcnt %edi, %eax
retq
; match
and %esi, %edi
popcnt %edi, %eax
retq
respectively.

How data is stored in memory or register

I m new to assembly language and learning it for exams. I ama programmer and worked in C,C++, java, asp.net.
I have tasm with win xp.
I want to know How data is stored in memory or register. I want to know the process. I believe it is something like this:
While Entering data, eg. number:
Input Decimal No. -> Converted to Hex -> Store ASCII of hex in registers or memory.
While Fetching data:
ASCII of hex in registers or memory -> Converted to Hex -> Show Decimal No. on monitor.
Is it correct. ? If not, can anybody tell me with simple e.g
Ok, Michael: See the code below where I am trying to add two 1 digit numbers to display 2 digit result, like 6+5=11
Sseg segment stack
ends
code segment
;30h to 39h represent numbers 0-9
MOV BX, '6' ; ASCII CODE OF 6 IS STORED IN BX, equal to 36h
ADD BX, '5' ; ASCII CODE OF 5 (equal to 35h) IS ADDED IN BX, i.e total is 71h
Thanks Michael... I accept my mistake....
Ok, so here, BX=0071h, right ? Does it mean, BL=00 and BH=71 ?
However, If i do so, I can't find out how to show the result 11 ?
Hey Blechdose,
Can you help me in one more problem. I am trying to compare 2 values. If both are same then dl=1 otherwise dl=0. But in the following code it displays 0 for same values, it is showing me 0. Why is it not jumping ?
sseg segment stack
ends
code segment
assume cs:code
mov dl,0
mov ax,5
mov bx,5
cmp ax,bx
jne NotEqual
je equal
NotEqual:
mov dl,0
add dl,30h
mov ah,02h
int 21h
mov ax,4c00h
int 21h
equal: mov dl,1
add dl,30h
mov ah,02h
int 21h
mov ax,4c00h
int 21h
code ends
end NotEqual
end equal
Registers consist of bits. A bit can have the logic value 0 or 1. It is a "logic value" for us, but actually it is represented by some kind of voltage inside the hardware. For example 4-5 Volt is interpreted as "logic 1" and 0-1 Volt as "logic 0". The BX register has 16 of those bits.
Lets say the current content of BX(Base address register) is: 0000000000110110. Because it is very hard to read those long lines of 0s and 1s for humans, we combine every 4 bits to 1 Hexnumber, to get a more readable format to work with. The CPU does not know what a Hex or decimal number is. It can only work with binary code. Okay, let us use a more readable format for our BX register:
0000 0000 0011 0110 (actual BX content)
0 0 3 6 (HEX format for us)
54 (or corresponding decimal value)
When you send this value (36h), to your output terminal, it will interpret this value as an ASCII-charakter. Thus it will display a "6" for the 36h value.
When you want to add 6 + 2 with assembly, you put 0110 (6) and 0010 (2) in the registers. Your assembler TASM is doing the work for you. It allows you to write '6' (ASCII) or 0x6 (hex) or even 6 (decimal) in the asm-sourcecode and will convert that for you into a binary number, which the register accepts. WARNING: '6' will not put the value 6 into the register, but the ASCII-Code for 6. You cannot calculate with that directly.
Example: 6+2=8
mov BX, 6h ; We put 0110 (6) into BX. (actually 0000 0000 0000 0110,
; because BX is 16 Bit, but I will drop those leading 0s)
add BX, 2h ; we add 0010 (2) to 0110 (6). The result 1000 (8) is stored in BX.
add BX, 30h ; we add 00110000 (30h). The result 00111000 (38h) is stored in BX.
; 38h is the ASCII-code, which your terminal output will interpret as '8'
When you do a calculation like 6+5 = 11, it will be even more complicated, because you have to convert the result 1011 (11) into 2 ASCII-Digits '1' and '1' (3131h = 00110001 00110001)
After adding 6 (0110) + 5 (0101) = 11 (1011), BX will contain this (without blanks):
0000 0000 0000 1011 (binary)
0 0 0 B (Hex)
11 (decimal)
|__________________|
BX
|________||________|
BH BL
BH is the higher Byte of BX, while BL is the lower byte of BX. In our example BH is 00h, while BL contains 0bh.
To display your summation result on your terminal output, you need to convert it to ASCII-Code. In this case, you want to display an '11'. Thus you need two times a '1'-ASCII-Character. By looking up one of the hunderds ASCII-tables on the internet, you will find out, that the Code for the '1'-ASCII-Charakter is 31h. Consequently you need to send 3131h to your terminal:
0011 0001 0011 0001 (binary)
3 1 3 1 (hex)
12593 (decimal)
The trick to do this, is by dividing your 11 (1011) by 10 with the div instruction. After the division by 10 you get a result and a remainder. you need to convert the remainder into an ASCII-number, which you need to save into a buffer. Then you repeat the process by dividing the result from the last step by 10 again. You need to do this, until the result is 0. (using the div operation is a bit tricky. You have to look that up by yourself)
binary (decimal):
divide 1011 (11) by 1010 (10):
result: 0001 (1) remainder: 0001 (1) -> convert remainderto ASCII
divide result by 1010 (10) again:
result: 0000 (1) remainder: 0001 (1) -> convert remainderto ASCII

What is the best way of sending the data to serial port?

This is related with microcontrollers but thought to post it here because it is a problem with algorithms and data types and not with any hardware stuff. I'll explain the problem so that someone that doesn't have any hardware knowledge can still participate :)
In Microcontroller there is an Analog to Digital converter with 10
bit resolution. (It will output a
value between 0 and 1023)
I need to send this value to PC using the serial port.
But you can only write 8 bits at once. (You need to write bytes). It is
a limitation in micro controller.
So in the above case at least I need to send 2 bytes.
My PC application just reads a sequence of numbers for plotting. So
it should capture two consecutive
bytes and build the number back. But
here we will need a delimiter
character as well. but still the delimiter character has an ascii value between 0 - 255 then it will mixup the process.
So what is a simplest way to do this? Should I send the values as a sequence of chars?
Ex : 1023 = "1""0""2""3" Vs "Char(255)Char(4)"
In summary I need to send a sequence of 10 bit numbers over Serial in fastest way. :)
You need to send 10 bits, and because you send a byte at a time, you have to send 16 bits. The big question is how much is speed a priority, and how synchronised are the sender and receiver? I can think of 3 answers, depending on these conditions.
Regular sampling, unknown join point
If the device is running all the time, you aren't sure when you are going to connect (you could join at any time in the sequence) but sampling rate is slower than communication speed so you don't care about size I think I'd probably do it as following. Suppose you are trying to send the ten bits abcdefghij (each letter one bit).
I'd send pq0abcde then pq1fghij, where p and q are error checking bits. This way:
no delimiter is needed (you can tell which byte you are reading by the 0 or 1)
you can definitely spot any 1 bit error, so you know about bad data
I'm struggling to find a good two bit error correcting code, so I guess I'd just make p a parity bit for bits 2,3 and 4 (0, a b above) and q a parity bit for 5 6 and 7 (c,d,e above). This might be clearer with an example.
Suppose I want to send 714 = 1011001010.
Split in 2 10110 , 01010
Add bits to indicate first and second byte 010110, 101010
calculate parity for each half: p0=par(010)=1, q0=par(110)=0, p1=par(101)=0, q1=par(010)=1
bytes are then 10010110, 01101010
You then can detect a lot of different error conditions, quickly check which byte you are being sent if you lose synchronisation, and none of the operations take very long in a microcontroller (I'd do the parity with an 8 entry lookup table).
Dense data, known join point
If you know that the reader starts at the same time as the writer, just send the 4 ten bit values as 5 bytes. If you always read 5 bytes at a time then no problems. If you want even more space saving, and have good sample data already, I'd compress using a huffman coding.
Dense data, unknown join point
In 7 bytes you can send 5 ten bit values with 6 spare bits. Send 5 values like this:
byte 0: 0 (7 bits)
byte 1: 1 (7 bits)
byte 2: 1 (7 bits)
byte 3: 1 (7 bits)
byte 4: 0 (7 bits)
byte 5: 0 (7 bits)
byte 6: (8 bits)
Then whenever you see 3 1's in a row for the most significant bit, you know you have bytes 1, 2 and 3. This idea wastes 1 bit in 56, so could be made even more efficient, but you'd have to send more data at a time. Eg (5 consecutive ones, 120 bits sent in 16 bytes):
byte 0: 0 (7 bits) 7
byte 1: 1 (7 bits) 14
byte 2: 1 (7 bits) 21
byte 3: 1 (7 bits) 28
byte 4: 1 (7 bits) 35
byte 5: 1 (7 bits) 42
byte 6: 0 (7 bits) 49
byte 7: (8 bits) 57
byte 8: (8 bits) 65
byte 9: (8 bits) 73
byte 10: (8 bits) 81
byte 11: 0 (7 bits) 88
byte 12: (8 bits) 96
byte 13: (8 bits) 104
byte 14: (8 bits) 112
byte 15: (8 bits) 120
This is quite a fun problem!
The best method is to convert the data to an ASCII string and send it that way - it makes debugging a lot easier and it avoids various communication issues (special meaning of certain control characters etc).
If you really need to use all the available bandwidth though then you can pack 4 10 bit values into 5 consecutive 8 bit bytes. You will need to be careful about synchronization.
Since you specified "the fastest way" I think expanding the numbers to ASCII is ruled out.
In my opinion a good compromise of code simplicity and performance can be obtained by the following encoding:
Two 10bit values will be encoded in 3 bytes like this.
first 10bit value bits := abcdefghij
second 10bit value bits := klmnopqrst
Bytes to encode:
1abcdefg
0hijklmn
0_opqrst
There is one bit more (_) available that could be used for a parity over all 20bits for error checking or just set to a fixed value.
Some example code (puts 0 at the position _):
#include <assert.h>
#include <inttypes.h>
void
write_byte(uint8_t byte); /* writes byte to serial */
void
encode(uint16_t a, uint16_t b)
{
write_byte(((a >> 3) & 0x7f) | 0x80);
write_byte(((a & 3) << 4) | ((b >> 6) & 0x7f));
write_byte(b & 0x3f);
}
uint8_t
read_byte(void); /* read a byte from serial */
void
decode(uint16_t *a, uint16_t *b)
{
uint16_t x;
while (((x = read_byte()) & 0x80) == 0) {} /* sync */
*a = x << 3;
x = read_byte();
assert ((x & 0x80) == 0); /* put better error handling here */
*a |= (x >> 4) & 3;
*b = x << 6;
x = read_byte();
assert ((x & 0xc0) == 0); /* put better error handling here */
*b |= x;
}
I normally use a start byte and checksum and in this case fixed length, so send 4 bytes, the receiver can look for the start byte and if the next three add up to a know quantity then it is a good packet take out the middle two bytes, if not keep looking. The receiver can always re-sync and it doesnt waste the bandwidth of ascii. Ascii is your other option, a start byte that is not a number and perhaps four numbers for decimal. Decimal is definitely not fun in a microcontroller, so start with something non-hex like X for example and then three bytes with the hex ascii values for your number. Search for the x examine the next three bytes, hope for the best.

Resources