Cache tag value - caching

In reference to the cache question at the following link :
link
Question is : Using the the series of addresses below, show the hits and misses and final cache contents for a two-way set-associative cache with one-word blocks, 4-byte words, and a total size of 16 words. Assume FIFO replacement.
0, 4, 64, 0, 128, 32, 12, 96, 128, 64
My question is : Why is the tag value set to word address / 8 ?
Thanks.

Short explanation - If the cache holds 16 words total ( = 64 bytes cache, pretty small :), and it's 2-way set associative, then you have 8 sets that are directly mapped by the address. You don't need the set bits to be part of the tag because you've already used them to map to the correct set.
Assuming the access granularity is 1 byte, than you address has 2 LSB bits that map you inside a block (4 bytes), you need to ignore these when accessing the cache since you're reading the full block (the memory unit would then use these 2 bits to give you the exact bytes within the block according to the read size and alignment). So word address = real_address / 4
Now, since you have 8 sets, you use the next 3 bits to map to the correct set.
+--------------------------------------+----------------------+-------------------+
| Tag (bits 5 and above) | Set (bits 2,3,4) | Offset (bits 0,1) |
+--------------------------------------+----------------------+-------------------+
That is, addr 0x0 would map to set 0, addr 0x4 (word addr 0x1) would always sit at set no. 1, no matter what. set 2 would have addr 0x8 (word addr 0x2), set 3 would have addr 0xC (word addr 0x3), ... and so on , until set 7 would be used for addr 0x1C (word addr 0x7).
The next address would simply wrap - addr 0x20 (word addr 0x8) would check bits 2..4 and see they're zeroed, so would map again to set 0, and so on. At this point comes the tag to distinguish between address 0x0, addr 0x20, addr 0x10000, or any other address that maps there (addr % 0x20 == 0, or word_addr % 8 == 0). Since you don't care about the offset inside the line here, and the set bits are already known when you decide to access a given set, the only thing missing that requires storage (aside from the data of course), are the bits above the set bits - this is required (and enough) to determines the line identity in a given set, and to know if a lookup hits or misses. These bits are addr / 0x20 (or addr >> 5), or word_addr / 8 ( = word_addr >> 3)
Note that this means that a tag alone is not enough to identify the line addr, you need tag and set bits to reconstruct that.

Related

General algorithm for bit permutations

Is there a general strategy to create an efficient bit permutation algorithm. The goal is to create a fast branch-less and if possible LUT-less algorithm. I'll give an example:
A 13 bit code is to be transformed into another 13 bit code according to the following rule table:
BIT
INPUT (DEC)
INPUT (BIN)
OUTPUT (BIN)
OUTPUT (DEC)
0
1
0000000000001
0000100000000
256
1
2
0000000000010
0010000000000
1024
2
4
0000000000100
0100000000000
2048
3
8
0000000001000
1000000000000
4096
4
16
0000000010000
0000001000000
64
5
32
0000000100000
0000000100000
32
6
64
0000001000000
0001000000000
512
7
128
0000010000000
0000000010000
16
8
256
0000100000000
0000000001000
8
9
512
0001000000000
0000000000010
2
10
1024
0010000000000
0000000000001
1
11
2048
0100000000000
0000000000100
4
12
4096
1000000000000
0000010000000
128
Example: If the input code is 1+2+4096=4099 the resulting output would be 256+1024+128=1408
A naive approach would be:
OUTPUT = ((INPUT AND 0000000000001) << 8) OR ((INPUT AND 0000000000010) << 9) OR ((INPUT AND 0000000000100) << 9) OR ((INPUT AND 0000000001000) << 9) OR ...
It means we have 3 instructions per bit (AND, SHIFT, OR) = 39-1 (last OR omitted) instructions for the above example. Instead we could also use a combination of left and right shifts to potentially reduce code size (depends on target platform), but this will not decrease the amount of instructions.
When inspecting the example table, you will of course notice a few obvious possibilities for optimization, for example in line 2/3/4 which can be combined as ((INPUT AND 0000000000111) << 9). But beside that it is becoming a difficult tedious task.
Are the general strategies? I think using Karnaugh-Veitch Map's to simplify the expression could be one approach? However it is pretty difficult for 13 input variables. Also the resulting expression would only be a combination of OR's and AND's.
For bit permutations, several strategies are known that work in certain cases. There's a code generator at https://programming.sirrida.de/calcperm.php which implements most of them. However, in this case, it seems to find only basically the strategy you suggested, indicating that it seems hard to find any pattern to exploit in this permutation.
If one big lookup table is too much, you can try to use two smaller ones.
Take 7 lower bits of the input, look up a 16-bit value in table A.
Take 6 higher bits of the input, look up a 16-bit value in table B.
or the values from 1. and 2. to produce the result.
Table A needs 128*2 bytes, table B needs 64*2 bytes, that's 384 bytes for the lookup tables.
This is a hand-optimised multiple LUT solution, which doesn't really prove anything other than that I had some time to burn.
Multiple small lookup tables can occasionally save time and/or space, but I don't know of a strategy to find the optimal combination. In this case, the best division seems to be three LUTs of three bits each (bits 4-6, 7-9 and 10-12), totalling 24 bytes (each table has 8 one-byte entries), plus a simple shift to cover bits through 3, and another simple shift for the remaining bit 0. Bit 5, which is untransformed, was also a tempting target but I don't see a good way to divide bit ranges around it.
The three look-up tables have single-byte entries because the range of the transformations for each range is just one byte. In fact, the transformations for two of the bit ranges fit entirely in the low-order byte, avoiding a shift.
Here's the code:
unsigned short permute_bits(unsigned short x) {
#define LUT3(BIT0, BIT1, BIT2) \
{ 0, (BIT0), (BIT1), (BIT1)+(BIT0), \
(BIT2), (BIT2)+(BIT0), (BIT2)+(BIT1), (BIT2)+(BIT1)+(BIT0)}
static const unsigned char t4[] = LUT3(1<<(6-3), 1<<(5-3), 1<<(9-3));
static const unsigned char t7[] = LUT3(1<<4, 1<<3, 1<<1);
static const unsigned char t10[] = LUT3(1<<0, 1<<2, 1<<7);
#undef LUT3
return ( (x&1) << 8) // Bit 0
+ ( (x&14) << 9) // Bits 1-3, simple shift
+ (t4[(x>>4)&7] << 3) // Bits 4-6, see below
+ (t7[(x>>7)&7] ) // Bits 7-9, three-bit lookup for LOB
+ (t10[(x>>10)&7] ); // Bits 10-12, ditto
}
Note on bits 4-6
Bit 6 is transformed to position 9, which is outside of the low-order byte. However, bits 4 and 5 are moved to positions 6 and 5, respectively, and the total range of the transformed bits is only 5 bit positions. Several different final shifts are possible, but keeping the shift relatively small provides a tiny improvement on x86 architecture, because it allows the use of LEA to do a simultaneous shift and add. (See the second last instruction in the assembly below.)
The intermediate results are added instead of using boolean OR for the same reason. Since the sets of bits in each intermediate result are disjoint, ADD and OR have the same result; using add can take advantage of chip features like LEA.
Here's the compilation of that function, taken from http://gcc.godbolt.org using gcc 12.1 with -O3:
permute_bits(unsigned short):
mov edx, edi
mov ecx, edi
movzx eax, di
shr di, 4
shr dx, 7
shr cx, 10
and edi, 7
and edx, 7
and ecx, 7
movzx ecx, BYTE PTR permute_bits(unsigned short)::t10[rcx]
movzx edx, BYTE PTR permute_bits(unsigned short)::t7[rdx]
add edx, ecx
mov ecx, eax
sal eax, 9
sal ecx, 8
and ax, 7168
and cx, 256
or eax, ecx
add edx, eax
movzx eax, BYTE PTR permute_bits(unsigned short)::t4[rdi]
lea eax, [rdx+rax*8]
ret
I left out the lookup tables themselves because the assembly produced by GCC isn't very helpful.
I don't know if this is any quicker than #trincot's solution (in a comment); a quick benchmark was inconclusive, but it looked to be a few percent quicker. But it's quite a bit shorter, possibly enough to compensate for the 24 bytes of lookup data.

Checking if an adress is linecache aligned

This is a quiz question which I failed in the past and despite having access to the solution, I don't understand the different step to come to the correct answer.
Here is the problem :
Which of these adress is line cache aligned
a. 0x7ffc32a21164
b. 0x560c40e05350
c. 0x560c40e052c0
d. 0x560c3f2d71ff
And the solution to the problem:
Each hex char is represented by 4 bits
It takes 6 bits to represent 64 adress, since ln(64)/ln(2) = 6
0x0 0000
0x4 0100
0x8 1000
0xc 1100
________
2^3 2^2 2^1 2^0
8 4 2 1
Conclusion: if the adress ends if either 00, 40, 80 or c0, then it is aligned on 64 bytes.
The answer is c.
I really don't see how we go from 6 bits representation to this answer. Can anyone adds something to the solution given to make it clearer?
The question boils down to: Which number is a multiple of 64? All that remains is understanding the number system they're using.
In binary, 64 is written as 1000000. In hexadecimal, it's written as 0x40. So multiples of 64 will end in 0x00 (0 * 64), 0x40 (1 * 64), 0x80 (2 * 64), or 0xC0 (3 * 64). (The cycle then repeats.) Answer c is the one with the right ending.
An analogy in decimal would be: Which number is a multiple of 5? 0 * 5 is 0 and 1 * 5 is 5, after which the cycle repeats. So we just need to look at the last digit. If it's a 0 or a 5, we know the number is a multiple of 5.

Direct mapped cache example

i am really confused on the topic Direct Mapped Cache i've been looking around for an example with a good explanation and it's making me more confused then ever.
For example: I have
2048 byte memory
64 byte big cache
8 byte cache lines
with direct mapped cache how do i determine the 'LINE' 'TAG' and "Byte offset'?
i believe that the total number of addressing bits is 11 bits because 2048 = 2^11
2048/64 = 2^5 = 32 blocks (0 to 31) (5bits needed) (tag)
64/8 = 8 = 2^3 = 3 bits for the index
8 byte cache lines = 2^3 which means i need 3 bits for the byte offset
so the addres would be like this: 5 for the tag, 3 for the index and 3 for the byte offset
Do i have this figured out correctly?
Do i figured out correctly? YES
Explanation
1) Main memmory size is 2048 bytes = 211. So you need 11 bits to address a byte (If your word size is 1 byte) [word = smallest individual unit that will be accessed with the address]
2) You can calculating tag bits in direct mapping by doing (main memmory size / cash size). But i will explain a little more about tag bits.
Here the size of a cashe line( which is always same as size of a main memmory block) is 8 bytes. which is 23 bytes. So you need 3 bits to represent a byte within a cashe line. Now you have 8 bits (11 - 3) are remaining in the address.
Now the total number of lines present in the cache is (cashe size / line size) = 26 / 23 = 23
So, you have 3 bits to represent the line in which the your required byte is present.
The number of remaining bits now are 5 (8 - 3).
These 5 bits can be used to represent a tag. :)
3) 3 bit for index. If you were trying to label the number of bits needed to represent a line as index. Yes you are right.
4) 3 bits will be used to access a byte withing a cache line. (8 = 23)
So,
11 bits total address length = 5 tag bits + 3 bits to represent a line + 3 bits to represent a byte(word) withing a line
Hope there is no confusion now.

Questions about websocket framing

According to the RFC 6455 specification about websocket's.
Data frame structure is follows:
frame-fin ; 1 bit in length
frame-rsv1 ; 1 bit in length
frame-rsv2 ; 1 bit in length
frame-rsv3 ; 1 bit in length
frame-opcode ; 4 bits in length
frame-masked ; 1 bit in length
frame-payload-length ; either 7, 7+16,
; or 7+64 bits in
; length
[ frame-masking-key ] ; 32 bits in length
frame-payload-data ; n*8 bits in
; length, where
; n >= 0
So the minimum length of byte array to hold a frame would be 224 bytes (56 bits)? As I read on internet to represent a bit in byte array we need 4 bytes (1000).
How do I mask data? And what data should I mask? Only frame-payload-data or all the frame except the mask key?
The frame-masking-key field is only present when the frame is masked, which is only done for frames sent by a client to a server. And the frame-payload-data is optional; a frame may be empty, containing no data. Therefore the minimum length of a frame in the client-to-server direction is (1+1+1+1+4+1+7+32)=48 bits or 6 bytes, and the minimum length of a frame in the server-to-client direction is (1+1+1+1+4+1+7)=16 bits or 2 bytes.
Those would be frames that carry no payload. Obviously frames that carry payload data will require additional space.
As I read on internet to represent a bit in byte array we need 4 bytes
(1000).
Umm, no, each byte holds 8 bits. It might be convenient within a program to use larger data units to represent bit values, but that is completely independent of the format that is used in the actual frame.
How do I mask data? And what data should I mask? Only frame-payload-data
or all the frame except the mask key?
You mask by XOR-ing the frame-masking-key over the frame-payload-data. This is described in section 5.3 of RFC 6455.

What is the best way of sending the data to serial port?

This is related with microcontrollers but thought to post it here because it is a problem with algorithms and data types and not with any hardware stuff. I'll explain the problem so that someone that doesn't have any hardware knowledge can still participate :)
In Microcontroller there is an Analog to Digital converter with 10
bit resolution. (It will output a
value between 0 and 1023)
I need to send this value to PC using the serial port.
But you can only write 8 bits at once. (You need to write bytes). It is
a limitation in micro controller.
So in the above case at least I need to send 2 bytes.
My PC application just reads a sequence of numbers for plotting. So
it should capture two consecutive
bytes and build the number back. But
here we will need a delimiter
character as well. but still the delimiter character has an ascii value between 0 - 255 then it will mixup the process.
So what is a simplest way to do this? Should I send the values as a sequence of chars?
Ex : 1023 = "1""0""2""3" Vs "Char(255)Char(4)"
In summary I need to send a sequence of 10 bit numbers over Serial in fastest way. :)
You need to send 10 bits, and because you send a byte at a time, you have to send 16 bits. The big question is how much is speed a priority, and how synchronised are the sender and receiver? I can think of 3 answers, depending on these conditions.
Regular sampling, unknown join point
If the device is running all the time, you aren't sure when you are going to connect (you could join at any time in the sequence) but sampling rate is slower than communication speed so you don't care about size I think I'd probably do it as following. Suppose you are trying to send the ten bits abcdefghij (each letter one bit).
I'd send pq0abcde then pq1fghij, where p and q are error checking bits. This way:
no delimiter is needed (you can tell which byte you are reading by the 0 or 1)
you can definitely spot any 1 bit error, so you know about bad data
I'm struggling to find a good two bit error correcting code, so I guess I'd just make p a parity bit for bits 2,3 and 4 (0, a b above) and q a parity bit for 5 6 and 7 (c,d,e above). This might be clearer with an example.
Suppose I want to send 714 = 1011001010.
Split in 2 10110 , 01010
Add bits to indicate first and second byte 010110, 101010
calculate parity for each half: p0=par(010)=1, q0=par(110)=0, p1=par(101)=0, q1=par(010)=1
bytes are then 10010110, 01101010
You then can detect a lot of different error conditions, quickly check which byte you are being sent if you lose synchronisation, and none of the operations take very long in a microcontroller (I'd do the parity with an 8 entry lookup table).
Dense data, known join point
If you know that the reader starts at the same time as the writer, just send the 4 ten bit values as 5 bytes. If you always read 5 bytes at a time then no problems. If you want even more space saving, and have good sample data already, I'd compress using a huffman coding.
Dense data, unknown join point
In 7 bytes you can send 5 ten bit values with 6 spare bits. Send 5 values like this:
byte 0: 0 (7 bits)
byte 1: 1 (7 bits)
byte 2: 1 (7 bits)
byte 3: 1 (7 bits)
byte 4: 0 (7 bits)
byte 5: 0 (7 bits)
byte 6: (8 bits)
Then whenever you see 3 1's in a row for the most significant bit, you know you have bytes 1, 2 and 3. This idea wastes 1 bit in 56, so could be made even more efficient, but you'd have to send more data at a time. Eg (5 consecutive ones, 120 bits sent in 16 bytes):
byte 0: 0 (7 bits) 7
byte 1: 1 (7 bits) 14
byte 2: 1 (7 bits) 21
byte 3: 1 (7 bits) 28
byte 4: 1 (7 bits) 35
byte 5: 1 (7 bits) 42
byte 6: 0 (7 bits) 49
byte 7: (8 bits) 57
byte 8: (8 bits) 65
byte 9: (8 bits) 73
byte 10: (8 bits) 81
byte 11: 0 (7 bits) 88
byte 12: (8 bits) 96
byte 13: (8 bits) 104
byte 14: (8 bits) 112
byte 15: (8 bits) 120
This is quite a fun problem!
The best method is to convert the data to an ASCII string and send it that way - it makes debugging a lot easier and it avoids various communication issues (special meaning of certain control characters etc).
If you really need to use all the available bandwidth though then you can pack 4 10 bit values into 5 consecutive 8 bit bytes. You will need to be careful about synchronization.
Since you specified "the fastest way" I think expanding the numbers to ASCII is ruled out.
In my opinion a good compromise of code simplicity and performance can be obtained by the following encoding:
Two 10bit values will be encoded in 3 bytes like this.
first 10bit value bits := abcdefghij
second 10bit value bits := klmnopqrst
Bytes to encode:
1abcdefg
0hijklmn
0_opqrst
There is one bit more (_) available that could be used for a parity over all 20bits for error checking or just set to a fixed value.
Some example code (puts 0 at the position _):
#include <assert.h>
#include <inttypes.h>
void
write_byte(uint8_t byte); /* writes byte to serial */
void
encode(uint16_t a, uint16_t b)
{
write_byte(((a >> 3) & 0x7f) | 0x80);
write_byte(((a & 3) << 4) | ((b >> 6) & 0x7f));
write_byte(b & 0x3f);
}
uint8_t
read_byte(void); /* read a byte from serial */
void
decode(uint16_t *a, uint16_t *b)
{
uint16_t x;
while (((x = read_byte()) & 0x80) == 0) {} /* sync */
*a = x << 3;
x = read_byte();
assert ((x & 0x80) == 0); /* put better error handling here */
*a |= (x >> 4) & 3;
*b = x << 6;
x = read_byte();
assert ((x & 0xc0) == 0); /* put better error handling here */
*b |= x;
}
I normally use a start byte and checksum and in this case fixed length, so send 4 bytes, the receiver can look for the start byte and if the next three add up to a know quantity then it is a good packet take out the middle two bytes, if not keep looking. The receiver can always re-sync and it doesnt waste the bandwidth of ascii. Ascii is your other option, a start byte that is not a number and perhaps four numbers for decimal. Decimal is definitely not fun in a microcontroller, so start with something non-hex like X for example and then three bytes with the hex ascii values for your number. Search for the x examine the next three bytes, hope for the best.

Resources