Cache memory: What is the difference between a tag and an index?

Cache memory: What is the difference between a tag and an index? - caching

I read a lot of articles and watched videos explaining the concept of cache memory but I still can't get what the difference is between an index and a tag in the address format. They all say that we need an index because otherwise multiple locations would get hashed to the same location within the cache. But I don't understand. Can someone please explain?
Thank you.

An address as a simple number, usually taken as an unsigned integer:
+---------------------------+
| address |
+---------------------------+
The same address — the same overall number — is decomposed by the cache into piece parts called fields:
+----------------------------+
| tag | index | offset |
+----------------------------+
For any given cache, the tag width, index width, and offset width are in bits and are fixed, and, for any given address, each field, of course, has a value we can determine given that we know the address and the widths of the fields for the given cache.
Caches store replication of main memory in chunks called blocks.  To find the block address of some address, keep its tag and index bits as is, but set the block offset bits to zeros.
Let's say there are two addresses: A and B.  A has a tag, index and offset, as does B have a tag, index, and offset.
We want to know if A and B match to the level of the block of memory — which means we care about tag & index bits matching, but not about offset bits.
You can see from the above that two addresses can be different yet have the same index — many addresses will share the same index yet have different tag or different offset bits.
Now, let's say that B is an address known to be cached.  That means that the block of memory for B's tag and B's index is in the cache.  The whole block is in the cache, which is all address with the same tag & index, and any possible offset bits.
Let's say that A is some address the program wants to access.  The cache's job is to determine if A and B refer to the same block of memory, if so then since B is in the cache, access of A is a hit in the cache, while if A and B don't refer to the same block of memory, then there is a miss in the cache.
Caches employ a notion of an array.  They use the index positions for the elements of the array, to simplify their operation.  A simple (direct mapped) cache will have a block stored at each index position in the array (other caches will have more than one block stored at each index position in the array: this refers to the cache's set associativity level, number of "ways", as in 2-way or 4-way, etc..).  To find a desired element, A, we need to look in the cache.  This is done by taking A's index position and using it as the index in the cache array.  If the element already there has a block for address B, and B's tag stored there is the same tag value as A's, then both index position and tags match — index matches because we looked in the right place, and tags match because the cache stores B's tag and we have all of A so can compare A's tag with B's tag.
Such a cache will never store the block for an address at an index position different than the index position value for its address.  So, there is only one index position to look at to see if the cache stores the block associated with an address, A.
They all say that we need an index because otherwise multiple locations would get hashed to the same location within the cache
There is a degenerate case in cache architecture where the index size is 0 bits wide.  This means that regardless of the actual address, A, all addresses are stored at the same one index position.  Such as cache is "fully associative" and does not use the index field (or the index field has zero width).
The benefit of the index (when present in the cache architecture) is a simplification of the hardware: it has to only look at blocks stored at the index of A, and never at blocks stored at other indexes within the cache.
The benefit of not using an index is that one address will never evict another merely due to having the same index; fully associative caches are subject to less cache thrashing.

Related

index and tag switching position

Let us consider the alternative {Index, Tag, Offset}. The usage and the size of each of the field remain the same, e.g. index is used to locate a block in cache, and its bit-length is still determined by the number of cache blocks. The only difference is that we now uses the MSB bits for index, the middle portion for tag, and the last portion for offset.
What do you think is the shortcoming of this scheme?

This will work — and if the cache is fully associative this won't matter (as there is no index, its all tag), but if the associativity is limited it will make (far) less effective use of the cache memory.  Why?
Consider an object, that is sufficiently large to cross a cache block boundary.
When accessing the object, the address of some fields vs. the other fields will not be in the same cache block.  How will the cache behave?
When the index is in the middle, then the cache block/line index will change, allowing the cache to store different nearby entities even with limited associativity.
When the index is at the beginning (most significant bytes), the tag will have changed between these two addresses, but the index will be the same — thus, there will be an collision at the index, which will use up one of the ways of the set-associativity.  If the cache were direct mapped (i.e. 1-way set associative), it could thrash badly on repeated access to the same object.
Let's pretend that we have 12-bit address space, and the index, tag, and offset are each 4 bits.
Let's consider an object of four 32-bit integer fields, and that the object is at location 0x248 so that two integer fields, a, b, are at 0x248 and 0x24c and two other integer fields, c, d, are at 0x250 and 0x254.
Consider what happens when we access either a or b followed by c or d followed by a or b again.
If the tag is the high order hex digit, then the cache index (in the middle) goes from 4 to 5, meaning that even in an direct mapped cache both the a&b fields and the c&d fields can be in the cache at the same time.
For the same access pattern, if the tag is the middle hex digit and the index the high hex digit, then the cache index doesn't change — it stays at 2.  Thus, on a 1-way set associative cache, accessing fields a or b followed by c or d will evict the a&b fields, which will result in a miss if/when a or b are accessed later.
So, it really depends on access patterns, but one thing that makes a cache really effective is when the program accesses either something it accessed before or something in the same block as it accessed before.  This happens as we manipulate individual objects, and as we allocate objects that end up being adjacent, and as we repeat accesses to an array (e.g. 2nd loop over an array).
If the index is in the middle, we get more variation as we use different addresses of within some block or chunk or area of memory — in our 12-bit address space example, the index changes every 16 bytes, and adjacent block of 16 bytes can be stored in the cache.
But if the index is at the beginning we need to consume more memory before we get to a different index — the index changes only every 256 bytes, so two adjacent 16-byte blocks will often have collisions.
Our programs and compilers are generally written assuming locality is favored by the cache — and this means that the index should be in the middle and the tag in the high position.
Both tag/index position options offer good locality for addresses in the same block, but one favors adjacent addresses in different blocks more than the other.

Indexing text file for microcontroller

I need to search for a specific record in a large file. The search will be performed on a microprocessor (ESP8266), so I'm working with limited storage and RAM.
The list looks like this:
BSSID,data1,data2
001122334455,float,float
001122334466,float,float
...
I was thinking using an index to speed up the search. The data are static, and the index will be built on a computer and then loaded onto the microcontroller.
What I've done so far is very simplistic.
I created an index of the first byte of the BSSID and points at the first and last values with that BSSID prefix.
The performance is terrible, but the index file is very small and uses very little RAM. I though to go further with this method, taking a look at the first two bytes, but the index table will be 256 times larger, resulting in a table 1/3 the size of the data file.
This is the index with the first method:
00,0000000000,0000139984
02,0000139984,0000150388
04,0000150388,0000158812
06,0000158812,0000160900
08,0000160900,0000171160
What indexing algorithm do you suggest that I use?
EDIT:Sorry I didn't include enough background before.I'm storing the data and index file on the flash memory of the chip. I have at the moment 30000 records, but this number could potentially grow until the chips momery limit is hit. The set is indeed static when is stored on the microcontroller but could be updated in a second moment with the help of a computer.The data isn't spread simmetrically between indexes.My goal is to find a good compromise between search speed, index size and RAM used.

I'm not sure where you're stuck, but I can comment on what you've done so far.
Most of all, the way to determine the "best" method is to
define "best" for your purposes;
research indexing algorithms (basic ones have been published for over 50 years);
choose a handful to implement;
Evaluate those implementations according to your definition of "best".
Keep in mind your basic resource restriction: you have limited RAM. If method requires more RAM than you have, it doesn't work, and is therefore infinitely slower than any method that does work.
You've come close to a critical idea, however: you want your index table to expand to consume any free RAM, using that space as effectively as possible. If you can index 16 bits instead of 8 and still fit the table comfortably into your available space, then you've cut down your linear search time by roughly a factor of 256.
Indexing considerations
Don't put the ending value in each row: it's identical to the starting value in the next row. Omit that, and you save one word in each row of the table, giving you twice the table room.
Will you get better performance if you slice the file into equal parts (same quantity of BSSIDS for each row of your table), and then store the entire starting BSSID with its record number? If your BSSIDs are heavily clumped, this might improve your overall processing, even though your table had fewer rows. You can't use a direct index in this case; you have to search the first column to get the proper starting point.
Does that move you toward a good solution?

Not sure how much memory you got (I am not familiar with that MCU) but do not forget that these tables are static/constant so they can be stored in EEPROM instead of RAM some chips have quite a lot of EEPROM usually way more than RAM...
Assume your file is sorted by the index. So You you got (assuming 32bit address) per each entry:
BYTE ix, DWORD beg,DWORD end
Why not this:
struct entry { DWORD beg,end };
entry ix0[256];
Where the first BYTE is also address in index array. This will spare 1 Byte per entry
Now as Prune suggested you can ignore the end address as you will scan the following entries in file anyway until you hit the correct index or index with different first BYTE. so yo can use:
DWORD ix[256];
where yo have only start address beg.
Now we do not know how many entries you actually have nor how many entries will share the same second BYTE of index. So we can not do any further assumption to improve...
You wanted to do something like:
DWORD ix[65536];
But have not enough memory for it ... how about doing something like this instead:
const N=1024; // number of entries you can store
const dix=(max_index_value+1)/N;
const ix[N]={.....};
so each entry ix[i] will cover all the indexes from i*dix to ((i+1)*dix)-1. So to find index you do this:
i = ix[index/dix];
for (;i<file_size;)
{
read entry from file at i-th position;
update position i;
if (file_index==index) { do your stuff; break; }
if (file_index> index) { index not found; break; }
}
To improve performance you can rewrite this linear scan into binary search between address of ix[index/dix] and ix[(index/dix)+1] or file size for the last index ... assuming each entry in file has the same size ...

How does Direct mapped cache implement spatial locality?

An array is declared as floatA[2048] . Each array element is 4Bytes in size.This program is run on a computer that has a
direct mapped data cache of size 8Kbytes, with block (line) size of 16Bytes.
Which elements of the array conflict with element A[0] in the data cache?
Ultimately A[0],A[512],A[1024],A[1536] map to cache block 0
As per my understanding, when A[0] is required for the first time, A[0],A[1],A[2],A[3](since one cache block can hold 4 elements) are brought into the cache and placed in cache blocks 0, 1, 2,and 3 respectively.
Other approach would be to bring only A[0] and place it in cache block 0. (Spatial locality not used here)
What is the general practice in such a scenario?

All four elements A[3:0] are stored in cache block 0 - since these 4 elements together form 16B. Depending on how the hardware system is set up, the next 16B are then stored on cache block 1 (the decision of which cache line (16B contiguous data granule) maps into which set is made while designing hardware and is based on certain bits of the address.

Determining the tag of a cache

Given a cache with 256 blocks, and 16 bytes per block, how can I determine the value of the tag field of a cache block that holds the 24-bit address 0x3CFBCF? Would there be any differences depending on whether the cache was direct-mapped, fully associative, or n-way set-asscociative?

This gives a great overview on finding the SET, TAG, and OFFSET of an address.

How does direct mapped cache work?

I am taking a System Architecture course and I have trouble understanding how a direct mapped cache works.
I have looked in several places and they explain it in a different manner which gets me even more confused.
What I cannot understand is what is the Tag and Index, and how are they selected?
The explanation from my lecture is:
"Address divided is into two parts
index (e.g 15 bits) used to address (32k) RAMs directly
Rest of address, tag is stored and compared with incoming tag. "
Where does that tag come from? It cannot be the full address of the memory location in RAM since it renders direct mapped cache useless (when compared with the fully associative cache).
Thank you very much.

Okay. So let's first understand how the CPU interacts with the cache.
There are three layers of memory (broadly speaking) - cache (generally made of SRAM chips), main memory (generally made of DRAM chips), and storage (generally magnetic, like hard disks). Whenever CPU needs any data from some particular location, it first searches the cache to see if it is there. Cache memory lies closest to the CPU in terms of memory hierarchy, hence its access time is the least (and cost is the highest), so if the data CPU is looking for can be found there, it constitutes a 'hit', and data is obtained from there for use by CPU. If it is not there, then the data has to be moved from the main memory to the cache before it can be accessed by the CPU (CPU generally interacts only with the cache), that incurs a time penalty.
So to find out whether the data is there or not in the cache, various algorithms are applied. One is this direct mapped cache method. For simplicity, let's assume a memory system where there are 10 cache memory locations available (numbered 0 to 9), and 40 main memory locations available (numbered 0 to 39). This picture sums it up:
There are 40 main memory locations available, but only upto 10 can be accommodated in the cache. So now, by some means, the incoming request from CPU needs to be redirected to a cache location. That has two problems:
How to redirect? Specifically, how to do it in a predictable way which will not change over time?
If the cache location is already filled up with some data, the incoming request from CPU has to identify whether the address from which it requires the data is same as the address whose data is stored in that location.
In our simple example, we can redirect by a simple logic. Given that we have to map 40 main memory locations numbered serially from 0 to 39 to 10 cache locations numbered 0 to 9, the cache location for a memory location n can be n%10. So 21 corresponds to 1, 37 corresponds to 7, etc. That becomes the index.
But 37, 17, 7 all correspond to 7. So to differentiate between them, comes the tag. So just like index is n%10, tag is int(n/10). So now 37, 17, 7 will have the same index 7, but different tags like 3, 1, 0, etc. That is, the mapping can be completely specified by the two data - tag and index.
So now if a request comes for address location 29, that will translate to a tag of 2 and index of 9. Index corresponds to cache location number, so cache location no. 9 will be queried to see if it contains any data, and if so, if the associated tag is 2. If yes, it's a CPU hit and the data will be fetched from that location immediately. If it is empty, or the tag is not 2, it means that it contains the data corresponding to some other memory address and not 29 (although it will have the same index, which means it contains a data from address like 9, 19, 39, etc.). So it is a CPU miss, and data from location no. 29 in main memory will have to be loaded into the cache at location 9 (and the tag changed to 2, and deleting any data which was there before), after which it will be fetched by CPU.

Lets use an example. A 64 kilobyte cache, with 16 byte cache-lines has 4096 different cache lines.
You need to break the address down into three different parts.
The lowest bits are used to tell you the byte within a cache line when you get it back, this part isn't directly used in the cache lookup. (bits 0-3 in this example)
The next bits are used to INDEX the cache. If you think of the cache as a big column of cache lines, the index bits tell you which row you need to look in for your data. (bits 4-15 in this example)
All the other bits are TAG bits. These bits are stored in the tag store for the data you have stored in the cache, and we compare the corresponding bits of the cache request to what we have stored to figure out if the data we are cacheing are the data that are being requested.
The number of bits you use for the index is log_base_2(number_of_cache_lines) [it's really the number of sets, but in a direct mapped cache, there are the same number of lines and sets]

A direct mapped cache is like a table that has rows also called cache line and at least 2 columns one for the data and the other one for the tags.
Here is how it works: A read access to the cache takes the middle part of the address that is called index and use it as the row number. The data and the tag are looked up at the same time.
Next, the tag needs to be compared with the upper part of the address to decide if the line is from the same address range in memory and is valid. At the same time, the lower part of the address can be used to select the requested data from cache line (I assume a cache line can hold data for several words).
I emphasized a little on data access and tag access+compare happens at the same time, because that is key to reduce the latency (purpose of a cache). The data path ram access doesn't need to be two steps.
The advantage is that a read is basically a simple table lookup and a compare.
But it is direct mapped that means for every read address there is exactly one place in the cache where this data could be cached. So the disadvantage is that a lot of other addresses would be mapped to the same place and may compete for this cache line.

I have found a good book at the library that has offered me the clear explanation I needed and I will now share it here in case some other student stumbles across this thread while searching about caches.
The book is "Computer Architecture - A Quantitative Approach" 3rd edition by Hennesy and Patterson, page 390.
First, keep in mind that the main memory is divided into blocks for the cache.
If we have a 64 Bytes cache and 1 GB of RAM, the RAM would be divided into 128 KB blocks (1 GB of RAM / 64B of Cache = 128 KB Block size).
From the book:
Where can a block be placed in a cache?
If each block has only one place it can appear in the cache, the cache is said to be direct mapped. The destination block is calculated using this formula: <RAM Block Address> MOD <Number of Blocks in the Cache>
So, let's assume we have 32 blocks of RAM and 8 blocks of cache.
If we want to store block 12 from RAM to the cache, RAM block 12 would be stored into Cache block 4. Why? Because 12 / 8 = 1 remainder 4. The remainder is the destination block.
If a block can be placed anywhere in the cache, the cache is said to be fully associative.
If a block can be placed anywhere in a restricted set of places in the cache, the cache is set associative.
Basically, a set is a group of blocks in the cache. A block is first mapped onto a set and then the block can be placed anywhere inside the set.
The formula is: <RAM Block Address> MOD <Number of Sets in the Cache>
So, let's assume we have 32 blocks of RAM and a cache divided into 4 sets (each set having two blocks, meaning 8 blocks in total). This way set 0 would have blocks 0 and 1, set 1 would have blocks 2 and 3, and so on...
If we want to store RAM block 12 into the cache, the RAM block would be stored in the Cache blocks 0 or 1. Why? Because 12 / 4 = 3 remainder 0. Therefore set 0 is selected and the block can be placed anywhere inside set 0 (meaning block 0 and 1).
Now I'll go back to my original problem with the addresses.
How is a block found if it is in the cache?
Each block frame in the cache has an address. Just to make it clear, a block has both address and data.
The block address is divided into multiple pieces: Tag, Index and Offset.
The tag is used to find the block inside the cache, the index only shows the set in which the block is situated (making it quite redundant) and the offset is used to select the data.
By "select the data" I mean that in a cache block there will obviously be more than one memory locations, the offset is used to select between them.
So, if you want to imagine a table, these would be the columns:
TAG | INDEX | OFFSET | DATA 1 | DATA 2 | ... | DATA N
Tag would be used to find the block, index would show in which set the block is, offset would select one of the fields to its right.
I hope that my understanding of this is correct, if it is not please let me know.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio