Instruction cache for pipelined simulator

Instruction cache for pipelined simulator - caching

I am trying to complete a simulator based for a simplified mips computer using java. I believe I have completed the pipeline logic needed for my assignment but I am having a hard time understanding what the instruction and data caches are supposed to do.
The instruction cache should be direct-mapped with 4 blocks and the block size is 4 words.
So I am really confused on what the cache is doing. Is it going to memory and pulling the instruction from memory? For example, in one block it will have just the add command.
Would it make sense to implement it as a 2 dimensional array?

First you should know the basics of the cache. You can imagine cache as an intermediate memory which sits between the DRAM or main memory and your processor, however very much limited in size. Now, when you try to access a location in memory, you will search it first in the cache. If it is found (cache hit) the processor will take this data and resume the execution. Generally the cache hit is supposed to be very few clock cycles lets say 1 or 2. Suppose if the data is not found in the cache (cache miss), then the data is fetched from the main memory, filled in the cache and fed to the processor. The processor blocks till the data is fetched. This takes few hundreds of clock cycles normally depending on the DRAM you are using. The amount of data that is fetched from DRAM is equal to cacheline size. For that you should search for spatial locality of reference in caches.
I think this should get you a start.

Related

How can assembly language makes computer with specific cache design runs faster?

I am new to assembly language and cache design and recently our professor gave us a question about writing assembly language instructions to make computers with specific cache design run faster. I have no clue how to use assembly to improve performance. Can I get any hints?
The two cache designs are like this:
Cache A: 128 sets, 2-way set
associative, 32-byte blocks, write-through, and no-write-allocate.
Cache B: 256 sets, direct-mapped, 32-byte blocks, write-back, and
write-allocate.
The question is:
Describe a little assembly language program snippet, two instructions
are sufficient, that makes Computer A (uses the Cache A design) run as
much faster as possible than Computer B (uses the Cache B design).
And there is another question asking the opposite:
Write a little assembly language program snippet, two instructions is
sufficient, that makes Computer B run as much faster as possible than
Computer A.

To be slow with the direct-mapped cache but fast with the associative cache, your best bet is probably 2 loads1.
Create a conflict-miss due to cache aliasing on that machine but not the other. i.e. 2 loads that can't both hit in cache back-to-back, because they index the same set.
Assume the snippet will be run in a loop, or that cache is already hot for some other reason before your snippet runs. You can probably also assume that a register holds a valid pointer with some known alignment relative to a 32-byte cache-lie boundary, i.e. you can set pre-conditions for your snippet.
Footnote 1: Or maybe stores, but load misses more obviously need to stall the CPU because they can't be hidden by a store buffer. Only by scoreboarding to not stall until the load results are actually used
To make the write-through / no-write-allocate cache run slow, maybe store and then load an adjacent address, or the address you just stored. On a write-back / write-allocate cache, the load will hit. (But only after waiting for the store miss to bring the data into cache.)
Reloading the same address you just stored could be fast on both machines if there's also a store buffer with store-forwarding.
And subsequent runs of the same snipped will get cache hits because the load would allocate the line in cache.
If your machine is CISC with post-increment addressing modes, there's more you can do with just 2 instructions if you imagine them as a loop body. It's unclear what kind of pre-conditions you're supposed to / allowed to assume for the cache.
Just 2 stores to the same line or even same address can demonstrate the cost of write-through: with write-back + write-allocate, you'll get a hit on the 2nd store.

Cache for heap memory access

In general desktops have 2 kinds of CPU cache to faster memory access.
1) Instruction cache -> to speed up executable instructions.
2) Data cache -> to speed up data fetch and store.
As per my understanding, Instruction cache operates on code segment of a program and Data cache operates on data segment of program. is this right?
Is there no cache advantage for memory allocated from heap? is heap memory access is covered in data cache?

Instruction cache operates on code segment of a program and Data cache operates on data segment of program. Is this right?
No, CPU is unaware about segments.
Instruction cache is for all execution accesses, whether they are performed inside code segment, or in the heap as dynamically created code.
Data cache is for all other, non-execution accesses. Data can be in the data segment, heap, or even in the code segment as constants.

As per my understanding, Instruction cache operates on code segment of a program and Data cache operates on data segment of program. is this right?
Is there no cache advantage for memory allocated from heap? is heap memory access is covered in data cache?
Memory is memory. The CPU can't tell the difference between the heap and the data.
Instruction caches usually just start with the address in the program counter and grab the next N bytes. The CPU still can't tell if its a code segment or a data segment.

When you write a program, this gets translated into machine readable binary. When CPU executes instructions, it fetches this binary, decodes, what it mean and then execute. Basically this binary tells CPU what instructions it has to execute. If this binary was only stored in main memory, then during each fetch stage, CPU has to access main memory, which is really bad. Instead what we do is to store, some of it in a cache, closer to the CPU. Since this cache only contain binary information related to the instructions to be executed, we call it instruction cache. Now instructions need data to operate. In your high level code you might have something like
arrayA[i] = (arrayB[i] + arrayC[i]) which will translate into a machine instruction something similar to
ADD memLocationStoredInRegisterA, memLocationStoredInRegisterB, memLocationStoredInRegisterC
This instruction is stored in instruction cache, but the data, i.e arrayA, arrayB and arrayC will be stored in another portion of memory. Again it will be useless to access main memory, each time this instruction is executed. Therefore we store some of this in another cache, which we call data cache.

Why implement data cache and instruction cache to reduce miss? [duplicate]

This question already has answers here:
What does a 'Split' cache means. And how is it useful(if it is)?
(1 answer)
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
(7 answers)
Closed 2 years ago.
I am undefeated in a question like this
In the context of a memory hierarchy why implement data cache and
instruction cache?
I replied that it is useful to decrease the number of conflict miss and insufficient space miss. But the data cache and the instruction cache can be sized according to the number of data and instruction? Because i assumed that the number of data is higher than the number of instruction (many times we need 2 data to execute 1 instruction) and the data cache and instruction cache is sized according to this numbers. Is true or completely wrong? In the case that it's wrong, why implement data cache and instruction cache to reduce miss?

The idea of a cache is to deliver cached data in 1 cycle to keep the CPU running at maximum speed.
Now today all CPUs are pipelined. This means the they have independent modules that e.g. fetch an instruction, decode it, fetch the operands, execute the instruction, and write back the result. All of these pipeline stages are executed whenever possible at the same time for different instructions.
For maximum speed, an instruction fetch has to be done at the same time as an operand fetch of an earlier instruction decoded before. Both can only be done (in the optimal case) at the same time in 1 cycle if one has an instruction cache and a data cache.

Another possible reason to have two caches (Instruction and Data) is Thrashing. Imagine a situation where your instruction and data reside in two memory locations whose index bits are the same. Assuming a direct mapped cache, (cheeky I know), It goes like this.
Fetch instruction from memory, calculate the index and store store it there.
Decode the instruction and get the address of data.
Now fetch the data from memory, calculate the index to store data.
There is some data in that location, well too bad, flush it to next level cache and store the newly fetched data.
Execute the instruction.
Its time to decode the next instruction, well its a cache miss as we swapped the cache entry for our data. Now go fetch it again.
When we fetch it we have to replace our data again as it has the same index.
So we will be continually swapping data and instruction form the same cache line, aka thrashing.

How could cache improve the performance of a pipeline processor?

I understand that accessing cache is much faster than accessing the main memory and I have a basic idea of all those miss rate and miss penalty stuff.
But this just came across my mind : how could cache be useful in a pipeline processor?
From my understanding, the time a single clock cycle takes is lower bounded by the longest time taken among all the processes. Like if accessing cache takes 1n, accessing main memory takes 10n, then the clock cycle time should be at least greater than 10n. Otherwise that task could not be completed when needed.. Then even the cache accessing is completed, the instruction still have to wait there until next clock cycle.
I was imaging a basic 5 stage pipeline process which includes instruction fetching, decoding, execution, memory accessing and write back.
Am I completely misunderstanding something? Or maybe in reality we have a much complex pipeline, where memory accessing is broken down to several pieces like cache checking and main memory accessing so that if we get an hit we can somehow skip the next cycle? But there will be a problem too if previous instruction didn't skip a cycle while the current instruction does...
I am scratching my head off... Any explanation would be highly appreciated!

The cycle time is not lower bounded by the longest time take among all processes.
Actually, a RAM access can take hundreds of cycles.
There are different processor architectures, but typical numbers might be:
1 cycle to access a register.
4 cycles to access L1 cache.
10 cycles to access L2 cache.
75 cycles to access L3 cache.
hundreds of cycles to access main memory.
In the extreme case, if a computation is memory-intensive, and constantly missing the cache, the CPU will be very under-utilized, as it requests to fetch data from memory and waits until the data is available. On the other hand, if an algorithm needs to repeatedly access the same region of memory that fits entirely in L1 cache (like inverting a matrix that is not too big), the CPU will be much better utilized: The algorithm will start by fetching the data into cache, and the rest of the algorithm will just use the cache to read and write data values.

Does larger cache size always lead to improved performance?

Since cache inside the processor increases the instruction execution speed. I'm wondering what if we increase the size of cache to many MBs like 1 GB. Is it possible? If it is will increasing the cache size always result in increased performance?

There is a tradeoff between cache size and hit rate on one side and read latency with power consumption on another. So the answer to your first question is: technically (probably) possible, but unlikely to make sense, since L3 cache in modern CPUs with size of just a few MBs has read latency of about dozens of cycles.
Performance depends more on memory access pattern than on cache size. More precisely, if the program is mainly sequential, cache size is not a big deal. If there are quite a lot of random access (ex. when associative containers are actively used), cache size really matters.
The above is true for single computational tasks. In multiprocess environment with several active processes bigger cache size is always better, because of decrease of interprocess contention.

This is a simplification, but, one of the primary reasons the cache increases 'speed' is that it provides a fast memory very close to the processor - this is much faster to access than main memory. So, in theory, increasing the size of the cache should allow more information to be stored in this 'fast' memory, and thereby improve performance.. In the real world things are obviously much more complex than this, and there will of course be added complexity, and cost, associated with such a large cache, and with dealing with issues like cache coherency, caching algorithms etc.

As cache stores data temporary. Cache is used to locate the file easily that has been frequently using. So if the size of cache increased upto 1gb or more it will not stay as cache, it becomes RAM. Data is stored in ram temporary. So if cache isn't used, when data is called by processor, ram will take time to fetch data to provide to the processor because of its wide size of 4gb or more. So we use cache as our temporary memory for the things we recently or frequently used. In this way, ram ram doesnt required to find and fetch data to give it to processor, because processor direct access data from cache, because of small size of cache, it doesnt take time to find data, and processor doesn't require to call ram to fetch data, all of this done fastly without ram. Lets take an example, we have a wide classroom (RAM) , our principal (processor) call class CR (Data) for some purposes, then ones will go to the class room and will find the CR in the class of 1000 students and take him to the principal. It takes time. When we specify a space(cache) for CR in the class, because principal mostly call CR of the class, so it will become easy to find CR becuase most of the time CR is called by Principal.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio