How can assembly language makes computer with specific cache design runs faster?

How can assembly language makes computer with specific cache design runs faster? - caching

I am new to assembly language and cache design and recently our professor gave us a question about writing assembly language instructions to make computers with specific cache design run faster. I have no clue how to use assembly to improve performance. Can I get any hints?
The two cache designs are like this:
Cache A: 128 sets, 2-way set
associative, 32-byte blocks, write-through, and no-write-allocate.
Cache B: 256 sets, direct-mapped, 32-byte blocks, write-back, and
write-allocate.
The question is:
Describe a little assembly language program snippet, two instructions
are sufficient, that makes Computer A (uses the Cache A design) run as
much faster as possible than Computer B (uses the Cache B design).
And there is another question asking the opposite:
Write a little assembly language program snippet, two instructions is
sufficient, that makes Computer B run as much faster as possible than
Computer A.

To be slow with the direct-mapped cache but fast with the associative cache, your best bet is probably 2 loads1.
Create a conflict-miss due to cache aliasing on that machine but not the other. i.e. 2 loads that can't both hit in cache back-to-back, because they index the same set.
Assume the snippet will be run in a loop, or that cache is already hot for some other reason before your snippet runs. You can probably also assume that a register holds a valid pointer with some known alignment relative to a 32-byte cache-lie boundary, i.e. you can set pre-conditions for your snippet.
Footnote 1: Or maybe stores, but load misses more obviously need to stall the CPU because they can't be hidden by a store buffer. Only by scoreboarding to not stall until the load results are actually used
To make the write-through / no-write-allocate cache run slow, maybe store and then load an adjacent address, or the address you just stored. On a write-back / write-allocate cache, the load will hit. (But only after waiting for the store miss to bring the data into cache.)
Reloading the same address you just stored could be fast on both machines if there's also a store buffer with store-forwarding.
And subsequent runs of the same snipped will get cache hits because the load would allocate the line in cache.
If your machine is CISC with post-increment addressing modes, there's more you can do with just 2 instructions if you imagine them as a loop body. It's unclear what kind of pre-conditions you're supposed to / allowed to assume for the cache.
Just 2 stores to the same line or even same address can demonstrate the cost of write-through: with write-back + write-allocate, you'll get a hit on the 2nd store.

Related

Can you directly access the cache using assembly?

Caching is a core thing when it comes to efficiency.
I know that caching usually happens automatically.
However, I'd like to control cache usage myself, because I think that I can do better than some heuristics that don't know the exact program.
Therefore I would need assembly instructions to directly move to or from cache memory cells.
like:
movL1 address content
I know that there are some instructions that give the "caching system" hints, but I'm not sure if that's enough because the hints could be ignored or they maybe aren't sufficient to express anything expressable by such a move to/from cache order.
Are there any assemblers that allow for complete cache control?
Side note: why I'd like to improve caching:
consider a hypothetical CPU with 1 register and a cache containing 2 cells.
consider the following two programs:
(where x,y,z,a are memory cells)
"START"
"move 1 to x"
"move 2 to y"
"move 3 to z"
"move 4 to a"
"move z to x"
"move y to x"
"END"
"START"
"move 1 to x"
"move 2 to y"
"move 3 to z"
"move 4 to a"
"move a to x"
"move y to x"
"END"
In the first case, you'd use the register and the cache for x,y,z (a is only written to once)
In the second case, you'd use the register and the cache for a,x,y (z is only written to once)
If the CPU does the caching, it simply can't decide ahead of time which of the two above cases it's facing.
It has to decide for each of the memory cells x,y,z if its contents should be cached before it knows if the program executed, is no. 1 or no. 2, because both programs start out the same.
The programmer on the other hand knows ahead of time which memory cells are reused, and when they are reused.

Peter Cordes wrote:
On most microarchitectures for most ISAs, no, you can't pin a line in cache to stop it from being evicted. The only way to use cache is as a transparent cache that you load/store through.
This is correct, but the exceptions are of interest....
It is common in DSP ("Digital Signal Processing") chips to provide a limited ability to partition SRAM between "cache" and "scratchpad memory" functionality. There are lots of white papers and reference guides on this topic -- an example is http://www.ti.com/lit/ug/sprug82a/sprug82a.pdf. In this chip, there are three blocks of SRAM -- a small "Level-1 Instruction" SRAM, a small "Level-1 Data" SRAM, and a larger "Level-2" SRAM. Each of the three can be partitioned between Cache and directly-addressed memory, with the details depending on the specific chip. For example, a chip may allow no cache, 1/4 SRAM as cache, 1/2 SRAM as cache, or all SRAM as cache. (The ratios are limited so the allowed cache sizes can be indexed efficiently.)
The IBM "Cell" processor (used in the Sony PlayStation 3, released in 2006) was a multi-core chip with one ordinary general-purpose core and eight co-processor cores. The co-processor cores had a limited instruction set, with load and store instructions that could only access their private 128KiB "scratchpad" memory. In order to access main memory, the co-processors had to program a DMA engine to perform a block copy of main memory to local scratchpad memory (or vice versa). This approach provided (and required) perfect control over data motion, resulting in (a very small amount of) very high-performance software.
Some GPUs also have small on-chip SRAMs that can be configured as either an L1 cache or as explicitly controlled local memory.
All of these are considered to be "very hard" (or worse) to use, but this can be the right approach if the product requires very low cost, completely predictable performance, or very low power.

On most microarchitectures for most ISAs, no, you can't pin a line in cache to stop it from being evicted. The only way to use cache is as a transparent cache that you load/store through.
Of course, a normal load will definitely bring a cache line into L1d cache, at least temporarily. Nothing stops it from being evicted later, though. e.g. on x86-64: mov eax, [rdi] instead of prefetcht0 [rdi].
Before dedicated prefetch instructions existed, using a plain load as a prefetch was sometimes done (e.g. ahead of some loop-bounds calculations before entering a loop that would start looping over an array). For performance purposes, best-effort software prefetch instructions that the CPU can ignore are usually better.
A plain load has the downside of not being able to retire from the out-of-order back-end until the loaded data actually arrives. (At least I think it can't on x86 CPUs with x86's strongly ordered memory model. Weakly-ordered ISAs that allow out-of-order loads might let the load retire even if it hasn't truly completed yet.) Software prefetch instructions exist to allow prefetch as a hint without bottlenecking the CPU on waiting for the load to finish.
On modern x86, forced eviction of a cache is possible. NT stores guarantee that on Pentium-M or newer, or CPUs after Pentium-M, I forget which. Also, clflush and clflushopt exist specifically for that.
clflush is not just a hint that the CPU can drop; it guarantees correctness for non-volatile DIMMs like Optane DC PM. Why does CLFLUSH exist in x86?
Being guaranteed, not just a hint, makes it slow. You generally don't want to do this for performance. As #old_timer says, burning instructions / cycles micro-managing the cache is almost always a waste of time. Leaving things up to the hardware's pseudo-LRU replacement and HW prefetch algorithms usually provide good results in the long run. SW prefetch can help in a few cases.
Xeon Phi can configure its MCDRAM as a large last-level cache, or as architecturally visible "local memory" that's part of physical address space. But at 6 to 16GiB, it's vastly bigger than the on-die L1/L2 caches, or the L1/L2/L3 caches of modern mainstream CPUs.
Also, x86 CPUs can run in cache-as-RAM no-fill mode, used by the BIOS in early startup before configuring DRAM controllers. But that's really just no fills on read or write, and read-as-zero for invalid lines, so you can't use DRAM at all when no-fill-mode is activated. i.e. only cache is available, and you have to be careful not to evict anything that was cached. It's not usable for any practical purpose except early-boot.
What use is the INVD instruction? and Cache-as-Ram (no fill mode) Executable Code have some details.
I know that there are some instructions that give the "caching system" hints, but I'm not sure if that's enough because the hints could be ignored or they maybe aren't sufficient to express anything expressable by such a move to/from cache order.

Direct access to the cache srams has nothing to do with the instruction set, if you have access then you have access and you access it however the chip/system designers implemented it. It could be as simple as an address space or it may be some indirect peripheral like access where you poke at control registers and that logic accesses that item in the cache for you.
And this doesn't mean that all ARM processors can gain access to their cache in the same way. (arm is an IP company not a chip company) but it might mean that no you can't do this on any existing x86s. I know for a fact on the product I am part of we can do this because we have ECC on those SRAMs and have an access method to initialize the rams from software before enabling the monitor. Some of the srams you can do it through normal accesses, but for example the arm we are using was implemented with parity checking not ECC so we added ECC on the SRAM and a side door access for init because trying to go through the cache with normal accesses and get 100% coverage was a PITA and end the end not the right solution.
Also worked on a product where the dram controller cache can be used direct access as an on chip ram, up to software decide how to use it as an L2 cache or as on chip ram.
So it has and can be done, and these are isolated examples. As part of screening the parts there are mbist tests that run, but often those are driven through jtag and not directly available to the processor and/or the ram isn't, sometimes the mbist can be started and checked by software but the ram can't, and some implementations, the designers made it so software can touch all of it, including tag ram.
Which leads to if you think you can do a better job than the hardware and want to move stuff around then you will also likely need access to the tag ram as well so that you can trace/drive where you want the cache line, its status, etc.
Based on this comment:
Sorry, I'm a [beginner] at assembly, could you please explain this simpler? whats a CPU "mode"? What's that HBM? How to set a CPU mode? what are NDAs? – KGM
Two things, you can't do better than the cache, and two, you are not ready for this task.
Even with experience you can't generally do better than the cache, if you want to manipulate the cache you use the same knowledge as to how you write your code and where you place it in memory as well as where the data is you are using and then the logic implementation can work better for you. Burning instructions and cycles trying to reposition things runtime isn't going to help. You generally need access to the design at level that is not available to the general public. Thus an NDA (non disclosure agreement), and even then it is extremely unlikely that you will get the info you need and/or the gains will be minimal, may only work on one implementation and not across the whole family of products, etc.
More interesting is what do you think you can do better and how are you thinking you can do it? (also understand that many of us here can make any cache implementation fail and run slower than if it wasn't there, even if you create a newer better cache, by definition it only improves performance in certain cases).

MESI protocol. Write with cache miss, but cache line copy exists on another CPU. Why needs fetch from main memory?

According to this diagram in case of write cache miss with copy in another CPU cache (for example Shared/Exclusive state). The steps are:
1. Snooping cores (with cache line copy) sets state to Invalid.
2. Current cache stores fresh main memory value.
Why one of the snooping cores can't put its cache line value on the bus at first? And then go to Invalid state. The same algorithm is used in read miss with existing copy. Thank you.

You're absolutely right in that it's pretty silly to go fetch a line from memory when you already have it right next to you, but this diagram describes the minimal requirement for functional correctness of the coherence protocol (i.e. what must be done to avoid coherence bugs), and that only dictates snooping the data out for modified lines since that's the only correct copy. What you describe is a possible optimization, and some systems indeed behave that way.
However, keep in mind that most systems today employ a shared cache as well (L2 or L3, sometimes even beyond that), and this is often inclusive (with regards to all lines that exist in all cores). In such systems, there's no real need to go all the way to memory, since having the line in another core means it's also in the shared cache, and after invalidation the requesting core can obtain it from there. Your proposal is therefore relevant only for systems with no shared cache, or with a cache that is not strictly inclusive.

Is it possible to use CPU cache in Golang?

Consider some memory and CPU intensive task:
e.g.: Task Block: read 16 bytes from memory then do CPU job. Then write back to memory.
And this Task Block can be parallelizable meaning each core can ran one Task Block.
e.g.: 8 CPU needs 8*16 byte cache but concurrently.

Yes, and just like all other code running on your machine, they all use CPU cache.
It's much too broad of a question to tell you how to code your app to make it the most efficient use of cache. I highly recommend setting up Go Benchmarks and then refactor your code and compare times. (Note, do not benchmark within a VM - VMs, and kind on any platform, do not have accurate enough clocks for Go's benchmarking. Run all Benchmarks native to your OS instead, no VM).
It all comes down to your ability to code the application to make efficient use of that CPU cache. This is a much broader topic for how you use your variables, how often they get updated, what stays on the heap or gets GC on the stack and how often, etc.
One tiny example to point you in the right direction to read more about efficient L1 and L2 cache development...
L1 cache uses 64 bit rows. If you want to store 4x 16bit Int16s, typically they will be allocated on the stack and most likely all stored on the same row of cache.
Say you want to update one of the Int16s? Well, CPU cache cannot update part of the row: It will have to invalidate the entire row, and allocate a whole new row of cache with the previous 3 Int16s and your new updates value.
Very inefficient.
One solution to that problem is use Int64s, which the CPU cache will only invalidate 1 row but yet keep the other 3 in cache for quick reads. Are you doing more push or pops? etc.
Again, it highly depends on your use case: this may even slow things down if you are using a lot of context switching of those 4 ints (e.g. mutex locks). In which case that's a whole different problem to optimize.
I recommend reading up on high frequency scaling and memory allocations on the stack and heaps.

Techniques available to control data/instructions in/out of the cache?

I have encountered some Intel compiler intrinsic functions which I believe allow developers to bypass the cache?
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-mac/GUID-AF42A867-B796-4D29-8FED-C20193FD87E0.htm
I have also come across the GCC compiler prefetch keyword, although I cannot admit to fully appreciating what this does.
With the above in mind I wondered if any members could either elaborate on the above (which I badly described) or provide other techniques which allow the developer to have close control over which data (or instructions) is/isn't loaded in the CPU cache?

This page contains a lot of information about all intrinsics:
Intel Intrinsics Guide
The series of instructions that will write data to memory, avoiding cache evictions are generally named _mm_stream_.... As the name implies, these are ideal for applications that write a large stream of data that is basically contiguous in memory and unlikely to be accessed again in the near future. So, for example, if you are mixing audio buffers and producing a single waveform output this would work well.
One of the keys to using these instructions effectively is taking advantage of write combining. If your write locations are scattered throughout memory, these instructions will stall as badly, or possibly worse than any other kind of memory storage instruction you attempt. Since these writes do not wind up in cache, if you're not filling an entire write buffer then essentially your operation becomes a write-through operation, requiring a stall until the write is completed. If you are writing contiguous memory locations then write combining will apply, and make your data writes much more efficient.
The flip side of that coin is prefetching. Prefetching tells the system to start pulling a memory address into the desired level of cache so that by the time the memory read is complete, you are ready to use the data. This is much harder to use, and requires an appropriate data "stride" which takes into account the cache sizes, cache line size, and the number of instructions which can execute before the memory read completes. Using the hinting parameter, you can "suggest" that the data goes into the L1, L2, or L3 cache, or that it is "non-temporal", meaning that you're just going to use it once and it should be evicted first before any other cache evictions. The hardware has its own prefetching heuristics that work well for most problems without explicit prefetching instructions, but the classic counter-example is a matrix transpose:
Prefetching examples
Prefetching is generally very difficult to use effectively except in some very specific cases like this. Without a more specific problem statement from you, this is about all I can provide.

Instruction cache for pipelined simulator

I am trying to complete a simulator based for a simplified mips computer using java. I believe I have completed the pipeline logic needed for my assignment but I am having a hard time understanding what the instruction and data caches are supposed to do.
The instruction cache should be direct-mapped with 4 blocks and the block size is 4 words.
So I am really confused on what the cache is doing. Is it going to memory and pulling the instruction from memory? For example, in one block it will have just the add command.
Would it make sense to implement it as a 2 dimensional array?

First you should know the basics of the cache. You can imagine cache as an intermediate memory which sits between the DRAM or main memory and your processor, however very much limited in size. Now, when you try to access a location in memory, you will search it first in the cache. If it is found (cache hit) the processor will take this data and resume the execution. Generally the cache hit is supposed to be very few clock cycles lets say 1 or 2. Suppose if the data is not found in the cache (cache miss), then the data is fetched from the main memory, filled in the cache and fed to the processor. The processor blocks till the data is fetched. This takes few hundreds of clock cycles normally depending on the DRAM you are using. The amount of data that is fetched from DRAM is equal to cacheline size. For that you should search for spatial locality of reference in caches.
I think this should get you a start.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio