Modifying the cache access delay in gem5 does not work - caching

When testing the cache access latency on my gem5, the access latency of l1 is 100 cycles lower than that of l2. My modification is to modify the tag_latency, data_latency, and response_latency in the L2 class in gem5/configs/common/Caches.py. Their original value was 20. I changed them all to 5 or all to 0. Every time I recompile gem5, when I run it again, the time does not change. Why is that?
I am using classical cache
By the way, does the meaning of data_latency, tag_latency and response_latency mean data access delay, tag delay, and delay in responsing to CPU ?
gem5/build/X86/gem5.opt --debug-flags=O3CPUAll --debug-start=120000000000
--outdir=gem5/results/test/final gem5/configs/example/attack_code_config.py
--cmd=final
--benchmark_stdout=gem5/results/test/final/final.out
--benchmark_stderr=gem5/results/test/final/final.err
--mem-size=4GB --l1d_size=32kB --l1d_assoc=8 --l1i_size=32kB --l1i_assoc=8
--l2_size=256kB --l2_assoc=8 --l1d_replacement=LRU --l1i_replacement=LRU
--caches --cpu-type=DerivO3CPU
--cmd --l1d_replacement etc. are the options I added to the
option.

Related

what happens to cache and DRAM when executing "a=5"?

If a process writes a immediate operand to an address
int a;
a = 5;
what happens to L1-Data cache and DRAM?
DRAM fills "5" first or L1-Data Cache fills "5" first?
The compiler assigns some memory address to variable a. In the second statement, when a = 5 is executed, if the system is a multi-processor system, a request will be sent downstream to invalidate all lines and give the processor executing the code this particular cache address in a unique cache coherency state. The value of 5 is then written to the L1 cache (assuming the compiler wants to keep the cacheline address in the cache and does not deem that this should be written back to memory/DRAM).

Write-Through No-Write-Allocate Penalty Calculation

I am considering a write through, no write allocate (write no allocate) cache. I understand these by the following definitions:
Write Through: information is written to both the block in the cache and to the block in the lower-level memory
no write allocate: on a Write miss, the block is modified in the main memory and not loaded into the cache.
tcache : the time it takes to access the first level of cache
tmem : the time it takes to access something in memory
We have the following scenarios:
read hit: value is found in cache, only tcache is required
read miss: value is not found in cache, ( tcache + tmem )
write hit: writes to both cache and main memory, ( tcache + tmem )
write miss: writes directly to main memory, ( tcache + tmem )
The wikipedia flow for write through/no write allocate shows that we always have to go through the cache first, even though we aren't populating the cache. Why, if we know a write will never populate the cache in this situation, can't we only spend tmem performing the operation, rather than ( tcache + tmem )? It seems like we are unnecessarily spending extra time checking something we know we will not update.
My only guess is Paul A. Clayton's comment on a previous question regarding this type of cache is the reason we still have to interact with the cache on a write. But even then, I don't see why the cache update and the memory update can't be done in parallel.

ext4 commit= mount option and dirty_writeback_centisecs

I'm tring to understand the way bytes go from write() to the phisical disk plate to tune my picture server performance.
Thing I don't understand is what is the difference between these two: commit= mount option and dirty_writeback_centisecs. Looks like they are about the same procces of writing changes to the storage device, but still different.
I do not get it clear which one fires first on the way to the disk for my bytes.
Yeah, I just ran into this investigating mount options for an SDCard Ubuntu install on an ARM Chromebook. Here's what I can tell you...
Here's how to see the dirty and writeback amounts:
user#chrubuntu:~$ cat /proc/meminfo | grep "Dirty" -A1
Dirty: 14232 kB
Writeback: 4608 kB
(edit: This dirty and writeback is rather high, I had a compile running when I ran this.)
So data to be written out is dirty. Dirty data can still be eliminated (if say, a temporary file is created, used, and deleted before it goes to writeback, it'll never have to be written out). As dirty data is moved into writeback, the kernel tries to combine smaller requests that may be into dirty into single larger I/O requests, this is one reason why dirty_expire_centisecs is usually not set too low. Dirty data is usually put into writeback when a) Enough data is cached to get up to vm.dirty_background_ratio. b) As data gets to be vm.dirty_writeback_centisecs centiseconds old (3000 default is 30 seconds) it is put into writeback. vm.dirty_writeback_centisecs, a writeback daemon is run by default every 500 centiseconds (5 seconds) to actually flush out anything in writeback.
fsync will flush out an individual file (force it from dirty into writeback and wait until it's flushed out of writeback), and sync does that with everything. As far as I know, it does this ASAP, bypassing any attempt to try to balance disk reads and writes, it stalls the device doing 100% writes until the sync completes.
commit=5 default ext4 mount option actually forces a sync() every 5 seconds on that filesystem. This is intended to ensure that writes are not unduly delayed if there's heavy read activity (ideally losing a maximum of 5 seconds of data if power is cut or whatever.) What I found with an Ubuntu install on SDCard (in a Chromebook) is that this actually just leads to massive filesystem stalls like every 5 seconds if you're writing much to the card, ChromeOS uses commit=600 and I applied that Ubuntu-side to good effect.
The dirty_writeback_centisecs, configures the daemons of the kernel Linux related to the virtual memory (that's why the vm). Which are in charge of making a write back from the RAM memory to all the storage devices, so if you configure the dirty_writeback_centisecs and you have 25 different storage devices mounted on your system it will have the same amount of time of writeback for all the 25 storage systems.
While the commit is done per storage device (actually is per filesystem) and is related to the sync process instead of the daemons from the virtual memory.
So you can see it as:
dirty_writeback_centisecs
writing from RAM to all filesystems
commit
each filesystem fetches from RAM

implementation of dirty_expire_centisecs

I'm trying to understand the behavior of dirty_expire_centisecs parameter on servers with 2.6 and 3.0 kernels.
Kernel documentation says (vm.txt/dirty_expire_centisecs)
"Data which has been dirty in-memory for longer than this interval will be written out next time a flusher thread wakes up."
which implies, dirty data that has been in memory for shorter than this interval will not be written.
According to my testing, behavior of dirty_expire_centisecs is as follows: when writeback timer fires before the expire timer, then no pages will be flushed, else all pages will be flushed.
If background_bytes limit reaches, it flushes all or portion depending on the rate, independent of both timers.
My testing tells me at low write rates (less than 1MB per sec), dirty_background_bytes trigger will flush all dirty pages and at slightly higher data rates (higher than 2MB per sec), it flushes only a portion of the dirty data, independent of expiry value.
This is different from what is said in the vm.txt. It make sense not to flush the most recent data. To me, observed behavior is not logical and practically useless. What do you guys think ?
My test setup:
Server with 16GB of RAM running Suse 11 SP1, SP2 and RedHat 6.2 (multi boot setup)
vm.dirty_bytes = 50000000 // 50MB <br>
vm.dirty_background_bytes = 30000000 // 30MB <br>
vm.dirty_writeback_centisecs = 1000 // 10 seconds <br>
vm.dirty_expire_centisecs = 1500 // 15 seconds <br>
with a file writing tool where I can control the write()'s per sec rate and size.
I asked this question on the linux-kernel mailing list and got an answer from Jan Kara. The timestamp that expiration is based on is the modtime of the inode of the file. Thus, multiple pages dirtied in the same file will all be written when the expiration time occurs because they're all associated with the same inode.
http://lkml.indiana.edu/hypermail/linux/kernel/1309.1/01585.html

Page fault and dirty pages

I have started reading about CPU caches and I have two questions:
Lets say the CPU receives a page fault and transfers control to the kernel handler. The handler decides to evict a frame in memory which is marked dirty. Lets say the CPU caches are write back with valid and modified bits. Now, the memory content of this frame are stale and the cache contains the latest data. How does the kernel force the caches to flush?
The way the page table entry (PTE) gets marked as dirty is as follows: The TLB has a modify bit which is set when the CPU modifies the page's content. This bit is copied back to the PTE on context switch. If we get a page fault, the PTE might be non-dirty but the TLB entry might have the modified bit set (it has not been copied back yet). How is this situation resolved?
As for flushing cache, that's just a privileged instruction. The OS calls the instruction and the hardware begins flushing. There's one instruction for invalidating all values and signaling an immediate flush without write back, and there's another instruction that tells the hardware to write data back before flushing. After the instruction call, the hardware (cache controller and I/O) takes over. There are also privileged instructions that tell the hardware to flush the TLB.
I'm not certain about your second question because it's been a while since I've taken an operating systems course, but my understanding is that in the event of a page fault the page will first be brought into the page table. Any page that is removed depends on available space as well as the page replacement algorithm used. Before that page can be brought in, if the page that it is replacing has the modified bit set it must be written out first so an IO is queued up. If it's not modified, then the page is immediately replaced. Same process for the TLB. If the modified bit is set then before that page is replaced you must write it back out so an IO is queued up and you just have to wait.

Resources