Dirty bit value after changing data to original state

Dirty bit value after changing data to original state - cpu

If the value in some part of cache is 4 and we change it to 5, that sets the dirty bit for that data to 1. But what about, if we set the value back to 4, will dirty bit still stay 1 or change back to 0?
I am interested in this, because this would mean a higher level optimization of the computer system when dealing with read-write operations between main memory and cache.

In order for a cache to work like you said, it would need to reserve half of its data space to store the old values.
Since cache are expensive exactly because they have an high cost per bit, and considering that:
That mechanism would only detect a two levels writing history: A -> B -> A and not any deeper (like A -> B -> C -> A).
Writing would imply the copy of the current values in the old values.
The minimum amount of taggable data in a cache is the line and the whole line need to be changed back to its original value. Considering that a line has a size in the order of 64 Bytes, that's very unlikely to happen.
An hierarchical structure of the caches (L1, L2, L3, ...) its there exactly to mitigate the problem of eviction.
The solution you proposed has little benefits compared to the cons and thus is not implemented.

Related

Why is two lines that differ in their address by precisely 65,536 bytes cannot be stored in the cache at the same?

I read a book Andrew Tanenbaum - structured computer organization (6th edition) - 2012, and I dont understand it.
"This mapping scheme puts consecutive memory lines in consecutive cache entries.In fact, up to 64 KB of contiguous data can be stored in the cache.However,two lines that differ in their address by precisely 65,536 bytes or any integral multiple of that number cannot be stored in the cache at the same time (because they have the same Line value).For example, if a program accesses data at location X and next executes an instruction that needs data at location X + 65,536 (or anyother location within the same line), the second instruction will force the cache entry to be reloaded, overwriting what was there.If this happens often enough, itcan result in poor behavior.In fact, the worst-case behavior of a cache is worsethan if there were no cache at all, since each memory operation involves reading in an entire cache line instead of just one word."
Why are they have the same Line value?

This is because of two concepts in cache design. First, a concept called associativity in cache design. For every possible input cache-line address (64 byte aligned on a modern x86-64 system) there are only N possible slots in the cache it may access.
The second is the a problem much like what is encountered with the hash function used within a hashmap. Simply put, some scheme has to be used in converting input addresses to slots in the cache. Notice that the book says the cache can hold 64 (presumably imperial) kilobytes. 64 kB is 65,536 bytes, and the magical cache-ruining distance in question is ALSO 65,536! So, in this case the address -> cache slot function is a simple and operation, and it appears the author is talking about a 1-way associativity cache (that is, each line may only be stored in ONE location inside the cache.) Leading to the mentioned conflict.
Why would microprocessor designers choose a simple AND function? Well... Because it's simple, mainly. Instead of wasting transistors on more complex logic, a basic operation like AND will suffice.

CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

It may seem a weird question..
Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7).
There are two objects A, B on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A is on the cache boundary, that is, its address is an integer multiple of 64.
1) If N < 64, when A is fetched by CPU, B will be read into the cache, too. So if B is needed, and the cache line is not evicted yet, CPU fetches B in a very short time. Everybody is happy.
2) If N >> 64 (i.e. much larger than 64), when A is fetched by CPU, B is not read into the cache line along with A. So we say "CPU doesn't like chase pointers around", and it is one of the reason to avoid heap allocated node-based data structure, like std::list.
My question is, if N > 64 but is still small, say N = 70, in other words, A and B do not fit in one cache line but are not too far away apart, when A is loaded by CPU, does fetching B takes the same amount of clock cycles as it would take when N is much larger than 64?
Rephrase - when A is loaded, let t represent the time elapse of fetching B, is t(N=70) much smaller than, or almost equal to, t(N=9999999)?
I ask this question because I suspect t(N=70) is much smaller than t(N=9999999), since CPU cache is hierarchical.
It is even better if there is a quantitative research.

There are at least three factors which can make a fetch of B after A misses faster. First, a processor may speculatively fetch the next block (independent of any stride-based prefetch engine, which would depend on two misses being encountered near each other in time and location in order to determine the stride; unit stride prefetching does not need to determine the stride value [it is one] and can be started after the first miss). Since such prefetching consumes memory bandwidth and on-chip storage, it will typically have a throttling mechanism (which can be as simple as having a modest sized prefetch buffer and only doing highly speculative prefetching when the memory interface is sufficiently idle).
Second, because DRAM is organized into rows and changing rows (within a single bank) adds latency, if B is in the same DRAM row as A, the access to B may avoid the latency of a row precharge (to close the previously open row) and activate (to open the new row). (This can also improve memory bandwidth utilization.)
Third, if B is in the same address translation page as A, a TLB may be avoided. (In many designs hierarchical page table walks are also faster in nearby regions because paging structures can be cached. E.g., in x86-64, if B is in the same 2MiB region as A, a TLB miss may only have to perform one memory access because the page directory may still be cached; furthermore, if the translation for B is in the same 64-byte cache line as the translation for A and the TLB miss for A was somewhat recent, the cache line may still be present.)
In some cases one can also exploit stride-base prefetch engines by arranging objects that are likely to miss together in a fixed, ordered stride. This would seem to be a rather difficult and limited context optimization.
One obvious way that stride can increase latency is by introducing conflict misses. Most caches use simple modulo a power of two indexing with limited associativity, so power of two strides (or other mappings to the same cache set) can place a disproportionate amount of data in a limited number of sets. Once the associativity is exceeded, conflict misses will occur. (Skewed associativity and non-power-of-two modulo indexing have been proposed to reduce this issue, but these techniques have not been broadly adopted.)
(By the way, the reason pointer chasing is particularly slow is not just low spatial locality but that the access to B cannot be started until after the access to A has completed because there is a data dependency, i.e., the latency of fetching B cannot be overlapped with the latency of fetching A.)

If B is at a lower address than A, it won't be in the same cache line even if they're adjacent. So your N < 64 case is misnamed: it's really the "same cache line" case.
Since you mention Intel i7: Sandybridge-family has a "spatial" prefetcher in L2, which (if there aren't a lot of outstanding misses already) prefetches the other cache line in a pair to complete a naturally-aligned 128B pair of lines.
From Intel's optimization manual, in section 2.3 SANDY BRIDGE:
2.3.5.4 Data Prefetching
... Some prefetchers fetch into L1.
Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with
the pair line that completes it to a 128-byte aligned chunk.
... several other prefetchers try to prefetch into L2
IDK how soon it does this; if it doesn't issue the request until the first cache line arrives, it won't help much for a pointer-chasing case. A dependent load can execute only a couple cycles after the cache line arrives in L1D, if it's really just pointer-chasing without a bunch of computation latency. But if it issues the prefetch soon after the first miss (which contains the address for the 2nd load), the 2nd load could find its data already in L1D cache, having arrived a cycle or two after the first demand-load.
Anyway, this makes 128B boundaries relevant for prefetching in Intel CPUs.
See Paul's excellent answer for other factors.

How does cacheline to register data transfer work?

Suppose I have an int array of 10 elements. With a 64 byte cacheline, it can hold 16 array elements from arr[0] to arr[15].
I would like to know what happens when you fetch, for example, arr[5] from the L1 cache into a register. How does this operation take place? Can the cpu pick an offset into a cacheline and read the next n bytes?

The cache will usually provide the full line (64B in this case), and a separate component in the MMU would rotate and cut the result (usually some barrel shifter), according to the requested offset and size. You would usually also get some error checks (if the cache supports ECC mechanisms) along the way.
Note that caches are often organized in banks, so a read may have to fetch bytes from multiple locations. By providing a full line, the cache can construct the bytes in proper order first (and perform the checks), before letting the MMU pick the relevant part.
Some designs focusing on power saving may decide to implement lower granularity, but this is often only adding complexity as you may have to deal with more cases of line segments being split.

Understanding caches and block sizes

A quick question to make sure I understand the concept behind a "block" and its usage with caches.
If I have a small cache that holds 4 blocks of 4 words each. Let's say its also directly mapped. If I try to access a word at memory address 2, would the block that contains words 0-3 be brought into the first block position of the cache or would it bring in words 2-5 instead?
I guess my question is how "blocks" exist in memory. When a value is accessed and a cache miss is trigger, does the CPU load one block's worth of data (4 words) starting at the accessed value in memory or does it calculate what block that word in memory is in and brings that block instead.
If this question is hard to understand, I can provide diagrams to what I'm trying to explain.

Usually caches are organized into "cache lines" (or, as you put it, blocks). The contents of the cache need to be associatively addressed, ie, accessed by using some portion of the requested address (ie "lookup table key" if you will). If the cache uses a block size of 1 word, the entire address -- all N bits of it -- would be the "key". Each word would be accessible with the granularity just described.
However, this associative key matching process is very hardware intensive, and is the bottleneck in both design complexity (gates used) and speed (if you want to use fewer gates, you take a speed hit in the tradeoff). Certainly, at some point, you cannot minimize gate usage by trading off for speed (delay in accessing the desired element), because a cache's whole purpose is to be FAST!
So, the tradeoff is done a little differently. The cache is organized into blocks (cache "lines" or "rows"). Each block usually starts at some 2^N aligned boundary corresponding to the cache line size. For example, for a cache line of 128 bytes, the cache line key address will always have 0's in the bottom seven bits (2^7 = 128). This effectively eliminates 7 bits from the address match complexity we just mentioned earlier. On the other hand, the cache will read the entire cache line into the cache memory whenever any part of that cache line is "needed" due to a "cache miss" -- the address "key" is not found in the associative memory.
Now, it seems like, if you needed byte 126 in a 128-byte cache line, you'd be twiddling your thumbs for quite a while, waiting for that cache block to be read in. To accomodate that situation, the cache fill can take place starting with the "critical cache address" -- the word that the processor needs to complete the current fetch cycle. This allows the CPU to go on its merry way very quickly, while the cache control unit proceeds onward -- usually by reading data word by word in a modulo N fashion (where N is the cache line size) into the cache memory.
The old MPC5200 PowerPC data book gives a pretty good description of this kind of critical word cache fill ordering. I'm sure it's used elsewhere as well.
HTH... JoGusto.

Algo- Given a sequence of memory access, give the minimum number of cache misses

Given inputs are: size of cache s, number of memory entries n, and a series of memory accesses.
Give the minumum number of cache misses possible.
Example:
s = 3, n = 4
1 2 3 1 4 1 2 3
min_miss = 4
I've been stuck the entire day. Thanks in advance!
You get to decide whatever behaviour the cache takes. You don't have to take in an entry even if it's accessed, for example. And it need not be regular. You don't need to follow a fixed "rule" to cache.

Try following http://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithm - when you need to swap something out swap out the item that will not be used again for the longest possible time. Since you get the entire sequence of memory accesses ahead of time this is feasible for you. This is obviously locally optimum at least up to the first cache miss after the cache becomes full, because every other strategy has had at least one cache miss by then. It is not obvious to me that this is globally optimal - searching I find a proof at http://www.stanford.edu/~bvr/psfiles/paging.pdf with a claim that other proofs of its optimality do exist but are even longer.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio