Phy memory set + offset bigger than page offset [duplicate] - memory-management

This question already has answers here:
Why does Intel use a VIPT cache and not VIVT or PIPT?
(1 answer)
VIPT Cache: Connection between TLB & Cache?
(1 answer)
Closed 2 years ago.
I understood that their is a way that even when cache set + offset bigger than page offset, the cache don't need to wait for finding the physical page number and using 2-3 lsb of the virtual page number as msb of the set, in this way it just took 2 clocks instead of 3 to reach the data. What is the name of it? Any relevant links are welcome

Related

L1 cache - how many simultaneous loads from the same cache line can x86/x64 do? [duplicate]

This question already has answers here:
Load/stores per cycle for recent CPU architecture generations
(1 answer)
How many CPU cycles are needed for each assembly instruction?
(5 answers)
VIPT Cache: Connection between TLB & Cache?
(1 answer)
Closed 2 years ago.
I have some code which reads from an array. The array is largish. I'd expect it to live substantially in L2 cache. Call this TOld.
I wrote an alternative that reads from an array that fits mainly in a single cache line (that I don't expect to be evicted). Call this TNew.
They should produce the same results, and they do. TOld does a single read of its array to get its result. TNew does 6 reads (and a few simple arithmetic ops which are negligible). In both cases I'd expect the reads to dominate.
Cost of L2 cache read by TOld ~15 cycles. Cost of L1 cache reads by TNew ~5 cycles, but I do 6 of them, so expect total ~30 cycles. So I'd expected TNew should be about half the speed of TOld. Instead it's just a few percent difference.
This suggests that the L1 cache is capable of doing 2 reads simultaneously, and from the same cache line. Is that possible in x86/x64?
Other alternative is I haven't correctly aligned TNew's array to land in a single cache line and it's in straddling 2 cache lines, maybe that allows 2 simultaneous reads, one per line. Is that possible?
Frankly neither seem credible, but opinions welcome.

Why does level 1 use split cache? [duplicate]

This question already has answers here:
What does a 'Split' cache means. And how is it useful(if it is)?
(1 answer)
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
(7 answers)
Why implement data cache and instruction cache to reduce miss? [duplicate]
(2 answers)
Closed 2 years ago.
Generally, what processors follow is that level 1 cache is split while level 2 in unified cache. Why is it so?
I'm not aware of any processor designed in the last 15 years that has a unified (L1) cache.

How can a 32 bit cpu transfer 64 or even 128 bits in parallel on a data bus? [duplicate]

This question already has answers here:
Data bus width and word size
(2 answers)
How do we determine if a processor is 8-bit; 16-bit or 32-bit
(8 answers)
word size and data bus
(2 answers)
Closed 2 years ago.
Well, I just recently started reading the book: Structured Computer Organization, By Andrew Tannerbaun, and everthing was clear to me until I reached this sentence on ch.2: "Finally, many computers can transfer 64 or 128 bits in parallel on a single bus cycle, even on 32-bit machines". The problem with this is that I cannot picture how something like this would work and, as far as I know, a cpu has a single data bus.
If there were for example, a 32bit CPU in a 64bit system (64bit data bus), how would the CPU do to transfer the 64bits "in parallel" on the same bus cycle?

Armv8 address fields for cache?

I am reading, ARM Cortex-A Series Programmer’s Guide for ARMv8-A.
In 11.1.2 Cache tags and Physical Addresses, There was an example for cache address fields.
Example:
Cache is 4-way 32KB
Cache line = 16-words (64 Byte)
And the address fields stated in the document:
Set(index) = 8 bits, Offset = 6 bits, Tag = 30 bits
From my understanding, 8 bits index will correspond to 256 cache lines in each way (which is illustrated correctly in the example). And offset is 6 bits (2^6 = 64) which is used to address bytes inside the line(64 bytes) correctly.
However the cache is 4 way which means that cache size is 4*256*64 = 64KB not 32KB.
Is my analysis correct or I am missing something ?
Someone asked the same question on arm community website: https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/8159/how-to-compute-a-cache-size
Here is the reply on his question:
" Got reply from ARM. It is a document error. It should be 2-way set-associative cache. 16KB * 2 = 32 KB "

Does 1st Level's Page Table Entry comprised with full paddress?

I was reading here and at the problem 3, exercise 2 he states:
The top-level page table should not assume that 2nd level page tables are page-aligned. So, we store full physical addresses there. Fortunately, we do not need control bits. So, each entry is at least 44 bits (6 bytes for byte-aligned, 8 bytes for word-aligned). Each top-level page table is therefore 256*6 = 1536 bytes (256 * 8 = 2048 bytes).
So, why does first layer can't assume page aligned 2nd layer?
Since it's presumes that the memory is not allocated in page granularity, wouldn't that make the allocation significantly more complicated?
I have tried btw to read the lectures of the course but they are that comprehensive.

Resources