L1 cache - how many simultaneous loads from the same cache line can x86/x64 do?

I have some code which reads from an array. The array is largish. I'd expect it to live substantially in L2 cache. Call this TOld.
I wrote an alternative that reads from an array that fits mainly in a single cache line (that I don't expect to be evicted). Call this TNew.
They should produce the same results, and they do. TOld does a single read of its array to get its result. TNew does 6 reads (and a few simple arithmetic ops which are negligible). In both cases I'd expect the reads to dominate.
Cost of L2 cache read by TOld ~15 cycles. Cost of L1 cache reads by TNew ~5 cycles, but I do 6 of them, so expect total ~30 cycles. So I'd expected TNew should be about half the speed of TOld. Instead it's just a few percent difference.
This suggests that the L1 cache is capable of doing 2 reads simultaneously, and from the same cache line. Is that possible in x86/x64?
Other alternative is I haven't correctly aligned TNew's array to land in a single cache line and it's in straddling 2 cache lines, maybe that allows 2 simultaneous reads, one per line. Is that possible?
Frankly neither seem credible, but opinions welcome.


Do CPUs with AVX2 or newer instruction sets support any form of caching on register renaming?

For example, there is a very simple pseudo code with many duplicated values taken:
1 5 1 5 1 2 2 3 8 3 4 5 6 7 7 7
For all data elements:
get particle id from data array
idx = id/7
index = (idx << 8) | id
aabb = lookup[index]
test collision of aabb with a ray
so that it will very probably re-compute same value of 1 for same division followed by same bitwise operation, with no loop carried dependency.
Can new CPUs (like Avx512 or AVX2) remember the pattern (same data + same code path) and directly rename an old input register and return the output quickly (like predicting branch but instead predicting register renamed for a temporary value)?
I'm currently developing a collision detection algorithm on an old CPU (bulldozer ver.1) and any online C++ compiler is not good enough for having predictable performance due to cpu being shared by all visitors.
Removing duplicates by using an unoredered map takes about 15-30 nanoseconds per insert or by using a vectorized plain array scan about 3-5 nanoseconds per insert. This is too slow to effectively filter unnecessary duplicates out. Even if a direct-mapped cache is used (that contains just a modulo operator and some assignments), it still fails (due to cache miss) even worse than using an unordered map in terms of performance.
I'm not expecting a cpu with only hundred(s) of physical registers to actually cache many things, but it could help a lot in computing duplicate values quickly, by just remembering the "same value + same code path" combo only from the last iteration of a loop. At least some physics simulations with collision checking could get a decent boost.
Processing a sorted is faster, but only for branching code? What about branchless code, with newest cpus?
Is there any way of harnessing the register renaming performance (zero latency?) as a simple caching of duplicated work?

Data Oriented Design with Mike Acton - Are 'loops per cache line' calculations right?

I've watched Mike Acton's talks about DOD a few times now to better understand it (it is not an easy subject to me). I'm referring to CppCon 2014: Mike Acton "Data-Oriented Design and C++"
and GDC 2015: How to Write Code the Compiler Can Actually Optimize.
But in both talks he presents some calculations that I'm confused with:
This shows that FooUpdateIn takes 12 bytes, but if you stack 32 of them you will get 6 fully packed cache lines. Same goes for FooUpdateOut, it takes 4 bytes and 32 of them gives you 2 fully packed cache lines.
In the UpdateFoos function, you can do ~5.33 loops per each cache line (assuming that count is indeed 32), then he proceeds by assuming that all the math done takes about 40 cycles which means that each cache line would take about 213.33 cycles.
Now here's where I'm confused, isn't he forgetting about reads and writes? Even though he has 2 fully packed data structures they are in different memory spaces.
In my head this is what's happening:
Read in[0].m_Velocity[0] (which would take about 200 cycles based on his previous slides)
Since in[0].m_Velocity[1] and in[0].m_Foo are in the same cache line as in[0].m_Velocity[0] their access is free
Do all the calculation
Write the result to out[0].m_Foo - Here is what I don't know what happens, I assume that it would discard the previous cache line (fetched in 1.) and load the new one to write the result
Read in[1].m_Velocity[0] which would discard again another cache line (fetched in 4.) (which would take again about 200 cycles)
So jumping from in and out the calculations goes from ~5.33 loops/cache line to 0.5 loops/cache line which would do 20 cycles per cache line.
Could someone explain why wasn't he concerned about reads/writes? Or what is wrong in my thinking?
Thank you.
If we assume L1 cache is 64KB and one cache line is 64 bytes then there are total 1000 cache lines. So, in step 4 write to the result out[0].m_Foo will not discard the data cache in step 2 as they both are in different memory locations. This is the reason why he is using separate structure for updating out m_Foo instead directly mutating it in inplace like in his first implementation. He is just talking till point of calculation value. Updating value/writing value will have same cost as in his first implementation. Also, processor can optimize loops quite well as it can do multiple calculations in parallel(not sequential as result of first loop and second loop are not dependent). I hope this helps

Why is two lines that differ in their address by precisely 65,536 bytes cannot be stored in the cache at the same?

I read a book Andrew Tanenbaum - structured computer organization (6th edition) - 2012, and I dont understand it.
"This mapping scheme puts consecutive memory lines in consecutive cache entries.In fact, up to 64 KB of contiguous data can be stored in the cache.However,two lines that differ in their address by precisely 65,536 bytes or any integral multiple of that number cannot be stored in the cache at the same time (because they have the same Line value).For example, if a program accesses data at location X and next executes an instruction that needs data at location X + 65,536 (or anyother location within the same line), the second instruction will force the cache entry to be reloaded, overwriting what was there.If this happens often enough, itcan result in poor behavior.In fact, the worst-case behavior of a cache is worsethan if there were no cache at all, since each memory operation involves reading in an entire cache line instead of just one word."
Why are they have the same Line value?
This is because of two concepts in cache design. First, a concept called associativity in cache design. For every possible input cache-line address (64 byte aligned on a modern x86-64 system) there are only N possible slots in the cache it may access.
The second is the a problem much like what is encountered with the hash function used within a hashmap. Simply put, some scheme has to be used in converting input addresses to slots in the cache. Notice that the book says the cache can hold 64 (presumably imperial) kilobytes. 64 kB is 65,536 bytes, and the magical cache-ruining distance in question is ALSO 65,536! So, in this case the address -> cache slot function is a simple and operation, and it appears the author is talking about a 1-way associativity cache (that is, each line may only be stored in ONE location inside the cache.) Leading to the mentioned conflict.
Why would microprocessor designers choose a simple AND function? Well... Because it's simple, mainly. Instead of wasting transistors on more complex logic, a basic operation like AND will suffice.

CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

It may seem a weird question..
Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7).
There are two objects A, B on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A is on the cache boundary, that is, its address is an integer multiple of 64.
1) If N < 64, when A is fetched by CPU, B will be read into the cache, too. So if B is needed, and the cache line is not evicted yet, CPU fetches B in a very short time. Everybody is happy.
2) If N >> 64 (i.e. much larger than 64), when A is fetched by CPU, B is not read into the cache line along with A. So we say "CPU doesn't like chase pointers around", and it is one of the reason to avoid heap allocated node-based data structure, like std::list.
My question is, if N > 64 but is still small, say N = 70, in other words, A and B do not fit in one cache line but are not too far away apart, when A is loaded by CPU, does fetching B takes the same amount of clock cycles as it would take when N is much larger than 64?
Rephrase - when A is loaded, let t represent the time elapse of fetching B, is t(N=70) much smaller than, or almost equal to, t(N=9999999)?
I ask this question because I suspect t(N=70) is much smaller than t(N=9999999), since CPU cache is hierarchical.
It is even better if there is a quantitative research.
There are at least three factors which can make a fetch of B after A misses faster. First, a processor may speculatively fetch the next block (independent of any stride-based prefetch engine, which would depend on two misses being encountered near each other in time and location in order to determine the stride; unit stride prefetching does not need to determine the stride value [it is one] and can be started after the first miss). Since such prefetching consumes memory bandwidth and on-chip storage, it will typically have a throttling mechanism (which can be as simple as having a modest sized prefetch buffer and only doing highly speculative prefetching when the memory interface is sufficiently idle).
Second, because DRAM is organized into rows and changing rows (within a single bank) adds latency, if B is in the same DRAM row as A, the access to B may avoid the latency of a row precharge (to close the previously open row) and activate (to open the new row). (This can also improve memory bandwidth utilization.)
Third, if B is in the same address translation page as A, a TLB may be avoided. (In many designs hierarchical page table walks are also faster in nearby regions because paging structures can be cached. E.g., in x86-64, if B is in the same 2MiB region as A, a TLB miss may only have to perform one memory access because the page directory may still be cached; furthermore, if the translation for B is in the same 64-byte cache line as the translation for A and the TLB miss for A was somewhat recent, the cache line may still be present.)
In some cases one can also exploit stride-base prefetch engines by arranging objects that are likely to miss together in a fixed, ordered stride. This would seem to be a rather difficult and limited context optimization.
One obvious way that stride can increase latency is by introducing conflict misses. Most caches use simple modulo a power of two indexing with limited associativity, so power of two strides (or other mappings to the same cache set) can place a disproportionate amount of data in a limited number of sets. Once the associativity is exceeded, conflict misses will occur. (Skewed associativity and non-power-of-two modulo indexing have been proposed to reduce this issue, but these techniques have not been broadly adopted.)
(By the way, the reason pointer chasing is particularly slow is not just low spatial locality but that the access to B cannot be started until after the access to A has completed because there is a data dependency, i.e., the latency of fetching B cannot be overlapped with the latency of fetching A.)
If B is at a lower address than A, it won't be in the same cache line even if they're adjacent. So your N < 64 case is misnamed: it's really the "same cache line" case.
Since you mention Intel i7: Sandybridge-family has a "spatial" prefetcher in L2, which (if there aren't a lot of outstanding misses already) prefetches the other cache line in a pair to complete a naturally-aligned 128B pair of lines.
From Intel's optimization manual, in section 2.3 SANDY BRIDGE: Data Prefetching
... Some prefetchers fetch into L1.
Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with
the pair line that completes it to a 128-byte aligned chunk.
... several other prefetchers try to prefetch into L2
IDK how soon it does this; if it doesn't issue the request until the first cache line arrives, it won't help much for a pointer-chasing case. A dependent load can execute only a couple cycles after the cache line arrives in L1D, if it's really just pointer-chasing without a bunch of computation latency. But if it issues the prefetch soon after the first miss (which contains the address for the 2nd load), the 2nd load could find its data already in L1D cache, having arrived a cycle or two after the first demand-load.
Anyway, this makes 128B boundaries relevant for prefetching in Intel CPUs.
See Paul's excellent answer for other factors.

Writing a full cache line at an uncached address before reading it again on x64

On x64 if you first write within a short period of time the contents of a full cache line at a previously uncached address, and then soon after read from that address again can the CPU avoid having to read the old contents of that address from memory?
As effectively it shouldn't matter what the contents of the memory was previously because the full cache line worth of data was fully overwritten? I can understand that if it was a partial cache line write of an uncached address, followed by a read then it would incur the overhead of having to synchronise with main memory etc.
Looking at documentation regards write allocate, write combining and snooping has left me a little confused about this matter. Currently I think that an x64 CPU cannot do this?
In general, the subsequent read should be fast - as long as store-to-load forwarding is able to work. In fact, it has nothing to do with writing an entire cache line at all: it should also work (with the same caveat) even for smaller writes!
Basically what happens on normally (i.e., WB memory regions) mapped memory is that the store(s) will add several entries to the store buffer of the CPU. Since the associated memory isn't currently cached, these entries are going to linger for some time, since an RFO request will occur to pull that line into cache so that it can be written.
In the meantime, you issue some loads that target the same memory just written, and these will usually be satisfied by store-to-load forwarding, which pretty much just notices that a store is already in the store buffer for the same address and uses it as the result of the load, without needing to go to memory.
Now, store forwarding doesn't always work. In particular, it never works on any Intel (or likely, AMD) CPU when the load only partially overlaps the most recent involved store. That is, if you write 4 bytes to address 10, and then read 4 bytes from addresss 9, only 3 bytes come from that write, and the byte at 9 has to come from somewhere else. In that case, all Intel CPUs simply wait for all the involved stores to be written and then resolve the load.
In the past, there were many other cases that would also fail, for example, if you issued a smaller read that was fully contained in an earlier store, it would often fail. For example, given a 4-byte write to address 10, a 2-byte read from address 12 is fully contained in the earlier write - but often would not forward as the hardware was not sophisticated enough to detect that case.
The recent trend, however, is that all the cases other than the "not fully contained read" case mentioned above successfully forward on modern CPUs. The gory details are well-covered, with pretty pictures, on stuffedcow and Agner also covers it well in his microarchitecture guide.
From the above linked document, here's what Agner says about store-forwarding on Skylake:
The Skylake processor can forward a memory write to a subsequent read
from the same address under certain conditions. Store forwarding is
one clock cycle faster than on previous processors. A memory write
followed by a read from the same address takes 4 clock cycles in the
best case for operands of 32 or 64 bits, and 5 clock cycles for other
operand sizes.
Store forwarding has a penalty of up to 3 clock cycles extra when an
operand of 128 or 256 bits is misaligned.
A store forwarding usually takes 4 - 5 clock cycles extra when an
operand of any size crosses a cache line boundary, i.e. an address
divisible by 64 bytes.
A write followed by a smaller read from the same address has little or
no penalty.
A write of 64 bits or less followed by a smaller read has a penalty of
1 - 3 clocks when the read is offset but fully contained in the
address range covered by the write.
An aligned write of 128 or 256 bits followed by a read of one or both
of the two halves or the four quarters, etc., has little or no
penalty. A partial read that does not fit into the halves or quarters
can take 11 clock cycles extra.
A read that is bigger than the write, or a read that covers both
written and unwritten bytes, takes approximately 11 clock cycles
The last case, where the read is bigger than the write is definitely a case where the store forwarding stalls. The quote of 11 cycles probably applies to the case that all of the involved bytes are in L1 - but the case that some bytes aren't cached at all (your scenario) it could of course take on the order of a DRAM miss, which can be hundreds of cycles.
Finally, note that none of the above has to do with writing an entire cache line - it works just as well if you write 1 byte and then read that same byte, leaving the other 63 bytes in the cache line untouched.
There is an effect similar to what you mention with full cache lines, but it deals with write combining writes, which are available either by marking memory as write-combining (rather than the usual write-back) or using the non-temporal store instructions. The NT instructions are mostly targeted towards writing memory that won't soon be subsequently read, skipping the RFO overhead, and probably don't forward to subsequent loads.
