Are caches of different level operating in the same frequency domain? - caching

Larger caches are usually with longer bitlines or wordlines and thus most likely higher access latency and cycle time.
So, does L2 caches work in the same domain as L1 caches? How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?
And related questions are:
Are all function units in a core in the same clock domain?
Are the uncore part all in the same clock domain?
Are cores in the multi-core system synchronous?
I believe clock domain crossing would introduce extra latency. Do most parts in a CPU chip working on the same clock domain?

The private L1i/d caches are always part of each core, not on a separate clock, in modern CPUs1. L1d is very tightly coupled with load execution units, and the L1dTLB. This is pretty universally true across architectures. (VIPT Cache: Connection between TLB & Cache?).
On CPUs with per-core private L2 cache, it's also part of the core, in the same frequency domain. This keeps L2 latency very low by keeping timing (in core clock cycles) fixed, and not requiring any async logic to transfer data across clock domains. This is true on Intel and AMD x86 CPUs, and I assume most other designs.
Footnote 1: Decades ago, when even having the L1 caches on-chip was a stretch for transistor budgets, sometimes just the comparators and maybe tags were on-chip, so that part could go fast while starting to set up the access to the data on external SRAM. (Or if not external, sometimes a separate die (piece of silicon) in the same plastic / ceramic package, so the wires could be very short and not exposed as external pins that might need ESD protection, etc).
Or for example early Pentium II ran its off-die / on-package L2 cache at half core clock speed (down from full speed in PPro). (But all the same "frequency domain"; this was before DVFS dynamic frequency/voltage for power management.) L1i/d was tightly integrated into the core like they still are today; you have to go farther back to find CPUs with off-die L1, like maybe early classic RISC CPUs.
The rest of this answer is mostly about Intel x86 CPUs, because from your mention of L3 slices I think that's what you're imagining.
How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?
Of mainstream Intel CPUs (P6 / SnB-family), only Skylake-X has non-inclusive L3 cache. Intel since Nehalem has used inclusive last-level cache so its tags can be a snoop filter. See Which cache mapping technique is used in intel core i7 processor?. But SKX changed from a ring to a mesh, and made L3 non-inclusive / non-exclusive.
On Intel desktop/laptop CPUs (dual/quad), all cores (including their L1+L2 caches) are in the same frequency domain. The uncore (the L3 cache + ring bus) is in a separate frequency domain, but I think normally runs at the speed of the cores. It might clock higher than the cores if the GPU is busy but the cores are all idle.
The memory clock stays high even when the CPU clocks down. (Still, single-core bandwidth can suffer if the CPU decides to clock down from 4.0 to 2.7GHz because it's running memory-bound code on the only active core. Single-core bandwidth is limited by max_concurrency / latency, not by DRAM bandwidth itself if you have dual-channel DDR4 or DDR3. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? I think this is because of increased uncore latency.)
The wikipedia Uncore article mentions overclocking it separately from the cores to reduce L3 / memory latency.
On Haswell and later Xeons (E5 v3), uncore (the ring bus and L3 slices) and each individual core have separate frequency domains. (source: Frank Denneman's NUMA Deep Dive Part 2: System Architecture. It has a typo, saying Haswell (v4) when Haswell is actually Xeon E[357]-xxxx v3. But other sources like this paper Comparisons of core and uncore frequency scaling modes in quantum chemistry application GAMESS confirm that Haswell does have those features. Uncore Frequency Scaling (UFS) and Per Core Power States (PCPS) were both new in Haswell.
On Xeons before Haswell, the uncore runs at the speed of the current fastest core on that package. On a dual-socket NUMA setup, this can badly bottleneck the other socket, by making it slow keeping up with snoop requests. See John "Dr. Bandwidth" McCalpin's post on this Intel forum thread:
On the Xeon E5-26xx processors, the "uncore" (containing the L3 cache, ring interconnect, memory controllers, etc), runs at a speed that is no faster than the fastest core, so the "package C1E state" causes the uncore to also drop to 1.2 GHz. When in this state, the chip takes longer to respond to QPI snoop requests, which increases the effective local memory latency seen by the processors and DMA engines on the other chip!
... On my Xeon E5-2680 chips, the "package C1E" state increases local latency on the other chip by almost 20%
The "package C1E state" also reduces sustained bandwidth to memory located on the "idle" chip by up to about 25%, so any NUMA placement errors generate even larger performance losses.
Dr. Bandwidth ran a simple infinite-loop pinned to a core on the other socket to keep it clocked up, and was able to measure the difference.
Quad-socket-capable Xeons (E7-xxxx) have a small snoop filter cache in each socket. Dual-socket systems simply spam the other socket with every snoop request, using a good fraction of the QPI bandwidth even when they're accessing their own local DRAM after an L3 miss.
I think Broadwell and Haswell Xeon can keep their uncore clock high even when all cores are idle, exactly to avoid this bottleneck.
Dr. Bandwidth says he disables package C1E state on his Haswell Xeons, but that probably wasn't necessary. He also posted some stuff about using Uncore perf counters to measure uncore frequency to find out what your CPU is really doing, and about BIOS settings that can affect the uncore frequency decision-making.
More background: I found https://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4 about some changes like new snoop mode options (which hop on the ring bus sends snoops to the other core), but it doesn't mention clocks.

A larger cache may have a higher access time, but still it could sustain one access per cycle per port by fully pipelining it. But it also may constrain the maximum supported frequency.
In modern Intel processors, the L1i/L1d and L2 caches and all functional units of a core are in the same frequency domain. On client processors, all cores of the same socket are also in the same frequency domain because they share the same frequency regulator. On server processors (starting with Haswell I think), each core in a separate frequency domain.
In modern Intel processors (since Nehalem I think), the uncore (which includes the L3) is in a separate frequency domain. One interesting case is when a socket is used in a dual NUMA nodes configuration. In this case, I think the uncore partition of each NUMA node would still both exist in the same frequency domain.
There is a special circuitry used to cross frequency domains where all cross-domain communication has to pass through it. So yes I think it incurs a small performance overhead.
There are other frequency domains. In particular, each DRAM channel operates in a frequency domains. I don't know whether current processors support having different channels to operate at different frequencies.

Related

Explanation for why effective DRAM bandwidth reduces upon adding CPUs

This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system
I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
24 cores # 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
Strong scaling of bandwidth:
Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?
Overview
This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.
More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.
Under the hood
The layout of Intel Xeon Skylake SP processors is like the following in your case:
Source: Intel
There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM # 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.
Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.
The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).
Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.
With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).
Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.
Source: Intel
Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.
The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:
What Every Programmer Should Know About Memory
Introduction to High Performance Scientific Computing (Section 1.3)
Lecture: Main Memory and the DRAM System
Short lectures: Dynamic Random Access Memory (in 7 parts)
Intel® 64 and IA-32 Architectures Software Developer's Manual (Volume 3)

Could multi-cpu access memory simultaneously in common home computer?

As far as I know, in modern mult-core cpu system, different cpus share one memory bus. Does that mean only one cpu could access the memory at one moment since there are only one memory bus which could not be used by more than one cpus at a time?
Yes, at the simplest level, a single memory bus will only be doing one thing at once. For memory busses, it's normal for them to be simplex (i.e. either loading or storing, not sending data in both directions at once like gigabit ethernet or PCIe).
Requests can be pipelined to minimize the gaps between requests, but transferring a cache-line of data takes multiple back-to-back cycles.
First of all, remember that when a CPU core "accesses the memory", they don't have to directly read from DRAM. The cache maintains a coherent view of memory shared by all cores, using (a variant of) the MESI cache coherency protocol.
Essential reading for the low-level details about how cache + memory works:
Ulrich Drepper's 2007 article What Every Programmer Should Know About Memory?, and my 2017 update on what's changed and what hasn't. e.g. a single core can barely saturate the memory controllers on a low-latency dual/quad core Intel CPU, and not even close on a many-core Xeon where max_concurrency / latency is the bottleneck, not the DRAM controller bandwidth. (Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?).
All high-performance / multi-core systems use caches, and normally every core has its own private L1i/L1d cache. In most modern multi-core CPUs, there are 2 levels of private cache per core, with a large shared cache. Earlier CPUs (like Intel Core2) only had private L1 caches, and the large shared last-level cache was L2.
Multi-level caches are essential to give low latency / high bandwidth for the most-hot data while still being large enough to have a high hit rate over a large working set.
Intel divides up their L3 caches into slices on the ring bus that connects cores together. So multiple accesses to different slices of L3 can happen simultaneously. See David Kanter's write-up of Sandybridge. Only on an L3 miss does the request need to be sent to a memory controller. (The memory controllers themselves have some buffering / reordering capability.)
Data written by one core can be read by another core without ever being written back to DRAM. A shared last-level cache acts as a backstop for shared data. (Intel CPUs with inclusive L3 cache also use it as a snoop filter to avoid broadcasting cache-coherency traffic to all cores: Which cache mapping technique is used in intel core i7 processor?).
But the writer will have the cache line in Modified state (and all other cores have it Invalid), so the reader has to request it from the writer to get it in Shared state. This is somewhat slow. See What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?, and What will be used for data exchange between threads are executing on one Core with HT?.
On modern Xeon multi-socket systems, I think it's still the case that dirty data can be sent between sockets without writing back to DRAM. But I'm not sure.
AMD Ryzen has separate L3 for each quad-core cluster, so data transfer between core-clusters is slower than within a single core cluster. (And if all the cores are working on the same data, it will end up replicated in the L3 of each cluster.)
Typical Intel/AMD desktop/laptop systems have dual-channel memory controllers, so (if both memory channels are populated) there can be two burst transfers in flight simultaneous, one to each DIMM.
But if only one channel is populated, or they're mismatched and the BIOS doesn't run them in dual-channel mode, or there are no outstanding accesses to cache lines that map to one of the channels, then memory parallelism is limited to pipelining access to one channel.
I know that modern cpu uses cache to achieve low lantency. So my question is based on the scene that when the computer was just started, there are no data in the cache, so cpus will fetch data directly from the memory
Nobody would design a multi-core system with no caches at all. That would be terribly inefficient because the cores would block each other from accessing the bus to fetch instructions as well as data, as you suspect
One fast CPU can do everything that two half-speed CPUs can do, and some things it can't (like run a single thread fast).
If you can build a CPU complex enough to support SMP operation, you can (and should) first make it support some cache. Maybe just internal tags for external data (for faster hit/miss checking), if we're talking about really old CPUs where the transistor budget for the whole chip was too low for much/any internal cache.
Or you could always have fully external cache outside the CPU, as part of an SMP interconnect. But the CPU has to know about it, at least to be able to mark some memory regions uncacheable so MMIO works, and (if it's not write-through) for consistent DMA. If you want private caches for each core, it can't just be a transparent memory-side cache (i.e. caching just the DRAM, not even seeing accesses to physical memory addresses that aren't backed by DRAM).
Multiple cores on a single piece of silicon only makes sense once you've pushed single-core performance to the point of diminishing returns with pipelining, caches, and superscalar execution. Maybe even out-of-order execution, although there are some multi-core in-order x86 and ARM chips. If running carefully-tuned code, out-of-order execution isn't always necessary for some kinds of problems. For example, GPUs don't use OoO exec because they're just designed for massive throughput with simple control.
Pipelining and caching can give huge speed improvements. See http://www.lighterra.com/papers/modernmicroprocessors/
Summary: it's generally possible for a single core to saturate the memory bus if memory access is all it does.
If you establish the memory bandwidth of your machine, you should be able to see if a single-threaded process can really achieve this and, if not, how the effective bandwidth use scales with the number of processors.
now I'll explain further.
it's all depends on the architecture you're using, for now, let's say modern SMP and SDRAM:
1) If two cores tried to access the same address in RAM
could go several ways:
they both want to read, simultaneously:
two cores on the same chip will probably share an intermediate cache
at some level (2 or 3), so the read will only be done once. On a
modern architecture, each core may be able to keep executing µ-ops
from one or more pipelines until the cache line is ready
two cores on different chips may not share a cache, but still need to
co-ordinate access to the bus: ideally, whichever chip didn't issue
the read will simply snoop the response
if they both want to write:
two cores on the same chip will just be writing to the same cache,
and that only needs to be flushed to RAM once. In fact, since memory
will be read from and written to RAM per cache line, writes at
distinct but sufficiently close addresses can be coalesced into a
single write to RAM
two cores on different chips do have a conflict, and the cache line
will need to be written back to RAM by chip1, fetched into chip2's
cache, modified and then written back again (no idea whether the
write/fetch can be coalesced by snooping)
2) If two cores tried to access different addresses
For a single access, the CAS latency means two operations can potentially be interleaved to take no longer (or perhaps only a little longer) than if the bus were idle.

Query Intel CPU details of execution unit, port, etc

Is it possible to query the number of execution unit/port per core and similar information on Intel CPU?
I have an assembly program, and noticed that the performance is quite different on different CPU's. For example, on an Core i5 4570, some functions takes consistently 25% cycles to complete than on an Core i7 4970HQ. They are both Haswell based, from the same generation. No memory movement is involved in the part of program benchmarked. So I am thinking maybe the difference comes from the details such as number of execution unit, number of ports etc. The benchmark measures single core CPU cycles, so frequencies/HT etc does not come into play.
Am I right to assume such an explanation of performance difference? If yes, where can I find such informations for specific CPUs. And is it possible to query it dynamically? If possible, then I can dispatch dynamically based on such informations and distribution uops more evenly and similar techniques to optimize the program for multiple CPUs.
Did you time reference cycles (RDTSC) instead of core clock cycles (with perf counters)? That would explain your observations.
Turbo makes a big difference, and the ratio between max turbo and max sustained / rated clock speed (i.e. reference cycle tick rate) is different on different CPUs. e.g. see my answer on this related question
The lower the CPU's TDP, the bigger the ratio between sustained and peak. The Haswell wikipedia article has tables:
84W desktop i5 4570: sustained 3.2GHz = RDTSC frequency, max turbo 3.6GHz (the speed the core was probably actually running for most of your benchmark, if it had time to go up from low-power idle speed).
47W laptop i7-4960HQ: 2.6GHz sustained = RDTSC frequency vs. 3.8GHz max turbo.
Time your code with performance counters, and look at the "core clock cycles" count. (And lots of other neat stuff).
Every Haswell core is identical from Core-M 5Watt CPUs to high-power quad core to 18-core Xeon (which actually has a per-core power-budget more like a laptop CPU); it's only the L3 caches, number of cores (and interconnect), and support or not for HT and/or Turbo that differ. Basically everything outside the cores themselves can be different, including the GPU. They don't disable execution ports, and even the L1/L2 caches are identical. I think disabling execution ports would require significant redesigns in the out-of-order scheduler and stuff like that.
More importantly, every port has at least one execution unit that isn't found on any other port: p0 has the divider, p1 has the integer multiply unit, p5 has the shuffle unit, and p6 is the only port that can execute predicted-taken branches. Actually, p2 and p3 are identical load ports (and can handle store-address uops)...
See Agner Fog's microarch pdf for more about Haswell internals, and also David Kanter's writeup with diagrams of the different blocks.
(However, it's not strictly true that the entire core is identical: Haswell Pentium/Celeron CPUs don't support AVX/AVX2, or BMI/BMI2. I think they do that by disabling decode of VEX prefixes in the decoders. This is still the case for Skylake Pentiums/Celerons, so thanks Intel for delaying the time when we can assume support for new instruction sets. Presumably they do this so CPUs with defects in one only the upper or lower half of their vector execution units can still be sold as Celeron or Pentium, just like CPUs with a defect in some of their L3 can be sold as i5 instead of i7)

why processor cache stuck at 8Mb?

Eight years ago I can buy Core 2 Duo processor with 6Mb cache.
Today we can buy something like i7 with 8Mb cache
Why cache growing up so slow? It's to hard to implement, or maybe there is no reason to do, or maybe it's different cache?
This is a tricky question indeed. The 8 MB you are talking about, is the amount of L3 cache found in some high level CPUs like i7 and some xeons.
The optimal amount of cache is obtained by a calculus between the maximum amount of RAM for the system, the number of physical cores and the CPU cycles.
For instance this Xeon CPU got 45 MB of cache but can handle 8 threads and 1.5TB of memory..
The cache of the CPU is made of multiple types L1, L2, L3 and L4 (Also known as eDram, is a high bandwidth DRAM, first seen in video game consoles (like XBOX 360 and PlayStation 2) and dedicated to the internal GPU. The commercial name for the Intel Chips with internal GPU + eDRAM is Iris PRO. The Haswell microarchitecture was the first Intel microprocessor to propose this graphical enhanced design. This L4 is used as a Victim Cache for the L3 cache).
Look the complete specifications of one of the recent i7 cpu from intel to see an example of the kind of caches you can find internally.
But it's not only the amount of cache that matters but the nature of it. The latest CPUs have level 3 (L3), 8 ways memory type (to be compared with 2 or 4 ways in the past) providing synchronous access to the 8 virtual cores.
The closer the cache is to the cpu, the faster he is.
The L1 cache (the faster and the more expensive) is used per physical core, L2 more or less for the threads and the L3 for DMA channel (buffered) communication with the main memory.
The more cache you have the less 'cache hit misses' you get. The number of misses (errors) is related to the CPU cycles (speed in Mhz) and the amount of cache.
According to the statistics, above a specific limit increasing the amount of cache provides a little or no performance improvement. So the cost of the CPU increases for a very small improvement.
The amount of cache must respect a ratio between performance improvement and cost.
The same reason explains why you can find more ram on high end cpus.
The speed of the ram is also improving over the time, so the amount of cache is less critical for CPU performance has it was in the past.
For a full article on the subject I recommend this very good page.

How are cache memories shared in multicore Intel CPUs?

I have a few questions regarding Cache memories used in Multicore CPUs or Multiprocessor systems. (Although not directly related to programming, it has many repercussions while one writes software for multicore processors/multiprocessors systems, hence asking here!)
In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)?
Can one processor/core access each other's cache memory, because if they are allowed to access each other's cache, then I believe there might be lesser cache misses, in the scenario that if that particular processors cache does not have some data but some other second processors' cache might have it thus avoiding a read from memory into cache of first processor? Is this assumption valid and true?
Will there be any problems in allowing any processor to access other processor's cache memory?
In a multiprocessor system or a multicore processor (Intel Quad Core,
Core two Duo etc..) does each cpu core/processor have its own cache
memory (data and program cache)?
Yes. It varies by the exact chip model, but the most common design is for each CPU core to have its own private L1 data and instruction caches.
On old and/or low-power CPUs, the next level of cache is typically a L2 unified cache is typically shared between all cores. Or on 65nm Core2Quad (which was two core2duo dies in one package), each pair of cores had their own last-level cache and couldn't communicate as efficiently.
Modern mainstream Intel CPUs (since the first-gen i7 CPUs, Nehalem) use 3 levels of cache.
32kiB split L1i/L1d: private per-core (same as earlier Intel)
256kiB unified L2: private per-core. (1MiB on Skylake-avx512).
large unified L3: shared among all cores
Last-level cache is a a large shared L3. It's physically distributed between cores, with a slice of L3 going with each core on the ring bus that connects the cores. Typically 1.5 to 2.25MB of L3 cache with every core, so a many-core Xeon might have a 36MB L3 cache shared between all its cores. This is why a dual-core chip has 2 to 4 MB of L3, while a quad-core has 6 to 8 MB.
On CPUs other than Skylake-avx512, L3 is inclusive of the per-core private caches so its tags can be used as a snoop filter to avoid broadcasting requests to all cores. i.e. anything cached in a private L1d, L1i, or L2, must also be allocated in L3. See Which cache mapping technique is used in intel core i7 processor?
David Kanter's Sandybridge write-up has a nice diagram of the memory heirarchy / system architecture, showing the per-core caches and their connection to shared L3, and DDR3 / DMI(chipset) / PCIe connecting to that. (This still applies to Haswell / Skylake-client / Coffee Lake, except with DDR4 in later CPUs).
Can one processor/core access each other's cache memory, because if
they are allowed to access each other's cache, then I believe there
might be lesser cache misses, in the scenario that if that particular
processors cache does not have some data but some other second
processors' cache might have it thus avoiding a read from memory into
cache of first processor? Is this assumption valid and true?
No. Each CPU core's L1 caches tightly integrate into that core. Multiple cores accessing the same data will each have their own copy of it in their own L1d caches, very close to the load/store execution units.
The whole point of multiple levels of cache is that a single cache can't be fast enough for very hot data, but can't be big enough for less-frequently used data that's still accessed regularly. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
Going off-core to another core's caches wouldn't be faster than just going to L3 in Intel's current CPUs. Or the required mesh network between cores to make this happen would be prohibitive compared to just building a larger / faster L3 cache.
The small/fast caches built-in to other cores are there to speed up those cores. Sharing them directly would probably cost more power (and maybe even more transistors / die area) than other ways of increasing cache hit rate. (Power is a bigger limiting factor than transistor count or die area. That's why modern CPUs can afford to have large private L2 caches).
Plus you wouldn't want other cores polluting the small private cache that's probably caching stuff relevant to this core.
Will there be any problems in allowing any processor to access other
processor's cache memory?
Yes -- there simply aren't wires connecting the various CPU caches to the other cores. If a core wants to access data in another core's cache, the only data path through which it can do so is the system bus.
A very important related issue is the cache coherency problem. Consider the following: suppose one CPU core has a particular memory location in its cache, and it writes to that memory location. Then, another core reads that memory location. How do you ensure that the second core sees the updated value? That is the cache coherency problem.
The normal solution is the MESI protocol, or a variation on it. Intel uses MESIF.
Quick answers
1) Yes 2)No, but it all may depend on what memory instance/resource you are referring, data may exist in several locations at the same time. 3)Yes.
For a full length explanation of the issue you should read the 9 part article "What every programmer should know about memory" by Ulrich Drepper ( http://lwn.net/Articles/250967/ ), you will get the full picture of the issues you seem to be inquiring about in a good and accessible detail.
To answer your first, I know the Core 2 Duo has a 2-tier caching system, in which each processor has its own first-level cache, and they share a second-level cache. This helps with both data synchronization and utilization of memory.
To answer your second question, I believe your assumption to be correct. If the processors were to be able to access each others' cache, there would obviously be less cache misses as there would be more data for the processors to choose from. Consider, however, shared cache. In the case of the Core 2 Duo, having shared cache allows programmers to place commonly used variables safely in this environment so that the processors will not have to access their individual first-level caches.
To answer your third question, there could potentially be a problem with accessing other processors' cache memory, which goes to the "Single Write Multiple Read" principle. We can't allow more than one process to write to the same location in memory at the same time.
For more info on the core 2 duo, read this neat article.
http://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems/

Resources