How to simulate write-through cache - caching

I learned the cache write-back and write-through strategy. I want to test the impact of different strategies on program IPC. But the emulator I used before was gem5. I just learned from the official mailing list that gem5 does not implement the write-through strategy. Does qemu have the option to set write-back and write-through strategy. I want to test spec 2006. So can qemu be realized? Or are there other mature simulators that can help me?

QEMU does not model caches at all, so you cannot use it to look at the performance of software in the way you are hoping to do. (In general, trying to estimate performance by running code on a software model is tricky at best, because the behaviour of software models is often significantly different from the behaviour of real hardware, especially for modern hardware which is significantly out-of-order, speculative and microarchitecturally complex. There are a lot of pitfalls for the unwary.)

Related

How practical and popular is intel cache allocation technology (CAT)

Intel introduced a software controlled cache partitioning mechanism named Cache Allocation Technology couple of years ago (you can see this website published in 2016 ). Using this technology, you can define the portion of L3 cache an application can use and at the same time other application will not evict your selected application's data from its assigned partition(s). Being software controlled, it is very easy to use. The purpose of this is said to ensure the quality of service. My question is how popular is this technology in practice among developers and system architects?
Also, some researchers have used this technology as a protection mechanism against side channel attack (like prime+probe and flush+reload). You can see this paper in this regard. Do you think it is practical?

How to make sure a piece of code never leaves the CPU cache (L3)?

The latest Intel's XEON processors have 30MB of L3 memory which is enough to fit a thin type 1 Hypervisor.
I'm interested in understanding how to keep such an Hypervisor within the CPU, i.e. prevented from being flushed to RAM or, at least, encrypt data before being sent to memory/disk.
Assumes we are running on bare metal and we can bootstrap this using DRTM (Late Launch), e.g. we load from untrusted memory/disk but we can only load the real operating system if we can unseal() a secret which is used to decrypt the Operating System and which take place after having set the proper rules to make sure anything sent to RAM is encrypted.
p.s. I know TXT's ACEA aka ACRAM (Authenticated Code Execution Area aka Authentication Code RAM) is said to have such guarantee (i.e. it is restrain to the CPU cache) so I wonder if some trickery could be done around this.
p.p.s. It seems like this is beyond current research so I'm actually not quite sure an answer is possible to this point.
Your question is a bit vague, but it seems to broil down to whether you can put cache lines in lockdown on a Xeon. The answer appears to be no because there's no mention of such a feature in Intel docs for Intel 64 or IA-32... at least for the publicly available models. If you can throw a few million $ at Intel, you can probably get a customized Xeon with such a feature. Intel is into the customized processors business now.
Cache lockdown is typically available on embedded processors. The Intel XScale does have this feature, as do many ARM processors etc.
Do note however that cache lockdown does not mean that the cached data/instructions are never found in RAM. What you seem to want is a form of secure private memory (not cache), possibly at the microcode level. But that is not a cache, because it contradicts the definition of cache... As you probably know, every Intel CPU made in the past decade has updatable microcode, which is stored fairly securely inside the cpu, but you need to have the right cryptographic signing keys to produce code that is accepted by the cpu (via microcode update). What you seem want is the equivalent of that, but at x86/x64 instruction level rather than at microcode level. If this is your goal, then licensing an x86/x64-compatible IP core and adding crypto-protected EEPROM to that is the way to go.
The future Intel Software Guard Extensions (SGX), which you mention in your further comments (after your question, via the Invisible Things Lab link), does not solve the issue of your hypervisor code never being stored in clear in RAM. And that is by design in SGX, so the code can be scanned for viruses etc. before being enclaved.
Finally, I cannot really comment on privatecore's tech because I can't find a real technological description of what they do. Twitter comments and news articles on start-up oriented sites don't provide that and neither does their site. Their business model comes down to "trust us, we know what we do" right now. We might see a real security description/analysis of their stuff some day, but I can't find it now. Their claims of being "PRISM proof" are probably making someone inside the NSA chuckle...
Important update: it's apparently possible to actually disable the (whole) cache from writing back to RAM in the x86 world. These are officially undocumented modes known as "cache-as-RAM mode" in AMD land an "no-fill mode" in Intel's. More at https://www.youtube.com/watch?v=EHkUaiomxfE Being undocumented stuff, Intel (at least) reserves the right to break that "feature" in strange ways as discussed at https://software.intel.com/en-us/forums/topic/392495 for example.
Update 2: A 2011 Lenovo patent http://www.google.com/patents/US8037292 discusses using the newer (?) No-Eviction mode (NEM) on Intel CPUs for loading the BIOS in the CPU's cache. The method can probably be used for other type of code, including supervisors. There's a big caveat though. Code other than the already cached stuff will run very slowly, so I don't see this as really usable outside the boot procedure. There's some coreboot code showing how to enable NEM (https://chromium.googlesource.com/chromiumos/third_party/coreboot/+/84defb44fabf2e81498c689d1b0713a479162fae/src/soc/intel/baytrail/romstage/cache_as_ram.inc)

Are there memory leaks in Linux?

This question is purely theoretical.
I was wondering whether the Linux source code could have memory leaks, and how they debugged it, considering that it is Linux, after all, that deals with each program's memory?
I obviously understand that Linux, being written in C, has to deal itself with malloc and free. What I don't understand is how we measure the operating system's memory leaks.
Note that this question is not Linux-specific; it also addresses the corresponding issues in Windows and MacOS X (darwin).
Quite frequently non-mainstream drivers and the staging tree has memory leaks. Follow the LKML and you can see occasional fixes for mistakes in the networking code for corner cases handling lists of SKBs.
Due to the nature of the kernel most work is code review and refactoring, but work is ongoing to make more tools:
http://www.linuxfoundation.org/en/Google_Summer_of_Code#kmemtrace_-_Kernel_Memory_Profiler
In certain cases you can use frameworks like Usermode Linux and then use conventional tools like Valgrind to attempt to peer into the running code:
http://user-mode-linux.sourceforge.net/
The implementation of malloc and free (actually brk/sbrk, since malloc and free are implemented by libc in-process) are not magical or special - it's just code, just like anything else, and there are data structures behind it that describe the mappings.
If you want to test correct behavior, one way is to write test programs in user-space that are known to allocate then free all their memory correctly. Run the app, then check the internal memory allocation structures in kernel mode using a debugger (or better yet, make this check a debug assert on process shutdown).
All software has bugs, including operating systems. Some of those bugs will result in memory leaks.
The Linux has a kernel debugger to help track down these things, but one has to discover that they exist before one can track them down. Usually, once a bug has been discovered and can be replicated at will, it becomes much easier to fix (Relatively speaking! Obviously you need a good coder to do the job). The hard part is finding the bugs in the first place and creating reliable test cases that demonstrate them. This is where you need to have a skilled QA team.
So I guess the short version of this answer is that good QA is as important good coding.

Analogs of Intel's Cluster OpenMP

Are there analogs of Intel Cluster OpenMP? This library simulates shared-memory machine (like SMP or NUMA) while running on distributed memory machine (like Ethernet-connected cluster of PC's).
This library allows to start openmp programs directly on cluster.
I search for
libraries, which allow multithreaded programms run on distributed cluster
or libraries (replacement of e.g. libgomp), which allow OpenMP programms run on distributed cluster
or compilers, capable of generating cluster code from openmp programms, besides Intel C++
The keyword you want to be searching for is "distributed shared memory"; there's a Wikipedia page on the subject. MOSIX, which became openMOSIX, which is now being developed as part of LinuxPMI, is the closest thing I'm aware of; but I don't have much experience with the current LinuxPMI project.
One thing you need to be aware of is that none of these systems work especially well, performance-wise. (Maybe a more optimistic way of saying it is that it's a tribute to the developers that these things work at all). You can't just abstract away the fact that accessing on-node memory is very very different from memory on some other node over a network. Even making local memory systems fast is difficult and requires a lot of hardware; you can't just hope that a little bit of software will hide the fact that you're now doing things over a network.
The performance ramifications are especially important when you consider that OpenMP programs you might want to run are almost always going to be written assuming that memory accesses are local and thus cheap, because, well, that's what OpenMP is for. False sharing is bad enough when you're talking about different sockets accessing a common cache line - page-based false sharing across a network is just disasterous.
Now, it could well be that you have a very simple program with very little actual shared state, and a distributed shared memory system wouldn't be so bad -- but in that case I've got to think you'd be better off in the long run just migrating the problem away from a shared-memory-based model like OpenMP towards something that'll work better in a cluster environment anyway.

Supplying 64 bit specific versions of your software

Would I expect to see any performance gain by building my native C++ Client and Server into 64 bit code?
What sort of applications benefit from having a 64 bit specific build?
I'd imagine anything that makes extensive use of long would benefit, or any application that needs a huge amount of memory (i.e. more than 2Gb), but I'm not sure what else.
Architectural benefits of Intel x64 vs. x86
larger address space
a richer register set
can link against external libraries or load plugins that are 64-bit
Architectural downside of x64 mode
all pointers (and thus many instructions too) take up 2x the memory, cutting the effective processor cache size in half in the worst case
cannot link against external libraries or load plugins that are 32-bit
In applications I've written, I've sometimes seen big speedups (30%) and sometimes seen big slowdowns (> 2x) when switching to 64-bit. The big speedups have happened in number crunching / video processing applications where I was register-bound.
The only big slowdown I've seen in my own code when converting to 64-bit is from a massive pointer-chasing application where one compiler made some really bad "optimizations". Another compiler generated code where the performance difference was negligible.
Benefit of porting now
Writing 64-bit-compatible code isn't that hard 99% of the time, once you know what to watch out for. Mostly, it boils down to using size_t and ptrdiff_t instead of int when referring to memory addresses (I'm assuming C/C++ code here). It can be a pain to convert a lot of code that wasn't written to be 64-bit-aware.
Even if it doesn't make sense to make a 64-bit build for your application (it probably doesn't), it's worth the time to learn what it would take to make the build so that at least all new code and future refactorings will be 64-bit-compatible.
Before working too hard on figuring out whether there is a technical case for the 64-bit build, you must verify that there is a business case. Are your customers asking for such a build? Will it give you a definitive leg up in competition with other vendors? What is the cost for creating such a build and what business costs will be incurred by adding another item to your accounting, sales and marketing processes?
While I recognize that you need to understand the potential for performance improvements before you can get a handle on competitive advantages, I'd strongly suggest that you approach the problem from the big picture perspective. If you are a small or solo business, you owe it to yourself to do the appropriate due diligence. If you work for a larger organization, your superiors will greatly appreciate the effort you put into thinking about these questions (or will consider the whole issue just geeky excess if you seem unprepared to answer them).
With all of that said, my overall technical response would be that the vast majority of user-facing apps will see no advantage from a 64-bit build. Think about it: how much of the performance problems in your current app comes from being processor-bound (or RAM-access bound)? Is there a performance problem in your current app? (If not, you probably shouldn't be asking this question.)
If it is a Client/Server app, my bet is that network latency contributes far more to the performance on the client side (especially if your queries typically return a lot of data). Assuming that this is a database app, how much of your performance profile is due to disk latency times on the server? If you think about the entire constellation of factors that affect performance, you'll get a better handle on whether your specific app would benefit from a 64-bit upgrade and, if so, whether you need to upgrade both sides or whether all of your benefit would derive just from the server-side upgrade.
Not much else, really. Though writing a 64-bit app can have some advantages to you, as the programmer, in some cases. A simplistic example is an application whose primary focus is interacting with the registry. As a 32-bit process, your app would not have access to large swaths of the registry on 64-bit systems.
Continuing #mdbritt's comment, building for 64-bit makes far more sense [currently] if it's a server build, or if you're distributing to Linux users.
It seems that far more Windows workstations are still 32-bit, and there may not be a large customer base for a new build.
On the other hand, many server installs are 64-bit now: RHEL, Windows, SLES, etc. NOT building for them would be cutting-off a lot of potential usage, in my opinion.
Desktop Linux users are also likely to be running the 64-bit versions of their favorite distro (most likely Ubuntu, SuSE, or Fedora).
The main obvious benefit of building for 64-bit, however, is that you get around the 3GB barrier for memory usage.
According to this web page you benefit most from the extra general-purpose registers with 64 bit CPU if you use a lot of and/or deep loops.
You can expect gain thanks to the additional registers and the new passing parameters convention (which are really linked to the additional registers).

Resources