I was curious as to how is cache performance measured. Can you do it programmatically or do you need specialized tools for this purpose? Does the programming language being used matter?
The ratio of 'Cache Hits' versus 'Cache Misses' is the usual indicator of cache performance.
In the Windows/.NET world these would usually be measured by creating custom performance counters.
I will assume that you are on Windoze, because all such stats, and related stats are available in utilities that part of the operating system.
When you are using an app such as Windead, which is not an o/s (except in the marketing brochures), the easiest (labour wise) thing to do is, implement a Unix layer on top of it. There are many, such as Cygwin. That provides the o/s layer, and most of what can be expected from an o/s. Such as Virtual Memory Statistics. Nohing to write, it is just a single-line command. Google for vmstat.
Related
I am working on implementing prototype performance monitoring system, I went through multiple documents and resources for understanding the concept but am still confused between profiling and dignostics. Can somebody provide an explanation of these two terms, their relation and when/where do we use them?
"Profiling" usually means mapping things happening in the system (e.g., performance monitoring events) to processes, or to functions (or instructions) within processes. Examples of profiling tools in the Unix/Linux world include "gprof" and "oprofile". Intel's "VTune Amplifier" is another commonly used profiler. Some profilers are limited to looking at the performance of a single process, while others (usually requiring elevated privileges) monitor all processes (including the kernel) operating on the system during the measurement period.
"Diagnostics" is not a term I see very often in performance monitoring, but from the context I would assume that this means looking for evidence of "trouble" in the overall operation of the system. As an example, the performance monitoring system at https://github.com/TACC/tacc_stats collects hardware and software performance monitoring data on each server. In TACC's operation, the data is reviewed automatically to look for matches to a variety of heuristics related to known patterns of poor performance (e.g., all memory accesses being made to one socket in a 2-socket system). The data is also used by human performance analysts in response to user queries and is aggregated to provide an overview of performance-related characteristics by application area.
I am trying to optimize critical parts of a C code for image processing in ARM devices and recently discovered NEON.
Having read tips here and there, I am getting pretty nice results, but there is something that escapes me. I see that overall performance is very much dependant on memory accesses and how they are done.
Which is the simplest way (by simple I mean, if possible, not having to run the whole compiled code in an emulator or simulator, but something that can be feed of small pieces of assembly and analyze them), in order to get an idea of how memory accesses are "bottlenecking" the subroutine?
I know this can not be done exactly without running it in a specific hardware and specific conditions, but the purpose is to have a "comparison" trial-and error tool to experiment with, even if the results are only approximations.
(something similar to this great tool for cycle counting)
I think you've probably answered your own question. Memory is a system level effect and many ARM implementers (Apple, Samsung, Qualcomm, etc) implement the system differently with different results.
However, of course you can optimize things for a certain system and it will probably work well on others, so really it comes down to figuring out a way that you can quickly iterate and test/simulate system level effects. This does get complicated so you might pay some money for system level simulators such as is included in ARM's RealView. Or I might recommend getting some open source hardware like a Panda Board and using valgrind's cache-grind. With linux on the panda board you can write some scripts to automate your testing.
It can be a hassle to get this going but if optimizing for ARM will be part of your professional life, then it's worth the (relatively low compared to your salary) software/hardware investment and time.
Note 1: I recommend against using PLD. This is very system tuning dependent, and if you get it working well on one ARM implementation it may hurt you for the next generation of chip or a different implementation. This may be a hint that trying to optimize at the system level, other than some basic data localization and ordering stuff may not be worth your efforts? (See Stephen's comment below).
Memory access is one thing that simply cannot be modeled from "small pieces of assembly” to generate meaningful guidance. Cache hierarchies, store buffers, load miss queues, cache policy, etc … even relatively simple processors have an enormous amount of “state” hiding underneath the LSU, and any small-scale analysis cannot accurately capture that state. That said, there are a few basic guidelines for getting the best performance:
maximize the ratio of "useful computation” instructions to LSU operations.
align your memory accesses (ideally to 16B).
if you need to pick between aligning loads or aligning stores, align your stores.
try to write out complete cachelines when possible.
PLD is mainly useful for non-uniform-but-somehow-still-predictable memory access patterns (these are rare).
For NEON specifically, you should prefer to use the vld1 and vst1 instructions (with an alignment hint). On most micro-architectures, in most cases, they are the fastest way to move between NEON and memory. Eschew v[ld|st][3|4] in particular; these are an attractive nuisance, slower than doing separate permutes on most micro-architectures in most cases.
I'm reading the O'Reilly Linux Kernel book and one of the things that was pointed out during the chapter on paging is that the Pentium cache lets the operating system associate a different cache management policy with each page frame. So I get that there could be scenarios where a program has very little spacial/temporal locality and memory accesses are random/infrequent enough that the probability of cache hits is below some sort of threshold.
I was wondering whether this mechanism is actually used in practice today? Or is it more of a feature that was necessary back when caches where fairly small and not as efficient as they are now? I could see it being useful for an embedded system with little overhead as far as system calls are necessary, are there other applications I am missing?
Having multiple cache management policies is widely used, whether by assigning whole regions using MTRRs (fixed/dynamic, as explained in Intel's PRM), MMIO regions, or through special instructions (e.g. streaming loads/stores, non-temporal prefetches, etc..). The use-cases also vary a lot, whether you're trying to map an external I/O device into virtual memory (and don't want CPU caching to impact its coherence), or whether you want to define a writethrough region for better integrity management of some database, or just want plain writeback to maximize the cache-hierarchy capacity and replacement efficiency (which means performance).
These usages often overlap (especially when multiple applications are running), so the flexibility is very much needed, as you said - you don't want data with little to no spatial/temporal locality to thrash out other lines you use all the time.
By the way, caches are never going to be big enough in the foreseeable future (with any known technology), since increasing them requires you to locate them further away from the core and pay in latency. So cache management is still, and is going to be for a long while, one of the most important things for performance critical systems and applications
A distributed system is described as scalable if it remains effective when there is a significant increase the number of resources and the number of users. However, these systems sometimes face performance bottlenecks. How can these be avoided?
The question is pretty broad, and depends entirely on what the system is doing.
Here are some things I've seen in systems to reduce bottlenecks.
Use caches, reducing network and disk bottlenecks. But remember that knowing when to evict from a cache eviction can a hard problem in some situations.
Use message queues to decouple components in the system. This way you can add more hardware to specific parts of the system that need it.
Delay computation when possible (often by using message queues). This takes the heat off the system during high-processing times.
Of course, design the system for parallel processing wherever possible. One host doing processing is NOT scalable. Note: most relational databases fall into the one-host bucket, this is why NoSQL has become suddenly popular; but not always appropriate (theoretically).
Use eventual consistency if possible. Strong consistency is much harder to scale.
Some are proponents for CQRS and DDD. Though I have never seen or designed a "CQRS system" nor a "DDD system," those have definitely affected the way I design systems.
There is a lot of overlap in the points above; some the techniques may use some of the others.
But, experience (your own and others) eventually teaches you about scalable systems. I keep up-to-date by reading about designs from google, amazon, twitter, facebook, and the like. Another good starting point is the high-scalability blog.
Just to build on a point discussed in the abover post, I would like to add that what you need for your distributed system is a distributed cache, so that when you intend on scaling your application the distributed cache acts like a "elastic" data-fabric meaning that you can increase storage capacity of the cache without compromising on performance and at the same time giving you a relaible platform that is accessible to multiple applications.
One such distributed caching solution is NCache. Do take a look!
There are a set of memory management algorithms used in operating system construction, like pagination, segmentation, paged segmentation (paginación segmentada), segment pagination (segmentación paginada) and others.
Do you know if they are used besides that area, in not so low level software? They are used in bussiness applications?
These algoritms are for translating the program memory addresses onto the physical memory addresses. You will very rarely ever have to think of it in an application. In some extreme cases of applications working on very large datasets you may have to create a driver-like module to tune memory translation, but all the rest is still up to the operating system.
You might never write an OS yourself, but if you ever find yourself having to write a device driver, it will be imperitive that you understand these issues. So it is still quite useful to understand how these algorithms work.
Now you might be in school thinking, "Yuck, I'll just avoid that stuff". But you really have no idea where a 40-year carreer in the industry might take you.