How does trace-inputs provided to trace-driven simulators look like? - memory-management

Simulators used to study computer architecture performance are broadly categorized as execution-driven and trace-driven. They work in the following fashion.
Trace Driven Simulator: A real machine is used to execute a benchmark program/software in the native ISA binary. This binary is usually instrumented (modified) so that as each instruction is executed information such as the instruction op-code, data address, and branch information is written out in a trace file. Later on, these traces are being read into a simulator which can run on any machine (different ISA) and analyzed for performance study.
Execution Driven Simulator: Benchmark is directly executed. As the program is executed performance study is performed at the same time.
Can you explain how does traces (input to a trace-driven simulator) look like? Broadly I know that it needs to have things like opcode, memory references, branch outcomes etc. What else does it need to store so that the simulator faces no issues starting/running the benchmark from an abrupt point (i.e. not from start)?

Related

Stall all memory accesses of an application

I want to analyze the effect of using slower memories on applications and need a mean to add delay for all memory accesses. Until now I investigated Intel PIN and other software but they seem to be overkill for what I need. Is there any tool to do so?
Is adding NOP operations in the binary code of the application right before each LOAD/STORE a feasible way?
Your best bet is to run your application under an x86 simulator such as MARSSx86 or Sniper. Using these simulators you can smoothly vary the modeled memory latency or any other parameters of the system1 and see how your application performance varies. This is a common approach in academia (often a generic machine will be modeled, rather than x86, which gives you access to more simulator implementations).
The primary disadvantage of using a simulator is that even good simulators are not completely accurate, and how accurate they are depends on the code in question. Certain types of variance from actual performance aren't particularly problematic when answering the question "how does performance vary with latency", but a simulator that doesn't model the memory access path well might produce a answer that is far from reality.
If you really can't use simulation, you could use any binary re-writing tool like PIN to instrument the memory access locations. nop would be a bad choice because it executes very quickly because you cannot add a dependency between the memory load result and the nop instruction. This latter issue means it only adds additional "work" at the location of each load, but that the work is independent of the load itself, so doesn't simulate increased memory latency.
A better approach would be to follow each load with a long latency operation that uses the result of the load as input and output (but doesn't modify it). Maybe something like imul reg, reg, 1 if reg received the result of the load (but this only adds 3 cycles, so you might hunt for longer latency instructions if you want to add a lot of latency).
1 At least within the set of things modeled by the simulator.

How can I perform a low-level analysis of a performance degradation?

For example, I have a large linear function (1 basic block, ~1000 instructions)
which is called many times. After some fiddling with compiler options I've got
an unexpected 10% performance degradation on Cortex-A57. Presumably it is due to
a little different instruction scheduling. I'd like to investigate the problem
deeper and find out what instruction combination causes unnecessary pipeline
stalls. But I have no idea how I could do that. I guess, I need a very detailed
execution trace to understand what happens, though I'm not sure if it is
possible to get such a trace.
So, the question is: What tools can I use to investigate such low-level
performance problems? How can I determine what prevents the CPU from executing
maximum number of instructions every cycle?
PS I'm mostly interested in Cortex-A57 cores, but I'd appreciate useful
information on any other core or even a different architecture.
PPS The function accesses the memory, but it is expected that almost all memory
accesses hit the cache. The assumption is confirmed by perf stat -e r42,r43
(L1D_CACHE_REFILL_LD and L1D_CACHE_REFILL_ST events).
Tools: I'm most familiar with Intel compilers and tools but notice there are several similar tools out there for the ARM ecosystem. Here are some techniques I recommend.
USE YOUR COMPILER It has many options that can give you a very good idea of what is going on.
Disable any optimizations (compiler option) while compiling your original code. This will tell you if the issue is related to code generation optimizations.
Do a before and after ASM dump, and compare. You may find code differences that you already know are suspect.
Make sure you are not including any debugging information. Debugging inserts check points and other things that can potentially impact the performance of your code. These bits of code will also change the execution of the code through the pipeline
Change the compiler options one at a time to identify if the issue is related to data or code alignment enforcement, etc. I'm sure you've already done this but am mentioning it for completeness.
Enable any compiler performance monitoring options that can be dumped to a log file. A lot of useful information can be found in compiler log files. On the other hand, they also contain info that can only be interpreted by those that live on a higher plain of existence, i.e. compiler writers.
USE A TOOL THAT DUMPS PMU EVENTS
I saw quite a few out there. My apologies for not giving references but you can do a simple search "tool arm pmu events". These can be extremely sophisticated and powerful, e.g. Intel VTune, or very basic and still very powerful, e.g. the command line SEP for x86.
Take a look at the performance events (PMU events) available to you and figure out which events you want to monitor. You can get these events from the ARM Cortex-A57 processor tech reference (Chapter 11, Performance Monitoring Unit).
USE A PMU DUMPING SDK
Use an SDK what has functions for acquiring the ARM PMU events. These SDKs provide you with APIs for selecting and acquiring PMU events, giving you very precise control. Inserting this monitoring code may impact the execution of your code, so be careful of its placement. Again, you can find plenty of such SDKs out there with a simple search.
STUDY UP ON PIPELINE DEBUGGING (IF YOU ARE REALLY INTO THIS TYPE OF STUFF)
Find a good architectural description of the pipeline, including reservation stations, # of ALUs, etc.
Find a good reference on how to figure out what is going on in the pipeline. Here's an example for x86. ARM is a different beast but x86 articles will give you the basics (and more) of what you need to analyze and and what you can do with what you find.
Good luck. Pipeline debugging can be fun but time consuming.

ARM NEON: Tools to predict performance issues due to memory access limited bandwidth?

I am trying to optimize critical parts of a C code for image processing in ARM devices and recently discovered NEON.
Having read tips here and there, I am getting pretty nice results, but there is something that escapes me. I see that overall performance is very much dependant on memory accesses and how they are done.
Which is the simplest way (by simple I mean, if possible, not having to run the whole compiled code in an emulator or simulator, but something that can be feed of small pieces of assembly and analyze them), in order to get an idea of how memory accesses are "bottlenecking" the subroutine?
I know this can not be done exactly without running it in a specific hardware and specific conditions, but the purpose is to have a "comparison" trial-and error tool to experiment with, even if the results are only approximations.
(something similar to this great tool for cycle counting)
I think you've probably answered your own question. Memory is a system level effect and many ARM implementers (Apple, Samsung, Qualcomm, etc) implement the system differently with different results.
However, of course you can optimize things for a certain system and it will probably work well on others, so really it comes down to figuring out a way that you can quickly iterate and test/simulate system level effects. This does get complicated so you might pay some money for system level simulators such as is included in ARM's RealView. Or I might recommend getting some open source hardware like a Panda Board and using valgrind's cache-grind. With linux on the panda board you can write some scripts to automate your testing.
It can be a hassle to get this going but if optimizing for ARM will be part of your professional life, then it's worth the (relatively low compared to your salary) software/hardware investment and time.
Note 1: I recommend against using PLD. This is very system tuning dependent, and if you get it working well on one ARM implementation it may hurt you for the next generation of chip or a different implementation. This may be a hint that trying to optimize at the system level, other than some basic data localization and ordering stuff may not be worth your efforts? (See Stephen's comment below).
Memory access is one thing that simply cannot be modeled from "small pieces of assembly” to generate meaningful guidance. Cache hierarchies, store buffers, load miss queues, cache policy, etc … even relatively simple processors have an enormous amount of “state” hiding underneath the LSU, and any small-scale analysis cannot accurately capture that state. That said, there are a few basic guidelines for getting the best performance:
maximize the ratio of "useful computation” instructions to LSU operations.
align your memory accesses (ideally to 16B).
if you need to pick between aligning loads or aligning stores, align your stores.
try to write out complete cachelines when possible.
PLD is mainly useful for non-uniform-but-somehow-still-predictable memory access patterns (these are rare).
For NEON specifically, you should prefer to use the vld1 and vst1 instructions (with an alignment hint). On most micro-architectures, in most cases, they are the fastest way to move between NEON and memory. Eschew v[ld|st][3|4] in particular; these are an attractive nuisance, slower than doing separate permutes on most micro-architectures in most cases.

Debugging a micro-processor

One of our co-processors is an 8-bit microprocessor. It's main role is to control the hardware that handles flash memory. We suspect that the code it's running is highly inefficient since we measured low speeds when reading/writing to flash memory. The problem is, we have only one J-TAG port that's connected to the main CPU so debugging it is not an option. What we do have, is a register that's available from CPU that contains the micro-processor's program counter. The bad news, is that the micro-processor works at a different frequency than the CPU so monitoring it's program counter outside is also hard. Measuring time inside the micro-processor is also very difficult since it's registers are only 8-bit long. Needless to say, the code is in assembly and very complex. How would you go about approaching this problem?
Needless to say, the code is in assembly and very complex. How would you go about approaching this problem?
I would advise that you start from (or generate) the requirements specification for this part and reimplement the code in C (or even careful use of a C++ subset). If the "complexity" you perceive is merely down the the code rather than the requirements it would be a good idea to design that out - it will only make maintenance in the future more complex, error prone and expensive.
One of the common arguments for using assembler are size and performance, but more frequently a large body of assembler code is far from optimal; in order to retain a level of productivity and maintainability often "boiler-plate" code is used and reused that is not tailored to the specific situation, whereas a compiler will analyse code changes and perform the kind of "micro-optimisation" that system designers really shouldn't have to sweat about. Make your algorithms and data structures efficient and leave the target instruction set details to the compiler.
Even without the ability to directly debug on the target, the use of a high-level language will allow prototyping and simulation on a PC for example.
Even if you retain the assembler code, if your development tools include an instruction set simulator, that may be a good alternative to hardware debugging; especially if it supports debugger scripts that can be used to simulate the behaviour of hardware devices.
All that said, looking at this as a "black-box" and concluding that the code is inefficient is a bit of a leap. What kind of flash memory is appearing to be slow for example? How is it interfaced to the microcontroller? And how have you measured this performance? Flash memory is intrinsically slow - especially writing and page erase; check the performance specification of the Flash before drawing any conclusion on the software performance.

Profiling for analyzing the low level memory accesses of my program

I have to analyze the memory accesses of several programs. What I am looking for is a profiler that allow me to see which one of my programs is more memory intensive insted of computing intensive. I am very interested in the number of accesses to the L1 data cache, L2, and the main memory.
It needs to be for Linux and if it is possible only with command usage. The programming language is c++. If there is any problem with my question, such as I do not understand what you mean or we need more data please comment below.
Thank you.
Update with the solution
I have selected the answer of Crashworks as favourited because is the only one that provided something of what I was looking for. But the question is still open, if you know a better solution please answer.
It is not possible to determine all accesses to memory, since it doesn't make much sense. An access to memory could be executing next instruction (program resides in memory), or when your program reads or write a variable, so your program is almost accessing memory all the time.
What could be more interesting for you could be follow the memory usage of your program (both heap and stack). In this case you can use standard top command.
You could also monitor system calls (i.e. to write to disk or to attach/alloc a shared memory segment). In this case you should use strace command.
A more complete control to do everything would be debugging your program by means of gdb debugger. It lets you control your program such as setting breakpoints to a variable so the program is interrputed whenever it is read or written (maybe this is what you were looking for). On the other hand GDB can be tricky to learn so DDD, which is a gtk graphical frontend will help you starting out with it.
Update: What you are looking for is really low level memory access that it is not available at user level (that is the task of the kernel of the operating system). I am not sure if even L1 cache management is handled transparently by CPU and hidden to kernel.
What is clear is that you need to go as down as kernel level, so KDB, explained here o KDBG, explained here.
Update 2: It seems that Linux kernel does handle CPU cache but only L1 cache. The book Understanding the Linux Virtual Memory Manager explais how memory management of Linux kernel works. This chapter explains some of the guts of L1 cache handling.
If you are running Intel hardware, then VTune for Linux is probably the best and most full-featured tool available to you.
Otherwise, you may be obliged to read the performance-counter MSRs directly, using the perfctr library. I haven't any experience with this on Linux myself, but I found a couple of papers that may help you (assuming you are on x86 -- if you're running PPC, please reply and I can provide more detailed answers):
http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/11169/35961/01704008.pdf?temp=x
http://www.cise.ufl.edu/~sb3/files/pmc.pdf
In general these tools can't tell you exactly which lines your cache misses occur on, because they work by polling a counter. What you will need to do is poll the "l1 cache miss" counter at the beginning and end of each function you're interested in to see how many misses occur inside that function, and of course you may do so hierarchically. This can be simplified by eg inventing a class that records the start timer on entering scope and computes the delta on leaving scope.
VTune's instrumented mode does this for you automatically across the whole program. The equivalent AMD tool is CodeAnalyst. Valgrind claims to be an open-source cache profiler, but I've never used it myself.
Perhaps cachegrind (part of the valgrind suite) may be suitable.
Do you need something more than the unix command top will provide? This provides cpu usage and memory usage of linux programs in an easy to read presentation format.
If you need something more specific, a profiler perhaps, the software language (java/c++/etc.) will help determine which profiler is best for your situation.

Resources