Estimating the memory size of a software - memory-management

I'm working on the development of Boot that will be embedded in a PROM chip for a project.
I was tasked with making an estimation of the final memory size that the software will probably take but I've never done this before.
I searched a bit around and I'm thinking about doing the following:
Counting all the variables, this size goes directly to the size total
Estimating a number of line of codes each function will take (the code hasn't been written yet)
Finding out an approximate number of asm instruction per c instruction
Total size = Total nb line of codes * avg asm instruction per c instruction * 32bit
My solution could very well be bogus, I hope someone will be able to help.

On principle - You are on the right track:
You need to distinguish between several types of memory footprint:
Stack
Dynamic memory (malloc, new, etc.)
Initialised variables
Un-initialised variables
Code
Stack is mostly impacted by recursion, local variables and function parameters.
Dynamic memory (heap) is obvious and also probably not relevant to you - so I'll ignore it for now.
Initialised variables are interesting since you need to count them twice - once for the program footprint on the PROM (similar to code and constants) and once for the RAM footprint.
Un-initialised variables obviously go toward the RAM and counting the size is almost good enough (you also need to consider alignment and padding.
The hardest to estimate is code or what goes into PROM, you need to count constants and local variables as well as the code, the code itself is more or less what you suspect (after adding padding, alignment, function call overhead, interrupt vector initialisation etc.) but many things can make it larger than expected, such as inline functions, library functions (many seemingly trivial operations involve such functions), casting etc.

On way of answering the question would be from experience or assessment of existing code with similar functionality. However there will be a number of factors that affect code size:
Target architecture and instruction set.
Compiler and compiler options used.
Library code usage.
Capability of development staff.
Required functionality.
The "development of Boot" tells us nothing about the requirements or functionality of your boot process. This will have the greatest affect on code size. As an example of how target can make a difference, 8-bit targets typically have greater code density, but generate more code for arithmetic on larger data types, while on say an ARM target where you can select between Thumb and ARM instruction sets, the code density will change significantly.
If you have no prior experience or representative code base to work from, then I suggest you perform a few experiments to get some metrics you can work with:
Build an empty application - just an empty main() function if C or C++; that will give you the basic fixed overhead of the runtime start-up.
If you are using library code, that will probably take a significant amount of space; add dummy calls to all library interfaces you will make use of in the final application, that will tell you how much code will be taken up by library code (assuming the library code is not in-lined).
Thereafter it will depend on functionality; you might implement a subset of the required functionality, and then estimate what proportion of the final build that might constitute.
Regarding your suggestions, remember that variables do not occupy space in ROM, though any constant initialisers will do so. Typically a boot-loader can use all available RAM because the application start-up will re-establish a new runtime environment for itself, discarding the boot-loader environment and variables.
If you were to provide details of functionality and target, you may be able to leverage the experience of the community in estimating the required resources. For example I might be able to tell you (from experience) that a boot-loader with support for Flash programming that loads via a UART using XMODEM protocol on an ARM7 using ARM instruction set will fit in 4k Bytes, or that adding support for loading via SD card may add a further 6Kb, and say USB Virtual Comm Port a further 4Kb. However your requirements are possibly unique and you will have to determine the resource load for yourself somehow.

Related

Why cannot a kernel be launched with the reason of too many register use when there is a register spilling mechanism?

1) When does a kernel start to spill registers to local memory?
2) When there is not enough registers, how does the CUDA runtime decide to not launch a kernel and throws too many resources requested error? How many registers are enough to launch a kernel?
3) Since there is a register spilling mechanism, shouldn't all CUDA kernels be launched even if there are not enough registers?
1) When does a kernel start to spill registers to local memory?
This is entirely under control of the compiler. It is not performed by the runtime, and there are no dynamic runtime decisions about it. When your code reaches the point of a spill, it means that the compiler has inserted an instruction like:
STL [R0], R1
In this case, R1 is being stored to local memory, the local memory address given in R0. This would be a spill store. (After that instruction, R1 could be used for/loaded with something else.) The compiler knows when it has done this, of course, and so it can report the number of spill loads and spill stores it has chosen to use/make. You can get this information (along with register usage, and other information) using the -Xptxas=-v compiler switch.
The compiler (unless you restrict it, see below) makes decisions about register usage primarily focused on performance, paying otherwise less attention to how many registers are actually used. The first priority is performance.
2) When there is not enough registers, how does the CUDA runtime decide to not launch a kernel and throws too many resources requested error? How many registers are enough to launch a kernel?
At compile-time, when your kernel code is being compiled, the compiler has no idea how it will be launched. It has no idea what your launch configuration will be like (number of blocks, number of threads per block, amount of dynamically allocated shared memory, etc) In fact the compilation process mostly proceeds as if the thing being compiled is a single thread.
During compilation, the compiler makes a bunch of static decisions about register assignments (how and where registers will be used). CUDA has binary utilities that can help with understanding this. Register assignments don't change at runtime, are not in any way dynamic, and therefore are entirely determined at compile time. Therefore, at the completion of compilation for a given device code function, it is generally possible to determine how many registers are needed. The compiler includes this information in the binary compiled object.
At runtime, at the point of kernel launch, the CUDA runtime now knows:
How many registers (per thread) are needed for a given kernel
What device we are running on, and therefore what the aggregate limits are
What the launch configuration is (blocks, threads)
Assembling these 3 pieces of information means the runtime can immediately know if there is or will be enough "register space" for the launch. Roughly speaking, the pass/fail arithmetic is if the launch would satisfy this inequality:
registers_per_thread*threads_per_block <= max_registers_per_multiprocessor
There is granularity to be considered in this equation as well. Registers are often allocated in groups of 2 or 4 at runtime, i.e. the registers_per_thread quantity may need to be rounded up to the next whole-number multiple of something like 2 or 4, before the inequality test is applied. The registers_per_thread quantity is ascertained by the compiler as already described. The threads_per_block quantity comes from your kernel launch configuration. The max_registers_per_multiprocessor quantity is machine-readable (i.e. it is a function of the GPU you are running on). You can see how to retrieve that quantity yourself if you wish by studying the deviceQuery CUDA sample code.
3) Since there is a register spilling mechanism, shouldn't all CUDA kernels be launched even if there are not enough registers?
I reiterate that the register assignment (and register spill decisions) is/are entirely a static compile-time process. No runtime decisions or alterations are made. The register assignment is entirely inspectable from the compiled code. Therefore, since no adjustments can be made at runtime, no changes could be made to allow an arbitrary launch. Any such change would require recompilation of the code. While this might be theoretically possible, it is not currently implemented in CUDA. Furthermore, it has the possibility to lead to both variable and perhaps unpredictable behavior (in performance) so there might be reasons not to do it.
Its possible to make all kernels "launchable" (with respect to register limitations) by suitably restricting the compiler's choices about register assignment. __launch_bounds__ and the compiler switch -maxrregcount are a couple ways to achieve this. CUDA provides both an occupancy calculator as well as an occupancy API to help with this process.

Drhystone benchmark on 32 bit micrcontroller

Currently I am doing a performance comparison on two 32bit microcontrollers. I used Dhrystone benchmark to run on both microcontrollers. One microcontroller has 4KB I-cache while second coontroller has 8KB of I-cache. Both microcontrollers are using same tool chain. As much as possible I kept same static and run-time settings onboth microcontrollers. But microcontroller with 4KB cache are faster than 8KB cache microcontroller. Both microcontroller are from same vendor and based on same CPU.
Could anyone provide some information why microcontroller with 4KB cache is faster than other?
Benchmarks are in general useless. dhrystone being one of the oldest ones and it may have had a little bit of value back then, before pipelines and too much compiler optimization. I think I gave up on dhrystone 15 years ago roughly about the time I started using it.
It is trivial to demonstrate that this code
.globl ASMDELAY
ASMDELAY:
sub r0,r0,#1
bne ASMDELAY
bx lr
Which is primarily two INSTRUCTIONS, can vary WIDELY in execution time on the same chip if you understand how modern processors work. The simple trick to seeing this is turn off the caches and prefetchers and such, and place this code at offset 0x0000, call it with some value. place it at 0x0004 repeat, then at 0x0008, repeat. keep doing this. You can put one, two, etc nops between the subtract and branch. try it at various offsets.
THEN, turn on and of caches for each of these alignments, if you have a prefetch outside the processor for the flash, turn that on and off.
THEN, vary your clocks, esp for those MCUs where you have to adjust the wait states based on clock rate.
On a SINGLE MCU, you are going to see those essentially two instructions vary in execution time by a very large amount. Lets just say 20 times longer in some cases than others.
Now take a small program or a small fraction of the dhrystone program. Compile that for your mcu, how many instructions do you see? Make minor to major optimization and other variations on the compile command line, how much does the code change. If two instructions can vary by lets call it 20 times in execution time, how bad can it get for 200 instructions or 2000 instructions? It can get pretty bad.
If you take the dhrystone programs you have right now with the compiler options you have right now, go into your bootstrap, add one nop (causing the entire binary to shift by one instruction in flash) run again. Add two, three, four. You are still not comparing different mcus just running your benchmark on one system.
Run with and without d cache, with and without i cache if you have each of those. turn on and off the flash prefetch if you have one and if you have a write buffer you can turn on and of try that. Still remaining on the same compiler same options same mcu.
Take the different functions in the dhrystone source code, and re-arrange them in the source code. instead of Proc_1, Proc_2, Proc_3 make it Proc_1, Proc_3, Proc_2. Do all of the above again. Re-arrange again, repeat.
Before leaving this mcu you should now see that the execution time of the same source code which is completely unmodified (other than perhaps re-arranging functions) can and will have vastly different execution times.
if you then start changing compiler options or keep the same source and change compilers, you will see even more of a difference in execution time.
How is it possible that dhrystone benchmarks today or from way back had a single result for each platform? Simple, that result was just one of a wide range, not really representing the platform.
So then if you try to compare two different hardware platforms be it the same arm core inside a different mcu from the same vendor or different vendors. the arm core, assuming (which is not a safe assumption) it is the same source with the same compile/build options, even assuming the same verilog compile and synthesis was used. you can have that same core change based on arm provided options. Anyway, how the vendor be it the same vendor in two instances or two different vendors wraps that same core, you will see variations. Then take a completely different core be it another arm or a mips, etc. How could you gain any value comparing those using a program like this that itself varies widely on each platform?
You cant. What you can do is use benchmarks to give the illusion of one thing being better than another, one computer is faster than another, one compiler is faster than another. In order to sell computers or compilers. Sprints coverage is within one percent of Verizons...does that tell us anything useful? Nope.
If you were to eliminate the compiler from the equation, and if these are truly the same "CPU", same rev of source from ARM, built the same way, then they should fetch the same, but the size of the cache is part of that so it may already be a different cpu implementation as the width or depth of the cache can affect things. In software it is like needing a 32 bit pointer rather than a 16 bit pointer (17 bit instead of 16, but you cant have a 17 bit generally in logic you can).
Anyway, if you compile the code under test one time for an address space that is common to both platforms, use that same binary exactly for that space, can attach different bootstrap code as needed, note the C library calls strcpy, etc also have to be the same in the same space between platforms to eliminate the compiler and alignment from messing you up. this may or may not level the playing field.
If you want to believe these are the same cpu, then turn the caches off, eliminate the compiler variations by doing the above. See if they execute the same. Copy the program to ram and run in ram, eliminate the flash issues. I assume you have them both clocked the same with the same wait states in the flash?
if they are the same cpu and the chip vendor has with these two chips made the memory system take the same number of clocks say for ram accesses, and it is really the same cpu, you should be able to get the same time by eliminating optimizations (caching, flash prefetching, alignment).
what you are probably seeing is some form of alignment with how the code lies in memory from the compiler vs cache lines, or it could be much much simpler it could be just the differences in the caches, how the hits and misses work and the 4KB is just more lucky than the 8KB for this particular program compiled a certain way, aligned in memory a certain way, etc.
With the simple two instruction loop above it is easy to see some of the reasons why performance varies on the same system, if your modern cpu fetches 8 instructions at a time and your loop gets too close to the tail end of that fetch, the prefetch may think it needs to fetch another 8 beyond that costing you those clock cycles. Certainly as you exactly straddle two "fetch lines" as I call them with those two instructions it is going to cost you a lot more cycles per loop even with a cache. Same problem happens when those two instructions approach a cache line (as you vary their alignment per test) eventually it takes two cache line reads instead of one to fetch those two instructions. At least the first time through there is an extra cache line read. The extra clocks for that first time through is something you can see using a simple benchmark like that while playing with alignment.
Michael Abrash, Zen of Assembly Language. There is an epub/etc you can build from github of this book. the 8088 was obsolete when this book came out if that is all you see 8088 stuff, then you are completely missing the point. It applies to the most modern processors today, how to view the problem, how to test, how to time the test, how to interpret the results. All the stuff I have mentioned so far, and all the things I know about this that I have not mentioned, all came from that books knowledge applied for however many decades I have been doing this.
So again if you have truly eliminated the compiler, alignment, the cpu, the memory system tied to that cpu, etc and it is down to only the size of the cache varies. Then it is probably related to how the cache line fetches hit and miss differently based on alignment of the code relative to the cache lines for the two caches. One is hitting more and missing less and/or evicting better for this particular binary. You can try rearranging the functions, can add nops, or if you cant get at the bootstrap then add whole functions or more code (another printf, etc) at a lower address in the binary causing the linker to slide the code under test through to different addresses changing how the program lines up with the cache lines. Since the functions in the code under test are so big (more than a few instructions to implement) you would have to start modifying the program in order to get a finer grained adjustment of binary relative to cache lines.
You most definitely should see execution time differences on the two platforms if you adjust alignment and or wholesale re-arrange the binary based on functions being re-arranged.
Bottom line benchmarks dont really tell you much, the results have more of a negative stink to them than a positive joy. Without a re-write a particular benchmark or application may just do better on one platform (be it everything is the same but the size of the cache or two completely different architectures) even when you try to vary the results due to alignment, turning on and off prefetching, write buffering, branch prediction, etc. Pulling out all the tricks you can come up with one may vary from x to y the other from n to m and maybe there is some overlap in the ranges. For these two platforms that you say are similiar except for cache size I would hope you could find a combination where sometimes A is faster than B and at least one combination where B is faster than A, with the same features turned on and off for both in that comparision. If/when you switch to some other application/benchmark this all starts over again, no reason to assume dhrystone results predict any other code under test. The only program that matters, esp on an MCU is the final build of your application. Just remember changing a single line of code, or even adding a single nop in the bootstrap can have dramatic results in performance sometimes several TIMES slower or faster from a single nop.

Optimizing local memory use with OpenCL

OpenCL is of course designed to abstract away the details of hardware implementation, so going down too much of a rabbit hole with respect to worrying about how the hardware is configured is probably a bad idea.
Having said that, I am wondering how much local memory is efficient to use for any particular kernel. For example if I have a work group which contains 64 work items then presumably more than one of these may simultaneously run within a compute unit. However it seems that the local memory size as returned by CL_DEVICE_LOCAL_MEM_SIZE queries is applicable to the whole compute unit, whereas it would be more useful if this information was for the work group. Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative? Is tuning by hand the only way to go? To me that means that you are only tuning for one GPU model.
Lastly, I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less? I hear that if you allocate too much then data is just placed in global memory. Is there a way of determining if this is the case?
Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
Not in one step, but you can compute it. First, you need to know how much local memory a workgroup will need. To do so, you can use clGetKernelWorkGroupInfo with the flag CL_KERNEL_LOCAL_MEM_SIZE (strictly speaking it's the local memory required by one kernel). Since you know how much local memory there is per compute unit, you can know the maximum number of workgroups that can coexist on one compute unit.
Actually, this is not that simple. You have to take into consideration other parameters, such as the max number of threads that can reside on one compute unit.
This is a problem of occupancy (that you should try to maximize). Unfortunately, occupancy will vary depending of the underlying architecture.
AMD publish an article on how to compute occupancy for different architectures here.
NVIDIA provide an xls sheet that compute the occupancy for the different architectures.
Not all the necessary information to do the calculation can be queried with OCL (if I recall correctly), but nothing stops you from storing info about different architectures in your application.
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative?
It is quite rigid, and with clGetKernelWorkGroupInfo you don't need to do that. However there is something about CL_KERNEL_LOCAL_MEM_SIZE that needs to be taken into account:
If the local memory size, for any pointer argument to the kernel
declared with the __local address qualifier, is not specified, its
size is assumed to be 0.
Since you might need to compute dynamically the size of the necessary local memory per workgroup, here is a workaround based on the fact that the kernels are compiled in JIT.
You can define a constant in you kernel file and then use the -D option to set its value (previously computed) when calling clBuildProgram.
I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less?
Again CL_KERNEL_LOCAL_MEM_SIZE is the answer. the standard states:
This includes local memory that may be needed by an implementation to
execute the kernel...
If your work is fairly independent and doesn't re-use input data you can safely ignore everything about work groups and shared local memory. However, if your work items can share any input data (classic example is a 3x3 or 5x5 convolution that re-reads input data) then the optimal implementation will need shared local memory. Non-independent work can also benefit. One way to think of shared local memory is programmer-managed cache.

Why are JIT-ed languages still slower and less memory efficient than native C/C++?

Interpreters do a lot of extra work, so it is understandable that they end up significantly slower than native machine code. But languages such as C# or Java have JIT compilers, which supposedly compile to platform native machine code.
And yet, according to benchmarks that seem legit enough, in most of the cases are still 2-4x times slower than C/C++? Of course, I mean compared to equally optimized C/C++ code. I am well aware of the optimization benefits of JIT compilation and their ability to produce code that is faster than poorly optimized C+C++.
And after all that noise about how good the Java memory allocation is, why such a horrendous memory usage? 2x to 50x, on average about 30x times more memory is being used across that particular benchmark suite, which is nothing to sneeze at...
NOTE that I don't want to start a WAR, I am asking about the technical details which define those performance and efficiency figures.
Some reasons for differences;
JIT compilers mostly compile quickly and skip some optimizations that take longer to find.
VM's often enforce safety and this slows execution. E.g. Array access is always bounds checked in .Net unless guaranteed within the correct range
Using SSE (great for performance if applicable) is easy from C++ and hard from current VM's
Performance gets more priority in C++ over other aspects when compared to VM's
VM's often keep unused memory a while before returning to the OS seeming to 'use' more memory.
Some VM's make objects of value types like int/ulong.. adding object memory overhead
Some VM's auto-Align data structures a lot wasting memory (for performance gains)
Some VM's implement a boolean as int (4 bytes), showing little focus on memoryconservation.
But languages such as C# or Java have JIT compilers, which supposedly compile to platform native machine code.
Interpreters also have to translate to machine code in the end. But JITters spend less effort to compile and optimize for the sake of better start up and execution time. Wasting time on compilation will make the perceived performance worse from the user's point of view so that's only possible when you do only once like in an AoT compiler.
They also have to monitor the compiled result to recompile and optimize the hot spots even more, or de-optimize the paths that are rarely used. And then they have to fire the GC once in a while. Those take more time than a normal compiled binary. Besides, the big memory usage of the JITter and the JITted program may mean less efficient cache usage which in turn also slows the performance down
For more information about memory usage you can reference here
Java memory usage is much heavier than C++'s memory usage because:
There is an 8-byte overhead for each object and 12-byte for each array in Java (32-bit; twice as much in 64-bit java). If the size of an object is not a multiple of 8 bytes, it is rounded up to next multiple of 8. This means an object containing a single byte field occupies 16 bytes and requires a 4-byte reference. Please note that C++ also allocates a pointer (usually 4 or 8 bytes) for every object that declares virtual functions.
Parts of the Java Library must be loaded prior to the program execution (at least the classes that are used "under the hood" by the program).[60] This leads to a significant memory overhead for small applications[citation needed].
Both the Java binary and native recompilations will typically be in memory.
The virtual machine itself consumes a significant amount of memory.
In Java, a composite object (class A which uses instances of B and C) is created using references to allocated instances of B and C. In C++ the memory and performance cost of these types of references can be avoided when the instance of B and/or C exists within A.
Lack of address arithmetic makes creating memory-efficient containers, such as tightly spaced structures and XOR linked lists, impossible.
In most cases a C++ application will consume less memory than the equivalent Java application due to the large overhead of Java's virtual machine, class loading and automatic memory resizing. For applications in which memory is a critical factor for choosing between languages and runtime environments, a cost/benefit analysis is required.
One should also keep in mind that a program that uses a garbage collector can need as much as five times the memory of a program that uses explicit memory management in order to reach the same performance.
why such a horrendous memory usage? 2x to 50x, on average about 30x times more memory is being used across that particular benchmark suite, which is nothing to sneeze at...
See https://softwareengineering.stackexchange.com/a/189552

What standard techniques are there for using cpu specific features in DLLs?

Short version: I'm wondering if it's possible, and how best, to utilise CPU specific
instructions within a DLL?
Slightly longer version:
When downloading (32bit) DLLs from, say, Microsoft it seems that one size fits all processors.
Does this mean that they are strictly built for the lowest common denominator (ie. the
minimum platform supported by the OS)?
Or is there some technique that is used to export a single interface within the DLL but utilise
CPU specific code behind the scenes to get optimal performance? And if so, how is it done?
I don't know of any standard technique but if I had to make such a thing, I would write some code in the DllMain() function to detect the CPU type and populate a jump table with function pointers to CPU-optimized versions of each function.
There would also need to be a lowest common denominator function for when the CPU type is unknown.
You can find current CPU info in the registry here:
HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor
The DLL is expected to work on every computer WIN32 runs on, so you are stuck to the i386 instruction set in general. There is no official method of exposing functionality/code for specific instruction sets. You have to do it by hand and transparently.
The technique used basically is as follows:
- determine CPU features like MMX, SSE in runtime
- if they are present, use them, if not, have fallback code ready
Because you cannot let your compiler optimise for anything else than i386, you will have to write the code using the specific instruction sets in inline assembler. I don't know if there are higher-language toolkits for this. Determining the CPU features is straight forward, but could also need to be done in assembler.
An easy way to get the SSE/SSE2 optimizations is to just use the /arch argument for MSVC. I wouldn't worry about fallback--there is no reason to support anything below that unless you have a very niche application.
http://msdn.microsoft.com/en-us/library/7t5yh4fd.aspx
I believe gcc/g++ have equivalent flags.
Intel's ICC can compile code twice, for different architectures. That way, you can have your cake and eat it. (OK, you get two cakes - your DLL will be bigger). And even MSVC2005 can do it for very specific cases (E.g. memcpy() can use SSE4)
There are many ways to switch between different versions. A DLL is loaded, because the loading process needs functions from it. Function names are converted into addresses. One solution is to let this lookup depend on not just function name, but also processor features. Another method uses the fact that the name to address function uses a table of pointers in an interim step; you can switch out the entire table. Or you could even have a branch inside critical functions; so foo() calls foo__sse4 when that's faster.
DLLs you download from Microsoft are targeted for the generic x86 architecture for the simple reason that it has to work across all the multitude of machines out there.
Until the Visual Studio 6.0 time frame (I do not know if it has changed) Microsoft used to optimize its DLLs for size rather than speed. This is because the reduction in the overall size of the DLL gave a higher performance boost than any other optimization that the compiler could generate. This is because speed ups from micro optimization would be decidedly low compared to speed ups from not having the CPU wait for the memory. True improvements in speed come from reducing I/O or from improving the base algorithm.
Only a few critical loops that run at the heart of the program could benefit from micro optimizations simply because of the huge number of times they are invoked. Only about 5-10% of your code might fall in this category. You could rest assured that such critical loops would already be optimized in assembler by the Microsoft software engineers to some level and not leave much behind for the compiler to find. (I know it's expecting too much but I hope they do this)
As you can see, there would be only drawbacks from the increased DLL code that includes additional versions of code that are tuned for different architectures when most of this code is rarely used / are never part of the critical code that consumes most of your CPU cycles.

Resources