AVR's Program memory - avr

I ve written a code in C for ATmega128 and
I d like to know how the changes that I do in the code influence the Program Memory.
To be more specific, let's consider that the code is similar to that one:
d=fun1(a,b);
c=fun2(c,d);
the change that I do in the code is that I call the same functions more times e.g.:
d=fun1(a,b);
c=fun2(c,d);
h=fun1(k,l);
n=fun2(p,m);
etc...
I build the solution at the AtmelStudio 6.1 and I see the changes in the Program Memory.
Is there anyway to foresee, without builiding the solution, how the chages in the code will affect the program memory?
Thanks!!

Generally speaking this is next to impossible using C/C++ (that means the effort does not pay off).
In your simple case (the number of calls increase), you can determine the number of instructions for each call, and multiply by the number. This will only be correct, if the compiler does not inline in all cases, and does not apply optimzations at a higher level.
These calculations might be wrong, if you upgrade to a newer gcc version.
So normally you only get exact numbers when you compare two builds (same compiler version, same optimisations). avr-size and avr-nm gives you all information, for example to compare functions by size. You can automate this task (by converting the output into .csv files), and use a spreadsheet or diff to look for changes.
This method normally only pays off, if you have to squeeze a program into a smaller device (from 4k flash into 2k for example - you already have 128k flash, that's quite a lot).
This process is frustrating, because if you apply the same design pattern in C with small differences, it can lead to different sizes: So from C/C++, you cannot really predict what's going to happen.

Related

How to link a program in GCC to prelinked library?

OK, I have the problem, I do not know exactly the correct terms in order to find what I am looking for on google. So I hope someone here can help me out.
When developing real time programs on embedded devices you might have to iterate a few hundred or thousand times until you get the desired result. When using e.g. ARM devices you wear out the internal flash quite quickly. So typically you develop your programs to reside in the RAM of the device and all is ok. This is done using GCC's functionality to split the code in various sections.
Unfortunately, the RAM of most devices is much smaller than the flash. So at one point in time, your program gets too big to fit in RAM with all variables etc. (You choose the size of the device such that one assumes it will fit the whole code in flash later.)
Classical shared objects do not work as there is nothing like a dynamical linker in my environment. There is no OS or such.
My idea was the following: For the controller it is no problem to execute code from both RAM and flash. When compiling with the correct attributes for the functions this is also no big problem for the compiler to put part of the program in RAM and part in flash.
When I have some functionality running successfully I create a library and put this in the flash. The main development is done in the 'volatile' part of the development in RAM. So the flash gets preserved.
The problem here is: I need to make sure, that the library always gets linked to the exact same location as long as I do not reflash. So a single function must always be on the same address in flash for each compile cycle. When something in the flash is missing it must be placed in RAM or a lining error must be thrown.
I thought about putting together a real library and linking against that. Here I am a bit lost. I need to tell GCC/LD to link against a prelinked file (and create such a prelinked file).
It should be possible to put all the library objects together and link this together in the flash. Then the addresses could be extracted and the main program (for use in RAM) can link against it. But: How to do these steps?
In the internet there is the term prelink as well as a matching program for linux. This is intended to speed up the loading times. I do not know if this program might help me out as a side effect. I doubt it but I do not understand the internals of its work.
Do you have a good idea how to reach the goal?
You are solving a non-problem. Embedded flash usually has a MINIMUM write cycle of 10,000. So even if you flash it 20 times a day, it will last a year and half. An St-Nucleo is $13. So that's less than 3 pennies a day :-). The TYPICAL write cycle is even longer, at about 100,000. It will be a long time before you wear them out.
Now if you are using them for dynamic storage, that might be a concern, depending on the usage patterns.
But to answer your questions, you can build your code into a library .a file easily enough. However, GCC does not guarantee that it links the object code in any order, as it depends on optimization level. Furthermore, only functions that are referenced in a library file is pulled in, so if your function calls change, it may pull in more or less library functions.

Assembly Analysis Tools

Does anyone have any suggestions for assembly file analysis tools? I'm attempting to analyze ARM/Thumb-2 ASM files generated by LLVM (or alternatively GCC) when passed the -S option. I'm particularly interested in instruction statistics at the basic block level, e.g. memory operation counts, etc. I may wind up rolling my own tool in Python, but was curious to see if there were any existing tools before I started.
Update: I've done a little searching, and found a good resource for disassembly tools / hex editors / etc here, but unfortunately it is mainly focused on x86 assembly, and also doesn't include any actual assembly file analyzers.
What you need is a tool for which you can define an assembly language syntax, and then build custom analyzers. You analyzers might be simple ("how much space does an instruction take?") or complex ("How many cycles will this isntruction take to execute?" [which depends on the preceding sequence of instructions and possibly a sophisticated model of the processor you care about]).
One designed specifically to do that is the New Jersey Machine Toolkit. It is really designed to build code generators and debuggers. I suspect it would be good at "instruction byte count". It isn't clear it is good at more sophisticated analyses. And I believe it insists you follow its syntax style, rather than yours.
One not designed specifically to do that, but good at parsing/analyzing langauges in general is our
DMS Software Reengineering Toolkit.
DMS can be given a grammar description for virtually any context free language (that covers most assembly language syntax) and can then parse a specific instance of that grammar (assembly code) into ASTs for further processing. We've done with with several assembly langauges, including the IBM 370, Motorola's 8 bit CPU line, and a rather peculiar DSP, without trouble.
You can specify an attribute grammar (computation over an AST) to DMS easily. These are great way to encode analyses that need just local information, such as "How big is this instruction?". For more complex analysese, you'll need a processor model that is driven from a series of instructions; passing such a machine model the ASTs for individual instructions would be an easy way to apply a machine model to compute more complex things as "How long does this instruction take?".
Other analyses such as control flow and data flow, are provided in generic form by DMS. You can use an attribute evaluator to collect local facts ("control-next for this instruction is...", "data from this instruction flows to,...") and feed them to the flow analyzers to compute global flow facts ("if I execute this instruction, what other instructions might be executed downstream?"..)
You do have to configure DMS for your particular (assembly) language. It is designed to be configured for tasks like these.
Yes, you can likely code all this in Python; after all, its a Turing machine. But likely not nearly as easily.
An additional benefit: DMS is willing to apply transformations to your code, based on your analyses. So you could implement your optimizer with it, too. After all, you need to connect the analysis indication the optimization is safe, to the actual optimization steps.
I have written many disassemblers, including arm and thumb. Not production quality but for the purposes of learning the assembler. For both the ARM and Thumb the ARM ARM (ARM Architectural Reference Manual) has a nice chart from which you can easily count up data operations from load/store, etc. maybe an hours worth of work, maybe two. At least up front, you would end up with data values being counted though.
The other poster may be right, as with the chart I am talking about it should be very simple to write a program to examine the ASCII looking for ldr, str, add, etc. No need to parse everything if you are interested in memory operations counts, etc. Of course the downside is that you are likely not going to be able to examine loops. One function may have a load and store, another may have a load and store but have it wrapped by a loop, causing many more memory operations once executed.
Not knowing what you really are interested in, my guess is you might want to simulate the code and count these sorts of things. I wrote a thumb simulator (thumbulator) that attempts to do just that. (and I have used it to compare llvm execution vs gcc execution when it comes to number of instructions executed, fetches, memory operations, etc) The problem may be that it is thumb only, no ARM no Thumb2. Thumb2 could be added easier than ARM. There exists an armulator from arm, which is in the gdb sources among other places. I cant remember now if it executes thumb2. My understanding is that when arm was using it would accurately tell you these sorts of statistics.
You can plug your statistics into LLVM code generator, it's quite flexible and it is already collecting some stats, which could be used as an example.

Windows API calls from assembly while minimizing program size

I'm trying to write a program in assembly and make the resulting executable as small as possible. Some of what I'm doing requires windows API calls to functions such as WriteProcessMemory. I've had some success with calling these functions, but after compiling and linking, my program comes out in the range of 14-15 KB. (From a source of less than 1 KB) I was hoping for much, much less than that.
I'm very new to doing low level things like this so I don't really know what would need to be done to make the program smaller. I understand that the exe format itself takes up quite a bit of space. Can anything be done to minimize that?
I should mention that I'm using NASM and GCC but I can easily change if that would help.
See Tiny PE for a bunch of tips and tricks you can use to reduce the final size of your executable. Be warned that some of the later techniques in that article are extremely fragile.
The default section alignment for most PE files is 4K to align with the natural system memory layout. If you have a .data, .text and .resource section - that's 12K already. Most of it will be 0's and a waste of space.
There are a few things you can do to minimize this waste. First, reduce the section alignment to 512 bytes (don't know the options needed for nasm/gcc). Second, merge the sections so that you only have a single .text section. This can be a problem though for modern machines with the NX bit turned on. This security feature prevents modification of executable sections of code from things like viruses.
There are also a slew of PE compression tools out there that will compact your PE and decompress it when executed.
I suggest using the DumpBin utility (or GNU's objdump) to determine what takes the most space. It may be resource files, huge global variables or something like that.
FWIW, the smallest programs I can assemble using ML or ML64 are on the order of 3kb. (That's just saying hello world and exiting.)
Give me a small C program (not C++), and I'll show you how to make a 1 ko .exe with it. The smallest size of executable I recommend is 1K, because it will fail to run on some Windows if it's not at least this size.
You merely have to play with linker switches to make it happen!
A good linker to do this is polink.
And if you do everything in Assembly, it's even easier. Just go to the MASM32 forum and you'll see plenty of programs like this.

how to minimize a programming language compile time?

I was thinking more about the programming language i am designing. and i was wondering, what are ways i could minimize its compile time?
Your main problem today is I/O. Your CPU is many times faster than main memory and memory is about 1000 times faster than accessing the hard disk.
So unless you do extensive optimizations to the source code, the CPU will spend most of the time waiting for data to be read or written.
Try these rules:
Design your compiler to work in several, independent steps. The goal is to be able to run each step in a different thread so you can utilize multi-core CPUs. It will also help to parallelize the whole compile process (i.e. compile more than one file at the same time)
It will also allow you to load many source files in advance and preprocess them so the actual compile step can work faster.
Try to allow to compile files independently. For example, create a "missing symbol pool" for the project. Missing symbols should not cause compile failures as such. If you find a missing symbol somewhere, remove it from the pool. When all files have been compiled, check that the pool is empty.
Create a cache with important information. For example: File X uses symbols from file Y. This way, you can skip compiling file Z (which doesn't reference anything in Y) when Y changes. If you want to go one step further, put all symbols which are defined anywhere in a pool. If a file changes in such a way that symbols are added/removed, you will know immediately which files are affected (without even opening them).
Compile in the background. Start a compiler process which checks the project directory for changes and compile them as soon as the user saves the file. This way, you will only have to compile a few files each time instead of everything. In the long run, you will compile much more but for the user, turnover times will be much shorter (= time user has to wait until she can run the compiled result after a change).
Use a "Just in time" compiler (i.e. compile a file when it is used, for example in an import statement). Projects are then distributed in source form and compiled when run for the first time. Python does this. To make this perform, you can precompile the library during the installation of your compiler.
Don't use header files. Keep all information in a single place and generate header files from the source if you have to. Maybe keep the header files just in memory and never save them to disk.
what are ways i could minimize its compile time?
No compilation (interpreted language)
Delayed (just in time) compilation
Incremental compilation
Precompiled header files
I've implemented a compiler myself, and ended up having to look at this once people started batch feeding it hundreds of source files. I was quite suprised what I found out.
It turns out that the most important thing you can optimize is not your grammar. It's not your lexical analyzer or your parser either. Instead, the most important thing in terms of speed is the code that reads in your source files from disk. I/O's to disk are slow. Really slow. You can pretty much measure your compiler's speed by the number of disk I/Os it performs.
So it turns out that the absolute best thing you can do to speed up a compiler is to read the entire file into memory in one big I/O, do all your lexing, parsing, etc. from RAM, and then write out the result to disk in one big I/O.
I talked with one of the head guys maintaining Gnat (GCC's Ada compiler) about this, and he told me that he actually used to put everything he could onto RAM disks so that even his file I/O was really just RAM reads and writes.
In most languages (pretty well everything other than C++), compiling individual compilation units is quite fast.
Binding/linking is often what's slow - the linker has to reference the whole program rather than just a single unit.
C++ suffers as - unless you use the pImpl idiom - it requires the implementation details of every object and all inline functions to compile client code.
Java (source to bytecode) suffers because the grammar doesn't differentiate objects and classes - you have to load the Foo class to see if Foo.Bar.Baz is the Baz field of object referenced by the Bar static field of the Foo class, or a static field of the Foo.Bar class. You can make the change in the source of the Foo class between the two, and not change the source of the client code, but still have to recompile the client code, as the bytecode differentiates between the two forms even though the syntax doesn't. AFAIK Python bytecode doesn't differentiate between the two - modules are true members of their parents.
C++ and C suffer if you include more headers than are required, as the preprocessor has to process each header many times, and the compiler compile them. Minimizing header size and complexity helps, suggesting better modularity would improve compilation time. It's not always possible to cache header compilation, as what definitions are present when the header is preprocessed can alter its semantics, and even syntax.
C suffers if you use the preprocessor a lot, but the actual compilation is fast; much of C code uses typedef struct _X* X_ptr to hide implementation better than C++ does - a C header can easily consist of typedefs and function declarations, giving better encapsulation.
So I'd suggest making your language hide implementation details from client code, and if you are an OO language with both instance members and namespaces, make the syntax for accessing the two unambiguous. Allow true modules, so client code only has to be aware of the interface rather than implementation details. Don't allow preprocessor macros or other variation mechanism to alter the semantics of referenced modules.
Here are some performance tricks that we've learned by measuring compilation speed and what affects it:
Write a two-pass compiler: characters to IR, IR to code. (It's easier to write a three-pass compiler that goes characters -> AST -> IR -> code, but it's not as fast.)
As a corollary, don't have an optimizer; it's hard to write a fast optimizer.
Consider generating bytecode instead of native machine code. The virtual machine for Lua is a good model.
Try a linear-scan register allocator or the simple register allocator that Fraser and Hanson used in lcc.
In a simple compiler, lexical analysis is often the greatest performance bottleneck. If you are writing C or C++ code, use re2c. If you're using another language (which you will find much more pleasant), read the paper aboug re2c and apply the lessons learned.
Generate code using maximal munch, or possibly iburg.
Surprisingly, the GNU assembler is a bottleneck in many compilers. If you can generate binary directly, do so. Or check out the New Jersey Machine-Code Toolkit.
As noted above, design your language to avoid anything like #include. Either use no interface files or precompile your interface files. This tactic dramatically reduces the burdern on the lexer, which as I said is often the biggest bottleneck.
Here's a shot..
Use incremental compilation if your toolchain supports it.
(make, visual studio, etc).
For example, in GCC/make, if you have many files to compile, but only make changes in one file, then only that one file is compiled.
Eiffel had an idea of different states of frozen, and recompiling didn't necessarily mean that the whole class was recompiled.
How much can you break up the compliable modules, and how much do you care to keep track of them?
Make the grammar simple and unambiguous, and therefore quick and easy to parse.
Place strong restrictions on file inclusion.
Allow compilation without full information whenever possible (eg. predeclaration in C and C++).
One-pass compilation, if possible.
One thing surprisingly missing in answers so far: make you you're doing a context free grammar, etc. Have a good hard look at languages designed by Wirth such as Pascal & Modula-2. You don't have to reimplement Pascal, but the grammar design is custom made for fast compiling. Then see if you can find any old articles about the tricks Anders pulled implementing Turbo Pascal. Hint: table driven.
it depends on what language/platform you're programming for. for .NET development, minimise the number of projects that you have in your solution.
In the old days you could get dramatic speedups by setting up a RAM drive and compiling there. Don't know if this still holds true, though.
In C++ you could use distributed compilation with tools like Incredibuild
A simple one: make sure the compiler can natively take advantage of multi-core CPUs.
Make sure that everything can be compiled the fist time you try to compile it. E.g. ban forward references.
Use a context free grammar so that you can find the correct parse tree without a symbol table.
Make sure that the semantics can be deduced from the syntax so you can construct the correct AST directly rather than by mucking with a parse tree and symbol table.
How serious a compiler is this?
Unless the syntax is pretty convoluted, the parser should be able to run no more than 10-100 times slower than just indexing through the input file characters.
Similarly, code generation should be limited by output formatting.
You shouldn't be hitting any performance issues unless you're doing a big, serious compiler, capable of handling mega-line apps with lots of header files.
Then you need to worry about precompiled headers, optimization passes, and linking.
I haven't seen much work done for minimizing the compile time. But some ideas do come to mind:
Keep the grammar simple. Convoluted grammar will increase your compile time.
Try making use of parallelism, either using multicore GPU or CPU.
Benchmark a modern compiler and see what are the bottlenecks and what you can do in you compiler/language to avoid them.
Unless you are writing a highly specialized language, compile time is not really an issue..
Make a build system that doesn't suck!
There's a huge amount of programs out there with maybe 3 source files that take under a second to compile, but before you get that far you'd have to sit through an automake script that takes about 2 minutes checking things like the size of an int. And if you go to compile something else a minute later, it makes you sit through almost exactly the same set of tests.
So unless your compiler is doing awful things to the user like changing the size of its ints or changing basic function implementations between runs, just dump that info out to a file and let them get it in a second instead of 2 minutes.

What's a good profiling tool to use when source code isn't available?

I have a big problem. My boss said to me that he wants two "magic black box":
1- something that receives a micropocessor like input and return, like output, the MIPS and/or MFLOPS.
2- something that receives a c code like input and return, like output, something that can characterize the code in term of performance (something like the necessary MIPS that a uP must have to execute the code in some time).
So the first "black box" I think could be a benchmark of EEMBC or SPEC...different uP, same benchmark that returns MIPS/MFLOPS of each uP. The first problem is OK (I hope)
But the second...the second black box is my nightmare...the only thingh that i find is to use profiling tool but I ask a particular profiling tool.
Is there somebody that know a profiling tool that can have, like input, simple c code and gives me, like output, the performance characteristics of my c code (or the times that some assembly instruction is called)?
The real problem is that we must choose the correct uP for a certai c code...but we want a uP tailored for our c code...so if we know a MIPS (and architectural structure of uP, memory structure...) and what our code needed
Thanks to everyone
I have to agree with Adam, though I would be a little more gracious about it. Compiler optimizations only matter in hotspot code, i.e. tight loops that a) don't call functions, and b) take a large percentage of time.
On a positive note, here's what I would suggest:
Run the C code on a processor, any processor. On that processor, find out what takes the most time.
You could use a profiler for this. The simple method I prefer is to just run it under a debugger and manually halt it, some number of times (like 10) and each time write down the call stack. I suppose there is something in the code taking a good percentage of the time, like 50%. If so, you will see it doing that thing on roughly that percentage of samples, so you won't have to guess what it is.
If that activity is something that would be helped by some special processor, then try that processor.
It is important not to guess. If you say "I think this needs a DSP chip", or "I think it needs a multi-core chip", that is a guess. The guess might be right, but probably not. It is probably the case that what takes the most time is something you never would guess, like memory management or I/O formatting. Performance issues are very good at hiding from you.
No. If someone made a tool that could analyse (non-trivial) source code and tell you its performance characteristics, it would be common place. i.e. everyone would be using it.
Until source code is compiled for a particular target architecture, you will not be able to determine its overall performance. For instance, a parallelising compiler targeting n processors might conceivably be able to change an O(n^2) algorithm to one of O(n).
You won't find a tool to do what you want.
Your only option is to cross-compile the code and profile it on an emulator for the architecture you're running. The problem with profiling high level code is the compiler makes a stack of optimizations that are non trivial and you'd need to know how the particular compiler did that.
It sounds dumb, but why do you want to fit your code to a uP and a uP to your code? If you're writing signal processing buy a DSP. If you're building a SCADA box then look into Atmel or ARM stuff. Are you building a general purpose appliance with a user interface? Look into PPC or X86 compatible stuff.
Simply put, choose a bloody architecture that's suitable and provides the features you need. Optimization before choosing the processor is retarded (very roughly paraphrasing Knuth).
Fix the architecture at something roughly appropriate, work out roughly the processing requirements (you can scratch up an estimate by hand which will always be too high when looking at C code) and buy a uP to match.

Resources