Most simple architecture available as GCC target - gcc

I'm looking for CPU architecture, which is supported by GCC (and is still maintained) for which is easiest to implement software simulator.
It should be something simple, with flat memory model, 16bit+ address space, 16-32 bit ALU and good code dencity is prefered as for it will be running programs with program memory limitations.
Just few words about origin of those requirements. I need virtual CPU for running 'sandboxed' programs. That will be running on microcontrollers with ~5 KBytes RAM, ARM CPU ~20 MHz clock speed.
Performance is non an issue at all, what I really need is writing C/C++ programs and then running them in sandbox without stdlib. For writing programs GCC can help, just need implement vcpu for one of target architectures.
I've got acquainted with ARMv7-m, avr32 references and found them pretty accaptable but some more powerfull then I need. The less/simpler code I need to write for vcpu implementation, the sooner I will have what I need and less bugs will be there.
Seems like I found what I need. Is was already answered here: What is the smallest, simplest CPU that gcc can compile for?
Thank you all.


What is it meant by "developers must optimise their apps to run on ARM-based processors"?

This is a subject that I am not very knowledgable about and I was hoping to get a better understanding on the topic.
I was going through articles about Apple's transition to Apple Silicon and at some point I read "Apple is going to ship Rosetta 2, an emulation layer that lets you run old apps on new Macs."
As far as I know, an application is written in a high level language (e.g. C/C++,Java etc.). Then the compiler (let's assume interpreters don't exist for a moment) reads that code and translates it to assembly code. Then the assembler will convert assembly code to machine code which is readable by the processor.
My question is, assuming the above are correct, why is Rosetta 2 required since a CPU is supposed to translate high level code into readable machine code anyway? Why would developers need to "optimise" (or care on what processor their applications are run on) their applications since they are written (mostly) in high level language (which the processor can compile) ? I don't get why would programmers care if the CPU is supposed to handle compiling and assembling.
This question is probably rather trivial but I couldn't find what I was looking for just by reading about compilers or CPU architecture.
a CPU is supposed to translate high level code into readable machine code anyway?
No, the CPU doesn't do that itself, it happens via software running on the CPU (JIT or ahead-of-time compiler).
For ahead-of-time compiler (e.g. normal C++ implementations), closed source software only ships x86 machine code, not source. So you can't just recompile it yourself. Open-source software is usually easily portable by recompiling.
Rewritten is an overstatement for most apps, most can just recompile.
But if you have custom x86-specific code, like manually vectorized SIMD loops using SSE / AVX intrinsics or hand-written asm, you'd have to port those to NEON / AArch64 SIMD.

Are games/programs compiled for multiple architectures?

This might be a big broad and somewhat stupid, but it is something I've never understood.
I've never dealt with code that needs to be compiled except Java, which I guess falls between two chairs, so here goes.
I don't understand how games and programs are compiled. Are they compiled for multiple architectures? Or are they compiled during installation (it does not look that way)?
As far as I've understood, code needs to be compiled based on the local architecture in order to make it work. Meaning that you can't compile something for AMD and "copy" the binaries and execute them on a computer running Intel (or similar).
Is there something I've misunderstood here, or does they use an approach which differs from the example I am presenting?
AMD and Intel are manufacturers. You might be thinking of amd64 (also known as x86_64) versus x86. x86_64 is, as the name suggests, based on x86.
Computers running a 64-bit x86_64 OS can normally run x86 apps, but the reverse is not true. So one possibility is to ship 32 bit x86 games, but that limits the amount of RAM that can be accessed per process. That might be OK for a game though.
A bigger issue is shipping for different platforms, such as Playstation and (Windows) PC. The Playstation not only has a completely different CPU architecture (Cell), but a different operating system.
In this case you can't simply cross-compile - and that is because of the operating system difference. You have to have two separate versions of the game - sharing a bunch of common code and media files (also known as assets) - one version for PC and one for Playstation.
You could use Java to overcome that problem, in theory... but that only works when a JVM is available for all target platforms. Also, there is now fragmentation in the Java market, with e.g. Android supporting a different API from JME. And iPhones and iPads don't support Java at all.
Many games publishers do not in fact use Java. An exception is Mojang.

When does the MacOSX 'free' library call madvise, and is there any way to control it?

I've got a C++ program that is notably slower on OSX 10.8.2 than on Linux. Profiling shows that the reason is that calls to free (that result from STL operations, FWIW), are much slower on OSX, because they go and call madvise, and real time gets consumed in there.
Is there any way to modulate this behavior of OS/X?
Well, yes!
I had horrible performance issues with malloc/free in Linux and started looking for a replacement.
Two options came to mind tbbmalloc (part of Intel TBB which is free BTW) and Google malloc.
After extensive testing it wasn't clear which was faster (of the two) but both were significantly faster than LIBC's implementation.
I went with tbbmalloc since it was working smoother, google malloc had a bug that caused virtual memory to be very large (reserved but not committed) which was very bad for my app (IT daemons would kill it).
The good:
Much better performance than libc's malloc. Was 3x-300x in STL heavy app.
Simple integration. No code change. Add/change 1 line the executable's makefile. No change to SOs.
The bad:
Mem checkers will not with replacements. for memchk/valgrind/etc. revert to the original malloc.
App would take 10-30% more memory.
The app I developed was a CAD application that used 10s of GBs, building and destroying 10s of millions various structures (lots of STL maps, vectors, hash_maps).
How to do this:
In the linker command, add -ltbbmalloc and make sure the library is in the lib search path (-L flag).

Profiling a benchmark compiled for the SPARC v8 on an x86

I'm trying to make a (small) improvement to the leon3 processor (instruction set is SPARC v8) for an academic exercise. Before I decide what to improve, I want to profile a couple of benchmark programs that I want to tailor the improvements to.
I don't have access to a SPARC v8 machine.
Currently, I'm using an evaluation version of 'tsim' (a leon3 simulator) which does profiling at the functional level. Which is not really all that useful.
I have tried weird stuff like compiling with loop unrolling enabled and then counting the interesting instructions in the assembly code, but gcc refuses to unroll the loops, probably because some of them go too deep (e.g. 4 nested 'for' loops).
Ideally, what I'm looking for is a SPARC v8 simulator that runs the benchmark and profiles it at the instruction level (stuff like: 'smul' was executed x times) so that I can decide where to start trying with the improvement. Of course if there are other ways I can do this if not a profiler, I won't mind.
Any ideas?
Simulating the processor in Modelsim could be an option. With Modelsim you can do a functional simulation of the complete LEON3 processor. Although the simulation will be quite slow and probably complete overkill for your purposes but Aeroflex Gaisler provides excellent scripts to work with Modelsim.
A student edition of modelsim can be found here:
If you really want to dig that deep into the hardware, you'll find a simulator useful that helps you with that.
Simics comes into mind. They used to have free academic licenses, but since they were bought by Intel, you now need to apply for one, which from my experience takes a couple of weeks. If you are willing to invest this time, you'll certainly get a tool that suits your needs, although they support LEON2, not LEON3, as a model, but for profiling this should be fine.
Qemu also has LEON support but as they're heavily recompiling, it will probably be hard to to instruction-level profiling with it.

How to Benchmark VS2010 Installations

I'm doing a VS2010 Installation on a VM Machine and 1 on a Physical PC.
VM Spec:
Xeon CPU 3.33 GHZ (duo core)
Windows 7 64 Bit
4 GB of Ram
Physical PC:
Duo Core CPU (speed unknown at this time)
Windows 7 64 Bit
6 GB of Ram
My Question is, what is the best way to run some sort of benchmark test with VS2010 to determine what has the best performance?
Real life benchmarking is easy to do: Take a project you are working on (or any project similar to it in structure), and measure a time of a full rebuild.
If the project does not take long enough for the results to be representative or interesting, then I say why would you care about the performance at all?
As project by different teams differ a lot (some use more templates, some more of complex and difficult to optimize expressions, some lots of small files, some lots of libraries ..., some C++, some C#), I doubt there could exist a "universal benchmark project" useful enough to you. Taking the real project your developers are working on is the most representative you can do.
If you want just to have some rough "order of magnitude" comparison, you can simply download some large enough open source project in the same language as you do. E.g. for C you might want to try something like OGG library source or LibPNG source.
