Profiling a benchmark compiled for the SPARC v8 on an x86 - gcc

I'm trying to make a (small) improvement to the leon3 processor (instruction set is SPARC v8) for an academic exercise. Before I decide what to improve, I want to profile a couple of benchmark programs that I want to tailor the improvements to.
I don't have access to a SPARC v8 machine.
Currently, I'm using an evaluation version of 'tsim' (a leon3 simulator) which does profiling at the functional level. Which is not really all that useful.
I have tried weird stuff like compiling with loop unrolling enabled and then counting the interesting instructions in the assembly code, but gcc refuses to unroll the loops, probably because some of them go too deep (e.g. 4 nested 'for' loops).
Ideally, what I'm looking for is a SPARC v8 simulator that runs the benchmark and profiles it at the instruction level (stuff like: 'smul' was executed x times) so that I can decide where to start trying with the improvement. Of course if there are other ways I can do this if not a profiler, I won't mind.
Any ideas?

Simulating the processor in Modelsim could be an option. With Modelsim you can do a functional simulation of the complete LEON3 processor. Although the simulation will be quite slow and probably complete overkill for your purposes but Aeroflex Gaisler provides excellent scripts to work with Modelsim.
A student edition of modelsim can be found here:
http://www.mentor.com/company/higher_ed/modelsim-student-edition

If you really want to dig that deep into the hardware, you'll find a simulator useful that helps you with that.
Simics comes into mind. They used to have free academic licenses, but since they were bought by Intel, you now need to apply for one, which from my experience takes a couple of weeks. If you are willing to invest this time, you'll certainly get a tool that suits your needs, although they support LEON2, not LEON3, as a model, but for profiling this should be fine.
Qemu also has LEON support but as they're heavily recompiling, it will probably be hard to to instruction-level profiling with it.

Related

What is it meant by "developers must optimise their apps to run on ARM-based processors"?

This is a subject that I am not very knowledgable about and I was hoping to get a better understanding on the topic.
I was going through articles about Apple's transition to Apple Silicon and at some point I read "Apple is going to ship Rosetta 2, an emulation layer that lets you run old apps on new Macs."
As far as I know, an application is written in a high level language (e.g. C/C++,Java etc.). Then the compiler (let's assume interpreters don't exist for a moment) reads that code and translates it to assembly code. Then the assembler will convert assembly code to machine code which is readable by the processor.
My question is, assuming the above are correct, why is Rosetta 2 required since a CPU is supposed to translate high level code into readable machine code anyway? Why would developers need to "optimise" (or care on what processor their applications are run on) their applications since they are written (mostly) in high level language (which the processor can compile) ? I don't get why would programmers care if the CPU is supposed to handle compiling and assembling.
This question is probably rather trivial but I couldn't find what I was looking for just by reading about compilers or CPU architecture.
a CPU is supposed to translate high level code into readable machine code anyway?
No, the CPU doesn't do that itself, it happens via software running on the CPU (JIT or ahead-of-time compiler).
For ahead-of-time compiler (e.g. normal C++ implementations), closed source software only ships x86 machine code, not source. So you can't just recompile it yourself. Open-source software is usually easily portable by recompiling.
Rewritten is an overstatement for most apps, most can just recompile.
But if you have custom x86-specific code, like manually vectorized SIMD loops using SSE / AVX intrinsics or hand-written asm, you'd have to port those to NEON / AArch64 SIMD.

OpenCL programming in Charm++

Is it possible to run OpenCL through Charm++, while retaining the same fault tolerance and load balancing capabilities as for CPU or CUDA?
I did not explicitly see anything mentioned in the tutorials or the book.
Background: I'm one of the core developers of Charm++.
It's not clear whether you mean compiling OpenCL code to a Charm++-based parallel program, or calling kernels written in OpenCL from Charm++ code. Regardless, there is nothing explicitly implemented to support either of those cases at present.
Compiling OpenCL to Charm++ would be a large project. I don't know of anyone proposing to do such a thing, but it's not fundamentally implausible.
The research group behind Charm++, the Parallel Programming Laboratory has looked at the possibility of implementing OpenCL support to match our offload support for CUDA-based accelerators. This would not be particularly hard. However, at present, we don't have any demand from grant-funded projects that support our work to do so. We would welcome contributions of code to do this. There's also the possibility that commercial development may lead to this getting implemented.

Most simple architecture available as GCC target

I'm looking for CPU architecture, which is supported by GCC (and is still maintained) for which is easiest to implement software simulator.
It should be something simple, with flat memory model, 16bit+ address space, 16-32 bit ALU and good code dencity is prefered as for it will be running programs with program memory limitations.
Just few words about origin of those requirements. I need virtual CPU for running 'sandboxed' programs. That will be running on microcontrollers with ~5 KBytes RAM, ARM CPU ~20 MHz clock speed.
Performance is non an issue at all, what I really need is writing C/C++ programs and then running them in sandbox without stdlib. For writing programs GCC can help, just need implement vcpu for one of target architectures.
I've got acquainted with ARMv7-m, avr32 references and found them pretty accaptable but some more powerfull then I need. The less/simpler code I need to write for vcpu implementation, the sooner I will have what I need and less bugs will be there.
UPDATE:
Seems like I found what I need. Is was already answered here: What is the smallest, simplest CPU that gcc can compile for?
Thank you all.

IDE Tool choice - cross platform x86 ASM debugging

I'm writing a teaching tutorial to teach university students and programming enthusiasts Compilation concepts on an x86.
I want an IDE tool like WinASM for them to be able to debug their work, but am aware that they may not all run Windows at home.
Is my best choice to target Jasmin? (Or have you got a better suggestion - and why?)
Another approach I've seen is to use a common teaching architecture (such as MIPS) and run it under emulation. For MIPS in particular, there are lots of interactive simulators (like SPIM), as well as full system emulators (like QEMU). The fact that the MIPS architecture is considerably simpler (and less register-starved!) than x86 is definitely a plus as well -- it means you can spend more time focusing on interesting compilation topics, rather than teaching the architecture.
This is another approach (although poor for debugging) - executing assembler inline in C++
A C repl that generates ASM - for learning about the assembler generated.
Also you could just rely on old gdb.
Have you ever considered an online debugging tool? There are a few of them out there. I personally like this asm debugger.

Windows based development for ARM processors

I am a complete newbie to the ARM world. I need to be able to write C code, compile it, and then download into an ARM emulator, and execute. I need to use the GCC 4.1.2 compiler for the C code compilation.
Can anybody point me in the correct directions for the following issues?
What tool chain to use?
What emulator to use?
Are there tutorials or guides on setting up the tool chain?
building a gcc cross compiler yourself is pretty easy. the gcc library and the C library and other things not so much, an embedded library and such a little harder. Depends on how embedded you want to get. I have little use for gcclib or a c library so roll your own works great for me.
After many years of doing this, perhaps it is an age thing, I now just go get the code sourcery tools. the lite version works great. yagarto, devkitarm, winarm or something like that (the site with a zillion examples) all work fine. emdebian also has a good pre-built toolchain. a number of these places if not all have info on how they built their toolchains from gnu sources.
You asked about gcc, but bear in mind that llvm is a strong competitor, and as far as cross compiling goes, since it always cross compiles, it is a far easier cross compiler to download and build and get working than gcc. the recent version is now producing code (for arm) that competes with gcc for performance. gcc is in no way a leader in performance, other compilers I have used run circles around it, but it has been improving with each release (well the 3.x versions sometimes produce better code than the 4.x versions, but you need 4.x for the newer cores and thumb2). even if you go with gcc, try the stable release of llvm from time to time.
qemu is a good emulator, depending on what you are doing the gba emulator virtual gameboy advance is good. There are a couple of nds emulators too. GDB and other places have what appear to be ARMs own armulator. I found it hard to extract and use, so I wrote my own, but being lazy only implemented the thumb instruction set, I called mine the thumbulator. easy to use. Far easier than qemu and armulator to add peripherals to and watch and debug your code. ymmv.
Hmmm I posted a similar answer for someone recently. Google: arm verilog and at umich you will find a file isc.tgz in which is an arm10 behavioural (as in you cannot make a chip from it therefore you can find verilog on the net) model. Which for someone wanting to learn an instruction set, watching your code execute at the gate level is about as good as it gets. Be careful, like a drug, you can get addicted then have a hard time when you go back to silicon where you have relatively zero visibility into your code while it is executing. Somewhere in stackoverflow I posted the steps involved to get that arm10 model and another file or two to turn it into an arm emulator using icarus verilog. gtkwave is a good and free tool for examining the wave (vcd) files.
Above all else you will need the ARM ARM. (The ARM Architectural Reference Manual). Just google it and find it on ARM's web site. There is pseudo code for each instruction teaching you what they do. Use the thumbulator or armulator or others if you need to understand more (mame has an arm core in it too). I make no guarantees that the thumbulator is 100% debugged or accurate, I took some common programs and compared their output to silicon both arm and non-arm to debug the core.
Toolchain you can use Yagarto http://www.yagarto.de/
Emulator you can use Proteus ISIS http://www.labcenter.com/index.cfm
(There is a demo version)
and tutorials, well, google them =)
Good luck!

Resources