Related
I am adapting a fortran mpi program from sequential to parallel writing for certain types of files. It uses netcdf 4.3.3.1/hdf5 1.8.9 parallel. I use intel compiler version 14.0.3.174.
When all reads/writes are done it is time to close the files. At this point, the simulations does not continue anymore. So all calls are waiting. When I check the call stack from each processor I can see the master root is different compared to the rest of them.
Mpi Master processor call stack:
__sched_yield, FP=7ffc6aa978b0
opal_progress, FP=7ffc6aa978d0
ompi_request_default_wait_all, FP=7ffc6aa97940
ompi_coll_tuned_sendrecv_actual, FP=7ffc6aa979e0
ompi_coll_tuned_barrier_intra_recursivedoubling, FP=7ffc6aa97a40
PMPI_Barrier, FP=7ffc6aa97a60
H5AC_rsp__dist_md_write__flush, FP=7ffc6aa97af0
H5AC_flush, FP=7ffc6aa97b20
H5F_flush, FP=7ffc6aa97b50
H5F_flush_mounts, FP=7ffc6aa97b80
H5Fflush, FP=7ffc6aa97ba0
NC4_close, FP=7ffc6aa97be0
nc_close, FP=7ffc6aa97c00
restclo, FP=7ffc6aa98660
driver, FP=7ffc6aaa5ef0
main, FP=7ffc6aaa5f90
__libc_start_main, FP=7ffc6aaa6050
_start,
Remaining processors call stack:
__sched_yield, FP=7fffe330cdd0
opal_progress, FP=7fffe330cdf0
ompi_request_default_wait, FP=7fffe330ce50
ompi_coll_tuned_bcast_intra_generic, FP=7fffe330cf30
ompi_coll_tuned_bcast_intra_binomial, FP=7fffe330cf90
ompi_coll_tuned_bcast_intra_dec_fixed, FP=7fffe330cfb0
mca_coll_sync_bcast, FP=7fffe330cff0
PMPI_Bcast, FP=7fffe330d030
mca_io_romio_dist_MPI_File_set_size, FP=7fffe330d080
PMPI_File_set_size, FP=7fffe330d0a0
H5FD_mpio_truncate, FP=7fffe330d0c0
H5FD_truncate, FP=7fffe330d0f0
H5F_dest, FP=7fffe330d110
H5F_try_close, FP=7fffe330d340
H5F_close, FP=7fffe330d360
H5I_dec_ref, FP=7fffe330d370
H5I_dec_app_ref, FP=7fffe330d380
H5Fclose, FP=7fffe330d3a0
NC4_close, FP=7fffe330d3e0
nc_close, FP=7fffe330d400
RESTCOM`restclo, FP=7fffe330de60
driver, FP=7fffe331b6f0
main, FP=7fffe331b7f0
__libc_start_main, FP=7fffe331b8b0
_start,
I do realize one call stack contain bcast an the other a barrier. This might cause a deadlock. Yet I do not foresee how to continue from here. If a mpi call is not properly done (e.g only called in 1 proc), I would expect an error message instead of such behaviour.
Update: the source code is around 100k lines.
The files are opened this way:
cmode = ior(NF90_NOCLOBBER,NF90_NETCDF4)
cmode = ior(cmode, NF90_MPIIO)
CALL ipslnc( NF90_CREATE(fname,cmode=cmode,ncid=ncfid, comm=MPI_COMM, info=MPI_INFO))
And closed as:
iret = NF90_CLOSE(ncfid)
It turns out when writting NF90_PUT_ATT, the root processor has a different value compared to the others. Once solved, the program runs as expected.
I've read this tutorial
I could follow the guide and run the code. but I have questions.
1) Why do we need both load-address and run-time address. As I understand it is because we have put .data at flash too; so why we don't run app there, but need start-up code to copy it into RAM?
http://www.bravegnu.org/gnu-eprog/c-startup.html
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
arm-none-eabi-gcc -nostdlib -o sum_array.elf sum_array.c
Many thanks
Your first question was answered in the guide.
When you load a program on an operating system your .data section, basically non-zero globals, are loaded from the "binary" into the right offset in memory for you, so that when your program starts those memory locations that represent your variables have those values.
unsigned int x=5;
unsigned int y;
As a C programmer you write the above code and you expect x to be 5 when you first start using it yes? Well, if are booting from flash, bare metal, you dont have an operating system to copy that value into ram for you, somebody has to do it. Further all of the .data stuff has to be in flash, that number 5 has to be somewhere in flash so that it can be copied to ram. So you need a flash address for it and a ram address for it. Two addresses for the same thing.
And that begins to answer your second question, for every line of C code you write you assume things like for example that any function can call any other function. You would like to be able to call functions yes? And you would like to be able to have local variables, and you would like the variable x above to be 5 and you might assume that y will be zero, although, thankfully, compilers are starting to warn about that. The startup code at a minimum for generic C sets up the stack pointer, which allows you to call other functions and have local variables and have functions more than one or two lines of code long, it zeros the .bss so that the y variable above is zero and it copies the value 5 over to ram so that x is ready to go when the code your entry point C function is run.
If you dont have an operating system then you have to have code to do this, and yes, there are many many many sandboxes and toolchains that are setup for various platforms that already have the startup and linker script so that you can just
gcc -O myprog.elf myprog.c
Now that doesnt mean you can make system calls without a...system...printf, fopen, etc. But if you download one of these toolchains it does mean that you dont actually have to write the linker script nor the bootstrap.
But it is still valuable information, note that the startup code and linker script are required for operating system based programs too, it is just that native compilers for your operating system assume you are going to mostly write programs for that operating system, and as a result they provide a linker script and startup code in that toolchain.
1) The .data section contains variables. Variables are, well, variable -- they change at run time. The variables need to be in RAM so that they can be easily changed at run time. Flash, unlike RAM, is not easily changed at run time. The flash contains the initial values of the variables in the .data section. The startup code copies the .data section from flash to RAM to initialize the run-time variables in RAM.
2) Linker-script: The object code created by your compiler has not been located into the microcontroller's memory map. This is the job of the linker and that is why you need a linker script. The linker script is input to the linker and provides some instructions on the location and extent of the system's memory.
Startup code: Your C program that begins at main does not run in a vacuum but makes some assumptions about the environment. For example, it assumes that the initialized variables are already initialized before main executes. The startup code is necessary to put in place all the things that are assumed to be in place when main executes (i.e., the "run-time environment"). The stack pointer is another example of something that gets initialized in the startup code, before main executes. And if you are using C++ then the constructors of static objects are called from the startup code, before main executes.
1) Why do we need both load-address and run-time address.
While it is in most cases possible to run code from memory mapped ROM, often code will execute faster from RAM. In some cases also there may be a much larger RAM that ROM and application code may compressed in ROM, so the executable code may not simply be copied from ROM also decompressed - allowing a much larger application than the available ROM.
In situations where the code is stored on non-memory mapped mass-storage media such as NAND flash, it cannot be executed directly in any case and must be loaded into RAM by some sort of bootloader.
2) Why we need linker script and start-up code here. Can I not just build C source as below and run it with qemu?
The linker script defines the memory layout of you target and application. Since this tutorial is for bare-metal programming, there is no OS to handle that for you. Similarly the start-up code is required to at least set an initial stack-pointer, initialise static data, and jump to main. On an embedded system it is also necessary to initialise various hardware such as the PLL, memory controllers etc.
I want to test some architecture changes on an already existing architecture (x86) using simulators. However to properly test them and run benchmarks, I might have to make some changes to the instruction set, Is there a way to add these changes to GCC or any other compiler?
Simple solution:
One common approach is to add inline assembly, and encode the instruction bytes directly.
For example:
int main()
{
asm __volatile__ (".byte 0x90\n");
return 0;
}
compiles (gcc -O3) into:
00000000004005a0 <main>:
4005a0: 90 nop
4005a1: 31 c0 xor %eax,%eax
4005a3: c3 retq
So just replace 0x90 with your inst bytes. Of course you wont see the actual instruction on a regular objdump, and the program would likely not run on your system (unless you use one of the nop combinations), but the simulator should recognize it if it's properly implemented there.
Note that you can't expect the compiler to optimize well for you when it doesn't know this instruction, and you should take care and work with inline assembly clobber/input/output options if it changes state (registers, memory), to ensure correctness. Use optimizations only if you must.
Complicated solution
The alternative approach is to implement this in your compiler - it can be done in gcc, but as stated in the comments LLVM is probably one of the best ones to play with, as it's designed as a compiler development platform, but it's still very complicated as LLVM is best suited for IR optimization stages, and is somewhat less friendly when trying to modify the target-specific backends.
Still, it's doable, and you have to do that if you also plan to have your compiler decide when to issue this instruction. I'd suggest to start from the first option though, to see if your simulator even works with this addition, and only then spending time on the compiler side.
If and when you do decide to implement this in LLVM, your best bet is to define it as an intrinsic function, there's relatively more documentation about this in here - http://llvm.org/docs/ExtendingLLVM.html
You can add new instructions, or change existing by modifying group of files in GCC called "machine description". Instruction patterns in <target>.md file, some code in <target>.c file, predicates, constraints and so on. All of these lays in $GCCHOME/gcc/config/<target>/ folder. All of this stuff using on step of generation ASM code from RTL. You can also change cases of emiting instructions by change some other general GCC source files, change SSA tree generation, RTL generation, but all of this a little bit complicated.
A simple explanation what`s happened:
https://www.cse.iitb.ac.in/grc/slides/cgotut-gcc/topic5-md-intro.pdf
It's doable, and I've done it, but it's tedious. It is basically the process of porting the compiler to a new platform, using an existing platform as a model. Somewhere in GCC there is a file that defines the instruction set, and it goes through various processes during compilation that generate further code and data. It's 20+ years since I did it so I have forgotten all the details, sorry.
I've heard the Google talk (http://www.youtube.com/watch?v=_gZK0tW8EhQ) by Ron Garret and read the paper (http://www.flownet.com/gat/jpl-lisp.html), but I'm not understanding how it worked to "correct" supposedly running code with a REPL. Was the DS-1's Lisp code running is some sort of virtual machine? Or was it "live" in the REPL's actual world? Or was the Lisp code an executable that got replaced? What exactly happened/happens when you dynamically change running Lisp code through a REPL?
Whereas most programs are built and distributed as an executable that contains only the necessary components to run the program, Lisp can be distributed as an image that contains not just the components for the specific program, but also much or all of the Lisp runtime and development environment.
The REPL is the quintessential mechanism for providing interactive access to a running Lisp environment. The two key components of the REPL, Read, and Eval, expose much of the Lisp runtime system. For example, many Lisp systems today implement Eval by compiling the provided form (that is read by the Reader), compiling the form to machine code, and then executing the result. This is in contrast to interpreting the form. Some systems, especially in the past, contained both an interpreter that executes quickly and is suitable for interactive access, and a compiler that produces better code. But modern systems are fast enough that the compiler phase isn't noticeable and simply forgo the interpreter.
Of course, you can do very similar things today. A simple example is running SSH to your Linux box that's hosting PHP. Your PHP server is up and running and live, serving pages and requests. But you login through SSH, go over and fix a PHP file, and as soon as you save that file, all of your users see the new result in real time -- the system updated itself on the fly.
The fact that PHP is running on a Linux runtime vs Lisp running on a Lisp runtime, is a detail. The effect is the same. The fact that PHP isn't compiled is a detail also. For example, you can do the same thing on a Java server: modify a JSP, save it, and the JSP is converted in to a Servlet as Java source code, then compiled on the fly by the Java runtime, then loaded in to the executing container, replacing the old code.
Lisps capability to do this is very nice, and it was very interesting far back in the day. Today, it's less so, as there are different system providing similar capabilities.
Addenda:
No, Lisp is not a virtual machine, there's no need for it to be that complicated.
The key to the concept is dynamic dispatch. With dynamic dispatch there is some lookup involved before a function is invoked.
In a static language like C, locations of things are pretty much set in stone once the linker and loader have finished processing the executable in preparation to start executing.
So, in C if you have something simple like:
int add(int i) {
return i + 1;
}
void main() {
add(1);
}
After compiling and linking and loading of the program, the address of the add function will be set in stone, and thus thing referring to that function will know exactly where to find it.
So, in assembly language: (note this is a contrived assembly language)
add: pop r1 ; pop R1 from the stack, loading the i parameter
add r1, 1; Add 1 to the parameter.
push r1 ; push result of function call
rts ; return from subroutine
main: push 1 ; Push parameter to function
call add ; call function
pop r1 ; gather (and ignore) the result
So, you can see here that add is fixed in place.
In something like Lisp, function are referred to indirectly.
int add(int i) {
return i + 1;
}
int *add_ptr() = &add;
void main() {
*(add_ptr)(1);
}
In assembly you get:
add: pop r1 ; pop R1 from the stack, loading the i parameter
add r1, 1; Add 1 to the parameter.
push r1 ; push result of function call
rts ; return from subroutine
add_ptr: dw add ; put the address of the add routine in add_ptr
main: push 1 ; Push parameter to function
mov r1, add_ptr ; Put the contents of add_ptr into R1
call (r1) ; call function indirectly through R1
pop r1 ; gather (and ignore) the result
Now, you can see here that rather than calling add directly, it is called indirectly through the add_ptr. In a Lisp runtime, it has the capability of compiling new code, and when that happens, add_ptr would be overwritten to point to the newly compiled code. You can see how the code in main never has to change, it will call whatever function add_ptr is pointing to.
Since most all of the functions in Lisp are indirectly referenced through their symbols, a lot can change "behind the back" of a running system, and the system, will continue to run.
When a function is recompiled, the old function code (assuming no other references) become eligible for garbage collection, and will, typically, eventually go away.
You can also see that when the system is garbage collected, any code that is moved (such as the code for the add function) can be moved by the runtime, and it's new location updated in the add_ptr so the system will continue operating even after code and been relocated by the garbage collector.
So, the key to it all, is to have your functions invoked through some lookup mechanism. Having this gives you a lot of flexibility.
Note, you can also do this is a running C system, for example. You can put code in a dynamic library, load the library, execute the code, and if you want you can build a new dynamic library, close the old one, open the new one, and invoke the new code -- all in a "running" system. The dynamic library interface provides the lookup mechanism that isolates the code.
I was looking up the pypy project (Python in Python), and started pondering the issue of what is running the outer layer of python? Surely, I conjectured, it can't be as the old saying goes "turtles all the way down"! Afterall, python is not valid x86 assembly!
Soon I remembered the concept of bootstrapping, and looked up compiler bootstrapping. "Ok", I thought, "so it can be either written in a different language or hand compiled from assembly". In the interest of performance, I'm sure C compilers are just built up from assembly.
This is all well, but the question still remains, how does the computer get that assembly file?!
Say I buy a new cpu with nothing on it. During the first operation I wish to install an OS, which runs C. What runs the C compiler? Is there a miniature C compiler in the BIOS?
Can someone explain this to me?
Say I buy a new cpu with nothing on it. During the first operation I wish to install an OS, which runs C. What runs the C compiler? Is there a miniature C compiler in the BIOS?
I understand what you're asking... what would happen if we had no C compiler and had to start from scratch?
The answer is you'd have to start from assembly or hardware. That is, you can either build a compiler in software or hardware. If there were no compilers in the whole world, these days you could probably do it faster in assembly; however, back in the day I believe compilers were in fact dedicated pieces of hardware. The wikipedia article is somewhat short and doesn't back me up on that, but never mind.
The next question I guess is what happens today? Well, those compiler writers have been busy writing portable C for years, so the compiler should be able to compile itself. It's worth discussing on a very high level what compilation is. Basically, you take a set of statements and produce assembly from them. That's it. Well, it's actually more complicated than that - you can do all sorts of things with lexers and parsers and I only understand a small subset of it, but essentially, you're looking to map C to assembly.
Under normal operation, the compiler produces assembly code matching your platform, but it doesn't have to. It can produce assembly code for any platform you like, provided it knows how to. So the first step in making C work on your platform is to create a target in an existing compiler, start adding instructions and get basic code working.
Once this is done, in theory, you can now cross compile from one platform to another. The next stages are: building a kernel, bootloader and some basic userland utilities for that platform.
Then, you can have a go at compiling the compiler for that platform (once you've got a working userland and everything you need to run the build process). If that succeeds, you've got basic utilities, a working kernel, userland and a compiler system. You're now well on your way.
Note that in the process of porting the compiler, you probably needed to write an assembler and linker for that platform too. To keep the description simple, I omitted them.
If this is of interest, Linux from Scratch is an interesting read. It doesn't tell you how to create a new target from scratch (which is significantly non trivial) - it assumes you're going to build for an existing known target, but it does show you how you cross compile the essentials and begin building up the system.
Python does not actually assemble to assembly. For a start, the running python program keeps track of counts of references to objects, something that a cpu won't do for you. However, the concept of instruction-based code is at the heart of Python too. Have a play with this:
>>> def hello(x, y, z, q):
... print "Hello, world"
... q()
... return x+y+z
...
>>> import dis
dis.dis(hello)
2 0 LOAD_CONST 1 ('Hello, world')
3 PRINT_ITEM
4 PRINT_NEWLINE
3 5 LOAD_FAST 3 (q)
8 CALL_FUNCTION 0
11 POP_TOP
4 12 LOAD_FAST 0 (x)
15 LOAD_FAST 1 (y)
18 BINARY_ADD
19 LOAD_FAST 2 (z)
22 BINARY_ADD
23 RETURN_VALUE
There you can see how Python thinks of the code you entered. This is python bytecode, i.e. the assembly language of python. It effectively has its own "instruction set" if you like for implementing the language. This is the concept of a virtual machine.
Java has exactly the same kind of idea. I took a class function and ran javap -c class to get this:
invalid.site.ningefingers.main:();
Code:
0: aload_0
1: invokespecial #1; //Method java/lang/Object."<init>":()V
4: return
public static void main(java.lang.String[]);
Code:
0: iconst_0
1: istore_1
2: iconst_0
3: istore_1
4: iload_1
5: aload_0
6: arraylength
7: if_icmpge 57
10: getstatic #2;
13: new #3;
16: dup
17: invokespecial #4;
20: ldc #5;
22: invokevirtual #6;
25: iload_1
26: invokevirtual #7;
//.......
}
I take it you get the idea. These are the assembly languages of the python and java worlds, i.e. how the python interpreter and java compiler think respectively.
Something else that would be worth reading up on is JonesForth. This is both a working forth interpreter and a tutorial and I can't recommend it enough for thinking about "how things get executed" and how you write a simple, lightweight language.
In the interest of performance, I'm sure C compilers are just built up from assembly.
C compilers are, nowadays, (almost?) completely written in C (or higher-level languages - Clang is C++, for instance). Compilers gain little to nothing from including hand-written assembly code. The things that take most time are as slow as they are because they solve very hard problems, where "hard" means "big computational complexity" - rewriting in assembly brings at most a constant speedup, but those don't really matter anymore at that level.
Also, most compilers want high portability, so architecture-specific tricks in the front and middle end are out of question (and in the backends, they' not desirable either, because they may break cross-compilation).
Say I buy a new cpu with nothing on it. During the first operation I wish to install an OS, which runs C. What runs the C compiler? Is there a miniature C compiler in the BIOS?
When you're installing an OS, there's (usually) no C compiler run. The setup CD is full of readily-compiled binaries for that architecture. If there's a C compiler included (as it's the case with many Linux distros), that's an already-compiled exectable too. And those distros that make you build your own kernel etc. also have at least one executable included - the compiler. That is, of course, unless you have to compile your own kernel on an existing installation of anything with a C compiler.
If by "new CPU" you mean a new architecture that isn't backwards-compatible to anything that's yet supported, self-hosting compilers can follow the usual porting procedure: First write a backend for that new target, then compile yourself for it, and suddenly you got a mature compiler with a battle-hardened (compiled a whole compiler) native backend on the new platform.
If you buy a new machine with a pre-installed OS, it doesn't even need to include a compiler anywhere, because all the executable code has been compiled on some other machine, by whoever provides the OS - your machine doesn't need to compile anything itself.
How do you get to this point if you have a completely new CPU architecture? In this case, you would probably start by writing a new code generation back-end for your new CPU architecture (the "target") for an existing C compiler that runs on some other platform (the "host") - a cross-compiler.
Once your cross-compiler (running on the host) works well enough to generate a correct compiler (and necessary libraries, etc.) that will run on the target, then you can compile the compiler with itself on the target platform, and end up with a target-native compiler, which runs on the target and generates code which runs on the target.
It's the same principle with a new language: you have to write code in an existing language that you do have a toolchain for, which will compile your new language into something that you can work with (let's call this the "bootstrap compiler"). Once you get this working well enough, you can write a compiler in your new language (the "real compiler"), and then compile the real compiler with the bootstrap compiler. At this point you're writing the compiler for your new language in the new language itself, and your language is said to be "self-hosting".