When should I use GCC's -pipe option? - gcc

The GCC 4.1.2 documentation has this to say about the -pipe option:
-pipe
Use pipes rather than temporary files for communication between the various stages of compilation. This fails to work on some systems where the assembler is unable to read from a pipe; but the GNU assembler has no trouble.
I assume I'd be able to tell from error message if my systems' assemblers didn't support pipes, so besides that issue, when does it matter whether I use that option? What factors should go into deciding to use it?

In our experience with a medium-sized project, adding -pipe made no discernible difference in build times. We ran into a couple of problems with it (sometimes failing to delete intermediate files if an error was encountered, IIRC), and so since it wasn't gaining us anything, we quit using it rather than trying to troubleshoot those problems.

It doesn't usually make any difference
It has + and - considerations. Historically, running the compiler and assembler simultaneously would stress RAM resources.
Gcc is small by today's standards and -pipe adds a bit of multi-core accessible parallel execution.
But by the same token the CPU is so fast that it can create that temporary file and read it back without you even noticing. And since -pipe was never the default mode, it occasionally acts up a little. A single developer will generally report not noticing the time difference.
Now, there are some large projects out there. You can check out a single tree that will build all of Firefox, or NetBSD, or something like that, something that is really big. Something that includes all of X, say, as a minor subsystem component. You may or may not notice a difference when the job involves millions of lines of code in thousands and thousands of C files. As I'm sure you know, people normally work on only a small part of something like this at one time. But if you are a release engineer or working on a build server, or changing something in stdio.h, you may well want to build the whole system to see if you broke anything. And now, every drop of performance probably counts...

Trying this out now, it looks to be moderately faster to build when the source / build destinations are on NFS (linux network). Memory usage is high though. If you never fill the RAM and have source on NFS, seems like a win with -pipe.

Honestly there is very little reason to not use it. -pipe will only use a tad more ram, which if this box is building code, I'd assume has a decent amount. It can significantly improve build time if your system is using a more conservative filesystem that writes and then deletes all the temporary files along the way (ext3, for example.)

One advantage is that with -pipe will the compiler interact less with a file system. Even when it is a ram disk does the data still need to go through the block I/O and file system layers when using temporary files whereas with a pipe it becomes a bit more direct.
With files does the compiler first need to finish writing before it can call the assembler. Another advantage of pipes is that both, the compiler and assembler, can run at the same time and it is making a bit more use of SMP architectures. Especially when the compiler needs to wait for the data from the source file (because of blocking I/O calls) can the operating system give the assembler full CPU time and let it do its job faster.

From a hardware point of view I guess you would use -pipe to preserve the lifetime of your hard drive.

Related

Alternative to Intel C++ compiler for Windows and OSX, which provides CPU dispatching

I'm quite honestly sick of Intel compiler now, because it's just buggy, sometimes just generates incorrect crashing code, which is especially bad, since the compilation takes like 2 hours, so there's really no way to try to get around it. Profile guided optimizations, which are needed to make executables at least reasonably sized, always generate crashing code currently for me, so...
But it has one perk no other compiler I know has - dispatching to instruction sets, which is essential for my use - signal processing. Is there any other compiler, that can do that?
(for the record I'm even ok with "pragming" every loop, that would need the CPU dispatching, and no need for nonlooped operations probably)

Baremetal benchs & software

I'm looking on some information about bare-metal programming.
I'm working on different powerpc platforms, and currently trying to prove that some tests are not impacted by the linux kernel. These tests are pretty basic, loads and stores in asm volatile, some benchmarks as well (Coremark, Dhrystone, etc). These tests run perfectly on Linux, but i now have to test them in baremetal, an environement i don't really have experience in.
All my platforms have u-boot installed, and i'm wondering if there is such applications that would allow me to run my tests powerpc-eabi cross-compiled ? for example, would a gdbserver launched by u-boot be able to communicate via serial port, or ethernet ? Is it possible to have a busybox called by U-boot ?
Uboot is a bootloader...use it. You probably have an xmodem downloader or ymodem downloader with uboot, if push comes to shove you can turn your program into a long series of write word to memory then branch to that.
uboot will have already setup ram and the serial port, that is how you are talking with uboot anyway, so you don't have to do all of that. You won't need to configure the serial port but you will want to find out how to write a character which means poll the status register for the transmitter register to be empty then write one character to the transmit register. Repeat for every character in your string or whatever to print.
The bootstrap to your C program assuming it is C usually involves at a bare bare minimum setting up the stack pointer (which btw uboot is running so the stack is already setup you can just not do that so long as you load your program such that it doesn't collide with what uboot is doing) and then branch to your C entry point.
Depending on how you have written your high level language program (I am assuming C) then you might have to zero out the .bss area and setup the .data area, the nice thing about using a bootloader to copy a program to ram and just run it is you usually don't have to do any of this, the binary that you download and run already has the bss zeroed and .data in the right place. So it comes back to setup the stack and branch or simply branch since you may not even have to set of the stack.
Building a bare metal program is the real challenge, because you don't have a system to make system calls to, and that is a hard thing to give up and/or simulate. newlib for example makes life a bit easier as it has a very easy to replace system backend so that you can for example leave the printfs in dhrystone (vs removing them and finding a different way to output the strings as needed or output the results.
compiling to object of the C files is easy, assembling the assembly is easy, and you should be able to do that with your powerpc-eabi gcc cross compiler, the next challenge is linking, telling the linker where stuff goes. since this is likely a flat chunk of ram you can probably do something like -Ttext 0x123450000 where the number is whatever the base address is of the ram you want to use. if you have any multiplies or divides or any floats or any other gcc library functions (that replace things that your processor may or may not do or requires a wrapper to do them properly), or any libc calls then it will try to link them in. Ideally the gcc library ones are easy but depending on the cross compiler they can be a challenge, worst case take the gcc sources and build those functions yourself, or get or build a different gcc cross compiler with different target options (Generally an easy thing to do).
I highly recommend you disassemble your binary and make sure if nothing else your entry point of your bootstrap is at the beginning of the binary. use objcopy to make a binary file powerpc-...-objcopy myprog.elf -O binary myprog.bin. then use xmodem or ymodem on the uboot prompt to copy over that program and run it.
backing up. from the datasheets for the part when you look up the uart and figure out the base address you should first use the uboot prompt to write to the address of the uart transmit register write a 0x30 to that address for example and if you have the right address then before it prints the uboot prompt again after your command it should have an extra zero '0' in the output. If you cant get it to do that with a single write from the uboot command line you wont get it to work in a program of any kind you have the wrong address or you are doing something else wrong.
Then write a very small program in assembly language that outputs a character to the uart by writing to that address, then have it count to some big number depending on the speed of your processor. If you are running at 100Mhz then count to 100 million or more (or count down to zero from a few hundred million) then branch to the beginning and repeat, output, wait output, wait. build and link this tiny program and then download with xmodem or whatever and branch to it. If you can't get it to output a character every few seconds then you won't be able to progress to something more complicated.
Next small program, poll the status register, wait for the tx buffer to be empty, then write a 0x30 to the tx register. increment the register holding the 0x30 to 0x31 and that register with 0x37. branch to the wait for tx empty and output the new value 0x31, make this an infinite loop. If once you start running you don't see 01234567012345670... repeated forever without the numbers getting mangled they must be 0-7 and repeat, then you won't be able to progress to something more complicated.
Repeat the last two programs in C with a small bootstrap that branches to the C entry point, if you cant get those working you wont be able to progress any further.
Start small with any library calls you think you can't do without (printf for example) and if you can't make a simple printf("Hello World\n"); work with all the linking and system backend and such, then you won't be able to run Dhrystone and leave in its system calls.
The compiler will likely turn some of the Dhrystone into memcpy or memset calls which you will have to implement, there are hand tuned assembly versions of these most likely and your Dhrystone performance numbers can and will be hugely affected by implementation of functions like these, so you cant simply do this
void memset ( unsigned char *d unsigned char c, unsigned int len)
{
while(len--) *(d++)=c;
}
and expect any performance. You can likely grab gcc lib or gnu libc versions of these or just steal the ones from the linux build of one of these tests (disassemble and grab the asm), that way you have apples to apples...
Benchmarking is often more bogus than real, it is very easy to take the same benchmark source with the same compiler in the same environment (on linux or on bare metal, etc) and show dramatically different results by doing various simple things, different compiler options, rearranging the functions, adding a few nops in the bootstrap, etc. Anything to either build different code or take advantage of or get hurt by the cahce, etc. If you want to show bare metal being faster than on the operating system, it is likely NOT going to happen without a bit of work. You are going to need to get the i and d caches up the d cache likely requires that you get the mmu up and so on. These can all be research projects. Then you need to know how to control your compiler build, make sure optimizations are on, as mentioned add or remove nops in your bootstrap to change the alignment of tight loops in the code with respect to cache lines. ON an operating system there are interrupts and things going on, possibly you are multitasking so with bare metal you should be able to get dhrystone like tests to run at the same speed or faster than linux, if you cant it is not because linux is faster it is because you are not doing something right in your bare metal implementation.
Yes you can probably use gdb to talk to uboot and load programs, not sure I never use gdb, I prefer to use a dumb terminal and x or y modem or use jtag with the openocd terminal (telnet into openocd rather than gdb in).
You could try compile the Benchmarks together with u-boot. So that after u-boot finishes loading it loads your program. I know that was possible for ARM platforms.
I don't whether toolchains exist for powerpc bare metal development
At https://cirosantilli.com/linux-kernel-module-cheat/#dhrystone in this commit I have provided a minimal runnable Dhrystone baremetal example with Newlib on ARM that runs on QEMU and gem5. With this starting point, it should not be hard to port it to PowerPC or other ISAs and real platforms.
In that setup, Newlib implements everything except syscalls themselves as described at: https://electronics.stackexchange.com/questions/223929/c-standard-libraries-on-bare-metal/400077#400077 which makes it much easier to use larger subsets of the C standard library.
And I use newlib through a toolchain built with crosstool-NG.
Some key files in that setup:
linker script
syscall implementations
the full make command showing some of the flags used:
make \
-j 8 \
-C /home/ciro/bak/git/linux-kernel-module-cheat/submodules/dhrystone \
CC=/home/ciro/bak/git/linux-kernel-module-cheat/out/crosstool-ng/build/default/install/aarch64/bin/aarch64-unknown-elf-gcc \
'CFLAGS_EXTRA=-nostartfiles -O0' \
'LDFLAGS_EXTRA=-Wl,--section-start=.text=0x40000000 -T /home/ciro/bak/git/linux-kernel-module-cheat/baremetal/link.ld' \
'EXTRA_OBJS=/home/ciro/bak/git/linux-kernel-module-cheat/out/baremetal/aarch64/qemu/virt/lib/bootloader.o /home/ciro/bak/git/linux-kernel-module-cheat/out/baremetal/aarch64/qemu/virt/lib/lkmc.o /home/ciro/bak/git/linux-kernel-module-cheat/out/baremetal/aarch64/qemu/virt/lib/syscalls_asm.o /home/ciro/bak/git/linux-kernel-module-cheat/out/baremetal/aarch64/qemu/virt/lib/syscalls.o' \
OUT_DIR=/home/ciro/bak/git/linux-kernel-module-cheat/out/baremetal/aarch64/qemu/virt/submodules/dhrystone \
-B \
;
Related: How to compile dhrystone benchmark for RV32I

How is -march different from -mtune?

I tried to scrub the GCC man page for this, but still don't get it, really.
What's the difference between -march and -mtune?
When does one use just -march, vs. both? Is it ever possible to just -mtune?
If you use -march then GCC will be free to generate instructions that work on the specified CPU, but (typically) not on earlier CPUs in the architecture family.
If you just use -mtune, then the compiler will generate code that works on any of them, but will favour instruction sequences that run fastest on the specific CPU you indicated. e.g. setting loop-unrolling heuristics appropriately for that CPU.
-march=foo implies -mtune=foo unless you also specify a different -mtune. This is one reason why using -march is better than just enabling options like -mavx without doing anything about tuning.
Caveat: -march=native on a CPU that GCC doesn't specifically recognize will still enable new instruction sets that GCC can detect, but will leave -mtune=generic. Use a new enough GCC that knows about your CPU if you want it to make good code.
This is what i've googled up:
The -march=X option takes a CPU name X and allows GCC to generate code that uses all features of X. GCC manual explains exactly which CPU names mean which CPU families and features.
Because features are usually added, but not removed, a binary built with -march=X will run on CPU X, has a good chance to run on CPUs newer than X, but it will almost assuredly not run on anything older than X. Certain instruction sets (3DNow!, i guess?) may be specific to a particular CPU vendor, making use of these will probably get you binaries that don't run on competing CPUs, newer or otherwise.
The -mtune=Y option tunes the generated code to run faster on Y than on other CPUs it might run on. -march=X implies -mtune=X. -mtune=Y will not override -march=X, so, for example, it probably makes no sense to -march=core2 and -mtune=i686 - your code will not run on anything older than core2 anyway, because of -march=core2, so why on Earth would you want to optimize for something older (less featureful) than core2? -march=core2 -mtune=haswell makes more sense: don't use any features beyond what core2 provides (which is still a lot more than what -march=i686 gives you!), but do optimize code for much newer haswell CPUs, not for core2.
There's also -mtune=generic. generic makes GCC produce code that runs best on current CPUs (meaning of generic changes from one version of GCC to another). There are rumors on Gentoo forums that -march=X -mtune=generic produces code that runs faster on X than code produced by -march=X -mtune=X does (or just -march=X, as -mtune=X is implied). No idea if this is true or not.
Generally, unless you know exactly what you need, it seems that the best course is to specify -march=<oldest CPU you want to run on> and -mtune=generic (-mtune=generic is here to counter the implicit -mtune=<oldest CPU you want to run on>, because you probably don't want to optimize for the oldest CPU). Or just -march=native, if you ever going to run only on the same machine you build on.

How portable is mmap?

I've been considering using mmap for file reading, and was wondering how portable that is.
I'm developing on a Linux platform, but would like my program to work on Mac OS X and Windows.
Can I assume mmap is working on these platforms?
The mmap() function is a POSIX call. It works fine on MacOS X (and Linux, and HP-UX, and AIX, and Solaris).
The problem area will be Windows. I'm not sure whether there is an _mmap() call in the POSIX 'compatibility' sub-system. It is likely to be there — but will have the name with the leading underscore because Microsoft has an alternative view on namespaces and considers mmap() to intrude on the user name space, even if you ask for POSIX functionality. You can find a definition of an alternative Windows interface MapViewOfFile() and discussion about performance in another SO question (mmap() vs reading blocks).
If you try to map large files on a 32-bit system, you may find there isn't enough contiguous space to allocate the whole file in memory, so the memory mapping will fail. Do not assume it will work; decide what your fallback strategy is if it fails.
Using mmap for reading files isn't portable if you rely on mapping large bits of large files into your address space - 32-bit systems can easily not have a single large usable space - say 1G - of address space available so mmap would fail quite often for a 1G mapping.
The principle of a memory mapped file is fairly portable, but you don't have mmap() on Windows (but things like MapViewOfFile() exist). You could take a peek at the python mmap modules c code to see how they do it for various platforms.
I consider memory mapped io on UNIXs
as not useable for interactive applications,
as it may result in a SIGSEGV/SIGBUS
(in case of the file has been truncated meanwhile by some other process).
Ignoring such sick "solutions" as setjmp/longjmp
there is nothing one can do other than to terminate the process after getting SIGSEGV/SIGBUS.
The new G++ feature to convert such signals into exceptions
seems to be intended mainly for apples OS,
since the description states, that one needs runtime support for this G++ feature
and there is no information to be found about this G++ feature anywhere.
We probably have to wait a couple of years, until structured exception handling like it can be found on windows since more than 20 years makes its way into UNIXs.

What standard techniques are there for using cpu specific features in DLLs?

Short version: I'm wondering if it's possible, and how best, to utilise CPU specific
instructions within a DLL?
Slightly longer version:
When downloading (32bit) DLLs from, say, Microsoft it seems that one size fits all processors.
Does this mean that they are strictly built for the lowest common denominator (ie. the
minimum platform supported by the OS)?
Or is there some technique that is used to export a single interface within the DLL but utilise
CPU specific code behind the scenes to get optimal performance? And if so, how is it done?
I don't know of any standard technique but if I had to make such a thing, I would write some code in the DllMain() function to detect the CPU type and populate a jump table with function pointers to CPU-optimized versions of each function.
There would also need to be a lowest common denominator function for when the CPU type is unknown.
You can find current CPU info in the registry here:
HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor
The DLL is expected to work on every computer WIN32 runs on, so you are stuck to the i386 instruction set in general. There is no official method of exposing functionality/code for specific instruction sets. You have to do it by hand and transparently.
The technique used basically is as follows:
- determine CPU features like MMX, SSE in runtime
- if they are present, use them, if not, have fallback code ready
Because you cannot let your compiler optimise for anything else than i386, you will have to write the code using the specific instruction sets in inline assembler. I don't know if there are higher-language toolkits for this. Determining the CPU features is straight forward, but could also need to be done in assembler.
An easy way to get the SSE/SSE2 optimizations is to just use the /arch argument for MSVC. I wouldn't worry about fallback--there is no reason to support anything below that unless you have a very niche application.
http://msdn.microsoft.com/en-us/library/7t5yh4fd.aspx
I believe gcc/g++ have equivalent flags.
Intel's ICC can compile code twice, for different architectures. That way, you can have your cake and eat it. (OK, you get two cakes - your DLL will be bigger). And even MSVC2005 can do it for very specific cases (E.g. memcpy() can use SSE4)
There are many ways to switch between different versions. A DLL is loaded, because the loading process needs functions from it. Function names are converted into addresses. One solution is to let this lookup depend on not just function name, but also processor features. Another method uses the fact that the name to address function uses a table of pointers in an interim step; you can switch out the entire table. Or you could even have a branch inside critical functions; so foo() calls foo__sse4 when that's faster.
DLLs you download from Microsoft are targeted for the generic x86 architecture for the simple reason that it has to work across all the multitude of machines out there.
Until the Visual Studio 6.0 time frame (I do not know if it has changed) Microsoft used to optimize its DLLs for size rather than speed. This is because the reduction in the overall size of the DLL gave a higher performance boost than any other optimization that the compiler could generate. This is because speed ups from micro optimization would be decidedly low compared to speed ups from not having the CPU wait for the memory. True improvements in speed come from reducing I/O or from improving the base algorithm.
Only a few critical loops that run at the heart of the program could benefit from micro optimizations simply because of the huge number of times they are invoked. Only about 5-10% of your code might fall in this category. You could rest assured that such critical loops would already be optimized in assembler by the Microsoft software engineers to some level and not leave much behind for the compiler to find. (I know it's expecting too much but I hope they do this)
As you can see, there would be only drawbacks from the increased DLL code that includes additional versions of code that are tuned for different architectures when most of this code is rarely used / are never part of the critical code that consumes most of your CPU cycles.

Resources