With C++ code built for debugging with g++ (i.e. options "-O0 -ggdb") and using the newest gcc (5.1.0) and gdb (7.9) the display of source code in gdb is still painfully non-linear when using the "next" command. As an example this function call might be expected to step through with a single "next":
7757| SDValue NewRoot = TLI->LowerFormalArguments(
7758| DAG.getRoot(), F.getCallingConv(), F.isVarArg(), Ins, dl, DAG, InVals);
however it takes four, with the displayed execution line being first 7757, then 7758, then again 7757, then again 7758. If the function call is condensed to a single line then just one "next" is needed. If the call is absurdly inflated then seven "next"s are needed (shown as the '#' annotations)
7757| SDValue
7758| NewRoot
7759| =
#1,6 7760| TLI
7761| ->
7762| LowerFormalArguments(
#5 7763| DAG.getRoot(),
7764| F.getCallingConv(),
#3 7765| F.isVarArg(),
7766| Ins,
7767| dl,
7768| DAG,
7769| InVals
#2,4,7 7770| );
So it's related to but not as simple as "each function call on a distinct line is a stepping point". This gets especially confusing with breakpoints in recursive functions, where I find myself checking the callstack to see whether it's really a new invocation or just a phony backwards step.
Since reflowing all of the LLVM source to contain function calls in a single line isn't really a viable option, is there some gcc/gdb option for controlling this behaviour?
EDIT: now checked with clang 3.5 and lldb 3.5: when built with clang only three "next"s occur. And gdb and lldb see the same "next" behaviour in either case (i.e. 4 with gcc, 3 with clang)
This sort of behavior from the debugger is a "GIGO" situation -- that is, normally gdb is just doing whatever the debug info tells it to do. That is, when there is odd behavior, it is generally due to decisions made by the compiler. It may be a bug, and probably worth a bug report, but I also wouldn't be surprised if it is intended to work this way for some reason.
You can investigate these kinds of problems by using readelf or objdump to examine the line table.
Related
I'm running into a deadlock during static initialization in Solaris. The situation strongly resembles that of this user's problem.
My environment is:
solaris 10
gcc 5.4 installed to a non-standard location
all relevant shared libraries are linked against the libstdc++ and/or libgcc_s libraries from that installation
boost 1.45 (we're moving away from it soon, but for the moment that cannot change)
I see this problem when linking dynamically or statically against boost libraries
The symptoms:
Deadlocks while executing boost::system::generic_category()
generic_category() is being called to initialize global static references in boost/system/error_code.hpp
If I shuffle link order, putting -lboost_system ahead of other libraries being linked in, the problem goes away.
If I set a breakpoint in generic_category() then attempt to step over the 1st line after the first time the breakpoint gets hit, the breakpoint gets hit again when executing the same function in a different shared library's _init() -- that is, it never stops on the 2nd line of generic_category() from when I told it to step over the 1st line.
Since stepping over the 1st line didn't work, I stepped into it then stepped out & again the breakpoint got hit.
I restarted the process & stepped in after the breakpoint got hit then began stepping. Stepping over the call to boost::system::error_category::error_category() I ran into the same problem.
I tried again, this time stepping an instruction at a time when I got to the error_category() call. It attempts to call it through the PLT which calls elf_rtbndr() which is supposed to return the real function's address in %o0, but when I step over the call to elf_rtbndr() it again hits the breakpoint instead of resuming where it left off.
When the breakpoint gets hit for the 2nd time it's calling generic_category() in some other shared library's _init(); that's when the deadlock occurs.
Thanks in advance for your time & help.
This has been reported several times (see this post in Boost and another in GCC). This seems to be a circular dependency issue during Boost initialization which, for some reason, only manifests on Solaris. The usual advice is to work around this by messing with library initialization (e.g. by shuffling the library order as you did with -lboost_system).
Another option is to disable thread-safe guards (-fno-threadsafe-statics flag) which would get rid of the deadlock but would keep the buggy nested constructor call which is undesirable.
I understand more or less the idea: When compiling separate modules and producing assembly code, functions calling each other have to respect strictly the calling convention, which kills the opportunity for many optimisations when compiling separate modules.
For instance if I have function A which calls function B which calls function C, all 3 in their own separate source files, it becomes possible to allocate registers evenly within the functions so that no register saving on the stack is necessary at all during those calls. With traditional compile-assembly-linking this is not possible, as the caller-saved and callee-saved registers are imposed by the calling convention.
Another optimisation is to inline functions which are called only once. This previously was possible only if a function is local, but thanks to linktime optimisation it's now possible even if the function is in another source file.
Now, if I compile with both -flto and -S flags, I see that instead of normal assembly instructions, gcc generates an encoded representation of the program, such as this:
.section .gnu.lto_.inline.c3c5e6ef8ec983c,"dr0"
.ascii "x\234mQ;N\303#\20}\273\353\17\370C\234\20\242`\"!Q\20\11Ah\322&\25\242\314\231|\4\32\220\220(,$.#\205D\343\3P Z.\341Tn\231\35\274\31L\342\342\355\314\274\371<\317\30\354\376\356\365\357\333\7\262"
.ascii "1\240G\325\273\202\7\216\232\204\36\205"
.ascii "8\242\370\240|\222"
.ascii "8\374\21\205ty\352\"*r\340!:!n\357n%]\224\345\10|\304\23\342\274z\346"
.ascii "8\35\23\370\7\4\1\366s\362\203j\271]\27bb{\316\353\27\343\310\4\371\374\237*n#\220\342rA\31"
.ascii "7\365\263\327\231\26\364\10"
.ascii "2\\-\311\277\255^w\220}|\340\233\306\352\263\362Qo+e+\314\354\277\246\354\252\277\20\364\224%T\233'eR\301{\32\340\372\313\362\263\242\331\314\340\24\6\21s\210\243!\371\347\325\333&m\210\305\203\355\277*\326\236\34\300-\213\327\306\2Td\317\27\231\26tl,\301\26\21cd\27\335#\262L\223"
.ascii "8\353\30\351\264{I\26\316\11\14"
.ascii "9\326h\254\220B}6a\247\13\353\27M\274\231"
.ascii "0\23M\332\272\272%d[\274\36Q\200\37\321\1&\35"
Since the data is in its own particular section, the linker sees this, and does the code generation. If the module was written in either assembly or with no -flto flag, then the linker would see data in the .text section instead, so there is no confusion possible for the linker.
The problem is: How can the linker generate code? Normally only gcc can generate code, the linker's role is just here to change a few offsets and adapt the binary format. In order to generate code, the linker would need to contain a second copy of the entire gcc backend (half of the compiler which generates assembly code from intermediate representation), as well as the entire assembler (since no assembly code was produced). How is such a thing possible, especially considering that binutils is a completely separate entity from gcc, developed by different teams?
GCC's -flto emits a serialized form of GCC's internal representation, as you discovered.
Then, at link time, the linker reinvokes GCC and passes it the objects that need final compilation. GCC reads the internal representation and does the work.
I think the actual work is done in collect2, which is part of GCC that is used when invoking the linker (I'm a little fuzzy on the details). There is also a "linker plugin" system that enables this to work a little better (like letting the linker decide how to split the compilation). This is implemented at least by the binutils ld and by gold; but as far as I recall this is just an optimization and isn't needed to get the basic -flto feature to work. You can see a bit more information on the original LTO project page; and maybe links from there would explain more.
There is more overlap between the GCC and binutils teams than you might think. The two projects share some code and have a long history of working together. Some people work on both projects.
From https://gcc.gnu.org/wiki/LinkTimeOptimization:
Despite the "link time" name, LTO does not need to use any special
linker features. The basic mechanism needed is the detection of GIMPLE
sections inside object files. This is currently implemented in
collect2 [which is called by gcc; -ps]. Therefore, LTO will work on any linker already supported by
GCC.
I assume this means you must link calling the compiler driver gcc. Simply linking with the system's vanilla linker wouldn't optimize the whole program, as you already concluded.
Update:
https://gcc.gnu.org/onlinedocs/gccint/Collect2.html says
The program collect2 is installed as ld in the directory where the
passes of the compiler are installed. When collect2 needs to find the
real ld it tries the following file names: [...]
(The page goes on detailing how collect2 looks for configuration-dependent executables and ones with well-known names like real-ld, finally even ld; but will not call itself recursively.)
When I compile 32-bit C code with GCC and the -fomit-frame-pointer option, the frame pointer (ebp) is not used unless my function calls Windows API functions with stdcall and atleast one parameter.
For example, if I only use GetCommandLine() from the Windows API, which has no parameters/arguments, GCC will omit the frame pointer and use ebp for other things, speeding up the code and not having that useless prologue.
But the moment I call a stdcall Win32 function that accepts at least one argument, GCC completely ignores the -fomit-frame-pointer and uses the frame pointer anyway, and the code is worse in inspection as it can't use ebp for general purpose things. Not to mention I find the frame pointer quite pointless. I mean, I want to compile for release and distribution, why should I care about debugging? (if I want to debug I'll just use a debug build instead after reproducing the bug)
My stack most certainly does NOT contain dynamic allocation like alloca. So, the stack has a defined structure yet GCC chooses the dumb method despite my options? Is there something I'm missing to force it to not use frame pointer?
My second grip I have with it is that it refuses to use "push" instructions for Win32 functions. Every other compiler I tried, they used push instructions to push on the stack, resulting in much better more compact code, not to mention it is the most natural way to push arguments for stdcall. Yet GCC stubbornly uses "mov" instructions to move in each spot, manually, at offsets relative to esp because it needs to keep the stack pointer completely static. stdcall is made to be easy on the caller, and yet GCC completely misses the point of stdcall since it generates this crappy code when interfacing with it. What's worse, since the stack pointer is static, it still uses a frame pointer? Just why?
I tried -mpush-args, it doesn't do anything.
I also noticed that if I make my stack big enough for it to exceed a page (4096 bytes), GCC will add a prologue with a function that does nothing but "bitwise or" the stack every 4096 bytes with zero (which does nothing). I assume it's for touching the stack and automatically commiting memory with page faults if the stack was reserved? Unfortunately, it does this even if I set the initial commit of the stack (not reserve) to high enough to hold my stack, not to mention this shouldn't even be needed in the first place. Redundant code at its best.
Are these bugs in GCC? Or something I'm missing in options? Should I use something else? Please tell me if I'm missing some options.
I seriously hope I won't have to make an inline asm macro just to call stdcall functions and use push instructions (and this will avoid frame pointer too I guess). That sounds really overkill for something so basic that should be in compilers of today. And yes I use GCC 4.8.1 so not an ancient version.
As extra question, is it possible to force GCC to not save registers on the stack at function prologue? I use my own direct entry point with -nostartfiles argument, because it is a pure Windows application and it works just fine without standard lib startup. If I use attribute((noreturn)), it will discard the epilogue restoring the registers but it will still push them on the stack at prologue, I don't know if there's a way to force it to not save registers for this entry point function. Either way not a big deal in the least, it would just feel more complete I guess. Thanks!
See the answer Force GCC to push arguments on the stack before calling function (using PUSH instruction)
I.e. try -mpush-args -mno-accumulate-outgoing-args. It may also require -mno-stack-arg-probe if gcc complains.
It looks like supplying the -mpush-args -mno-accumulate-outgoing-args -mno-stack-arg-probe works, specifically the last one. Now the code is cleaner and more normal like other compilers, and it uses PUSH for arguments, even makes it easier to track in OllyDbg this way.
Unfortunately, this FORCES the stupid frame pointer to be used, even in small functions that absolutely do not need it at all. Seriously is there a way to absolutely force GCC to disable the frame pointer?!
I want to test some architecture changes on an already existing architecture (x86) using simulators. However to properly test them and run benchmarks, I might have to make some changes to the instruction set, Is there a way to add these changes to GCC or any other compiler?
Simple solution:
One common approach is to add inline assembly, and encode the instruction bytes directly.
For example:
int main()
{
asm __volatile__ (".byte 0x90\n");
return 0;
}
compiles (gcc -O3) into:
00000000004005a0 <main>:
4005a0: 90 nop
4005a1: 31 c0 xor %eax,%eax
4005a3: c3 retq
So just replace 0x90 with your inst bytes. Of course you wont see the actual instruction on a regular objdump, and the program would likely not run on your system (unless you use one of the nop combinations), but the simulator should recognize it if it's properly implemented there.
Note that you can't expect the compiler to optimize well for you when it doesn't know this instruction, and you should take care and work with inline assembly clobber/input/output options if it changes state (registers, memory), to ensure correctness. Use optimizations only if you must.
Complicated solution
The alternative approach is to implement this in your compiler - it can be done in gcc, but as stated in the comments LLVM is probably one of the best ones to play with, as it's designed as a compiler development platform, but it's still very complicated as LLVM is best suited for IR optimization stages, and is somewhat less friendly when trying to modify the target-specific backends.
Still, it's doable, and you have to do that if you also plan to have your compiler decide when to issue this instruction. I'd suggest to start from the first option though, to see if your simulator even works with this addition, and only then spending time on the compiler side.
If and when you do decide to implement this in LLVM, your best bet is to define it as an intrinsic function, there's relatively more documentation about this in here - http://llvm.org/docs/ExtendingLLVM.html
You can add new instructions, or change existing by modifying group of files in GCC called "machine description". Instruction patterns in <target>.md file, some code in <target>.c file, predicates, constraints and so on. All of these lays in $GCCHOME/gcc/config/<target>/ folder. All of this stuff using on step of generation ASM code from RTL. You can also change cases of emiting instructions by change some other general GCC source files, change SSA tree generation, RTL generation, but all of this a little bit complicated.
A simple explanation what`s happened:
https://www.cse.iitb.ac.in/grc/slides/cgotut-gcc/topic5-md-intro.pdf
It's doable, and I've done it, but it's tedious. It is basically the process of porting the compiler to a new platform, using an existing platform as a model. Somewhere in GCC there is a file that defines the instruction set, and it goes through various processes during compilation that generate further code and data. It's 20+ years since I did it so I have forgotten all the details, sorry.
I'm working on the Pintos toy operating system at university, but there's a strange bug when using GCC 4.6.2. When I push my system call arguments (just 3 pushl-s in inline assembly), some mysterious data also appears on the stack, and the arguments are in the wrong order. Setting -fno-omit-frame-pointer gets rid of the strange data, but the arguments are still in the wrong order. GCC 4.5 works fine. Any idea what specific option could fix this?
NOTE: the problem still occurs with -O0.
Without a code example and a listing of the result from your different compilations, it's difficult to help you. But here are three possible causes for your problems:
Make sure you understand how arguments are pushed to the stack. Arguments are pushed from the back. This makes it possible for printf(char *, ...) to examine the first item to find out how many more there are. If you want to call the function int foo(int a, int b, int c), you'll need to push c, then b and finally a.
Could the strange data on the stack be a return address or EFLAGS? I don't know Pintos and how system calls are made, but make sure that you understand the difference between CALL/RET and INT/IRET. INT pushes the flags onto the stack.
If your inline assembly has side effects, you might want to write volatile/__volatile__ in front of it. Otherwise GCC is allowed to move it when optimizing.
I need to see your code to better understand what's going on.
The culprit was -fomit-frame-pointer, which has been enabled by default since 4.6.2. -fno-omit-frame-pointer fixed the issue.
Did you clean the parameters on stack after the syscall? gcc may not be aware that you touch the stack and generate code depends on the stack pointer it expected.
-fno-omit-frame-pointer force gcc to use e/rbp for accessing locate data but it just hide the actual problem.