regarding gcc produced assembly code (assembly code not in order?) - gcc

I'm using a gcc compiler for 64 bit mips machine.
I noticed something interesting for a piece of assembly code generated. below is detail:
00000001200a4348 <get_pa_txr_index+0x50> 2ca2001f sltiu v0,a1,31
00000001200a434c <get_pa_txr_index+0x54> 14400016 bnez v0,00000001200a43a8 <get_pa_txr_index+0xb0>
00000001200a4350 <get_pa_txr_index+0x58> 64a2000e daddiu v0,a1,14
00000001200a43a8 <get_pa_txr_index+0xb0> 000210f8 dsll v0,v0,0x3
00000001200a43ac <get_pa_txr_index+0xb4> 0062102d daddu v0,v1,v0
00000001200a43b0 <get_pa_txr_index+0xb8> dc440008 ld a0,8(v0)
00000001200a43b4 <get_pa_txr_index+0xbc> df9955c0 ld t9,21952(gp)
00000001200a43b8 <get_pa_txr_index+0xc0> 0320f809 jalr t9
00000001200a43bc <get_pa_txr_index+0xc4> 00000000 nop
normally the bnez will immediately jump to 0xb0. But in the block after 0xb0, what I'm sure is the program must use a1 as a parameter.
But as we can see, a1 never showed up in the block after 0xb0.
But a1 is used in 0x58 which is right after the bnez (0x54).
So is it possible the 0x54 and 0x58 instruction get executed at the same time? A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor.
my question is, how can gcc compiler knows my cpu has this capability? what kind of technology is gcc using? what optimize option is gcc using to generate this kind of assembly code?
thanks.

This feature is usually called a branch delay slot. Finding an instruction with which to fill a branch delay slot is usually done during the scheduling phase of the backend of an optimizing compiler.

Related

Insert an undefined instruction in X86 code to be detected by Intel PIN

I'm using a PIN based simulator to test some new architectural modifications. I need to test a "new" instruction with two operands (a register and a memory location) using my simulator.
Since it's tedious to use GCC Machine description to add only one instructions it seemed logical to use NOPs or Undefined Instructions. PIN would easily be able to detect a NOP instruction using INS_IsNop, but it would interfere with NOPs added naturally to the code, It also has either no operands or a single memory operand.
The only option left is to use and undefined instruction. undefined instructions would never interfere with the rest of the code, and can be detected by PIN using INS_IsInvalid.
The problem is I don't know how to add an undefined instruction (with operands) using GCC inline assembly. How do I do that?
So it turns out that x86 has an explicit "unknown instruction" (see this). gcc can produce this by simply using:
asm("ud2");
As for an undefined instruction with operands, I'm not sure what that would mean. Once you have an undefined opcode, the additional bytes are all undefined.
But maybe you can get what you want with something like:
asm(".byte 0x0f, 0x0b");
Try using a prefix that doesn't normally apply to an instruction. e.g.
rep add eax, [rsi + rax*4 - 15]
will assemble just fine. Some instruction set extensions are done this way. e.g. lzcnt is encoded as rep bsf, so it executes as bsf on older CPUs, rather than generating an illegal instruction exception. (Prefixes that don't apply are ignored, as required by the x86 ISA.)
This will let you take advantage of the assembler's ability to encode instruction operands, which as David Wohlferd notes in his answer, is a problem if you use ud2.

GDB doesn't disassemble program running in RAM correctly

I have an application compiled using GCC for an STM32F407 ARM processor. The linker stores it in Flash, but is executed in RAM. A small bootstrap program copies the application from Flash to RAM and then branches to the application's ResetHandler.
memcpy(appRamStart, appFlashStart, appRamSize);
// run the application
__asm volatile (
"ldr r1, =_app_ram_start\n\t" // load a pointer to the application's vectors
"add r1, #4\n\t" // increment vector pointer to the second entry (ResetHandler pointer)
"ldr r2, [r1, #0x0]\n\t" // load the ResetHandler address via the vector pointer
// bit[0] must be 1 for THUMB instructions otherwise a bus error will occur.
"bx r2" // jump to the ResetHandler - does not return from here
);
This all works ok, except when I try to debug the application from RAM (using GDB from Eclipse) the disassembly is incorrect. The curious thing is the debugger gets the source code correct, and will accept and halt on breakpoints that I have set. I can single step the source code lines. However, when I single step the assembly instructions, they make no sense at all. It also contains numerous undefined instructions. I'm assuming it is some kind of alignment problem, but it all looks correct to me. Any suggestions?
It is possible that GDB relies on symbol table to check instruction set mode which can be Thumb(2)/ARM. When you move code to RAM it probably can't find this information and opts back to ARM mode.
You can use set arm force-mode thumb in gdb to force Thumb mode instruction.
As a side note, if you get illegal instruction when you debugging an ARM binary this is generally the problem if it is not complete nonsense like trying to disassembly data parts.
I personally find it strange that tools doesn't try a heuristic approach when disassembling ARM binaries. In case of auto it shouldn't be hard to try both modes and do an error count to decide which mode to use as a last resort.

change instruction set in GCC

I want to test some architecture changes on an already existing architecture (x86) using simulators. However to properly test them and run benchmarks, I might have to make some changes to the instruction set, Is there a way to add these changes to GCC or any other compiler?
Simple solution:
One common approach is to add inline assembly, and encode the instruction bytes directly.
For example:
int main()
{
asm __volatile__ (".byte 0x90\n");
return 0;
}
compiles (gcc -O3) into:
00000000004005a0 <main>:
4005a0: 90 nop
4005a1: 31 c0 xor %eax,%eax
4005a3: c3 retq
So just replace 0x90 with your inst bytes. Of course you wont see the actual instruction on a regular objdump, and the program would likely not run on your system (unless you use one of the nop combinations), but the simulator should recognize it if it's properly implemented there.
Note that you can't expect the compiler to optimize well for you when it doesn't know this instruction, and you should take care and work with inline assembly clobber/input/output options if it changes state (registers, memory), to ensure correctness. Use optimizations only if you must.
Complicated solution
The alternative approach is to implement this in your compiler - it can be done in gcc, but as stated in the comments LLVM is probably one of the best ones to play with, as it's designed as a compiler development platform, but it's still very complicated as LLVM is best suited for IR optimization stages, and is somewhat less friendly when trying to modify the target-specific backends.
Still, it's doable, and you have to do that if you also plan to have your compiler decide when to issue this instruction. I'd suggest to start from the first option though, to see if your simulator even works with this addition, and only then spending time on the compiler side.
If and when you do decide to implement this in LLVM, your best bet is to define it as an intrinsic function, there's relatively more documentation about this in here - http://llvm.org/docs/ExtendingLLVM.html
You can add new instructions, or change existing by modifying group of files in GCC called "machine description". Instruction patterns in <target>.md file, some code in <target>.c file, predicates, constraints and so on. All of these lays in $GCCHOME/gcc/config/<target>/ folder. All of this stuff using on step of generation ASM code from RTL. You can also change cases of emiting instructions by change some other general GCC source files, change SSA tree generation, RTL generation, but all of this a little bit complicated.
A simple explanation what`s happened:
https://www.cse.iitb.ac.in/grc/slides/cgotut-gcc/topic5-md-intro.pdf
It's doable, and I've done it, but it's tedious. It is basically the process of porting the compiler to a new platform, using an existing platform as a model. Somewhere in GCC there is a file that defines the instruction set, and it goes through various processes during compilation that generate further code and data. It's 20+ years since I did it so I have forgotten all the details, sorry.

How do I set a software breakpoint on an ARM processor?

How do I do the equivalent of an x86 software interrupt:
asm( "int $3" )
on an ARM processor (specifically a Cortex A8) to generate an event that will break execution under gdb?
Using arm-none-eabi-gdb.exe cross compiler, this works great for me (thanks to Igor's answer):
__asm__("BKPT");
ARM does not define a specific breakpoint instruction. It can be different in different OSes. On ARM Linux it's usually an UND opcode (e.g. FE DE FF E7) in ARM mode and BKPT (BE BE) in Thumb.
With GCC compilers, you can usually use __builtin_trap() intrinsic to generate a platform-specific breakpoint. Another option is raise(SIGTRAP).
I have a simple library (scottt/debugbreak) just for this:
#include <debugbreak.h>
...
debug_break();
Just copy the single debugbreak.h header into your code and it'll correctly handle ARM, AArch64, i386, x86-64 and even MSVC.
__asm__ __volatile__ ("bkpt #0");
See BKPT man entry.
For Windows on ARM, the instrinsic __debugbreak() still works which utilizes undefined opcode.
nt!DbgBreakPointWithStatus:
defe __debugbreak
Although the original question asked about Cortex-A7 which is ARMv7-A, on ARMv8 GDB uses
brk #0
On my armv7hl (i.MX6q with linux 4.1.15) system, to set a breakpoint in another process, I use :
ptrace(PTRACE_POKETEXT, pid, address, 0xe7f001f0)
I choose that value after strace'ing gdb :)
This works perfectly : I can examine the traced process, restore the original instruction, and restart the process with PTRACE_CONT.
We can use breakpoint inst:
For A32: use
BRK #imm instruction
For Arm and Thumb: use BKPT #imme instruction.
Or we can use UND pseudo-instruction to generate undefined instruction which will cause exception if processor attempt to execute it.

PPC breakpoints

How is a breakpoint implemented on PPC (On OS X, to be specific)?
For example, on x86 it's typically done with the INT 3 instruction (0xCC) -- is there an instruction comparable to this for ppc? Or is there some other way they're set/implemented?
With gdb and a function that hexdumps itself, I get 0x7fe00008. This appears to be the tw instruction:
0b01111111111000000000000000001000
011111 31
11111 condition flags: lt, gt, ge, logical lt, logical gt
00000 rA
00000 rB
0000000100 constant 4
0 reserved
i.e. compare r0 to r0 and trap on any result.
The GDB disassembly is simply the extended mnemonic trap
EDIT: I'm using "GNU gdb 6.3.50-20050815 (Apple version gdb-696) (Sat Oct 20 18:20:28 GMT 2007)"
EDIT 2: It's also possible that conditional breakpoints will use other forms of tw or twi if the required values are already in registers and the debugger doesn't need to keep track of the hit count.
Besides software breakpoints, PPC also supports hardware breakpoints, implemented via IABR (and possibly IABR2, depending on the core version) registers. These are instructions breakpoints, but there are also data breakpoints (implemented with DABR and, possibly, DABR2). If your core supports two sets of hardware breakpoint registers (i.e. IABR2 and DABR2 are present), you can do more than just trigger on a specific address: you can specify a whole contiguous range of addresses as a breakpoint target. For data breakpoints, you can also specify whether you want them to trigger on write, or read, or any access.
Best guess is a 'tw' or 'twi' instruction.
You could dig into the source code of PPC gdb, OS X probably uses the same functionality as its FreeBSD roots.
PowerPC architectures use "traps".
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.aixassem/doc/alangref/twi.htm
Instruction breakpoints are typically realised with the TRAP instruction or with the IABR debug hardware register.
Example implementations:
ArchLinux, Apple, Wii and Wii U.
I'm told by a reliable (but currently inebriated, so take it with a grain of salt) source that it's a zero instruction which is illegal and causes some sort of system trap.
EDIT: Made into community wiki in case my friend is so drunk that he's talking absolute rubbish :-)

Resources