Is tooling available to 'assemble' WebAssembly to x86-64 native code?

Is tooling available to 'assemble' WebAssembly to x86-64 native code? - compilation

I am guessing that a Wasm binary is usually JIT-compiled to native code, but given a Wasm source, is there a tool to see the actual generated x86-64 machine code?
Or asked in a different way, is there a tool that consumes Wasm and outputs native code?

The online WasmExplorer compiles C code to both WebAssembly and FireFox x86, using the SpiderMonkey compiler. Given the following simple function:
int testFunction(int* input, int length) {
int sum = 0;
for (int i = 0; i < length; ++i) {
sum += input[i];
}
return sum;
}
Here is the x86 output:
wasm-function[0]:
sub rsp, 8 ; 0x000000 48 83 ec 08
cmp esi, 1 ; 0x000004 83 fe 01
jge 0x14 ; 0x000007 0f 8d 07 00 00 00
0x00000d:
xor eax, eax ; 0x00000d 33 c0
jmp 0x26 ; 0x00000f e9 12 00 00 00
0x000014:
xor eax, eax ; 0x000014 33 c0
0x000016: ; 0x000016 from: [0x000024]
mov ecx, dword ptr [r15 + rdi] ; 0x000016 41 8b 0c 3f
add eax, ecx ; 0x00001a 03 c1
add edi, 4 ; 0x00001c 83 c7 04
add esi, -1 ; 0x00001f 83 c6 ff
test esi, esi ; 0x000022 85 f6
jne 0x16 ; 0x000024 75 f0
0x000026:
nop ; 0x000026 66 90
add rsp, 8 ; 0x000028 48 83 c4 08
ret
You can view this example online.
WasmExplorer compiles code into wasm / x86 via a service - you can see the scripts that are run on Github - you should be able to use these to construct a command-line tool yourself.

Related

Why do I find some never called instructions nopl, nopw after ret or jmp in GCC compiled code? [duplicate]

I've been working with C for a short while and very recently started to get into ASM. When I compile a program:
int main(void)
{
int a = 0;
a += 1;
return 0;
}
The objdump disassembly has the code, but nops after the ret:
...
08048394 <main>:
8048394: 55 push %ebp
8048395: 89 e5 mov %esp,%ebp
8048397: 83 ec 10 sub $0x10,%esp
804839a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
80483a1: 83 45 fc 01 addl $0x1,-0x4(%ebp)
80483a5: b8 00 00 00 00 mov $0x0,%eax
80483aa: c9 leave
80483ab: c3 ret
80483ac: 90 nop
80483ad: 90 nop
80483ae: 90 nop
80483af: 90 nop
...
From what I learned nops do nothing, and since after ret wouldn't even be executed.
My question is: why bother? Couldn't ELF(linux-x86) work with a .text section(+main) of any size?
I'd appreciate any help, just trying to learn.

First of all, gcc doesn't always do this. The padding is controlled by -falign-functions, which is automatically turned on by -O2 and -O3:
-falign-functions
-falign-functions=n
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance,
-falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 would align to the next 32-byte boundary only
if this can be done by skipping 23 bytes or less.
-fno-align-functions and -falign-functions=1 are equivalent and mean that functions will not be aligned.
Some assemblers only support this flag when n is a power of two; in
that case, it is rounded up.
If n is not specified or is zero, use a machine-dependent default.
Enabled at levels -O2, -O3.
There could be multiple reasons for doing this, but the main one on x86 is probably this:
Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be
advantageous to align critical loop entries and subroutine entries by 16 in order to minimize
the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.
(Quoted from "Optimizing subroutines in assembly
language" by Agner Fog.)
edit: Here is an example that demonstrates the padding:
// align.c
int f(void) { return 0; }
int g(void) { return 0; }
When compiled using gcc 4.4.5 with default settings, I get:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
000000000000000b <g>:
b: 55 push %rbp
c: 48 89 e5 mov %rsp,%rbp
f: b8 00 00 00 00 mov $0x0,%eax
14: c9 leaveq
15: c3 retq
Specifying -falign-functions gives:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
b: eb 03 jmp 10 <g>
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <g>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: b8 00 00 00 00 mov $0x0,%eax
19: c9 leaveq
1a: c3 retq

This is done to align the next function by 8, 16 or 32-byte boundary.
From “Optimizing subroutines in assembly language” by A.Fog:
11.5 Alignment of code
Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an importantsubroutine entry or jump label happens to be near the end of a 16-byte block then themicroprocessor will only get a few useful bytes of code when fetching that block of code. Itmay have to fetch the next 16 bytes too before it can decode the first instructions after thelabel. This can be avoided by aligning important subroutine entries and loop entries by 16.
[...]
Aligning a subroutine entry is as simple as putting as many
NOP
's as needed before thesubroutine entry to make the address divisible by 8, 16, 32 or 64, as desired.

As far as I remember, instructions are pipelined in cpu and different cpu blocks (loader, decoder and such) process subsequent instructions. When RET instructions is being executed, few next instructions are already loaded into cpu pipeline. It's a guess, but you can start digging here and if you find out (maybe the specific number of NOPs that are safe, share your findings please.

Understanding non-contiguous stack addressing in gcc x86 disassembly

I am using this C source code to compile with gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0.
/////////////////////////////////////////////////////////////////////////////////////////////
// Name: megabeets_0x1.c
// Description: Simple crackme intended to teach radare2 framework capabilities.
// Compilation: $ gcc megabeets_0x1.c -o megabeets_0x1 -fno-stack-protector -m32 -z execstac
//
// Author: Itay Cohen (#megabeets)
// Website: https://www.megabeets.net
/////////////////////////////////////////////////////////////////////////////////////////////
#include <stdio.h>
#include <string.h>
void rot13 (char *s) {
if (s == NULL)
return;
int i;
for (i = 0; s[i]; i++) {
if (s[i] >= 'a' && s[i] <= 'm') { s[i] += 13; continue; }
if (s[i] >= 'A' && s[i] <= 'M') { s[i] += 13; continue; }
if (s[i] >= 'n' && s[i] <= 'z') { s[i] -= 13; continue; }
if (s[i] >= 'N' && s[i] <= 'Z') { s[i] -= 13; continue; }
}
}
int beet(char *name)
{
char buf[128];
strcpy(buf, name);
char string[] = "Megabeets";
rot13(string);
return !strcmp(buf, string);
}
int main(int argc, char *argv[])
{
printf("\n .:: Megabeets ::.\n");
printf("Think you can make it?\n");
if (argc >= 2 && beet(argv[1]))
{
printf("Success!\n\n");
}
else
printf("Nop, Wrong argument.\n\n");
return 0;
}
gcc command used
gcc megabeets_0x1.c -o test32 -fno-stack-protector -z execstack -m32 -no-pie -fno-pic
The disassembly of function beet generated using objdump looks like the following:
080485a8 <beet>:
80485a8: 55 push ebp
80485a9: 89 e5 mov ebp,esp
80485ab: 81 ec 98 00 00 00 sub esp,0x98
80485b1: 83 ec 08 sub esp,0x8
80485b4: ff 75 08 push DWORD PTR [ebp+0x8]
80485b7: 8d 85 78 ff ff ff lea eax,[ebp-0x88]
80485bd: 50 push eax
80485be: e8 6d fd ff ff call 8048330 <strcpy#plt>
80485c3: 83 c4 10 add esp,0x10
80485c6: c7 85 6e ff ff ff 4d mov DWORD PTR [ebp-0x92],0x6167654d
80485cd: 65 67 61
80485d0: c7 85 72 ff ff ff 62 mov DWORD PTR [ebp-0x8e],0x74656562
80485d7: 65 65 74
80485da: 66 c7 85 76 ff ff ff mov WORD PTR [ebp-0x8a],0x73
80485e1: 73 00
80485e3: 83 ec 0c sub esp,0xc
80485e6: 8d 85 6e ff ff ff lea eax,[ebp-0x92]
80485ec: 50 push eax
80485ed: e8 94 fe ff ff call 8048486 <rot13>
80485f2: 83 c4 10 add esp,0x10
80485f5: 83 ec 08 sub esp,0x8
80485f8: 8d 85 6e ff ff ff lea eax,[ebp-0x92]
80485fe: 50 push eax
80485ff: 8d 85 78 ff ff ff lea eax,[ebp-0x88]
8048605: 50 push eax
8048606: e8 15 fd ff ff call 8048320 <strcmp#plt>
804860b: 83 c4 10 add esp,0x10
804860e: 85 c0 test eax,eax
8048610: 0f 94 c0 sete al
8048613: 0f b6 c0 movzx eax,al
8048616: c9 leave
8048617: c3 ret
I have few doubts regarding this disassembly,
After pushing ebp and moving esp to ebp, stack pointer is decreased by 0x98 first time, then by 0x8, totalling to 0xA0 which results stack frame aligned to 16 bytes. Why didn't compiler do a direct subtraction of 0xA0 from esp instead of 2 subsequent subtraction?
As can be seen from the C code, variable buf in function beet is 128 bytes. But in this disassembly buf is pointed by ebp-0x88 which means 136 bytes for buffer. Why 136 bytes allocated instead of 128 bytes?
Before calling functions like strcpy or rot13, random number of bytes first subtracted from esp before calling and after execution completion of these functions another random number of bytes is added to esp(which I guess to clear the arguments sent to those functions).
Example- Before calling rot13, 0xc is subtracted from esp, after completion 0x10 added instead of 0xc.
So, these random shifting of esp and pushing data results non-contiguous data, resulting in lower utilization of stack memory. Is there any particular reason behind this behaviour ?
After searching on google or stackoverflow I couldn't find any answer to these doubts.
Thank you
NOTE:
GCC code optimization results almost same disassembly.

Subtracting 0x98 from the stack leaves it 16-byte aligned. The additional 8 bytes is to prepare for pushing the parameters to strcpy, so that the stack is 16-byte aligned again before the call.
It does allocate 128 bytes for buf. The additional bytes between buf and ebp are either for alignment or for compiler temporaries or some other purpose of the compiler. Perhaps there is space for the return value here. In any case, the compiler doesn’t end up needing to use the space. If you enable optimization, it probably wouldn’t be there.
As in #1, the stack pointer is adjusted before pushing the parameters for each call so that the stack is 16-byte aligned before the call.

gcc likely() unlikely() macros and assembly code

I'm trying to see how gcc's likely() and unlikely() branch prediction macros has effect on assembly code. In the following piece of code I don't see any difference in the generated assembly code regardless of which macro i use. Any pointers on what's happening?
0 int main() {
1 volatile int x;
2 unlikely(x)?x++:x--;
3 }
Asm code:
0 0000000000000014 <main>:
1 int main() {
2 14: 55 push rbp
3 15: 48 89 e5 mov rbp,rsp
4 volatile int x;
5 likely(x)?x++:x--;
6 18: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
7 1b: 85 c0 test eax,eax
8 1d: 0f 95 c0 setne al
9 20: 0f b6 c0 movzx eax,al
10 23: 48 85 c0 test rax,rax
11 26: 74 0b je 33 <main+0x1f>
12 28: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
13 2b: 83 c0 01 add eax,0x1
14 2e: 89 45 fc mov DWORD PTR [rbp-0x4],eax
15 31: eb 09 jmp 3c <main+0x28>
16 33: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
17 36: 83 e8 01 sub eax,0x1
18 39: 89 45 fc mov DWORD PTR [rbp-0x4],eax
19 }
20 3c: 5d pop rbp
21 3d: c3 ret

It looks like you compiled without optimization. Basic block reordering is an optimization, so without it, __builtin_expect does not have this effect. With optimization, I observe that the sense of the branch is inverted when switching the expected result.
Note that whether this has any effect on current x86 processors is difficult to say.

How can I insert repeated NOP statements using Visual C++'s inline assembler?

Visual C++, using Microsoft's compiler, allows us to define inline assembly code using:
__asm {
nop
}
What I need is a macro that makes possible to multiply such instruction n times like:
ASM_EMIT_MULT(op, times)
for example:
ASM_EMIT_MULT(0x90, 160)
Is that possible? How could I do this?

With MASM, this is very simple to do. Part of the installation is a file named listing.inc (since everyone gets MASM as part of Visual Studio now, this will be located in your Visual Studio root directory/VC/include). This file defines a series of npad macros that take a single size argument and expand to an appropriate sequence of non-destructive "padding" opcodes. If you only need one byte of padding, you use the obvious nop instruction. But rather than using a long series of nops until you reach the desired length, Intel actually recommends other non-destructive opcodes of the appropriate length, as do other vendors. These pre-defined npad macros free you from having to memorize that table, not to mention making the code much more readable.
Unfortunately, inline assembly is not a full-featured assembler. There are a lot of things missing that you would expect to find in real assemblers like MASM. Macros (MACRO) and repeats (REPEAT/REPT) are among the things that are missing.
However, ALIGN directives are available in inline assembly. These will generate the required number of nops or other non-destructive opcodes to enforce alignment of the next instruction. Using this is drop-dead simple. Here is a very stupid example, where I've taken working code and peppered it with random aligns:
unsigned long CountDigits(unsigned long value)
{
__asm
{
mov edx, DWORD PTR [value]
bsr eax, edx
align 4
xor eax, 1073741792
mov eax, DWORD PTR [4 * eax + kMaxDigits+132]
align 16
cmp edx, DWORD PTR [4 * eax + kPowers-4]
sbb eax, 0
align 8
}
}
This generates the following output (MSVC's assembly listings use npad x, where x is the number of bytes, just as you'd write it in MASM):
PUBLIC CountDigits
_TEXT SEGMENT
_value$ = 8
CountDigits PROC
00000 8b 54 24 04 mov edx, DWORD PTR _value$[esp-4]
00004 0f bd c2 bsr eax, edx
00007 90 npad 1 ;// enforcing the "align 4"
00008 35 e0 ff ff 3f xor eax, 1073741792
0000d 8b 04 85 84 00
00 00 mov eax, DWORD PTR _kMaxDigits[eax*4+132]
00014 eb 0a 8d a4 24
00 00 00 00 8d
49 00 npad 12 ;// enforcing the "align 16"
00020 3b 14 85 fc ff
ff ff cmp edx, DWORD PTR _kPowers[eax*4-4]
00027 83 d8 00 sbb eax, 0
0002a 8d 9b 00 00 00
00 npad 6 ;// enforcing the "align 8"
00030 c2 04 00 ret 4
CountDigits ENDP
_TEXT ENDS
If you aren't actually wanting to enforce alignment, but just want to insert an arbitrary number of nops (perhaps as filler for later hot-patching?), then you can use C macros to simulate the effect:
#define NOP1 __asm { nop }
#define NOP2 NOP1 NOP1
#define NOP4 NOP2 NOP2
#define NOP8 NOP4 NOP4
#define NOP16 NOP8 NOP8
// ...
#define NOP64 NOP16 NOP16 NOP16 NOP16
// ...etc.
And then pepper your code as desired:
unsigned long CountDigits(unsigned long value)
{
__asm
{
mov edx, DWORD PTR [value]
bsr eax, edx
NOP8
xor eax, 1073741792
mov eax, DWORD PTR [4 * eax + kMaxDigits+132]
NOP4
cmp edx, DWORD PTR [4 * eax + kPowers-4]
sbb eax, 0
}
}
to produce the following output:
PUBLIC CountDigits
_TEXT SEGMENT
_value$ = 8
CountDigits PROC
00000 8b 54 24 04 mov edx, DWORD PTR _value$[esp-4]
00004 0f bd c2 bsr eax, edx
00007 90 npad 1 ;// these are, of course, just good old NOPs
00008 90 npad 1
00009 90 npad 1
0000a 90 npad 1
0000b 90 npad 1
0000c 90 npad 1
0000d 90 npad 1
0000e 90 npad 1
0000f 35 e0 ff ff 3f xor eax, 1073741792
00014 8b 04 85 84 00
00 00 mov eax, DWORD PTR _kMaxDigits[eax*4+132]
0001b 90 npad 1
0001c 90 npad 1
0001d 90 npad 1
0001e 90 npad 1
0001f 3b 14 85 fc ff
ff ff cmp edx, DWORD PTR _kPowers[eax*4-4]
00026 83 d8 00 sbb eax, 0
00029 c2 04 00 ret 4
CountDigits ENDP
_TEXT ENDS
Or, even cooler, we can use a bit of template meta-programming magic to get the same effect in style. Just define the following template function and its specialization (important to prevent infinite recursion):
template <size_t N> __forceinline void npad()
{
npad<N-1>();
__asm { nop }
}
template <> __forceinline void npad<0>() { }
And use it like this:
unsigned long CountDigits(unsigned long value)
{
__asm
{
mov edx, DWORD PTR [value]
bsr eax, edx
}
npad<8>();
__asm
{
xor eax, 1073741792
mov eax, DWORD PTR [4 * eax + kMaxDigits+132]
}
npad<4>();
__asm
{
cmp edx, DWORD PTR [4 * eax + kPowers-4]
sbb eax, 0
}
}
That'll produce the desired output (exactly the same as the one just above) in all optimized builds—whether you optimize for size (/O1) or speed (/O2)—…but not in debugging builds. If you need it in debug builds, you'll have to resort to the C macros. :-(

Base on Cody Gray Answer and code example for metaprogramming using template recursion and inline or forceinline as stated on the code before
template <size_t N> __forceinline void npad()
{
npad<N-1>();
__asm { nop }
}
template <> __forceinline void npad<0>() { }
It won't work on visual studio, without setting some options and is not a guarantee it will work
Although __forceinline is a stronger indication to the compiler than
__inline, inlining is still performed at the compiler's discretion, but no heuristics are used to determine the benefits from inlining this function.
You can read more about this here https://learn.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-4-c4714?view=vs-2019

Why do we allocate 12 bytes for each variable?

In visual Studio 2010 Professional (x86, Windows 7):
... more
00DC1362 B9 39 00 00 00 mov ecx,39h
00DC1367 B8 CC CC CC CC mov eax,0CCCCCCCCh
00DC136C F3 AB rep stos dword ptr es:[edi]
20: int a = 3;
00DC136E C7 45 F8 03 00 00 00 mov dword ptr [ebp-8],3
21: int b = 10;
00DC1375 C7 45 EC 0A 00 00 00 mov dword ptr [ebp-14h],0Ah
22: int c;
23: c = a + b;
00DC137C 8B 45 F8 mov eax,dword ptr [ebp-8]
00DC137F 03 45 EC add eax,dword ptr [ebp-14h]
00DC1382 89 45 E0 mov dword ptr [ebp-20h],eax
24: return 0;
Notice how the relative addressing variable A and B are not aligned by word size of 4?
What is happening here?
Also, why do we skip $ebp - 8 ?
Turning off the optimization will show the ideal addressing scheme.
Can someone please explain the reason? Thanks.
The offset of each variable is 12 bytes. A -> B -> C
I made a mistake. I meant why do we skip the first 8 bytes.

You are looking at the code generated by the default Debug build setting. Particularly the /RTC option (enable run-time error checks). Filling the stack frame with 0xcccccccc helps diagnose uninitialized variables, the gaps around the variables help diagnose buffer overflow.
There isn't much point in looking at this code, you are not going to ship that. It is purely a Debug build artifact, only there to help you get the bugs out of the code. None of it remains in the Release build.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Is tooling available to 'assemble' WebAssembly to x86-64 native code? - compilation

I am guessing that a Wasm binary is usually JIT-compiled to native code, but given a Wasm source, is there a tool to see the actual generated x86-64 machine code? Or asked in a different way, is there a tool that consumes Wasm and outputs native code?

Related

Why do I find some never called instructions nopl, nopw after ret or jmp in GCC compiled code? [duplicate]

Understanding non-contiguous stack addressing in gcc x86 disassembly

gcc likely() unlikely() macros and assembly code

How can I insert repeated NOP statements using Visual C++'s inline assembler?

Why do we allocate 12 bytes for each variable?

Categories

Resources