Grammar for ARM assembly syntax? - syntax

I am attempting to do a source code transformation of ARM assembly (specifically, ARMv8-A), and I need a formal grammar of this. Ideally of ARMv8-A for ANTLR, but a grammar for any version of ARM with any format would help.
Strangely, I haven't been able to find one. Is there really no formal grammar for any version ARM?

TL-DR; There is no formal grammar, because it does not exist. The formal specification is the binary encoding. All CPU manufactures document the binary encoding. The specific assembler syntax/grammar is left to tool creators to invent.
There are accepted forms of the encoding. For instance register direct and different operand checks that need to be made. Some instructions have multiple encodings (map to multiple binary values that compute similar results) and this can be used by forensics to see what tools were used. Often different assemblers (ARM corp as versus Gnu ARM as) will have support for slightly different notations for operands. In order to circumvent this, people often use the 'C' pre-processor to translate generic assembly to a target assembler using conditional substitution.
There are not usually formal grammars for assembler because they are so simple. It is like a goto program. All loop constructs do not exist in assembler. So everything is very linear. It might have a 'lex' (or syntax) type file of accepted mnemonics and pseudo-instructions, but instructions can be put in any order without an assembler complaining (although it may crash due to garbage register values).
The general documentation is just a mnemonic with a binary encoding. The encoding is simple because the hardware (CPU) just examines certain bits to determine the form. For instance, ALU instructions:
ADD - add and don't set condition codes
SUB - subtract and don't set condition codes
ADDS - add and set condition codes
SUBS - subtract and set condition codes.
They have two source registers (R0-R15 are candidates) and one destination (R0-R15). So that typically takes 12bits of 32bits. The ARM has a 'conditional execution' portion that uses four bits. The assembler just needs to select the leading portion (instruction type) and stuff the remaining bits of the operands. It is the same for all architectures. The issue comes with labels, where you need to compute offsets from one instruction to another. This is the main job of an assembler. Otherwise, it is a one-to-one mapping/translation and there is only a very limited grammar.
Related: Pre-processor as an assembler

Related

Is this an intermediate representation?

I'm looking into how the v8 compiler works. I read an article which states source code is tokenized, parsed, an AST is constructed, then bytecode is generated (https://medium.com/dailyjs/understanding-v8s-bytecode-317d46c94775)
Is this bytecode an intermediate representation?
Short answer: No. Usually people use the terms "bytecode" and "intermediate representation" to mean two different things.
Long answer: It depends a bit on your definition (but for most definitions, "no" is still the right answer).
"Bytecode" in virtual machines like V8 refers to a representation that is used as input for an interpreter. The article you linked to gives a good description.
"Intermediate representation" or IR usually refers to data that a compiler uses internally, as an intermediate step (hence the name) between its input (usually the AST = abstract syntax tree, i.e. parsed version of the source text) and its output (usually machine code or byte code, but it could be anything, as in a source-to-source compiler).
So in a traditional setup, you have:
source --(parser)--> AST --(compiler front-end)--> IR --(compiler back-end)--> machine code
where the IR is usually modified several times as the compiler performs various optimizations on it, before finally generating machine code from it. There can also be several different IRs; for example V8's earlier optimizing compiler ("Crankshaft") had two: high-level IR "Hydrogen" and low-level IR "Lithium", whereas V8's current optimizing compiler ("Turbofan") even has three: "JavaScript-level nodes", "Simplified nodes", and "Machine-level nodes".
Now if you wanted to draw the boxes in your whiteboard diagram of the system a little differently, then instead of having a "parser" and a "compiler" you could treat everything between source and machine code as one big "compiler" (which as a first step parses the source). In that case, the AST would be a form of intermediate representation. But, as stated above, usually when people use the term IR they mean "compiler IR", not the AST.
In a virtual machine like V8, the overall execution pipeline is more complicated than described above. It starts with:
source --(parser)--> AST --(bytecode generator)--> bytecode
This bytecode is primarily used as input for V8's interpreter.
As an optimization, when V8 decides to run a function through the optimizing compiler, it does not start with the source code and a parser again, but instead the optimizing compiler uses the bytecode as its input. In diagram form:
bytecode --(interpreter)--> program execution
bytecode --(compiler front-end)--> IR --(compiler back-end)--> machine code --(CPU)--> program execution
Now here's the part where your perspective comes in: since the bytecode in V8 is not only used as input for the interpreter, but also as input for the optimizing compiler and in that sense as a step on the way from source text to machine code, if you wanted to call it a special form of intermediate representation, you wouldn't technically be wrong. It would be an unusual definition of the term though. When a compiler theory textbook talks about "intermediate representation", it does not mean "bytecode".

How can a compiler be cross platform(hardware)?

I just realized that binary compilers convert source code to the binary of the destination platform. Kind of obvious... but if a compiler works such way, then how can the same compiler be used for different systems like x86, ARM, MIPS, etc?
Shouldn't they be supposed to "know" the machine-language of the hardware platform to be able to know how to build the binary? Does a compiler(like gcc) knows the machine language of every single platform that is supported?
How is that system possible, and how can a compiler be optimized for that many platforms at the same time?
Yes, they have to "know" the machine language for every single platform they support. This is a required to generate machine code. However, compilation is a multi-step process. Usually, the first steps of the compilation are common to most architectures.
Taken from wikipedia
Structure of a compiler
Compilers bridge source programs in high-level
languages with the underlying hardware.
A compiler requires
determining the correctness of the syntax of programs,
generating correct and efficient object code,
run-time organization, and
formatting output according to assembler and/or linker conventions.
A
compiler consists of three main parts: the frontend, the middle-end,
and the backend.
The front end
checks whether the program is correctly
written in terms of the programming language syntax and semantics.
Here legal and illegal programs are recognized. Errors are reported,
if any, in a useful way. Type checking is also performed by collecting
type information. The frontend then generates an intermediate
representation or IR of the source code for processing by the
middle-end.
The middle end
is where optimization takes place. Typical
transformations for optimization are removal of useless or unreachable
code, discovery and propagation of constant values, relocation of
computation to a less frequently executed place (e.g., out of a loop),
or specialization of computation based on the context. The middle-end
generates another IR for the following backend. Most optimization
efforts are focused on this part.
The back end
is responsible for translating the IR from the middle-end into assembly code. The target
instruction(s) are chosen for each IR instruction. Register allocation
assigns processor registers for the program variables where possible.
The backend utilizes the hardware by figuring out how to keep parallel
execution units busy, filling delay slots, and so on. Although most
algorithms for optimization are in NP, heuristic techniques are
well-developed.
More this article which describes the structure of a compiler and on this one which deals with Cross compilers.
The http://llvm.org/ project will answer all of your questions in this regard :)
In a nutshell, cross HW compilers emit "intermediate representation" of the code , which is HW agnostic and then its being customized via the native tool chain
Yes it is possible, it's called Cross Compiler. Compilers usually first they generate the object code which is not understanable by the current machine but it can be migrated to the destiny machine with another compiler. Next, object code is "compiled" again and linked with external libraries of the target machines.
TL;DR: Yes, the compilers knows the target code, but you can compile in another hardware.
I recommend you to read attached links for information.
Every platform has its own toolchain, toolchain includes gcc,gdb,ld,nm etc.
Let's take specific example of gcc as of now. GCC source code has many layers including architecture dependent and independent part. Architecture dependent part contains procedures to handle architecture specific things like their stack, function calls, floating point operations. We need to cross compile the gcc source code for a specific architecture like for ARM. You can see its steps here for reference:- http://www.ailis.de/~k/archives/19-arm-cross-compiling-howto.html#toolchain.
This architecture dependent part is responsible for handling machine language operations.

Meaning of 'debugging symbols' with respect to GDB

Wikipedia says:
A debug symbol is information that expresses which programming-language constructs generated a specific piece of machine code in a given executable module.
Any examples of what kind of programming-language constructs are used for the purpose?
What is the meaning of "constructs" in this context? Functions?
The programming language constructs reffered to are things like if statements, while loops, assignment statements, etc etc.
Debug symbols are usually files that map addresses of executable chunks of machine bytecode with the original source code file and line number they represent. This is what allows you to do things like put a breakpoint on an if statement, and have the machine stop when the execution reached that particular bit of bytecode.

Building a custom machine code from the ground up

I have recently begun working with logic level design as an amateur hobbyist but have now found myself running up against software, where I am much less competent. I have completed designing a custom 4 bit CPU in Logisim loosely based on the paper "A Very Simple Microprocessor" by Etienne Sicard. Now that it does the very limited functions that I've built into it (addition, logical AND, OR, and XOR) without any more detectable bugs (crossing fingers) I am running into the problem of writing programs for it. Logisim has the functionality of importing a script of Hex numbers into a RAM or ROM module so I can write programs for it using my own microinstruction code, but where do I start? I'm quite literally at the most basic possible level of software design and don't really know where to go from here. Any good suggestions on resources for learning about this low level of programming or suggestions on what I should try from here? Thanks much in advance, I know this probably isn't the most directly applicable question ever asked on this forum.
I'm not aware of the paper you mention. But if you have designed your own custom CPU, then if you want to write software for it, you have two choices: a) write it in machine code, or b) write your own assembler.
Obviously I'd go with b. This will require that you shift gear a bit and do some high-level programming. What you are aiming to write is an assembler program that runs on a PC, and converts some simple assembly language into your custom machine code. The assembler itself will be a high-level program and as such, I would recommend writing it in a high-level programming language that is good at both string manipulation and binary manipulation. I would recommend Python.
You basically want your assembler to be able to read in a text file like this:
mov a, 7
foo:
mov b, 20
add a, b
cmp a, b
jg foo
(I just made this program up; it's nonsense.)
And convert each line of code into the binary pattern for that instruction, outputting a binary file (or perhaps a hex file, since you said your microcontroller can read in hex values). From there, you will be able to load the program up onto the CPU.
So, I suggest you:
Come up with (on paper) an assembly language that is a simple written representation for each of the opcodes your machine supports (you may have already done this),
Learn simple Python,
Write a Python script that reads one line at a time (sys.stdin.readline()), figures out which opcode it is and what values it takes, and outputs the corresponding machine code to stdout.
Write some assembly code in your assembly language that will run on your CPU.
Sounds like a fun project.
I have done something similar that you might find interesting. I also have created from scratch my own CPU design. It is an 8-bit multi-cycle RISC CPU based on Harvard architecture with variable length instructions.
I started in Logisim, then coded everything in Verilog, and I have synthesized it in an FPGA.
To answer your question, I have done a simple and rudimentary assembler that translates a program (instructions, ie. mnemonics + data) to the corresponding machine language that can then be loaded into the PROG memory. I've written it in shell script and I use awk, which is what I was confortable with.
I basically do two passes: first translate mnemonics to their corresponding opcode and translate data (operands) into hex, here I keep track of all the labels addresses. second pass will replace all labels with their corresponding address.
(labels and addresses are for jumps)
You can see all the project, including the assembler, documented here: https://github.com/adumont/hrm-cpu
Because your instruction set is so small, and based on the thread from the mguica answer, I would say the next step is to continue and/or fully test your instruction set. do you have flags? do you have branch instructions. For now just hand generate the machine code. Flags are tricky, in particular the overflow (V) bit. You have to examine carry in and carry out on the msbit adder to get it right. Because the instruction set is small enough you can try the various combinations of back to back instructions and followed by or and followed by xor and followed by add, or followed by and or followed by xor, etc. And mix in the branches. back to flags, if the xor and or for example do not touch carry and overflow then make sure you see carry and overflow being a zero and not touched by logical instructions and carry and overflow being one and not touched, and also independently show carry and overflow are separate, one on one off, not touched by logical, etc. make sure all the conditional branches only operate on that one condition, lead into the various conditional branches with flag bits that are ignored in both states insuring that the conditional branch ignores them. Also verify that if the conditional branch is not supposed to modify them that it doesnt. likewise if the condition doesnt cause a branch that the conditional flags are not touched...
I like to use randomization but it may be more work than you are after. I like to independently develop a software simulator of the instruction set, which I find easier to use that the logic also sometimes easier to use in batch testing. you can then randomize some short list of instructions, varying the instruction and the registers, naturally test the tester by hand computing some of the results, both state of registers after test complete and state of flag bits. Then make that randomized list longer, at some point you can take a long instruction list and run it on the logic simulator and see if the logic comes up with the same register results and flag bits as the instruction set simulator, if they vary figure out why. If the do not try another random sequence, and another. Filling registers with prime numbers before starting the test is a very good idea.
back to individual instruction testing and flags go through all the corner cases 0xFFFF + 0x0000 0xFFFF+1, things like that places just to the either side of and right on operands and results that are one count away from where a flag changes at the point where the flag changes and just the other side of that. for the logicals for example if they use the zero flag, then have various data patterns that test results that are on either side of and at zero 0x0000, 0xFFFF 0xFFFE 0x0001 0x0002, etc. Probably a walking ones result as well 0x0001 result 0x0002, ox0004, etc.
hopefully I understood your question and have not pointed out the obvious or what you have already done thus far.

Should i write a Direct3D Shader Model language compiler using flex/yacc?

I am going to create a compiler for Direct3D's Shader Model language. The compiler's target platform and development environment are on Windows/VC++.
For those who are not familiar with the Shader Model Language, here are examples of instructions which the language consists of (some of the instructions are a bit outdated, but the syntax is basically the same as the version I will be using).
Here
And here
I am considering flex/yacc as the framework for developing the compiler. Would these be suitable for the job? Is there any better framework for developing in native C++?
In my opinion, a normal lexer and/or parser generator usually won't help much in writing an assembler. They're mostly helpful in dealing with relatively complex grammars, but in the case of an assembler, the "grammar" is usually so trivial that such a generator is more hindrance than help.
A typical assembler is mostly table driven -- you start by creating a table of defined op-codes, and the characteristics of the instruction it will generate (e.g. number and types of of registers that must be specified for it). You typically have a (smaller, in the case of shaders, probably much smaller) table defining how to encode addressing modes and such.
Most of the assembler works by consulting that table -- i.e. it reads something from input, and attempts to look it up in the table. If it's not present, it gives an error message saying it's an unknown opcode. If it's found, it gets information from the table about the number of operands associated with that op-code. It attempts to read that many operands. If it can't, it gives an error saying something's wrong with the instruction. If it can, it encodes the instruction, and starts over.
There are a few places it has to handle a bit more than that, of course. Where/when you define something like a label, it has to record the name and position of that label in a symbol table. When it encounters something like a branch to that address, it has to look up the target and encode its address appropriately.
Only when/if you decide to support macros do you depart much from that basic model. Depending on how elaborate you get with them, it might be worthwhile to use a parser generator and such for a macro expansion facility. Then again, given that shaders are mostly pretty small, macros aren't likely to be a very high priority for such an assembler.
Edit: rereading it, I should probably clarify/correct one point. The use for a parser generator isn't so much when the grammar itself becomes complex, as when the grammar allows for statements that are complex. Consider a really trivial grammar:
expression := expression '+' value
| expression '-' value
| value
Even though this allows only addition and subtraction, it still defines statements that are arbitrarily complex (or at least arbitrarily long strings of values being added or subtracted). Of course, for even a fairly trivial real language, we'll normally have multiplication, division, function calls, etc.
This is considerably different from a typical assembly language, where each each instruction has a fixed format. For example, an addition or subtraction operation has exactly two source operands and one destination operand.

Resources