I would appreciate a brief explanation of how my assembler timing loop on a NUCLEO-H723ZG board indicates that it is being executed in a single cpu clock cycle. The two instructions used, a SUBS and a BNE, consume three clock cycles when the loop is branching so there is some magic afoot! I am using the GPIO BSRR to toggle a LED and need to use a timing loop count of 275M to achieve an approximate one flash per second.
For the Cortex M0, M3 and M4 the cycle counts are included in the technical reference manual (eg Cortex M4). For the M7 they are not published, but it sounds like you have measured the answer for yourself so do not need it to be in the manual in this case.
If your code is correct, then the processor is able do those two instructions in a single cycle.
This is not surprising. For example the M4 can carry out a 16-bit data processing instruction and it instruction in a single cycle.
You can disable this if you require deterministic (but worse) performance. See the DISFOLD bit in the Auxiliary Control Register.
I found two statements in cortex m3 guide(red book)
1. Cortex m3 supports both Little as well as big endianess.
2. After reset endianess cannot be changed dynamically.
So indirectly it is telling change endianess settings in reset handler , is it so?
If yes then how to change endianess. Means which register I need to configure and where to configure ( in reset or in exception handler)
It is not actually good idea to change endianess
But still as a curiosity I wanted to see whether cortex m3 really supports to both endianess or not?
The Cortex-M architecture can be configured to support either big-endian or little-endian operation.
However, a specific Cortex-M implementation can only support one endianness -- it's hard-wired into the silicon, and cannot be changed. Every implementation I'm aware of has chosen little-endian.
You need to be reading the ARM documentation directly. The Technical Reference Manual touches on things like this. If you actually had the source to the cortex-m3 when building it into a chip then you would see the outer layers and or config options that you can touch.
From the cortex-m3 TRM
SETEND always faults. A configuration pin selects Cortex-M3
endianness.
And then we have it in another hit:
The processor contains a configuration pin, BIGEND, that enables you
to select either the little-endian or BE-8 big-endian format. This
configuration pin is sampled on reset. You cannot change endianness
when out of reset.
Technically it would be possible to build a chip where you could choose, it could be designed with an external strap connected to BIGEND, it could be some fuse or other non-volatile thing that you can touch and then pop reset on the ARM core, could have some other processor or logic that manages the booting of the ARM core and you talk to or program that before releasing reset on the ARM core.
In general it is a bad idea to go against the grain on default endianness for an architecture. ARM in particular now that there are two flavors and the latter (BE-8) being more painful (than BE-32). Granted there are other toolchains than gcc, but even with those the vast majority of the users, the vast majority of the indirect testing is in the native (little-endian) mode. Would even wonder how truly tested the logic is, does anyone outside ARMs design verification actually push that mode? Did they test it hard enough?
Have you tried actually building big endian cortex-m3 code? Since the cortex-m is a 16 bit instruction set (with thumb2 extensions) how does that affect BE-8. With BE-8 on a full sized ARM with ARM instructions the 32 bit data swaps but the 32 bit instructions do not. Perhaps this is in the TRM and I should read more, but does that work the same way on the cortex-m? The 16 bit instructions do not swap but the data does? What about on a full sized arm with thumb instructions? And does the toolchain match what the hardware expects?
BTW what that implies is there is a signal named BIGEND in the logic that you interface when you are building a chip around the cortex-m3 and you can go into that logic and change the default setting for BIGEND (I assume they have provided one) or as I have mentioned above you can add logic in your chip to make it a runtime option rather than compile time.
I have difficulty in understanding the workings of instruction register and counter register in avr micro-controller like an atmega128p.
Is there a way to explain what they do in a simple way?
I suppose you are interested in the difference between the Instruction Register(IR) and Program Counter (PC).
Their working principles are pretty simple.
IR: Holds the instruction that will be or is currently executed
PC: Holds the address of the instruction that should be loaded into IR from flash memory in the next clock cycle.
Note that AVRs normally use a 2-stage pipeline (fetch + execute) therefore the IR is double buffered (inside a clock cycle the next instruction is loaded from flash and the current instruction is executed).
I wonder, how is process behind when we use reg variable in verilog. We know that, in Quartus using by any FPGA we can declare and handle many registers. Also, we know that recent-i7(64 bit) CPUs have just 16 register. When we compile and simulate our code that has more than for example 50 register, it works. These registers are stored in any memory block? If so, why do we define this variable as reg? I mean when we define variable as reg in verilog registers should(am I wrong?) be handled in CPU registers but CPU has no more than 16 register. I am confused.
These registers are stored in any memory block?
No. Without spacial implementation Quartus tool use registers from logic cells.
Some ALTERA device have UFM (User flash memory). You can use it to hold data:
https://www.altera.com/en_US/pdfs/literature/an/an489.pdf
https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/ug/ug_alt_ufm.pdf
Sorry for My English.
I've written a virtual machine in C as a hobby project. This virtual machine executes code that's very similar to Intel syntax x86 assembly. The problem is that the registers this virtual machine uses are only registers in name. In my VM code, registers are used just like x86 registers, but the machine stores them in system memory. There are no performance improvements to using registers over system memory in VM code. (I thought that the locality alone would increase performance somewhat, but in practice, nothing has changed.)
When interpreting a program, this virtual machine stores arguments to instructions as pointers. This allows a virtual instruction to take a memory address, constant value, virtual register, or just about anything as an argument.
Since hardware registers don't have addresses, I can't think of a way to actually store my VM registers in hardware registers. Using the register keyword on my virtual register type doesn't work, because I have to get a pointer to the virtual register to use it as an argument. Is there any way to make these virtual registers perform more like their native counterparts?
I'm perfectly comfortable delving into assembly if necessary. I'm aware that JIT compiling this VM code could allow me to utilize hardware registers, but I'd like to be able to use them with my interpreted code as well.
Machine registers don't have indexing support: you can't access the register with a runtime-specified "index", whatever that would mean, without code generation. Since you're likely decoding the register index from your instructions, the only way is to make a huge switch (i.e. switch (opcode) { case ADD_R0_R1: r[0] += r[1]; break; ... }). This is likely a bad idea since it increases the interpreter loop size too much, so it will introduce instruction cache thrashing.
If we're talking about x86, the additional problem is that the amount of general-purpose registers is pretty low; some of them will be used for bookkeeping (storing PC, storing your VM stack state, decoding instructions, etc.) - it's unlikely that you'll have more than one free register for the VM.
Even if register indexing support were available, it's unlikely it would give you a lot of performance. Commonly in interpreters the largest bottleneck is instruction decoding; x86 supports fast and compact memory addressing based on register values (i.e. mov eax, dword ptr [ebx * 4 + ecx]), so you would not win much. It's worthwhile though to check the generated assembly - i.e. to make sure the 'register pool' address is stored in the register.
The best way to accelerate interpreters is JITting; even a simple JIT (i.e. without smart register allocation - basically just emitting the same code you would execute with the instruction loop and a switch statement, except the instruction decoding) can boost your performance 3x or more (these are actual results from a simple JITter on top of a Lua-like register-based VM). An interpreter is best kept as reference code (or for cold code to decrease JIT memory cost - the JIT generation cost is a non-issue for simple JITs).
Even if you could directly access hardware registers, wrapping code around the decision to use a register instead of memory is that much slower.
To get performance you need to design for performance up front.
A few examples.
Prepare an x86 VM by setting up all the traps to catch the code leaving its virtual memory space. Execute the code directly, dont emulate, branch to it and run. When the code reaches out of its memory/i/o space to talk to a device, etc, trap that and emulate that device or whatever it was reaching for then return control back to the program. If the code is processor bound it will run really fast, if I/O bound then slow but not as slow as emulating each instruction.
Static binary translation. Disassemble and translate the code before running, for example an instruction 0x34,0x2E would turn into ascii in a .c file:
al ^= 0x2E;
of =0;
cf=0;
sf=al
Ideally performing tons of dead code removal (if the next instruction modifies the flags as well then dont modify them here, etc). And letting the optimizer in the compiler do the rest. You can get a performance gain this way over an emulator, how good of a performance gain depends on how well you can optimize the code. Being a new program it runs on the hardware, registers memory and all, so the processor bound code is slower than a VM, in some cases you dont have to deal with the processor doing exceptions to trap memory/io because you have simulated the memory accesses in the code, but that still has a cost and calls a simulated device anyway so no savings there.
Dynamic translation, similar to sbt but you do this at runtime, I have heard this done for example when simulating x86 code on some other processor say a dec alpha, the code is slowly changed into native alpha instructions from x86 instructions so the next time around it executes the alpha instruction directly instead of emulating the x86 instruction. Each time through the code the program executes faster.
Or maybe just redesign your emulator to be more efficient from an execution standpoint. Look at the emulated processors in MAME for example, the readability and maintainability of the code has been sacrificed for performance. When written that was important, today with multi-core gigahertz processors you dont have to work so hard to emulate a 1.5ghz 6502 or 3ghz z80. Something as simple as looking the next opcode up in a table and deciding not to emulate some or all of the flag calculation for an instruction can give you a noticeable boost.
Bottom line, if you are interested in using the x86 hardware registers, Ax, BX, etc to emulate AX, BX, etc registers when running a program, the only efficient way to do that is to actually execute the instruction, and not execute and trap as in single stepping a debugger, but execute long strings of instructions while preventing them from leaving the VM space. There are different ways to do this, and performance results will vary, and that doesnt mean it will be faster than a performance efficient emulator. This limits you to matching the processor to the program. Emulating the registers with efficient code and a really good compiler (good optimizer) will give you reasonable performance and portability in that you dont have to match the hardware to the program being run.
transform your complex, register-based code before execution (ahead of time). A simple solution would be a forth like dual-stack vm for execution which offering the possibility to cache the top-of-stack element (TOS) in a register. If you prefer a register-based solution choose an "opcode" format which bundles as much as possible instructions (thumb rule, up to four instructions can be bundled into a byte if a MISC style design is chosen). This way virtual register accesses are locally resolvable to physical register references for each static super-instruction (clang and gcc able to perform such optimization). As side effect the lowered BTB mis-prediction rate would result in far better performance regardless of specific register allocations.
Best threading techniques for C based interpreters are direct threading (label-as-address extension) and replicated switch-threading (ANSI conform).
So you're writing an x86 interpreter, which is bound to be between 1 and 3 powers of 10 slower that the actual hardware. In the real hardware, saying mov mem, foo is going take a lot more time than mov reg, foo, while in your program mem[adr] = foo is going to take about as long as myRegVars[regnum] = foo (modulo cacheing). So you're expecting the same speed differential?
If you want to simulate the speed differential between registers and memory, you're going to have to do something like what Cachegrind does. That is, keep a simulated clock, and when it does a memory reference, it adds a big number to that.
Your VM seems to be too complicated for an efficient interpretation. An obvious optimisation is to have a "microcode" VM, with register load/store instructions, probably even a stack-based one. You can translate your high level VM into a simpler one before the execution. Another useful optimisation depends on a gcc computable labels extension, see the Objective Caml VM interpreter for the example of such a threaded VM implementation.
To answer the specific question you asked:
You could instruct your C compiler to leave a bunch of registers free for your use. Pointers to the first page of memory are usually not allowed, they are reserved for NULL pointer checks, so you could abuse the initial pointers for marking registers. It helps if you have a few native registers to spare, so my example uses 64 bit mode to simulate 4 registers. It may very well be that the additional overhead of the switch slows down execution instead of making it faster. Also see the other answers for general advices.
/* compile with gcc */
register long r0 asm("r12");
register long r1 asm("r13");
register long r2 asm("r14");
register long r3 asm("r15");
inline long get_argument(long* arg)
{
unsigned long val = (unsigned long)arg;
switch(val)
{
/* leave 0 for NULL pointer */
case 1: return r0;
case 2: return r1;
case 3: return r2;
case 4: return r3;
default: return *arg;
}
}