How to change endianess settings in cortex m3? - endianness

I found two statements in cortex m3 guide(red book)
1. Cortex m3 supports both Little as well as big endianess.
2. After reset endianess cannot be changed dynamically.
So indirectly it is telling change endianess settings in reset handler , is it so?
If yes then how to change endianess. Means which register I need to configure and where to configure ( in reset or in exception handler)
It is not actually good idea to change endianess
But still as a curiosity I wanted to see whether cortex m3 really supports to both endianess or not?

The Cortex-M architecture can be configured to support either big-endian or little-endian operation.
However, a specific Cortex-M implementation can only support one endianness -- it's hard-wired into the silicon, and cannot be changed. Every implementation I'm aware of has chosen little-endian.

You need to be reading the ARM documentation directly. The Technical Reference Manual touches on things like this. If you actually had the source to the cortex-m3 when building it into a chip then you would see the outer layers and or config options that you can touch.
From the cortex-m3 TRM
SETEND always faults. A configuration pin selects Cortex-M3
endianness.
And then we have it in another hit:
The processor contains a configuration pin, BIGEND, that enables you
to select either the little-endian or BE-8 big-endian format. This
configuration pin is sampled on reset. You cannot change endianness
when out of reset.
Technically it would be possible to build a chip where you could choose, it could be designed with an external strap connected to BIGEND, it could be some fuse or other non-volatile thing that you can touch and then pop reset on the ARM core, could have some other processor or logic that manages the booting of the ARM core and you talk to or program that before releasing reset on the ARM core.
In general it is a bad idea to go against the grain on default endianness for an architecture. ARM in particular now that there are two flavors and the latter (BE-8) being more painful (than BE-32). Granted there are other toolchains than gcc, but even with those the vast majority of the users, the vast majority of the indirect testing is in the native (little-endian) mode. Would even wonder how truly tested the logic is, does anyone outside ARMs design verification actually push that mode? Did they test it hard enough?
Have you tried actually building big endian cortex-m3 code? Since the cortex-m is a 16 bit instruction set (with thumb2 extensions) how does that affect BE-8. With BE-8 on a full sized ARM with ARM instructions the 32 bit data swaps but the 32 bit instructions do not. Perhaps this is in the TRM and I should read more, but does that work the same way on the cortex-m? The 16 bit instructions do not swap but the data does? What about on a full sized arm with thumb instructions? And does the toolchain match what the hardware expects?
BTW what that implies is there is a signal named BIGEND in the logic that you interface when you are building a chip around the cortex-m3 and you can go into that logic and change the default setting for BIGEND (I assume they have provided one) or as I have mentioned above you can add logic in your chip to make it a runtime option rather than compile time.

Related

AT32UC3B0512 project compiled as AT32UC3B0256 -> Consequences

I just figured out that I have compiled and programmed my AT32UC3B0512 project using the AT32UC3B0256 as target device.
My application seams to work without problems. Is that possible? What are the differences between AT32UC3B0512 and AT32UC3B0256 (beside flash and ram size)?
in most cases the program EEPROM is the only difference.
You use lower target then you have
so it just limits the size of your program more then you can use in real.
The functionality is not affected at all (tested for quite some time on L0,A0,and A3 series).
The only thing you have to be careful with UC3 chips are the pin incompatibility between package and series
for example TQFP is much different then BGA ...
also the same package with different pin count is incompatible
also you can not change UC3A0 for UC3A3 ...
the last 2/3 digits are only the EEPROM size
and mostly does not affect SW/HW compatibility
[Note]
#SergioFormiggini is right the AT32UC3B0256 does not have DAC
it is the first time I see a difference in HW config of a chip with different memory size only on Atmel chips
Unless they also changed the memory map and or GPIO mappings you should be fine

Is there performance advantage to ARM64

Recently 64-bit ARM mobiles started appearing. But is there any practical advantage to building an application 64-bit? Specifically considering application that does not have much use for the increased virtual address space¹, but would waste some space due to increased pointer size.
So does ARM64 have any other advantages than the larger address that would actually warrant building such application 64bit?
Note: I've seen 64-bit Performance Advantages, but it only mentions x86-64 which does have other improvements besides extended virtual address space. I also recall that the situation is indeed specific to x86 and on some other platforms that went 64-bit like Sparc the usual approach was to only compile kernel and the applications that actually did use lot of memory as 64-bit and everything else as 32-bit.
¹The application is multi-platform and it still needs to be built for and run on devices with as little as 48MiB of memory. Does have some large data that it reads from external storage, but it never needs more than some megabytes of it at once.
I am not sure a general response can be given, but I can provide some examples of differences. There are of course additional differences added in version 8 of the ARM architecture, which apply regardless of target instruction set.
Performance-positive additions in AArch64
32 General-purpose registers gives compilers more wiggle room.
I/D cache synchronization mechanisms accessible from user mode (no system call needed).
Load/Store-Pair instructions makes it possible to load 128-bits of data with one instruction, and still remain RISC-like.
The removal of near-universal conditional execution makes more out-of-ordering possible.
The change in layout of NEON registers (D0 is still lower half of Q0, but D1 is now lower half of Q1 rather than upper half of Q0) makes more out-of-ordering possible.
64-bit pointers make pointer tagging possible.
CSEL enables all kind of crazy optimizations.
Performance-negative changes in AArch64
More registers may also mean higher pressure on the stack.
Larger pointers mean larger memory footprint.
Removal of near-universal conditional execution may cause higher pressure on branch predictor.
Removal of load/store-multiple means more instructions needed for function entry/exit.
Performance-relevant changes in ARMv8-A
Load-Aquire/Store-Release semantics remove need for explicit memory barriers for basic synchronization operations.
I probably forgot lots of things, but those are some of the more obvious changes.

What are the ASM options for hardware-monitoring for ARM devices? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm having trouble identifying what could possibly be a good instruction set that is granted to work on ARMv5 SoC and above, I'm also having some issues with the syntax since I'm used to a much simpler gcc asm syntax for X86 and the ARM one looks more complicated, but this is another topic ... I guess.
What I need to do is to check the feature of the SoC, like the frequency, the temperatures and the main features for computation purposes like Thumb or NEON support.
I know that ARM basically just designs and sells the blueprints for the CPU and the companies that buy the license are free to move bits around and make modifications, but I don't think that things are that bad in terms of entropy in the ARM world, or at least this kind of registers ( hardware monitoring and security features like the temperature ) are usually quite standard across the board, at least this is true in the X86 world where some CPUID instructions are probably complex, but you can check the main features of your CPU quite easily and most importantly you can code an application that works on both Intel and AMD with about the same code base.
What is a good set of register for this and if I pick 1 given register there are implication of the ASM syntax that I should use ?
Arm is simpler than x86, give it some time with an open mind and you will see that.
Intel uses different foundries and design teams and technologies so there is no consistency either with temperature at least every other family is a different design team and they often change technologies/size every year or two.
Most of the arm cores provide on the minimal side registers that tell you everything from what processor core it is, what version on up to tons of registers describing which instructions are supported or not.
Your arm is going to run colder and/or faster than an x86 if you could apples to apples.
Unless something has changed if you want to put ARM's name on your chip or associate with your chip you cant go in and muck with the logic. If you look in the TRM for a particular architecture you will see the strap options available. Boot from 0x00000000 or 0xFFFF0000, start big or little endian, etc.
All arm cores from armv4t (ARM7TDMI) to the present support thumb, it is the only universal ARM instruction set. One length neon and such are available in some of the cortex-m cores (cortex-m4) as well as different levels of support for thumb2 extensions. As well as low power consumption, mips to watts while keeping mips to mhz. The cortex-ms are microcontrollers so they will have options to turn items off or not turn them on to help conserve power. but you can also implement that yourself on your on chip peripherals.
The cortex-m's wont give you ARM instructions, only thumb with thumb2 extensions. All of the TRMs (Technical Reference Manuals) for the various arm cores are available at arms website (infocenter.arm.com) which will describe the features, strap options, axi/amba choices or sizes, etc.
Mips is your other primary choice for an soc core, I dont think your mips to watts will be as good. You can of course go with an open core as well the openrisc or altor or mpx or amber or others that are there, but it is all on you for performance, temperature, etc. (and floating point).
Not sure what you mean by hardware monitoring, but you have jtag and other typical debug options available. If it is temperature you are after you have to work with your cell library provider and see what is available for the target foundry/process and then implement that peripheral and connect it to the arm. or outside world or both.
Bottom line you need to do more research, the info you need is available from arm for free or at the cost of an email address.

Atomic operations in ARM strex and ldrex - can they work on I/O registers?

Suppose I'm modifying a few bits in a memory-mapped I/O register, and it's possible that another process or and ISR could be modifying other bits in the same register.
Can ldrex and strex be used to protect against this? I mean, they can in principle because you can ldrex, and then change the bit(s), and strex it back, and if the strex fails it means another operation may have changed the reg and you have to start again. But can the strex/ldrex mechanism be used on a non-cacheable area?
I have tried this on raspberry pi, with an I/O register mapped into userspace, and the ldrex operation gives me a bus error. If I change the ldrex/strex to a simple ldr/str it works fine (but is not atomic any more...) Also, the ldrex/strex routines work fine on ordinary RAM. Pointer is 32-bit aligned.
So is this a limitation of the strex/ldrex mechanism? or a problem with the BCM2708 implementation, or the way the kernel has set it up? (or somethinge else- maybe I've mapped it wrong)?
Thanks for mentioning me...
You do not use ldrex/strex pairs on the resource itself. Like swp or test and set or whatever your instruction set supports (for arm it is swp and more recently strex/ldrex). You use these instructions on ram, some ram location agreed to by all the parties involved. The processes sharing the resource use the ram location to fight over control of the resource, whoever wins, gets to then actually address the resource. You would never use swp or ldrex/strex on a peripheral itself, that makes no sense. and I could see the memory system not giving you an exclusive okay response (EXOKAY) which is what you need to get out of the ldrex/strex infinite loop.
You have two basic methods for sharing a resource (well maybe more, but here are two). One is you use this shared memory location and each user of the shared resource, fights to win control over the memory location. When you win you then talk to the resource directly. When finished give up control over the shared memory location.
The other method is you have only one piece of software allowed to talk to the peripheral, nobody else is allowed to ever talk to the peripheral. Anyone wishing to have something done on the peripheral asks the one resource to do it for them. It is like everyone being able to share the soft drink fountain, vs the soft drink fountain is behind the counter and only the soft drink fountain employee is allowed to use the soft drink fountain. Then you need a scheme either have folks stand in line or have folks take a number and be called to have their drink filled. Along with the single resource talking to the peripheral you have to come up with a scheme, fifo for example, to essentially make the requests serial in nature.
These are both on the honor system. You expect nobody else to talk to the peripheral who is not supposed to talk to the peripheral, or who has not won the right to talk to the peripheral. If you are looking for hardware solutions to prevent folks from talking to it, well, use the mmu but now you need to manage the who won the lock and how do they get the mmu unblocked (without using the honor system) and re-blocked in a way that
Situations where you might have an interrupt handler and a foreground task sharing a resource, you have one or the other be the one that can touch the resource, and the other asks for requests. for example the resource might be interrupt driven (a serial port for example) and you have the interrupt handlers talk to the serial port hardware directly, if the application/forground task wants to have something done it fills out a request (puts something in a fifo/buffer) the interrupt then looks to see if there is anything in the request queue, and if so operates on it.
Of course there is the, disable interrupts and re-enable critical sections, but those are scary if you want your interrupts to have some notion of timing/latency...Understand what you are doing and they can be used to solve this app+isr two user problem.
ldrex/strex on non-cached memory space:
My extest perhaps has more text on the when you can and cant use ldrex/strex, unfortunately the arm docs are not that good in this area. They tell you to stop using swp, which implies you should use strex/ldrex. But then switch to the hardware manual which says you dont have to support exclusive operations on a uniprocessor system. Which says two things, ldrex/strex are meant for multiprocessor systems and meant for sharing resources between processors on a multiprocessor system. Also this means that ldrex/strex is not necessarily supported on uniprocessor systems. Then it gets worse. ARM logic generally stops either at the edge of the processor core, the L1 cache is contained within this boundary it is not on the axi/amba bus. Or if you purchased/use the L2 cache then the ARM logic stops at the edge of that layer. Then you get into the chip vendor specific logic. That is the logic that you read the hardware manual for where it says you dont NEED to support exclusive accesses on uniprocessor systems. So the problem is vendor specific. And it gets worse, ARM's L1 and L2 cache so far as I have found do support ldrex/strex, so if you have the caches on then ldrex/strex will work on a system whose vendor code does not support them. If you dont have the cache on that is when you get into trouble on those systems (that is the extest thing I wrote).
The processors that have ldrex/strex are new enough to have a big bank of config registers accessed through copressor reads. buried in there is a "swp instruction supported" bit to determine if you have a swap. didnt the cortex-m3 folks run into the situation of no swap and no ldrex/strex?
The bug in the linux kernel (there are many others as well for other misunderstandings of arm hardware and documentation) is that on a processor that supports ldrex/strex the ldrex/strex solution is chosen without determining if it is multiprocessor, so you can (and I know of two instances) get into an infinite ldrex/strex loop. If you modify the linux code so that it uses the swp solution (there is code there for either solution) they linux will work. why only two people have talked about this on the internet that I know of, is because you have to turn off the caches to have it happen (so far as I know), and who would turn off both caches and try to run linux? It actually takes a fair amount of work to succesfully turn off the caches, modifications to linux are required to get it to work without crashing.
No, I cant tell you the systems, and no I do not now nor ever have worked for ARM. This stuff is all in the arm documentation if you know where to look and how to interpret it.
Generally, the ldrex and strex need support from the memory systems. You may wish to refer to some answers by dwelch as well as his extext application. I would believe that you can not do this for memory mapped I/O. ldrex and strex are intended more for Lock Free algorithms, in normal memory.
Generally only one driver should be in charge of a bank of I/O registers. Software will make requests to that driver via semaphores, etc which can be implement with ldrex and strex in normal SDRAM. So, you can inter-lock these I/O registers, but not in the direct sense.
Often, the I/O registers will support atomic access through write one to clear, multiplexed access and other schemes.
Write one to clear - typically use with hardware events. If code handles the event, then it writes only that bit. In this way, multiple routines can handle different bits in the same register.
Multiplexed access - often an interrupt enable/disable will have a register bitmap. However, there are also alternate register that you can write the interrupt number to which enable or disable a particular register. For instance, intmask maybe two 32 bit registers. To enable int3, you could mask 1<<3 to the intmask or write only 3 to an intenable register. They intmask and intenable are hooked to the same bits via hardware.
So, you can emulate an inter-lock with a driver or the hardware itself may support atomic operations through normal register writes. These schemes have served systems well for quiet some time before people even started to talk about lock free and wait free algorithms.
Like previous answers state, ldrex/strex are not intended for accessing the resource itself, but rather for implementing the synchronization primitives required to protect it.
However, I feel the need to expand a bit on the architectural bits:
ldrex/strex (pronounced load-exclusive/store-exclusive) are supported by all ARM architecture version 6 and later processors, minus the M0/M1 microcontrollers (ARMv6-M).
It is not architecturally guaranteed that load-exclusive/store-exclusive will work on memory types other than "Normal" - so any clever usage of them on peripherals would not be portable.
The SWP instruction isn't being recommended against simply because its very nature is counterproductive in a multi-core system - it was deprecated in ARMv6 and is "optional" to implement in certain ARMv7-A revisions, and most ARMv7-A processors already require it to be explicitly enabled in the cp15 SCTLR. Linux by default does not, and instead emulates the operation through the undef handler using ... load-exclusive and store-exclusive (what #dwelch refers to above). So please don't recommend SWP as a valid alternative if you are expecting code to be portable across ARMv7-A platforms.
Synchronization with bus masters not in the inner-shareable domain (your cache-coherency island, as it were) requires additional external hardware - referred to as a global monitor - in order to track which masters have requested exclusive access to which regions.
The "not required on uniprocessor systems" bit sounds like the ARM terminology getting in the way. A quad-core Cortex-A15 is considered one processor... So testing for "uniprocessor" in Linux would not make one iota of a difference - the architecture and the interconnect specifications remain the same regardless, and SWP is still optional and may not be present at all.
Cortex-M3 supports ldrex/strex, but its interconnect (AHB-lite) does not support propagating it, so it cannot use it to synchronize with external masters. It does not support SWP, never introduced in the Thumb instruction set, which its interconnect would also not be able to propagate.
If the chip in question has a toggle register (which is essentially XORed with the output latch when written to) there is a work around.
load port latch
mask off unrelated bits
xor with desired output
write to toggle register
as long as two processes do not modify the same pins (as opposed to "the same port") there is no race condition.
In the case of the bcm2708 you could choose an output pin whose neighbors are either unused or are never changed and write to GPFSELn in byte mode. This will however only ensure that you will not corrupt others. If others are writing in 32 bit mode and you interrupt them they will still corrupt you. So its kind of a hack.
Hope this helps

Getting confuse with ABI calling convention and arch

I am getting confuse with all those terms:
ABI, calling convention, and hardware architecture.
The ABI is link with the architecture: x86-64 have a different ABI than the i386.
But then you can also define your own calling convention cdecl...
Well so what is the link between all those concept?
Which one is defining the other one?
Mostly I think I am confuse with ABI. What do you put inside a part from calling convention?
Thanks
This is a vast topic still to give you some pointers:
The ABI (application binary interface) cover the details that need to be specified in order that application can work on a certain system (usually with an operating system). So, to get to examples:
data type sizes (for example C standard gives just minimum requirements for the types. int type should at least as big as short, and short has to be 16 bits.)
layout in memory of structures and bitfields
calling convention (when a function is invoked where it can find it's parameters, which in registers, which on stack etc)
stack frame (what it is present on the stack, useful for a debugger)
system call numbers
others
Basically any detail that needs to be known in order to build a program that runs together with some other components (libraries, OS) can be included in an ABI. Some ABI specify more and some specify less details.
The hardware architecture can be also seen as a specification but of even lower level (it's about hardware not software). The hardware architecture specifies things like the instruction set available, memory hierarchy and how to access peripherals. For one hardware architecture there can be different ABI-s. Also you can have the same ABI for multiple (but usually similar) hardware architecture.

Resources