nanopb - binary compatibility between different processor architectures

nanopb - binary compatibility between different processor architectures - protocol-buffers

I am using nanopb/google-protobuf between an ARM processor and Infineon Aurix.
QUESTION
Are there possible binary compatibility issues when communicating between different processor architectures?
Are there configuration flags that I could set, on both processors, to make sure encoding/decoding is done identically?

Related

how to run a openmp program on clusters with multiple nodes? [duplicate]

I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.

Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.

Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.

Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.

Can I run C program compiled on different ARM processor?

Let's say I compiled C-program on RaspberryPi, can I run this binary on let's say Cubietruck?
How to know for sure that 2 ARM processors are compatible? Are they all compatible between each other?
It should be some easy answer referring instruction set supported by processors, but I can't find any good materials on that.

There are several conditions for that:
Your executable should use the "least common denominator" of all the ARM microarchitectures you wish to support. See gcc's -march=... option for that. Assuming you're running Linux, grep '^model' /proc/cpuinfo should give you that information for each platform.
(related) Some features may not be supported by all your target ARM cores (FPU, NEON, etc...), so be very careful with that.
You should, of course, run the same OS on all supported platforms.
You need to make sure that all supported platforms run the same ABI; ARM has an history of ABIs changes, so you must take this into consideration.
If you're lucky enough to target only reasonably modern ARM platforms, you should be able to find some common ground (EABI or Hard Float ABI). Otherwise you probably have no choice but to maintain several versions of your executable.

What is the minimum target CPU architecture for the various versions of Visual Studio?

What is the minimum target processor architecture (indicated with _M_IX86 predefined macros) supported by every version of Visual Studio 2008, 2010 and 2012?
For example, MSVS 2012 supports only Pentium Pro and higher.

The classic switch for this was /G. Your available options differed for different versions of the compiler (with newer versions dropping older options, albeit continuing to accept them for compatibility reasons). Here's what you got:
/G3 built code that was optimized for 386 processors (_M_IX86 was set to 300)
/G4 for the 486 processor (_M_IX86 was set to 400)
/G5 built code that was optimized for the Pentium (_M_IX86 was set to 500)
/G6 built code that was optimized for the Pentium Pro, II, and III (_M_IX86 was set to 600)
/G7 built code that was optimized for the Pentium 4 or AMD Athlon (_M_IX86 was set to 700)
/GB specified either "blend" mode or the lowest common denominator that was reasonable when that version of the compiler was released. This was the default option if no other was specified.
And of course, it bears explicit mention that setting this option to optimize for a newer processor architecture did not prevent your code from running on an older processor architecture. It just wasn't optimized for that architecture and might run more slowly.
However, if you look up this compiler option in a current version of the documentation, you'll see no mention of any of this. All you see is something about Itanium processors (which we'll put aside). That's because the compiler shipping with VC++ 2005 dropped the /G3–/G7 compiler options altogether:
[The] /G3, /G4, /G5, /G6, /G7, and /GB compiler options have been removed. The compiler now uses a "blended model" that attempts to create the best output file for all architectures.
So, although many of us remember it well from VC++ 6, this code generation setting was a historical curiosity only even as far back as VC++ 2008. Therefore I'm not sure where you get the impression that VS 2012 supports only the Pentium Pro. I can't find mention of that anywhere in the official documentation or elsewhere online. The limiting factor for version 2012 of the compiler is not the processor architecture but the OS version. If you've patched the compiler, libraries, and all the other accoutrements to support targeting Windows XP, then you will be able to run your application on an original Pentium-233, onto which you've masochistically shoe-horned Windows XP.
The purpose of the _M_IX86 macro is really just an indicator that you're targeting the Intel IA-32 processor family—more commonly known as good old 32-bit x86—in contrast to one of the other supported target architectures, like _M_AMD64 for 64-bit x86. You should just treat it as a defined/undefined value now.
Yes, the old table of values for _M_IX86 still appears in the latest version of the preprocessor documentation, but it is utterly obsolete. You'll note that other obsolete symbols appear there as well, such as _M_PPC: what was the last version of MSVC++ that shipped with a PowerPC compiler? 4.2?
But that is only part of the story. There are still other compiler options that govern code generation with respect to target architectures.
For example, the /arch switch. From the latest version of the documentation, you have the following options:
/arch:IA32 which essentially sets the lowest common denominator, using x87 for floating point
/arch:SSE which turns on SSE instructions
/arch:SSE2 which turns on SSE2 instructions (and is the default for x86)
/arch:AVX which turns on Intel Advanced Vector Extensions
/arch:AVX2 which turns on Intel Advanced Vector Extensions 2
If you read the Remarks section, you'll also see that these options can imply more than just the specified instruction set. For example, since all processors that support SSE instructions also support the CMOV instruction, the CMOV instruction will be generated when /arch:SSE or higher is specified. The CMOV instruction has nothing to do with SSE; in fact, SSE was introduced with the Pentium III while CMOV was introduced way back with the Pentium Pro. But it's guaranteed to be supported on any architectures that support SSE.
The other relevant option is controlled by the /favor switch. This was new starting with VC++ 2008, and was presumably the replacement for the old /G3–/G7 options. As the documentation says:
/favor:blend is the default and produces code with no unique optimizations
/favor:INTEL64 generates code specific for Intel's implementation of x86-64
/favor:AMD64 generates code specific for AMD's implementation of x86-64
/favor:ATOM generates code specific for Intel's Atom processor

What clang optimization flags should I apply to my OS X 10.5+ universal (i86/x64) binary?

Currently, I am compiling:
clang -Oz -g
But I would like to apply an -mtune and if possible an -march flag, and anything else that will be valid on all Intel architectures that OS X Leopard supports.
Specifically I am asking: which -mtune and -march flag should I specify so that my binary is optimized for 10.5, and will run on all supported Intel processors for 10.5?
In addition, I would like to apply different tunings to the 32 bit and 64 bit portions, is this possible? If so what should I tune the 64 bit portion to?
For bonus points, I am interested in the same for PowerPC, for future reference, though currently I do not support that.

You can build separate binaries using different -march and other flags, though be aware that -march can use instructions not available on earlier processors. -mtune and -mcpu can choose instructions, alignment, etc., that will favour a particular processor, yet run on all processors in that family.
To tune to different architectures (i386, x86-64, ppc, ppc64) you would have to consult the manual pages for clang / gcc. After separate builds it should be possible to use lipo to create a universal binary. There's a simple example here.
For Apple's compilers, you should use the -arch specification, and -mmacosx-version-min=10.5, provided you still have the SDKs for 10.5.

Distro provided cross compiler vs custom built gcc

I intend to cross compile for Raspberry Pi, basically a small ARM computer. The host will be an i686 box running Arch Linux.
My first instinct is to use cross compiler provided by Arch Linux, arm-elf-gcc-base and arm-elf-binutils. However, every wiki and post I read seems to use some version of custom gcc build. They seem to spend significant time on cooking their own gcc. Problem is that they never say WHY it is important to use their gcc over another.
Can stock distro provided cross compilers be used for building Raspberry Pi or ARM in general kernels and apps?
Is it necessary to have multiple compilers for ARM architecture? If so, why, since single gcc can support all x86 variants?
If 2), then how can I deduce what target subset is supported by a particular version of gcc?
More general question, what general use cases call for custom gcc build?
Please be as technical as you can, I'd like to know WHY as well as how.

When developers talk about building software (cross compiling) for a different machine (target) compared to their own (host) they use the term toolchain to describe the set of tools necessary to build binary files. That's because when you need to build an executable binary, you need more than a compiler.
You need routines (crt0.o) to initialize runtime according to requirements of operating system and standard libraries. You need standard set of libraries and those libraries need to be aware of the kernel on target because of the system calls API and several os level configurations (f.e. page size) and data structures (f.e. time structures).
On the hardware side, there are different set of ARM architectures. Architectures can be backward compatible but a toolchain by nature is binary and targeted for a specific architecture. You can have the most widespread architecture by default but then that won't be too fruitful for an already constraint environment (embedded device). If you have the latest architecture, then it won't be useful for older architecture based targets.
When you build a binary on your host for your host, compiler can look up all the necessary bits from its own environment or use what's on the host - so most of the above details are invisible to developer. However when you build for a different target than your host type, toolchain must know about hardware, os and standard library details. The way you tell these to toolchain is... by building it according to those details which might require some level of bootstrapping. (or you can do this via extensive set of parameters if toolchain supports / built for it.)
So when there is a generic (stock) cross compile toolchain, it has already some target specifics set and that might not meet your requirements. Please see this recent question about the situation on Ubuntu for an example.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio