I am developing an algorithm that uses ARM Neon instructions. I am writing the code using assembler file (.S and no inline asm).
My question is that what is the best way for debugging purpose i.e. viewing registers, memory, etc.
Currently, I am using Android NDK to compile and my Android phone to run the algorithm.

Poor man's debug solutions...
You can use gdb / gdbserver to remotely control execution of applications on an Android phone. I'm not giving full details here because they change all the time but for example you can start with this answer or make a quick search on Internet. Learning to use GDB might seem to have a high steep curve however material on web is exhaustive. You can easily find something to your taste.
Single-stepping an ARM core via software tools is hard that's why ARM ecosystem is full of expensive tools and extra HW equipment.
Trick I use is to insert BRK instructions manually in assembly code. BRK is Self-hosted debug breakpoint. When core sees this instruction it stops and informs OS about situation. OS then notifies debugger about the situation and passes control to it. When debugger gets control you can check contents of registers and probably even make changes to them. Last part of the operation is to make your process continue. Since PC is still at our break point instruction what you must do is to increase PC, set it to instruction after BRK.
Since you mentioned you use .S files instead of .s files you can utilize gcc to do preprocessing / macro work. This way enabling, disabling BRK might become less of an issue.
Big down side of this way of working is turnaround time. If there is a certain point that you want to investigate with gdb you must make sure there is a BRK instruction there and this will probably require another build/push/debug cycle.


What is it meant by "developers must optimise their apps to run on ARM-based processors"?

This is a subject that I am not very knowledgable about and I was hoping to get a better understanding on the topic.
I was going through articles about Apple's transition to Apple Silicon and at some point I read "Apple is going to ship Rosetta 2, an emulation layer that lets you run old apps on new Macs."
As far as I know, an application is written in a high level language (e.g. C/C++,Java etc.). Then the compiler (let's assume interpreters don't exist for a moment) reads that code and translates it to assembly code. Then the assembler will convert assembly code to machine code which is readable by the processor.
My question is, assuming the above are correct, why is Rosetta 2 required since a CPU is supposed to translate high level code into readable machine code anyway? Why would developers need to "optimise" (or care on what processor their applications are run on) their applications since they are written (mostly) in high level language (which the processor can compile) ? I don't get why would programmers care if the CPU is supposed to handle compiling and assembling.
This question is probably rather trivial but I couldn't find what I was looking for just by reading about compilers or CPU architecture.
a CPU is supposed to translate high level code into readable machine code anyway?
No, the CPU doesn't do that itself, it happens via software running on the CPU (JIT or ahead-of-time compiler).
For ahead-of-time compiler (e.g. normal C++ implementations), closed source software only ships x86 machine code, not source. So you can't just recompile it yourself. Open-source software is usually easily portable by recompiling.
Rewritten is an overstatement for most apps, most can just recompile.
But if you have custom x86-specific code, like manually vectorized SIMD loops using SSE / AVX intrinsics or hand-written asm, you'd have to port those to NEON / AArch64 SIMD.

Test app performance by making it lag

Is there a way or an application to test performance by making the app execute slower? I want to be sure that my app will perform well on older hardware.
Just adding stalls in SW won't necessarily imitate any older HW, it would just show you how the stalled code behaves on the new HW (and if the stalls aren't properly serializing - they may actually get avoided altogether).
If you just want to see how the code behaves without some specific ISA features you can disable them on compilation, or even compile to an older architecture. That won't make your CPU run any slower of course, but it won't be able to use for example AVX/SSE vectors (in x86 for e.g.), or other dedicated instructions.
If you want on old system+OS configuration you can use emulation - for e.g. DosBox
If you want an even higher level of realism, you can find a HW simulator that models that HW, and run on that (assuming you can cross-compile your code to run on it).
And of course, if you want an even more realistic experiment, and willing to go the extra mile, just get a specimen of that old HW, wipe the dust off, and build and run on it :)

What are the possible side effects of using GCC profiling flag -pg?

There is a device driver for a camera device provided to us as a .so library file by the vendor.
Only the header file with API's is available which provides the list of functions that we can work with the device. Our application is linked with the .so library file provided by the vendor and uses the interface functions provided for our objective.
When we wanted to measure the time taken by our application in handling different tasks, we have added GCC -pg flag and compiled+built our application.
But we found that using this executable built with -pg, we are observing random failure in the camera image acquire functions. Since we are using the .so library file, we do not know what is going wrong inside that function.
So in general I wanted to understand what could be the possible reasons of such a failure mode. Any pointers or documents that can help what goes inside profiling and its side effects is appreciated.
This answer is a helpful overview of how the gcc -pg flag profiler actually works. The take-home point is mostly to do with possible changes to timing. If your library has any kind of time-sensitivity in it, introducing profiler overheads might be changing the time it takes to execute parts of the code, and perhaps violating some kind of constraint.
If you look at the gprof documentation, it would explain the implementation details:
Profiling works by changing how every function in your program is
compiled so that when it is called, it will stash away some
information about where it was called from. From this, the profiler
can figure out what function called it, and can count how many times
it was called. This change is made by the compiler when your program
is compiled with the `-pg' option, which causes every function to call
mcount (or _mcount, or __mcount, depending on the OS and compiler) as
one of its first operations.
So the timing of your application would change quite a bit when you turn on -pg.
If you would like to instrument your code without significantly affecting the timings, you could possibly look at oprofile. It does not pose as significant an overhead as gprof does.
Another fairly recent tool that serves as a good lightweight profiling tool is perf.
The profiling tools are useful primarily in understanding the CPU bound pieces of your library/application and can help you optimize those critical pieces. Most of the time they serve to identify some culprit function/method which wastes CPU cycles. So do not use it as the sole piece for debugging any and all issues.
Most vendor libraries would also provide means to turn on extra debugging or dumping extra information during runtime. They include means such as environment variables, log files, /proc or /sys interfaces for drivers, etc. and sometimes even tools to increase debugging levels at runtime. See if you can leverage these.
If you have defined APIs in a library/driver, you should run unit-tests on them instead of trying to debug the whole application you've built.
If you find a certain unit-test fails, send the source code of the unit-test to your vendor, and ask them to fix the bug. If it is not a bug, your vendor would at least point you towards the right set of APIs or the semantics to use.

How does a breakpoint in debugger work?

Breakpoints are one of the coolest feature supported by most popular Debuggers like GDB. But how a breakpoint works ? What code modifications does the compiler do to achieve the breakpoint? Are there any special hardware features used to support breakpoints?
Compiler does not need to "modify" the binary in any way to support the breakpoints. However it is important, that:
Compiler includes enough information in the executable (that is not in the code itself but in special sections in same file), so that debugger can relate source that user wants to debug with machine code. One typical thing debugger needs to know to be able to set breakpoints (unless you specify addresses directly), is where (at which address) program functions and lines of source code start (within machine code).
Code is not optimized by compiler in any way, that makes it impossible to relate source and machine code. Typically you will want debug code that was not optimized or code where only carefully selected optimizations were performed.
The rest of work is then performed by debugger itself.
Software breakpoints don't necessarily need special hardware features. Debugger here relies on modifying original binary (it's copy that is loaded to memory). When you set a breakpoint, debugger will place special instruction at the location of breakpoint. This special instruction needs to somehow let debugger detect when it (this special instruction) is executing. This can be some instruction that causes some kind of interrupt/exception, that debugger can hook onto, or some instruction that handles the control to debug unit. If this runs under some OS, that OS needs to support modifying running program (with something like ptrace poke/peek). Downside of SW breakpoints is that debugger needs to be able to modify running program, which is not possible if program is running from some kind of read-only memory (quite common in embedded world).
Hardware breakpoints (which need to be supported by CPU) implement similar behavior without modifying program binary. This is CPU specific, but usually it lets you to at least define a program address at which execution should hit a breakpoint. CPU continuously compares current PC with these breakpoint addresses and once the condition is matched, it breaks the execution. Number of these breakpoints is always limited.
To put a break point first we have to add some special information in to the binary .We use the flag -g while compiling the c source files to include this info.The Software debugger actually use this info to put break points.The best example for hardware break point support is in VxWorks as I have experienced.
Basically at the break point the processor halts.So internally any step which will give an exception to processor can be used to put a software break point.While a Hardware break point works by matching the address stored in Hardware registers to cause an exception.So Hardware break point is very powerful but it is heavily architecture dependent.
A very good explanation is here
What is the difference between hardware and software breakpoints?
A good intro with Processor related information is given here

Resources for generating x86 assembly for gcc

I want to generate x86 assembly for a compiler course I have this semester.
My problem is that my only experience was a long time ago with 8086 assembler and I remember nothing.
I am looking for resources that have examples that will work with gcc(as) in order to test them.
My favourite documentation links:
Please, take note of the Related links section at the lower-right of this very screen, as well
there is a nice 8088/86 emulator pcemu, I have a fork of it where I removed the bios and dos calls leaving a processor emulator for learning 8088/86. use nasm as an assembler and or some other pcemu or similar simulator (where you can get good visibility into what is going on, printing each instruction in execution order for example).
If you didnt mean 8086 and meant the modern/current x86/IA processors, then pcemu wont work you need something like qemu (little to know visibility).
