Can parallel processing be achieved? - parallel-processing

Can an MCU really do parallel processing?
Let's just say that I wana countdown, send data through another interface, and do one more work such as Light up an LED all at the sametime.
Is that even possible?

A processor with multiple execution units or cores can perform parallel processing. Most microcontrollers do not have multiple execution units.
Some architectures support SIMD (Single Instruction/Multiple Data) instructions that can generate multiple results from a single instruction - this is a low level form of parallel processing, similarly DSPs (Digital Signal Processors) and microcontrollers with DSP instructions support dual or multiple MAC (multiply/accumulate) units that are also a form of parallel processing. Both SIMD and MAC are used primarily for number crunching and signal processing applications. High end DSPs often support other instruction level parallel execution capabilities.
Another low-level architecture feature that allows parallel execution is pipeline execution. This allows instructions that may take multiple cycles to run to generate one result per cycle by running different stages of the same operation simultaneously.
Most microcontrollers can support a multi-tasking or multi-threading scheduler that can give the impression of concurrent execution by scheduling execution time to each task according to the scheduling algorithm used. While this is not parallel processing and in fact adds an overhead rather than accelerates processing, it is useful in other ways such as functional partitioning of the code and, in the case of a real-time priority based preemptive scheduler, achieving real-time response to events. For the example use case you give in your question, this form of scheduling is entirely appropriate and adequate. See Real-time Operating System (RTOS)
Microcontroller architectures that do support true parallel processing include XMOS, PicoChip, and the Cell processor. Historically the Transputer pioneered parallel processing in microprocessors.
A way of achieving a high level of parallelism at a low level where individual operations of the same process can occur simultaneously (when one does not depend of the result of the other, or a pipeline is used) is to implement a process on an FPGA - essentially to implement the processing in hardware rather than software, but the languages used to program FPGAs share similarities with software languages.

A company named Parallax makes an 8 core MCU called Propeller that does parallel processing. Their programming language "Spin" is interesting, object oriented, scriptish, but also has inline assembly.

Related

understanding parallelism of FPGAs

I am having a bit of problems with understanding the benefits of FPGAs for parallel processing. Everybody sais it is parallel, but It looks to me it is not trully parallel. Lets see this example:
I have data signal coming on some pin, at 1 bit per clock cycle. The FPGA will receive this data and since it has the data already inside the integrated circuit it can start processing it right away. But this is called serial processing, not parallel. If the FPGA will wait for the data to accumulate, to later process it in parallel, then we can say FPGA processing is trully parallel but what is the benefit of wating for the data to arrive in large quantities, we will just lose time, for example, if we wait for 8 bit data , we will lose 7 cycles. So where is the benefit of parallelism of FPGAs ??? I can't get it.
It would be parallel if the data would be comming in parallel, like when you use the old DB-25 Parallel Port connector. But this technology became obsolete since parallel port can not support high speeds. Today's USB standard is serial, Ethernet is serial, so .... where is the parallelism ???
The parallelism comes in if you have data that arrives in chunks, and the chunks arrive faster than they can be processed, and the chunks can be processed individually. Rather than having to slow-down the data sender, an FPGA allows you to add more processing "blocks" so that the processing goes faster.
Example:
You receive data (serially or in parallel, doesn't matter) at 1MB/s in 50kB chunks, but your algorithm only allows 1 chunk to be processed per second. In an FPGA, you can wire up the "receiver" to distribute the chunks across 20 "processors", so now your sender can still send at full speed, and your receiver sees less overall lag.
Parallelism has several levels, which need to be understood if you want to under stand computer architectures. FPGAs are just a tool to build a "computer".
The levels are:
bit level: multiple bits or datawords are processed in parallel.
For example you can build adders of 8 bit, 32 bit or 4096 bit, which add two integer numbers in just one cycle
instruction level: multiple instructions of one control flow are executed in parallel
=> pipelining, super scalar architecture
thread level: multiple control flows are executed in paralell
=> multi threading, multi core, n-socket systems
application level: execute multiple applications in parallel
=> multi processing
dataflow processing: every thing in parallel :)
FPGAs can use each level to do everything in parallel.

Difference between concurrency and simultaneous?

Now I am studying parallel computing and algorithms I am little bit confused about the terms concurrent execution and simultaneous execution.
What is the difference between these terms? When do we have to use concurrent and when do we have to use simultaneous in parallel computing?
Simultaneous execution is about utilizing multiple resources (cores, HW threads, etc..) in order to perform multiple tasks at the same time. The tasks don't have to interact in any way, you may have two different applications running simultaneously on two different cores for example, or on the same core.
The art of designing systems to be able to perform multiple tasks at the same time can be said to deal with simultaneous execution. Hyper-threading for e.g. is also called "SMT", simultaneous multi-threading, since it deals with the ability to run two threads with their full contexts at the same time on a single core (This is Intels' approach, AMD has a slightly different solution, see - Difference between intel and AMD multithreading)
Concurrency is a term residing on a higher level of abstraction, relating to the OS world. It's a property of your execution environment in which you have multiple tasks that may be executed over time, while you have no control over the order or even the form of interleaving in which they're performed. It doesn't really matter if they operate simultaneously on multiple cores, on one core with SMT, or even on a single-threaded core with some preemption mechanism and some scheduling algorithm that breaks the tasks into chunks and constantly swaps between them. The important thing here is that concurrency forces you to design your tasks in a way that guarantees correctness (especially if they interact or share data) on any type of system with any order or interleaving.
If the task is designed correctly (with proper locking, barriers, semaphores, and anything guaranteeing correct data flow) and the OS does its job properly (saving states on context switch for example or clearing caches and shooting down TLB entries when needed), then it can run with any form of execution model "under the hood".
Since you're referring to parallel algorithms, the proper term for you is probably concurrent execution.
There are quite a lot of examples in this thread (with additional links to sources - I won't copy it here to avoid plagiarism :) - What is the difference between concurrency and parallelism?

When should I use parallel-programming?

What could be a typical or real problem for using parallel programming? It can be quite challenging to implement. On the internet they explain how to use it but not why.
Performance is the most common reason to use parallel programming. But: Not all programs will become faster by using parallel programming. In most cases your algorithm consists of parts that are parallelizable and parts, that are inherently sequential. You always have to reason about the potential performance gain of using parallel programming. In some cases the overhead for using it will actually make your program slower. Have a look at Amdahl's law to learn more about the potential performance improvements you can reach.
If you only want some examples of usage of parallel computations: There are some classes of algorithms that are inherently parallel, see this article the dwarfs of berkeley
Another reason for using a multithreaded application architecture is it's responsiveness. There are certain functions which block program execution for a certain amount of time, i.e. reads from files, network, waiting for user inputs, etc. While waiting like this does not consume CPU power, it often blocks or slows program flow.
Using threads in such case is simply a good practice to make the code clearer. Instead of using (often complex or unintuitive) checks for inputs, integrating those checks into program flow, manual switching between handling input and other tasks, a programmer may choose to use threads and let one thread wait for input, and the other i.e. to perform calculations.
In other words, multiple threads sometimes allow for better use of different resources at your computer's disposal: network, disk, input devices or simply monitor.
Generalization: using multiple threads (including parallel data processing) is advisable when the speed and responsiveness gains outweigh the synchronization costs and work required to parallelize the application.
The reason why there is increased interest in parallel programming is partly because the hardware we use today is more parallel. (multicore processors, many-core GPU). To fully benefit from this hardware you need to program in parallel.
Interestingly, parallel processing also improves battery life:
Having 4 cores at 1Ghz draws less power than one single core at 4Ghz.
A phone with a multicore CPU will try to run as much tasks as possible simultaneously, so it can turn off the CPU when all work is done. This is sometimes called "the rush to idle".
Now, some programs are more easy parallelize than others. You should not randomly try to parallelize your entire code base. But it can be a useful excersise to do so even if there is no business reason: then you will be more ready the day when you really need it.
There are very few problems which can't be solved more quickly by a parallel program than by a serial program. There are very few computers which do not have multiple processing units.
I conclude, therefore, that you should use parallel programming all the time.

Hybrid OpenMP + OpenMPI for mixed distributed & shared memory?

I am developing a code to perform a few very large computations by my standards. Based on single-CPU estimates, expected run-time is ~10 CPU years, and memory requirements are ~64 GB. Little to no IO is required. My serial version of the code in question (written in C) is working well enough and I have to start thinking about how to best parallelize the code.
I have access to clusters with ~64 GB RAM and 16 cores per node. I will probably limit myself to using e.g. <= 8 nodes. I'm imagining a setup where memory is shared between threads on a single node, with separate memory used on different nodes and relatively little communication between nodes.
From what I've read so far, the solution I have come up with is to use a hybrid OpenMP + OpenMPI design, using OpenMP to manage threads on individual compute nodes, and OpenMPI to pass information between nodes, like this:
https://www.rc.colorado.edu/crcdocs/openmpi-openmp
My question is whether this is the "best" way to implement this parallelization. I'm an experienced C programmer but have very limited experience in parallel programming (a little bit with OpenMP, none with OpenMPI; most of my jobs in the past were embarrassingly parallel). As an alternative suggestion, is it possible with OpenMPI to efficiently share memory on a single host? If so then I could avoid using OpenMP, which would make things slightly simpler (one API instead of two).
Hybrid OpenMP and MPI coding is most appropriate for problems where one can clearly identify two separate levels of parallelism - corase grained one and the fine grained one nested inside each coarse subdomain. Since fine grained parallelism requires lots of communication when implemented with message passing, it doesn't scale, because the communication overhead can become comparable to the amount of work being done. As OpenMP is a shared memory paradigm, no data communication is necessary, only access synchronisation, and it is more appropriate for finer grained parallel tasks. OpenMP also benefits from data sharing between threads (and the corresponding cache sharing on modern multi-core CPUs with shared last-level cache) and usually requires less memory than the equivalent message passing code, where some of the data might need to be replicated in all processes. MPI on the other side can run cross nodes and is not limited to running on a single shared-memory system.
Your words suggest that your parallelisation is very coarse grained or belongs to the so-called embarassingly parallel problems. If I were you, I would go hybrid. If you only employ OpenMP pragmas and don't use runtime calls (e.g. omp_get_thread_num()) your code can be compiled as both pure MPI (i.e. with non-threaded MPI processes) or as hybrid, depending on whether you enable OpenMP or not (you can also provide a dummy OpenMP runtime to enable code to be compiled as serial). This will give you both the benefits of OpenMP (data sharing, cache reusage) and MPI (transparent networking, scalability, easy job launching) with the added option to switch off OpenMP and run in an MPI-only mode. And as an added bonus, you will be able to meet the future, which looks like brining us interconnected many-many-core CPUs.

multi core and parallel processing

what is the difference between parallel processing and multi core processing
Parallel and multi-core processing both refer to the same thing: the ability to execute code at the same time (in more than one core/CPU/machine.) So in this sense multi-core is just a means to do parallel processing.
On the other hand, concurrency (which is probably what you mean by parallel processing) refers to having multiple units of execution (threads or processes) that are interleaved. This can also happen in either in a single core CPU or in many cores/CPUs or even in many machines (clusters).
Summing up, multicore is a subset of parallel and concurrency can occur with or without parallelism. The field that studies this is distributed systems or distributed computing.
Parallel processing just refers to a program running more than 1 part simultaneously, usually with the different parts communicating in some way. This might be on multiple cores, multiple threads on one core (which is really simulated parallel processing), multiple CPUs, or even multiple machines.
Multicore processing is usually a subset of parallel processing.
Multicore processing means code working on more than one "core" of a single CPU chip. A core is like a little processor within a processor. So making code work for multicore processing will nearly always be talking about the parallelization aspect (though would also include removing any core specific assumptions, which you shouldn't normally have anyway).
As far as an algorithm design goes, if it is correct in a parallel processing point of view, it will be correct multicore.
However, if you need to optimise your code to get it to run as fast as possible "in parallel" then the differences between multicore, multi-cpu, multi-machine, or vectorised will make a big difference.
Parallel processing can be done inside a single core with multiple threads.
Multi-Core processing means distributing those threads to make use of the multiple cores in a CPU.

Resources