Difference between concurrency and simultaneous?

Difference between concurrency and simultaneous? - parallel-processing

Now I am studying parallel computing and algorithms I am little bit confused about the terms concurrent execution and simultaneous execution.
What is the difference between these terms? When do we have to use concurrent and when do we have to use simultaneous in parallel computing?

Simultaneous execution is about utilizing multiple resources (cores, HW threads, etc..) in order to perform multiple tasks at the same time. The tasks don't have to interact in any way, you may have two different applications running simultaneously on two different cores for example, or on the same core.
The art of designing systems to be able to perform multiple tasks at the same time can be said to deal with simultaneous execution. Hyper-threading for e.g. is also called "SMT", simultaneous multi-threading, since it deals with the ability to run two threads with their full contexts at the same time on a single core (This is Intels' approach, AMD has a slightly different solution, see - Difference between intel and AMD multithreading)
Concurrency is a term residing on a higher level of abstraction, relating to the OS world. It's a property of your execution environment in which you have multiple tasks that may be executed over time, while you have no control over the order or even the form of interleaving in which they're performed. It doesn't really matter if they operate simultaneously on multiple cores, on one core with SMT, or even on a single-threaded core with some preemption mechanism and some scheduling algorithm that breaks the tasks into chunks and constantly swaps between them. The important thing here is that concurrency forces you to design your tasks in a way that guarantees correctness (especially if they interact or share data) on any type of system with any order or interleaving.
If the task is designed correctly (with proper locking, barriers, semaphores, and anything guaranteeing correct data flow) and the OS does its job properly (saving states on context switch for example or clearing caches and shooting down TLB entries when needed), then it can run with any form of execution model "under the hood".
Since you're referring to parallel algorithms, the proper term for you is probably concurrent execution.
There are quite a lot of examples in this thread (with additional links to sources - I won't copy it here to avoid plagiarism :) - What is the difference between concurrency and parallelism?

Related

Making use of idle cores in a parallel pipeline?

I've been going through this tutorial on parallel pipelines and noticed that, while there is definitely a considerable difference in throughput, couldn't it be even better if the compression stage also took on a read job since it's just waiting around anyway? The same thing goes for the write stage... I mean, why not take on a third compression and then switch over to writing two, and then have one of those cores go back to compressing while the other wraps up the third write, and so on?
I apologize if this is obvious. I imagine this is standard practice and is called something, I'm just not sure what. Is their any overhead involved with switching jobs like this?
And I know this might be the wrong forum for this last question, but can the GPU switch jobs like this or should the programmable shaders/CUDA cores pretty much be left alone after being programmed?
EDIT: I guess I also don't understand how taking the same six-cores used in the 2 cores/stage example would be faster than just giving each of the six cores all three stages. Sure, there would be two cores that would do two, but that's still faster than the top scenario. I would understand it better in the GPU's case since there is specialized hardware involved for certain computations, but generally speaking, I don't see it. Maybe this example is weak or something because I know the parallel processing is here to stay.

This is definitely an issue with pipelining and there are a number of different ways to try and mitigate it.
With specialized hardware the hardware will often be tuned to try and balance the time taken in each stage for typical workloads. Fixed function stages in GPUs for example are typically balanced around the needs of a sample of representative game rendering workloads with transistors being allocated to try and balance the time taken in each stage. With static balancing like this there will usually be some wasted performance still however.
An alternative approach that can be used in both software and hardware to balance a pipeline is to break the longer stages down into multiple shorter steps. This is a common strategy in CPU instruction pipelines but can also be useful in software. In your example, the longer running compression step could potentially be broken down into multiple shorter pipeline stages. Depending on the task this may be difficult or impossible to do efficiently however.
Task scheduling systems can be used to help balance workloads across CPUs in a software pipeline. In a task scheduling system, you have a number of worker threads (usually around one per hardware thread) and any task can run on any worker thread. You have an API to set up dependencies between tasks and the task scheduler is responsible for scheduling tasks to run wherever CPU time is available once their dependencies are satisfied. In your example, the cores with idle time running the Read and Write tasks could help out with Compress tasks rather than sitting idle as long as the Compress tasks had their Read task dependencies satisfied.
Traditional OS thread schedulers can give some of the same benefits of a task scheduling system. In your example, if the Read threads waited on a semaphore when their work queues were empty (to be signalled when new work was added to the queues), the OS could schedule Compress threads to run on those idle cores. This can work reasonably well for relatively long running pipeline stages (10s of milliseconds) but for shorter pipeline stages (sub 1ms) the overhead of the OS thread scheduling and the length of the thread time slice will likely mean a task scheduling system would give better performance.

Your points are valid. The tutorial is lacking.
If the read, compress, and write operations can all occur at once, independently, the simple non-pipelined case would be the fastest for the six cores. Also notice that in the six core diagram, the reads and writes never overlap, so they could be the same ones. You only need four cores.
But consider a case where the reads all access the same disk so issuing too many read operations in parallel makes the reads take longer because they interfere with each other. In this case you can gain by pipelining the reads since you start the first compress steps sooner and they limit
the overall performance.

When should I use parallel-programming?

What could be a typical or real problem for using parallel programming? It can be quite challenging to implement. On the internet they explain how to use it but not why.

Performance is the most common reason to use parallel programming. But: Not all programs will become faster by using parallel programming. In most cases your algorithm consists of parts that are parallelizable and parts, that are inherently sequential. You always have to reason about the potential performance gain of using parallel programming. In some cases the overhead for using it will actually make your program slower. Have a look at Amdahl's law to learn more about the potential performance improvements you can reach.
If you only want some examples of usage of parallel computations: There are some classes of algorithms that are inherently parallel, see this article the dwarfs of berkeley

Another reason for using a multithreaded application architecture is it's responsiveness. There are certain functions which block program execution for a certain amount of time, i.e. reads from files, network, waiting for user inputs, etc. While waiting like this does not consume CPU power, it often blocks or slows program flow.
Using threads in such case is simply a good practice to make the code clearer. Instead of using (often complex or unintuitive) checks for inputs, integrating those checks into program flow, manual switching between handling input and other tasks, a programmer may choose to use threads and let one thread wait for input, and the other i.e. to perform calculations.
In other words, multiple threads sometimes allow for better use of different resources at your computer's disposal: network, disk, input devices or simply monitor.
Generalization: using multiple threads (including parallel data processing) is advisable when the speed and responsiveness gains outweigh the synchronization costs and work required to parallelize the application.

The reason why there is increased interest in parallel programming is partly because the hardware we use today is more parallel. (multicore processors, many-core GPU). To fully benefit from this hardware you need to program in parallel.
Interestingly, parallel processing also improves battery life:
Having 4 cores at 1Ghz draws less power than one single core at 4Ghz.
A phone with a multicore CPU will try to run as much tasks as possible simultaneously, so it can turn off the CPU when all work is done. This is sometimes called "the rush to idle".
Now, some programs are more easy parallelize than others. You should not randomly try to parallelize your entire code base. But it can be a useful excersise to do so even if there is no business reason: then you will be more ready the day when you really need it.

There are very few problems which can't be solved more quickly by a parallel program than by a serial program. There are very few computers which do not have multiple processing units.
I conclude, therefore, that you should use parallel programming all the time.

Cilk or Cilk++ or OpenMP

I'm creating a multi-threaded application in Linux. here is the scenario:
Suppose I am having x instance of a class BloomFilter and I have some y GB of data(greater than memory available). I need to test membership for this y GB of data in each of the bloom filter instance. It is pretty much clear that parallel programming will help to speed up the task moreover since I am only reading the data so it can be shared across all processes or threads.
Now I am confused about which one to use Cilk, Cilk++ or OpenMP(which one is better). Also I am confused about which one to go for Multithreading or Multiprocessing

Cilk Plus is the current implementation of Cilk by Intel.
They both are multithreaded environment, i.e., multiple threads are spawned during execution.
If you are new to parallel programming probably OpenMP is better for you since it allows an easier parallelization of already developed sequential code. Do you already have a sequential version of your code?
OpenMP uses pragma to instruct the compiler which portions of the code has to run in parallel. If I understand your problem correctly you probably need something like this:
#pragma omp parallel for firstprivate(array_of_bloom_filters)
for i in DATA:
check(i,array_of_bloom_filters);
the instances of different bloom filters are replicated in every thread in order to avoid contention while data is shared among thread.
update:
The paper actually consider an application which is very unbalanced, i.e., different taks (allocated on different thread) may incur in very different workload. Citing the paper that you mentioned "a highly unbalanced task graph that challenges scheduling,
load balancing, termination detection, and task coarsening strategies". Consider that in order to balance computation among threads it is necessary to reduce the task size and therefore increase the time spent in synchronizations.
In other words, good load balancing comes always at a cost. The description of your problem is not very detailed but it seems to me that the problem you have is quite balanced. If this is not the case then go for Cilk, its work stealing approach its probably the best solution for unbalanced workloads.

At the time this was posted, Intel was putting a lot of effort into boosting Cilk(tm) Plus; more recently, some effort has been diverted toward OpenMP 4.0.
It's difficult in general to contrast OpenMP with Cilk(tm) Plus.
If it's not possible to distribute work evenly across threads, one would likely set schedule(runtime) in an OpenMP version, and then at run time try various values of environment variable, such as OMP_SCHEDULE=guided, OMP_SCHEDULE=dynamic,2 or OMP_SCHEDULE=auto. Those are the closest OpenMP analogies to the way Cilk(tm) Plus work stealing works.
Some sparse matrix functions in Intel MKL library do actually scan the job first and determine how much to allocate to each thread so as to balance work. For this method to be useful, the time spent in serial scanning and allocating has to be of lower order than the time spent in parallel work.
Work-stealing, or dynamic scheduling, may lose much of the potential advantage of OpenMP in promoting cache locality by pinning threads with cache locality e.g. by OMP_PROC_BIND=close.
Poor cache locality becomes a bigger issue on a NUMA architecture where it may lead to significant time spent on remote memory access.
Both OpenMP and Cilk(tm) Plus have facilities for switching between serial and parallel execution.

multi core and parallel processing

what is the difference between parallel processing and multi core processing

Parallel and multi-core processing both refer to the same thing: the ability to execute code at the same time (in more than one core/CPU/machine.) So in this sense multi-core is just a means to do parallel processing.
On the other hand, concurrency (which is probably what you mean by parallel processing) refers to having multiple units of execution (threads or processes) that are interleaved. This can also happen in either in a single core CPU or in many cores/CPUs or even in many machines (clusters).
Summing up, multicore is a subset of parallel and concurrency can occur with or without parallelism. The field that studies this is distributed systems or distributed computing.

Parallel processing just refers to a program running more than 1 part simultaneously, usually with the different parts communicating in some way. This might be on multiple cores, multiple threads on one core (which is really simulated parallel processing), multiple CPUs, or even multiple machines.
Multicore processing is usually a subset of parallel processing.
Multicore processing means code working on more than one "core" of a single CPU chip. A core is like a little processor within a processor. So making code work for multicore processing will nearly always be talking about the parallelization aspect (though would also include removing any core specific assumptions, which you shouldn't normally have anyway).
As far as an algorithm design goes, if it is correct in a parallel processing point of view, it will be correct multicore.
However, if you need to optimise your code to get it to run as fast as possible "in parallel" then the differences between multicore, multi-cpu, multi-machine, or vectorised will make a big difference.

Parallel processing can be done inside a single core with multiple threads.
Multi-Core processing means distributing those threads to make use of the multiple cores in a CPU.

Multiple threads and performance on a single CPU

Is here any performance benefit to using multiple threads on a computer with a single CPU that does not having hyperthreading?

In terms of speed of computation, No. In fact things will slow down due to the overhead of managing the threads.
In terms of responsiveness, yes. You can for example have one thread wait on an IO operation and have another run a GUI at the same time.

It depends on your application. If it spends all its time using the CPU, then multithreading will just slow things down - though you may be able to use it to be more responsive to the user and thus give the impression of better performance.
However, if your code is limited by other things, for example using the file system, the network, or any other resource, then multithreading can help, since it allows your application to behave asynchronously. So while one thread is waiting for a file to load from disk, another can be querying a remote webserver and another redrawing the GUI, while another is doing various calculations.
Working with multiple threads can also simplify your business logic, since you don't have to pay so much attention to how various independent tasks need to interleave. If the operating system's scheduling logic is better than yours, then you may indeed see improved performance.

You can consider using multithreading on a single CPU
If you use network resources
If you do high-intensive IO operations
If you pull data from a database
If you exploit other stuff with possible delays
If you want to make your app with ultraspeed reaction
When you should not use multithreading on a single CPU
High-intensive operations which do almost 100% CPU usage
If you are not sure how to use threads and synchronization
If your application cannot be divided into several parallel processes

Yes, there is a benefit of using multiple threads (or processes) on a single CPU - if one thread is busy waiting for something, others can continue doing useful work.
However this can be offset by the overhead of task switching. You'll have to benchmark and/or profile your application on production grade hardware to find out.

Regardless of the number of CPUs available, if you require preemptive multitasking and/or applications with asynchronous components (i.e. pretty much anything that combines a responsive GUI with a non-trivial amount of computation or continuous I/O processing), multithreading performs much better than the alternative, which is to use multiple processes for each application.
This is because threads in the same process can exchange data much more efficiently than can multiple processes, because they share the same memory context.
See this Wikipedia article on computer multitasking for a fairly concise discussion of these issues.

Absolutely! If you do any kind of I/O, there is great advantage to having a multithreaded system. While one thread wait for an I/O operation (which are relatively slow), another thread can do useful work.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio