Difference in Sequential And Parallel Hotspot - parallel-processing

I am studying Parallel Computing exam, and i read that hotspot problem can happens both in parallel that sequential execution. But in the former it is more dangerous.
Now i hope to understand by simple examples a case of sequential and a parallel hotspot problem, and if possible the difference between them.

Related

Algorithms which are inherently not possible to parallelise

For instance, algorithms such as LZ77 might require previous results in order to proceed but it is still possible to execute them in parallel, at least to some extent (e.g. http://www.cs.cmu.edu/~jshun/dcc2013-final.pdf).
Are there any specific, real-world algorithms which have to be executed only sequentially?
Wikipedia:
Some problems have no parallel algorithms, and are called inherently
serial problems.
Examples:
http://en.wikipedia.org/wiki/Three-body_problem
http://en.wikipedia.org/wiki/Newton%27s_method
You can also have a look at this thread algorithms that cannot be sped up by parallelisation

Suggest an OpenMP program that has noticeble speedup and the most important concepts in it for a talk

I am going to have a lecture on OpenMP and I want to write an program using OpenMP lively . What program do you suggest that has the most important concept of OpenMP and has noticeable speedup? I want an awesome program example, please help me all of you that you are expert in OpenMP
you know I am looking for an technical and Interesting example with nice output.
I want to write two program lively , first one for better illustration of most important OpenMP concept and has impressive speedup and second-one as a hands-on that everyone must write that code at the same time
my audience may be very amateur
Personally I wouldn't say that the most impressive aspect of OpenMP is the scalability of the codes you can write with it. I'd say that a more impressive aspect is the ease with which one can take an existing serial program and, with only a few OpenMP directives, turn it into a parallel program with satisfactory scalability.
So I'd suggest that you take any program (or part of any program) of interest to your audience, better yet a program your audience is familiar with, and parallelise it right there and then in your lecture, lively as you put it. I'd be impressed if a lecturer could show me, say, a 4 times speedup on 8 cores with 5 minutes coding and a re-compilation. And that leads on to all sorts of interesting topics about why you don't (always, easily) get 8 times speedup on 8 cores.
Of course, like all stage illusionists, you'll have to choose your example carefully and rehearse to ensure that you do get an impressive-enough speedup to support your argument.
Personally I'd be embarrassed to use an embarrassingly parallel program for such a demo; the more perceptive members of the audience might be provoked into a response such as meh.
(1) Matrix multiply
Perhaps it's the most simple example (though matrix addition would be simpler).
(2) Mandelbrot
http://en.wikipedia.org/wiki/Mandelbrot_set
Mandelbrot is also embarrassingly parallel, and OpenMP can achieve decent speedups. You can even use graphics to visualize it. Mandelbrot is also an interesting example because it has workload imbalance. You may see different speedups based on scheduling policies (e.g., schedule(dynamic,1) vs. schedule(static)), and different threading libraries (e.g., Cilk Plus or TBB).
(3) A couple of mathematical kernels
For example, FFT (non-recursive version) is also embarrassingly parallelized.
Take a look at "OmpSCR" benchmarks: http://sourceforge.net/projects/ompscr/ This suite has simple OpenMP examples.

Performance optimization in CUDA - Which of these algorithms should I use?

I have an algorithm which consists two major tasks. Both tasks are embarrassingly parallel. So I can port this algorithm on CUDA by one of the following way.
>Kernel<<<
Block,Threads>>>() \\\For task1
cudaThreadSynchronize();
>Kerne2<<<
Block,Threads>>>() \\\For task2
Or I can do following thing.
>Kernel<<<
Block,Threads>>>()
{
1.Threads work on task 1.
2.syncronizes across device.
3.Start for task 2.
}
One can note that in first method, we'll have to come back to CPU while in second trend we'll have to use synchronization across all blocks in CUDA. Paper in IPDPS 10 says that second method, with proper care can perform better. But in general which method should be followed?
There is not currently any officially supported method for synchronizing across thread blocks withing a single kernel execution in the CUDA programming model. Methods of doing so, in my experience, lead to brittle code that can lead to incorrect behavior under changing circumstances such as running on different hardware, changing driver and CUDA release versions, etc.
Just because something is published in an academic publication does not mean it is a safe idea for production code.
I recommend you stick with your method 1, and I ask you this: have you determined that separating your computation into two separate kernels is really causing a performance problem? Is the cost of a second kernel launch definitely the bottleneck?

How to implement efficient sorting algorithms for multiple processors with Scala?

How to implement efficient sorting algorithms for multiple processors in Scala? Here's the link for radix algorithm in GPU:
radix algorithm in GPU
Use scala.actors.Futures. It isn't a good solution, because you are talking about parallel computation, not concurrent computation, and Futures is aimed at the latter, not the former.
Things like parallel arrays that are coming with Java 7 and a later (not 2.8) version of Scala are more appropriate for parallel algorithms.
Just explaining, a parallel algorithm is one which does the same computation on multiple processing units. It is easy to see that each of which runs the same code. A concurrent computation is one in which each processing unit is running a potentially different code.
Also related, in parallel algorithms, the code being run doesn't change, only the data. In concurrent computation you have code changing constantly.
By the way, though that is not what you are asking, let me state that there's a library for Scala to run OpenCL code (ie, run computation on GPU). It's called ScalaCL.

What are some hints that an algorithm should parallelized?

My experience thus far has shown me that even with multi-core processors, parallelizing an algorithm won't always speed it up noticably. In fact, sometimes it can slow things down. What are some good hints that an algorithm can be sped up significantly by being parallelized?
(Of course given the caveats with premature optimization and their correlation to evil)
To gain the most benefit from parallelisation, a task should be able to be broken into similiar-sized course-grain chunks that are independent (or mostly so), and require little communication of data or synchronisation between the chunks.
Fine-grain parallelisation, almost always suffers from increased overheads, and will have a finite speed-up regardless of the number of physical cores available.
[The caveat to this, is those architectures that have a very large no. of 'cores' (such as the connection machines 64,000 cores). These are well suited to calculations that can be broken into relatively simple actions assigned to a particular topology (like a rectangular mesh).]
If you can divide the work into independent parts then it may be parallelized well.
Remember also Amdahl's Law which is a sobering reminder of how little we can expect in terms of performances gains by adding more cores to most programs.
First, check out this paper by the late Jim Gray:
Distributed Computing Economics
Actually, this will clear up some misunderstanding based on what you wrote in the question. Obviously, if the less amenable your problem set is to being discretized, the more difficult it will be.
Any time you have computations that depend on previous computations, it is not a parallel problem. Things like linear image processing, brute force methods, and genetic algorithms are all easily parallelized.
A good analogy is what could you work on that you could get a bunch of friends to do different parts at once? For example, putting ikea furniture together might parallelize well if different people can work on different sections, but rolling wallpaper might not because you need to do walls in sequence.
If you're doing large matrix computations, like simulations involving finite element models, these can often be broken down into smaller pieces in straight-forward ways. Matrix-vector multiplies can benefit well from parallelization, assuming you are dealing with very large matrices. Unless there is a real performance bottleneck that is causing code to run slowly, it's probably not necessary to hassle with parallel processing.
Well, if you need lots of locks for it to work, then its probably one of those difficult algorithms that doesn't parallelise well. Is there any part of the algorithm that can be broken up into separate parts that don't need to touch each other?

Resources