using prefix sum - parallel or sequential - algorithm

I have been wondering about when to use Parallel prefix sum instead of using sequential buildup. The algorithm I am using constructs parallel sums but I read somewhere that for small number of elements (typically less than 100 elements), its better to go for sequential algorithm. This brings the question of whether there is a certain threshold above which parallel implementation might yield some gain over sequential? I am using opencl for coding and have implemented parallel prefix sum using Blelloch 1990 implementation.

It depends, as usual. On the implementation, the device, and the size of the data.
GPU Gems 3, chapter 39 has some pretty graphs that show when their specific implementations have thresholds. They didn't implement the algorithm naively of course - it's an optimized version using shared memory, unrolled loops, and cache bank conflict-avoidance.
Once you have an implementation, you'll just have to benchmark it to find the threshold.


CUDA Sorting Many Vectors / Arrays

I have many (200 000) vectors of integers (around 2000 elements in each vector) in GPU memory.
I am trying to parallelize algorithm which needs to sort, calculate average, standard deviation and skewness for each vector.
In the next step, the algorithm has to delete the maximal element and repeated calculation of statistical moments until some criteria is not fulfilled for each vector independently.
I would like to ask someone more experienced what is the best approach to parallelize this algorithm.
Is it possible to sort more that one vector at once?
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
200 000 vectors of integers ... 2000 elements in each vector ... in GPU memory.
2,000 integers sounds like something a single GPU block could tackle handily. They would fit in its shared memory (or into its register file, but that would be less useful for various reasons), so you wouldn't need to sort them in global memory. 200,000 vector = 200,000 blocks; but you can't have 2000 block threads - that excessive
You might be able to use cub's block radix sort, as #talonmies suggests, but I'm not too sure that's the right thing to do. You might be able to do it with thrust, but there's also a good chance you'll have a lot of overhead and complex code (I may be wrong though). Give serious consideration to adapting an existing (bitonic) sort kernel, or even writing your own - although that's more challenging to get right.
Anyway, if you write your own kernel, you can code your "next step" after sorting the data.
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
This depends on how much time your application spends on these sorting efforts at the moment, relative to its entire running time. See also Amdahl's Law for a more formal statement of the above. Having said that - typically it should be worthwhile to parallelize the sorting when you already have data in GPU memory.

CUDA parallel sorting algorithm vs single thread sorting algorithms

I have a large amount of data which i need to sort, several million array each with tens of thousand of values. What im wondering is the following:
Is it better to implement a parallel sorting algorithm, on the GPU, and run it across all the arrays
implement a single thread algorithm, like quicksort, and assign each thread, of the GPU, a different array.
Obviously speed is the most important factor. For single thread sorting algorithm memory is a limiting factor. Ive already tried to implement a recursive quicksort but it doesnt seem to work for large amounts of data so im assuming there is a memory issue.
Data type to be sorted is long, so i dont believe a radix sort would be possible due to the fact that it a binary representation of the numbers would be too long.
Any pointers would be appreciated.
Sorting is an operation that has received a lot of attention. Writing your own sort isn't advisable if you are interested in high performance. I would consider something like thrust, back40computing, moderngpu, or CUB for sorting on the GPU.
Most of the above will be handling an array at a time, using the full GPU to sort an array. There are techniques within thrust to do a vectorized sort which can handle multiple arrays "at once", and CUB may also be an option for doing a "per-thread" sort (let's say, "per thread block").
Generally I would say the same thing about CPU sorting code. Don't write your own.
EDIT: I guess one more comment. I would lean heavily towards the first approach you mention (i.e. not doing a sort per thread.) There are two related reasons for this:
Most of the fast sorting work has been done along the lines of your first method, not the second.
The GPU is generally better at being fast when the work is well adapted for SIMD or SIMT. This means we generally want each thread to be doing the same thing and minimizing branching and warp divergence. This is harder to achieve (I think) in the second case, where each thread appears to be following the same sequence but in fact data dependencies are causing "algorithm divergence". On the surface of it, you might wonder if the same criticism might be levelled at the first approach, but since these libraries I mention arer written by experts, they are aware of how best to utilize the SIMT architecture. The thrust "vectorized sort" and CUB approaches will allow multiple sorts to be done per operation, while still taking advantage of SIMT architecture.

A couple of CUDA-performance questions

This is the first time i ask question here so thanks very much in advance and please forgive my ignorance. And also I've just started to CUDA programming.
Basically, i have a bunch of points, and i want to calculate all the pair-wise distances. Currently my kernel function just holds on one point, and iteratively read in all other points (from global memory), and conduct the calculation. Here's some of my confusions:
I'm using a Tesla M2050 with 448 cores. But my current parallel version (kernel<<<128,16,16>>>) achieves a much higher parallelism (about 600x faster than kernel<<<1,1,1>>>). Is it possibly due to the multithreading thing or pipeline issue, or they actually indicate the same thing?
I want to further improve the performance. So i figure to use shared memory to hold some input points for each multiprocessing block. But the new code is just as fast. What's the possible cause? Could it be related to the fact that i set too many threads?
Or, is it because i have a if-statement in the code? The thing is, i only consider and count the short distances, so i have a statement like (if dist < 200). How much should i worry about this one?
A million thanks!
Mark Harris has a very good presentation about optimizing CUDA: Optimizing Parallel Reduction in CUDA.
Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined
Having an extra operations statement, does indeed cause problems although it will be the last thing you want to optimize, if not simply because you need to know the layout of your code before implementing the size assumptions!
The problem you are working on sounds like the famous n-body problem,
see Fast N-Body Simulation with CUDA.
An additional performance increase can be achieved if you can avoid doing a pairwise computation, for example, the elements are too far to have an effect on each-other. This applies to any relationship that can be expressed geometrically, whether it be pairwise costs or a physics simulation with springs. My favorite method is to divide the grid into boxes and, with each element putting itself into a box via division, then only evaluate pairwise relations between between neighboring boxes. This can be called O(n*m).
(1) The GPU runs many more threads in parallel than there are cores. This is because each core is pipelined. Operations take around 20 cycles on compute capability 2.0 (Fermi) architectures. So for each clock cycle, the core starts work on a new operation, returns the finished result of one operation, and move all the other (around 18) operations one more step towards completion. So, to saturate the GPU, you might need something like 448 * 20 threads.
(2) It's probably because your values are getting cached in the L1 and L2 caches.
(3) It depends on how much work you're doing inside the if conditional. The GPU must run all 32 threads in a warp through all the code inside the if even if the condition is true for only a single of those threads. If there is a lot of code in the conditional as compared to the rest of your kernel, and relatively view threads go through that code path, it is likely that you end up with low compute throughput.

What are the most common uses for distributed computing?

I wrote a very simple distributed computing platform (based on the Map/Reduce paradigm), and I'm in the process of writing some demos and showcases. I have a very small team and have to prioritize which demos I'll write first.
To prioritize I need to sort the demos accordingly to about 70% being a relevant, common, significant use case of distributed computing, 30% being easy to write.
So far I have it ordered like this:
Discovering pi digits with Monte Carlo
Numerical integration with Monte Carlo
Large matrix multiplication (dense matrices)
Linear regressions
Large matrix inversion
Multiple regressions
Clustering (K-Means)
Clustering (Hierarchical)
Number 1 is on the list because it took 10 minutes to write, although it's completely useless (I'm not sure but I figure there's not a lot of people trying to find more digits to pi).
Due to the nature of my platform, it will shine more in things that are of course embarrassingly parallel, and not I/O-bounded or reduce-dominated.
How would you change my list? What would you add to it? Is sorting useful at all in the enterprise world or is it only for benchmarking distributed computing platforms?
Your list suggests that you are not distinguishing between parallel computing and distributed computing. This is not necessarily wrong but someone looking for a demonstration of the excellence of a distributed computing platform might be left tepidly enthused upon seeing parallel computations, such as your items 2 - 5, being performed.
Sorting is certainly useful everywhere there is data: large enterprises, small enterprises, in your desk drawers, across the Googlesphere. So too is searching, which is a surprising omission from your list. The other omission which strikes me immediately is any sort of data fusion, merging large datasets to get information from their intersections beyond what can be extracted from the datasets individually.
I second Mark in that you are mixing distributed computing and HPC. Here are some comments on each of your topics:
(1) There are people trying to compute as many digits of Pi as they can but the Monte Carlo algorithm is completely useless there as its precision scales with the inverse square root of the number of trials, so in order to get one more decimal digit of precision you would roughly need 100 times more trials. There are other algorithms - see if you can implement some of them using Map/Reduce.
(2) This one is fine, although seldom used - same problem with precision as (1).
(5) Pure matrix inversions are seldom performed, mainly because of numerical instabilities. How about solving a dense system of linear equations instead?
I would say that you are missing one of the main usages of M/R processing nowadays, namely graph processing (read: social and other networks/flows analysis). Also some more general optimisation problem might be nice, e.g. genetic algorithms.

Maximum Increase in Processing Speed via Parallelism

Are there any cases in which anything more than a linear speed increase comes from parallelising an algorithm ?
The maximum you can reach from a theory viewpoint is linear speedup.
In practice, it is possible super linear speedup. If you can distribute your problem in a away that you can leverage effects of processor caches, e.g. because it does not fit in the cache of a single core, your problem can scale better than linear.
In theory, no - but in practice this might be the case (depending on the underlying hardware and your specific problem). Its not trivial to compare parallel and sequential code (you have to compare the fastest sequential implementation with your parallel implementation, not just your parallel implementation running on a single processor/thread).
But still, when someone speaks about more-than-linear speed-up I would always be suspicious; they either didn't measure it correctly (see above), measured an artifact (hardware/OS dependent) and should document it accordingly, or this only works for a specific combination of problem/implementation/hardware.
