rxjs performance array vs stream - rxjs

I'm new to the rxjs world and trying to get my head around it. My understanding is one of the reasons to use rxjs is to improve performance with large datasets.
I'm trying to measure the speed improvement that you could get vs normal arrays high functions (map, reduce).
I have set up this example here https://jsbin.com/bagoli/edit?js,console
The idea is to generate an array and apply some operators to it, measuring the time spent.
I don't understand why the stream calculation is always slower. Am I missing something?
Thank you for your help.

Your calculateWithStreams function is async and will run in parallel to your Array function. Therefore making it slower. If you run them one at a time, the times are basically the same once you increase the Size a bit.
RxJS does of course have some overhead compared to native Arrays, but it makes up for it with lazy evaluation.
Also consider that the improvement isn't just in execution speed, but also memory usage. The Array version will always create a new array and will take up more memory.

Related

What are the main advantages in using promises in scheme?

Pragmatically, what are the main advantages of using promises? Can you show me some examples of real-life useful usage of promises?
In Scheme a promise is just a value that has a task that is not necessarily done yet and if you never use the value it will never be calculated. In short it is a way to do lazy evaluation in the otherwise eager Scheme. A typical way is to do computations on streams instead of lists.
With lists you can use higher order functions so that you can have a list, then filter it for values you are interested in, then transform these values and perhaps at some point you have enough to produce the value you needed. This is nice since you can abstract each step so that you can make logic that only does one step and compose steps to make the whole program, but in this scenario the first step needs to finish in full before the next step can handle the result of the first while it might be that if you are searching for the first prime number between 0 and 1000 having iterated over all the numbers in each step might not be so effective. Here is where streams comes in.
With streams the code looks the same, but the intermediate result is made by need. A stream is a pair where the parts are promises so that the code that would otherwise make a pair is delayed until the values are used. Every step just produces enough data for the next step and thus should it be enough for the first step to iterate just 20% of the elements for the last step to have computed the final result the 80% rest will never ever be processed in any of the steps. With such a structure the initial stream can also be infinite, like all the numbers from 0 increased by 1.
There are penalties involved using streams. Imagine you make an algorithm that would visit all the elements anyway. Then a stream version of an algorithm would be slower since the promises that are created and the forcing gives th eprogram overhead compared with doing the computation without laziness.
You might be interested in seeing Hal Abelson explaining streams and their pros and cons.
There are other alternatives to streams an lazy evaluation. One is generators. Here you can also make composable procedures that takes a generator and produces a generator. The iteration will be by need like with streams.
Another alternative would be transducers. This is also composable and iterates like streams and generators, but unlike generators initial data cannot be an infinite sequence like with streams and generators unless the underlying structure supports it.
The advantages of using promises or any other technique in this answer is not scheme specific. They are true for all eager programming languages!

Efficient Data Structures in Maple

I'm working with a large amount of data in Maple and I need to know the most efficient way to store it. I started with lists, but I quickly learned how inefficient those are so I have since replaced them. Now I'm using a mixture of Arrays (for structures with a fixed length) and tables (for structures with variable length), but my code actually runs significantly slower than it did when I was only using lists.
So here are my questions:
What is the most efficient data structure to use in Maple for a static-length set of data? for a variable-length set?
Are there any "gotchas" I need to be aware of when using these structures as parameters in a recursive proc? If using Arrays or tables, does each one need to be copied for each iteration to avoid clobbering data?
I think I can wrap this one up now. I made a few performance improvements, mostly just small tweaks that only helped a bit, but I did manage a big improvement by removing as many instances of the copy command as I could (I used it on arrays and tables). It turns out this is what was causing my array/table implementation to be slower than my list-only implementation. But the code still didn't run as fast as I needed, so I re-wrote it in C#. That's probably not the best solution for "how to improve Maple efficiency", but it sure does run a lot faster now.

CUDA parallel sorting algorithm vs single thread sorting algorithms

I have a large amount of data which i need to sort, several million array each with tens of thousand of values. What im wondering is the following:
Is it better to implement a parallel sorting algorithm, on the GPU, and run it across all the arrays
OR
implement a single thread algorithm, like quicksort, and assign each thread, of the GPU, a different array.
Obviously speed is the most important factor. For single thread sorting algorithm memory is a limiting factor. Ive already tried to implement a recursive quicksort but it doesnt seem to work for large amounts of data so im assuming there is a memory issue.
Data type to be sorted is long, so i dont believe a radix sort would be possible due to the fact that it a binary representation of the numbers would be too long.
Any pointers would be appreciated.
Sorting is an operation that has received a lot of attention. Writing your own sort isn't advisable if you are interested in high performance. I would consider something like thrust, back40computing, moderngpu, or CUB for sorting on the GPU.
Most of the above will be handling an array at a time, using the full GPU to sort an array. There are techniques within thrust to do a vectorized sort which can handle multiple arrays "at once", and CUB may also be an option for doing a "per-thread" sort (let's say, "per thread block").
Generally I would say the same thing about CPU sorting code. Don't write your own.
EDIT: I guess one more comment. I would lean heavily towards the first approach you mention (i.e. not doing a sort per thread.) There are two related reasons for this:
Most of the fast sorting work has been done along the lines of your first method, not the second.
The GPU is generally better at being fast when the work is well adapted for SIMD or SIMT. This means we generally want each thread to be doing the same thing and minimizing branching and warp divergence. This is harder to achieve (I think) in the second case, where each thread appears to be following the same sequence but in fact data dependencies are causing "algorithm divergence". On the surface of it, you might wonder if the same criticism might be levelled at the first approach, but since these libraries I mention arer written by experts, they are aware of how best to utilize the SIMT architecture. The thrust "vectorized sort" and CUB approaches will allow multiple sorts to be done per operation, while still taking advantage of SIMT architecture.

A couple of CUDA-performance questions

This is the first time i ask question here so thanks very much in advance and please forgive my ignorance. And also I've just started to CUDA programming.
Basically, i have a bunch of points, and i want to calculate all the pair-wise distances. Currently my kernel function just holds on one point, and iteratively read in all other points (from global memory), and conduct the calculation. Here's some of my confusions:
I'm using a Tesla M2050 with 448 cores. But my current parallel version (kernel<<<128,16,16>>>) achieves a much higher parallelism (about 600x faster than kernel<<<1,1,1>>>). Is it possibly due to the multithreading thing or pipeline issue, or they actually indicate the same thing?
I want to further improve the performance. So i figure to use shared memory to hold some input points for each multiprocessing block. But the new code is just as fast. What's the possible cause? Could it be related to the fact that i set too many threads?
Or, is it because i have a if-statement in the code? The thing is, i only consider and count the short distances, so i have a statement like (if dist < 200). How much should i worry about this one?
A million thanks!
Bin
Mark Harris has a very good presentation about optimizing CUDA: Optimizing Parallel Reduction in CUDA.
Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined
Having an extra operations statement, does indeed cause problems although it will be the last thing you want to optimize, if not simply because you need to know the layout of your code before implementing the size assumptions!
The problem you are working on sounds like the famous n-body problem,
see Fast N-Body Simulation with CUDA.
An additional performance increase can be achieved if you can avoid doing a pairwise computation, for example, the elements are too far to have an effect on each-other. This applies to any relationship that can be expressed geometrically, whether it be pairwise costs or a physics simulation with springs. My favorite method is to divide the grid into boxes and, with each element putting itself into a box via division, then only evaluate pairwise relations between between neighboring boxes. This can be called O(n*m).
(1) The GPU runs many more threads in parallel than there are cores. This is because each core is pipelined. Operations take around 20 cycles on compute capability 2.0 (Fermi) architectures. So for each clock cycle, the core starts work on a new operation, returns the finished result of one operation, and move all the other (around 18) operations one more step towards completion. So, to saturate the GPU, you might need something like 448 * 20 threads.
(2) It's probably because your values are getting cached in the L1 and L2 caches.
(3) It depends on how much work you're doing inside the if conditional. The GPU must run all 32 threads in a warp through all the code inside the if even if the condition is true for only a single of those threads. If there is a lot of code in the conditional as compared to the rest of your kernel, and relatively view threads go through that code path, it is likely that you end up with low compute throughput.

How does SIMD behave in this case?

I am using an engine that allows SIMD code to be written, and it performs fast. But there is only one block that has all the code.
I understand that this code is run independently on each entity concurrently, but when there is only 1 thing changing, is it still faster to calculate it regardless? Is this the idea with SIMD, parallelism?
For instance:
void simdFunction ()
{
center = mesh.center(); // always the same
vert.pos.x = center.x; // run on each vertex
}
In this case, the center is always the same, so will it be calculated for each vertex on SIMD? If so, is this still efficient?
Basically does being able to run this in parallel outweighs the cost of calculating it regardless in the general SIMD programming sense?
this code is run independently on each entity concurrently
No, that's not how SIMD works.
With SIMD, all arithmetic units are working in lock-step, performing identical operations. There's no independence whatsoever.
Generally though, you're better off computing shared constants just once, in sequential code. That way the SIMD engine will spend less time on each slice of vertices.
The exception would be if the computation is short, the SIMD is a co-processor (like GPGPU), and the data is already in that co-processor. Then computing it using SIMD might easily beat moving data back to the sequential processor and back.

Resources