What's the solution to share information among the processors?

What's the solution to share information among the processors? - parallel-processing

I am writing a new algorithm for my finite analysis code using c++, suppose I am working on a 64*64 square mesh and it has four surfaces (surface A, B, C, D), and then I decompose it into four sub-domains, and it happens that the surface cells in surface A MAY also be scattered.
My question is kind of general, I am wondering if I want to calculate a variable in this 64*64 grids, but this variable depends on the entire surface A's coordinate information, then what should I do? Because I found that in some sub-domains, there is no A at all. So I found my code and new algorithm runs very well and successfully in serial, but it fails immediately in parallel.
Any advice or strategies for handling this kind of situation?

Parallel processing may be tricky, if you are using a algorithm based on shared memory paradigm you need to protect this memory region or you will have a race condition.
Other paradigms, like distributed memory, have other kind of strategies to handle this, mostly message passing.

Related

Is there a parallel flood fill implementation?

I've got openMP and MPI at my disposal, and was wondering if anyone has come across a parallel version of any flood fill algorithm (preferably in c). If not, I'd be interested in sketches of how to do parallelise it - is it even possible given its based on recursion?
Wikipedia's got a pretty good article if you need to refresh your memory on flood fills.
Many thanks for your help.

There's nothing "inherently" recursive about flood-fill, just that to do some work, you need some information about previously-discovered "frontier" cells. If you think of it that way, it's clear that parallelism is eminently possible: even with a single queue, you could use four threads (one for each direction), and only move the tail of the queue when the cell has been examined by each thread. or equivalently, four queues. thinking in this way, one might even imagine partitioning the space into multiple queues - bucketed by coordinate ranges, perhaps.
one basic problem is that the problem definition usually includes the proviso that no cell is ever revisited. this implies that each worker needs an up-to-date map of which cells have been considered (globally). mutable global information is problematic, performance-wise, though it's not hard to think of ways to limit the necessity for propagating updates globally...

Can raymarching be accelerated under an SIMD architecture?

The answer would seem to be no, because raymarching is highly conditional i.e. each ray follows a unique execution path, since on each step we check for opacity, termination etc. that will vary based on the direction of the individual ray.
So it would seem that SIMD would largely not be able to accelerate this; rather, MIMD would be required for acceleration.
Does this make sense? Or am I missing something(s)?

As stated already, you could probably get a speedup from implementing your
vector math using SSE instructions (be aware of the effects discussed
here - also for the other approach). This approach would allow the code
stay concise and maintainable.
I assume, however, your question is about "packet traversal" (or something
like it), in other words to process multiple scalar values each of a
different ray:
In principle it should be possible deferring the shading to another pass.
The SIMD packet could be repopulated with a new ray once the bare marching
pass terminates and the temporary result is stored as input for the shading
pass. This will allow to parallelize a certain, case-dependent percentage
of your code exploting all four SIMD lanes.
Tiling the image and indexing the rays within it in Morton-order might be
a good idea too in order to avoid cache pressure (unless your geometry is
strictly procedural).
You won't know whether it pays off unless you try. My guess is, that if it
does, the amount of speedup might not be worth the complication of the code
for just four lanes.
Have you considered using an SIMT architecture such as a programmable GPU?
A somewhat up-to-date programmable graphics board allows you to perform
raymarching at interactive rates (see it happen in your browser here).

The last days I built a software-based raymarcher for a menger sponge. At the moment without using SIMD and I also used no special algorithm. I just trace from -1 to 1 in X and Y, which are U and V for the destination texture. Then I got a camera position and a destination which I use to calculate the increment vector for the raymarch.
After that I use a constant value of iterations to perform, in which only one branch decides if there's an intersection with the fractal volume. So if my camera eye is E and my direction vector is D I have to find the smallest t. If I found that or reached a maximal distance I break the loop. At the end I have t - from that I calculate the fragment color.
In my opinion it should be possible to parallelize these operations by SSE1/2, because one can solve the branch by null'ing the field in the vector (__m64 / __m128), so further SIMD operations won't apply here. It really depends on what you raymarch/-cast but if you just calculate a fragment color from a function (like my fractal curve here is) and don't access memory non-linearly there are some tricks to make it possible.
Sure, this answer contains speculation, but I will keep you informed when I've parallelized this routine.

Only insofar as SSE, for instance, lets you do operations on vectors in parallel.

What are canonical examples of parallel computation?

I am writing a paper to test a new application that will demonstrate the benefits of parallelized computation (compared to the traditional serialized version of this application). I want to use the canonical examples for parallel computation in my paper.
My first example is the parallel computation of pi. I would ideally like an example where each iteration is very time consuming (because of the additional overhead associated with parallelizing); my first thought is a Bayesian simulation with MCMC and Gibbs sampling.
What other problems are typically discussed in this context? What are good examples of large embarassingly parallel problems?

just a few more -
Multiplying matrices
Inverting matrices
FFT
String matching
Rendering 3d scenes (via scan line conversion or ray tracing)

One example I've used in the past of an embarrassingly parallel problem is visualizing the mandelbrot set. Each pixel can be computed independently.
Conway's Life is interesting as well, in that each value of the "next" board can be computed independently, but will depend on the relevant bits of the "current" board being done already.

I would suggest that canonical examples of parallel computation and embarassingly parallel problems are, if not completely, then nearly, disjoint sets. To put it another way, people working in parallel computation aren't terribly excited about embarassingly parallel problems; we call them that because we'd be embarassed to be working on them.
I'd be looking, if I were you, at these (a not entirely original list):
linear algebra on large dense matrices, both direct and iterative approaches;
linear algebra on huge sparse matrices
branch and bound approaches to linear programming (and related) problems;
sequence matching for bioinformatics (outside my field, I may have mis-expressed this);
continuos optimisation.
I expect there are many more.
EDIT: You may be interested in this list of problems which have been selected for benchmarking the next generation of European (academic) supercomputers. It will give you some idea of where that niche is heading.

Molecular dynamics simluations allow you to change the size of the problem until your computer resources are exhausted (i.e. 256 particles vs. 256,000,000 particles). Its truly a "canonical" example if you run the MD simulations under NVT conditions ;-)

My favorite example is monte carlo simulation.

Word counting seems to be the canonical example for MapReduce.
http://en.wikipedia.org/wiki/MapReduce#Example

Finding collisions in hash functions using Paul C. van Oorschot and Michael J. Weiner's method (PDF) comes up often in various cryptographic settings.

I used the Mandelbrot set demo to explain to my mom what parallel programming is about : http://www.ateji.com/px/demo.html
All the examples you mentions are mostly heavy data-parallel codes. You'll probably want to mention also task-oriented codes, such as servers responding to many requests in parallel, and data-flow or stream programming examples (MapReduce is a good representative of this class).

Ask for resource about fast ray-tracing algorithm

First, I am sorry for this rough question, but I don't want to introduce too much details, so I just ask for related resource like articles, libraries or tips.
My program need to do intensive computation of ray-triangle intersection (there are millions of rays and triangles), and my goal is to make it as fast as I can.
What I have done is:
Use the fastest ray-triangle algorithm that I know.
Use Octree.(From Game Programming Gem 1, 4.10. 4.11)
Use An Efficient and Robust Ray–Box Intersection Algorithm which is used in octree algorithm.
It is faster than before I applied those better algorithms, but I believe it could be faster, Could you please shed lights on any possible places that could make it faster?
Thanks.

The place to ask these questions is ompf2.com. A forum with topics about realtime (although also non-realtime) raytracing

OMPF forum is the right place for this question, but since I'm here today...
Don't use a ray/box intersection for OctTree traversal. You may use it for the root node of the tree, but that's it. Once you know the distance to the entry and exit of the root box, you can calculate the distances to the x,y, and z partition planes - the planes that subdivide the box. If the distance to front and back are f and b respectively then you can determine which child nodes of the box are hit by analyzing f,b,x,y,z distances. You can also determine the order to traverse the child nodes and completely reject many of them.
At most 4 of the children can be hit since the ray starts in one octant and only changes octants when it crosses one of the 3 partition planes.
Also, since it becomes recursive you'll be needing the entry and exit distances for the child nodes. These distances are chosen from the set (f,b,x,y,z) which you've already computed.
I have been optimizing this for a very long time, and can safely say you have about an order of magnitude performance still on the table for trees many levels deep. I started right where you are now.

There are several optimizations you can do, but all of them depend on the exact domain of your problem. As far as general algorithms go, you are on the right track. Depending on the domain, you could:
Introduce a portal system
Move the calculations to a GPU and take advantage of parallel computation
A quite popular trend in raytracing recently is Bounding Volume Hierarchies

You've already gotten a good start using a spatial sort coupled with fast intersection algorithms. For tracing single rays at a time, one of the best structures out there (for static scenes) is a K-d tree built using the Surface Area Heuristic.
However, for truly high-speed ray tracing you need to take advantage of:
Coherent packets of rays
Frusta
SIMD
I would suggest you start with "Ray Tracing Animated Scenes using Coherent Grid Traversal". It gives an easy-to-follow example of such a modern approach. You can also follow the references to see how these ideas are applied to K-d trees and BVHs.
On the same page, also check out "State of the Art in Ray Tracing Animated Scenes".
Another great set of resources are all the SIGGRAPH publications over the years. This is a very competitive conference, so these papers tend to be top-notch.
Finally, if you're willing to use existing code, check out the project page for OpenRT.

A useful resource I've seen is the journal of graphics tools. Depending on your scenes, another BVH might be more appropriate than an octree.
Also, if you haven't looked at your performance with a profiler then you should. Shark is great on OSX, and I've gotten good results with Very Sleepy on windows.

Improving raytracer performance

I'm writing a comparatively straightforward raytracer/path tracer in D (http://dsource.org/projects/stacy), but even with full optimization it still needs several thousand processor cycles per ray. Is there anything else I can do to speed it up? More generally, do you know of good optimizations / faster approaches for ray tracing?
Edit: this is what I'm already doing.
Code is already running highly parallel
temporary data is structured in a cache-efficient fashion as well as aligned to 16b
Screen divided into 32x32-tiles
Destination array is arranged in such a way that all subsequent pixels in a tile are sequential in memory
Basic scene graph optimizations are performed
Common combinations of objects (plane-plane CSG as in boxes) are replaced with preoptimized objects
Vector struct capable of taking advantage of GDC's automatic vectorization support
Subsequent hits on a ray are found via lazy evaluation; this prevents needless calculations for CSG
Triangles neither supported nor priority. Plain primitives only, as well as CSG operations and basic material properties
Bounding is supported

The typical first order improvement of raytracer speed is some sort of spatial partitioning scheme. Based only on your project outline page, it seems you haven't done this.
Probably the most usual approach is an octree, but the best approach may well be a combination of methods (e.g. spatial partitioning trees and things like mailboxing). Bounding box/sphere tests are a quick cheap and nasty approach, but you should note two things: 1) they don't help much in many situations and 2) if your objects are already simple primitives, you aren't going to gain much (and might even lose). You can more easily (than octree) implement a regular grid for spatial partitioning, but it will only work really well for scenes that are somewhat uniformly distributed (in terms of surface locations)
A lot depends on the complexity of the objects you represent, your internal design (i.e. do you allow local transforms, referenced copies of objects, implicit surfaces, etc), as well as how accurate you're trying to be. If you are writing a global illumination algorithm with implicit surfaces the tradeoffs may be a bit different than if you are writing a basic raytracer for mesh objects or whatever. I haven't looked at your design in detail so I'm not sure what, if any, of the above you've already thought about.
Like any performance optimization process, you're going to have to measure first to find where you're actually spending the time, then improving things (algorithmically by preference, then code bumming by necessity)

One thing I learned with my ray tracer is that a lot of the old rules don't apply anymore. For example, many ray tracing algorithms do a lot of testing to get an "early out" of a computationally expensive calculation. In some cases, I found it was much better to eliminate the extra tests and always run the calculation to completion. Arithmetic is fast on a modern machine, but a missed branch prediction is expensive. I got something like a 30% speed-up on my ray-polygon intersection test by rewriting it with minimal conditional branches.
Sometimes the best approach is counter-intuitive. For example, I found that many scenes with a few large objects ran much faster when I broke them down into a large number of smaller objects. Depending on the scene geometry, this can allow your spatial subdivision algorithm to throw out a lot of intersection tests. And let's face it, intersection tests can be made only so fast. You have to eliminate them to get a significant speed-up.
Hierarchical bounding volumes help a lot, but I finally grokked the kd-tree, and got a HUGE increase in speed. Of course, building the tree has a cost that may make it prohibitive for real-time animation.
Watch for synchronization bottlenecks.
You've got to profile to be sure to focus your attention in the right place.

Is there anything else I can do to speed it up?
D, depending on the implementation and compiler, puts forth reasonably good performance. As you haven't explained what ray tracing methods and optimizations you're using already, then I can't give you much help there.
The next step, then, is to run a timing analysis on the program, and recode the most frequently used code or slowest code than impacts performance the most in assembly.
More generally, check out the resources in these questions:
Literature and Tutorials for Writing a Ray Tracer
Anyone know of a really good book about Ray Tracing?
Computer Graphics: Raytracing and Programming 3D Renders
raytracing with CUDA
I really like the idea of using a graphics card (a massively parallel computer) to do some of the work.
There are many other raytracing related resources on this site, some of which are listed in the sidebar of this question, most of which can be found in the raytracing tag.

I don't know D at all, so I'm not able to look at the code and find specific optimizations, but I can speak generally.
It really depends on your requirements. One of the simplest optimizations is just to reduce the number of reflections/refractions that any particular ray can follow, but then you start to lose out on the "perfect result".
Raytracing is also an "embarrassingly parallel" problem, so if you have the resources (such as a multi-core processor), you could look into calculating multiple pixels in parallel.
Beyond that, you'll probably just have to profile and figure out what exactly is taking so long, then try to optimize that. Is it the intersection detection? Then work on optimizing the code for that, and so on.

Some suggestions.
Use bounding objects to fail fast.
Project the scene at a first step (as common graphic cards do) and use raytracing only for light calculations.
Parallelize the code.

Raytrace every other pixel. Get the color in between by interpolation. If the colors vary greatly (you are on an edge of an object), raytrace the pixel in between. It is cheating, but on simple scenes it can almost double the performance while you sacrifice some image quality.
Render the scene on GPU, then load it back. This will give you the first ray/scene hit at GPU speeds. If you do not have many reflective surfaces in the scene, this would reduce most of your work to plain old rendering. Rendering CSG on GPU is unfortunately not completely straightforward.
Read source code of PovRay for inspiration. :)

You have first to make sure that you use very fast algorithms (implementing them can be a real pain, but what do you want to do and how far want you to go and how fast should it be, that's a kind of a tradeof).
some more hints from me
- don't use mailboxing techniques, in papers it is sometimes discussed that they don't scale that well with the actual architectures because of the counting overhead
- don't use BSP/Octtrees, they are relative slow.
- don't use the GPU for Raytracing, it is far too slow for advanced effects like reflection and shadows and refraction and photon-mapping and so on ( i use it only for shading, but this is my beer)
For a complete static scene kd-Trees are unbeatable and for dynamic scenes there are clever algorithms there that scale very well on a quadcore (i am not sure about the performance above).
And of course, for a realy good performance you need to use very much SSE code (with of course not too much jumps) but for not "that good" performance (im talking here about 10-15% maybe) compiler-intrinsics are enougth to implement your SSE stuff.
And some decent Papers about some Algorithms i was talking about:
"Fast Ray/Axis-Aligned Bounding Box - Overlap Tests using Ray Slopes"
( very fast very good paralelisizable (SSE) AABB-Ray hit test )( note, the code in the paper is not all code, just google for the title of the paper, youll find it)
http://graphics.tu-bs.de/publications/Eisemann07RS.pdf
"Ray Tracing Deformable Scenes using Dynamic Bounding Volume Hierarchies"
http://www.sci.utah.edu/~wald/Publications/2007///BVH/download//togbvh.pdf
if you know how the above algorithm works then this is a much greater algorithm:
"The Use of Precomputed Triangle Clusters for Accelerated Ray Tracing in Dynamic Scenes"
http://garanzha.com/Documents/UPTC-ART-DS-8-600dpi.pdf
I'm also using the pluecker-test to determine fast (not thaat accurate, but well, you can't have all) if i hit a polygon, works very pretty with SSE and above.
So my conclusion is that there are so many great papers out there about so much Topics that do relate to raytracing (How to build fast, efficient trees and how to shade (BRDF models) and so on and so on), it is an realy amazing and interesting field of "experimentating", but you need to have also much sparetime because it is so damn complicated but funny.

My first question is - are you trying to optimize the tracing of one single still screen,
or is this about optimizing the tracing of multiple screens in order to calculate an animation ?
Optimizing for a single shot is one thing, if you want to calculate successive frames in an animation there are lots of new things to think about / optimize.

You could
use an SAH-optimized bounding volume hierarchy...
...eventually using packet traversal,
introduce importance sampling,
access the tiles ordered by Morton code for better cache coherency, and
much more - but those were the suggestions I could immediately think of. In more words:
You can build an optimized hierarchy based on statistics in order to quickly identify candidate nodes when intersecting geometry. In your case you'll have to combine the automatic hierarchy with the modeling hierarchy, that is either constrain the build or have it eventually clone modeling information.
"Packet traversal" means you use SIMD instructions to compute 4 parallel scalars, each of an own ray for traversing the hierarchy (which is typically the hot spot) in order to squeeze the most performance out of the hardware.
You can perform some per-ray-statistics in order to control the sampling rate (number of secondary rays shot) based on the contribution to the resulting pixel color.
Using an area curve on the tile allows you to decrease the average space distance between the pixels and thus the probability that your performance benefits from cache hits.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio