Using healpy spin transforms in a quicker way - healpy

I'm using a code that needs to do frequent spin-2 transforms in healpy using both map2alm and alm2map (predominantly the latter). However, when I go to higher and higher nside this transform inevitably becomes slower.
I am using healpy with openmp and due to the sheer volume of transforms necessary, this still results in a long time to complete a job. Is there any further way to try and speed up the transform that might be applicable to this situation? For example, as I am only interested in part of the sky, is there a way to only reconstruct the values in specified pixels and not the whole sky when using alm2map, and would this be quicker?
Thanks for any help

I don't think healpy can sped this up, but you might be interested in checking out the other Cl estimators, in case you're interested in the power spectra and not only the alm.
For the full sky (in HEALPix):
PolSpice
XPol
For smaller fields:
Poker
For reasons described here, the estimation of the power spectra is not straightforward for a partially masked sky. HEALPy does not correct for this; the other packages I linked above do.

Related

When should these methods be used to calculate blob orientation?

In image processing, each of the following methods can be used to get the orientation of a blob region:
Using second order central moments
Using PCA to find the axis
Using distance transform to get the skeleton and axis
Other techniques, like fitting the contour of the region with an ellipse.
When should I consider using a specific method? How do they compare, in terms of accuracy and performance?
I'll give you a vague general answer, and I'm sure others will give you more details. This issue comes up all the time in image processing. There are N ways to solve my problem, which one should I use? The answer is, start with the simplest one that you understand the best. For most people, that's probably 1 or 2 in your example. In most cases, they will be nearly identical and sufficient. If for some reason the techniques don't work on your data, you have now learned for yourself, a case where the techniques fail. Now, you need to start exploring other techniques. This is where the hard work comes in, in being a image processing practitioner. There are no silver bullets, there's a grab bag of techniques that work in specific contexts, which you have to learn and figure out. When you learn this for yourself, you will become god like among your peers.
For this specific example, if your data is roughly ellipsoidal, all these techniques will be similar results. As your data moves away from ellipsoidal, (say spider like) the PCA/Second order moments / contours will start to give poor results. The skeleton approaches become more robust, but mapping a complex skeleton to a single axis / orientation can become a very difficult problem, and may require more apriori knowledge about the blob.

How to Scale SPICE Matrix so LU-decomposition doesn't Fail

I am implementing a SPICE solver. I have the following problem: say I put two diodes and a current source in serial (standard diodes). I use MNA and Boost LU-decomposition. The problem is that the nodal matrix becomes very quickly near-singular. I think I have to scale the values but I don't know how and I couldn't find anything on the Internet. Any ideas how to do this scaling?
In the perspective of numerical, there is a scale technique for this kind of near-singular matrices. Basically, this technique is to divide each row of A by the sum (or maximum) of the absolute values in that row. You can find KLU which is a linear solver for circuit simulations for more details.
In perspective of SPICE simulation, it uses so-call Gmin stepping technique to iteratively compute and approach a real answer. You can find this in the documents of a SPICE project QUCS (Quite Universal Circuit Simulator).
Scaling does not help when the matrix has both very large and very small entries.
It is necessary to use some or all of the many tricks that were developed for circuit solver applications. A good start is clipping the range of the exponential and log function arguments to reasonable values -- in most circuits a diode forward voltage is never more than 1V and the diode reverse current not less than 1pA.
Actually, look at all library functions and wrap them in code that makes their arguments and results suitable for circuit-solving purposes. Simple clipping is sometimes good enough, but it is way better to make sure the functions stay (twice) differentiable and continuous.

Can raymarching be accelerated under an SIMD architecture?

The answer would seem to be no, because raymarching is highly conditional i.e. each ray follows a unique execution path, since on each step we check for opacity, termination etc. that will vary based on the direction of the individual ray.
So it would seem that SIMD would largely not be able to accelerate this; rather, MIMD would be required for acceleration.
Does this make sense? Or am I missing something(s)?
As stated already, you could probably get a speedup from implementing your
vector math using SSE instructions (be aware of the effects discussed
here - also for the other approach). This approach would allow the code
stay concise and maintainable.
I assume, however, your question is about "packet traversal" (or something
like it), in other words to process multiple scalar values each of a
different ray:
In principle it should be possible deferring the shading to another pass.
The SIMD packet could be repopulated with a new ray once the bare marching
pass terminates and the temporary result is stored as input for the shading
pass. This will allow to parallelize a certain, case-dependent percentage
of your code exploting all four SIMD lanes.
Tiling the image and indexing the rays within it in Morton-order might be
a good idea too in order to avoid cache pressure (unless your geometry is
strictly procedural).
You won't know whether it pays off unless you try. My guess is, that if it
does, the amount of speedup might not be worth the complication of the code
for just four lanes.
Have you considered using an SIMT architecture such as a programmable GPU?
A somewhat up-to-date programmable graphics board allows you to perform
raymarching at interactive rates (see it happen in your browser here).
The last days I built a software-based raymarcher for a menger sponge. At the moment without using SIMD and I also used no special algorithm. I just trace from -1 to 1 in X and Y, which are U and V for the destination texture. Then I got a camera position and a destination which I use to calculate the increment vector for the raymarch.
After that I use a constant value of iterations to perform, in which only one branch decides if there's an intersection with the fractal volume. So if my camera eye is E and my direction vector is D I have to find the smallest t. If I found that or reached a maximal distance I break the loop. At the end I have t - from that I calculate the fragment color.
In my opinion it should be possible to parallelize these operations by SSE1/2, because one can solve the branch by null'ing the field in the vector (__m64 / __m128), so further SIMD operations won't apply here. It really depends on what you raymarch/-cast but if you just calculate a fragment color from a function (like my fractal curve here is) and don't access memory non-linearly there are some tricks to make it possible.
Sure, this answer contains speculation, but I will keep you informed when I've parallelized this routine.
Only insofar as SSE, for instance, lets you do operations on vectors in parallel.

What is the idea behind scaling an image using Lanczos?

I'm interested in image scaling algorithms and have implemented the bilinear and bicubic methods. However, I have heard of the Lanczos and other more sophisticated methods for even higher quality image scaling, and I am very curious how they work.
Could someone here explain the basic idea behind scaling an image using Lanczos (both upscaling and downscaling) and why it results in higher quality?
I do have a background in Fourier analysis and have done some signal processing stuff in the past, but not with relation to image processing, so don't be afraid to use terms like "frequency response" and such in your answer :)
EDIT: I guess what I really want to know is the concept and theory behind using a convolution filter for interpolation.
(Note: I have already read the Wikipedia article on Lanczos resampling but it didn't have nearly enough detail for me)
The selection of a particular filter for image processing is something of a black art, because the main criterion for judging the result is subjective: in computer graphics, the ultimate question is almost always: "does it look good?". There are a lot of good filters out there, and the choice between the best frequently comes down to a judgement call.
That said, I will go ahead with some theory...
Since you are familiar with Fourier analysis for signal processing, you don't really need to know much more to apply it to image processing -- all the filters of immediate interest are "separable", which basically means you can apply them independently in the x and y directions. This reduces the problem of resampling a (2-D) image to the problem of resampling a (1-D) signal. Instead of a function of time (t), your signal is a function of one of the coordinate axes (say, x); everything else is exactly the same.
Ultimately, the reason you need to use a filter at all is to avoid aliasing: if you are reducing the resolution, you need to filter out high-frequency original data that the new, lower resolution doesn't support, or it will be added to unrelated frequencies instead.
So. While you're filtering out unwanted frequencies from the original, you want to preserve as much of the original signal as you can. Also, you don't want to distort the signal you do preserve. Finally, you want to extinguish the unwanted frequencies as completely as possible. This means -- in theory -- that a good filter should be a "box" function in frequency space: with zero response for frequencies above the cutoff, unity response for frequencies below the cutoff, and a step function in between. And, in theory, this response is achievable: as you may know, a straight sinc filter will give you exactly that.
There are two problems with this. First, a straight sinc filter is unbounded, and doesn't drop off very fast; this means that doing a straightforward convolution will be very slow. Rather than direct convolution, it is faster to use an FFT and do the filtering in frequency space...
However, if you actually do use a straight sinc filter, the problem is that it doesn't actually look very good! As the related question says, perceptually there are ringing artifacts, and practically there is no completely satisfactory way to deal with the negative values that result from "undershoot".
Finally, then: one way to deal with the problem is to start out with a sinc filter (for its good theoretical properties), and tweak it until you have something that also solves your other problems. Specifically, this will get you something like the Lanczos filter:
Lanczos filter: L(x) = sinc(pi x) sinc(pi x/a) box(|x|<a)
frequency response: F[L(x)](f) = box(|f|<1/2) * box(|f|<1/2a) * sinc(2 pi f a)
[note that "*" here is convolution, not multiplication]
[also, I am ignoring normalization completely...]
the sinc(pi x) determines the overall shape of the frequency response (for larger a, the frequency response looks more and more like a box function)
the box(|x|<a) gives it finite support, so you can use direct convolution
the sinc(pi x/a) smooths out the edges of the box and (consequently? equivalently?) greatly improves the rejection of undesirable high frequencies
the last two factors ("the window") also tone down the ringing; they make a vast improvement in both the perceptual artifact and the practical incidence of "undershoot" -- though without completely eliminating them
Please note that there is no magic about any of this. There are a wide variety of windows available, which work just about as well. Also, for a=1 and 2, the frequency response does not look much like a step function. However, I hope this answers your question "why sinc", and gives you some idea about frequency responses and so forth.

Improving raytracer performance

I'm writing a comparatively straightforward raytracer/path tracer in D (http://dsource.org/projects/stacy), but even with full optimization it still needs several thousand processor cycles per ray. Is there anything else I can do to speed it up? More generally, do you know of good optimizations / faster approaches for ray tracing?
Edit: this is what I'm already doing.
Code is already running highly parallel
temporary data is structured in a cache-efficient fashion as well as aligned to 16b
Screen divided into 32x32-tiles
Destination array is arranged in such a way that all subsequent pixels in a tile are sequential in memory
Basic scene graph optimizations are performed
Common combinations of objects (plane-plane CSG as in boxes) are replaced with preoptimized objects
Vector struct capable of taking advantage of GDC's automatic vectorization support
Subsequent hits on a ray are found via lazy evaluation; this prevents needless calculations for CSG
Triangles neither supported nor priority. Plain primitives only, as well as CSG operations and basic material properties
Bounding is supported
The typical first order improvement of raytracer speed is some sort of spatial partitioning scheme. Based only on your project outline page, it seems you haven't done this.
Probably the most usual approach is an octree, but the best approach may well be a combination of methods (e.g. spatial partitioning trees and things like mailboxing). Bounding box/sphere tests are a quick cheap and nasty approach, but you should note two things: 1) they don't help much in many situations and 2) if your objects are already simple primitives, you aren't going to gain much (and might even lose). You can more easily (than octree) implement a regular grid for spatial partitioning, but it will only work really well for scenes that are somewhat uniformly distributed (in terms of surface locations)
A lot depends on the complexity of the objects you represent, your internal design (i.e. do you allow local transforms, referenced copies of objects, implicit surfaces, etc), as well as how accurate you're trying to be. If you are writing a global illumination algorithm with implicit surfaces the tradeoffs may be a bit different than if you are writing a basic raytracer for mesh objects or whatever. I haven't looked at your design in detail so I'm not sure what, if any, of the above you've already thought about.
Like any performance optimization process, you're going to have to measure first to find where you're actually spending the time, then improving things (algorithmically by preference, then code bumming by necessity)
One thing I learned with my ray tracer is that a lot of the old rules don't apply anymore. For example, many ray tracing algorithms do a lot of testing to get an "early out" of a computationally expensive calculation. In some cases, I found it was much better to eliminate the extra tests and always run the calculation to completion. Arithmetic is fast on a modern machine, but a missed branch prediction is expensive. I got something like a 30% speed-up on my ray-polygon intersection test by rewriting it with minimal conditional branches.
Sometimes the best approach is counter-intuitive. For example, I found that many scenes with a few large objects ran much faster when I broke them down into a large number of smaller objects. Depending on the scene geometry, this can allow your spatial subdivision algorithm to throw out a lot of intersection tests. And let's face it, intersection tests can be made only so fast. You have to eliminate them to get a significant speed-up.
Hierarchical bounding volumes help a lot, but I finally grokked the kd-tree, and got a HUGE increase in speed. Of course, building the tree has a cost that may make it prohibitive for real-time animation.
Watch for synchronization bottlenecks.
You've got to profile to be sure to focus your attention in the right place.
Is there anything else I can do to speed it up?
D, depending on the implementation and compiler, puts forth reasonably good performance. As you haven't explained what ray tracing methods and optimizations you're using already, then I can't give you much help there.
The next step, then, is to run a timing analysis on the program, and recode the most frequently used code or slowest code than impacts performance the most in assembly.
More generally, check out the resources in these questions:
Literature and Tutorials for Writing a Ray Tracer
Anyone know of a really good book about Ray Tracing?
Computer Graphics: Raytracing and Programming 3D Renders
raytracing with CUDA
I really like the idea of using a graphics card (a massively parallel computer) to do some of the work.
There are many other raytracing related resources on this site, some of which are listed in the sidebar of this question, most of which can be found in the raytracing tag.
I don't know D at all, so I'm not able to look at the code and find specific optimizations, but I can speak generally.
It really depends on your requirements. One of the simplest optimizations is just to reduce the number of reflections/refractions that any particular ray can follow, but then you start to lose out on the "perfect result".
Raytracing is also an "embarrassingly parallel" problem, so if you have the resources (such as a multi-core processor), you could look into calculating multiple pixels in parallel.
Beyond that, you'll probably just have to profile and figure out what exactly is taking so long, then try to optimize that. Is it the intersection detection? Then work on optimizing the code for that, and so on.
Some suggestions.
Use bounding objects to fail fast.
Project the scene at a first step (as common graphic cards do) and use raytracing only for light calculations.
Parallelize the code.
Raytrace every other pixel. Get the color in between by interpolation. If the colors vary greatly (you are on an edge of an object), raytrace the pixel in between. It is cheating, but on simple scenes it can almost double the performance while you sacrifice some image quality.
Render the scene on GPU, then load it back. This will give you the first ray/scene hit at GPU speeds. If you do not have many reflective surfaces in the scene, this would reduce most of your work to plain old rendering. Rendering CSG on GPU is unfortunately not completely straightforward.
Read source code of PovRay for inspiration. :)
You have first to make sure that you use very fast algorithms (implementing them can be a real pain, but what do you want to do and how far want you to go and how fast should it be, that's a kind of a tradeof).
some more hints from me
- don't use mailboxing techniques, in papers it is sometimes discussed that they don't scale that well with the actual architectures because of the counting overhead
- don't use BSP/Octtrees, they are relative slow.
- don't use the GPU for Raytracing, it is far too slow for advanced effects like reflection and shadows and refraction and photon-mapping and so on ( i use it only for shading, but this is my beer)
For a complete static scene kd-Trees are unbeatable and for dynamic scenes there are clever algorithms there that scale very well on a quadcore (i am not sure about the performance above).
And of course, for a realy good performance you need to use very much SSE code (with of course not too much jumps) but for not "that good" performance (im talking here about 10-15% maybe) compiler-intrinsics are enougth to implement your SSE stuff.
And some decent Papers about some Algorithms i was talking about:
"Fast Ray/Axis-Aligned Bounding Box - Overlap Tests using Ray Slopes"
( very fast very good paralelisizable (SSE) AABB-Ray hit test )( note, the code in the paper is not all code, just google for the title of the paper, youll find it)
http://graphics.tu-bs.de/publications/Eisemann07RS.pdf
"Ray Tracing Deformable Scenes using Dynamic Bounding Volume Hierarchies"
http://www.sci.utah.edu/~wald/Publications/2007///BVH/download//togbvh.pdf
if you know how the above algorithm works then this is a much greater algorithm:
"The Use of Precomputed Triangle Clusters for Accelerated Ray Tracing in Dynamic Scenes"
http://garanzha.com/Documents/UPTC-ART-DS-8-600dpi.pdf
I'm also using the pluecker-test to determine fast (not thaat accurate, but well, you can't have all) if i hit a polygon, works very pretty with SSE and above.
So my conclusion is that there are so many great papers out there about so much Topics that do relate to raytracing (How to build fast, efficient trees and how to shade (BRDF models) and so on and so on), it is an realy amazing and interesting field of "experimentating", but you need to have also much sparetime because it is so damn complicated but funny.
My first question is - are you trying to optimize the tracing of one single still screen,
or is this about optimizing the tracing of multiple screens in order to calculate an animation ?
Optimizing for a single shot is one thing, if you want to calculate successive frames in an animation there are lots of new things to think about / optimize.
You could
use an SAH-optimized bounding volume hierarchy...
...eventually using packet traversal,
introduce importance sampling,
access the tiles ordered by Morton code for better cache coherency, and
much more - but those were the suggestions I could immediately think of. In more words:
You can build an optimized hierarchy based on statistics in order to quickly identify candidate nodes when intersecting geometry. In your case you'll have to combine the automatic hierarchy with the modeling hierarchy, that is either constrain the build or have it eventually clone modeling information.
"Packet traversal" means you use SIMD instructions to compute 4 parallel scalars, each of an own ray for traversing the hierarchy (which is typically the hot spot) in order to squeeze the most performance out of the hardware.
You can perform some per-ray-statistics in order to control the sampling rate (number of secondary rays shot) based on the contribution to the resulting pixel color.
Using an area curve on the tile allows you to decrease the average space distance between the pixels and thus the probability that your performance benefits from cache hits.

Resources