Advice/Guidance on Roofline Model Analysis (Skylake, Thunder X2, Haswell) - performance

I'm learning bandwidth/memory- and CPU-bound performance and roofline graphs at the moment, and I'd love some help/input on how to analyze the following figure.
Roofline figure from "https://www.mdpi.com/2079-3197/8/1/20"
The first analysis I'm trying to do here is which kernel out of the two--Dirac and LBM--is closer to the empirical upper-bound performance on ThunderX2. My thoughts are that Dirac is closer to the upper-bound performance on TX2 as the respective red triangle (representing TX2's performance) is closer to the roofline when on Dirac than when on LBM. Can anyone correct my justification/approach if it's incorrect?
Second one I'm trying to reach conclusion to is which architecture out of the given three (Skylake, Thunder X2, or Haswell)) is "best-suited" for LBM. There might be multiple approaches here; my guess is that SKL would be best-suited for LBM as it is the highest performing out of the three in LBM environment but could also be TX2, considering that its distance from its own roofline is the shortest among the three, hence being the most effective/suitable one for LBM.
Any input, correction, or suggestion would be greatly appreciated!

Related

How does one arrive at "fair" priors for spatial and non-spatial effects

In a basic BYM model may be written as
sometimes with covariates but that doesn't matter much here. Where s are the spatially structured effects and u the unstructured effects over units.
In Congdon (2020) they refer to the fair prior on these as one in which
where is the average number of neighbors in the adjacency matrix.
It is defined similarly (in terms of precision, I think) in Bernardinelli et al. (1995).
However, for the gamma distribution, scaling appears to only impact the scale term
I haven't been able to find a worked example for this, and don't understand how the priors are arrived at, for example, in the well-known lip cancer data
I am hoping someone could help me understand how these are reached in this setting, even in the simple case of two gamma hyperpriors.
References
Congdon, P. D. (2019). Bayesian Hierarchical Models: With Applications Using R, Second Edition (2nd edition). Chapman and Hall/CRC.
Bernardinelli, L., Clayton, D. and Montomoli, C. (1995). Bayesian estimates of disease maps: How important are priors? Statistics in Medicine 14 2411–2431.

Floating point algorithms with potential for performance optimization

For a university lecture I am looking for floating point algorithms with known asymptotic runtime, but potential for low-level (micro-)optimization. This means optimizations such as minimizing cache misses and register spillages, maximizing instruction level parallelism and taking advantage of SIMD (vector) instructions on new CPUs. The optimizations are going to be CPU-specific and will make use of applicable instruction set extensions.
The classic textbook example for this is matrix multiplication, where great speedups can be achieved by simply reordering the sequence of memory accesses (among other tricks). Another example is FFT. Unfortunately, I am not allowed to choose either of these.
Anyone have any ideas, or an algorithm/method that could use a boost?
I am only interested in algorithms where a per-thread speedup is conceivable. Parallelizing problems by multi-threading them is fine, but not the scope of this lecture.
Edit 1: I am taking the course, not teaching it. In the past years, there were quite a few projects that succeeded in surpassing the current best implementations in terms of performance.
Edit 2: This paper lists (from page 11 onwards) seven classes of important numerical methods and some associated algorithms that use them. At least some of the mentioned algorithms are candidates, it is however difficult to see which.
Edit 3: Thank you everyone for your great suggestions! We proposed to implement the exposure fusion algorithm (paper from 2007) and our proposal was accepted. The algorithm creates HDR-like images and consists mainly of small kernel convolutions followed by weighted multiresolution blending (on the Laplacian pyramid) of the source images. Interesting for us is the fact that the algorithm is already implemented in the widely used Enfuse tool, which is now at version 4.1. So we will be able to validate and compare our results with the original and also potentially contribute to the development of the tool itself. I will update this post in the future with the results if I can.
The simplest possible example:
accumulation of a sum. unrolling using multiple accumulators and vectorization allow a speedup of (ADD latency)*(SIMD vector width) on typical pipelined architectures (if the data is in cache; because there's no data reuse, it typically won't help if you're reading from memory), which can easily be an order of magnitude. Cute thing to note: this also decreases the average error of the result! The same techniques apply to any similar reduction operation.
A few classics from image/signal processing:
convolution with small kernels (especially small 2d convolves like a 3x3 or 5x5 kernel). In some sense this is cheating, because convolution is matrix multiplication, and is intimately related to the FFT, but in reality the nitty-gritty algorithmic techniques of high-performance small kernel convolutions are quite different from either.
erode and dilate.
what image people call a "gamma correction"; this is really evaluation of an exponential function (maybe with a piecewise linear segment near zero). Here you can take advantage of the fact that image data is often entirely in a nice bounded range like [0,1] and sub-ulp accuracy is rarely needed to use much cheaper function approximations (low-order piecewise minimax polynomials are common).
Stephen Canon's image processing examples would each make for instructive projects. Taking a different tack, though, you might look at certain amenable geometry problems:
Closest pair of points in moderately high dimension---say 50000 or so points in 16 or so dimensions. This may have too much in common with matrix multiplication for your purposes. (Take the dimension too much higher and dimensionality reduction silliness starts mattering; much lower and spatial data structures dominate. Brute force, or something simple using a brute-force kernel, is what I would want to use for this.)
Variation: For each point, find the closest neighbour.
Variation: Red points and blue points; find the closest red point to each blue point.
Welzl's smallest containing circle algorithm is fairly straightforward to implement, and the really costly step (check for points outside the current circle) is amenable to vectorisation. (I suspect you can kill it in two dimensions with just a little effort.)
Be warned that computational geometry stuff is usually more annoying to implement than it looks at first; don't just grab a random paper without understanding what degenerate cases exist and how careful your programming needs to be.
Have a look at other linear algebra problems, too. They're also hugely important. Dense Cholesky factorisation is a natural thing to look at here (much more so than LU factorisation) since you don't need to mess around with pivoting to make it work.
There is a free benchmark called c-ray.
It is a small ray-tracer for spheres designed to be a benchmark for floating-point performance.
A few random stackshots show that it spends nearly all its time in a function called ray_sphere that determines if a ray intersects a sphere and if so, where.
They also show some opportunities for larger speedup, such as:
It does a linear search through all the spheres in the scene to try to find the nearest intersection. That represents a possible area for speedup, by doing a quick test to see if a sphere is farther away than the best seen so far, before doing all the 3-d geometry math.
It does not try to exploit similarity from one pixel to the next. This could gain a huge speedup.
So if all you want to look at is chip-level performance, it could be a decent example.
However, it also shows how there can be much bigger opportunities.

Cluster similar curves considering "belongingness"?

Currently, I have 6 curves shown in 6 different colors as below.
The 6 curves are in fact generated by 6 trials of one same experiment. That means, ideally they should be the same curve, but due to the noise and different trial participants, they just look similar but not exactly the same.
Now I wish to create an algorithm that is able to identify that the 6 curves are essentially the same and cluster them together into one cluster. What similarity metrics should I use?
Note:
The x-axis does NOT matter at all! I simply align them together for visual purpose. Thus, feel free to left/right shift the curves, if doing so helps.
"Sub-curves" that are part of the curves may appear. The "belongingness" is important and thus needs identifying as well. But again, left/right shifting is allowed.
I have attemped to learn some of the clustering algorithm, such as DBSCAN, K-means, Fuzzy C-means, etc. But I don't see their appropriateness in this case, because the "belongingness" needs to be spotted!
Any suggestions or comments are well welcomed. I understand that it is hard to give some exact solutions to this question. I am only expecting some enlightening suggestions here.
Have a look at time series similarity functions, such as dynamic time warping.
They can be used with e.g. DBSCAN but NOT with k-means (you cannot compute a reasonable "mean" for these distances; k-means is really designed for squared Euclidean distances).

Random Config Generation for RRT

I am writing code for Rapidly exploring trees for robotic arm movement. I have two doubts
i) what is the distance metric that I have to use to find the nearest node in the graph? If it is euclidean distance,how do I calculate it because there are two links in each arm configuration of the robot and I have no idea how to find the euclidean distance in that case.
How do I find the distance between ADE and ABC if ABC is the nearest config to ADE in the tree?
ii) How do I generate a random config towards the goal because my random configs never seem to reach goal even after 5000 iterations.
Thanks in advance.
Distance Metrics for the Two Revolute-Joint Arm
RRT is pretty robust to the (pseudo-) metric that you choose, but the quality of the trees (and consequently the paths) will be influenced if you've got something that isn't particularly good. To get good performance overall, the metric function is supposed to be fast, so I'd definitely try simpler things before you move onto something more complex.
In the case of robot arms a number of metrics are possible. Perhaps the simplest is simply to use the Euclidean distance between the end effector in two configurations. You'll almost certainly have to have this working already if you're testing the planning algorithm.
If you've got a full dynamics model of the system, then it is likely that other metrics based on the energy required to move the arm from one configuration to another will perform better.
Other metrics based on the (joint local) angle swept out at the joints, which can be derived from evaluating a path from an inverse kinematics solver may be acceptable - but I haven't tried this in practice. This may also be useful technique to know about if you need to implement your connect-configurations function.
Improving Convergence
Once you've got your metric function working correctly, RRT should just work. However, in practice, you'll almost always need to oversample near the goal configuration to encourage the algorithm to exploit the work done in the rest of the tree building stage. Most commonly, you do this by sampling the goal configuration state with about 5% probability.

Edge detection : Any performance evaluation technique?

I am working on edge detection in images and would like to evaluate the performance of algorithm, if any any one could give me a reference or method on how to proceed it will be really helpful. :)
I do not have ground truth and data set includes color as well as gray images.
Thank you.
Create a synthetic data set with known edges, for example by 3D rendering, by compositing 2D images with precise masks (as may be obtained in royalty free photosets), or by introducing edges directly (thin/faint lines). Remember to add some confounding non-edges that look like edges, of a type appropriate for what you're tuning for.
Use your (non-synthetic) data set. Run the reference algorithms that you want to compare against. Also produce combinations of the reference algorithms, for example by voting (majority, at least K out of N, etc). Calculate stats on your algo vs reference algo performance, in terms of (a) number of points your algo classifies as edge which each reference algo, or the combination, does not classify as edge (false positive), or (b) number of points which the reference algo classifies as edge that your algo does not (false negative). You can also calculate a rank correlation-type number for algos by looking at each point and looking at which algos do (or don't) classify that as an edge.
Create ground truth manually. Use reference edge-finding algos as a starting point, then fix up by hand. Probably valuable to do for a small number of images in any case.
Good luck!
For comparisons, quantitative measures like what #Alex I explained is best. To do so, you need to define what is "correct" with a ground truth set and a way to consistently determine if a given image is correct or on a more granular level, how correct (some number like a percentage) it is. #Alex I gave a way to do that.
Another option that is often used in graphics research where there is no ground truth is user studies. Usually less desirable as they are time consuming and often more costly. However, if it is a qualitative improvement that you are after or if a quantitative measurement is just too hard to do, a user study is an appropriate solution.
When I mean user study I mean to poll people on how well a result is given the input image. You could give them a scale to rate things on and randomly give them samples from both your results and the results of another algorithm
And of course, if you still want more ideas, be sure to check out edge detection papers to see how they measured their results (I'd actually look here first as they've already gone through this same process and determined what was best for them: google scholar).

Resources