Question about bandwidth ceilings in roofline models - cpu

I don't quite understand the bandwidth factor in roofline models described in Wikipedia (like the pic and its caption shown below):
why the intersection between the β x I and axises could be changed? Why could there be performance while operation intensity is zero?
When changing bandwidth ceilings, why the slope of the β x I does not change?
An example of a Roofline model with added bandwidth ceilings. In this model, the two additional ceilings represent the absence of software prefetching and NUMA organization of memory. -- Wikipedia

The axes in this graph are logarithmic. So the zero intensity case is not actually on the graph. Also, because of that logarithmic scale, any factor degradation, for instance from lack of prefetching, is a constant displacement.

Related

Algorithms (millions of solid random intersections)

I am looking for a numerical method to calculate the volume of the intersection of more than two cylinders in any angle (not just 90° (Steinmetz Solid)). There is an old Hubbell paper (1965) but it just work for two cylinders.
Evidently, I can make the calculation by hand, but I need a numerical method since I am making calculations for millions of random intersections.
Exact computation of the intersection volume looks like an endeavour. The graph of edges can have high complexity, and the edges are complicated skew curves.
I would try with a voxelization of space, one bit per voxel (2000³ voxels requiring 1GB of memory). Maybe an octree representation can help lower the storage requirement, with the number of cells required being closer to the area than to the volume.
Anyway, the filling of the cyclindres will take a quite significant amount of time.

What is the intuition behind the relationship between the dimensions of a model and the performance of k-nearest neighbors?

Regarding the properties of k-nearest neighbors, on page 38 of Elements of Statistical Learning, the authors write:
"...as the dimension p gets large, so does the metric size of the k-nearest neighborhood. So settling for nearest neighborhood as a surrogate for conditioning will fail us miserably."
Does this mean that, holding k constant, as we add features to a model, the distance between outcomes and thus the size of neighborhoods increases, so the model's variance increases?
The curse of dimensionality comes in various shapes. Especially for machine learning, there is a discussion here.
Generally, with increasing dimensionality, the relative difference in distances between points becomes increasingly small. For d=1000 dimensions, it is highly unlikely, that any point A in a random dataset is significantly closer to a given point B than any other point. In a way this can be explained by saying that with d=1000 it is very unlikely that a point A is closer to a point B in the vast majority of dimensions (at least unlikely to be closer than any other arbitrary point).
Another aspect is that the volumetric properties become unintuitive for increasing 'd'. For example, even when assuming a relatively moderate d=25 (if I remember correctly), the volume of a the unit-cube (length of edge = 1) is 1,000,000 bigger than the volume of the unit-sphere (sphere with diameter = 1). I mention this because your quote mentions 'metric size', but I'm not sure how this affects kNN.

Uncertainty on pose estimate when minimizing measurement errors

Let's say I want to estimate the camera pose for a given image I and I have a set of measurements (e.g. 2D points ui and their associated 3D coordinates Pi) for which I want to minimize the error (e.g. the sum of squared reprojection errors).
My question is: How do I compute the uncertainty on my final pose estimate ?
To make my question more concrete, consider an image I from which I extracted 2D points ui and matched them with 3D points Pi. Denoting Tw the camera pose for this image, which I will be estimating, and piT the transformation mapping the 3D points to their projected 2D points. Here is a little drawing to clarify things:
My objective statement is as follows:
There exist several techniques to solve the corresponding non-linear least squares problem, consider I use the following (approximate pseudo-code for the Gauss-Newton algorithm):
I read in several places that JrT.Jr could be considered an estimate of the covariance matrix for the pose estimate. Here is a list of more accurate questions:
Can anyone explain why this is the case and/or know of a scientific document explaining this in details ?
Should I be using the value of Jr on the last iteration or should the successive JrT.Jr be somehow combined ?
Some people say that this actually is an optimistic estimate of the uncertainty, so what would be a better way to estimate the uncertainty ?
Thanks a lot, any insight on this will be appreciated.
The full mathematical argument is rather involved, but in a nutshell it goes like this:
The outer product (Jt * J) of the Jacobian matrix of the reprojection error at the optimum times itself is an approximation of the Hessian matrix of least squares error. The approximation ignores terms of order three and higher in the Taylor expansion of the error function at the optimum. See here (pag 800-801) for proof.
The inverse of the Hessian matrix is an approximation of the covariance matrix of the reprojection errors in a neighborhood of the optimal values of the parameters, under a local linear approximation of parameters-to-errors transformation (pag 814 above ref).
I do not know where the "optimistic" comment comes from. The main assumption underlying the approximation is that the behavior of the cost function (the reproj. error) in a small neighborhood of the optimum is approximately quadratic.

What is the formal name of the data-structure/algorithm for particle simulations presented here?

For a computer to simulate a system of n particles in an universe where they interact with each other, one could use this rough algorithm:
for interval where dt=10ms
for each particle a in universe
for each particle b in universe
interact(a,b,dt)
for each particle a in universe
integrate(a,dt)
It is heavy, calling interact n^2 times per tick - thus, unfeasible to simulate many particles. Most of the time, though, particles that are near interact less strongly. The idea is to take advantage of this fact, creating a graph where each node is a particle and each connection is their distance. Particles that are near interact more often than particles that are far. For example,
for interval where dt=10ms
for each particle a in universe
for each particle b where 0m <= distance to a < 10m
interact(a,b,dt)
for interval where dt=20ms
for each particle a in universe
for each particle b where 10m <= distance to a < 20m
interact(a,b,dt)
for interval where dt=40ms
fro each particle a in universe
for each particle a in b where 20m <= distance to a < 40m
interact(a,b,dt)
(...etc)
for interval where dt=10ms
for each particle a in universe
integrate(a,dt)
This would be obviously superior, as a particle would interact mostly with those which are near. When a particle that is far gets closer, it will start refreshing more frequently.
I need to know the math behind this, in order to calculate the optimal refresh rate between 2 particles in function of distance. Thus, my question is, what is the formal name of what I am describing here?
To overcome the O(n^2) cost of calculating the full set of pair-wise interactions at each step, N-body simulations of this kind are often implemented using the Barnes-Hut approach. This is similar in spirit to the type of multi-resolution idea that you've described.
Barnes-Hut is an efficient (O(n*log(n))) approximation for the full pair-wise interaction terms based on a hierarchical spatial partitioning strategy. The set of particles are inserted into an octree (quadtree in R^2), which is a spatial indexing tree with height O(log(n)). In addition to containing pointers to their children, nodes at each level of the tree also contain the center of mass of their set of child particles - tree nodes are in effect lumped "super-particles" at various spatial resolutions.
When calculating the force acting on a particular particle the tree is traversed from the root, and at each node a decision is made whether to continue traversing into it's children, or just take the approximate 'lumped' contribution based on the center of mass of the children. Typically, this decision is made based on the distance of the center of mass from the particle in question - if the center of mass is "far enough away" the traversal terminates and the "lumped" approximation is taken.
This strategy ensures that the full (and expensive) pair-wise interaction is only computed at "short" particle distances, with approximate "lumped" interactions used as the distance increases.
State-of-the-art N-body algorithms also incorporate individual (and variable) time-steps for each particle in the system to gain additional efficiency, but this starts to get very complicated!
Hope this helps.
You're doing time-step simulation, and using a performance enhancing heuristic called localization.
The general algorithm you describe is a N-body simulation.
I don't think the heuristic you describe has a universal name.

SVM - hard or soft margins?

Given a linearly separable dataset, is it necessarily better to use a a hard margin SVM over a soft-margin SVM?
I would expect soft-margin SVM to be better even when training dataset is linearly separable. The reason is that in a hard-margin SVM, a single outlier can determine the boundary, which makes the classifier overly sensitive to noise in the data.
In the diagram below, a single red outlier essentially determines the boundary, which is the hallmark of overfitting
To get a sense of what soft-margin SVM is doing, it's better to look at it in the dual formulation, where you can see that it has the same margin-maximizing objective (margin could be negative) as the hard-margin SVM, but with an additional constraint that each lagrange multiplier associated with support vector is bounded by C. Essentially this bounds the influence of any single point on the decision boundary, for derivation, see Proposition 6.12 in Cristianini/Shaw-Taylor's "An Introduction to Support Vector Machines and Other Kernel-based Learning Methods".
The result is that soft-margin SVM could choose decision boundary that has non-zero training error even if dataset is linearly separable, and is less likely to overfit.
Here's an example using libSVM on a synthetic problem. Circled points show support vectors. You can see that decreasing C causes classifier to sacrifice linear separability in order to gain stability, in a sense that influence of any single datapoint is now bounded by C.
Meaning of support vectors:
For hard margin SVM, support vectors are the points which are "on the margin". In the picture above, C=1000 is pretty close to hard-margin SVM, and you can see the circled points are the ones that will touch the margin (margin is almost 0 in that picture, so it's essentially the same as the separating hyperplane)
For soft-margin SVM, it's easer to explain them in terms of dual variables. Your support vector predictor in terms of dual variables is the following function.
Here, alphas and b are parameters that are found during training procedure, xi's, yi's are your training set and x is the new datapoint. Support vectors are datapoints from training set which are are included in the predictor, ie, the ones with non-zero alpha parameter.
In my opinion, Hard Margin SVM overfits to a particular dataset and thus can not generalize. Even in a linearly separable dataset (as shown in the above diagram), outliers well within the boundaries can influence the margin. Soft Margin SVM has more versatility because we have control over choosing the support vectors by tweaking the C.

Resources