Matlab function gradient for fminunc

Matlab function gradient for fminunc - algorithm

f = #(w) sum(log(1 + exp(-t .* (phis * w'))))/size(phis, 1) + coef * w*w';
options = optimset('Display', 'notify', 'MaxFunEvals', 2e+6, 'MaxIter', 2e+6);
w = fminunc(f, ones(1, size(phis, 2)), options);
phis size is NxN+1
t size is Nx1
coef is const
I'm trying to minimize function f, firstly I was using fminsearch but it works long time, that's why now I use fminunc, but there is one problem: I need function gradient for acceleration. Can you help me please construct gradient for function f, coz I always get this warning:
Warning: Gradient must be provided for trust-region algorithm;
using line-search algorithm instead.

What you are trying to do is called logistic regression, with a L2-regularization. There are far better ways to solve this problem than a call to a Matlab function, since the log-likelihood function is concave.
You should ask your question in the statistical website, or have a look at my former question there.

Related

Fitting data to unknown curve -- Possible tanh

I am trying to fit the following dataset:
0.01 3.69470157
0.59744 3.514991345
0.65171 3.265043489
0.70076 2.978933734
0.75021 2.700637918
0.80103 2.413791532
0.84878 2.086939551
0.89572 1.819489189
0.94717 1.532756131
0.99626 1.244667864
1.01643 1.130430784
1.03626 1.024324017
1.05633 0.910153046
1.07605 0.804981232
1.09791 0.708171108
1.11795 0.612456485
1.13841 0.516217721
1.15944 0.421844141
1.18032 0.335218393
1.20003 0.258073446
1.22204 0.181296813
1.24223 0.115157866
1.25935 0.069310744
Where the first column is x and the second is y.
I have tried a tanh function, polynomials, and now trying the erf function. Nothing seems to fit correctly.
Is there a way to know what function I should be fitting to this? And if so, what is the form of such a function. Thank you.
BIG EDIT: the function must be monotonically decreases as x increases, and have asymptotic behavior at the tail ends. So for the data-set it looks like it should approach ~3.7 and ~0.0

A simple sine(radians) equation with an offset gives a good fit:
y = amplitude * sin(pi * (x - center) / width) + Offset
amplitude = -2.2364202059901910E+00
center = 8.6047683705837374E-01
width = 1.1558269132014631E+00
Offset = 2.0456549443735259E+00
R-squared: 0.99994
RMSE: 0.00909

Understanding Support Vector Regression (SVR) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm working with SVR, and using this resource. Erverything is super clear, with epsilon intensive loss function (from figure). Prediction comes with tube, to cover most training sample, and generalize bounds, using support vectors.
Then we have this explanation. This can be described by introducing (non-negative) slack variables , to measure the deviation of training samples outside -insensitive zone. I understand this error, outside tube, but don't know, how we can use this in optimization. Could somebody explain this?
In local source. I'm trying to achieve very simple optimization solution, without libraries. This what I have for loss function.
import numpy as np
# Kernel func, linear by default
def hypothesis(x, weight, k=None):
k = k if k else lambda z : z
k_x = np.vectorize(k)(x)
return np.dot(k_x, np.transpose(weight))
.......
import math
def boundary_loss(x, y, weight, epsilon):
prediction = hypothesis(x, weight)
scatter = np.absolute(
np.transpose(y) - prediction)
bound = lambda z: z \
if z >= epsilon else 0
return np.sum(np.vectorize(bound)(scatter))

First, let's look at the objective function. The first term, 1/2 * w^2 (wish this site had LaTeX support but this will suffice) correlates with the margin of the SVM. The article you linked doesn't, in my opinion, explain this very well and calls this term describing "the model's complexity", but perhaps this is not the best way of explaining it. Minimizing this term maximizes the margin (while still representing the data well), which is the predominant goal of using SVM's doing regression.
Warning, Math Heavy Explanation: The reason this is the case is that when maximizing the margin, you want to find the "farthest" non-outlier points right on the margin and minimize its distance. Let this farthest point be x_n. We want to find its Euclidean distance d from the plane f(w, x) = 0, which I will rewrite as w^T * x + b = 0 (where w^T is just the transpose of the weights matrix so that we can multiply the two). To find the distance, let us first normalize the plane such that |w^T * x_n + b| = epsilon, which we can do WLOG as w is still able to form all possible planes of the form w^T * x + b= 0. Then, let's note that w is perpendicular to the plane. This is obvious if you have dealt a lot with planes (particularly in vector calculus), but can be proven by choosing two points on the plane x_1 and x_2, then noticing that w^T * x_1 + b = 0, and w^T * x_2 + b = 0. Subtracting the two equations we get w^T(x_1 - x_2) = 0. Since x_1 - x_2 is just any vector strictly on the plane, and its dot product with w is 0, then we know that w is perpendicular to the plane. Finally, to actually calculate the distance between x_n and the plane, we take the vector formed by x_n' and some point on the plane x' (The vectors would then be x_n - x', and projecting it onto the vector w. Doing this, we get d = |w * (x_n - x') / |w||, which we can rewrite as d = (1 / |w|) * | w^T * x_n - w^T x'|, and then add and subtract b to the inside to get d = (1 / |w|) * | w^T * x_n + b - w^T * x' - b|. Notice that w^T * x_n + b is epsilon (from our normalization above), and that w^T * x' + b is 0, as this is just a point on our plane. Thus, d = epsilon / |w|. Notice that maximizing this distance subject to our constraint of finding the x_n and having |w^T * x_n + b| = epsilon is a difficult optimization problem. What we can do is restructure this optimization problem as minimizing 1/2 * w^T * w subject to the first two constraints in the picture you attached, that is, |y_i - f(x_i, w)| <= epsilon. You may think that I have forgotten the slack variables, and this is true, but when just focusing on this term and ignoring the second term, we ignore the slack variables for now, I will bring them back later. The reason these two optimizations are equivalent is not obvious, but the underlying reason lies in discrimination boundaries, which you are free to read more about (it's a lot more math that frankly I don't think this answer needs more of). Then, note that minimizing 1/2 * w^T * w is the same as minimizing 1/2 * |w|^2, which is the desired result we were hoping for. End of the Heavy Math
Now, notice that we want to make the margin big, but not so big that includes noisy outliers like the one in the picture you provided.
Thus, we introduce a second term. To motivate the margin down to a reasonable size the slack variables are introduced, (I will call them p and p* because I don't want to type out "psi" every time). These slack variables will ignore everything in the margin, i.e. those are the points that do not harm the objective and the ones that are "correct" in terms of their regression status. However, the points outside the margin are outliers, they do not reflect well on the regression, so we penalize them simply for existing. The slack error function that is given there is relatively easy to understand, it just adds up the slack error of every point (p_i + p*_i) for i = 1,...,N, and then multiplies by a modulating constant C which determines the relative importance of the two terms. A low value of C means that we are okay with having outliers, so the margin will be thinned and more outliers will be produced. A high value of C indicates that we care a lot about not having slack, so the margin will be made bigger to accommodate these outliers at the expense of representing the overall data less well.
A few things to note about p and p*. First, note that they are both always >= 0. The constraint in your picture shows this, but it also intuitively makes sense as slack should always add to the error, so it is positive. Second, notice that if p > 0, then p* = 0 and vice versa as an outlier can only be on one side of the margin. Last, all points inside the margin will have p and p* be 0, since they are fine where they are and thus do not contribute to the loss.
Notice that with the introduction of the slack variables, if you have any outliers then you won't want the condition from the first term, that is, |w^T * x_n + b| = epsilon as the x_n would be this outlier, and your whole model would be screwed up. What we allow for, then, is to change the constraint to be |w^T * x_n + b| = epsilon + (p + p*). When translated to the new optimization's constraint, we get the full constraint from the picture you attached, that is, |y_i - f(x_i, w)| <= epsilon + p + p*. (I combined the two equations into one here, but you could rewrite them as the picture is and that would be the same thing).
Hopefully after covering all this up, the motivation for the objective function and the corresponding slack variables makes sense to you.
If I understand the question correctly, you also want code to calculate this objective/loss function, which I think isn't too bad. I have not tested this (yet), but I think this should be what you want.
# Function for calculating the error/loss for a SVM. I assume that:
# - 'x' is 2d array representing the vectors of the data points
# - 'y' is an array representing the values each vector actually gives
# - 'weights' is an array of weights that we tune for the regression
# - 'epsilon' is a scalar representing the breadth of our margin.
def optimization_objective(x, y, weights, epsilon):
# Calculates first term of objective (note that norm^2 = dot product)
margin_term = np.dot(weight, weight) / 2
# Now calculate second term of objective. First get the sum of slacks.
slack_sum = 0
for i in range(len(x)): # For each observation
# First find the absolute distance between expected and observed.
diff = abs(hypothesis(x[i]) - y[i])
# Now subtract epsilon
diff -= epsilon
# If diff is still more than 0, then it is an 'outlier' and will have slack.
slack = max(0, diff)
# Add it to the slack sum
slack_sum += slack
# Now we have the slack_sum, so then multiply by C (I picked this as 1 aribtrarily)
C = 1
slack_term = C * slack_sum
# Now, simply return the sum of the two terms, and we are done.
return margin_term + slack_term
I got this function working on my computer with small data, and you may have to change it a little to work with your data if, for example, the arrays are structured differently, but the idea is there. Also, I am not the most proficient with python, so this may not be the most efficient implementation, but my intent was to make it understandable.
Now, note that this just calculates the error/loss (whatever you want to call it). To actually minimize it requires going into Lagrangians and intense quadratic programming which is a much more daunting task. There are libraries available for doing this but if you want to do this library free as you are doing with this, I wish you good luck because doing that is not a walk in the park.
Finally, I would like to note that most of this information I got from notes I took in my ML class I took last year, and the professor (Dr. Abu-Mostafa) was a great help to have me learn the material. The lectures for this class are online (by the same prof), and the pertinent ones for this topic are here and here (although in my very biased opinion you should watch all the lectures, they were a great help). Leave a comment/question if you need anything cleared up or if you think I made a mistake somewhere. If you still don't understand, I can try to edit my answer to make more sense. Hope this helps!

Integral image box filtering

I'm trying to compare the performance in Halide of a 2-pass, separable approach to the integral-image-based box filtering approach to gain a better understanding of Halide scheduling. I cannot find any examples of Integral Image creation in Halide where the Integral Image Function is used in the defintion of a subsequent function.
ImageParam input(type_of<uint8_t>(), 3, "image 1");
Var x("x"), y("y"), c("c"), xi("xi"), yi("yi");
Func ip("ip");
ip(x, y, c) = cast<float>(BoundaryConditions::repeat_edge(input)(x, y, c));
Param<int> radius("radius", 15, 1, 50);
RDom imageDomain(input);
RDom r(-radius, radius, -radius, radius);
// Make an integral image
Func integralImage = ip;
integralImage(x, imageDomain.y, c) += integralImage(x, imageDomain.y - 1, c);
integralImage(imageDomain.x, y, c) += integralImage(imageDomain.x - 1, y, c);
integralImage.compute_root(); // Come up with a better schedule for this
// Apply box filter to integral image
Func outputImage;
outputImage(x,y,c) = integralImage(x+radius,y+radius,c)
+ integralImage(x-radius,y-radius,c)
- integralImage(x-radius,y+radius,c)
- integralImage(x-radius,y+radius,c);
Expr normFactor = (2*radius+1) * (2*radius+1);
outputImage(x,y,c) = outputImage(x,y,c) / normFactor;
result(x,y,c) = cast<uint8_t>(outputImage(x,y,c));
result.parallel(y).vectorize(x,8)
I did find the following code in the tests:
https://github.com/halide/Halide/blob/master/test/correctness/multi_pass_reduction.cpp
But this example uses realize to compute the integral image as a buffer over a fixed domain and doesn't consume the definition of integral image as a function in the definition of a subsequent function.
When I run this code, I observe that:
The computation of the Integral Image is extremely slow (moves my pipeline to 0 fps)
I get an incorrect answer. I feel like I must be somehow misdefining my integral image
I also have a related question, how would one best schedule the computation of an integral image in this type of scenario in Halide?

My problem was in the definition of my integral image. If I change my implementation to the standard one pass definition of the integral image, I get expected behavior:
Func integralImage;
integralImage(x,y,c) = 0.0f; // Pure definition
integralImage(intImDom.x,intImDom.y,c) = ip(intImDom.x,intImDom.y,c)
+ integralImage(intImDom.x-1,intImDom.y,c)
+ integralImage(intImDom.x,intImDom.y-1,c)
- integralImage(intImDom.x-1,intImDom.y-1,c);
integralImage.compute_root();
I still have remaining questions about the most efficient algorithm/schedule in Halide for computing an integral image, but I'll re-post that as a more specific question, as my current post was kind of open-ended.
As an aside, there is a second problem in the code above in that padding of the input image is not handled correctly.

Neural network with batch training algorithm, when to apply momentum and weight decay

I built a neural network and successfully trained it by using backpropagation with stochastic gradient descent. Now I'm switching to batch training but I'm a bit confused about when to apply momentum and weight decay.
I know fair well how backpropagation works in theory, I'm just stuck with implementation details.
With the stochastic approach, all I had to do was apply the updates to the weight immediately after having computed the gradients, as in this pseudo python code:
for epoch in epochs:
for p in patterns:
outputs = net.feedforward(p.inputs)
# output_layer_errors is needed to plot the error
output_layer_errors = net.backpropagate(outputs, p.targets)
net.update_weights()
where update_weights method is defined as follows:
def update_weights(self):
for h in self.hidden_neurons:
for o in self.output_neurons:
gradient = h.output * o.error
self.weights[h.index][o.index] += self.learning_rate * gradient + \
self.momentum * self.prev_gradient
self.weights[h.index][o.index] -= self.decay * self.weights[h.index][o.index]
for i in self.input_neurons:
for h in self.hidden_neurons:
gradient = i.output * h.error
self.weights[i.index][h.index] += self.learning_rate * gradient + \
self.momentum * self.prev_gradient
self.weights[i.index][h.index] -= self.decay * self.weights[i.index][h.index]
This works like a charm (note that there might be errors because i'm just using python because it's more understandable, the actual net is coded in C. This code is just to show the steps i did to compute the updates).
Now, switching to batch updates, the main algorithm should be something like:
for epoch in epochs:
for p in patterns:
outputs = net.feedforward(p.inputs)
# output_layer_errors is needed to plot the error
output_layer_errors = net.backpropagate(outputs, p.targets)
net.accumulate_updates()
net.update_weights()
the accumulate method is as follows:
def accumulate_weights(self):
for h in self.hidden_neurons:
for o in self.output_neurons:
gradient = h.output * o.error
self.accumulator[h.index][o.index] += self.learning_rate * gradient
# should I compute momentum here?
for i in self.input_neurons:
for h in self.hidden_neurons:
gradient = i.output * h.error
# should I just accumulate the gradient without scaling it by the learning rate here?
self.accumulator[i.index][h.index] = self.learning_rate * gradient
# should I compute momentum here?
and the update_weights is like this:
def update_weights(self):
for h in self.hidden_neurons:
for o in self.output_neurons:
# what to do here? apply momentum? apply weight decay?
self.weights[h.index][o.index] += self.accumulator[h.index][o.index]
self.accumulator[h.index][o.index] = 0.0
for i in self.input_neurons:
for h in self.hidden_neurons:
# what to do here? apply momentum? apply weight decay?
self.weights[i.index][h.index] += self.accumulator[i.index][h.index]
self.accumulator[i.index][h.index] = 0.0
I'm not sure if I have to:
1) scale the gradient with the learning rate at the time of accumulation or at the time of update
2) apply momentum at the time accumulation of at the time of update
3) same as 2) but for weight decay
Can somebody help me solve this issue?
I'm sorry for the long question, but I thought I would be detailed to explain my doubts better.

Just some quick comment to this. Stochastic gradient descendent leads most of the times to a non-smooth optimization, and requires a sequential optimization that does not suit current technology advances such as parallel computation.
As such, the mini-batch approach try to gain the advantages of the stochastic optimization with the advantages of the batch optimization (parallel computation). Here what you do is to create small training blocks which you give in a parallel fashion to the learning algorithm. At the end each worker should tell you the error to their training sample, which you can average and use as in a normal stochastic gradient descendent.
This approach lead to a much smoother optimization, and probably to a quicker optimization if you make use of parallel computing.

It seems for the first question either is fine. But if you want to combine with momentum, you better check the original formula in your implementation. I would say you should not scale the gradient during the accumulation. At the time when computing momentum, use the formula:
v_{t+1} = \mu v_t - \alpha * g_t
where g_t is the gradient. alpha is learning rate.
I also recommend using AdaGrad and mini-batch instead of full-batch.
Reference: http://firstprayer.github.io/stochastic-gradient-descent-tricks/

Finding translation and scale on two sets of points to get least square error in their distance?

I have two sets of 3D points (original and reconstructed) and correspondence information about pairs - which point from one set represents the second one. I need to find 3D translation and scaling factor which transforms reconstruct set so the sum of square distances would be least (rotation would be nice too, but points are rotated similarly, so this is not main priority and might be omitted in sake of simplicity and speed). And so my question is - is this solved and available somewhere on the Internet? Personally, I would use least square method, but I don't have much time (and although I'm somewhat good at math, I don't use it often, so it would be better for me to avoid it), so I would like to use other's solution if it exists. I prefer solution in C++, for example using OpenCV, but algorithm alone is good enough.
If there is no such solution, I will calculate it by myself, I don't want to bother you so much.
SOLUTION: (from your answers)
For me it's Kabsch alhorithm;
Base info: http://en.wikipedia.org/wiki/Kabsch_algorithm
General solution: http://nghiaho.com/?page_id=671
STILL NOT SOLVED:
I also need scale. Scale values from SVD are not understandable for me; when I need scale about 1-4 for all axises (estimated by me), SVD scale is about [2000, 200, 20], which is not helping at all.

Since you are already using Kabsch algorithm, just have a look at Umeyama's paper which extends it to get scale. All you need to do is to get the standard deviation of your points and calculate scale as:
(1/sigma^2)*trace(D*S)
where D is the diagonal matrix in SVD decomposition in the rotation estimation and S is either identity matrix or [1 1 -1] diagonal matrix, depending on the sign of determinant of UV (which Kabsch uses to correct reflections into proper rotations). So if you have [2000, 200, 20], multiply the last element by +-1 (depending on the sign of determinant of UV), sum them and divide by the standard deviation of your points to get scale.
You can recycle the following code, which is using the Eigen library:
typedef Eigen::Matrix<double, 3, 1, Eigen::DontAlign> Vector3d_U; // microsoft's 32-bit compiler can't put Eigen::Vector3d inside a std::vector. for other compilers or for 64-bit, feel free to replace this by Eigen::Vector3d
/**
* #brief rigidly aligns two sets of poses
*
* This calculates such a relative pose <tt>R, t</tt>, such that:
*
* #code
* _TyVector v_pose = R * r_vertices[i] + t;
* double f_error = (r_tar_vertices[i] - v_pose).squaredNorm();
* #endcode
*
* The sum of squared errors in <tt>f_error</tt> for each <tt>i</tt> is minimized.
*
* #param[in] r_vertices is a set of vertices to be aligned
* #param[in] r_tar_vertices is a set of vertices to align to
*
* #return Returns a relative pose that rigidly aligns the two given sets of poses.
*
* #note This requires the two sets of poses to have the corresponding vertices stored under the same index.
*/
static std::pair<Eigen::Matrix3d, Eigen::Vector3d> t_Align_Points(
const std::vector<Vector3d_U> &r_vertices, const std::vector<Vector3d_U> &r_tar_vertices)
{
_ASSERTE(r_tar_vertices.size() == r_vertices.size());
const size_t n = r_vertices.size();
Eigen::Vector3d v_center_tar3 = Eigen::Vector3d::Zero(), v_center3 = Eigen::Vector3d::Zero();
for(size_t i = 0; i < n; ++ i) {
v_center_tar3 += r_tar_vertices[i];
v_center3 += r_vertices[i];
}
v_center_tar3 /= double(n);
v_center3 /= double(n);
// calculate centers of positions, potentially extend to 3D
double f_sd2_tar = 0, f_sd2 = 0; // only one of those is really needed
Eigen::Matrix3d t_cov = Eigen::Matrix3d::Zero();
for(size_t i = 0; i < n; ++ i) {
Eigen::Vector3d v_vert_i_tar = r_tar_vertices[i] - v_center_tar3;
Eigen::Vector3d v_vert_i = r_vertices[i] - v_center3;
// get both vertices
f_sd2 += v_vert_i.squaredNorm();
f_sd2_tar += v_vert_i_tar.squaredNorm();
// accumulate squared standard deviation (only one of those is really needed)
t_cov.noalias() += v_vert_i * v_vert_i_tar.transpose();
// accumulate covariance
}
// calculate the covariance matrix
Eigen::JacobiSVD<Eigen::Matrix3d> svd(t_cov, Eigen::ComputeFullU | Eigen::ComputeFullV);
// calculate the SVD
Eigen::Matrix3d R = svd.matrixV() * svd.matrixU().transpose();
// compute the rotation
double f_det = R.determinant();
Eigen::Vector3d e(1, 1, (f_det < 0)? -1 : 1);
// calculate determinant of V*U^T to disambiguate rotation sign
if(f_det < 0)
R.noalias() = svd.matrixV() * e.asDiagonal() * svd.matrixU().transpose();
// recompute the rotation part if the determinant was negative
R = Eigen::Quaterniond(R).normalized().toRotationMatrix();
// renormalize the rotation (not needed but gives slightly more orthogonal transformations)
double f_scale = svd.singularValues().dot(e) / f_sd2_tar;
double f_inv_scale = svd.singularValues().dot(e) / f_sd2; // only one of those is needed
// calculate the scale
R *= f_inv_scale;
// apply scale
Eigen::Vector3d t = v_center_tar3 - (R * v_center3); // R needs to contain scale here, otherwise the translation is wrong
// want to align center with ground truth
return std::make_pair(R, t); // or put it in a single 4x4 matrix if you like
}

For 3D points the problem is known as the Absolute Orientation problem. A c++ implementation is available from Eigen http://eigen.tuxfamily.org/dox/group__Geometry__Module.html#gab3f5a82a24490b936f8694cf8fef8e60 and paper http://web.stanford.edu/class/cs273/refs/umeyama.pdf
you can use it via opencv by converting the matrices to eigen with cv::cv2eigen() calls.

Start with translation of both sets of points. So that their centroid coincides with the origin of the coordinate system. Translation vector is just the difference between these centroids.
Now we have two sets of coordinates represented as matrices P and Q. One set of points may be obtained from other one by applying some linear operator (which performs both scaling and rotation). This operator is represented by 3x3 matrix X:
P * X = Q
To find proper scale/rotation we just need to solve this matrix equation, find X, then decompose it into several matrices, each representing some scaling or rotation.
A simple (but probably not numerically stable) way to solve it is to multiply both parts of the equation to the transposed matrix P (to get rid of non-square matrices), then multiply both parts of the equation to the inverted PT * P:
PT * P * X = PT * Q
X = (PT * P)-1 * PT * Q
Applying Singular value decomposition to matrix X gives two rotation matrices and a matrix with scale factors:
X = U * S * V
Here S is a diagonal matrix with scale factors (one scale for each coordinate), U and V are rotation matrices, one properly rotates the points so that they may be scaled along the coordinate axes, other one rotates them once more to align their orientation to second set of points.
Example (2D points are used for simplicity):
P = 1 2 Q = 7.5391 4.3455
2 3 12.9796 5.8897
-2 1 -4.5847 5.3159
-1 -6 -15.9340 -15.5511
After solving the equation:
X = 3.3417 -1.2573
2.0987 2.8014
After SVD decomposition:
U = -0.7317 -0.6816
-0.6816 0.7317
S = 4 0
0 3
V = -0.9689 -0.2474
-0.2474 0.9689
Here SVD has properly reconstructed all manipulations I performed on matrix P to get matrix Q: rotate by the angle 0.75, scale X axis by 4, scale Y axis by 3, rotate by the angle -0.25.
If sets of points are scaled uniformly (scale factor is equal by each axis), this procedure may be significantly simplified.
Just use Kabsch algorithm to get translation/rotation values. Then perform these translation and rotation (centroids should coincide with the origin of the coordinate system). Then for each pair of points (and for each coordinate) estimate Linear regression. Linear regression coefficient is exactly the scale factor.

A good explanation Finding optimal rotation and translation between corresponding 3D points
The code is in matlab but it's trivial to convert to opengl using the cv::SVD function

You might want to try ICP (Iterative closest point).
Given two sets of 3d points, it will tell you the transformation (rotation + translation) to go from the first set to the second one.
If you're interested in a c++ lightweight implementation, try libicp.
Good luck!

The general transformation, as well the scale can be retrieved via Procrustes Analysis. It works by superimposing the objects on top of each other and tries to estimate the transformation from that setting. It has been used in the context of ICP, many times. In fact, your preference, Kabash algorithm is a special case of this.
Moreover, Horn's alignment algorithm (based on quaternions) also finds a very good solution, while being quite efficient. A Matlab implementation is also available.

Scale can be inferred without SVD, if your points are uniformly scaled in all directions (I could not make sense of SVD-s scale matrix either). Here is how I solved the same problem:
Measure distances of each point to other points in the point cloud to get a 2d table of distances, where entry at (i,j) is norm(point_i-point_j). Do the same thing for the other point cloud, so you get two tables -- one for original and the other for reconstructed points.
Divide all values in one table by the corresponding values in the other table. Because the points correspond to each other, the distances do too. Ideally, the resulting table has all values being equal to each other, and this is the scale.
The median value of the divisions should be pretty close to the scale you are looking for. The mean value is also close, but I chose median just to exclude outliers.
Now you can use the scale value to scale all the reconstructed points and then proceed to estimating the rotation.
Tip: If there are too many points in the point clouds to find distances between all of them, then a smaller subset of distances will work, too, as long as it is the same subset for both point clouds. Ideally, just one distance pair would work if there is no measurement noise, e.g when one point cloud is directly derived from the other by just rotating it.

you can also use ScaleRatio ICP proposed by BaoweiLin
The code can be found in github

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio