I am attempting to translate the following python function into C++:
import numpy as np
from scipy.linalg import blas
def scaled_dist(a, b, ls):
al = a/ls
bl = b/ls
tmp1 = np.sum(al**2, axis=1)
tmp2 = np.sum(bl**2, axis=1)
tmp3 = np.add.outer(tmp1, tmp2, order='F')
tau = blas.dgemm(a=al, b=bl, alpha=-2.0, c=tmp3, beta=1, trans_b=1)
np.clip(tau, 0, np.inf, out=tau)
return tau
However I have hit a stumbling block with the line:
tmp3 = np.add.outer(tmp1, tmp2)
My c++ code compiles but encounters a runtime error when executed. The code (up to that line) is:
Eigen::MatrixXd test2(const Eigen::MatrixXd &x1, const Eigen::MatrixXd &x2,const Eigen::VectorXd &vec)
{
Eigen::MatrixXd r = Eigen::MatrixXd::Zero(x1.rows(), x2.rows());
Eigen::MatrixXd al = x1.array().rowwise() / vec.transpose().array();
Eigen::VectorXd tmp1 = al.array().square().rowwise().sum();
Eigen::MatrixXd bl = x2.array().rowwise() / vec.transpose().array();
Eigen::VectorXd tmp2 = bl.array().square().rowwise().sum();
r = tmp1.transpose().array() + tmp2.array();
return r;
}
I am able to make sense of the runtime error, which is (I believe) an assertion error, complaining that the left and right hand sides of the addition expression do not match in size. My approach was motivated by the fact that tmp1.transpose() * tmp2 does appear to produce the expected result.
My question is as follows:
Given two vectors, vec1 and vec2, what is the idiomatic way using Eigen of achieving the same functionality as numpy.add.outer(vec1, vec2), namely an "outer" addition whereby a matrix is obtained by adding the (broadcast) rows of one vector to the (broadcast) columns of the other? i.e., if
vec1 = [1,2,3]
vec2 = [3,4,5]
then
outer_add(vec1, vec2) =
[4, 5, 6]
[5, 6, 7]
[6, 7, 8]
You can use replicate for that, e.g.:
Vector3f v1(1,2,3), v2(3,4,5);
MatrixXf r = v1.rowwise().replicate(v2.size())
+ v2.transpose().colwise().replicate(v1.size());
Related
I'm trying to get a grasp on declarative programming, so I've started learning Sympy and my "hello world" is to try to represent a standard MLP by converting a vanilla numpy implementation I found online. I am getting stuck trying to add the bias vector. Is there a differentiable way to do this operation in Sympy?
#! /usr/bin/env python3
import numpy as np
import random
import sympy as sp
i = 3
o = 1
x = sp.Symbol('x')
w = sp.Symbol('w')
b = sp.Symbol('b')
y = sp.Symbol('y')
Φ = sp.tanh # activation function
mlp = Φ(x*w+b)
L = lambda a, e: a - e # loss function
C = L(mlp, y)
dC = sp.diff(C, w) # partial deriv of C with respect to each weight
η = 0.01 # learning rate
if __name__ == "__main__":
random.seed(1)
train_inputs = np.array([[0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1]])
train_outputs = np.array([[0, 1, 1, 0]]).T
W = 2 * np.random.rand(i, o) - 1 # TODO parameterize initialization
W = sp.Matrix(W)
B = 2 * np.random.rand(1, o) - 1 # TODO parameterize initialization
B = sp.Matrix(B)
for temp, ye in zip(train_inputs, train_outputs):
X = sp.Matrix(temp)
ya = mlp.subs({'x':X, 'w':W, 'b':B}).n()
Δ = dC.subs({'y':ye, 'x':X, 'b':B, 'w':W}).n()
W -= η * Δ
b -= η * Δ
I am replicating using Julia a sequence of steps originally made in Matlab. In Octave, this procedure takes 1.4582 seconds and in Julia (using Jupyter) it takes approximately 10 seconds. I'll try to be brief in the scripts. My goal is to achieve or improve Octave's performance. First of all, I will describe my variables and some function:
zgrid (double 1x7 size)
kgrid (double 500x1 size)
V0 (double 500x7 size)
P (double 7x7 size) a transition matrix
delta and beta are fixed parameters.
F(z,k) and u(c) are particular functions and are specified in the Julia script.
% Octave script
% V0 is given
[K, Z, K2] = meshgrid(kgrid, zgrid, kgrid);
K = permute(K, [2, 1, 3]);
Z = permute(Z, [2, 1, 3]);
K2 = permute(K2, [2, 1, 3]);
C = max(f(Z,K) + (1-delta)*K - K2,0);
U = u(C);
EV = V0*P';% EV is a 500x7 matrix size
EV = permute(repmat(EV, 1, 1, 500), [3, 2, 1]);
H = U + beta*EV;
[TV, index] = max(H, [], 3);
In Julia, I created a function that replicates this procedure. I used loops, but it has a performance 9 times longer.
% Julia script
% V0 is the input of my T operator function
V0 = repeat(sqrt.(kgrid), outer = [1,7]);
F = (z,k) -> exp(z)*(k^α);
u = (c) -> (c^(1-μ) - 1)/(1-μ)
% parameters
α = 1/3
β = 0.987
δ = 0.012;
μ = 2
Kss = 48.1905148382166
kgrid = range(0.75*Kss, stop=1.25*Kss, length=500);
zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657]
function T(V)
E=V*P'
T1 = zeros(Float64, 500, 7 )
aux = zeros(Float64, 500)
for i = 1:7
for j = 1:500
for l = 1:500
c= maximum( (F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j] - kgrid[l],0))
aux[l] = u(c) + β*E[l,i]
end
T1[j,i] = maximum(aux)
end
end
return T1
end
I would very much like to improve my performance in Julia. I believe there is a way to improve, but I am new in Julia programming.
This code runs for me in 5ms. Note that I have made F and u into proper (not anonymous) functions, F_ and u_, but you could get a similar effect by making the anonymous functions const.
Your main problem is that you have a lot of non-const global variables, and also that your main function is doing unnecessary work multiple times, and creating an unnecessary array, aux.
The performance tips section in the manual is essential reading: https://docs.julialang.org/en/v1/manual/performance-tips/
F_(z,k) = exp(z) * (k^(1/3)); # you can still use α, but it must be const
u_(c) = (c^(1-2) - 1)/(1-2)
function T_(V, P, kgrid, zgrid, β, δ)
E = V * P'
T1 = similar(V)
for i in axes(T1, 2)
for j in axes(T1, 1)
temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j]
aux = -Inf
for l in eachindex(kgrid)
c = max(0.0, temp - kgrid[l])
aux = max(aux, u_(c) + β * E[l, i])
end
T1[j,i] = aux
end
end
return T1
end
Benchmark:
V0 = repeat(sqrt.(kgrid), outer = [1,7]);
zgrid = sort!(rand(1, 7); dims=2)
kgrid = sort!(rand(500, 1); dims=1)
P = rand(length(zgrid), length(zgrid))
#btime T_($V0, $P, $kgrid, $zgrid, $β, $δ);
# output: 5.126 ms (4 allocations: 54.91 KiB)
The following should perform much better. The most noticeable differences are that it calculates F 500x less, and doesn't rely on global variables.
function T(V,kgrid,zgrid,β,δ)
E=V*P'
T1 = zeros(Float64, 500, 7)
for j = 1:500
for i = 1:7
x = F(zrid[i],kgrid[j]) +(1-δ)*kgrid[j]
T1[j,i] = maximum(u(max(x - kgrid[l], 0)) + β*E[l,i] for l in 1:500)
end
end
return T1
end
For my simulation I need to calculate many transformation matrices therefore I would like to vectorize a for-loop that I'm using right now.
Is there a way to vectorize the existing for-loop or do I probably need another approach in calculating the vectors and matrices before?
I prepared a little working example:
n_dim = 1e5;
p1_3 = zeros(3,n_dim); % translationvector (no trans.) [3x100000]
tx = ones(1,n_dim)*15./180*pi; % turn angle around x-axis (fixed) [1x100000]
ty = zeros(1,n_dim); % turn angle around y-axis (no turn) [1x100000]
tz = randi([-180 180], 1, n_dim)./180*pi; % turn angle around z-axis (different turn) [1x100000]
hom = [0 0 0 1].*ones(n_dim,4); % vector needed for homogenous transformation [100000x4]
% calculate sin/cosin values for rotation [100000x1 each]
cx = cos(tx)';
sx = sin(tx)';
cy = cos(ty)';
sy = sin(ty)';
cz = cos(tz)';
sz = sin(tz)';
% calculate rotation matrix [300000x3]
R_full = [ cy.*cz, -cy.*sz, sy; ...
cx.*sz+sx.*sy.*cz, cx.*cz-sx.*sy.*sz, -sx.*cy; ...
sx.*sz-cx.*sy.*cz, cz.*sx+cx.*sy.*sz, cx.*cy];
% prealocate transformation tensor
T = zeros(4,4,n_dim);
% create transformation tensor here
% T = [R11 R12 R13 p1;
% R21 R22 R23 p2;
% R31 R32 R33 p3;
% 0 0 0 1]
tic
for i = 1:n_dim
T(:,:,i) = [[R_full(i,1), R_full(i,2), R_full(i,3); ...
R_full(n_dim+i,1), R_full(n_dim+i,2), R_full(n_dim+i,3); ...
R_full(2*n_dim+i,1), R_full(2*n_dim+i,2), R_full(2*n_dim+i,3)], p1_3(:,i);
hom(i,:)];
end
toc
Try this:
T = permute(reshape(R_full,n_dim,3,3),[2,3,1]);
T(4,4,:) = 1;
Your method:
Elapsed time is 0.839315 seconds.
This method:
Elapsed time is 0.015389 seconds.
EDIT
I included Florian's answer, and of course he wins.
Are you ready for some crazy indexing foo? Here we go:
clear all;
close all;
clc;
n_dim_max = 200;
t_loop = zeros(n_dim_max, 1);
t_indexing = t_loop;
t_permute = t_loop;
fprintf("---------------------------------------------------------------\n");
for n_dim = 1:n_dim_max
p1_3 = zeros(3,n_dim); % translationvector (no trans.) [3x100000]
tx = ones(1,n_dim)*15./180*pi; % turn angle around x-axis (fixed) [1x100000]
ty = zeros(1,n_dim); % turn angle around y-axis (no turn) [1x100000]
tz = randi([-180 180], 1, n_dim)./180*pi; % turn angle around z-axis (different turn) [1x100000]
hom = [0 0 0 1].*ones(n_dim,4); % vector needed for homogenous transformation [100000x4]
% calculate sin/cosin values for rotation [100000x1 each]
cx = cos(tx)';
sx = sin(tx)';
cy = cos(ty)';
sy = sin(ty)';
cz = cos(tz)';
sz = sin(tz)';
% calculate rotation matrix [300000x3]
R_full = [ cy.*cz, -cy.*sz, sy; ...
cx.*sz+sx.*sy.*cz, cx.*cz-sx.*sy.*sz, -sx.*cy; ...
sx.*sz-cx.*sy.*cz, cz.*sx+cx.*sy.*sz, cx.*cy];
% prealocate transformation tensor
T = zeros(4,4,n_dim);
% create transformation tensor here
% T = [R11 R12 R13 p1;
% R21 R22 R23 p2;
% R31 R32 R33 p3;
% 0 0 0 1]
tic
for i = 1:n_dim
T(:,:,i) = [[R_full(i,1), R_full(i,2), R_full(i,3); ...
R_full(n_dim+i,1), R_full(n_dim+i,2), R_full(n_dim+i,3); ...
R_full(2*n_dim+i,1), R_full(2*n_dim+i,2), R_full(2*n_dim+i,3)], p1_3(:,i);
hom(i,:)];
end
t_loop(n_dim) = toc;
tic
% prealocate transformation tensor
TT = zeros(4, 4);
TT(end) = 1;
TT = repmat(TT, 1, 1, n_dim);
% Crazy index finding.
temp = repmat(1:(3*n_dim):(3*3*n_dim), 3, 1) + n_dim .* ((0:2).' * ones(1, 3));
temp = repmat(temp, 1, 1, n_dim);
t = zeros(1, 1, n_dim);
t(:) = 0:(n_dim-1);
temp = temp + ones(3, 3, n_dim) .* t;
% Direct assignment using crazily found indices.
TT(1:3, 1:3, :) = R_full(temp);
t_indexing(n_dim) = toc;
tic
% prealocate transformation tensor
TTT = zeros(4, 4);
TTT(end) = 1;
TTT = repmat(TTT, 1, 1, n_dim);
TTT(1:3, 1:3, :) = permute(reshape(R_full, n_dim, 3, 3), [2, 3, 1]);
t_permute(n_dim) = toc;
% Check
fprintf("n_dim: %d\n", n_dim);
fprintf("T equals TT: %d\n", (sum(T(:) == TT(:))) == (4 * 4 * n_dim));
fprintf("T equals TTT: %d\n", (sum(T(:) == TTT(:))) == (4 * 4 * n_dim));
fprintf("---------------------------------------------------------------\n");
end
figure(1);
plot(1:n_dim_max, t_loop, 1:n_dim_max, t_indexing, 1:n_dim_max, t_permute);
legend({'Loop', 'Indexing', 'Permute'});
xlabel('Dimension');
ylabel('Elapsed time [s]');
Sorry, the script got lengthy, because it's your initial solution, my solution, (and Florian's solution) and testing script all-in-one. Lazy friday was the reason for me not to split things properly...
How did I get there? Simple "reverse engineering". I took your solution for n_dim = [2, 3, 4] and determined [~, ii] = ismember(T(1:3, 1:3, :), R_full), i.e. the mapping of R_full to T(1:3, 1:3, :). Then, I analyzed the indexing scheme, and found the proper solution to mimic that mapping for arbitrary n_dim. Done! ;-) Yes, I like crazy indexing stuff.
I am working on finding the initial points of convergence using newton's iteration method in mathematica. newton function works now I would like to show which initial points from a grid produce Newton iterations that converge to -1, same for points that converge to (1 + (3)^1/2)/2i, given that:
f(x) = x^3+1
newton[x0_] := (
x = x0;
a1 = {};
b1 = {};
c1 = {};
counter = 0;
error = Abs[f[x]];
While[counter < 20 && error > 0.0001,
If[f'[x] != 0, x = x - N[f[x]/f'[x]]];
counter = counter + 1;
error = Abs[f[x]]];
x)
I created a grid to show which initial points of a+bi converge to the roots.
grid = Table[a + b I, {a, -2, 2, 0.01}, {b, -2, 2, 0.01}];
Then I created a fractal, but whenever I plot it gives me a blank graph on the axis.
There's got to be a way for me to be able to identify the converge points from the grid but so far I have not been successful. I tried using the Which[] method but when comparing the value its returns false.
Any help will appreciate it
Your code is not optimal, to put it mildly, but to give you a head start, why don't you start with something like this:
f[x_] := x^3 + 1;
newton[x0_] := (x = x0;
a1 = {};
b1 = {};
c1 = {};
counter = 0;
error = Abs[f[x]];
While[counter < 20 && error > 0.0001,
If[f'[x] != 0, x = x - N[f[x]/f'[x]]];
counter = counter + 1;
error = Abs[f[x]]];
{x, counter})
Table[Re#newton[a + b I], {a, -2, 2, 0.01}, {b, -2, 2, 0.01}] // Image
Can you do something like Python's yield statement in Mathematica, in order to create generators? See e.g. here for the concept.
Update
Here's an example of what I mean, to iterate over all permutations, using only O(n) space: (algorithm as in Sedgewick's Algorithms book):
gen[f_, n_] := Module[{id = -1, val = Table[Null, {n}], visit},
visit[k_] := Module[{t},
id++; If[k != 0, val[[k]] = id];
If[id == n, f[val]];
Do[If[val[[t]] == Null, visit[t]], {t, 1, n}];
id--; val[[k]] = Null;];
visit[0];
]
Then call it it like:
gen[Print,3], printing all 6 permutations of length 3.
As I have previously stated, using Compile will given faster code. Using an algorithm from fxtbook, the following code generates a next partition in lexicographic ordering:
PermutationIterator[f_, n_Integer?Positive, nextFunc_] :=
Module[{this = Range[n]},
While[this =!= {-1}, f[this]; this = nextFunc[n, this]];]
The following code assumes we run version 8:
ClearAll[cfNextPartition];
cfNextPartition[target : "MVM" | "C"] :=
cfNextPartition[target] =
Compile[{{n, _Integer}, {this, _Integer, 1}},
Module[{i = n, j = n, ni, next = this, r, s},
While[Part[next, --i] > Part[next, i + 1],
If[i == 1, i = 0; Break[]]];
If[i == 0, {-1}, ni = Part[next, i];
While[ni > Part[next, j], --j];
next[[i]] = Part[next, j]; next[[j]] = ni;
r = n; s = i + 1;
While[r > s, ni = Part[next, r]; next[[r]] = Part[next, s];
next[[s]] = ni; --r; ++s];
next
]], RuntimeOptions -> "Speed", CompilationTarget -> target
];
Then
In[75]:= Reap[PermutationIterator[Sow, 4, cfNextPartition["C"]]][[2,
1]] === Permutations[Range[4]]
Out[75]= True
This is clearly better in performance than the original gen function.
In[83]:= gen[dummy, 9] // Timing
Out[83]= {26.067, Null}
In[84]:= PermutationIterator[dummy, 9, cfNextPartition["C"]] // Timing
Out[84]= {1.03, Null}
Using Mathematica's virtual machine is not much slower:
In[85]:= PermutationIterator[dummy, 9,
cfNextPartition["MVM"]] // Timing
Out[85]= {1.154, Null}
Of course this is nowhere near C code implementation, yet provides a substantial speed-up over pure top-level code.
You probably mean the question to be more general but the example of iterating over permutations as given on the page you link to happens to be built in to Mathematica:
Scan[Print, Permutations[{1, 2, 3}]]
The Print there can be replaced with any function.