Could you please explain low performance of FFI.cast in a following snippet?
prof = require 'profile'
local ffi = require("ffi")
ffi.cdef[[
struct message {
int field_a;
};
]]
function cast_test1()
bytes = ffi.new("char[100000000]")
sum = 0
t1 = prof.rdtsc()
for i=1,1000000 do
sum = sum + i
end
t2 = prof.rdtsc()
print("test1", tonumber(t2-t1))
end
function cast_test2()
bytes = ffi.new("char[100000000]")
sum = 0
t1 = prof.rdtsc()
for i=1,1000000 do
sum = sum + i
msg = ffi.cast("struct message *", bytes+ i * 16)
-- msg.field_a = i
end
t2 = prof.rdtsc()
print("test2", tonumber(t2-t1))
end
cast_test1()
cast_test2()
Looks like the loop with the cast runs about 30 times slower. Any ideas how to overcome this?
% luajit -v cast_tests.lua
LuaJIT 2.0.3 -- Copyright (C) 2005-2014 Mike Pall. http://luajit.org/
test1 3227528
test2 94474000
Looks like the global msg variable was the main culprit. Replacing it with local gives 20x speedup :)
It's relevant both for lualit-2.0.3 and lualit-2.1
function cast_test3()
local bytes = ffi.new("char[100000000]")
local sum = 0
local t1 = prof.rdtsc()
for i=1,1000000 do
sum = sum + i
local msg = ffi.cast("struct message *", bytes+ i * 4)
msg.field_a = i
end
local t2 = prof.rdtsc()
local sum2 = 0
for i=1,1000000 do
local msg = ffi.cast("struct message *", bytes+ i * 4)
sum2 = sum2 + msg.field_a
end
local t3 = prof.rdtsc()
print(sum, sum2)
print("test3", tonumber(t2-t1), tonumber(t3-t2))
end
cast_test3()
Results:
% /usr/bin/luajit -v cast_tests.lua ~/Projects/lua_tests/lua_rdtsc
LuaJIT 2.0.3 -- Copyright (C) 2005-2014 Mike Pall. http://luajit.org/
500000500000 500000500000
test3 4502508 4850884
Related
prob = pulp.LpProblem('C and T', pulp.LpMaximize)
C = pulp.LpVariable("C", lowBound = 0, cat = pulp.LpInteger )
T = pulp.LpVariable('T', lowBound = 0, cat = pulp.LpInteger)
prob += 25*T + 10*C
prob += 5*T + 2 *C <= 30
prob += C >= 3*T
prob.solve()
After solving this optimization I want to see the variables' value. what is the command for this?
print(f"C={C.value()},T={T.value()}")
I am doing some tests to see where I can improve the performance of my lua code.
I was reading this document: https://www.lua.org/gems/sample.pdf
and I thought using integers as table indices should be considerably faster since it uses the array part of tables and does not require hashing.
So I've written this test program:
print('local x=0 local y=0 local z=0')
local x=0 local y=0 local z=0
t0 = os.clock()
for i=1,1e7 do
x = 1
y = 2
z = 3
end
print(os.clock()-t0 .. "\n")
print("tab = {1,2,3}")
tab = {1,2,3}
t0 = os.clock()
for i=1,1e7 do
tab[1] = 1
tab[2] = 2
tab[3] = 3
end
print(os.clock()-t0 .. "\n")
print("tab = {[1]=1,[2]=2,[3]=3}")
tab = {[1]=1,[2]=2,[3]=3}
t0 = os.clock()
for i=1,1e7 do
tab[1] = 1
tab[2] = 2
tab[3] = 3
end
print(os.clock()-t0 .. "\n")
print("tab = {a=1,b=2,c=3}")
tab = {a=1,b=2,c=3}
t0 = os.clock()
for i=1,1e7 do
tab.a = 1
tab.b = 2
tab.c = 3
end
print(os.clock()-t0 .. "\n")
print('tab = {["bli"]=1,["bla"]=2,["blu"]=3}')
tab = {["bli"]=1,["bla"]=2,["blu"]=3}
t0 = os.clock()
for i=1,1e7 do
tab["bli"] = 1
tab["bla"] = 2
tab["blu"] = 3
end
print(os.clock()-t0 .. "\n")
print("tab = {verylongfieldname=1,anotherevenlongerfieldname=2,superincrediblylongfieldname=3}")
tab = {verylongfieldname=1,anotherevenlongerfieldname=2,superincrediblylongfieldname=3}
t0 = os.clock()
for i=1,1e7 do
tab.verylongfieldname = 1
tab.anotherevenlongerfieldname = 2
tab.superincrediblylongfieldname = 3
end
print(os.clock()-t0 .. "\n")
print('local f = function(p1, p2, p3)')
local f = function(p1, p2, p3)
x = p1
y = p2
z = p3
return x,y,z
end
local a=0
local b=0
local c=0
t0 = os.clock()
for i=1,1e7 do
a,b,c = f(1,2,3)
end
print(os.clock()-t0 .. "\n")
print('local g = function(params)')
local g = function(params)
x = params.p1
y = params.p2
z = params.p3
return {x,y,z}
end
t0 = os.clock()
for i=1,1e7 do
t = g{p1=1, p2=2, p3=3}
end
print(os.clock()-t0 .. "\n")
I've ordered the blocks by what I expected to be increasing time consumption. (I wasn't sure about the function calls, that was just a test.) But here are the surprising results:
local x=0 local y=0 local z=0
0.093613
tab = {1,2,3}
0.678514
tab = {[1]=1,[2]=2,[3]=3}
0.83678
tab = {a=1,b=2,c=3}
0.62888
tab = {["bli"]=1,["bla"]=2,["blu"]=3}
0.733916
tab = {verylongfieldname=1,anotherevenlongerfieldname=2,superincrediblylongfieldname=3}
0.536726
local f = function(p1, p2, p3)
0.475592
local g = function(params)
3.576475
And even the long field names that should cause the longest hashing process are faster than array accessing with integers. Am I doing something wrong?
The 6th page(actual page 20) of the document you linked explains what you are seeing.
If you write something like {[1] = true, [2] = true, [3] = true}, however, Lua is not smart enough to detect that the given expressions (literal numbers, in this case) describe array indices, so it creates a table with four slots in
its hash part, wasting memory and CPU time.
You can only gain a major benefit of the array part when you assign a table using no keys.
table = {1,2,3}
If you are reading/writing to a table or array that already exists you will not see a large deviation in processing time.
The example in the document includes the creation of the table in the for loop
for i = 1, 1000000 do
local a = {true, true, true}
a[1] = 1; a[2] = 2; a[3] = 3
end
Results with all local variables inside the loops. Edit: Lengthened long string to 40 bytes as pointed out by siffiejoe
local x=0 local y=0 local z=0
0.18
tab = {1,2,3}
3.089
tab = {[1]=1,[2]=2,[3]=3}
4.59
tab = {a=1,b=2,c=3}
3.79
tab = {["bli"]=1,["bla"]=2,["blu"]=3}
3.967
tab = {verylongfieldnameverylongfieldnameverylongfieldname=1,anotherevenlongerfieldnameanotherevenlongerfieldname=2,superincrediblylongfieldnamesuperincrediblylongfieldname=3}
4.013
local f = function(p1, p2, p3)
1.238
local g = function(params)
6.325
Additionally lua preforms the hashes differently for different key types.
The source code can be viewed here 5.2.4 ltable.c, this contains the code I will be discussing.
The mainposition function handles that decision making on which hash to preform
/*
** returns the `main' position of an element in a table (that is, the index
** of its hash value)
*/
static Node *mainposition (const Table *t, const TValue *key) {
switch (ttype(key)) {
case LUA_TNUMBER:
return hashnum(t, nvalue(key));
case LUA_TLNGSTR: {
TString *s = rawtsvalue(key);
if (s->tsv.extra == 0) { /* no hash? */
s->tsv.hash = luaS_hash(getstr(s), s->tsv.len, s->tsv.hash);
s->tsv.extra = 1; /* now it has its hash */
}
return hashstr(t, rawtsvalue(key));
}
case LUA_TSHRSTR:
return hashstr(t, rawtsvalue(key));
case LUA_TBOOLEAN:
return hashboolean(t, bvalue(key));
case LUA_TLIGHTUSERDATA:
return hashpointer(t, pvalue(key));
case LUA_TLCF:
return hashpointer(t, fvalue(key));
default:
return hashpointer(t, gcvalue(key));
}
}
When the key is a Lua_Number we call hashnum
/*
** hash for lua_Numbers
*/
static Node *hashnum (const Table *t, lua_Number n) {
int i;
luai_hashnum(i, n);
if (i < 0) {
if (cast(unsigned int, i) == 0u - i) /* use unsigned to avoid overflows */
i = 0; /* handle INT_MIN */
i = -i; /* must be a positive value */
}
return hashmod(t, i);
}
Here are the other hash implementations for the other types:
#define hashpow2(t,n) (gnode(t, lmod((n), sizenode(t))))
#define hashstr(t,str) hashpow2(t, (str)->tsv.hash)
#define hashboolean(t,p) hashpow2(t, p)
/*
** for some types, it is better to avoid modulus by power of 2, as
** they tend to have many 2 factors.
*/
#define hashmod(t,n) (gnode(t, ((n) % ((sizenode(t)-1)|1))))
#define hashpointer(t,p) hashmod(t, IntPoint(p))
These hashes resolve down to 2 paths hashpow2 and hashmod. LUA_TNUMBER use hashnum > hashmod and LUA_TSHRSTR use hashstr > hashpow2
In the problem Im working on there is such a part of code, as shown below. The definition part is just to show you the sizes of arrays. Below I pasted vectorized version - and it is >2x slower. Why it happens so? I know that i happens if vectorization requiers large temporary variables, but (it seems) it is not true here.
And generally, what (other than parfor, with I already use) can I do to speed up this code?
maxN = 100;
levels = maxN+1;
xElements = 101;
umn = complex(zeros(levels, levels));
umn2 = umn;
bessels = ones(xElements, xElements, levels); % 1.09 GB
posMcontainer = ones(xElements, xElements, maxN);
tic
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
nn = n + 1;
mm = 1;
for m = 1 : 2 : n
umn(nn, mm) = bessels(i, j, nn) * posMcontainer(i, j, m);
mm = mm + 1;
end
end
end
end
toc % 0.520594 seconds
tic
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
umn2(nn, 1:numOfEl) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
end
end
toc % 1.275926 seconds
sum(sum(umn-umn2)) % veryfying, if all done right
Best regards,
Alex
From the profiler:
Edit:
In reply to #Jason answer, this alternative takes the same time:
for n = 1:2:maxN
nn(n) = n + 1;
numOfEl(n) = ceil(n/2);
end
for j = 1 : xElements
for i = 1 : xElements
for n = 1 : 2 : maxN
umn2(nn(n), 1:numOfEl(n)) = bessels(i, j, nn(n)) * posMcontainer(i, j, 1:2:n);
end
end
end
Edit2:
In reply to #EBH :
The point is to do the following:
parfor i = 1 : xElements
for j = 1 : xElements
umn = complex(zeros(levels, levels)); % cleaning
for n = 0:maxN
mm = 1;
for m = -n:2:n
nn = n + 1; % for indexing
if m < 0
umn(nn, mm) = bessels(i, j, nn) * negMcontainer(i, j, abs(m));
end
if m > 0
umn(nn, mm) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
if m == 0
umn(nn, mm) = bessels(i, j, nn);
end
mm = mm + 1; % for indexing
end % m
end % n
beta1 = sum(sum(Aj1.*umn));
betaSumSq1(i, j) = abs(beta1).^2;
beta2 = sum(sum(Aj2.*umn));
betaSumSq2(i, j) = abs(beta2).^2;
end % j
end % i
I speeded it up as much, as I was able to. What you have written is taking only the last bessels and posMcontainer values, so it does not produce the same result. In the real code, those two containers are filled not with 1, but with some precalculated values.
After your edit, I can see that umn is just a temporary variable for another calculation. It still can be mostly vectorizable:
betaSumSq1 = zeros(xElements); % preallocating
betaSumSq2 = zeros(xElements); % preallocating
% an index matrix to fetch the right values from negMcontainer and
% posMcontainer:
indmat = tril(repmat([0 1;1 0],ceil((maxN+1)/2),floor(levels/2)));
indmat(end,:) = [];
% an index matrix to fetch the values in correct order for umn:
b_ind = repmat([1;0],ceil((maxN+1)/2),1);
b_ind(end) = [];
tempind = logical([fliplr(indmat) b_ind indmat+triu(ones(size(indmat)))]);
% permute the arrays to prevent squeeze:
PM = permute(posMcontainer,[3 1 2]);
NM = permute(negMcontainer,[3 1 2]);
B = permute(bessels,[3 1 2]);
for k = 1 : maxN+1 % third dim
for jj = 1 : xElements % columns
b = B(:,jj,k); % get one vector of B
% perform b*NM for every row of NM*indmat, than flip the result:
neg = fliplr(bsxfun(#times,bsxfun(#times,indmat,NM(:,jj,k).'),b));
% perform b*PM for every row of PM*indmat:
pos = bsxfun(#times,bsxfun(#times,indmat,PM(:,jj,k).'),b);
temp = [neg mod(1:levels,2).'.*b pos].'; % concat neg and pos
% assign them to the right place in umn:
umn = reshape(temp(tempind.'),[levels levels]).';
beta1 = Aj1.*umn;
betaSumSq1(jj,k) = abs(sum(beta1(:))).^2;
beta2 = Aj2.*umn;
betaSumSq2(jj,k) = abs(sum(beta2(:))).^2;
end
end
This reduce running time from ~95 seconds to less 3 seconds (both without parfor), so it improves in almost 97%.
I would suspect it is memory allocation. You are re-allocating the m array in a 3 deep loop.
try rearranging the code:
tic
for n = 1 : 2 : maxN
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
for j = 1 : xElements
for i = 1 : xElements
umn2(nn, 1:numOfEl) = bessels(i, j, nn) * posMcontainer(i, j, m);
end
end
end
toc % 1.275926 seconds
I was trying this in Igor pro, which a similar language, but with different optimizations. So the direct translations don't time the same way as Matlab (vectorized was slightly faster in Igor). But reordering the loops did speed up the vectorized form.
In your second part of the code, that is setting umn2, inside the loops, you have:
nn = n + 1;
m = 1:2:n;
numOfEl = ceil(n/2);
Those 3 lines don't require any input from the i and j loops, they only use the n loop. So reordering the loops such that i and j are inside the n loop will mean that those 3 lines are done xElements^2 (100^2) times less often. I suspect it is that m = 1:2:n line that takes time, since that is allocating an array.
I'm trying to develop the adaptive unsharp algorithm described by Polesel et al. in the article "Image Enhancement via Adaptive Unsharp Masking" (link to the article). The core of the algorithm is the minimization of a cost function defined as:
J(m,n) = E[e(m,n)^2] = E[(gd(m,n)-gy(m,n))^2]
where E[] is the statistical expectation and gy(m,n) is:
gy(m,n) = gx(m,n) + lambda1(m,n)*gzx(m,n) + lambda2(m,n)*gzy(m,n);
I want to find lambda1 and lambda2 for each pixel in order to minimize the cost function in each pixel.
Here the code that I wrote so far:
function [ o_sharpened_image ] = AdaptativeUnsharpMask( i_image , t1, t2)
%ADAPTATIVEUNSHARPMASK Summary of this function goes here
% Detailed explanation goes here
if isa(i_image,'dip_image')
i_image = dip_array(i_image);
end
if ~isfloat(i_image)
i_image = im2double(i_image);
end
adh = 4;
adl = 3;
g = [-1 -1 -1; -1 8 -1; -1 -1 -1];
dim = size(i_image);
lambda_x = 0.5*ones(dim);
lambda_y = 0.5*ones(dim);
z_x = conv2(i_image,[-1 2 -1],'same');
z_y = conv2(i_image,[-1; 2; -1],'same');
g_x = conv2(i_image,g,'same');
g_zx = conv2(z_x,g,'same');
g_zy = conv2(z_y,g,'same');
a = ones(dim);
variance_map = colfilt(i_image,[3 3],'sliding',#var);
a(variance_map >= t1 & variance_map < t2) = adh;
a(variance_map >= t2) = adl;
g_d = a.*g_x;
lambda = [lambda_x lambda_y];
lambda0 = lambda;
lambda_min = lsqnonlin(#(lambda) UnsharpCostFunction(lambda,g_d,g_zx,g_zy),lambda0);
o_sharpened_image = i_image + lambda_min(:,1:size(i_image,2)).*z_x + lambda_min(:,size(i_image,2)+1:end).*z_y;
end
Here the code of the cost function:
function [ J ] = UnsharpCostFunction( i_lambda, i_gd, i_gzx, i_gzy )
%UNSHARPCOSTFUNCTION Summary of this function goes herek
gy = i_gd + i_lambda(:,1:size(i_gd,2)).*i_gzx + i_lambda(:,size(i_gd,2)+1:end).*i_gzy;
J = mean((i_gd(:) - gy(:)).^2);
end
For each iteration I print on the command window the value of the J function and it is always the same. What am I doing wrong?
Thank you.
I have been running a MATLAB program for almost six hours now, and it is still not complete. It is cycling through three while loops (the outer two loops are n=855, the inner loop is n=500). Is this a surprise that it is taking this long? Is there anything I can do to increase the speed? I am including the code below, as well as the variable data types underneath that.
while i < (numAtoms + 1)
pointAccessible = ones(numPoints,1);
j = 1;
while j <(numAtoms + 1)
if (i ~= j)
k=1;
while k < (numPoints + 1)
if (pointAccessible(k) == 1)
sphereCoord = [cell2mat(atomX(i)) + p + sphereX(k), cell2mat(atomY(i)) + p + sphereY(k), cell2mat(atomZ(i)) + p + sphereZ(k)];
neighborCoord = [cell2mat(atomX(j)), cell2mat(atomY(j)), cell2mat(atomZ(j))];
coords(1,:) = [sphereCoord];
coords(2,:) = [neighborCoord];
if (pdist(coords) < (atomRadius(j) + p))
pointAccessible(k)=0;
end
end
k = k + 1;
end
end
j = j+1;
end
remainingPoints(i) = sum(pointAccessible);
i = i +1;
end
Variable Data Types:
numAtoms = 855
numPoints = 500
p = 1.4
atomRadius = <855 * 1 double>
pointAccessible = <500 * 1 double>
atomX, atomY, atomZ = <1 * 855 cell>
sphereX, sphereY, sphereZ = <500 * 1 double>
remainingPoints = <855 * 1 double>