Removing a "row" from a structure array - performance

This is similar to a question I asked before, but is slightly different:
So I have a very large structure array in matlab. Suppose, for argument's sake, to simplify the situation, suppose I have something like:
structure(1).name, structure(2).name, structure(3).name structure(1).returns, structure(2).returns, structure(3).returns (in my real program I have 647 structures)
Suppose further that structure(i).returns is a vector (very large vector, approximately 2,000,000 entries) and that a condition comes along where I want to delete the jth entry from structure(i).returns for all i. How do you do this? or rather, how do you do this reasonably fast? I have tried some things, but they are all insanely slow (I will show them in a second) so I was wondering if the community knew of faster ways to do this.
I have parsed my data two different ways; the first way had everything saved as cell arrays, but because things hadn't been working well for me I parsed the data again and placed everything as vectors.
What I'm actually doing is trying to delete NaN data, as well as all data in the same corresponding row of my data file, and then doing the very same thing after applying the Hampel filter. The relevant part of my code in this attempt is:
for i=numStock+1:-1:1
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
stock(i).return = sort(stock(i).return);
stock(i).returnLength = length(stock(i).return);
stock(i).medianReturn = median(stock(i).return);
stock(i).madReturn = mad(stock(i).return,1);
end;
for i=numStock:-1:1
for j = length(stock(i+1).volume):-1:1
if(isnan(stock(i+1).volume(j)))
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end
end
stock(i+1).volume = sort(stock(i+1).volume);
stock(i+1).volumeLength = length(stock(i+1).volume);
stock(i+1).medianVolume = median(stock(i+1).volume);
stock(i+1).madVolume = mad(stock(i+1).volume,1);
end;
for i=numStock+1:-1:1
for j=stock(i).returnLength:-1:1
if (abs(stock(i).return(j) - stock(i).medianReturn) > 3*stock(i).madReturn)
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end;
end;
end;
for i=numStock:-1:1
for j=stock(i+1).volumeLength:-1:1
if (abs(stock(i+1).volume(j) - stock(i+1).medianVolume) > 3*stock(i+1).madVolume)
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end;
end;
end;
However, this returns an error:
"Matrix index is out of range for deletion.
Error in Failure (line 110)
stock(k).return(j) = [];"
So instead I tried by parsing everything in as vectors. Then I decided to try and delete the appropriate entries in the vectors prior to building the structure array. This isn't returning an error, but it is very slow:
%% Delete bad data, Hampel Filter
% Delete bad entries
id=strcmp(returns,'');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
id=strcmp(returns,'C');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
% Convert returns from string to double
returns=cellfun(#str2double,returns);
sp500=cellfun(#str2double,sp500);
% Delete all data for which a return is not a number
nanid=isnan(returns);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Delete all data for which a volume is not a number
nanid=isnan(volume);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Apply the Hampel filter, and delete all data corresponding to
% observations deleted by the filter.
medianReturn = median(returns);
madReturn = mad(returns,1);
for i=length(returns):-1:1
if (abs(returns(i) - medianReturn) > 3*madReturn)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
medianVolume = median(volume);
madVolume = mad(volume,1);
for i=length(volume):-1:1
if (abs(volume(i) - medianVolume) > 3*madVolume)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
As I said, this is very slow, probably because I'm using a for loop on a very large data set; however, I'm not sure how else one would do this. Sorry for the gigantic post, but does anyone have a suggestion as to how I might go about doing what I'm asking in a reasonable way?
EDIT: I should add that getting the vector method to work is probably preferable, since my aim is to put all of the return vectors into a matrix and get all of the volume vectors into a matrix and perform PCA on them, and I'm not sure how I would do that using cell arrays (or even if princomp would work on cell arrays).
EDIT2: I have altered the code to match your suggestion (although I did decide to give up speed and keep with the for-loops to keep with the structure array, since reparsing this data will be way worse time-wise). The new code snipet is:
stock_return = zeros(numStock+1,length(stock(1).return));
for i=1:numStock+1
for j=1:length(stock(i).return)
stock_return(i,j) = stock(i).return(j);
end
end
stock_return = stock_return(~any(isnan(stock_return)), : );
This returns an Index exceeds matrix dimensions error, and I'm not sure why. Any suggestions?

I could not find a convenient way to handle structures, therefore I would restructure the code so that instead of structures it uses just arrays.
For example instead of stock(i).return(j) I would do stock_returns(i,j).
I show you on a part of your code how to get rid of for-loops.
Say we deal with this code:
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
Now, the deletion of columns with any NaN data goes like this:
stock_return = stock_return(:, ~any(isnan(stock_return)) );
As for the absolute difference from medianVolume, you can write a similar code:
% stock_return_length is a scalar
% stock_median_return is a column vector (eg. [1;2;3])
% stock_mad_return is also a column vector.
median_return = repmat(stock_median_return, stock_return_length, 1);
is_bad = abs(stock_return - median_return) > 3.* stock_mad_return;
stock_return = stock_return(:, ~any(is_bad));
Using a scalar for stock_return_length means of course that the return lengths are the same, but you implicitly assume it in your original code anyway.
The important point in my answer is using any. Logical indexing is not sufficient in itself, since in your original code you delete all the values if any of them is bad.
Reference to any: http://www.mathworks.co.uk/help/matlab/ref/any.html.
If you want to preserve the original structure, so you stick to stock(i).return, you can speed-up your code using essentially the same scheme but you can only get rid of one less for-loop, meaning that your program will be substantially slower.

Related

Convert structure fields to arrays efficiently matlab

I have a structure called s in Matlab. This is a structure with two fields a and b. The structure size is 1 x 1,620,000.
It is a very large structure (that probably takes half of the ram of my machine). This is what the structure looks like:
I am looking for an efficient way to concatenate each of the fields a and b into two separate arrays that I can then export to csv. I built the code below, to do so, but even after 12 hours running it has not even reached a quarter of the loop. Any more efficient way of doing this?
a = [];
b =[];
total_n = size(s,2);
count = 1;
while size(s,2)>0
if size(s(1).a,1)
a = [a; s(1).a];
end
if size(s(1).b,1)
b = [b; s(1).b];
end
s(1) = []; %to save memory
if mod(count,1000) == 0
fprintf('Done %2f \n', [count/total_n])
end
count = count+1;
end
s(1) = []; %to save memory
ah, but such huge misunderstanding that comment is.
if size(s) is 1 x 1,620,000, you just suddenly forced the loop to do (under the hood, you dont see it)
snew=zeros(1,size(s,2)-1) # now you use double memory
snew=s(2:end) # now you force an unnecesary copy
So not only does that line make your code require double the memory, but also in each loop, you make an unnecesary copy of a large array.
Just replace your while for a normal for loop of for ii=1:size(s,2) and then index s!
Now, you can see hopefully then why the following is equally a big mistake (not only that, but any modern MATLAB version is currently telling you this is a bad idea in your editor)
a=[]
a=[a;s(1).a]
In here in each loop you are forcing MATLAB to make a new a that is 1 bigger than before, and copy the contents of the old a there.
instead, preallocate the size of a.
As you don't know what you are going to put there, I suggest using a cell array, as each s(ii).a has a different length.
You can then, after the loop, remove all empty (isempty) cells if you want.
Managed to do it efficiently:
s= struct2cell(s);
s= squeeze(s);
a = a(1,:);
a = a';
a = vertcat(a{:});
b = a(2,:);
b = b';
b = vertcat(b{:});

Is preallocation possible in this code snippet?

The following is a snippet of code of Ant Colony Optimization. I've removed whatever I feel would absolutely not be necessary to understand the code. The rest I'm not sure as I'm unfamiliar with coding on matlab. However, I'm running this algorithm on 500 or so cities with 500 ants and 1000 iterations, and the code runs extremely slow as compared to other algorithm implementations on matlab. For the purposes of my project, I simply need the datasets, not demonstrate coding capability on matlab, and I had time constraints that simply did not allow me to learn matlab from scratch, as that was not taken into consideration nor expected when the deadline was given, so I got the algorithm from an online source.
Matlab recommends preallocating two variables inside a loop as they are arrays that change size I believe. However I do not fully understand the purpose of those two parts of the code, so I haven't been able to do so. I believe both arrays increment a new item every iteration of the loop, so technically they should both be zero-able and could preallocate the size of both expected final sizes based on the for loop condition, but I'm not sure. I've tried preallocating zeroes to the two arrays, but it does not seem to fix anything as Matlab still shows preallocate for speed recommendation.
I've added two comments on the two variables recommended by MATLAB to preallocate below. If someone would be kind as to skim over it and let me know if it is possible, it'd be much appreciated.
x = 10*rand(50,1);
y = 10*rand(50,1);
n=numel(x);
D=zeros(n,n);
for i=1:n-1
for j=i+1:n
D(i,j)=sqrt((x(i)-x(j))^2+(y(i)-y(j))^2);
D(j,i)=D(i,j);
end
end
model.n=n;
model.x=x;
model.y=y;
model.D=D;
nVar=model.n;
MaxIt=100;
nAnt=50;
Q=1;
tau0=10*Q/(nVar*mean(model.D(:)));
alpha=1;
beta=5;
rho=0.6;
eta=1./model.D;
tau=tau0*ones(nVar,nVar);
BestCost=zeros(MaxIt,1);
empty_ant.Tour=[];
empty_ant.Cost=[];
ant=repmat(empty_ant,nAnt,1);
BestSol.Cost=inf;
for it=1:MaxIt
for k=1:nAnt
ant(k).Tour=randi([1 nVar]);
for l=2:nVar
i=ant(k).Tour(end);
P=tau(i,:).^alpha.*eta(i,:).^beta;
P(ant(k).Tour)=0;
P=P/sum(P);
r=rand;
C=cumsum(P);
j=find(r<=C,1,'first');
ant(k).Tour=[ant(k).Tour j];
end
tour = ant(k).Tour;
n=numel(tour);
tour=[tour tour(1)]; %MatLab recommends preallocation here
ant(k).Cost=0;
for i=1:n
ant(k).Cost=ant(k).Cost+model.D(tour(i),tour(i+1));
end
if ant(k).Cost<BestSol.Cost
BestSol=ant(k);
end
end
for k=1:nAnt
tour=ant(k).Tour;
tour=[tour tour(1)];
for l=1:nVar
i=tour(l);
j=tour(l+1);
tau(i,j)=tau(i,j)+Q/ant(k).Cost;
end
end
tau=(1-rho)*tau;
BestCost(it)=BestSol.Cost;
figure(1);
tour=BestSol.Tour;
tour=[tour tour(1)]; %MatLab recommends preallocation here
plot(model.x(tour),model.y(tour),'g.-');
end
If you change the size of an array, that means copying it to a new location in memory. This is not a huge problem for small arrays but for large arrays this slows down your code immensely. The tour arrays you're using are fixed size (51 or n+1 in this case) so you should preallocate them as zero arrays. the only thing you do is add the first element of the tour again to the end so all you have to do is set the last element of the array.
Here is what you should change:
x = 10*rand(50,1);
y = 10*rand(50,1);
n=numel(x);
D=zeros(n,n);
for i=1:n-1
for j=i+1:n
D(i,j)=sqrt((x(i)-x(j))^2+(y(i)-y(j))^2);
D(j,i)=D(i,j);
end
end
model.n=n;
model.x=x;
model.y=y;
model.D=D;
nVar=model.n;
MaxIt=1000;
nAnt=50;
Q=1;
tau0=10*Q/(nVar*mean(model.D(:)));
alpha=1;
beta=5;
rho=0.6;
eta=1./model.D;
tau=tau0*ones(nVar,nVar);
BestCost=zeros(MaxIt,1);
empty_ant.Tour=zeros(n, 1);
empty_ant.Cost=[];
ant=repmat(empty_ant,nAnt,1);
BestSol.Cost=inf;
for it=1:MaxIt
for k=1:nAnt
ant(k).Tour=randi([1 nVar]);
for l=2:nVar
i=ant(k).Tour(end);
P=tau(i,:).^alpha.*eta(i,:).^beta;
P(ant(k).Tour)=0;
P=P/sum(P);
r=rand;
C=cumsum(P);
j=find(r<=C,1,'first');
ant(k).Tour=[ant(k).Tour j];
end
tour = zeros(n+1,1);
tour(1:n) = ant(k).Tour;
n=numel(ant(k).Tour);
tour(end) = tour(1); %MatLab recommends preallocation here
ant(k).Cost=0;
for i=1:n
ant(k).Cost=ant(k).Cost+model.D(tour(i),tour(i+1));
end
if ant(k).Cost<BestSol.Cost
BestSol=ant(k);
end
end
for k=1:nAnt
tour(1:n)=ant(k).Tour;
tour(end) = tour(1);
for l=1:nVar
i=tour(l);
j=tour(l+1);
tau(i,j)=tau(i,j)+Q/ant(k).Cost;
end
end
tau=(1-rho)*tau;
BestCost(it)=BestSol.Cost;
figure(1);
tour(1:n) = BestSol.Tour;
tour(end) = tour(1); %MatLab recommends preallocation here
plot(model.x(tour),model.y(tour),'g.-');
end
I think that the warning that the MATLAB Editor gives in this case is misplaced. The array is not repeatedly resized, it is just resized once. In principle, tour(end+1)=tour(1) is more efficient than tour=[tour,tour(1)], but in this case you might not notice the difference in cost.
If you want to speed up this code you could think of vectorizing some of its loops, and of reducing the number of indexing operations performed. For example this section:
tour = ant(k).Tour;
n=numel(tour);
tour=[tour tour(1)]; %MatLab recommends preallocation here
ant(k).Cost=0;
for i=1:n
ant(k).Cost=ant(k).Cost+model.D(tour(i),tour(i+1));
end
if ant(k).Cost<BestSol.Cost
BestSol=ant(k);
end
could be written as:
tour = ant(k).Tour;
ind = sub2ind(size(model.D),tour,circshift(tour,-1));
ant(k).Cost = sum(model.D(ind));
if ant(k).Cost < BestSol.Cost
BestSol = ant(k);
end
This rewritten code doesn't have a loop, which usually makes things a little faster, and it also doesn't repeatedly do complicated indexing (ant(k).Cost is two indexing operations, within a loop that will slow you down more than necessary).
There are more opportunities for optimization like these, but rewriting the whole function is outside the scope of this answer.
I have not tried running the code, please let me know if there are any errors when using the proposed change.

Employ early bail-out in MATLAB

There is a example for Employ early bail-out in this book (http://www.amazon.com/Accelerating-MATLAB-Performance-speed-programs/dp/1482211297) (#YairAltman). for speed improvement we can convert this code:
data = [];
newData = [];
outerIdx = 1;
while outerIdx <= 20
outerIdx = outerIdx + 1;
for innerIdx = -100 : 100
if innerIdx == 0
continue % skips to next innerIdx (=1)
elseif outerIdx > 15
break % skips to next outerIdx
else
data(end+1) = outerIdx/innerIdx;
newData(end+1) = process(data);
end
end % for innerIdx
end % while outerIdx
to this code:
function bailableProcessing()
for outerIdx = 1 : 5
middleIdx = 10
while middleIdx <= 20
middleIdx = middleIdx + 1;
for innerIdx = -100 : 100
data = outerIdx/innerIdx + middleIdx;
if data == SOME_VALUE
return
else
process(data);
end
end % for innerIdx
end % while middleIdx
end % for outerIdx
end % bailableProcessing()
How we did this conversion? Why we have different middleIdx range in new code? Where is checking for innerIdx and outerIdx in new code? what is this new data = outerIdx/innerIdx + middleIdx calculation?
we have only this information for second code :
We could place the code segment that should be bailed-out within a
dedicated function and return from the function when the bail-out
condition occurs.
I am sorry that I did not clarify within the text that the second code segment is not a direct replacement of the first. If you reread the early bail-out section (3.1.3) perhaps you can see that it has two main parts:
The first part of the section (which includes the top code segment) illustrates the basic mechanism of using break/continue in order to bail-out from a complex processing loop, in order to save processing time in computing values that are not needed.
In contrast, the second part of the section deals with cases when we wish to break out of an ancestor loop that is not the direct parent loop. I mention in the text that there are three alternatives that we can use in this case, and the second code segment that you mentioned is one of them (the other alternatives are to use dedicated flags with break/continue and to use try/catch blocks). The three code segments that I provided in this second part of the section should all be equivalent to each other, but they are NOT equivalent to the code-segment at the top of the section.
Perhaps I should have clarified this in the text, or maybe I should have used the same example throughout. I will think about this for the second edition of the book (if and when it ever appears).
I have used a variant of these code segments in other sections of the book to illustrate various other aspects of performance speedups (for example, 3.1.4 & 3.1.6) - in all these cases the code segments are NOT equivalent to each other. They are merely used to illustrate the corresponding text.
I hope you like my book in general and think that it is useful. I would be grateful if you would place a positive feedback about it on Amazon (direct link).
p.s. - #SamRoberts was correct to surmise that mention of my name would act as a "bat-signal", attracting my attention :-)
it's all far more simple than you think!
How we did this conversion?
Irrationally. Those two codes are completely different.
Why we have different middleIdx range in new code?
Randomness. The point of the author is something different.
Where is checking for innerIdx and outerIdx in new code?
dont need that, as it's not intended to be the same code.
what is this new data = outerIdx/innerIdx + middleIdx calculation?
a random calculation as well as data(end+1) = outerIdx/innerIdx; in the original code.
i suppose the author wants to illustrate something far more profoundly: that if you wrap your code that does (possibly many) loops (fors/whiles, doesnt matter) inside a function and you issue a return statement if you somehow detect that you're done, it will result in an effectively "bailable" computation, e.g. the method that does the work returns earlier than it would normally do. that is illustrated here by the condition that checks on data == SOME_VALUE; you can have your favourite bailout condition there instead :-)
moreover, the keywords [continue/break] inside the first example are meant to illustrate that you can [skip the rest of/leave] the inner-most loop from whereever you call them. in principal, you can implement a bailout using these by e.g.
bailing = false;
for outer = 1:1000
for inner = 1:1000
if <somebailingcondition>
bailing = true;
break;
else
<do stuff>
end
end
if bailing
break;
end
end
but that would be very clumsy as that "cascade" of breaks will need to be as long as you have nested loops and messes up the code.
i hope that could clarify your issues.

Sorting a list of objects by property in Matlab, and then deleting the smallest one

I'm trying to use Matlab to implement the MDO algorithm, which requires me to sort an array of objects of a custom-defined mdoVertex class by their degree, and then delete the one with the smallest degree value. My first attempt was this:
for i = 1:m
if graph(i).degree < minDegree
minDegree = graph(i).degree;
elimObject = graph(i);
end
end
Matlab is complaining that elimObject, or the object to be eliminated after the loop executes, is an undefined function or variable. How, then, can I keep track of not only the current smallest degree the loop has encountered, but also which object it corresponded to? 'graph' is the name of the array holding all of my vertex objects.
I suspect that you're somehow trying to call clear on the object returned from your function. Or is it just a few lines of code in a script? I'm guessing here. In any event, calling clear won't work. As you've noticed, clear expects to be given a variable name.
But in this case, you're not trying to delete a variable, you're trying to remove an element from an array. For that, you do arrayname(indextodelete) = [];
So I think that you want...
minDegree = inf; % See what I did there? I defined the variable, and I did it in such a way that I KNOW that the first vertex will satisfy the condition.
for i = 1:length(graph) % Properly loop over the entire graph
if graph(i).degree < minDegree % The first vertex will definitely satisfy this. Maybe another one (or more) will later!
minDegree = graph(i).degree;
minDegreeIndex = i; % Don't record the value, just remember WHERE it is in the array.
end
end
graph(minDegreeIndex) = []; % Now, remove the element that you identified from the array!
(By the way, you never showed us how you tried to eliminate elimObject. I assume that you called clear (the object that you identified)? You shouldn't make us guess; show us.)

Returning multiple ints and passing them as multiple arguements in Lua

I have a function that takes a variable amount of ints as arguments.
thisFunction(1,1,1,2,2,2,2,3,4,4,7,4,2)
this function was given in a framework and I'd rather not change the code of the function or the .lua it is from. So I want a function that repeats a number for me a certain amount of times so this is less repetitive. Something that could work like this and achieve what was done above
thisFunction(repeatNum(1,3),repeatNum(2,4),3,repeatNum(4,2),7,4,2)
is this possible in Lua? I'm even comfortable with something like this:
thisFunction(repeatNum(1,3,2,4,3,1,4,2,7,1,4,1,2,1))
I think you're stuck with something along the lines of your second proposed solution, i.e.
thisFunction(repeatNum(1,3,2,4,3,1,4,2,7,1,4,1,2,1))
because if you use a function that returns multiple values in the middle of a list, it's adjusted so that it only returns one value. However, at the end of a list, the function does not have its return values adjusted.
You can code repeatNum as follows. It's not optimized and there's no error-checking. This works in Lua 5.1. If you're using 5.2, you'll need to make adjustments.
function repeatNum(...)
local results = {}
local n = #{...}
for i = 1,n,2 do
local val = select(i, ...)
local reps = select(i+1, ...)
for j = 1,reps do
table.insert(results, val)
end
end
return unpack(results)
end
I don't have 5.2 installed on this computer, but I believe the only change you need is to replace unpack with table.unpack.
I realise this question has been answered, but I wondered from a readability point of view if using tables to mark the repeats would be clearer, of course it's probably far less efficient.
function repeatnum(...)
local i = 0
local t = {...}
local tblO = {}
for j,v in ipairs(t) do
if type(v) == 'table' then
for k = 1,v[2] do
i = i + 1
tblO[i] = v[1]
end
else
i = i + 1
tblO[i] = v
end
end
return unpack(tblO)
end
print(repeatnum({1,3},{2,4},3,{4,2},7,4,2))

Resources