An efficient, optimized code for matching rows in a huge matrix - performance

I have a huge matrix on which I need to do some matching operation. Here's the code I have written and it works fine, but I think there is some room to make it more optimized or write another code that does the matching in less time. Could you please help me with that?
rowsMatched = find(bigMatrix(:, 1) == matchingRow(1, 1)
& bigMatrix(:, 2) == matchingRow(1, 2)
& bigMatrix(:, 3) == matchingRow(1, 3))
The problem with this code is that I cannot use && operand. So, in case one of the columns do not match, the program still checks the next condition. How can I avoid this?
Update: Here's the solution to this problem:
rowsMatched = find(all(bsxfun(#eq, bigMatrix, matchingRow),2));
Thank you

You can use BSXFUN to do it in a vectorized manner:
rowsMatched = find(all(bsxfun(#eq, bigMatrix, matchingRow),2));
Notice that it will work for any number of columns, just matchingRow should have the same number of columns as bigMatrix.

Related

Writting many Equations for NDSolve

I'm trying to write the master equations for genetic networks, as they are many equations I'm trying to make a table for writing all of them at time. However, I don't know how to adjust the boundaries, I mean:
I wrote a matrix with all the variables that I need:
p={{p11,p12},{p21,p22}}
Then I wrote a table for creating the differential equations:
Table[p[[i,j]]'[t]== p[[i-1,j]][t]+p[[i,j-1]][t]+p[[i+1,j]][t]+p[[i,j+1]][t],{i,1,2},{j,1,2}]
However the part p[[i-1,j]] when i=1 is p[[0,1]]but it doesn't exist and I need to put 0 instead of this but I dont not how. I tried with If but it doesn't work well. What can i do?
Will this work for you?
pf[i_,j_]:=If[i<1||i>2||j<1||j>2,0,p[[i,j]][t]];
Table[p[i,j]'[t]== pf[i-1,j]+pf[i,j-1]+pf[i+1,j]+pf[i,j+1],{i,1,2},{j,1,2}]
which returns
{{p[1, 1]]'[t] == p[[1,2]][t] + p[[2,1]][t], p[1, 2]]'[t] == p[[1,1]][t] + p[[2,2]][t]},
{p[2, 1]]'[t] == p[[1,1]][t] + p[[2,2]][t], p[2, 2]]'[t] == p[[1,2]][t] + p[[2,1]][t]}}

Designing an algorithm to check combinations

I´m having serious performance issues with a job that is running everyday and I think i cannot improve the algorithm; so I´m gonnga explain you what is the problem to solve and the algorithm we have, and maybe you have some other ideas to solve the problem better.
So the problem we have to solve is:
There is a set of Rules, ~ 120.000 Rules.
Every rule has a set of combinations of Codes. Codes are basically strings. So we have ~8 combinations per rule. Example of a combination: TTAAT;ZZUHH;GGZZU;WWOOF;SSJJW;FFFOLL
There is a set of Objects, ~800 objects.
Every object has a set of ~200 codes.
We have to check for every Rule, if there is at least one Combination of Codes that is fully contained in the Objects. It means =>
loop in Rules
Loop in Combinations of the rule
Loop in Objects
every code of the combination found in the Object? => create relationship rule/object and continue with the next object
end of loop
end of loop
end of loop
For example, if we have the Rule with this combination of two codes: HHGGT; ZZUUF
And let´s say we have an object with this codes: HHGGT; DHZZU; OIJUH; ZHGTF; HHGGT; JUHZT; ZZUUF; TGRFE; UHZGT; FCDXS
Then we create a relationship between the Object and the Rule because every code of the combination of the rule is contained in the codes of the object => this is what the algorithm has to do.
As you can see this is quite expensive, because we need 120.000 x 8 x 800 = 750 millions of times in the worst-case scenario.
This is a simplified scenario of the real problem; actually what we do in the loops is a little bit more complicated, that´s why we have to reduce this somehow.
I tried to think in a solution but I don´t have any ideas!
Do you see something wrong here?
Best regards and thank you for the time :)
Something like this might work better if I'm understanding correctly (this is in python):
RULES = [
['abc', 'def',],
['aaa', 'sfd',],
['xyy', 'eff',]]
OBJECTS = [
('rrr', 'abc', 'www', 'def'),
('pqs', 'llq', 'aaa', 'sdr'),
('xyy', 'hjk', 'fed', 'eff'),
('pnn', 'rrr', 'mmm', 'qsq')
]
MapOfCodesToObjects = {}
for obj in OBJECTS:
for code in obj:
if (code in MapOfCodesToObjects):
MapOfCodesToObjects[code].add(obj)
else:
MapOfCodesToObjects[code] = set({obj})
RELATIONS = []
for rule in RULES:
if (len(rule) == 0):
continue
if (rule[0] in MapOfCodesToObjects):
ValidObjects = MapOfCodesToObjects[rule[0]]
else:
continue
for i in range(1, len(rule)):
if (rule[i] in MapOfCodesToObjects):
codeObjects = MapOfCodesToObjects[rule[i]]
else:
ValidObjects = set()
break
ValidObjects = ValidObjects.intersection(codeObjects)
if (len(ValidObjects) == 0):
break
for vo in ValidObjects:
RELATIONS.append((rule, vo))
for R in RELATIONS:
print(R)
First you build a map of codes to objects. If there are nObj objects and nCodePerObj codes on average per object, this takes O(nObj*nCodePerObj * log(nObj*nCodePerObj).
Next you iterate through the rules and look up each code in each rule in the map you built. There is a relation if a certain object occurs for every code in the rule, i.e. if it is in the set intersection of the objects for every code in the rule. Since hash lookups have O(1) time complexity on average, and set intersection has time complexity O(min of the lengths of the 2 sets), this will take O(nRule * nCodePerRule * nObjectsPerCode), (note that is nObjectsPerCode, not nCodePerObj, the performance gets worse when one code is included in many objects).

Stata: why is my matrix not clearing over the foreach loop

When I run the following code, the two output matrices (diffInDiffOne & diffInDiffTwo) are the same. My guess is that coeffs is not being replaced after each loop but I have no idea why . I think that the coefficients matrix is being overwritten but I have no idea how. I tried changing the for loop order but this surprisingly didn't solve my issue either:
local treatments treat_one treat_two
matrix diffInDiffOne = J(1,9,.)
matrix diffInDiffTwo = J(1,9,.)
foreach treatment in `treatments' {
reg science inSchool#`treatment'#male
matrix coeffs=e(b)
if treat_one==`treatment'{
matrix diffInDiffOne = diffInDiffOne\coeffs
}
if treat_two==`treatment'{
matrix diffInDiffTwo = diffInDiffTwo\coeffs
}
}
matrix list diffInDiffOne
matrix list diffInDiffTwo
When I list the matrix they are both the same, depsite the fact that two regressions give different answers. Any help with this issue is much appreciated. Thanks
This code appears at first sight to reduce to
reg science inSchool#treat_one#male
matrix li e(b)
reg science inSchool#treat_two#male
matrix li e(b)
apart from the detail of adding nine missing values to the matrix.
However, that is not your code, so what is biting you? I guess at something much more subtle.
You should need to be very careful with the if command. Variables evaluated in if commands are evaluated in their first observation. So, the first time round the loop
the conditions are
if treat_one[1] == treat_one[1]
if treat_two[1] == treat_one[1]
The second time, it is
if treat_one[1] == treat_two[1]
if treat_two[1] == treat_two[1]
If it is true in your data that treat_one[1] == treat_two[1] the effect will not be as you may imagine.
If you want to test for equality of strings, do something like
if "`treatment'" == "treat_one"
You may have in mind something more like
foreach treatment in treat_one treat_two {
reg science inSchool#`treatment'#male
matrix `treatment' = e(b)
matrix list `treatment`
}
You seem to be wanting to write very complicated code for rather simple problems. A while back, I recommended thinking in terms of do-files rather than programs. That may be advice to reconsider.

Removing a "row" from a structure array

This is similar to a question I asked before, but is slightly different:
So I have a very large structure array in matlab. Suppose, for argument's sake, to simplify the situation, suppose I have something like:
structure(1).name, structure(2).name, structure(3).name structure(1).returns, structure(2).returns, structure(3).returns (in my real program I have 647 structures)
Suppose further that structure(i).returns is a vector (very large vector, approximately 2,000,000 entries) and that a condition comes along where I want to delete the jth entry from structure(i).returns for all i. How do you do this? or rather, how do you do this reasonably fast? I have tried some things, but they are all insanely slow (I will show them in a second) so I was wondering if the community knew of faster ways to do this.
I have parsed my data two different ways; the first way had everything saved as cell arrays, but because things hadn't been working well for me I parsed the data again and placed everything as vectors.
What I'm actually doing is trying to delete NaN data, as well as all data in the same corresponding row of my data file, and then doing the very same thing after applying the Hampel filter. The relevant part of my code in this attempt is:
for i=numStock+1:-1:1
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
stock(i).return = sort(stock(i).return);
stock(i).returnLength = length(stock(i).return);
stock(i).medianReturn = median(stock(i).return);
stock(i).madReturn = mad(stock(i).return,1);
end;
for i=numStock:-1:1
for j = length(stock(i+1).volume):-1:1
if(isnan(stock(i+1).volume(j)))
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end
end
stock(i+1).volume = sort(stock(i+1).volume);
stock(i+1).volumeLength = length(stock(i+1).volume);
stock(i+1).medianVolume = median(stock(i+1).volume);
stock(i+1).madVolume = mad(stock(i+1).volume,1);
end;
for i=numStock+1:-1:1
for j=stock(i).returnLength:-1:1
if (abs(stock(i).return(j) - stock(i).medianReturn) > 3*stock(i).madReturn)
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end;
end;
end;
for i=numStock:-1:1
for j=stock(i+1).volumeLength:-1:1
if (abs(stock(i+1).volume(j) - stock(i+1).medianVolume) > 3*stock(i+1).madVolume)
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end;
end;
end;
However, this returns an error:
"Matrix index is out of range for deletion.
Error in Failure (line 110)
stock(k).return(j) = [];"
So instead I tried by parsing everything in as vectors. Then I decided to try and delete the appropriate entries in the vectors prior to building the structure array. This isn't returning an error, but it is very slow:
%% Delete bad data, Hampel Filter
% Delete bad entries
id=strcmp(returns,'');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
id=strcmp(returns,'C');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
% Convert returns from string to double
returns=cellfun(#str2double,returns);
sp500=cellfun(#str2double,sp500);
% Delete all data for which a return is not a number
nanid=isnan(returns);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Delete all data for which a volume is not a number
nanid=isnan(volume);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Apply the Hampel filter, and delete all data corresponding to
% observations deleted by the filter.
medianReturn = median(returns);
madReturn = mad(returns,1);
for i=length(returns):-1:1
if (abs(returns(i) - medianReturn) > 3*madReturn)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
medianVolume = median(volume);
madVolume = mad(volume,1);
for i=length(volume):-1:1
if (abs(volume(i) - medianVolume) > 3*madVolume)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
As I said, this is very slow, probably because I'm using a for loop on a very large data set; however, I'm not sure how else one would do this. Sorry for the gigantic post, but does anyone have a suggestion as to how I might go about doing what I'm asking in a reasonable way?
EDIT: I should add that getting the vector method to work is probably preferable, since my aim is to put all of the return vectors into a matrix and get all of the volume vectors into a matrix and perform PCA on them, and I'm not sure how I would do that using cell arrays (or even if princomp would work on cell arrays).
EDIT2: I have altered the code to match your suggestion (although I did decide to give up speed and keep with the for-loops to keep with the structure array, since reparsing this data will be way worse time-wise). The new code snipet is:
stock_return = zeros(numStock+1,length(stock(1).return));
for i=1:numStock+1
for j=1:length(stock(i).return)
stock_return(i,j) = stock(i).return(j);
end
end
stock_return = stock_return(~any(isnan(stock_return)), : );
This returns an Index exceeds matrix dimensions error, and I'm not sure why. Any suggestions?
I could not find a convenient way to handle structures, therefore I would restructure the code so that instead of structures it uses just arrays.
For example instead of stock(i).return(j) I would do stock_returns(i,j).
I show you on a part of your code how to get rid of for-loops.
Say we deal with this code:
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
Now, the deletion of columns with any NaN data goes like this:
stock_return = stock_return(:, ~any(isnan(stock_return)) );
As for the absolute difference from medianVolume, you can write a similar code:
% stock_return_length is a scalar
% stock_median_return is a column vector (eg. [1;2;3])
% stock_mad_return is also a column vector.
median_return = repmat(stock_median_return, stock_return_length, 1);
is_bad = abs(stock_return - median_return) > 3.* stock_mad_return;
stock_return = stock_return(:, ~any(is_bad));
Using a scalar for stock_return_length means of course that the return lengths are the same, but you implicitly assume it in your original code anyway.
The important point in my answer is using any. Logical indexing is not sufficient in itself, since in your original code you delete all the values if any of them is bad.
Reference to any: http://www.mathworks.co.uk/help/matlab/ref/any.html.
If you want to preserve the original structure, so you stick to stock(i).return, you can speed-up your code using essentially the same scheme but you can only get rid of one less for-loop, meaning that your program will be substantially slower.

simple method to keep last n elements in a queue for vb6?

I am trying to keep the last n elements from a changing list of x elements (where x >> n)
I found out about the deque method, with a fixed length, in other programming languages. I was wondering if there is something similar for VB6
Create a Class that extends an encapsulated Collection.
Add at the end (anonymous), retrieve & remove from the beginning (index 1). As part of adding check your MaxDepth property setting (or hard code it if you like) and if Collection.Count exceeds it remove the extra item.
Or just hard code it all inline if a Class is a stumper for you.
This is pretty routine.
The only thing I can think of is possibly looping through the last 5 values of the dynamic array using something like:
For UBound(Array) - 5 To UBound(Array)
'Code to store or do the desired with these values
Loop
Sorry I don't have a definite answer, but hopefully that might help.
Here's my simplest solution to this:
For i = n - 1 To 1 Step -1
arrayX(i) = arrayX(i - 1)
Next i
arrayX(0) = latestX
Where:
arrayX = array of values
n = # of array elements
latestX = latest value of interest (assumes entire code block is also
within another loop)

Resources