I have two matrices of different dimensions and I want to compare if any of the elements in their first columns match (and eventually remove rows where there is a match). Looping just takes too much time so I am trying vectorized version but nothing I've tried worked. Any help would be much appreciated!
Example.
A = magic(3);
B = magic(6);
UpTo = min( size( A, 1 ), size( B, 1 ) );
CommonRows = B(1 : UpTo, 1) == A(1 : UpTo, 1);
B( CommonRows, : ) = [] % B with rows of same element in column 1 removed
This is the last thing I tried and almost got it but doesn't work when I have repeating values in both matrices.
[C,iC]=setdiff(A(:,1),B(:,1))
[D,iD]=intersect(A(:,1),B(:,1))
newA=A(iC,:)
newBtemp=[A(iD,:);B]
newB=sort(newBtemp)
But I think I got it now finally:
common=ismember(A(:,1),B(:,1))
temp=A(common,:)
A(common,:)=[]
newB=sort([temp;B])
Related
It is a straightforward question: Is there a faster alternative to all(a(:,i)==a,1) in MATLAB?
I'm thinking of a implementation that benefits from short-circuit evaluations in the whole process. I mean, all() definitely benefits from short-circuit evaluations but a(:,i)==a doesn't.
I tried the following code,
% example for the input matrix
m = 3; % m and n aren't necessarily equal to those values.
n = 5000; % It's only possible to know in advance that 'm' << 'n'.
a = randi([0,5],m,n); % the maximum value of 'a' isn't necessarily equal to
% 5 but it's possible to state that every element in
% 'a' is a positive integer.
% all, equal solution
tic
for i = 1:n % stepping up the elapsed time in orders of magnitude
%%%%%%%%%% all and equal solution %%%%%%%%%
ax_boo = all(a(:,i)==a,1);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
end
toc
% alternative solution
tic
for i = 1:n % stepping up the elapsed time in orders of magnitude
%%%%%%%%%%% alternative solution %%%%%%%%%%%
ax_boo = a(1,i) == a(1,:);
for k = 2:m
ax_boo(ax_boo) = a(k,i) == a(k,ax_boo);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
end
toc
but it's intuitive that any "for-loop-solution" within the MATLAB environment will be naturally slower. I'm wondering if there is a MATLAB built-in function written in a faster language.
EDIT:
After running more tests I found out that the implicit expansion does have a performance impact in evaluating a(:,i)==a. If the matrix a has more than one row, all(repmat(a(:,i),[1,n])==a,1) may be faster than all(a(:,i)==a,1) depending on the number of columns (n). For n=5000 repmat explicit expansion has proved to be faster.
But I think that a generalization of Kenneth Boyd's answer is the "ultimate solution" if all elements of a are positive integers. Instead of dealing with a (m x n matrix) in its original form, I will store and deal with adec (1 x n matrix):
exps = ((0):(m-1)).';
base = max(a,[],[1,2]) + 1;
adec = sum( a .* base.^exps , 1 );
In other words, each column will be encoded to one integer. And of course adec(i)==adec is faster than all(a(:,i)==a,1).
EDIT 2:
I forgot to mention that adec approach has a functional limitation. At best, storing adec as uint64, the following inequality must hold base^m < 2^64 + 1.
Since your goal is to count the number of columns that match, my example converts the binary encoding to integer decimals, then you just loop over the possible values (with 3 rows that are 8 possible values) and count the number of matches.
a_dec = 2.^(0:(m-1)) * a;
num_poss_values = 2 ^ m;
num_matches = zeros(num_poss_values, 1);
for i = 1:num_poss_values
num_matches(i) = sum(a_dec == (i - 1));
end
On my computer, using 2020a, Here are the execution times for your first 2 options and the code above:
Elapsed time is 0.246623 seconds.
Elapsed time is 0.553173 seconds.
Elapsed time is 0.000289 seconds.
So my code is 853 times faster!
I wrote my code so it will work with m being an arbitrary integer.
The num_matches variable contains the number of columns that add up to 0, 1, 2, ...7 when converted to a decimal.
As an alternative you can use the third output of unique:
[~, ~, iu] = unique(a.', 'rows');
for i = 1:n
ax_boo = iu(i) == iu;
end
As indicated in a comment:
ax_boo isolates the indices of the columns I have to sum in a row vector b. So, basically the next line would be something like c = sum(b(ax_boo),2);
It is a typical usage of accumarray:
[~, ~, iu] = unique(a.', 'rows');
C = accumarray(iu,b);
for i = 1:n
c = C(i);
end
I am writing in MATLAB a program that checks whether two elements A and B were exchanged in ranking positions.
Example
Assume the first ranking is:
list1 = [1 2 3 4]
while the second one is:
list2 = [1 2 4 3]
I want to check whether A = 3 and B = 4 have exchanged relative positions in the rankings, which in this case is true, since in the first ranking 3 comes before 4 and in the second ranking 3 comes after 4.
Procedure
In order to do this, I have written the following MATLAB code:
positionA1 = find(list1 == A);
positionB1 = find(list1 == B);
positionA2 = find(list2 == A);
positionB2 = find(list2 == B);
if (positionA1 <= positionB1 && positionA2 >= positionB2) || ...
(positionA1 >= positionB1 && positionA2 <= positionB2)
... do something
end
Unfortunately, I need to run this code a lot of times, and the find function is really slow (but needed to get the element position in the list).
I was wondering if there is a way of speeding up the procedure. I have also tried to write a MEX file that performs in C the find operation, but it did not help.
If the lists don't change within your loop, then you can determine the positions of the items ahead of time.
Assuming that your items are always integers from 1 to N:
[~, positions_1] = sort( list1 );
[~, positions_2] = sort( list2 );
This way you won't need to call find within the loop, you can just do:
positionA1 = positions_1(A);
positionB1 = positions_1(B);
positionA2 = positions_2(A);
positionB2 = positions_2(B);
If your loop is going over all possible combinations of A and B, then you can also vectorize that
Find the elements that exchanged relative ranking:
rank_diff_1 = bsxfun(#minus, positions_1, positions_1');
rank_diff_2 = bsxfun(#minus, positions_2, positions_2');
rel_rank_changed = sign(rank_diff_1) ~= sign(rank_diff_2);
[A_changed, B_changed] = find(rel_rank_changed);
Optional: Throw out half of the results, because if (3,4) is in the list, then (4,3) also will be, and maybe you don't want that:
mask = (A_changed < B_changed);
A_changed = A_changed(mask);
B_changed = B_changed(mask);
Now loop over only those elements that have exchanged relative ranking
for ii = 1:length(A_changed)
A = A_changed(ii);
B = B_changed(ii);
% Do something...
end
Instead of find try to compute something like this
Check if there is any exchanged values.
if logical(sum(abs(list1-list2)))
do something
end;
For specific values A and B:
if (list1(logical((list1-list2)-abs((list1-list2))))==A)&&(list1(logical((list1-list2)+abs((list1-list2))))==B)
do something
end;
I have two matrices in Matlab A and B, which have equal number of columns but different number of rows. The number of rows in B is also less than the number of rows in A. B is actually a subset of A.
How can I remove those rows efficiently from A, where the values in columns 1 and 2 of A are equal to the values in columns 1 and 2 of matrix B?
At the moment I'm doing this:
for k = 1:size(B, 1)
A(find((A(:,1) == B(k,1) & A(:,2) == B(k,2))), :) = [];
end
and Matlab complains that this is inefficient and that I should try to use any, but I'm not sure how to do it with any. Can someone help me out with this? =)
I tried this, but it doesn't work:
A(any(A(:,1) == B(:,1) & A(:,2) == B(:,2), 2), :) = [];
It complains the following:
Error using ==
Matrix dimensions must agree.
Example of what I want:
A-B in the results means that the rows of B are removed from A. The same goes with A-C.
try using setdiff. for example:
c=setdiff(a,b,'rows')
Note, if order is important use:
c = setdiff(a,b,'rows','stable')
Edit: reading the edited question and the comments to this answer, the specific usage of setdiff you look for is (as noticed by Shai):
[temp c] = setdiff(a(:,1:2),b(:,1:2),'rows','stable')
c = a(c,:)
Alternative solution:
you can just use ismember:
a(~ismember(a(:,1:2),b(:,1:2),'rows'),:)
Use bsxfun:
compare = bsxfun( #eq, permute( A(:,1:2), [1 3 2]), permute( B(:,1:2), [3 1 2] ) );
twoEq = all( compare, 3 );
toRemove = any( twoEq, 2 );
A( toRemove, : ) = [];
Explaining the code:
First we use bsxfun to compare all pairs of first to column of A and B, resulting with compare of size numRowsA-by-numRowsB-by-2 with true where compare( ii, jj, kk ) = A(ii,kk) == B(jj,kk).
Then we use all to create twoEq of size numRowsA-by-numRowsB where each entry indicates if both corresponding entries of A and B are equal.
Finally, we use any to select rows of A that matches at least one row of B.
What's wrong with original code:
By removing rows of A inside a loop (i.e., A( ... ) = []) you actually resizing A at almost each iteration. See this post on why exactly this is a bad practice.
Using setdiff
In order to use setdiff (as suggested by natan) on only the first two columns you'll need use it's second output argument:
[ignore, ia] = setdiff( A(:,1:2), B(:,1:2), 'rows', 'stable' );
A = A( ia, : ); % keeping only relevant rows, beyond first two columns.
Here's another bsxfun implementation -
A(~any(squeeze(all(bsxfun(#eq,A(:,1:2),permute(B(:,1:2),[3 2 1])),2)),2),:)
One more that is dangerously close to Shai's solution, but still avoids two permute to one permute -
A(~any(all(bsxfun(#eq,A(:,1:2),permute(B(:,1:2),[3 2 1])),2),3),:)
I have a matrix, matrix_logical(50000,100000), that is a sparse logical matrix (a lot of falses, some true). I have to produce a matrix, intersect(50000,50000), that, for each pair, i,j, of rows of matrix_logical(50000,100000), stores the number of columns for which rows i and j have both "true" as the value.
Here is the code I wrote:
% store in advance the nonzeros cols
for i=1:50000
nonzeros{i} = num2cell(find(matrix_logical(i,:)));
end
intersect = zeros(50000,50000);
for i=1:49999
a = cell2mat(nonzeros{i});
for j=(i+1):50000
b = cell2mat(nonzeros{j});
intersect(i,j) = numel(intersect(a,b));
end
end
Is it possible to further increase the performance? It takes too long to compute the matrix. I would like to avoid the double loop in the second part of the code.
matrix_logical is sparse, but it is not saved as sparse in MATLAB because otherwise the performance become the worst possible.
Since the [i,j] entry counts the number of non zero elements in the element-wise multiplication of rows i and j, you can do it by multiplying matrix_logical with its transpose (you should convert to numeric data type first, e.g matrix_logical = single(matrix_logical)):
inter = matrix_logical * matrix_logical';
And it works both for sparse or full representation.
EDIT
In order to calculate numel(intersect(a,b))/numel(union(a,b)); (as asked in your comment), you can use the fact that for two sets a and b, you have
length(union(a,b)) = length(a) + length(b) - length(intersect(a,b))
so, you can do the following:
unLen = sum(matrix_logical,2);
tmp = repmat(unLen, 1, length(unLen)) + repmat(unLen', length(unLen), 1);
inter = matrix_logical * matrix_logical';
inter = inter ./ (tmp-inter);
If I understood you correctly, you want a logical AND of the rows:
intersct = zeros(50000, 50000)
for ii = 1:49999
for jj = ii:50000
intersct(ii, jj) = sum(matrix_logical(ii, :) & matrix_logical(jj, :));
intersct(jj, ii) = intersct(ii, jj);
end
end
Doesn't avoid the double loop, but at least works without the first loop and the slow find command.
Elaborating on my comment, here is a distance function suitable for pdist()
function out = distfun(xi,xj)
out = zeros(size(xj,1),1);
for i=1:size(xj,1)
out(i) = sum(sum( xi & xj(i,:) )) / sum(sum( xi | xj(i,:) ));
end
In my experience, sum(sum()) is faster for logicals than nnz(), thus its appearance above.
You would also need to use squareform() to reshape the output of pdist() appropriately:
squareform(pdist(martrix_logical,#distfun));
Note that pdist() includes a 'jaccard' distance measure, but it is actually the Jaccard distance and not the Jaccard index or coefficient, which is the value you are apparently after.
I'm trying to figure out a way to create random numbers that "feel" random over short sequences. This is for a quiz game, where there are four possible choices, and the software needs to pick one of the four spots in which to put the correct answer before filling in the other three with distractors.
Obviously, arc4random % 4 will create more than sufficiently random results over a long sequence, but in a short sequence its entirely possible (and a frequent occurrence!) to have five or six of the same number come back in a row. This is what I'm aiming to avoid.
I also don't want to simply say "never pick the same square twice," because that results in only three possible answers for every question but the first. Currently I'm doing something like this:
bool acceptable = NO;
do {
currentAnswer = arc4random() % 4;
if (currentAnswer == lastAnswer) {
if (arc4random() % 4 == 0) {
acceptable = YES;
}
} else {
acceptable = YES;
}
} while (!acceptable);
Is there a better solution to this that I'm overlooking?
If your question was how to compute currentAnswer using your example's probabilities non-iteratively, Guffa has your answer.
If the question is how to avoid random-clustering without violating equiprobability and you know the upper bound of the length of the list, then consider the following algorithm which is kind of like un-sorting:
from random import randrange
# randrange(a, b) yields a <= N < b
def decluster():
for i in range(seq_len):
j = (i + 1) % seq_len
if seq[i] == seq[j]:
i_swap = randrange(i, seq_len) # is best lower bound 0, i, j?
if seq[j] != seq[i_swap]:
print 'swap', j, i_swap, (seq[j], seq[i_swap])
seq[j], seq[i_swap] = seq[i_swap], seq[j]
seq_len = 20
seq = [randrange(1, 5) for _ in range(seq_len)]; print seq
decluster(); print seq
decluster(); print seq
where any relation to actual working Python code is purely coincidental. I'm pretty sure the prior-probabilities are maintained, and it does seem break clusters (and occasionally adds some). But I'm pretty sleepy so this is for amusement purposes only.
You populate an array of outcomes, then shuffle it, then assign them in that order.
So for just 8 questions:
answer_slots = [0,0,1,1,2,2,3,3]
shuffle(answer_slots)
print answer_slots
[1,3,2,1,0,2,3,0]
To reduce the probability for a repeated number by 25%, you can pick a random number between 0 and 3.75, and then rotate it so that the 0.75 ends up at the previous answer.
To avoid using floating point values, you can multiply the factors by four:
Pseudo code (where / is an integer division):
currentAnswer = ((random(0..14) + lastAnswer * 4) % 16) / 4
Set up a weighted array. Lets say the last value was a 2. Make an array like this:
array = [0,0,0,0,1,1,1,1,2,3,3,3,3];
Then pick a number in the array.
newValue = array[arc4random() % 13];
Now switch to using math instead of an array.
newValue = ( ( ( arc4random() % 13 ) / 4 ) + 1 + oldValue ) % 4;
For P possibilities and a weight 0<W<=1 use:
newValue = ( ( ( arc4random() % (P/W-P(1-W)) ) * W ) + 1 + oldValue ) % P;
For P=4 and W=1/4, (P/W-P(1-W)) = 13. This says the last value will be 1/4 as likely as other values.
If you completely eliminate the most recent answer it will be just as noticeable as the most recent answer showing up too often. I do not know what weight will feel right to you, but 1/4 is a good starting point.