Fast check if element is in MATLAB matrix - performance

I would like to verify whether an element is present in a MATLAB matrix.
At the beginning, I implemented as follows:
if ~isempty(find(matrix(:) == element))
which is obviously slow. Thus, I changed to:
if sum(matrix(:) == element) ~= 0
but this is again slow: I am calling a lot of times the function that contains this instruction, and I lose 14 seconds each time!
Is there a way of further optimize this instruction?
Thanks.

If you just need to know if a value exists in a matrix, using the second argument of find to specify that you just want one value will be slightly faster (25-50%) and even a bit faster than using sum, at least on my machine. An example:
matrix = randi(100,1e4,1e4);
element = 50;
~isempty(find(matrix(:)==element,1))
However, in recent versions of Matlab (I'm using R2014b), nnz is finally faster for this operation, so:
matrix = randi(100,1e4,1e4);
element = 50;
nnz(matrix==element)~=0
On my machine this is about 2.8 times faster than any other approach (including using any, strangely) for the example provided. To my mind, this solution also has the benefit of being the most readable.

In my opinion, there are several things you could try to improve performance:
following your initial idea, i would go for the function any to test is any of the equality tests had a success:
if any(matrix(:) == element)
I tested this on a 1000 by 1000 matrix and it is faster than the solutions you have tested.
I do not think that the unfolding matrix(:) is penalizing since it is equivalent to a reshape and Matlab does this in a smart way where it does not actually allocate and move memory since you are not modifying the temporary object matrix(:)
If your does not change between the calls to the function or changes rarely you could simply use another vector containing all the elements of your matrix, but sorted. This way you could use a more efficient search algorithm O(log(N)) test for the presence of your element.
I personally like the ismember function for this kind of problems. It might not be the fastest but for non critical parts of the code it greatly improves readability and code maintenance (and I prefer to spend one hour coding something that will take day to run than spending one day to code something that will run in one hour (this of course depends on how often you use this program, but it is something one should never forget)
If you can have a sorted copy of the elements of your matrix, you could consider using the undocumented Matlab function ismembc but remember that inputs must be sorted non-sparse non-NaN values.
If performance really is critical you might want to write your own mex file and for this task you could even include some simple parallelization using openmp.
Hope this helps,
Adrien.

Related

Poor performance in matlab

So I had to write a program in Matlab to calculate the convolution of two functions, manually. I wrote this simple piece of code that I know is not that optimized probably:
syms recP(x);
recP(x) = rectangularPulse(-1,1,x);
syms triP(x);
triP(x) = triangularPulse(-1,1,x);
t = -10:0.1:10;
s1 = -10:0.1:10;
for i = 1:201
s1(i) = 0;
for j = t
s1(i) = s1(i) + ( recP(j) * triP(t(i)-j) );
end
end
plot(t,s1);
I have a core i7-7700HQ coupled with 32 GB of RAM. Matlab is stored on my HDD and my Windows is on my SSD. The problem is that this simple code is taking I think at least 20 minutes to run. I have it in a section and I don't run the whole code. Matlab is only taking 18% of my CPU and 3 GB of RAM for this task. Which is I think probably enough, I don't know. But I don't think it should take that long.
Am I doing anything wrong? I've searched for how to increase the RAM limit of Matlab, and I found that it is not limited and it takes how much it needs. I don't know if I can increase the CPU usage of it or not.
Is there any solution to how make things a little bit faster? I have like 6 or 7 of these for loops in my homework and it takes forever if I run the whole live script. Thanks in advance for your help.
(Also, it highlights the piece of code that is currently running. It is the for loop, the outer one is highlighted)
Like Ander said, use the symbolic toolbox in matlab as a last resort. Additionally, when trying to speed up matlab code, focus on taking advantage of matlab's vectorized operations. What I mean by this is matlab is very efficient at performing operations like this:
y = x.*z;
where x and z are some Nx1 vectors each and the operator '.*' is called 'dot multiplication'. This is essentially telling matlab to perform multiplication on x1*z1, x[2]*z[2] .... x[n]*z[n] and assign all the values to the corresponding value in the vector y. Additionally, many of the functions in matlab are able to accept vectors as inputs and perform their operations on each element and return an equal size vector with the output at each element. You can check this for any given function by scrolling down in its documentation to the inputs and outputs section and checking what form of array the inputs and outputs can take. For example, rectangularPulse's documentation says it can accept vectors as inputs. Therefore, you can simplify your inner loop to this:
s1(i) = s1(i) + ( rectangularPulse(-1,1,t) * triP(t(i)-t) );
So to summarize:
Avoid the symbolic toolbox in matlab until you have a better handle of what you're doing or you absolutely have to use it.
Use matlab's ability to handle vectors and arrays very well.
Deconstruct any nested loops you write one at a time from the inside out. Usually this dramatically accelerates matlab code especially when you are new to writing it.
See if you can even further simplify the code and get rid of your outer loop as well.

How does JIT optimize branching while processing elements of collections? (in Scala)

This is a question about performance of code written in Scala.
Consider the following two code snippets, assume that x is some collection containing ~50 million elements:
def process(x: Traversable[T]) = {
processFirst x.head
x reduce processPair
processLast x.last
}
Versus something like this (assume for now we have some way to determine if we're operating on the first element versus the last element):
def isFirstElement[T](x: T) = ???
def isLastElement[T](x: T) = ???
def process(x: Traversable[T]) = {
x reduce {
(left, right) =>
if (isFirstElement(left)
processFirst(left)
else if (isLastElement(right))
processLast(right)
processPair(left, right)
}
}
Which approach is faster? and for ~50 million elements, how much faster?
It seems to me that the first example would be faster because there are fewer conditional checks occurring for all but the first and last elements. However for the latter example there is some argument to suggest that the JIT might be clever enough to optimize away those additional head/last conditional checks that would otherwise occur for all but the first/last elements.
Is the JIT clever enough to perform such operations? The obvious advantage of the latter approach is that all business can be placed in the same function body while in the latter case business must be partitioned into three separate function bodies invoked separately.
** EDIT **
Thanks for all the great responses. While I am leaving the second code snippet above to illustrate its incorrectness, I want to revise the first approach slightly to reflect better the problem I am attempting to solve:
// x is some iterator
def process(x: Iterator[T]) = {
if (x.hasNext)
{
var previous = x.next
var current = null
processFirst previous
while(x.hasNext)
{
current = x.next
processPair(previous, current)
previous = current
}
processLast previous
}
}
While there are no additional checks occurring in the body, there is an additional reference assignment that appears to be unavoidable (previous = current). This is also a much more imperative approach that relies on nullable mutable variables. Implementing this in a functional yet high performance manner would be another exercise for another question.
How does this code snippet stack-up against the last of the two examples above? (the single-iteration block approach containing all the branches). The other thing I realize is that the latter of the two examples is also broken on collections containing fewer than two elements.
If your underlying collection has an inexpensive head and last method (not true for a generic Traversable), and the reduction operations are relatively inexpensive, then the second way takes about 10% longer (maybe a little less) than the first on my machine. (You can use a var to get first, and you can keep updating a second far with the right argument to obtain last, and then do the final operation outside of the loop.)
If you have an expensive last (i.e. you have to traverse the whole collection), then the first operation takes about 10% longer (maybe a little more).
Mostly you shouldn't worry too much about it and instead worry more about correctness. For instance, in a 2-element list your second code has a bug (because there is an else instead of a separate test). In a 1-element list, the second code never calls reduce's lambda at all, so again fails to work.
This argues that you should do it the first way unless you're sure last is really expensive in your case.
Edit: if you switch to a manual reduce-like-operation using an iterator, you might be able to shave off up to about 40% of your time compared to the expensive-last case (e.g. list). For inexpensive last, probably not so much (up to ~20%). (I get these values when operating on lengths of strings, for example.)
First of all, note that, depending on the concrete implementation of Traversable, doing something like x.last may be really expensive. Like, more expensive than all the rest of what's going on here.
Second, I doubt the cost of conditionals themselves is going to be noticeable, even on a 50 million collection, but actually figuring out whether a given element is the first or the last, might again, depending on implementation, get pricey.
Third, JIT will not be able to optimize the conditionals away: if there was a way to do that, you would have been able to write your implementation without conditionals to begin with.
Finally, if you are at a point where it starts looking like an extra if statement might affect performance, you might consider switching to java or even "C". Don't get me wrong, I love scala, it is a great language, with lots of power and useful features, but being super-fast just isn't one of them.

Is it worth it to rewrite an if statement to avoid branching?

Recently I realized I have been doing too much branching without caring the negative impact on performance it had, therefore I have made up my mind to attempt to learn all about not branching. And here is a more extreme case, in attempt to make the code to have as little branch as possible.
Hence for the code
if(expression)
A = C; //A and C have to be the same type here obviously
expression can be A == B, or Q<=B, it could be anything that resolve to true or false, or i would like to think of it in term of the result being 1 or 0 here
I have come up with this non branching version
A += (expression)*(C-A); //Edited with thanks
So my question would be, is this a good solution that maximize efficiency?
If yes why and if not why?
Depends on the compiler, instruction set, optimizer, etc. When you use a boolean expression as an int value, e.g., (A == B) * C, the compiler has to do the compare, and the set some register to 0 or 1 based on the result. Some instruction sets might not have any way to do that other than branching. Generally speaking, it's better to write simple, straightforward code and let the optimizer figure it out, or find a different algorithm that branches less.
Jeez, no, don't do that!
Anyone who "penalize[s] [you] a lot for branching" would hopefully send you packing for using something that awful.
How is it awful, let me count the ways:
There's no guarantee you can multiply a quantity (e.g., C) by a boolean value (e.g., (A==B) yields true or false). Some languages will, some won't.
Anyone casually reading it is going observe a calculation, not an assignment statement.
You're replacing a comparison, and a conditional branch with two comparisons, two multiplications, a subtraction, and an addition. Seriously non-optimal.
It only works for integral numeric quantities. Try this with a wide variety of floating point numbers, or with an object, and if you're really lucky it will be rejected by the compiler/interpreter/whatever.
You should only ever consider doing this if you had analyzed the runtime properties of the program and determined that there is a frequent branch misprediction here, and that this is causing an actual performance problem. It makes the code much less clear, and its not obvious that it would be any faster in general (this is something you would also have to measure, under the circumstances you are interested in).
After doing research, I came to the conclusion that when there are bottleneck, it would be good to include timed profiler, as these kind of codes are usually not portable and are mainly used for optimization.
An exact example I had after reading the following question below
Why is it faster to process a sorted array than an unsorted array?
I tested my code on C++ using that, that my implementation was actually slower due to the extra arithmetics.
HOWEVER!
For this case below
if(expression) //branched version
A += C;
//OR
A += (expression)*(C); //non-branching version
The timing was as of such.
Branched Sorted list was approximately 2seconds.
Branched unsorted list was aproximately 10 seconds.
My implementation (whether sorted or unsorted) are both 3seconds.
This goes to show that in an unsorted area of bottleneck, when we have a trivial branching that can be simply replaced by a single multiplication.
It is probably more worthwhile to consider the implementation that I have suggested.
** Once again it is mainly for the areas that is deemed as the bottleneck **

How to implement a part of histogram equalization in matlab without using for loops and influencing speed and performance

Suppose that I have these Three variables in matlab Variables
I want to extract diverse values in NewGrayLevels and sum rows of OldHistogram that are in the same rows as one diverse value is.
For example you see in NewGrayLevels that the six first rows are equal to zero. It means that 0 in the NewGrayLevels has taken its value from (0 1 2 3 4 5) of OldGrayLevels. So the corresponding rows in OldHistogram should be summed.
So 0+2+12+38+113+163=328 would be the frequency of the gray level 0 in the equalized histogram and so on.
Those who are familiar with image processing know that it's part of the histogram equalization algorithm.
Note that I don't want to use built-in function "histeq" available in image processing toolbox and I want to implement it myself.
I know how to write the algorithm with for loops. I'm seeking if there is a faster way without using for loops.
The code using for loops:
for k=0:255
Condition = NewGrayLevels==k;
ConditionMultiplied = Condition.*OldHistogram;
NewHistogram(k+1,1) = sum(ConditionMultiplied);
end
I'm afraid if this code gets slow for high resolution big images.Because the variables that I have uploaded are for a small image downloaded from the internet but my code may be used for sattellite images.
I know you say you don't want to use histeq, but it might be worth your time to look at the MATLAB source file to see how the developers wrote it and copy the parts of their code that you would like to implement. Just do edit('histeq') or edit('histeq.m'), I forget which.
Usually the MATLAB code is vectorized where possible and runs pretty quick. This could save you from having to reinvent the entire wheel, just the parts you want to change.
I can't think a way to implement this without a for loop somewhere, but one optimisation you could make would be using indexing instead of multiplication:
for k=0:255
Condition = NewGrayLevels==k; % These act as logical indices to OldHistogram
NewHistogram(k+1,1) = sum(OldHistogram(Condition)); % Removes a vector multiplication, some additions, and an index-to-double conversion
end
Edit:
On rereading your initial post, I think that the way to do this without a for loop is to use accumarray (I find this a difficult function to understand, so read the documentation and search online and on here for examples to do so):
NewHistogram = accumarray(1+NewGrayLevels,OldHistogram);
This should work so long as your maximum value in NewGrayLevels (+1 because you are starting at zero) is equal to the length of OldHistogram.
Well I understood that there's no need to write the code that #Hugh Nolan suggested. See the explanation here:
%The green lines are because after writing the code, I understood that
%there's no need to calculate the equalized histogram in
%"HistogramEqualization" function and after gaining the equalized image
%matrix you can pass it to the "ExtractHistogram" function
% (which there's no loops in it) to acquire the
%equalized histogram.
%But I didn't delete those lines of code because I had tried a lot to
%understand the algorithm and write them.
For more information and studying the code, please see my next question.

Branchless Binary Search

I'm curious if anyone could explain a branchless binary search implementation to me. I saw it mentioned in a recent question but I can't imagine how it would be implemented. I assume it could be useful to avoid branches if the number of items is quite large.
I'm going to assume you're talking about the sentence "Make a static const array of all the perfect squares in the domain you want to support, and perform a fast branchless binary search on it." found in this answer.
A "branchless" binary search is basically just an unrolled binary search loop. This only works if you know in advance the number of items in the array you're searching (as you would if it's static const). You can write a program to write the unrolled code if it's too long to do by hand.
Then, you must benchmark your solution to see whether it really is faster than a loop. If your branchless code is too big, it won't fit inside the CPU's fast instruction cache and will take longer to run than the equivalent loop.
If one has a function which returns +1, -1, or 0 based upon the position of the correct item versus the current one, one could initialize position to list size/2, and stepsize to position/2, and then after each comparison do position+=direction*stepsize; stepsize=stepsize/2. Iterate until stepsize is zero.

Resources