Why the weird nested arrayfun is most time effective? - performance

I was writing a simple Autocorrelation function for the matrix array. Each row is a separate time series and we autocorrelate it with itself with certain set of lags. I have found out that one of the most contrintuitive things works best.
% 17 second benchmark
A = cell2mat(arrayfun(#(i) AutoCorrelation(array(i,:),lags)',1:size(array,1),'UniformOutput',false))';
where the function itself is the following
function ans = AutoCorrelation(array,lags)
arrayfun(#(x) dot(array(lags(x)+1:end),array(1:end-lags(x)))/(length(array)-lags(x)),1:length(lags));
Other things I have tried:
A = zeros(size(array,1),length(lags));
T = size(array,2);
% 97 seconds benchmark
A = cell2mat(arrayfun(#(i) arrayfun(#(x) dot(array(i,lags(x)+1:end),array(i,1:end-lags(x)))/(size(array,2)-lags(x)),1:length(lags))',1:size(array,1),'UniformOutput',false))';
% 100 seconds benchmark
A = arrayfun(#(i,x) dot(array(i,lags(x)+1:end),array(i,1:end-lags(x)))/(size(array,2)-lags(x)),repmat((1:size(array,1))',1,length(lags)),repmat(1:length(lags),size(array,1),1));
% 27 second benchmark
for i = 1:length(lags)
A(:,i) = dot(array(:,lags(i)+1:end),array(:,1:end-lags(i)),2)/(T-lags(i));
% 95 second benchmark
for i = 1:length(lags)
for j = 1:size(array,1);
A(j,i) = dot(array(j,lags(i)+1:end),array(j,1:end-lags(i)),2)/(T-lags(i));
It is more a curiosity question. If you ask me I would bet a direct dot product method would work the best. Also if arrayfun works that well, why then the double argument arrayfun doesn't perform well?
My array was 512*100000 array of doubles.


Possible to speed up this gpuArray calculation with arrayfun() (or otherwise)?

I have a complex matrix A, and would like to modify it Nt times according to A = exp( -1i*(A + abs(A).^2) ). The size of A is typically 1000x1000, and the number of times to run would be around 10000.
I am looking to reduce the time taken to carry out these operations. For 1000 iterations on the CPU, I measure around 6.4 seconds. Following the Matlab documentation, I was able to move this to the GPU, which reduced the time taken to 0.07 seconds (an incredible x91 improvement!). So far so good.
However, I also now read this link in the docs, which describes how we can sometimes find even further improvement for element-wise calculations if we use arrayfun() as well. If I try to follow the tutorial, the time taken is actually worse, clocking in at 0.47 seconds. My tests are shown below:
Nt = 1000; % Number of times to run each method
test_functionFcn = #test_function;
A = rand( 500, 600, 'double' ) + rand( 500, 600, 'double' )*1i; % Define an initial complex matrix
gpu_A = gpuArray(A); % Transfer matrix to a GPU array
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%%%%%%%%%
cpu_data_out = A;
for k = 1:Nt
cpu_data_out = test_function( cpu_data_out );
tcpu = toc;
%%%%%%%%%%%%%%%%% Run the calculation Nt times on GPU directly %%%%%%%%%%%%%%%%%%%%
gpu_data_out = gpu_A;
for k = 1:Nt
gpu_data_out = test_function(gpu_data_out);
tgpu = toc;
%%%%%%%%%%%%%% Run the calculation Nt times on GPU using arrayfun() %%%%%%%%%%%%%%
gpuarrayfun_data_out = gpu_A;
for k = 1:Nt
gpuarrayfun_data_out = arrayfun( test_functionFcn, gpuarrayfun_data_out );
tgpu_arrayfun = toc;
%%% Print results %%%
fprintf( 'Time taken using only CPU: %g\n', tcpu );
fprintf( 'Time taken using gpuArray directly: %g\n', tgpu );
fprintf( 'Time taken using GPU + arrayfun(): %g\n', tgpu_arrayfun );
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
and the results are:
Time taken using only CPU: 6.38785
Time taken using gpuArray directly: 0.0680587
Time taken using GPU + arrayfun(): 0.474612
My questions are:
Am I using arrayfun() correctly in this situation, and it is expected that arrayfun() should be worse?
If so, and it is really just expected that it is slower than the direct gpuArray method, is there any easy (i.e non-MEX) way to speed up such a calculation? (I see they also mention using pagefun for example).
Thanks in advance for any advice.
(The graphics card is Nvidia Quadro M4000, and I am running Matlab R2017a)
After reading #Edric's answer, I think it is important to show a little more of the wider code. One thing I didn't mention in the OP is that in my actual main code, is that inside the k=1:Nt loop there is an additional operation which is a matrix multiplication with the transpose of a sparse, tridiagonal matrix. Here is a more fleshed out MWE of what is really going on:
Nt = 1000; % Number of times to run each method
N_rows = 500;
N_cols = 600;
test_functionFcn = #test_function;
A = rand( N_rows, N_cols, 'double' ) + rand( N_rows, N_cols, 'double' )*1i; % Define an initial complex matrix
%%% Generate a sparse, tridiagonal, square transformation matrix %%%%%%%%
mm = 10*ones(N_cols,1); % Subdiagonal elements
dd = 20*ones(N_cols,1); % Main diagonal elements
pp = 30*ones(N_cols,1); % Superdiagonal elements
M = spdiags([mm dd pp],-1:1,N_cols,N_cols);
M(1,1) = 6; % Set a couple of other entries
M(2,1) = 3;
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%
cpu_data_out = A;
for k = 1:Nt
cpu_data_out = test_function( cpu_data_out );
cpu_data_out = cpu_data_out*M.';
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
I'm very sorry for not including that in the OP - I did not realise at the time that it might be relevant to the solution. Does this change things? Are there still gains to be made with arrayfun() on the GPU, or is this now not suitable for converting to arrayfun() ?
A few points here. Firstly, (and most importantly), to time code on the GPU, you need to use either gputimeit, or you need to inject a call to wait(gpuDevice) before calling toc. That's because work is launched asynchronously on the GPU, and you only get accurate timings by waiting for it to finish. With those minor modifications, on my GPU, I see 0.09 seconds for the gpuArray method, and 0.18 seconds for the arrayfun version.
Running a loop of GPU operations is generally inefficient, so the main gain you can get here is by pushing the loop inside the arrayfun function body so that that loop runs directly on the GPU. Like this:
%%% Function to operate on matrices %%%
function x = test_function(x,Nt)
for ii = 1:Nt
x = exp(-1i*(x + abs(x).^2));
You'll need to invoke it like A = arrayfun(#test_function, A, Nt). On my GPU, this brings the arrayfun time down to 0.05 seconds, so about twice as fast as the plain gpuArray version.

A faster alternative to all(a(:,i)==a,1) in MATLAB

It is a straightforward question: Is there a faster alternative to all(a(:,i)==a,1) in MATLAB?
I'm thinking of a implementation that benefits from short-circuit evaluations in the whole process. I mean, all() definitely benefits from short-circuit evaluations but a(:,i)==a doesn't.
I tried the following code,
% example for the input matrix
m = 3; % m and n aren't necessarily equal to those values.
n = 5000; % It's only possible to know in advance that 'm' << 'n'.
a = randi([0,5],m,n); % the maximum value of 'a' isn't necessarily equal to
% 5 but it's possible to state that every element in
% 'a' is a positive integer.
% all, equal solution
for i = 1:n % stepping up the elapsed time in orders of magnitude
%%%%%%%%%% all and equal solution %%%%%%%%%
ax_boo = all(a(:,i)==a,1);
% alternative solution
for i = 1:n % stepping up the elapsed time in orders of magnitude
%%%%%%%%%%% alternative solution %%%%%%%%%%%
ax_boo = a(1,i) == a(1,:);
for k = 2:m
ax_boo(ax_boo) = a(k,i) == a(k,ax_boo);
but it's intuitive that any "for-loop-solution" within the MATLAB environment will be naturally slower. I'm wondering if there is a MATLAB built-in function written in a faster language.
After running more tests I found out that the implicit expansion does have a performance impact in evaluating a(:,i)==a. If the matrix a has more than one row, all(repmat(a(:,i),[1,n])==a,1) may be faster than all(a(:,i)==a,1) depending on the number of columns (n). For n=5000 repmat explicit expansion has proved to be faster.
But I think that a generalization of Kenneth Boyd's answer is the "ultimate solution" if all elements of a are positive integers. Instead of dealing with a (m x n matrix) in its original form, I will store and deal with adec (1 x n matrix):
exps = ((0):(m-1)).';
base = max(a,[],[1,2]) + 1;
adec = sum( a .* base.^exps , 1 );
In other words, each column will be encoded to one integer. And of course adec(i)==adec is faster than all(a(:,i)==a,1).
I forgot to mention that adec approach has a functional limitation. At best, storing adec as uint64, the following inequality must hold base^m < 2^64 + 1.
Since your goal is to count the number of columns that match, my example converts the binary encoding to integer decimals, then you just loop over the possible values (with 3 rows that are 8 possible values) and count the number of matches.
a_dec = 2.^(0:(m-1)) * a;
num_poss_values = 2 ^ m;
num_matches = zeros(num_poss_values, 1);
for i = 1:num_poss_values
num_matches(i) = sum(a_dec == (i - 1));
On my computer, using 2020a, Here are the execution times for your first 2 options and the code above:
Elapsed time is 0.246623 seconds.
Elapsed time is 0.553173 seconds.
Elapsed time is 0.000289 seconds.
So my code is 853 times faster!
I wrote my code so it will work with m being an arbitrary integer.
The num_matches variable contains the number of columns that add up to 0, 1, 2, ...7 when converted to a decimal.
As an alternative you can use the third output of unique:
[~, ~, iu] = unique(a.', 'rows');
for i = 1:n
ax_boo = iu(i) == iu;
As indicated in a comment:
ax_boo isolates the indices of the columns I have to sum in a row vector b. So, basically the next line would be something like c = sum(b(ax_boo),2);
It is a typical usage of accumarray:
[~, ~, iu] = unique(a.', 'rows');
C = accumarray(iu,b);
for i = 1:n
c = C(i);

while-loop faster than for when returning iterator

I'm trying to oversimplify this as much as possible.
functions f1and f2 implement a very simplified version of a roulette wheel selection over a Vector R. The only difference between them is that f1 uses a for and f2 a while. Both functions return the index of the array where the condition was met.
function f1(X::Vector)
l = length(X)
r = rand()*X[l]
for i = 1:l
if r <= X[i]
return i
function f2(X::Vector)
l = length(X)
r = rand()*X[l]
i = 1
while true
if r <= X[i]
return i
i += 1
now I created a couple of test functions...
M is the number of times we repeat the function execution.
Now this is critical... I want to store the values I get from the functions because I need them later... To oversimplify the code I just created a new variable r where I sum up the returns from the functions.
function test01(M,R)
cumR = cumsum(R)
r = 0
for i = 1:M
a = f1(cumR)
r += a
return r
function test02(M,R)
cumR = cumsum(R)
r = 0
for i = 1:M
a = f2(cumR)
r += a
return r
So, next I get:
#time test01(1e7,R)
elapsed time: 1.263974802 seconds (320000832 bytes allocated, 15.06% gc time)
#time test02(1e7,R)
elapsed time: 0.57086421 seconds (1088 bytes allocated)
So, for some reason I can't figure out f1 allocates a lot of memory and its even greater the larger M gets.
I said the line r += a was critical, because if I remove it from both test functions, I get the same result with both tests, so no problems! So I thought there was a problem with the type of a being returned by the functions (because f1 returns the iterator of the for loop, and f2 uses its own variable i "manually declared" inside the function).
aa = f1(cumsum(R))
bb = f2(cumsum(R))
typeof(aa) == typeof(bb)
So... what that hell is going on???
I apologize if this is some sort of basic question but, I've been going over this for over 3 hours now and couldn't find an answer... Even though the functions are fixed by using a while loop I hate not knowing what's going on.
When you see lots of surprising allocations like that, a good first thing to check is type-stability. The #code_warntype macro is very helpful here:
julia> #code_warntype f1(R)
# … lots of annotated code, but the important part is this last line:
Compare that to f2:
julia> #code_warntype f2(R)
# ...
So, why are the two different? Julia thinks that f1 might sometimes return nothing (which is of type Void)! Look again at your f1 function: what would happen if the last element of X is NaN? It'll just fall off the end of the function with no explicit return statement. In f2, however, you'll end up indexing beyond the bounds of X and get an error instead. Fix this type-instabillity by deciding what to do if the loop completes without finding the answer and you'll see more similar timings.
As I stated in the comment, your functions f1 and f2 both contain random numbers inside it, and you are using the random numbers as stopping criterion. Thus, there is no deterministic way to measure which of the functions is faster (doesn't depend in the implementation).
You can replace f1 and f2 functions to accept r as a parameter:
function f1(X::Vector, r)
for i = 1:length(X)
if r <= X[i]
return i
function f2(X::Vector, r)
i = 1
while i <= length(X)
if r <= X[i]
return i
i += 1
And then measure the time properly with the same R and r for both functions:
>>> R = cumsum(rand(100))
>>> r = rand(1_000_000) * R[end] # generate 1_000_000 random thresholds
>>> #time for i=1:length(r); f1(R, r[i]); end;
0.177048 seconds (4.00 M allocations: 76.278 MB, 2.70% gc time)
>>> #time for i=1:length(r); f2(R, r[i]); end;
0.173244 seconds (4.00 M allocations: 76.278 MB, 2.76% gc time)
As you can see, the timings are now nearly identical. Any difference will be caused for external factors (warming or processor busy with other tasks).

How to create array of concatenated contents from an array of labeld arrays

I have the following data:
a cell array of labels (e.g. a cell array of 4 options of types of messages where each type is a string)
an cell array of messages (e.g. a cell array of 5000 messages where each message is a cell array of many words strings).
an cell array of labels for each message (e.g. a cell array of 5000 strings where string in cell i is type of message in cell i in array in part 2).
My goal is to get from this data a cell array of size as of num of labels where in each cell there is concatenated contents from all the messages of type as the label (e.g. get a cell array of 4 cells where in cell i there is a cell array of all the words from all the messages that their type is i).
I implemented 3 method to perform this. This is the code for my 3 implementations:
% setting data for tic toc tests
messagesTypesOptions = {'type1';'type2';'type3';'type4'};
messages = cell(5000,1);
for i = 1:5000
messages{i} = {'word1';'word2';'word3';'word4';'word5';'word6';'word7';'word8';'word9';'word10'};
messages_labels = cell(5000,1);
for i = 1:5000
messages_labels{i} = messagesTypesOptions{randi([1 4])};
% start test
% method 1
type_to_msgs1 = cell(size(messagesTypesOptions,1),1);
for i = 1:size(messagesTypesOptions,1)
type_to_msgs1{i} = messages(strcmp(messages_labels,messagesTypesOptions{i}));
type_to_concatenated1 = cell(4,1);
for i = 1:4
type_to_msgs1{i} = type_to_msgs1{i}';
for i =1:4
label_msgs = type_to_msgs1{i};
num_of_label_msgs = size(label_msgs,2);
for j = 1: num_of_label_msgs
label_msgs{j} = label_msgs{j}';
type_to_concatenated1{i} = [label_msgs{:}];
% method 2
type_to_concatenated2 = cell(4,1);
labelStr_to_labelIndex = containers.Map(messagesTypesOptions,1:4);
for textIndex = 1:5000
type_to_concatenated2{labelStr_to_labelIndex(messages_labels{textIndex})} = ...
% method 3
type_to_concatenated3 = cell(4,1);
labelStr_to_labelIndex2 = containers.Map(messagesTypesOptions,1:4);
matrix_label_to_isMsgFromLabel = zeros(4,5000);
for textIndex = 1:5000
,textIndex) = 1;
for i = 1:4
label_msgs3 = messages(~~matrix_label_to_isMsgFromLabel(i,:))';
num_of_label_msgs3 = size(label_msgs3,2);
for j = 1: num_of_label_msgs3
label_msgs3{j} = label_msgs3{j}';
type_to_concatenated3{i} = [label_msgs3{:}];
Those are the results I get:
Elapsed time is 0.033120 seconds.
Elapsed time is 0.471959 seconds.
Elapsed time is 0.095011 seconds.
So, the conclusion is that method 1 is the fastest.
Now, my question is: Is there a way to solve this in a faster way?
Intuitively, it seams that my method1 is not very efficient because it has a for loop with strcmp and the strcmp is reading all the messages, so it is reading num of labels times all the messages, i.e reading num of labels (types) the same thing.
So, is there a way to modify one of my methods to get faster solution? Is there another method which is faster?
EDIT: Here I used for the examples constant messages. But, I want a solution for the case that the messages are different from each other and can be of different size.
EDIT2: Also, the types are strings that don't necessarily has numbers in them. (e.g. instead of type1,type2,... that I used for the example code, it can be 'error', 'warning', 'valid').
Basically you have messages and need to index into them to get output for each cell of the output cell array and finally concatenate the elements. For indexing you can use logical indexing which in most cases is very efficient. For getting the logical indexing arrays, you can take help of bsxfun. Here's the code to wrap up the discussion -
%// Get the parameters
lbls_len = numel(messages_labels);
msgtypeops_len = numel(messagesTypesOptions);
%// Tag messages_labels and messagesTypesOptions with numbers
alltypes = [messages_labels ; messagesTypesOptions];
[~,~,IDs] = unique(alltypes,'stable');
lbls = IDs(1:lbls_len);
typeops = IDs(lbls_len+1:end);
%// Positions of matches for each label IDs against type IDS
pos = bsxfun(#eq,lbls,typeops'); %//'
%// Logically index into messages and select the ones based on positions
%// obtained in the previous step for the final output and finally
%// concatenate along the rows to get us the final output cell array
out = arrayfun(#(n) vertcat(messages{pos(:,n)})',1:msgtypeops_len,'Uni',0)';
Here are some runtimes comparing Method - 1 that turned out to be best one as listed in the question against the proposed solution.
1) With length of messages_labels as 5000:
------------------ With Method - 1
Elapsed time is 0.072821 seconds.
------------------ With Proposed solution
Elapsed time is 0.053961 seconds.
2) With length of messages_labels as 500000:
------------------ With Method - 1
Elapsed time is 6.998149 seconds.
------------------ With Proposed solution
Elapsed time is 2.765090 seconds.
An almost 1.5x-2.5x speeedup might be good enough for you!
As ever, this boils down to a simple indexing problem, and for cell arrays of strings MATLAB has a nice way to generate those indices: ismember. There might be a clever way to then use that index vector to pull all the messages out in one go, but logical indexing is easy and quick enough, and JIT magic actually makes the trivial loop faster than arrayfun (using R2013b on Linux). That gives us this:
out = cell(4,1);
[~, idx] = ismember(messages_labels, messagesTypesOptions);
for ii=1:4
out{ii} = vertcat(messages{idx == ii})';
With the above added to the end of the original code:
>> test
Elapsed time is 0.056497 seconds.
Elapsed time is 0.857934 seconds.
Elapsed time is 0.201966 seconds.
Elapsed time is 0.017667 seconds.
Not bad :D
Replace all the 5000's with 50000's and it still scales linearly like #1 and #3:
>> test
Elapsed time is 0.550462 seconds.
Elapsed time is 48.685048 seconds.
Elapsed time is 1.965559 seconds.
Elapsed time is 0.162989 seconds.
Just to be sure:
>> isequal(type_to_concatenated1, type_to_concatenated2, type_to_concatenated3, out)
ans =
And, if you can handle the grouped messages being column vectors rather than rows, take out the transpose...
out{ii} = vertcat(messages{idx == ii});
...and it's twice as fast again:
>> test
Elapsed time is 0.552040 seconds.
Elapsed time is <skipped>
Elapsed time is 1.986059 seconds.
Elapsed time is 0.077958 seconds.

Speeding up MATLAB code for FDR estimation

I have 2 input variables:
a vector of p-values (p) with N elements (unsorted)
and N x M matrix with p-values obtained by random permutations (pr) with M iterations. N is quite large, 10K to 100K or more. M let's say 100.
I'm estimating the False Discovery Rate (FDR) for each element of p representing how many p-values from random permutations will pass if the current p-value (from p) will be the threshold.
I wrote the function with ARRAYFUN, but it takes lot of time for large N (2 min for N=20K), comparable to for-loop.
function pfdr = fdr_from_random_permutations(p, pr)
%# ... skipping arguments checks
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
Any ideas how to make it faster?
Comments about statistical issues here are also welcome.
The test data can be generated as p = rand(N,1); pr = rand(N,M);.
Well, the trick was indeed sorting the vectors. I give credit to #EgonGeerardyn for that. Also, there is no need to use mean. You can just divide everything afterwards by M. When p is sorted, finding the amount of values that are less than current x, is just a running index. pr is a more interesting case - I used a running index called place to discover how many elements are less than x.
Edit(2): Here is the fastest version I come up with:
function Speedup2()
N = 10000/4 ;
M = 100/4 ;
p = rand(N,1); pr = rand(N,M);
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
out = zeros(numel(p),1);
[p,sortIndex] = sort(p);
pr = sort(pr(:));
pr(end+1) = Inf;
place = 1;
N = numel(pr);
for i=1:numel(p)
x = p(i);
while pr(place)<=x
place = place+1;
exp1a = place-1;
exp2 = i;
out(i) = exp1a/exp2;
out(sortIndex) = out/ M;
And the benchmark results for N = 10000/4 ; M = 100/4 :
Elapsed time is 0.898689 seconds.
Elapsed time is 0.007697 seconds.
and for N = 10000 ; M = 100 ;
Elapsed time is 39.730695 seconds.
Elapsed time is 0.088870 seconds.
First of all, tr to analyze this using the profiler. Profiling should ALWAYS be the first step when trying to improve performance. We can all guess at what is causing your performance drop, but the only way to be sure and focus on the right part is to inspect the profiler report.
I didn't run the profiler on your code, as I don't want to generate test data to do so; but I have some ideas about what work is being carried out in vain. In your function mean(sum(pr<=x))./sum(p<=x), you are repeatedly summing over p<=x. All in all, one call includes N comparisons and N-1 summations. So for both, you have behavior that is quadratic in N when all N values of p are calculated.
If you step through a sorted version of p, you need less calculations and comparisons, as you can keep track of a running sum (i.e. behavior that is linear in N). I guess a similar method could be applied to the other part of the calculation.
The implementation of my idea as expressed above:
function pfdr = fdr(p,pr)
[N, M] = size(pr);
[p, idxP] = sort(p);
[pr] = sort(pr(:));
pfdr = NaN(N,1);
parfor iP = 1:N
x = p(iP);
m = sum(pr<=x)/M;
pfdr(iP) = m/iP;
pfdr(idxP) = pfdr;
If you have access to the parallel computing toolbox, the parfor loop will allow you to gain some performance. I used two basic ideas: mean(sum(pr<=x)) is actually equal to sum(pr(:)<=x)/M. On the other hand, since p is sorted, this allows you to just take the index as the number of elements (in the assumption that every element is unique, otherwise you'll have to work with unique to do the full rigorous analysis).
As you should already know very well by running the profiler yourself, the line m = sum(pr<=x)/M; is the main resource hog. This can be tackled similarly to p by making use of the sorted nature of pr.
I tested my code (both for identical results and for time consumption) against yours. For N=20e3; M=100, I get about 63 seconds to run your code and 43 seconds to run mine on my main computer (MATLAB 2011a on 64 bit Arch Linux, 8 GiB RAM, Core i7 860). For smaller values of M the gain is larger. But this gain is in part due to parallelization.
edit2: Apparently, I came to very similar results as Andrey, my result would have been very similar had I pursued the same approach.
However, I realised that there are some built-in functions that do more or less what you need, i.e. quite similar to determining the empirical cumulative density function. And this can be done by constructing the histogram:
function pfdr = fdr(p,pr)
[N, M] = size(pr);
[p, idxP] = sort(p);
count = histc(pr(:), [0; p]);
count = cumsum(count(1:N));
pfdr = count./(1:N).';
pfdr(idxP) = pfdr/M;
For the same M and N as above, this code takes 228 milliseconds on my computer. It takes 104 milliseconds for Andrey's parameters, so on my computer it turns out a bit slower, but I think this code is far more readable than intricate for loops (as was the case in both our examples).
Following the discussion between me and Andrey in this question, this very late answer is just to prove to Andrey that vectorized solutions are still faster than JIT'ed loops, they sometimes just aren't as easy to find.
I am more than willing to remove this answer if it is deemed inappropriate by the OP.
Now, on to business, here's the original arrayfun, looped version by Andrey, and vectorized version by Egon:
function test
N = 10000/4 ;
M = 100/4 ;
p = rand(N,1);
pr = rand(N,M);
%% first option
pfdr = arrayfun( #(x) mean(sum(pr<=x))./sum(p<=x), p);
%% second option
out = zeros(numel(p),1);
[p2,sortIndex] = sort(p);
pr2 = sort(pr(:));
pr2(end+1) = Inf;
place = 1;
for i=1:numel(p2)
x = p2(i);
while pr2(place)<=x
place = place+1;
exp1a = place-1;
exp2 = i;
out(i) = exp1a/exp2;
out(sortIndex) = out/ M;
%% third option
[p2,sortIndex] = sort(p);
count = histc(pr2(:), [0; p2]);
count = cumsum(count(1:N));
out = count./(1:N).';
out(sortIndex) = out/M;
Results on my laptop:
Elapsed time is 0.916196 seconds.
Elapsed time is 0.011429 seconds.
Elapsed time is 0.007328 seconds.
and for N=1000; M = 100; :
Elapsed time is 38.082718 seconds.
Elapsed time is 0.127052 seconds.
Elapsed time is 0.042686 seconds.
So: vectorized is 2-3 times faster.
