Vectorization or alternative to speed up MATLAB loop

Vectorization or alternative to speed up MATLAB loop - performance

I am using MATLAB to run a for loop in which variable-length portions of a large vector are updated at each iteration with the content of another vector; something like:
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) +...
a(k)*vec2(idx_start2(k):idx_end2(k));
end
The selected portions of vec1 and vec2 are not so small and N can be quite large; moreover, if this can be useful, idx_end(k)<idx_start(k+1) does not necessarily hold (i.e. vec1's edited portions may be partially re-updated in subsequent iterations). As a consequence, the above is by far the slowest portion of code in my script and I would like to speed it up, if possible.
Is there any way to vectorize the above for loop in order to make it run faster? Or, are there any alternative approaches to improve its execution speed?
EDIT:
As requested in the comments, here are some example values: Using the profiler to check execution times, the loop above runs in about 3.3 s with N=5e4, length(vec1)=3e6, length(vec2)=1.7e3 and the portions indexed by idx_start/end are slightly shorter on average than the latter, although not significantly.
Of course, 3.3 s is not particularly worrying in itself, but I would like to be able to increase especially N and vec1 by 1 or 2 orders of magnitude and in such a loop it will take quite longer to run.

Sorry, I couldn't find a way to speed up your code. This is the code I created to try to speed it up:
N = 5e4;
vec1 = 1:3e6;
vec2 = 1:1.7e3;
rng(0)
a = randn(N, 1);
idx_start1 = randi([1, 2.9e6], N, 1);
idx_end1 = idx_start1 + 1000;
idx_start2 = randi([1, 0.6e3], N, 1);
idx_end2 = idx_start2 + 1000;
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) + a(k) * vec2(idx_start2(k):idx_end2(k));
% use = idx_start1(k):idx_end1(k);
% vec1(use) = vec1(use) + a(k) * vec2(idx_start2(k):idx_end2(k));
end
The two commented-out lines of code in the for loop were my attempt to speed it up, but it actually made it slower, much to my surprise. Generally, I would create a variable for an array that is used more than once thinking that is faster, but it is not. The code that is not commented out runs in 0.24 s versus 0.67 seconds for the code that is commented out.

Related

Huge memory allocation running a julia function?

I try to run the following function in julia command, but when timing the function I see too much memory allocations which I can't figure out why.
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
for m = 1:length(Pf)
i = 0
for k = 1:iters
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
#thresh = erfcinv(2Pf) * sqrt(2/L) + 1
#Pd_the = 0.5 * erfc(((thresh - (snr + 1)) * sqrt(L)) / (2*(snr + 1)))
end
Running that function in the julia command on my laptop, I get the following shocking numbers:
julia> #time pdpf(1000, 10000)
17.621551 seconds (9.00 M allocations: 30.294 GB, 7.10% gc time)
What is wrong with my code? Any help is appreciated.

I don't think this memory allocation is so surprising. For instance, consider all of the times that the inner loop gets executed:
for m = 1:length(Pf) this gives you 100 executions
for k = 1:iters this gives you 10,000 executions based on the arguments you supply to the function.
randn(L) this gives you a random vector of length 1,000, based on the arguments you supply to the function.
Thus, just considering these, you've got 100*10,000*1000 = 1 billion Float64 random numbers being generated. Each one of them takes 64 bits = 8 bytes. I.e. 8GB right there. And, you've got two calls to randn(L) which means that you're at 16GB allocations already.
You then have y = s + n which means another 8GB allocations, taking you up to 24GB. I haven't looked in detail on the remaining code to get you from 24GB to 30GB allocations, but this should show you that it's not hard for the GB allocations to start adding up in your code.
If you're looking at places to improve, I'll give you a hint that these lines can be improved by using the properties of normal random variables:
n = randn(L)
s = sqrt(snr) * randn(L)
y = s + n
You should easily be able to cut down the allocations here from 24GB to 8GB in this way. Note that y will be a normal random variable here as you've defined it, and think up a way to generate a normal random variable with an identical distribution to what y has now.
Another small thing, snr is a constant inside your function. Yet, you keep taking its sqrt 1 million separate times. In some settings, 'checking your work' can be helpful, but I think that you can be confident the computer will get it right the first time and thus you don't need to make it keep re-doing this calculation ; ). There are other similar places you can improve your code to avoid duplicate computations here that I'll leave to you to locate.

aireties gives a good answer for why you have so many allocations. You can do more to reduce the number of allocations. Using this property we know that y = s+n is really y = sqrt(snr) * randn(L) + randn(L) and so we can instead do y = rvvar*randn(L) where rvvar= sqrt(1+sqrt(snr)^2) is defined outside the loop (thanks for the fix!). This will halve the number of random variables needed.
Outside the loop you can save sqrt(2/L) to cut down a little bit of time.
I don't think transpose is special-cased yet, so try using dot(y,y) instead of y'*y. I know dot for sure is just a loop without having to transpose, while the other may transpose depending on the version of Julia.
Something that would help performance (but not allocations) would be to use one big randn(L,iters) and loop through that. The reason is because if you make all of your random numbers all at once it's faster since it can use SIMD and a bunch of other goodies. If you want to implicitly do that without changing your code much, you can use ChunkedArrays.jl where you can use rands = ChunkedArray(randn,L) to initialize it and then everytime you want a randn(L), you instead use next(rands). Inside the ChunkedArray it actually makes bigger vectors and replenishes them as needed, but like this you can just get your randn(L) without having to keep track of all of that.
Edit:
ChunkedArrays probably only save time when L is smaller. This gives the code:
function pdpf(L::Int64, iters::Int64)
snr_dB = -10
snr = 10^(snr_dB/10)
Pf = 0.01:0.01:1
thresh = rand(100)
Pd = rand(100)
rvvar= sqrt(1+sqrt(snr)^2)
for m = 1:length(Pf)
i = 0
for k = 1:iters
y = rvvar*randn(L)
energy_fin = (y'*y) / L
#inbounds thresh[m] = erfcinv(2Pf[m]) * sqrt(2/L) + 1
if energy_fin[1] >= thresh[m]
i += 1
end
end
#inbounds Pd[m] = i/iters
end
end
which runs in half the time as using two randn calls. Indeed from the ProfileViewer we get:
#profile pdpf(1000, 10000)
using ProfileView
ProfileView.view()
I circled the two parts for the line y = rvvar*randn(L), so the vast majority of the time is random number generation. Last time I checked you could still get a decent speedup on random number generation by changing to to VSL.jl library, but you need MKL linked to your Julia build. Note that from the Google Summer of Code page you can see that there is a project to make a repo RNG.jl with faster psudo-rngs. It looks like it already has a few new ones implemented. You may want to check them out and see if they give speedups (or help out with that project!)

Strange observation about timing comparison between Julia and Matlab

The goal of this experiment is to compare speed of Matlab and Julia with a small piece of code below.
First the Matlab code:
>> t = 5000; n = 10000; x = 1:t*n;
>> x = reshape(x, t, n);
>> tic(); y1 = sum(x(:) .* x(:)); toc()
Elapsed time is 0.229563 seconds.
>> y1
y1 =
4.1667e+22
>> tic(); y2 = trace(x * x'); toc()
Elapsed time is 15.332694 seconds.
>> y2
y2 =
4.1667e+22
Versus in Julia
julia> t = 5000; n = 10000; x = 1: t*n;
julia> x = reshape(x, t, n);
julia> tic(); y1 = sum(x[:].* x[:]); toc();
elapsed time: 1.235170533 seconds
julia> y1
-4526945843202100544
julia> tic();y2 = trace(x*x'); toc();
The second one did not finish the job in more than 1 minutes. So what is the matter here with Julia? This piece of code happen to run both slower and overflow in Julia? Is there any problem in my style? I think one reason to convert from Matlab to Julia is the speed, and I used to think that Julia handles big number arithmetic operations by default. Now, it looks like these are not correct. Can someone explain?

There are a couple of things going on here.
Firstly, unlike Matlab, the x is an array of machine integers, not floating point values. This appears to be the main difference in speed, as it is unable to use BLAS routines for linear algebra.
You need to do either
x = 1.0:t*n
or explicitly convert it via
x = float(x)
This is also the reason for a different answer: Julia uses machine arithmetic, which for integers will wrap around on overflow (hence the negative number). You won't have this problem with floating point values.
(Julia does have arbitrary-precision integers, but doesn't promote by default: instead you would need to promote them yourself via big(n))
This will give you some speed up, but you can do better. Julia does require a slightly different programming style than Matlab. See the links that Isaiah provided in the comments above.
In regards to your specific examples
sum(x[:].* x[:])
This is slow as it creates 3 intermediate vectors (2 copies of x[:], though hopefully this will change in future, and their product).
Similarly,
trace(x*x')
computes an intermediate matrix (most of which is not used in the trace).
My suggestion would be to use
dot(vec(x),vec(x))
vec(x) just reshapes x into a vector (so no copying), and dot is the usual sum-product. You could also "roll-your-own":
function test(x)
s = zero(eltype(x)) # this prevents type-instability
for xi in x
s += xi*xi
end
s
end
test(x)
(this needs to be in a function for the JIT compiler to work its magic). It should be reasonably fast, though probably still not as fast as dot, which uses BLAS calls underneath.

Efficient replacement for ppval

I have a loop in which I use ppval to evaluate a set of values from a piecewise polynomial spline. The interpolation is easily the most time consuming part of the loop and I am looking for a way improve the function's efficiency.
More specifically, I'm using a finite difference scheme to calculate transient temperature distributions in friction welds. To do this I need to recalculate the material properties (as a function of temperature and position) at each time step. The rate limiting factor is the interpolation of these values. I could use an alternate finite difference scheme (less restrictive in the time domain) but would rather stick with what I have if at all possible.
I've included a MWE below:
x=0:.1:10;
y=sin(x);
pp=spline(x,y);
tic
for n=1:10000
x_int=10*rand(1000,1);
y_int=ppval(pp,x_int);
end
toc
plot(x,y,x_int,y_int,'*') % plot for sanity of data
Elapsed time is 1.265442 seconds.
Edit - I should probably mention that I would be more than happy with a simple linear interpolation between values but the interp1 function is slower than ppval
x=0:.1:10;
y=sin(x);
tic
for n=1:10000
x_int=10*rand(1000,1);
y_int=interp1(x,y,x_int,'linear');
end
toc
plot(x,y,x_int,y_int,'*') % plot for sanity of data
Elapsed time is 1.957256 seconds.

This is slow, because you're running into the single most annoying limitation of JIT. It's the cause of many many many oh so many questions in the MATLAB tag here on SO:
MATLAB's JIT accelerator cannot accelerate loops that call non-builtin functions.
Both ppval and interp1 are not built in (check with type ppval or edit interp1). Their implementation is not particularly slow, they just aren't fast when placed in a loop.
Now I have the impression it's getting better in more recent versions of MATLAB, but there are still quite massive differences between "inlined" and "non-inlined" loops. Why their JIT doesn't automate this task by simply recursing into non-builtins, I really have no idea.
Anyway, to fix this, you should copy-paste the essence of what happens in ppval into the loop body:
% Example data
x = 0:.1:10;
y = sin(x);
pp = spline(x,y);
% Your original version
tic
for n = 1:10000
x_int = 10*rand(1000,1);
y_int = ppval(pp, x_int);
end
toc
% "inlined" version
tic
br = pp.breaks.';
cf = pp.coefs;
for n = 1:10000
x_int = 10*rand(1000,1);
[~, inds] = histc(x_int, [-inf; br(2:end-1); +inf]);
x_shf = x_int - br(inds);
zero = ones(size(x_shf));
one = x_shf;
two = one .* x_shf;
three = two .* x_shf;
y_int = sum( [three two one zero] .* cf(inds,:), 2);
end
toc
Profiler:
Results on my crappy machine:
Elapsed time is 2.764317 seconds. % ppval
Elapsed time is 1.695324 seconds. % "inlined" version
The difference is actually less than what I expected, but I think that's mostly due to the sum() -- for this ppval case, I usually only need to evaluate a single site per iteration, which you can do without histc (but with simple vectorized code) and matrix/vector multiplication x*y (BLAS) instead of sum(x.*y) (fast, but not BLAS-fast).
Oh well, a ~60% reduction is not bad :)

It is a bit surprising that interp1 is slower than ppval, but having a quick look at its source code, it seems that it has to check for many special cases and has to loop over all the points since it it cannot be sure if the step-size is constant.
I didn't check the timing, but I guess you can speed up the linear interpolation by a lot if you can guarantee that steps in x of your table are constant, and that the values to be interpolated are stricktly within the given range, so that you do not have to do any checking. In that case, linear interpolation can be converted to a simple lookup problem like so:
%data to be interpolated, on grid with constant step
x = 0:0.5:10;
y = sin(x);
x_int = 0:0.1:9.9;
%make sure it is interpolation, not extrapolation
assert(all(x(1) <= x_int & x_int < x(end)));
% compute mapping, this can be precomputed for constant grid
slope = (length(x) - 1) / (x(end) - x(1));
offset = 1 - slope*x(1);
%map x_int to interval 1..lenght(i)
xmapped = offset + slope * x_int;
ind = floor(xmapped);
frac = xmapped - ind;
%interpolate by taking weighted sum of neighbouring points
y_int = y(ind) .* (1 - frac) + y(ind+1) .* frac;
% make plot to check correctness
plot(x, y, 'o-', x_int, y_int, '.')

a faster way of implementing the nested loop with gamma function

I am trying to evaluate the following integral:
I can find the area for the following polynomial as follows:
pn =
-0.0250 0.0667 0.2500 -0.6000 0
First using the integration by Simpson's rule
fn=#(x) exp(polyval(pn,x));
area=quad(fn,-10,10);
fprintf('area evaluated by Simpsons rule : %f \n',area)
and the result is area evaluated by Simpsons rule : 11.483072
Then with the following code that evaluates the summation in the above formula with gamma function
a=pn(1);b=pn(2);c=pn(3);d=pn(4);f=pn(5);
area=0;
result=0;
for n=0:40;
for m=0:40;
for p=0:40;
if(rem(n+p,2)==0)
result=result+ (b^n * c^m * d^p) / ( factorial(n)*factorial(m)*factorial(p) ) *...
gamma( (3*n+2*m+p+1)/4 ) / (-a)^( (3*n+2*m+p+1)/4 );
end
end
end
end
result=result*1/2*exp(f)
and this returns 11.4831. More or less the same result with the quad function. Now my question is whether or not it is possible for me to get rid of this nested loop as I will construct the cumulative distribution function so that I can get samples from this distribution using the inverse CDF transform. (for constructing the cdf I will use gammainc i.e. the incomplete gamma function instead of gamma)
I will need to sample from such densities that may have different polynomial coefficients and speed is of concern to me. I can already sample from such densities using Monte Carlo methods but I would like to see whether or not it is possible for me to use exact sampling from the density in order to speed up.
Thank you very much in advance.

There are several things one might do. The simplest is to avoid calling factorial. Instead one can use the relation that
factorial(n) = gamma(n+1)
Since gamma seems to be actually faster than a call to factorial, you can save a bit there. Even better, you can
>> timeit(#() factorial(40))
ans =
4.28681157826087e-05
>> timeit(#() gamma(41))
ans =
2.06671024634146e-05
>> timeit(#() gammaln(41))
ans =
2.17632543333333e-05
Even better, one can do all 4 calls in a single call to gammaln. For example, think about what this does:
gammaln([(3*n+2*m+p+1)/4,n+1,m+1,p+1])*[1 -1 -1 -1]'
Note that this call has no problem with overflows either in case your numbers get large enough. And since gammln is vectorized, that one call is fast. It costs little more time to compute 4 values than it does to compute one.
>> timeit(#() gammaln([15 20 40 30]))
ans =
2.73937416896552e-05
>> timeit(#() gammaln(40))
ans =
2.46521943333333e-05
Admittedly, if you use gammaln, you will need a call to exp at the end to recover the final result. You could do it with a single call to gamma however too. Perhaps like this:
g = gamma([(3*n+2*m+p+1)/4,n+1,m+1,p+1]);
g = g(1)/(g(2)*g(3)*g(4));
Next, you can be more creative in the inner loop on p. Rather than a full loop, coupled with a test to ignore the combinations you don't need, why not just do this?
for p=mod(n,2):2:40
That statement will select only those values of p that would have been used anyway, so now you can drop the if statement completely.
All of the above will give you what I'll guess is about a 5x speed increase in your loops. But it still has a set of nested loops. With some effort, you might be able to improve that too.
For example, rather than computing all of those factorials (or gamma functions) many times, do it ONCE. This should work:
a=pn(1);b=pn(2);c=pn(3);d=pn(4);f=pn(5);
area=0;
result=0;
nlim = 40;
facts = factorial(0:nlim);
gammas = gamma((0:(6*nlim+1))/4);
for n=0:nlim
for m=0:nlim
for p=mod(n,2):2:nlim
result = result + (b.^n * c.^m * d.^p) ...
.*gammas(3*n+2*m+p+1 + 1) ...
./ (facts(n+1).*facts(m+1).*facts(p+1)) ...
./ (-a)^( (3*n+2*m+p+1)/4 );
end
end
end
result=result*1/2*exp(f)
In my test on my machine, I find that your triply nested loops required 4.3 seconds to run. My version above produces the same result, yet required only 0.028418 seconds, a speedup of roughly 150 to 1, despite the triply nested loops.

Well, without even making changes to your code you could install an excellent package from Tom Minka at Microsoft called lightspeed which replaces some built-in matlab functions with much faster versions. I know there's a replacement for gammaln().
You'll get nontrivial speed improvements, though I'm not sure how much, and it's straight-forward to install.

How to write vectorized functions in MATLAB

I am just learning MATLAB and I find it hard to understand the performance factors of loops vs vectorized functions.
In my previous question: Nested for loops extremely slow in MATLAB (preallocated) I realized that using a vectorized function vs. 4 nested loops made a 7x times difference in running time.
In that example instead of looping through all dimensions of a 4 dimensional array and calculating median for each vector, it was much cleaner and faster to just call median(stack, n) where n meant the working dimension of the median function.
But median is just a very easy example and I was just lucky that it had this dimension parameter implemented.
My question is that how do you write a function yourself which works as efficiently as one which has this dimension range implemented?
For example you have a function my_median_1D which only works on a 1-D vector and returns a number.
How do you write a function my_median_nD which acts like MATLAB's median, by taking an n-dimensional array and a "working dimension" parameter?
Update
I found the code for calculating median in higher dimensions
% In all other cases, use linear indexing to determine exact location
% of medians. Use linear indices to extract medians, then reshape at
% end to appropriate size.
cumSize = cumprod(s);
total = cumSize(end); % Equivalent to NUMEL(x)
numMedians = total / nCompare;
numConseq = cumSize(dim - 1); % Number of consecutive indices
increment = cumSize(dim); % Gap between runs of indices
ixMedians = 1;
y = repmat(x(1),numMedians,1); % Preallocate appropriate type
% Nested FOR loop tracks down medians by their indices.
for seqIndex = 1:increment:total
for consIndex = half*numConseq:(half+1)*numConseq-1
absIndex = seqIndex + consIndex;
y(ixMedians) = x(absIndex);
ixMedians = ixMedians + 1;
end
end
% Average in second value if n is even
if 2*half == nCompare
ixMedians = 1;
for seqIndex = 1:increment:total
for consIndex = (half-1)*numConseq:half*numConseq-1
absIndex = seqIndex + consIndex;
y(ixMedians) = meanof(x(absIndex),y(ixMedians));
ixMedians = ixMedians + 1;
end
end
end
% Check last indices for NaN
ixMedians = 1;
for seqIndex = 1:increment:total
for consIndex = (nCompare-1)*numConseq:nCompare*numConseq-1
absIndex = seqIndex + consIndex;
if isnan(x(absIndex))
y(ixMedians) = NaN;
end
ixMedians = ixMedians + 1;
end
end
Could you explain to me that why is this code so effective compared to the simple nested loops? It has nested loops just like the other function.
I don't understand how could it be 7x times faster and also, that why is it so complicated.
Update 2
I realized that using median was not a good example as it is a complicated function itself requiring sorting of the array or other neat tricks. I re-did the tests with mean instead and the results are even more crazy:
19 seconds vs 0.12 seconds.
It means that the built in way for sum is 160 times faster than the nested loops.
It is really hard for me to understand how can an industry leading language have such an extreme performance difference based on the programming style, but I see the points mentioned in the answers below.

Update 2 (to address your updated question)
MATLAB is optimized to work well with arrays. Once you get used to it, it is actually really nice to just have to type one line and have MATLAB do the full 4D looping stuff itself without having to worry about it. MATLAB is often used for prototyping / one-off calculations, so it makes sense to save time for the person coding, and giving up some of C[++|#]'s flexibility.
This is why MATLAB internally does some loops really well - often by coding them as a compiled function.
The code snippet you give doesn't really contain the relevant line of code which does the main work, namely
% Sort along given dimension
x = sort(x,dim);
In other words, the code you show only needs to access the median values by their correct index in the now-sorted multi-dimensional array x (which doesn't take much time). The actual work accessing all array elements was done by sort, which is a built-in (i.e. compiled and highly optimized) function.
Original answer (about how to built your own fast functions working on arrays)
There are actually quite a few built-ins that take a dimension parameter: min(stack, [], n), max(stack, [], n), mean(stack, n), std(stack, [], n), median(stack,n), sum(stack, n)... together with the fact that other built-in functions like exp(), sin() automatically work on each element of your whole array (i.e. sin(stack) automatically does four nested loops for you if stack is 4D), you can built up a lot of functions that you might need just be relying on the existing built-ins.
If this is not enough for a particular case you should have a look at repmat, bsxfun, arrayfun and accumarray which are very powerful functions for doing things "the MATLAB way". Just search on SO for questions (or rather answers) using one of these, I learned a lot about MATLABs strong points that way.
As an example, say you wanted to implement the p-norm of stack along dimension n, you could write
function result=pnorm(stack, p, n)
result=sum(stack.^p,n)^(1/p);
... where you effectively reuse the "which-dimension-capability" of sum.
Update
As Max points out in the comments, also have a look at the colon operator (:) which is a very powerful tool for selecting elements from an array (or even changing it shape, which is more generally done with reshape).
In general, have a look at the section Array Operations in the help - it contains repmat et al. mentioned above, but also cumsum and some more obscure helper functions which you should use as building blocks.

Vectorization
In addition to whats already been said, you should also understand that vectorization involves parallelization, i.e. performing concurrent operations on data as opposed to sequential execution (think SIMD instructions), and even taking advantage of threads and multiprocessors in some cases...
MEX-files
Now although the "interpreted vs. compiled" point has already been argued, no one mentioned that you can extend MATLAB by writing MEX-files, which are compiled executables written in C, that can be called directly as normal function from inside MATLAB. This allows you to implement performance-critical parts using a lower-level language like C.
Column-major order
Finally, when trying to optimize some code, always remember that MATLAB stores matrices in column-major order. Accessing elements in that order can yield significant improvements compared to other arbitrary orders.
For example, in your previous linked question, you were computing the median of set of stacked images along some dimension. Now the order in which those dimensions are ordered greatly affect the performance. Illustration:
%# sequence of 10 images
fPath = fullfile(matlabroot,'toolbox','images','imdemos');
files = dir( fullfile(fPath,'AT3_1m4_*.tif') );
files = strcat(fPath,{filesep},{files.name}'); %'
I = imread( files{1} );
%# stacked images along the 1st dimension: [numImages H W RGB]
stack1 = zeros([numel(files) size(I) 3], class(I));
for i=1:numel(files)
I = imread( files{i} );
stack1(i,:,:,:) = repmat(I, [1 1 3]); %# grayscale to RGB
end
%# stacked images along the 4th dimension: [H W RGB numImages]
stack4 = permute(stack1, [2 3 4 1]);
%# compute median image from each of these two stacks
tic, m1 = squeeze( median(stack1,1) ); toc
tic, m4 = median(stack4,4); toc
isequal(m1,m4)
The timing difference was huge:
Elapsed time is 0.257551 seconds. %# stack1
Elapsed time is 17.405075 seconds. %# stack4

Could you explain to me that why is this code so effective compared to the simple nested loops? It has nested loops just like the other function.
The problem with nested loops is not the nested loops themselves. It's the operations you perform inside.
Each function call (especially to a non-built-in function) generates a little bit of overhead; more so if the function performs e.g. error checking that takes the same amount of time regardless of input size. Thus, if a function has only a 1 ms overhead, if you call it 1000 times, you will have wasted a second. If you can call it once to perform a vectorized calculation, you pay overhead only once.
Furthermore, the JIT compiler (pdf) can help vectorize simple for-loops, where you, for example, only perform basic arithmetic operations. Thus, the loops with simple calculations in your post are sped up by a lot, while the loops calling median are not.

In this case
M = median(A,dim) returns the median values for elements along the dimension of A specified by scalar dim
But with a general function you can try splitting your array with mat2cell (which can work with n-D arrays and not just matrices) and applying your my_median_1D function through cellfun. Below I will use median as an example to show that you get equivalent results, but instead you can pass it any function defined in an m-file, or an anonymous function defined with the #(args) notation.
>> testarr = [[1 2 3]' [4 5 6]']
testarr =
1 4
2 5
3 6
>> median(testarr,2)
ans =
2.5000
3.5000
4.5000
>> shape = size(testarr)
shape =
3 2
>> cellfun(#median,mat2cell(testarr,repmat(1,1,shape(1)),[shape(2)]))
ans =
2.5000
3.5000
4.5000

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio