Accelerator restriction: unsupported operation: RSQRTSS

Accelerator restriction: unsupported operation: RSQRTSS - openacc

I have a simple nbody implementation code and try to compile it for launching on NVIDIA GPUs (Tesla K20m/Geforce GTX 650 Ti). I use the following compiler options:
-Minfo=all -acc -Minline -Mfpapprox -ta=tesla:cc35/nvidia
Everything works without -Mfpapprox, but when I use it, the compilation fails with the following output:
346, Accelerator restriction: unsupported operation: RSQRTSS
The 346 line writes as:
float rdistance=1.0f/sqrtf(drSquared);
where
float drSquared=dx*dx+dy*dy+dz*dz+softening;
and dx, dy, dz are float values. This line is inside the #pragma acc parallel loop independent for() construction.
What is the problem with -Mfpapprox?

-Mfpapprox tells the compiler to use very low-precision CPU instructions to approximate DIV or SQRT. These instructions are not supported on the GPU. The GPU SQRT is both fast and precise so no need for a low-precision version.
Actually even on the CPU, I'd recommend you not use -Mfpapprox unless you really understand the mathematics of your code and it can handle a high degree of imprecision (as much as 5-6 bits or ~20Ulps off). We added this flag about 10 years ago since at the time the CPUs divide operation was very expensive. However, CPU performance for divide has greatly improved since then (as has sqrt) so you're generally better off not sacrificing precision for the little bit of speed-up you might get from this flag.
I'll put in an issue report requesting that the compiler ignore -Mfpapprox for GPU code so you wont see this error.

Related

What could be the causes of this performance regression, and how to investigate it?

Context
I'm writing some high-performance code for ARM64 using NEON SIMD instructions, which I am trying to further optimize. I only use integer operations, no floating-point. This code is fully CPU- or memory-bound: it does not perform system calls or I/O of any kind (filesystem, networking, or anything else). The code is single-threaded by design -- any parallelism should be handled by calling the code from different CPUs with different arguments. The data working set should be small enough to fit in my CPU's L1 D-cache, and if it overflows a little, it will definitely fit in L2 with lots of space to spare.
My development environment is an Apple laptop with the M1 processor, running macOS; as such, the prime choice for a performance investigation tool is Apple's Instruments. I know VTune has some more advanced features such as top-down microarchitecture analysis, but evidently this isn't available for ARM.
The problem
I had an idea that, at a high level, works like this: a certain function f(x, y) can be broken down into two functions g() and h(). I can calculate x2 = g(x), y2 = g(y) and then h(x2, y2), obtaining the same result as f(x, y). However, it turns out that I compute f() many times with different combinations of the same input arguments. By applying all these inputs to g() and caching their outputs, I can directly call the output of h()with these cached values and save some time recomputing theg()-part of f()`.
Benchmarks
I confirmed the basic idea is sound by microbenchmarking with Google Benchmark. If f() takes 100 X (where X is some arbitrary unit of time), then each call to g() takes 14 X, and a call to h() takes 78 X. While it's longer to call g() twice then h() rather than f(), suppose I need to compute f(x, y) and f(x, z), which would ordinarily take 200 X. I can instead compute x2 = g(x), y2 = g(y) and z2 = g(z), taking 3*14 = 42 X, and then h(x2, y2) and h(x2, z2), taking 2*78 = 156 X. In total, I spend 156 + 42 = 198 X, which is already less than 200 X, and the savings would add up for larger examples, up to maximum of 22%, since this is how much less h() costs compared to f() (assuming I compute h() much more often than g()). This would represent a significant speedup for my application.
I proceeded to test this idea on a more realistic example: I have some code which does a bunch of things, plus 3 calls to f() which, among themselves, use combinations of the same 2 arguments. So, I replace 3 calls to f() by 2 calls to g() and 3 calls to h(). The benchmarks above indicate this should reduce execution time by 3*100 - 2*14 - 3*78 = 38 X. However, benchmarking the modified code shows that execution time increases by ~700 X!
I tried replacing each call to f() individually with 2 calls to g() for its arguments and a call to h(). This should increase execution time by 2*14 + 78 - 100 = 6 X, but instead, execution time increases by 230 X (not coincidentally, approximately 1/3 of 700 X).
Performance counter results using Apple Instruments
To bring some data to the discussion, I ran both codes under Apple Instruments using the CPU counters template, monitoring some performance counters I thought might be relevant.
For reference, the original code executes in 7.6 seconds (considering only number of iterations times execution time per iteration, i.e. disregarding Google Benchmark overhead), whereas the new code executes in 9.4 seconds; i.e. a difference of 1.8 seconds. Both versions use the exact same number of iterations and work on the same input, producing the same output. The code runs on the M1's performance core, which I assume is running at its maximum 3.2 GHz clock speed.
Parameter
Original code
New code
Total cycles
22,199,155,777
27,510,276,704
MAP_DISPATCH_BUBBLE
78,611,658
6,438,255,204
L1D_CACHE_MISS_LD
892,442
1,808,341
L1D_CACHE_MISS_ST
2,163,402
4,830,661
L1I_CACHE_MISS_DEMAND
2,620,793
7,698,674
INST_SIMD_ALU
79,448,291,331
78,253,076,740
INST_SIMD_LD
17,254,640,147
16,867,679,279
INST_SIMD_ST
14,169,912,790
14,029,275,120
INST_INT_ALU
4,512,600,211
4,379,585,445
INST_INT_LD
550,965,752
546,134,341
INST_INT_ST
455,541,070
455,298,056
INST_ALL
119,683,934,968
118,972,558,207
MAP_STALL_DISPATCH
6,307,551,337
5,470,291,508
SCHEDULE_UOP
116,252,941,232
113,882,670,763
MAP_REWIND
16,293,616
11,787,119
FLUSH_RESTART_OTHER_NONSPEC
58,616
90,955
FETCH_RESTART
27,417,457
28,119,690
BRANCH_MISPRED_NONSPEC
432,761
465,697
L1I_TLB_MISS_DEMAND
754,161
1,492,705
L2_TLB_MISS_INSTRUCTION
485,702
1,217,474
MMU_TABLE_WALK_INSTRUCTION
486,812
1,219,082
BRANCH_MISPRED_NONSPEC
377,750
440,382
INST_BRANCH
1,194,614,553
1,151,040,641
Instruments won't let me add all these counters to the same run, so some results are from different runs. However, since the code is fully deterministic and runs the same number of iterations, any differences between runs should be just random noise.
EDIT: playing around with Instruments, I found one performance counter that has wildly differing values between the original code and the new code, which is MAP_DISPATCH_BUBBLE. Still doing research on what it means, whether it might explain the issues I'm seeing, and how to work around this.
EDIT 2: I decided to test this code on other ARM processors I have access to (Cortex-X2 and Cortex-A72). On the Cortex-X2, both versions perform identically, and on the Cortex-A72, there was a small (~1.5%) increase in performance with the new code. So I'm more inclined than ever to believe that I hit an M1 front-end bottleneck.
Hypotheses and data analysis
Having faced previous performance problems with this code base before, some ideas sprung to mind:
Memory alignment: SIMD code is sometimes sensitive to memory alignment, particularly for memory-bound code, which I suspect my code may be. However, adding or removing __attribute__((aligned(64))) made no difference, so I don't think that's it.
D-cache misses: the new code allocates some new arrays to cache the output of g(), so it might lead to more cache misses. And indeed there are 3.6 million more L1 D-cache misses (load + store) than the original code. However, as I've mentioned at the beginning, the working set easily fits into L2. Assuming a 10-cycle L2 cache miss cost, that's only 36 million cycles. At 3.2 GHz, that's just 1.1 ms, i.e. < 0.1% of the observed performance difference.
I-cache misses: a similar situation: there's an extra 5.1 million L1 I-cache misses, but at a 10-cycle cost, we're looking at 1.6 ms, again < 0.1% of the observed performance difference.
Inlining/unrolling: I employ aggressive inlining and loop unrolling on my code, as well as LTO and unity builds, since performance is the #1 priority and code size is irrelevant (unless it affects performance via e.g. I-cache misses). I considered the possibility that the new code might be inlining/unrolling less aggressively due to the compiler hitting some kind of heuristic for maximum code size. This might result in more instructions being executed, such as compares/branches for loops, and CALL/RET and function prologues/epilogues for function call. However, the table shows that the new code executes a bit fewer instructions of each kind (as I would expect), and of course, in total (INST_ALL).
Somehow, the original code simply achieves a higher IPC, and I have no idea why. Also, to be clear: both codes perform the same operation using the same algorithm. What I did was to basically the code for f() (a bunch of function calls to other subroutines) between g() and h().
The question
This brings me to my question: what could possibly be making the new code run slower than the old code? What other performance counters could I look at in Instruments to give me insight into this issue?
Beyond answers to this specific question, I'm looking for general advice on how to approach similar problems like this in the future. I've found some books about debugging performance problems, but they generally fall into two camps. The first just describes the profiling process I'm familiar with: find out which functions take the longest to execute and optimize them. The second is represented by books like Systems Performance: Enterprise and the Cloud and The Every Computer Performance Book, and is closer to what I'm looking for. However, they look at system-level issues like I/O, kernel calls, etc.; the kind of code I write is CPU- and maybe memory-bound, with many opportunities to convert to SIMD, and no interaction with the outside world. Basically, I'd like to know how to design meaningful experiments using a profiler and CPU performance counters (cycle counters, cache misses, instructions executed by type such as ALU, memory, etc.) to solve these kinds of performance issues with my code when they arise.

How to analyze the instructions pipelining on Zen4 for AVX-512 packed double computations? (backend bound)

I got access to the AMD Zen4 server and tested AVX-512 packed double performance. I chose Harmonic Series Sum[1/n over positive integers] and compared the performance using standard doubles, AVX2 (4 packed doubles) and AVX-512 (8 packed doubles). The test code is here.
AVX-256 version runs four times faster than the standard double version. I was expecting the AVX-512 version to run two times faster than the AVX-256 version, but there was barely any improvement in runtimes:
Method Runtime (minutes:seconds)
HarmonicSeriesPlain 0:41.33
HarmonicSeriesAVX256 0:10.32
HarmonicSeriesAVX512 0:09.82
I was scratching my head over the results and tested individual operations. See full results. Here is runtime for the division:
Method Runtime (minutes:seconds)
div_plain 1:53.80
div_avx256f 0:28.47
div_avx512f 0:14.25
Interestingly, div_avx256f takes 28 seconds, while HarmonicSeriesAVX256 takes only 10 seconds to complete. HarmonicSeriesAVX256 is doing more operations than div_avx256f - summing up the results and increasing the denominator each time (the number of packed divisions is the same). The speed-up has to be due to the instructions pipelining.
However, I need help finding out more details.
The analysis with the llvm-mca (LLVM Machine Code Analyzer) fails because it does not support Zen4 yet:
gcc -O3 -mavx512f -mfma -S "$file" -o - | llvm-mca -iterations 10000 -timeline -bottleneck-analysis -retire-stats
error: found an unsupported instruction in the input assembly sequence.
note: instruction: vdivpd %zmm0, %zmm4, %zmm2
On the Intel platform, I would use
perf stat -M pipeline binary
to find more details, but this metricgroup is not available on Zen4. Any more suggestions on how to analyze the instructions pipelining on Zen4? I have tried these perf stat events:
cycles,stalled-cycles-frontend,stalled-cycles-backend,cache-misses,sse_avx_stalls,fp_ret_sse_avx_ops.all,fp_ret_sse_avx_ops.div_flops,fpu_pipe_assignment.total,fpu_pipe_assignment.total0,
fpu_pipe_assignment.total1,fpu_pipe_assignment.total2,fpu_pipe_assignment.total3
and got the results here.
From this I can see, that the workload is backed bound. AMD's performance event fp_ret_sse_avx_ops.all ( the number of retired SSE/AVX operations) helps, but I still want to get better insights into instructions pipelining on Zen4. Any tips?

Zen 4 execution units are mostly 256-bit wide; handling a 512-bit uop occupies it for 2 cycles. It's normal that 512-bit vectors don't have more raw throughput for any math instructions in general on Zen 4. Although using them on Zen4 does mean more work per uop so out-of-order exec has an easier time.
Or in the case of division, they're occupied for longer since division isn't fully pipelined, like on all modern CPUs. Division is hard to implement.
On Intel Ice Lake for example, divpd throughput is 2 doubles per 4 clocks whether you're using 128-bit, 256-bit, or 512-bit vectors. 512-bit takes extra uops, so we can infer that the actual divider execution unit is 256-bit wide in Ice Lake, but that divpd xmm can use the two halves of it independently. (Unlike AMD).
https://agner.org/optimize/ has instructing timing tables (and his microarch PDF has details on how CPUs work that are essential to making sense of them). https://uops.info/ also has good automated microbenchmark results, free from typos and other human error except sometimes in choosing what to benchmark. (But the actual instruction sequences tested are available, so you can check what they actually tested.) Unfortunately they don't yet have Zen 4 results up, only up to Zen 3.
Zen4 has execution units 256-bit wide for the most part, so 512-bit instructions are single uop but take 2 cycles on most execution units. (Unlike Zen1 where they took 2 uops and thus hurt OoO exec). And it has efficient 512-bit shuffles, and lets you use the power of new AVX-512 instructions for 256-bit vector width, which is where a lot of the real value is. (Better shuffles, masking, vpternlogd, vector popcount, etc.)
Division isn't fully pipelined on any modern x86 CPU. Even on Intel CPUs 512-bit vdivpd zmm has about the same doubles-per-clock throughput as vdivpd ymm (Floating point division vs floating point multiplication has some older data on the YMM vs. XMM situation which is similar, although Zen4 apparently can't send different XMM vectors through the halves of its 256-bit-wide divide unit; vdivpd xmm has the same instruction throughput as vdivpd ymm)
Fast-reciprocal + Newton iterations
For something that's almost entirely bottlenecked on division throughput (not front-end or other ports), you might consider approximate-reciprocal with a Newton-Raphson iteration or two to refine the accuracy to close to 1 ulp. (Not quite the 0.5 ulp you'd get from exact division).
AVX-512 has vrcp14pd approx-reciprocal for packed-double. So two rounds of Newton iterations should double the number of correct bits each time, to 28 then 56 (which is more than the 53-bit mantissa of a double). Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision mostly talks about rsqrt, but similar idea.
SSE/AVX1 only had single-precision versions of the fast-reciprocal and rsqrt instructions, with only 12-bit precision. e.g. rcpps.
AVX-512ER has 28-bit precision versions, but only Xeon Phi ever had those; mainstream CPUs haven't included them. (Xeon Phi had very vdivps / pd exact division, so it was much better to use the reciprocals.)

I got the answer for the question from title: How to analyze the instructions pipelining on Zen4? directly from AMD:
For determining if a workload is backend-bound, the recommended
method on Zen 4 is to use the pipeline utilization metrics. We are
the process of providing similar metrics and metric groups through
the perf JSON event files for Zen 4 and they will be out very soon.
Read more details in this email thread
AMD has already posted the patches.
Before the patches land in favorite Linux distribution, you can use the raw events on Zen4. Check this example

Does the using of SIMD load main CPU registers?

Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to sum\multiply\do other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?
I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.

Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.
So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).
Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)
Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)
#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps
void computeThing(float8 a[], float8 b[], float8 c[], int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}
Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).
The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?
It just so happens that we are also making use of the SIMD units to do a couple of multiplications...
Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.
i.e. https://godbolt.org/z/xTcLrf

Adding a redundant assignment speeds up code when compiled without optimization

I find an interesting phenomenon:
#include<stdio.h>
#include<time.h>
int main() {
int p, q;
clock_t s,e;
s=clock();
for(int i = 1; i < 1000; i++){
for(int j = 1; j < 1000; j++){
for(int k = 1; k < 1000; k++){
p = i + j * k;
q = p; //Removing this line can increase running time.
}
}
}
e = clock();
double t = (double)(e - s) / CLOCKS_PER_SEC;
printf("%lf\n", t);
return 0;
}
I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times:
There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:
movl -44(%rbp), %eax
movl %eax, -48(%rbp)
So why does the program run faster with such an assignment?
Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.

TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.
And apparently also on low-power Goldmont, unless there's a different cause there for an extra load helping.
None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.
You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.
But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)
If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.
You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.
The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.
If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.
So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.
This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.
Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.
The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)
Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer (MOB) keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals. The MOB is a combination of the store buffer and load buffer)
Does this mean adding useless statements will speed up real programs? (with optimization enabled)
In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.
Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.
Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.
Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments

Matlab relational operator performance in presence of NaN-values

When testing the following code (notice the *NaN in the second fragment)
tic
x = zeros(1,5000000);
for i=1:10
selector = x > 1;
end
toc
tic
x = zeros(1,5000000)*NaN;
for i=1:10
selector = x > 1;
end
toc
on Matlab revisions
R2012a 64-bit
R2013a 32-bit
I observe the following odd behavior
R2012a 64-bit
Elapsed time is 0.056266 seconds.
Elapsed time is 0.059677 seconds.
R2013a 32-bit
Elapsed time is 0.070116 seconds.
Elapsed time is 3.995697 seconds.
So in case of R2013a 32-bit the presence of NaN values drastically increases runtime. Can anyone give me a hint where this might be comming from?
Best regards,
Thomas

You are using Intel CPU, and of that, for 32-bit code, you are using it's FPU. It is awfully slow with NaN, Inf and denormals and this is an old story. Good news SSE unit is slow with denormals only and handles NaNs at full speed, so if you can convince your compiler to emit SSE code, you should be up to full speed. This is done automatically for x64, because it implies SSE2 and the ABI uses SSE registers, but since x32 floating point ABI uses FPU registers, the FPU is used for doing the calculations to avoid moving things around too much.
I did not dig deeper (we use embedded platforms and not all of them have SSE as of now), but I suspect changing some compiler/flags should help. If it does, checking how things are inlined would be in order to see if you have that SSE-to-FPU-and-back on each function call. If it's a small tight loop somewhere in the code, there is a possibility of using SSE intrinsics.
upd: Oops just noticed this is matlab. The reasoning stays, but for the solutions, you'll have to look yourself.

The problem may be due to the fact that your 32-bit system takes longer to reallocate the ~40MB of memory in the x = zeros(1,5000000)*NaN; line. Perhaps there is not enough available RAM and it needs to swap memory to disk. To check which part (the allocation or the comparison) is problematic, tic-toc these parts separately.
BTW, there is no need to multiply by NaN - you can simply do x = nan(1,5000000);

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio