GPGPU threading strategy

GPGPU threading strategy - windows

I want to improve the performance of a compute shader.
Each thread group of the shader needs 8 blocks of data, each block has 24 elements.
I’m primarily optimizing for GeForce 1080Ti in my development PC and Tesla V100 in the production servers, but other people also run this code on their workstations, GPUs vary, not necessarily nVidia.
Which way is better:
[numthreads( 24, 1, 1 )], write a loop for( uint i = 0; i < 8; i++ )
This wastes 25% of execution units in each warp, but the memory access pattern is awesome. The VRAM reads of these 24 active threads are either coalesced, or full broadcasts.
[numthreads( 96, 1, 1 )], write a loop for( uint i = groupThreadID / 24; i < 8; i += 4 )
Looks better in terms of execution units utilization, however VRAM access pattern becomes worse because each warp is reading 2 slices of the input data.
Also I’m worried about synchronization penalty of GroupMemoryBarrierWithGroupSync() intrinsic, the group shared memory becomes split over 3 warps.
Also a bit harder to implement.

Related

Analysing performance of transpose function

I've written a naive and an "optimized" transpose functions for order-3 tensors containing double-precision complex numbers and I would like to analyze their performance.
Approximate code for naive transpose function:
#pragma omp for schedule(static)
for (auto i2 = std::size_t(0); i2 < n2; ++i2)
{
for (auto i1 = std::size_t{}; i1 < n1; ++i1)
{
for (auto i3 = std::size_t{}; i3 < n3; ++i3)
{
tens_tr(i3, i2, i1) = tens(i1, i2, i3);
}
}
}
Approximate code for optimized transpose function (remainder loop not shown, assume divisibility):
#pragma omp for schedule(static)
for (auto i2 = std::size_t(0); i2 < n2; ++i2)
{
// blocked loop
for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
{
for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
{
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
cache_buffer[i3 * block_size + i1] = tens(bi1 + i1, i2, bi3 + i3);
}
}
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
tens_tr(bi3 + i1, i2, bi1 + i3) = cache_buffer[i1 * block_size + i3];
}
}
}
}
}
Assumption: I decided to use a streaming function as reference because I reasoned that the transpose function, in its perfect implementation, would closely resemble any bandwidth-saturating streaming function.
For this purpose, I chose the DAXPY loop as reference.
#pragma omp parallel for schedule(static)
for (auto i1 = std::size_t{}; i1 < tens_a_->get_n1(); ++i1)
{
auto* slice_a = reinterpret_cast<double*>(tens_a_->get_slice_data(i1));
auto* slice_b = reinterpret_cast<double*>(tens_b_->get_slice_data(i1));
const auto slice_size = 2 * tens_a_->get_slice_size(); // 2 doubles for a complex
#pragma omp simd safelen(8)
for (auto index = std::size_t{}; index < slice_size; ++index)
{
slice_b[index] += lambda_ * slice_a[index]; // fp_count: 2, traffic: 2+1
}
}
Also, I used a simple copy kernel as a second reference.
#pragma omp parallel for schedule(static)
for (auto i1 = std::size_t{}; i1 < tens_a_->get_n1(); ++i1)
{
const auto* op1_begin = reinterpret_cast<double*>(tens_a_->get_slice_data(index));
const auto* op1_end = op1_begin + 2 * tens_a_->get_slice_size(); // 2 doubles in a complex
auto* op2_iter = reinterpret_cast<double*>(tens_b_->get_slice_data(index));
#pragma omp simd safelen(8)
for (auto* iter = op1_begin; iter != op1_end; ++iter, ++op2_iter)
{
*op2_iter = *iter;
}
}
Hardware:
Intel(R) Xeon(X) Platinum 8168 (Skylake) with 24 cores # 2.70 GHz and L1, L2 and L3 caches sized 32 kB, 1 MB and 33 MB respectively.
Memory of 48 GiB # 2666 MHz. Intel Advisor's roof-line view says memory BW is 115 GB/s.
Benchmarking: 20 warm-up runs, 100 timed experiments, each with newly allocated data "touched" such that page-faults will not be measured.
Compiler and flags:
Intel compiler from OneAPI 2022.1.0, optimization flags -O3;-ffast-math;-march=native;-qopt-zmm-usage=high.
Results (sizes assumed to be adequately large):
Using 24 threads pinned on 24 cores (total size of both tensors ~10 GiB):
DAXPY 102 GB/s
Copy 101 GB/s
naive transpose 91 GB/s
optimized transpose 93 GB/s
Using 1 thread pinned on a single core (total size of both tensors ~10 GiB):
DAXPY 20 GB/s
Copy 20 GB/s
naive transpose 9.3 GB/s
optimized transpose 9.3 GB/s
Questions:
Why is my naive transpose function performing so well?
Why is the difference in performance between reference and transpose functions so high when using only 1 thread?
I'm glad to receive any kind of input for any of the above questions. Also, I will gladly provide additional information when required. Unfortunately, I cannot provide a minimum reproducer because of the size and complexity of each benchmark program. Thank you very much for you time and help in advance!
Updates:
Could it be that the Intel compiler performed loop-blocking for the naive transpose function as optimization?

Is the above-mentioned assumption valid? [asked before the edit]
Not really.
Transpositions of large arrays tends not to saturate the bandwidth of the RAM on some platforms. This can be due to cache effects like cache trashing. For more information about this, you can read this post for example. In your specific case, things works quite well though (see below).
On NUMA platforms, the data page distribution on NUMA nodes has can have a strong impact on the performance. This can be due to the (temporary) unbalanced page distribution, a non-uniform latency, a non-uniform throughput or even the (temporary) saturation of the RAM of some NUMA node. NUMA can be seen on recent AMD processors but also on some Intel ones (eg. since Skylake, see this post) regarding the system configuration.
Even assuming the above points do not apply in your case, considering the perfect case while the naive code may not behave like a perfect transposition can result in wrong interpretations. If this assumption is broken, results could overestimate the performance of the naive implementation for example.
Why is my naive transpose function performing so well?
A good throughput does not means the computation is fast. The computation can be slower with a higher throughput if more data needs to be transferred from the RAM. This is possible due to cache misses. More specifically, with a naive access pattern, cache lines can be replaced more frequently with a lower reuse (due to cache trashing) and thus the wall clock time should be higher. You need to measure the wall clock time. Metrics are good to understand what is going on but not to measure the performance of a kernel.
In this specific case, the chosen size (ie. 1050) should not cause too many conflict misses because it is not divisible by a large power of two. In the naive version, the tens_tr writes will fill many cache lines partially (1050) before they can be reused when i1 is increased (up to 8 subsequent increases are needed so to fill the cache lines). This means, 1050 * 64 ~= 66 KiB of cache is needed for the i1-i3-based transposition of one given i2 to complete. The cache lines cannot be reused with other i2 values so the cache do not need to be so huge for the transposition to be relatively efficient. That being said, one should also consider the tens reads (though it can be quite quickly evicted from the cache). In the end, the 16-way associative L2 cache of 1 MiB should be enough for that. Note that the naive implementation should perform poorly with significantly bigger arrays since the L2 cache should not be large enough so for cache lines to be fully reused (causing data to be reloaded many times from the memory hierarchy, typically from the L3 in sequential and the RAM in parallel). Also note that the naive transposition can also perform very poorly on processor with smaller caches (eg. x86-64 desktop processors except recent ones that often have bigger caches) or if you plan to change the size of the input array to something divisible by a large power of two.
While blocking enable a better use of the L1 cache, it is not so important in your specific case. Indeed, the naive computation does not benefit from the L1 cache but the effect is small since the transposition should be bounded by the L3 cache and the RAM anyway. That being said, a better L1 cache usage could help to reduce a bit the latency regarding the target processor architecture. You should see the effect mainly on significantly smaller arrays.
In parallel, the L3 cache is large enough so that the 24 cores can run in parallel without too many conflict misses. Even if the L3 performed poorly, the kernel would be mainly memory bound so the impact of the cache misses would be not much visible.
Why is the difference in performance between reference and transpose functions so high when using only 1 thread?
This is likely due to the latency of memory operations. Transpositions perform memory read/writes with huge strides and the hardware prefetchers may not be able to fully mitigate the huge latency of the L3 cache or the one of the main RAM. Indeed, the number of pending cache-line request per core is limited (to a dozen of them on Skylake), so the kernel is bound by the latency of the requests since there is not enough concurrency to fully overlap their latency.
For the DAXPY/copy, hardware prefetchers can better reduce the latency but the amount of concurrency is still too small compared to the latency on Xeon processor so to fully saturate the RAM with 1 thread. This is a quite reasonable architectural limitation since such processors are designed to execute applications scaling well on many cores.
With many threads, the per-core limitation vanishes and it is replaced by a stronger one: the practical RAM bandwidth.
Could it be that the Intel compiler performed loop-blocking for the naive transpose function as optimization?
This is theoretically possible since the Intel compiler (ICC) has such optimizer, but it is very unlikely for ICC to do that on a 3D transposition code (since it is a pretty complex relatively specific use-case). The best is to analyse the assembly code so to be sure.
Note on the efficiency of the optimized transposition
Due to the cache-line write allocation on x86-64 processors (like your Xeon processor), I expect the transposition to have a lower throughput assuming it do not take into account such effect. Indeed, the processor needs to read tens_tr cache lines so to fill them since it do not know if they will be completely filled ahead of time (it would be crazy for the naive transposition) and they may be evicted before (eg. during a context switch, by another running program).
There is several possible reasons to explain that:
The assumption is wrong and it means 1/3 of the bandwidth is wasted by reading cache lines meant to be actually written;
the DAXPY code also have the same issue and the reported maximum bandwidth is not really correct either (unlikely);
ICC succeed to rewrite the transposition so to use efficiently the caches and also generate non-temporal store instructions so to avoid this effect (unlikely).
Based on the possible reasons, I think the measured throughput already take into account write allocation and that the transposition implementation can be optimized further. Indeed, the optimized version doing the copy can use non-temporal store so to write the array back in memory without reading it. This is not possible with the naive implementation. With such optimization, the throughput may be the same, but the execution time can be about 33% lower (due to a better use of the memory bandwidth). This is a good example showing that the initial assumption is just wrong.

Limits of workload that can be put into hardware accelerators

I am interested in understanding what's the percentage of workload that can almost never be put into a hardware accelerators. While more and more tasks are being amenable to domain specific accelerators, I wonder is it possible to have tasks that are not going to be useful with accelerator? Put simply, what are the tasks that are less likely to be accelerator-compatible?
Would love to have a pointers to resources that speaks to this question.

So you have the following question(s) in your original post:
Question:
I wonder is it possible to have tasks that are not going to be useful with accelerator? Put simply, what are the tasks that are less likely to be accelerator-compatible?
Answer:
Of course it's possible. First and foremost, workload that needs to be accelerated on hardware accelerators should not involve following:
dynamic polymorphism and dynamic memory allocation
runtime type information (RTTI)
system calls
........... (some more depending on the hardware accelerator)
Although explaining each above-mentioned point will make the post too lengthy, I can explain few. There is no support of dynamic memory allocation because hardware accelerators have fixed set of resources on silicon, and the dynamic creation and freeing of memory resources is not supported. Similarly dynamic polymorphism is only supported if the pointer object can be determined at compile time. And there should be no System calls because these are actions that relate to performing some task upon the operating system. Therefore OS operations, such as file read/write or OS queries like time and date, are not supported.
Having said that, the workload that are less likely to be accelerator-compatible are mostly communication intensive kernels. Such communication intensive kernels often lead to a serious data transfer overhead compared to the CPU execution, which can probably be detected by the CPU-FPGA or CPU-GPU communication time measurement.
For better understanding, let's take the following example:
Communication Intensive Breadth-First Search (BFS):
1 procedure BFS(G, root) is
2 let Q be a queue
3 label root as explored
4 Q.enqueue(root)
5 while Q is not empty do
6 v := Q.dequeue()
7 if v is the goal then
8 return v
9 for all edges from v to w in G.adjacentEdges(v) do
10 if w is not labeled as explored then
11 label w as explored
12 Q.enqueue(w)
The above pseudo code is of famous bread-first search (BFS). Why it's not a good candidate for acceleration? Because it traverses all the nodes in a graph without doing any significant computation. Hence it's immensely communication intensive as compared to compute intensive. Furthermore, for a data-driven algorithm like
BFS, the shape and structure of the input can actually dictate runtime characteristics like locality and branch behaviour , making it not so good candidate for hardware acceleration.
Now the question arises why have I focused on compute intensive vs communication intensive?
As you have tagged FPGA in your post, I can explain you this concept with respect to FPGA. For instance in a given system that uses the PCIe connection between the CPU and FPGA, we calculate the PCIe transfer time as the elapsed time of data movement from the host memory to the device memory through PCIe-based direct memory access (DMA).
The PCIe transfer time is a significant factor to filter out the FPGA acceleration for communication bounded workload. Therefore, the above mentioned BFS can show severe PCIe transfer overheads and hence, not acceleration compatible.
On the other hand, consider a the family of object recognition algorithms implemented as a deep neural network. If you go through these algorithms you will find that a significant amount of time (more than 90% may be) is spent in the convolution function. The input data is relatively small. The convolutions are embarrassingly parallel. And this makes it them ideal workload for moving to hardware accelerator.
Let's take another example showing a perfect workload for hardware acceleration:
Compute Intensive General Matrix Multiply (GEMM):
void gemm(TYPE m1[N], TYPE m2[N], TYPE prod[N]){
int i, k, j, jj, kk;
int i_row, k_row;
TYPE temp_x, mul;
loopjj:for (jj = 0; jj < row_size; jj += block_size){
loopkk:for (kk = 0; kk < row_size; kk += block_size){
loopi:for ( i = 0; i < row_size; ++i){
loopk:for (k = 0; k < block_size; ++k){
i_row = i * row_size;
k_row = (k + kk) * row_size;
temp_x = m1[i_row + k + kk];
loopj:for (j = 0; j < block_size; ++j){
mul = temp_x * m2[k_row + j + jj];
prod[i_row + j + jj] += mul;
}
}
}
}
}
}
The above code example is General Matrix Multiply (GEMM). It is a common algorithm in linear algebra, machine learning, statistics, and many other domains. The matrix multiplication in this code is more commonly computed using a blocked
loop structure. Commuting the arithmetic to reuse all of the elements
in one block before moving onto the next dramatically
improves memory locality. Hence it is an extremely compute intensive and perfect candidate for acceleration.
Hence, to name only few, we can conclude following are the deciding factors for hardware acceleration:
the load of your workload
the data your workload accesses,
how parallel is your workload
the underlying silicon available for acceleration
the bandwidth and latency of communication channels.
Do not forget Amdahl's Law:
Even if you have found out the right workload that is an ideal candidate for hardware acceleration, the struggle does not end here. Why? Because the famous Amdahl's law comes into play. Meaning, you might be able to significantly speed up a workload, but if it is only 2% of the runtime of the application, then even if you speed it up infinitely (take the run time to 0) you will only speed the overall application by 2% at the system level. Hence, your ideal workload should not only be an ideal workload algorithmically, in fact, it should also be contributing significantly to the overall runtime of your system.

Why does my pc prefer even numbered cores?

My pc has a 10th gen Core i7 vPRO with virtualization enabled. 8 cores + 8 virtual cores. (i7-10875H, Comet Lake)
Each physical core is split into pairs, so Core 1 hosts virtual cores 0 & 1, core 2 hosts virtual cores 2 & 3. I've noticed that in task manager, the first item of each core pair seems to be the preferred core, judging by the higher usage. I do set some affinities manually for certain heavy programs but I always set these in groups of 4, either from 0-3, 4-7, 8-11, 12-15, and never mismatch different logical processors.
I'm wondering why this behaviour happens - do the even numbered cores equate to physical cores, which could be slightly faster? If so, would I get slightly better clock speeds without virtualisation if I'm running programs that don't have a high thread count?

In general (for "scheduler theory"):
if you care about performance, spread the tasks across physical cores where possible. This prevents a "2 tasks run slower because they're sharing a physical core, while a whole physical core is idle" situation.
if you care about power consumption and not performance, make tasks use logical processors in the same physical core where possible. This may allow you to put entire core/s into a very power efficient "do nothing" state.
if you care about security (and not performance or power consumption), don't let unrelated tasks use logical processors in the same physical core at all (because information, like what kinds of instructions are currently being used, can be "leaked" from one logical processor to another logical process in the same physical core). Note that it would be fine for related tasks to use logical processes in the same physical core (e.g. 2 threads that belong to the same process that do trust each other, but not threads that belong to different processes that don't trust each other).
Of course a good OS would know the preference for each task (if each task cares about performance or power consumption or security), and would make intelligent decisions to handle a mixture of tasks with difference preferences. Sadly there are no good operating systems - most operating systems and APIs were designed in the 1990s or earlier (back when SMP was just starting and all CPUs were identical anyway) and lack the information about tasks that would be necessary to make intelligent decisions; so they assume performance is the only thing that matters for all tasks, leading to the "tasks spread across physical cores where possible, even when it's not ideal" behavior you're seeing.

My guess is that's due to hyperthreading.
Hyperthreading doesn't double CPU capacity (according to Intel, it adds ~30% on average), so it makes sense to spread the work among physical cores first, and use hyperthreading as a last resort when the overall CPU demand starts exceeding 50%.
Fun fact: a reported 50% overall CPU load on a hyperthreaded system is in fact a load of around ~70%, and the remaining 50% equate to the remaining ~30%.
If we query the OS to see how logical processors are assigned to cores1, we will see a situation like this:
Core 0: mask 0x3
Core 1: mask 0xc
Core 2: mask 0x30
Core 3: mask 0xc0
. . .
That means logical processors 0 and 1 are on core 0, 2 and 3 on core 1, etc.
You can disable hyperthreading in the BIOS. But since it adds performance, it's is a nice to have feature. Just need to be careful not to pin work such that it is competing for the same core.
1 To check core assignment I use a small C program below. The information might also be available via WMIC.
#include <stdio.h>
#include <stdlib.h>
#undef _WIN32_WINNT
#define _WIN32_WINNT 0x601
#include <Windows.h>
int main() {
DWORD len = 65536;
char *buf = (char*)malloc(len);
if (!GetLogicalProcessorInformationEx(RelationProcessorCore, (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf, &len)) {
return GetLastError();
}
union {
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX info;
PBYTE infob;
};
info = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf;
for (size_t i = 0, n = 0; n < len; i++, n += info->Size, infob += info->Size) {
switch (info->Relationship) {
case RelationProcessorCore:
printf("Core %zd:", i);
for (int j = 0; j < info->Processor.GroupCount; j++)
printf(" mask 0x%llx", info->Processor.GroupMask[j].Mask);
printf("\n");
break;
}
}
return 0;
}

Does data alignment really speed up execution by more than 5%?

Since ever I carefully consider alignment of data structures. It hurts letting the CPU shuffling bits before processing can be done. Gut feelings aside, I measured the costs of unaligned data: Write 64bit longs into some GB of memory and then read their values, checking correctness.
// c++ code
const long long MB = 1024 * 1024;
const long long GB = 1024 * MB;
void bench(int offset) // pass 0..7 for different alignments
{
int n = (1 * GB - 1024) / 8;
char* mem = (char*) malloc(1 * GB);
// benchmarked block
{
long long* p = (long long*) (mem + offset);
for (long i = 0; i < n; i++)
{
*p++ = i;
}
p = (long long*) (mem + offset);
for (long i = 0; i < n; i++)
{
if (*p++ != i) throw "wrong value";
}
}
free(mem);
}
The result surprised me:
1st run 2nd run %
i = 0 221 i = 0 217 100 %
i = 1 228 i = 1 227 105 %
i = 2 260 i = 2 228 105 %
i = 3 241 i = 3 228 105 %
i = 4 219 i = 4 215 99 %
i = 5 233 i = 5 228 105 %
i = 6 227 i = 6 229 106 %
i = 7 228 i = 7 228 105 %
The costs are just 5% (if we randomly store it at any memory location, costs would be 3,75% since 25% would land aligned). But storing data unaligned has the benefit of being a bit more compact, so the 3,75% benefit could even be compensated.
Tests run on Intel 3770 CPU. Did many variations of this benchmarks (eg using pointers instead of longs; random read access to change cache effects) all leading to similar results.
Question: Is data structure alignment still as important as we all thought it is?
I know there are atomicity aspects when 64bit values spread over cache lines, but that is not a strong argument either for alignment, because larger data structs (say 30, 200bytes or so) will often spread across them.
I always believed strongly in the speed argument as laid out nicely here for instance: Purpose of memory alignment and do not feel well disobeying the old rule. But : Can we measure the claimed performance boosts of proper alignment?
A good answer could provide a reasonable benchmark showing a boost of factor of > 1.25 for aligned vs unaligned data. Or demonstrate that commonly used other modern CPUs are much more affected by unalignment.
Thank you for your thoughts measurements.
edit: I am concerned about classical data structures where structs are held in memory. In contrast to special case scenarios like scientific number crunching scenarios.
update: insights from comments:
from http://www.agner.org/optimize/blog/read.php?i=142&v=t
Misaligned memory operands handled efficiently on Sandy Bridge
On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned.Store-to-load forwarding also works with misaligned operands in most cases.
http://danluu.com/3c-conflict/
Unaligned access might be faster(!) on Sandy Bridge due to cache organisation.

Yes, data alignment is an important prerequisite for vectorisation on architectures that only support SSE, which has strict data alignment requirements or on newer architectures such as Xeon PHI. Intel AVX, does support unaligned access, but aligning data is still considered a good practice, to avoid unnecessary performance hits:
Intel® AVX has relaxed some memory alignment requirements, so now
Intel AVX by default allows unaligned access; however, this access may
come at a performance slowdown, so the old rule of designing your data
to be memory aligned is still good practice (16-byte aligned for
128-bit access and 32-byte aligned for 256-bit access). The main
exceptions are the VEX-extended versions of the SSE instructions that
explicitly required memory-aligned data: These instructions still
require aligned data
On these architectures, codes where vectorisation is useful (e.g. scientific computing applications with heavy use of floating point) may benefit from meeting the respective alignment prerequisites; the speedup would be proportional to the number of vector lanes in the FPU (4, 8, 16X). You can measure the benefits of vectorisation yourself by comparing software such as Eigen or PetSC or any other scientific software with / without vectorisation (-xHost for icc, -march=native for gcc), you should easily get a 2X speedup.

RenderScript GPU performance not on par with device GFLOPS?

As a test, I am trying to crunch as much GFLOPS from the GPU as possible, just to see how far we can go with compute via RenderScript.
For this I use a GPU-cache-friendly kernel that will (hopefully) not be bounded on memory access for testing purposes:
#pragma rs_fp_relaxed
rs_allocation input;
float __attribute__((kernel)) compute(float in, int x)
{
float sum = 0;
if (x < 64) return 0;
for (int i = 0; i < 64; i++) {
sum += rsGetElementAt_float(input, x - i);
}
return sum;
}
On the Java side I just call the kernel a couple of times:
for (int i = 0; i < 1024; i++) {
m_script.forEach_compute(m_inAllocation, m_outAllocation);
}
With allocation sizes of 1M floats this maxes around 1-2 GFLOPS on a GPU that should max around 100 GFLOPS (Snapdragon 600, APQ8064AB), that is 50x - 100x less compute performance !.
I have tried unrolling the loop (10% difference), using larger or smaller sums (<5% diff), different allocation sizes (<5% diff), 1D or 2D allocations (no diff), but come nowhere near the amount of GFLOPS that should be possible on the device. I even am thinking that the entire kernel is only running on the CPUs.
In similar sense, looking at the results of an RenderScript benchmark application (https://compubench.com/result.jsp?benchmark=compu20, the top of the line devices only achieve around 60M pixels/s on a Gaussian blur. A 5x5 blur in naive (non-seperable) implementation takes around 50 FLOPS/pixel, resulting in 3 GFLOPS as opposed to the 300 GFLOPS these GPUs have.
Any thoughts?
(see e.g. http://kyokojap.myweb.hinet.net/gpu_gflops/ for an overview of device capabilities)
EDIT:
Using the OpenCL libs that are available on the device (Samsung S4, 4.4.2) I have rewritten the RenderScript test program to OpenCL and run it via the NDK. With basically the same setup (1M float buffers and running the kernel 1024 times) I can now get around 25 GFLOPS, that is 10x the RenderScript performance, and 4x from the theoretical device maximum.
For RenderScript there is no way of knowing if a kernel is running on the GPU. So:
if the RenderScript kernel does run on the GPU, why is it so slow?
if the kernel is not running on the GPU, which devices do run RenderScript on the GPU (aside from most probably the Nexus line)?
Thanks.

What device are you using? Not all devices are shipping with GPU drivers yet.
Also, that kernel will be memory bound, since you've got a 1:1 arithmetic to load ratio.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio