Following this question with reference to the shared memory example in the official guide, I'm trying to build the heat equation matrix, which is just like in this poorly drawn image I made
Here's what I've done so far, minimal example
#define N 32
#define BLOCK_SIZE 16
__global__ void heat_matrix(int* A)
const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
__shared__ int temp_sm_A[N*N];
int* temp_A = &temp_sm_A[0]; memset(temp_A, 0, N*N*sizeof(int));
if (tid < N) //(*)
#pragma unroll
for (unsigned int m = 0; m < NUM_BLOCKS; ++m)
#pragma unroll
for (unsigned int e = 0; e < BLOCK_SIZE ; ++e)
if ( (tid == 0 && e == 0) || (tid == (N-1) && e == (BLOCK_SIZE-1) ) )
temp_A[tid + (e + BLOCK_SIZE * m) * N] = -2;
temp_A[tid + (e + BLOCK_SIZE * m) * N + ( tid==0 ? 1 : -1 )] = 1;
if ( tid == e )
temp_A[tid + (e + BLOCK_SIZE * m) * N - 1] = 1;
//printf("temp_A[%d] = 1;\n", (tid + (e + BLOCK_SIZE * m) * N -1));
temp_A[tid + (e + BLOCK_SIZE * m) * N] = -2;
//printf("temp_A[%d] = -2;\n", (tid + (e + BLOCK_SIZE * m) * N));
temp_A[tid + (e + BLOCK_SIZE * m) * N + 1] = 1;
//printf("temp_A[%d] = 1;\n", (tid + (e + BLOCK_SIZE * m) * N +1));
__syncthreads(); //(**)
memcpy(A, temp_A, N*N*sizeof(int));
int main(){
int* h_A = (int*)malloc(N*N*sizeof(int)); memset(h_A, 0, N*N*sizeof(int));
int* d_A;
checkCudaErrors(cudaMalloc((void**)&d_A, N*N*sizeof(int)));
checkCudaErrors(cudaMemcpy(d_A, h_A, N*N*sizeof(int), cudaMemcpyHostToDevice));
dim3 dim_grid((N/2 + BLOCK_SIZE -1)/ BLOCK_SIZE);
dim3 dim_block(BLOCK_SIZE);
heat_matrix <<< dim_grid, dim_block >>> (d_A);
checkCudaErrors(cudaMemcpy(h_A, d_A, N*N*sizeof(int), cudaMemcpyDeviceToHost));
The code is arranged to suit a large N (larger than 32). I took advantage of the block division. When executing nvcc yields
CUDA error at code=77(cudaErrorIllegalAddress) "cudaMemcpy(h_A, d_A, N*N*sizeof(int), cudaMemcpyDeviceToHost)"
And cuda-memcheck provides only one error (actually there is another, but it comes from cudasuccess=checkCudaErrors(cudaDeviceReset()); ...)
========= Invalid __shared__ write of size 4
========= at 0x00000cd0 in heat_matrix(int*)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0xfffffffc is out of bounds
I can't see where I did wrong in the code. How can the thread 0 in the first block provoke an illegal access? There's even the specific if case to deal with it, and there isn't reported the line of the code in which the error occurred.
Moreover, is there a more efficient way for my code than to deal with all those ifs? Sure there is, but I couldn't find a better parallel expression to split the cases into the second for.
On a side note, to me the (*) seems unnecessary; instead (**) is necessary if I want to follow with other GPU function calls. Am I right?
Check out this line:
temp_A[tid + (e + BLOCK_SIZE * m) * N - 1] = 1;
For the thread with tid equal to zero during the first iteration, tid + (e + BLOCK_SIZE * m) * N - 1 evaluates to an index of -1. This is exactly what the cuda-memcheck output is complaining about (with the address having wrapped around due to underflow).
A similar out-of-bounds access will occur later for the line
temp_A[tid + (e + BLOCK_SIZE * m) * N + 1] = 1;
when tid, e, and m all assume their maximum value.
You have multiple threads writing to the same memory location. Each thread should write to exactly one array element per inner loop iteration. There is no need to write out neighboring elements because they are already covered by their own threads.
You have a race condition between the initializing memset() and the stores inside the main loops. Put a syncthreads() after the memset().
The calls to memset() and memcpy() will lead to each thread doing a full set/copy, doing the operations N times instead of just once.
The common way of handling this is to write out the operation explicitly, dividing the work between the threads of the block.
However ...
there is no benefit from creating the matrix in shared memory first and then copying it to global memory later. Writing directly to A in global memory eliminates the need for memset(), memcpy() and syncthreads() altogether.
Using a block size of just 16 threads leaves half of the resources unused, as thread blocks are allocated in units of 32 threads (a warp).
You may want to re-read the section about the Thread Hierarchy in the CUDA C Programming Guide.
In your kernel, temp_A is a local pointer to beginning of your shared memory array. Considering:
N = 32;
m (0,1);
Accesses like temp_A[tid + (e + BLOCK_SIZE * m) * N] can easily go out of bounds of 1024-elements long array.
I am learning CUDA with the book 'Programming Massively Parallel Processors'. A practice problem from chapter 5 confuses me:
For tiled matrix multiplication out of possible range of values for
BLOCK_SIZE, for what values of BLOCK_SIZE will the kernel completely
avoid un-coalesced accesses to global memory? (you only need to consider square blocks)
On my understanding, BLOCK_SIZE does little to memory-coalescing. As long as threads within single warp access consecutive elements, we will have a coalesced accesses. I could not figure out where the kernel has un-coalesced accesses to global memory. Any hints from you guys?
Here is the kernel's source codes:
#define COMMON_WIDTH 512
#define ROW_LEFT 500
#define COL_RIGHT 250
#define K 1000
#define TILE_WIDTH 32
__device__ int D_ROW_LEFT = ROW_LEFT;
__device__ int D_COL_RIGHT = COL_RIGHT;
__device__ int D_K = K;
void MatrixMatrixMultTiled(float *matrixLeft, float *matrixRight, float *output){
__shared__ float sMatrixLeft[TILE_WIDTH][TILE_WIDTH];
__shared__ float sMatrixRight[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int col = bx * TILE_WIDTH + tx;
int row = by * TILE_WIDTH + ty;
float value = 0;
for (int i = 0; i < ceil(D_K/(float)TILE_WIDTH); ++i){
if (row < D_ROW_LEFT && row * D_K + i * TILE_WIDTH +tx < D_K){
sMatrixLeft[ty][tx] = matrixLeft[row * D_K + i * TILE_WIDTH +tx];
if (col < D_COL_RIGHT && (ty + i * TILE_WIDTH) * D_COL_RIGHT + col < D_K ){
sMatrixRight[ty][tx] = matrixRight[(ty + i * TILE_WIDTH) * D_COL_RIGHT + col];
for (int j = 0; j < TILE_WIDTH; j++){
value += sMatrixLeft[ty][j] * sMatrixRight[j][tx];
if (row < D_ROW_LEFT && col < D_COL_RIGHT ){
output[row * D_COL_RIGHT + col] = value;
Your question is incomplete, since the code you have posted does not make any reference to BLOCK_SIZE, and that is certainly at least very relevant to the question posed in the book. More generally, questions that pose a kernel without the launch configuration are often incomplete, since the launch configuration is often relevant to both the correctness and the behavior, of a kernel.
I've not re-read this portion of the book right at the moment. However I'll assume the kernel launch configuration includes a block dimension that is something like the following: (this information is absent from your question but should have been included, in my opinion, for a sensible question)
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(...,...);
And I will assume the kernel launch is given by something like:
MatrixMatrixMultTiled<<<dimGrid, dimBlock>>>(...);
Your statement: "As long as threads within single warp access consecutive elements, we will have a coalesced accesses." is a reasonable working definition. Let's show that that is violated for some choices of BLOCK_SIZE, given the above assumptions to cover over the gaps in your incomplete question.
Coalesced access is a term that applies to global memory accesses only. We will therefore ignore accesses to shared memory. We will also, for this discussion, ignore accesses to the __device__ variables such as D_ROW_LEFT. (The access to those variables appears to be uniform. We can quibble about whether that constitutes coalesced access. My claim would be that it does constitute coalesced access, but we need not unpack that here.) Therefore we are left with just 3 "access" points:
matrixLeft[row * D_K + i * TILE_WIDTH +tx];
matrixRight[(ty + i * TILE_WIDTH) * D_COL_RIGHT + col];
output[row * D_COL_RIGHT + col]
Now, to pick an example, let's suppose BLOCK_SIZE is 16. Will any of the above access points violate your statement "threads within single warp access consecutive elements"?
Let's start with the block (0,0). Therefore row is equal to threadIdx.y and col is equal to threadIdx.x. Let's consider the first warp in that block. Therefore the first 16 threads in that warp will have a threadIdx.y value of 0, and their threadIdx.x values will be increasing from 0..15. Likewise the second 16 threads in that warp will have a threadIdx.y value of 1, and their threadIdx.x values will be increasing from 0..15.
Now let's compute the actual index generated for the first access point above, across the warp. Let's assume we are on the first loop iteration, so i is zero. Therefore this:
matrixLeft[row * D_K + i * TILE_WIDTH +tx];
reduces to:
matrixLeft[threadIdx.y * D_K + threadIdx.x];
D_K here is just the device copy of the K variable, which is 1000. Now let's evaluate the reduced index expression above across our selected warp (0) in our selected block (0,0):
warp lane: 0 1 2 3 4 5 6 .. 15 16 17 18 .. 31
threadIdx.x 0 1 2 3 4 5 6 15 0 1 2 15
threadIdx.y 0 0 0 0 0 0 0 0 1 1 1 1
index: 0 1 2 3 4 5 6 15 1000 1001 1002 1015
Therefore the generated index pattern here shows a discontinuity between the 16th and 17th thread in the warp, and the access pattern does not fit your previously stated condition:
"threads within single warp access consecutive elements"
and we do not have coalesced access in this case (at least, for float quantities).
I need an advice on how optimizing my implementation of the Needleman-Wunsch algorithm in CUDA.
I want to optimize my code to fill the DP matrix in CUDA. Due to the data dependence between matrix elements (each next element depends on the other ones - left to it, up to it, and left-up to it), I'm filling anti-diagonal matrix elements in parallel as follows:
__global__ void alignment_kernel(int *T, char *A, char *B, int t_M, int t_N, int d) {
int row = BLOCK_SIZE_Y * blockIdx.y + threadIdx.y;
int col = BLOCK_SIZE_X * blockIdx.x + threadIdx.x;
// Check if we are inside the table boundaries.
if (!(row < t_M && col < t_N)) {
// Check if current thread is on the current diagonal
if (row + col != d) {
int v1;
int v2;
int v3;
int v4;
v1 = v2 = v3 = v4 = INT_MIN;
if (row > 0 && col > 0) {
v1 = T[t_N * (row - 1) + (col - 1)] + score_matrix_read(A[row - 1], B[col - 1]);
if (row > 0 && col >= 0) {
v2 = T[t_N * (row - 1) + col] + gap;
if (row >= 0 && col > 0) {
v3 = T[t_N * row + (col - 1)] + gap;
if (row == 0 && col == 0) {
v4 = 0;
// Synchronize (ensure all the data is available)
T[t_N * row + col] = mmax(v1, v2, v3, v4);
Nevertheless, one obvious problem of my code is that I do multiple kernel calls (code bellow). Until now, I don't know how to use threads to process the anti-diagonal synchronously without doing that. I think this is a major problem to reach a better performance.
// Invoke kernel.
for (int d = 0; d < t_M + t_N - 1; d++) {
alignment_kernel<<< gridDim, blockDim >>>(d_T, d_A, d_B, t_M, t_N, d);
How can I process the anti-diagonal in parallel and, maybe, using shared memory to increase the speedup?
Beyond this problem, is there any way to do the back trace step of the needleman-wunsch algorithm in parallel?
I am currently working on a parallel implementation of the Needleman Wunsch algorithm as well (to use in a genome mapper). Depending on how many alignments you will be doing, it may be more efficient to do a single alignment per thread.
However, here is a publication that performs a single alignment in parallel (on a GPU). The novelty of their approach is that it does not generate the matrix sequentially, by rather diagonally. They don't talk about how they backtrack in their publication. They send the matrix back to the host after it is generated, then they perform the backtrack using a CPU. I think that backtracking on the GPU would be terribly inefficient due to branching.
When calculating (I)FFT it is possible to calculate "N*2 real" data points using a ordinary complex (I)FFT of N data points.
Not sure about my terminology here, but this is how I've read it described.
There are several posts about this on stackoverflow already.
This can speed things up a bit when only dealing with such "real" data which is often the case when dealing with for example sound (re-)synthesis.
This increase in speed is offset by the need for a pre-processing step that somehow... uhh... fidaddles? the data to achieve this. Look I'm not even going to try to convince anyone I fully understand this but thanks to previously mentioned threads, I came up with the following routine, which does the job nicely (thank you!).
However, on my microcontroller this costs a bit more than I'd like even though trigonometric functions are already optimized with LUTs.
But the routine itself just looks like it should be possible to optimize mathematically to minimize processing. To me it seems similar to plain 2d rotation. I just can't quite wrap my head around it, but it just feels like this could be done with fewer both trigonometric calls and arithmetic operations.
I was hoping perhaps someone else might easily see what I don't and provide some insight into how this math may be simplified.
This particular routine is for use with IFFT, before the bit-reversal stage.
MAG_A/B = 0 TO 1
PHA_A/B = 0 TO 2PI
r = MAG_A * sin(PHA_A)
i = MAG_B * sin(PHA_B)
rsum = r + i
rdif = r - i
r = MAG_A * cos(PHA_A)
i = MAG_B * cos(PHA_B)
isum = r + i
idif = r - i
r = -cos(INDEX)
i = -sin(INDEX)
rtmp = r * isum + i * rdif
itmp = i * isum - r * rdif
OUTPUT rsum + rtmp
OUTPUT itmp + idif
OUTPUT rsum - rtmp
OUTPUT itmp - idif
original working code, if that's your poison:
void fft_nz_set(fft_complex_t complex[], unsigned bits, unsigned index, int32_t mag_lo, int32_t pha_lo, int32_t mag_hi, int32_t pha_hi) {
unsigned size = 1 << bits;
unsigned shift = SINE_TABLE_BITS - (bits - 1);
unsigned n = index; // index for mag_lo, pha_lo
unsigned z = size - index; // index for mag_hi, pha_hi
int32_t rsum, rdif, isum, idif, r, i;
r = smmulr(mag_lo, sine(pha_lo)); // mag_lo * sin(pha_lo)
i = smmulr(mag_hi, sine(pha_hi)); // mag_hi * sin(pha_hi)
rsum = r + i; rdif = r - i;
r = smmulr(mag_lo, cosine(pha_lo)); // mag_lo * cos(pha_lo)
i = smmulr(mag_hi, cosine(pha_hi)); // mag_hi * cos(pha_hi)
isum = r + i; idif = r - i;
r = -sinetable[(1 << SINE_BITS) - (index << shift)]; // cos(pi_c * (index / size) / 2)
i = -sinetable[index << shift]; // sin(pi_c * (index / size) / 2)
int32_t rtmp = smmlar(r, isum, smmulr(i, rdif)) << 1; // r * isum + i * rdif
int32_t itmp = smmlsr(i, isum, smmulr(r, rdif)) << 1; // i * isum - r * rdif
complex[n].r = rsum + rtmp;
complex[n].i = itmp + idif;
complex[z].r = rsum - rtmp;
complex[z].i = itmp - idif;
// For reference, this would be used as follows to generate a sawtooth (after IFFT)
void synth_sawtooth(fft_complex_t *complex, unsigned fft_bits) {
unsigned fft_size = 1 << fft_bits;
fft_sym_dc(complex, 0, 0); // sets dc bin [0]
for(unsigned n = 1, z = fft_size - 1; n <= fft_size >> 1; n++, z--) {
// calculation of amplitude/index (sawtooth) for both n and z
fft_sym_magnitude(complex, fft_bits, n, 0x4000000 / n, 0x4000000 / z);
What do I need to change in my program to be able to compute a higher limit of prime numbers?
Currently my algorithm works only with numbers up to 85 million. Should work with numbers up to 3 billion in my opinion.
I'm writing my own implementation of the Sieve of Eratosthenes in CUDA and I've hit a wall.
So far the algorithm seems to work fine for small numbers (below 85 million).
However, when I try to compute prime numbers up to 100 million, 2 billion, 3 billion, the system freezes (while it's computing stuff in the CUDA device), then after a few seconds, my linux machine goes back to normal (unfrozen), but the CUDA program crashes with the following error message:
CUDA error at code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()"
I have a GTX 780 (3 GB) and I'm allocating the sieves in a char array, so if I were to compute prime numbers up to 100,000, it would allocate 100,000 bytes in the device.
I assumed that the GPU would allow up to 3 billion numbers since it has 3 GB of memory, however, it only lets me do 85 million tops (85 million bytes = 0.08 GB)
this is my code:
#include <stdio.h>
#include <helper_cuda.h> // checkCudaErrors() - NVIDIA_CUDA-6.0_Samples/common/inc
// #include <cuda.h>
// #include <cuda_runtime_api.h>
// #include <cuda_runtime.h>
typedef unsigned long long int uint64_t;
* kernel that initializes the 1st couple of values in the primes array.
__global__ static void sieveInitCUDA(char* primes)
primes[0] = 1; // value of 1 means the number is NOT prime
primes[1] = 1; // numbers "0" and "1" are not prime numbers
* kernel for sieving the even numbers starting at 4.
__global__ static void sieveEvenNumbersCUDA(char* primes, uint64_t max)
uint64_t index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 4;
if (index < max)
primes[index] = 1;
* kernel for finding prime numbers using the sieve of eratosthenes
* - primes: an array of bools. initially all numbers are set to "0".
* A "0" value means that the number at that index is prime.
* - max: the max size of the primes array
* - maxRoot: the sqrt of max (the other input). we don't wanna make all threads
* compute this over and over again, so it's being passed in
__global__ static void sieveOfEratosthenesCUDA(char *primes, uint64_t max,
const uint64_t maxRoot)
// get the starting index, sieve only odds starting at 3
// 3,5,7,9,11,13...
/* int index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 3; */
// apparently the following indexing usage is faster than the one above. Hmm
int index = blockIdx.x * blockDim.x + threadIdx.x + 3;
// make sure index won't go out of bounds, also don't start the execution
// on numbers that are already composite
if (index < maxRoot && primes[index] == 0)
// mark off the composite numbers
for (int j = index * index; j < max; j += index)
primes[j] = 1;
* checkDevice()
__host__ int checkDevice()
// query the Device and decide on the block size
int devID = 0; // the default device ID
cudaError_t error;
cudaDeviceProp deviceProp;
error = cudaGetDevice(&devID);
if (error != cudaSuccess)
printf("CUDA Device not ready or not supported\n");
printf("%s: cudaGetDevice returned error code %d, line(%d)\n", __FILE__, error, __LINE__);
error = cudaGetDeviceProperties(&deviceProp, devID);
if (deviceProp.computeMode == cudaComputeModeProhibited || error != cudaSuccess)
printf("CUDA device ComputeMode is prohibited or failed to getDeviceProperties\n");
// Use a larger block size for Fermi and above (see compute capability)
return (deviceProp.major < 2) ? 16 : 32;
* genPrimesOnDevice
* - inputs: limit - the largest prime that should be computed
* primes - an array of size [limit], initialized to 0
__host__ void genPrimesOnDevice(char* primes, uint64_t max)
int blockSize = checkDevice();
if (blockSize == EXIT_FAILURE)
char* d_Primes = NULL;
int sizePrimes = sizeof(char) * max;
uint64_t maxRoot = sqrt(max);
// allocate the primes on the device and set them to 0
checkCudaErrors(cudaMalloc(&d_Primes, sizePrimes));
checkCudaErrors(cudaMemset(d_Primes, 0, sizePrimes));
// make sure that there are no errors...
// setup the execution configuration
dim3 dimBlock(blockSize);
dim3 dimGrid((maxRoot + dimBlock.x) / dimBlock.x);
dim3 dimGridEvens(((max + dimBlock.x) / dimBlock.x) / 2);
//////// debug
#ifdef DEBUG
printf("dimBlock(%d, %d, %d)\n", dimBlock.x, dimBlock.y, dimBlock.z);
printf("dimGrid(%d, %d, %d)\n", dimGrid.x, dimGrid.y, dimGrid.z);
printf("dimGridEvens(%d, %d, %d)\n", dimGridEvens.x, dimGridEvens.y, dimGridEvens.z);
// call the kernel
// NOTE: no need to synchronize after each kernel
sieveInitCUDA<<<1, 1>>>(d_Primes); // launch a single thread to initialize
sieveEvenNumbersCUDA<<<dimGridEvens, dimBlock>>>(d_Primes, max);
sieveOfEratosthenesCUDA<<<dimGrid, dimBlock>>>(d_Primes, max, maxRoot);
// check for kernel errors
// copy the results back
checkCudaErrors(cudaMemcpy(primes, d_Primes, sizePrimes, cudaMemcpyDeviceToHost));
// no memory leaks
to test this code:
int main()
int max = 85000000; // 85 million
char* primes = malloc(max);
// check that it allocated correctly...
memset(primes, 0, max);
genPrimesOnDevice(primes, max);
// if you wish to display results:
for (uint64_t i = 0; i < size; i++)
if (primes[i] == 0) // if the value is '0', then the number is prime
std::cout << i; // use printf if you are using c
if ((i + 1) != size)
std::cout << ", ";
This error:
CUDA error at code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()"
doesn't necessarily mean anything other than that your kernel is taking too long. It's not necessarily a numerical limit, or computational error, but a system-imposed limit on the amount of time your kernel is allowed to run. Both Linux and windows can have such watchdog timers.
If you want to work around it in the linux case, review this document.
You don't mention it, but I assume your GTX780 is also hosting a (the) display. In that case, there is a time limit on kernels by default. If you can use another device as the display, then reconfigure your machine to have X not use the GTX780, as described in the link. If you do not have another GPU to use for the display, then the only option is to modify the interactivity setting indicated in the linked document, if you want to run long-running kernels. And in this situation, the keyboard/mouse/display will become non-responsive while the kernel is running. If your kernel should happen to run too long, it can be difficult to recover the machine, and may require a hard reboot. (You could also SSH into the machine, and kill the process that is using the GPU for CUDA.)
I have an 8-bit 640x480 image that I would like to shrink to a 320x240 image:
void reducebytwo(uint8_t *dst, uint8_t *src)
//src is 640x480, dst is 320x240
What would be the best way to do that using ARM SIMD NEON? Any sample code somewhere?
As a starting point, I simply would like to do the equivalent of:
for (int h = 0; h < 240; h++)
for (int w = 0; w < 320; w++)
dst[h * 320 + w] = (src[640 * h * 2 + w * 2] + src[640 * h * 2 + w * 2 + 1] + src[640 * h * 2 + 640 + w * 2] + src[640 * h * 2 + 640 + w * 2 + 1]) / 4;
This is a one to one translation of your code to arm NEON intrinsics:
#include <arm_neon.h>
#include <stdint.h>
static void resize_line (uint8_t * __restrict src1, uint8_t * __restrict src2, uint8_t * __restrict dest)
int i;
for (i=0; i<640; i+=16)
// load upper line and add neighbor pixels:
uint16x8_t a = vpaddlq_u8 (vld1q_u8 (src1));
// load lower line and add neighbor pixels:
uint16x8_t b = vpaddlq_u8 (vld1q_u8 (src2));
// sum of upper and lower line:
uint16x8_t c = vaddq_u16 (a,b);
// divide by 4, convert to char and store:
vst1_u8 (dest, vshrn_n_u16 (c, 2));
// move pointers to next chunk of data
void resize_image (uint8_t * src, uint8_t * dest)
int h;
for (h = 0; h < 240 - 1; h++)
resize_line (src+640*(h*2+0),
It processes 32 source-pixels and generates 8 output pixels per iteration.
I did a quick look at the assembler output and it looks okay. You can get better performance if you write the resize_line function in assembler, unroll the loop and eliminate pipeline stalls. That would give you an estimated factor of three performance boost.
It should be a lot faster than your implementation without assembler changes though.
Note: I haven't tested the code...
If you're not too concerned with precision then this inner loop should give you twice the compute throughput compared to the more accurate algorithm:
for (i=0; i<640; i+= 32)
uint8x16x2_t a, b;
uint8x16_t c, d;
/* load upper row, splitting even and odd pixels into a.val[0]
* and a.val[1] respectively. */
a = vld2q_u8(src1);
/* as above, but for lower row */
b = vld2q_u8(src2);
/* compute average of even and odd pixel pairs for upper row */
c = vrhaddq_u8(a.val[0], a.val[1]);
/* compute average of even and odd pixel pairs for lower row */
d = vrhaddq_u8(b.val[0], b.val[1]);
/* compute average of upper and lower rows, and store result */
vst1q_u8(dest, vrhaddq_u8(c, d));
It works by using the vhadd operation, which has a result the same size as the input. This way you don't have to shift the final sum back down to 8-bit, and all of the arithmetic throughout is eight-bit, which means you can perform twice as many operations per instruction.
However it is less accurate, because the intermediate sum is quantised, and GCC 4.7 does a terrible job of generating code. GCC 4.8 does just fine.
The whole operation has a good chance of being I/O bound, though. The loop should be unrolled to maximise separation between loads and arithmetic, and __builtin_prefetch() (or PLD) should be used to hoist the incoming data into caches before it's needed.
Here is the asm version on reduce_line that #Nils Pipenbrinck suggested
static void reduce2_neon_line(uint8_t* __restrict src1, uint8_t* __restrict src2, uint8_t* __restrict dest, int width) {
for(int i=0; i<width; i+=16) {
asm (
"pld [%[line1], #0xc00] \n"
"pld [%[line2], #0xc00] \n"
"vldm %[line1]!, {d0,d1} \n"
"vldm %[line2]!, {d2,d3} \n"
"vpaddl.u8 q0, q0 \n"
"vpaddl.u8 q1, q1 \n"
"vadd.u16 q0, q1 \n"
"vshrn.u16 d0, q0, #2 \n"
"vst1.8 {d0}, [%[dst]]! \n"
: [line1] "r"(src1), [line2] "r"(src2), [dst] "r"(dest)
: "q0", "q1", "memory"
It is about 4 times faster then C version (tested on iPhone 5).