CUDA limit seems to be reached, but what limit is that?

CUDA limit seems to be reached, but what limit is that? - gpgpu

I have a CUDA program that seems to be hitting some sort of limit of some resource, but I can't figure out what that resource is. Here is the kernel function:
__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap,
int segmentCount, int* output)
{
int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
int pointCount = segmentCount + 1;
if(segmentIndex >= segmentCount)
return;
int polylineIndex = segmentToPolylineIndexMap[segmentIndex];
int result = 0;
if(polylineIndex >= 0)
{
float2 p1 = points[segmentIndex];
float2 p2 = points[segmentIndex+1];
float2 A = p2;
float2 a;
a.x = p2.x - p1.x;
a.y = p2.y - p1.y;
for(int i = segmentIndex+2; i < segmentCount; i++)
{
int currentPolylineIndex = segmentToPolylineIndexMap[i];
// if not a different segment within out polyline and
// not a fake segment
bool isLegit = (currentPolylineIndex != polylineIndex &&
currentPolylineIndex >= 0);
float2 p3 = points[i];
float2 p4 = points[i+1];
float2 B = p4;
float2 b;
b.x = p4.x - p3.x;
b.y = p4.y - p3.y;
float2 c;
c.x = B.x - A.x;
c.y = B.y - A.y;
float2 b_perp;
b_perp.x = -b.y;
b_perp.y = b.x;
float numerator = dot(b_perp, c);
float denominator = dot(b_perp, a);
bool isParallel = (denominator == 0.0);
float quotient = numerator / denominator;
float2 intersectionPoint;
intersectionPoint.x = quotient * a.x + A.x;
intersectionPoint.y = quotient * a.y + A.y;
result = result | (isLegit && !isParallel &&
intersectionPoint.x > min(p1.x, p2.x) &&
intersectionPoint.x > min(p3.x, p4.x) &&
intersectionPoint.x < max(p1.x, p2.x) &&
intersectionPoint.x < max(p3.x, p4.x) &&
intersectionPoint.y > min(p1.y, p2.y) &&
intersectionPoint.y > min(p3.y, p4.y) &&
intersectionPoint.y < max(p1.y, p2.y) &&
intersectionPoint.y < max(p3.y, p4.y));
}
}
output[segmentIndex] = result;
}
Here is the call to execute the kernel function:
DoCheck<<<702, 32>>>(
(float2*)devicePoints,
deviceSegmentsToPolylineIndexMap,
numSegments,
deviceOutput);
The sizes of the parameters are as follows:
devicePoints = 22,464 float2s = 179,712 bytes
deviceSegmentsToPolylineIndexMap = 22,463 ints = 89,852 bytes
numSegments = 1 int = 4 bytes
deviceOutput = 22,463 ints = 89,852 bytes
When I execute this kernel, it crashes the video card. It would appear that I am hitting some sort of limit, because if I execute the kernel using DoCheck<<<300, 32>>>(...);, it works. Just to be clear, the parameters are the same, just the number of blocks is different.
Any idea why one crashes the video driver, and the other doesn't? The one that fail seems to be still within the card's limit on number of blocks.
Update
More information on my system configuration:
Video Card: nVidia 8800GT
CUDA Version: 1.1
OS: Windows Server 2008 R2
I also tried it on a laptop with the following configuration, but got the same results:
Video Card: nVidia Quadro FX 880M
CUDA Version: 1.2
OS: Windows 7 64-bit

The resource which is being exhausted is time. On all current CUDA platforms, the display driver includes a watchdog timer which will kill any kernel which takes more than a few seconds to execute. Running code on a card which is running a display is subject to this limit.
On the WDDM Windows platforms you are using, there are three possible solutions/work-arounds:
Get a Telsa card and use the TCC driver, which eliminates the problem completely
Try modifying registry settings to increase the timer limit (google for the TdrDelay registry key for more information, but I am not a Windows user and can't be more specific than that)
Modify your kernel code to be "re-entrant" and process the data parallel work load in several kernel launches rather than one. Kernel launch overhead isn't all that large and processing the workload over several kernel runs is often pretty easy to achieve, depending on the algorithm you are using.

Related

Why is local memory in this OpenCL algorithm so slow?

I am writing some OpenCL code. My kernel should create a special "accumulator" output based on an input image. I have tried two concepts and both are equally slow, although the second one uses local memory. Could you please help me identify why the local memory version is so slow? The target GPU for the kernels is a AMD Radeon Pro 450.
// version one
__kernel void find_points(__global const unsigned char* input, __global unsigned int* output) {
const unsigned int x = get_global_id(0);
const unsigned int y = get_global_id(1);
int ind;
for(k = SOME_BEGINNING; k <= SOME_END; k++) {
// some pretty wild calculation
// ind is not linear and accesses different areas of the output
ind = ...
if(input[y * WIDTH + x] == 255) {
atomic_inc(&output[ind]);
}
}
}
// variant two
__kernel void find_points(__global const unsigned char* input, __global unsigned int* output) {
const unsigned int x = get_global_id(0);
const unsigned int y = get_global_id(1);
__local int buf[7072];
if(y < 221 && x < 32) {
buf[y * 32 + x] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
int ind;
int k;
for(k = SOME_BEGINNING; k <= SOME_END; k++) {
// some pretty wild calculation
// ind is not linear and access different areas of the output
ind = ...
if(input[y * WIDTH + x] == 255) {
atomic_inc(&buf[ind]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(get_local_id(0) == get_local_size(0) - 1)
for(k = 0; k < 7072; k++)
output[k] = buf[k];
}
}
I would expect that the second variant is faster than the first one, but it isn't. Sometimes it is even slower.

Local buffer size __local int buf[7072] (28288 bytes) is too big. I don't know how big shared memory for AMD Radeon Pro 450 is but likely that is 32kB or 64kB per computing unit.
32768/28288 = 1, 65536/28288 = 2 means only 1 or maximum 2 wavefronts (64 work items) can run simultaneously only, so occupancy of computing unit is very very low hence poor performance.
Your aim should be to reduce local buffer as much as possible so that more wavefronts can be processed simultaneously.
Use CodeXL to profile your kernel - there are tools to show you all of this.
Alternatively you can have a look at CUDA occupancy calculator excel spreadsheet if you don't want to run the profiler to get a better idea of what that is about.

Mandelbrot optimization in openmp

Well i have to paralellisize the mandelbrot program in C. I think i have done it well and i cant get better times. My question if someone has an idea to improve the code, ive been thinking perhaps in nested parallel regions between the outer and insider for...
Also i have doubts if its more elegant or recommended to put all the pragmas in a single line or to write separate pragmas ( one for omp parallel and shared and private variables and a conditional, and another pragma with omp for and schedule dynamic).
Ive the doubt if constants can be used as private variables because i think is cleaner to have constants instead of defined variables.
Also i have written a conditional ( if numcpu >1) it has no sense to use parallel region and make a normal sequential execution.
Finally as i have read the dynamic chunk it depends on hardware and your system configuration... so i have left it as a constant, so it can be easily changed.
Also i adapt the number of threads to the number of processors available..
int main(int argc, char *argv[])
{
omp_set_dynamic(1);
int xactual, yactual;
//each iteration, it calculates: newz = oldz*oldz + p, where p is the current pixel, and oldz stars at the origin
double pr, pi; //real and imaginary part of the pixel p
double newRe, newIm, oldRe, oldIm; //real and imaginary parts of new and old z
double zoom = 1, moveX = -0.5, moveY = 0; //you can change these to zoom and change position
pixel_t *pixels = malloc(sizeof(pixel_t)*IMAGEHEIGHT*IMAGEWIDTH);
clock_t begin, end;
double time_spent;
begin=clock();
int numcpu;
numcpu = omp_get_num_procs();
//FILE * fp;
printf("El número de procesadores que utilizaremos es: %d", numcpu);
omp_set_num_threads(numcpu);
#pragma omp parallel shared(pixels, moveX, moveY, zoom) private(xactual, yactual, pr, pi, newRe, newIm) (if numcpu>1)
{
//int xactual=0;
// int yactual=0;
#pragma omp for schedule(dynamic, CHUNK)
//loop through every pixel
for(yactual = 0; yactual < IMAGEHEIGHT; yactual++)
for(xactual = 0; xactual < IMAGEWIDTH; xactual++)
{
//calculate the initial real and imaginary part of z, based on the pixel location and zoom and position values
pr = 1.5 * (xactual - IMAGEWIDTH / 2) / (0.5 * zoom * IMAGEWIDTH) + moveX;
pi = (yactual - IMAGEHEIGHT / 2) / (0.5 * zoom * IMAGEHEIGHT) + moveY;
newRe = newIm = oldRe = oldIm = 0; //these should start at 0,0
//"i" will represent the number of iterations
int i;
//start the iteration process
for(i = 0; i < ITERATIONS; i++)
{
//remember value of previous iteration
oldRe = newRe;
oldIm = newIm;
//the actual iteration, the real and imaginary part are calculated
newRe = oldRe * oldRe - oldIm * oldIm + pr;
newIm = 2 * oldRe * oldIm + pi;
//if the point is outside the circle with radius 2: stop
if((newRe * newRe + newIm * newIm) > 4) break;
}
// color(i % 256, 255, 255 * (i < maxIterations));
if(i == ITERATIONS)
{
//color(0, 0, 0); // black
pixels[yactual*IMAGEWIDTH+xactual][0] = 0;
pixels[yactual*IMAGEWIDTH+xactual][1] = 0;
pixels[yactual*IMAGEWIDTH+xactual][2] = 0;
}
else
{
double z = sqrt(newRe * newRe + newIm * newIm);
int brightness = 256 * log2(1.75 + i - log2(log2(z))) / log2((double)ITERATIONS);
//color(brightness, brightness, 255)
pixels[yactual*IMAGEWIDTH+xactual][0] = brightness;
pixels[yactual*IMAGEWIDTH+xactual][1] = brightness;
pixels[yactual*IMAGEWIDTH+xactual][2] = 255;
}
}
} //end of parallel region
end= clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
fprintf(stderr, "Elapsed time: %.2lf seconds.\n", time_spent);

You could extend the implementation to leverage SIMD extensions. As far as I know the latest OpenMP standard includes vector constructs. Check out this article that describes the new capabilities.
This whitepaper explains how SSE3 can be used when calculating the Mandelbrot set.

Is this part of a real IFFT process really optimal?

When calculating (I)FFT it is possible to calculate "N*2 real" data points using a ordinary complex (I)FFT of N data points.
Not sure about my terminology here, but this is how I've read it described.
There are several posts about this on stackoverflow already.
This can speed things up a bit when only dealing with such "real" data which is often the case when dealing with for example sound (re-)synthesis.
This increase in speed is offset by the need for a pre-processing step that somehow... uhh... fidaddles? the data to achieve this. Look I'm not even going to try to convince anyone I fully understand this but thanks to previously mentioned threads, I came up with the following routine, which does the job nicely (thank you!).
However, on my microcontroller this costs a bit more than I'd like even though trigonometric functions are already optimized with LUTs.
But the routine itself just looks like it should be possible to optimize mathematically to minimize processing. To me it seems similar to plain 2d rotation. I just can't quite wrap my head around it, but it just feels like this could be done with fewer both trigonometric calls and arithmetic operations.
I was hoping perhaps someone else might easily see what I don't and provide some insight into how this math may be simplified.
This particular routine is for use with IFFT, before the bit-reversal stage.
pseudo-version:
INPUT
MAG_A/B = 0 TO 1
PHA_A/B = 0 TO 2PI
INDEX = 0 TO PI/2
r = MAG_A * sin(PHA_A)
i = MAG_B * sin(PHA_B)
rsum = r + i
rdif = r - i
r = MAG_A * cos(PHA_A)
i = MAG_B * cos(PHA_B)
isum = r + i
idif = r - i
r = -cos(INDEX)
i = -sin(INDEX)
rtmp = r * isum + i * rdif
itmp = i * isum - r * rdif
OUTPUT rsum + rtmp
OUTPUT itmp + idif
OUTPUT rsum - rtmp
OUTPUT itmp - idif
original working code, if that's your poison:
void fft_nz_set(fft_complex_t complex[], unsigned bits, unsigned index, int32_t mag_lo, int32_t pha_lo, int32_t mag_hi, int32_t pha_hi) {
unsigned size = 1 << bits;
unsigned shift = SINE_TABLE_BITS - (bits - 1);
unsigned n = index; // index for mag_lo, pha_lo
unsigned z = size - index; // index for mag_hi, pha_hi
int32_t rsum, rdif, isum, idif, r, i;
r = smmulr(mag_lo, sine(pha_lo)); // mag_lo * sin(pha_lo)
i = smmulr(mag_hi, sine(pha_hi)); // mag_hi * sin(pha_hi)
rsum = r + i; rdif = r - i;
r = smmulr(mag_lo, cosine(pha_lo)); // mag_lo * cos(pha_lo)
i = smmulr(mag_hi, cosine(pha_hi)); // mag_hi * cos(pha_hi)
isum = r + i; idif = r - i;
r = -sinetable[(1 << SINE_BITS) - (index << shift)]; // cos(pi_c * (index / size) / 2)
i = -sinetable[index << shift]; // sin(pi_c * (index / size) / 2)
int32_t rtmp = smmlar(r, isum, smmulr(i, rdif)) << 1; // r * isum + i * rdif
int32_t itmp = smmlsr(i, isum, smmulr(r, rdif)) << 1; // i * isum - r * rdif
complex[n].r = rsum + rtmp;
complex[n].i = itmp + idif;
complex[z].r = rsum - rtmp;
complex[z].i = itmp - idif;
}
// For reference, this would be used as follows to generate a sawtooth (after IFFT)
void synth_sawtooth(fft_complex_t *complex, unsigned fft_bits) {
unsigned fft_size = 1 << fft_bits;
fft_sym_dc(complex, 0, 0); // sets dc bin [0]
for(unsigned n = 1, z = fft_size - 1; n <= fft_size >> 1; n++, z--) {
// calculation of amplitude/index (sawtooth) for both n and z
fft_sym_magnitude(complex, fft_bits, n, 0x4000000 / n, 0x4000000 / z);
}
}

CUDA: Why accessing the same device array is not coalesced?

I am posting a drilled down code for review. I believe it should compile and execute without any problems but since i excluded all the irrelevant parts, I might have made some mistake.
struct Users {
double A[96];
double B[32];
double C[32];
};
This is my Users structure with fixed length arrays. Below is given the main function.
int main(int argc, char **argv) {
int numUsers = 10;
Users *users = new Users[numUsers];
double Step[96];
for (int i = 0; i < 32; i++) {
Step[i] = 0.8;
Step[i + 32] = 0.8;
Step[i + 64] = 0.8;
}
for (int usr = 0; usr < numUsers; usr++) {
for (int i = 0; i < 32; i++) {
users[usr].A[i] = 10;
users[usr].A[i + 32] = 20;
users[usr].A[i + 64] = 30;
}
memset(users[usr].B, 0, sizeof(double) * 32);
memset(users[usr].C, 0, sizeof(double) * 32);
}
double *d_Step;
cudaMalloc((void**)&d_Step, sizeof(double) * 96);
cudaMemcpy(d_Step, Step, sizeof(double) * 96, cudaMemcpyHostToDevice);
Users *deviceUsers;
cudaMalloc((void**)&deviceUsers, sizeof(Users) * numUsers);
cudaMemcpy(deviceUsers, users, sizeof(Users) * numUsers, cudaMemcpyHostToDevice);
dim3 grid;
dim3 block;
grid.x = 1;
grid.y = 1;
grid.z = 1;
block.x = 32;
block.y = 10;
block.z = 1;
calc<<<grid, block >>> (deviceUsers, d_Step, numUsers);
delete users;
return 0;
}
Please note here that Step array is 1D array with 96 bins and I am spanning 10 warps (32 threads in x direction and there are 10 of these in my block). Each warp will access the same Step array. This can be seen below in the kernel.
__global__ void calc(Users *users, double *Step, int numUsers) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
int uId = threadIdx.y;
while (uId < numUsers) {
double mean00 = users[uId].A[tId] * Step[tId];
double mean01 = users[uId].A[tId + 32] * Step[tId + 32];
double mean02 = users[uId].A[tId + 64] * Step[tId + 64];
users[uId].A[tId] = (mean00 == 0? 0 : 1 / mean00);
users[uId].A[tId + 32] = (mean01 == 0? 0 : 1 / mean01);
users[uId].A[tId + 64] = (mean02 == 0? 0 : 1 / mean02);
uId += 10;
}
}
Now when I use NVIDIA Visual Profiler, the coalesced retrieves are 47%. I further investigated and found out that Step array which is being accessed by each warp causes this problem. If i replace it with some constant, the accesses are 100% coalesced.
Q1) As I understand, coalesced accesses are linked to byte line i.e. byte lines has to be multiple of 32, whether they are integer, double byte lines. Why I am not getting coalesced accesses?
As per my knowledge, cuda whenever assigns a memory block in the device global memory it, it assigned an even address to it. Thus as long as the starting point + 32 location are accessed by a warp, the access should be coalesced. Am I correct?
Hardware
Geforce GTX 470, Compute Capability 2.0

Your kernel read Step 10 times from global memory. Although L1 cache can reduce the actual access to global mem, it still be treated as inefficient access pattern by the profiler.
My profiler names it 'global load efficiency'. It doesn't say if it is coalesced or not.

Node.js / coffeescript performance on a math-intensive algorithm

I am experimenting with node.js to build some server-side logic, and have implemented a version of the diamond-square algorithm described here in coffeescript and Java. Given all the praise I have heard for node.js and V8 performance, I was hoping that node.js would not lag too far behind the java version.
However on a 4096x4096 map, Java finishes in under 1s but node.js/coffeescript takes over 20s on my machine...
These are my full results. x-axis is grid size. Log and linear charts:
Is this because there is something wrong with my coffeescript implementation, or is this just the nature of node.js still?
Coffeescript
genHeightField = (sz) ->
timeStart = new Date()
DATA_SIZE = sz
SEED = 1000.0
data = new Array()
iters = 0
# warm up the arrays to tell the js engine these are dense arrays
# seems to have neligible effect when running on node.js though
for rows in [0...DATA_SIZE]
data[rows] = new Array();
for cols in [0...DATA_SIZE]
data[rows][cols] = 0
data[0][0] = data[0][DATA_SIZE-1] = data[DATA_SIZE-1][0] =
data[DATA_SIZE-1][DATA_SIZE-1] = SEED;
h = 500.0
sideLength = DATA_SIZE-1
while sideLength >= 2
halfSide = sideLength / 2
for x in [0...DATA_SIZE-1] by sideLength
for y in [0...DATA_SIZE-1] by sideLength
avg = data[x][y] +
data[x + sideLength][y] +
data[x][y + sideLength] +
data[x + sideLength][y + sideLength]
avg /= 4.0;
data[x + halfSide][y + halfSide] =
avg + Math.random() * (2 * h) - h;
iters++
#console.log "A:" + x + "," + y
for x in [0...DATA_SIZE-1] by halfSide
y = (x + halfSide) % sideLength
while y < DATA_SIZE-1
avg =
data[(x-halfSide+DATA_SIZE-1)%(DATA_SIZE-1)][y]
data[(x+halfSide)%(DATA_SIZE-1)][y]
data[x][(y+halfSide)%(DATA_SIZE-1)]
data[x][(y-halfSide+DATA_SIZE-1)%(DATA_SIZE-1)]
avg /= 4.0;
avg = avg + Math.random() * (2 * h) - h;
data[x][y] = avg;
if x is 0
data[DATA_SIZE-1][y] = avg;
if y is 0
data[x][DATA_SIZE-1] = avg;
#console.log "B: " + x + "," + y
y += sideLength
iters++
sideLength /= 2
h /= 2.0
#console.log iters
console.log (new Date() - timeStart)
genHeightField(256+1)
genHeightField(512+1)
genHeightField(1024+1)
genHeightField(2048+1)
genHeightField(4096+1)
Java
import java.util.Random;
class Gen {
public static void main(String args[]) {
genHeight(256+1);
genHeight(512+1);
genHeight(1024+1);
genHeight(2048+1);
genHeight(4096+1);
}
public static void genHeight(int sz) {
long timeStart = System.currentTimeMillis();
int iters = 0;
final int DATA_SIZE = sz;
final double SEED = 1000.0;
double[][] data = new double[DATA_SIZE][DATA_SIZE];
data[0][0] = data[0][DATA_SIZE-1] = data[DATA_SIZE-1][0] =
data[DATA_SIZE-1][DATA_SIZE-1] = SEED;
double h = 500.0;
Random r = new Random();
for(int sideLength = DATA_SIZE-1;
sideLength >= 2;
sideLength /=2, h/= 2.0){
int halfSide = sideLength/2;
for(int x=0;x<DATA_SIZE-1;x+=sideLength){
for(int y=0;y<DATA_SIZE-1;y+=sideLength){
double avg = data[x][y] +
data[x+sideLength][y] +
data[x][y+sideLength] +
data[x+sideLength][y+sideLength];
avg /= 4.0;
data[x+halfSide][y+halfSide] =
avg + (r.nextDouble()*2*h) - h;
iters++;
//System.out.println("A:" + x + "," + y);
}
}
for(int x=0;x<DATA_SIZE-1;x+=halfSide){
for(int y=(x+halfSide)%sideLength;y<DATA_SIZE-1;y+=sideLength){
double avg =
data[(x-halfSide+DATA_SIZE-1)%(DATA_SIZE-1)][y] +
data[(x+halfSide)%(DATA_SIZE-1)][y] +
data[x][(y+halfSide)%(DATA_SIZE-1)] +
data[x][(y-halfSide+DATA_SIZE-1)%(DATA_SIZE-1)];
avg /= 4.0;
avg = avg + (r.nextDouble()*2*h) - h;
data[x][y] = avg;
if(x == 0) data[DATA_SIZE-1][y] = avg;
if(y == 0) data[x][DATA_SIZE-1] = avg;
iters++;
//System.out.println("B:" + x + "," + y);
}
}
}
//System.out.print(iters +" ");
System.out.println(System.currentTimeMillis() - timeStart);
}
}

As other answerers have pointed out, JavaScript's arrays are a major performance bottleneck for the type of operations you're doing. Because they're dynamic, it's naturally much slower to access elements than it is with Java's static arrays.
The good news is that there is an emerging standard for statically typed arrays in JavaScript, already supported in some browsers. Though not yet supported in Node proper, you can easily add them with a library: https://github.com/tlrobinson/v8-typed-array
After installing typed-array via npm, here's my modified version of your code:
{Float32Array} = require 'typed-array'
genHeightField = (sz) ->
timeStart = new Date()
DATA_SIZE = sz
SEED = 1000.0
iters = 0
# Initialize 2D array of floats
data = new Array(DATA_SIZE)
for rows in [0...DATA_SIZE]
data[rows] = new Float32Array(DATA_SIZE)
for cols in [0...DATA_SIZE]
data[rows][cols] = 0
# The rest is the same...
The key line in there is the declaration of data[rows].
With the line data[rows] = new Array(DATA_SIZE) (essentially equivalent to the original), I get the benchmark numbers:
17
75
417
1376
5461
And with the line data[rows] = new Float32Array(DATA_SIZE), I get
19
47
215
855
3452
So that one small change cuts the running time down by about 1/3, i.e. a 50% speed increase!
It's still not Java, but it's a pretty substantial improvement. Expect future versions of Node/V8 to narrow the performance gap further.
Caveat: It's got to be mentioned that normal JS numbers are double-precision, i.e. 64-bit floats. Using Float32Array will thus reduce precision, making this a bit of an apples-and-oranges comparison—I don't know how much of the performance improvement is from using 32-bit math, and how much is from faster array access. A Float64Array is part of the V8 spec, but isn't yet implemented in the v8-typed-array library.)

If you're looking for performance in algorithms like this, both coffee/js and Java are the wrong languages to be using. Javascript is especially poor for problems like this because it does not have an array type - arrays are just hash maps where keys must be integers, which obviously will not be as quick as a real array. What you want is to write this algorithm in C and call that from node (see http://nodejs.org/docs/v0.4.10/api/addons.html). Unless you're really good at hand-optimizing machine code, good C will easily outstrip any other language.

Forget about Coffeescript for a minute, because that's not the root of the problem. That code just gets written to regular old javascript anyway when node runs it.
Just like any other javascript environment, node is single-threaded. The V8 engine is bloody fast, but for certain types of applications you might not be able to exceed the speed of the jvm.
I would first suggest trying to right out your diamond algorithm directly in js before moving to CS. See what kinds of speed optimizations you can make.
Actually, I'm kind of interested in this problem now too and am going to take a look at doing this.
Edit #2 This is my 2nd re-write with some optimizations such as pre-populating the data array. Its not significantly faster, but the code is a bit cleaner.
var makegrid = function(size){
size++; //increment by 1
var grid = [];
grid.length = size,
gsize = size-1; //frequently used value in later calculations.
//setup grid array
var len = size;
while(len--){
grid[len] = (new Array(size+1).join(0).split('')); //creates an array of length "size" where each index === 0
}
//populate four corners of the grid
grid[0][0] = grid[gsize][0] = grid[0][gsize] = grid[gsize][gsize] = corner_vals;
var side_length = gsize;
while(side_length >= 2){
var half_side = Math.floor(side_length / 2);
//generate new square values
for(var x=0; x<gsize; x += side_length){
for(var y=0; y<gsize; y += side_length){
//calculate average of existing corners
var avg = ((grid[x][y] + grid[x+side_length][y] + grid[x][y+side_length] + grid[x+side_length][y+side_length]) / 4) + (Math.random() * (2*height_range - height_range));
//calculate random value for avg for center point
grid[x+half_side][y+half_side] = Math.floor(avg);
}
}
//generate diamond values
for(var x=0; x<gsize; x+= half_side){
for(var y=(x+half_side)%side_length; y<gsize; y+= side_length){
var avg = Math.floor( ((grid[(x-half_side+gsize)%gsize][y] + grid[(x+half_side)%gsize][y] + grid[x][(y+half_side)%gsize] + grid[x][(y-half_side+gsize)%gsize]) / 4) + (Math.random() * (2*height_range - height_range)) );
grid[x][y] = avg;
if( x === 0) grid[gsize][y] = avg;
if( y === 0) grid[x][gsize] = avg;
}
}
side_length /= 2;
height_range /= 2;
}
return grid;
}
makegrid(256)
makegrid(512)
makegrid(1024)
makegrid(2048)
makegrid(4096)

I have always assumed that when people described javascript runtime's as 'fast' they mean relative to other interpreted, dynamic languages. A comparison to ruby, python or smalltalk would be interesting. Comparing JavaScript to Java is not a fair comparison.
To answer your question, I believe that the results you are seeing are indicative of what you can expect comparing these two vastly different languages.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

CUDA limit seems to be reached, but what limit is that? - gpgpu

Related

Why is local memory in this OpenCL algorithm so slow?

Mandelbrot optimization in openmp

Is this part of a real IFFT process really optimal?

CUDA: Why accessing the same device array is not coalesced?

Node.js / coffeescript performance on a math-intensive algorithm

Categories

Resources