Shared memory n-body simulation in Chapel - parallel-processing

I'm trying to re-implement the shared memory implementation of the n-body simulation presented in chapter 6.1.6 in An Introduction to Parallel Programming by Peter Pacheco. In that chapter it was implemented using OpenMP.
Here is my parallel implementation using OpenMP. And here is a serial implementation using Chapel. I'm having issues implementing a shared memory parallel implementation using Chapel. Since there is no way to get the rank of a thread in a forall loop, I can't use the same approach as in the OpenMP implementation. I would have to use the coforall loop, create the tasks and distribute the iterations manually. That does not seem practical and suggests that there is a more elegant way to solve this within Chapel.
I'm looking for guidance and suggestions on how to better solve this problem using the tools provided by Chapel.

My suggestion would be to use a (+) reduce intent on forces in your forall-loop, which will give each task its own private copy of forces and then (sum) reduce their individual copies back into the original forces variable as the tasks complete. This would be done by attaching the following with-clause to your forall loop:
forall q in 0..#n_bodies with (+ reduce forces) {
While here, I looked for other ways to make the code a bit more elegant and would suggest changing from a 2D array to an array-of-arrays for this problem in order to collapse a bunch of trios of similar code statements for x, y, z components down to a single statement. I also made use of your pDomain variable and created a type alias for [0..#3] real in order to remove some redundancy in the code. Oh, and I removed the uses of the Math and IO modules because they are auto-used in Chapel programs.
Here's where that left me:
config const filename = "input.txt";
config const iterations = 100;
config const out_filename = "out.txt";
const X = 0;
const Y = 1;
const Z = 2;
const G = 6.67e-11;
config const dt = 0.1;
// Read input file, initialize bodies
var f = open(filename, iomode.r);
var reader = f.reader();
var n_bodies = reader.read(int);
const pDomain = {0..#n_bodies};
type vec3 = [0..#3] real;
var forces: [pDomain] vec3;
var velocities: [pDomain] vec3;
var positions: [pDomain] vec3;
var masses: [pDomain] real;
for i in pDomain {
positions[i] = reader.read(vec3);
velocities[i] = reader.read(vec3);
masses[i] = reader.read(real);
}
f.close();
reader.close();
for i in 0..#iterations {
// Reset forces
forces = [0.0, 0.0, 0.0];
forall q in pDomain with (+ reduce forces) {
for k in pDomain {
if k <= q {
continue;
}
var diff = positions[q] - positions[k];
var dist = sqrt(diff[X]**2 + diff[Y]**2 + diff[Z]**2);
var dist_cubed = dist**3;
var tmp = -G * masses[q] * masses[k] / dist_cubed;
var force_qk = tmp * diff;
forces[q] += force_qk;
forces[k] -= force_qk;
}
}
forall q in pDomain {
positions[q] += dt * velocities[q];
velocities[q] += dt / masses[q] * forces[q];
}
}
var outf = open(out_filename, iomode.cw);
var writer = outf.writer();
for q in pDomain {
writer.writeln("%er %er %er %er %er %er".format(positions[q][X], positions[q][Y], positions[q][Z], velocities[q][X], velocities[q][Y], velocities[q][Z]));
}
writer.close();
outf.close();
One other change you could consider making would be to replace the forall-loop that updates positions and velocities with the following whole-array statements:
positions += dt * velocities;
velocities += dt / masses * forces;
where the main tradeoff would be that the forall would implement the statements in a fused manner using a single parallel loop while the whole-array statements would not (at least in the current version 1.18 version of the compiler).

Related

Is there a convenient way to fill a sparse array with random values in Chapel?

I am working on comparing matrix multiplication with and without locales and I am trying to use sparse matrices to work with the linear algebra module. I am planning on using blockdist and breaking it out manually with loops, but I want to be able to see if I can get speedup just using something simpler for now. If there is an easy way to work with blockdist that I am overlooking I would appreciate it. Anyway, I am able to get the code to work and to see a speedup when I just fill in the sparse arrays with one value, but filling it with random values does not seem to work:
use LayoutCS;
use Time;
use LinearAlgebra, Norm;
use LinearAlgebra.Sparse;
use Random;
use IO;
writeln("Please type the filename with your matrix dimensions. One matrix on each line. The rows in the second need to match the columns in the first");
var filename: string;
filename = stdin.read(string);
// Open an input file with the specified filename in read mode.
var infile = open(filename, iomode.r);
var reader = infile.reader();
// Read the number of rows and columns in the array in from the file.
var r = reader.read(int), c = reader.read(int);
const parentDom = {1..r, 1..c};
var csrDom: sparse subdomain(parentDom) dmapped CS();
var A: [csrDom] real;
A = 2; //instead of this I would like to do something like fillRandom(A) but it seems to not work
var X: [1..r, 1..c] real;
fillRandom(X);
//read in the other matrix
var r1 = reader.read(int), c1 = reader.read(int);
const parentDom1 = {1..r1, 1..c1};
var csrDom1: sparse subdomain(parentDom1) dmapped CS();
var B: [csrDom1] real;
B = 3; //same thing as with matrix A
var Y: [1..r1, 1..c1] real;
fillRandom(Y);
// Close the file.
reader.close();
infile.close();
var t: Timer; //sets up timer
t.start();
var result: [1..r, 1..c1] real; //sets up matrix for results
forall i in 1..r do //goes through rows in 1st
for j in 1..c1 do //goes through 2nd matrix columns
for k in 1..c do { //goes through columns in 1st
result[i, j] += X[i, k] * Y[k, j]; //adds the multiplications to the new slot in results
}
t.stop();
writeln("multiplication took ", t.elapsed()," seconds");
t.clear();
t.start();
var res = A * B;
t.stop();
writeln("loc multiplication took ", t.elapsed()," seconds");
t.clear();
Does fillRandom not work with sparse arrays or am I doing it incorrectly? Do I need to go through and manually assign each value in the array with a loop? It is also of course possible that I am on the completely wrong path and should look more at blockdist and I would definitely appreciate any advice or tips regarding that as I'm not quite sure how I would work making sure each parallel task is actually running on the correct locale sections created by the blockdist.
Thank you in advance!
Does fillRandom not work with sparse arrays or am I doing it incorrectly?
Random.fillRandom() does not support sparse arrays in Chapel today (1.22.0). However, I think this would be a reasonable feature request, if you would be interested in filing a GitHub issue on it.
Even if fillRandom supported sparse arrays, the user would still need to specify which elements are non-zero before filling those non-zero elements with random values. This is done by adding indices as (int, int) tuples (or arrays of (int, int) tuples) to the sparse domain.
Here is a small example of generating a 100x100 compressed sparse row (CSR) array with 10 non-zero random elements:
/* Example tested with Chapel 1.22 */
import LayoutCS.{CS, isCSType};
import Random;
/* Parent domain dimensions: NxN */
config const N = 100;
/* Number of non-zero elements */
config const nnz = 10;
const parentDom = {1..N, 1..N};
var csrDom: sparse subdomain(parentDom) dmapped CS();
var A: [csrDom] real;
populate(A, csrDom, nnz);
// Print non-zero elements
for (i,j) in csrDom do
writeln((i,j), ':', A[i, j]);
/* Populate sparse matrix with `nnz` random values */
proc populate(ref A, ref ADom, nnz: int) where isCSType(ADom.dist.type) && isCSType(A.domain.dist.type) {
// Generate array of random non-zero indices
var indices: [1..nnz] 2*int;
var randomIndices = Random.createRandomStream(eltType=int);
// Replace any duplicate indices with a new random index
for idx in indices {
var newIdx = idx;
while indices.find(newIdx)(1) {
newIdx = (randomIndices.getNext(ADom.dim(0).low, ADom.dim(0).high),
randomIndices.getNext(ADom.dim(1).low, ADom.dim(1).high));
}
idx = newIdx;
}
// Add the non-zero indices to the CSR domain
ADom += indices;
// Generate random elements - maybe fillRandom(A) could do this for us some day
var randomElements = Random.createRandomStream(eltType=A.eltType);
for idx in ADom {
A[idx] = randomElements.getNext();
}
}

AS3: Inefficient to make too many references across classes?

I was told once that I should avoid referencing properties and such of my main game class from other classes as much as possible because it's an inefficient thing to do. Is this actually true? For a trivial example, in my Character class would
MainGame.property = something;
MainGame.property2 = something2;
MainGame.property3 = something3;
//etc.
take more time to execute than putting the same in a function
MainGame.function1();
and calling that (thereby needing to "open" that class only once rather than multiple times)? By the same logic, would
something = MainGame.property;
something2 = MainGame.property;
be slightly less efficient than
variable = MainGame.property;
something = variable;
something2 = variable;?
Think of referencing things as of operations. Every "." is an operation, every function call "()" is an operation (and a heavy one), every "+", "-" and so on.
So, indeed, certain approaches will generate more efficient code than other ones.
But. The thing is, to feel the result of inefficient approach you need to create a program performing millions of operations (or heavy tasks of the appropriate difficulty) each frame. If you are not parsing megabytes of binary data, or converting bitmaps, or such, you can not worry about performance. Although worrying about performance and efficiency is generally a good thing to do.
If you are willing to measure the efficiency of your code, you are free to measure its performance:
var aTime:int;
var a:int = 10;
var b:int = 20;
var c:int;
var i:int;
var O:Object;
aTime = getTimer();
O = new Object;
for (i = 0; i < 1000000; i++)
{
O.ab = a + b;
O.ba = a + b;
}
trace("Test 1. Elapsed", getTimer() - aTime, "ms.");
aTime = getTimer();
O = new Object;
c = a + b;
for (i = 0; i < 1000000; i++)
{
O.ab = c;
O.ba = c;
}
trace("Test 2. Elapsed", getTimer() - aTime, "ms.");
aTime = getTimer();
O = new Object;
for (i = 0; i < 1000000; i++)
{
O['ab'] = a + b;
O['ba'] = a + b;
}
trace("Test 3. Elapsed", getTimer() - aTime, "ms.");
Be prepared to run million-iteration loops so that total execution time exceeds 1 second, otherwise the precision will be too low.

Mesh Simplification with Assimp and OpenMesh

For days ago, I ask a question on how to use the edge collapse with Assimp. Smooth the obj and remove duplicated vertices in software are sloved the basic problem that could make edge collapse work, I mean it work because it could be simplicated by MeshLab like this:
It looks good in MeshLab, but I then do it in my engine which used Assimp and OpenMesh. The problem is Assimp imported the specified vertices and Indices, that could let the halfedge miss the opposite pair (Is this called non-manifold?).
The result snapshot use OpenMesh's Quadric Decimation:
To clear to find the problem, I do it without decimation and parse the OpenMesh data structure back directly. Everything is work fine as expect (I mean the result without decimation).
The code that I used to decimate the mesh:
Loader::BasicData Loader::TestEdgeCollapse(float vertices[], int vertexLength, int indices[], int indexLength, float texCoords[], int texCoordLength, float normals[], int normalLength)
{
// Mesh type
typedef OpenMesh::TriMesh_ArrayKernelT<> OPMesh;
// Decimater type
typedef OpenMesh::Decimater::DecimaterT< OPMesh > OPDecimater;
// Decimation Module Handle type
typedef OpenMesh::Decimater::ModQuadricT< OPMesh >::Handle HModQuadric;
OPMesh mesh;
std::vector<OPMesh::VertexHandle> vhandles;
int iteration = 0;
for (int i = 0; i < vertexLength; i += 3)
{
vhandles.push_back(mesh.add_vertex(OpenMesh::Vec3f(vertices[i], vertices[i + 1], vertices[i + 2])));
if (texCoords != nullptr)
mesh.set_texcoord2D(vhandles.back(),OpenMesh::Vec2f(texCoords[iteration * 2], texCoords[iteration * 2 + 1]));
if (normals != nullptr)
mesh.set_normal(vhandles.back(), OpenMesh::Vec3f(normals[i], normals[i + 1], normals[i + 2]));
iteration++;
}
for (int i = 0; i < indexLength; i += 3)
mesh.add_face(vhandles[indices[i]], vhandles[indices[i + 1]], vhandles[indices[i + 2]]);
OPDecimater decimater(mesh);
HModQuadric hModQuadric;
decimater.add(hModQuadric);
decimater.module(hModQuadric).unset_max_err();
decimater.initialize();
//decimater.decimate(); // without this, everything is fine as expect.
mesh.garbage_collection();
int verticesSize = mesh.n_vertices() * 3;
float* newVertices = new float[verticesSize];
int indicesSize = mesh.n_faces() * 3;
int* newIndices = new int[indicesSize];
float* newTexCoords = nullptr;
int texCoordSize = mesh.n_vertices() * 2;
if(mesh.has_vertex_texcoords2D())
newTexCoords = new float[texCoordSize];
float* newNormals = nullptr;
int normalSize = mesh.n_vertices() * 3;
if(mesh.has_vertex_normals())
newNormals = new float[normalSize];
Loader::BasicData data;
int index = 0;
for (v_it = mesh.vertices_begin(); v_it != mesh.vertices_end(); ++v_it)
{
OpenMesh::Vec3f &point = mesh.point(*v_it);
newVertices[index * 3] = point[0];
newVertices[index * 3 + 1] = point[1];
newVertices[index * 3 + 2] = point[2];
if (mesh.has_vertex_texcoords2D())
{
auto &tex = mesh.texcoord2D(*v_it);
newTexCoords[index * 2] = tex[0];
newTexCoords[index * 2 + 1] = tex[1];
}
if (mesh.has_vertex_normals())
{
auto &normal = mesh.normal(*v_it);
newNormals[index * 3] = normal[0];
newNormals[index * 3 + 1] = normal[1];
newNormals[index * 3 + 2] = normal[2];
}
index++;
}
index = 0;
for (f_it = mesh.faces_begin(); f_it != mesh.faces_end(); ++f_it)
for (fv_it = mesh.fv_ccwiter(*f_it); fv_it.is_valid(); ++fv_it)
{
int id = fv_it->idx();
newIndices[index] = id;
index++;
}
data.Indices = newIndices;
data.IndicesLength = indicesSize;
data.Vertices = newVertices;
data.VerticesLength = verticesSize;
data.TexCoords = nullptr;
data.TexCoordLength = -1;
data.Normals = nullptr;
data.NormalLength = -1;
if (mesh.has_vertex_texcoords2D())
{
data.TexCoords = newTexCoords;
data.TexCoordLength = texCoordSize;
}
if (mesh.has_vertex_normals())
{
data.Normals = newNormals;
data.NormalLength = normalSize;
}
return data;
}
Also provide the tree obj I tested, and the face data that generated by Assimp, I fetch out from visual studio debugger, that shows the problem that some of the indices could not find the index pair.
Few weeks thinking about this and fails, I thought I want some Academic/Mathematical solution for automatically generating these decimated mesh, but now I'm trying to find the simple way to implement this, the way I am able to do is changing the structure for loading multi-object (file.obj) in single custom object (class obj), and switch the object when needed it. The benefit of this is I could manage what should present and ignore any algorithm problem.
By the way, I list some obstacles that push me back to simple way.
Assimp Unique Indices and Vertices, this is nothing wrong, but for the algorithm, no way to make the adjacency half-edge structure for this.
OpenMesh for reading only object file(*.obj), this could be done when using read_mesh function, but the disadvantage is the lack example of document and hard to using in my engine.
Write a custom 3d model importer for any format is hard.
In conclusion, there are two ways to make level of details work in engine, one is using the mesh simplication algorithm and more test to ensure quality, the other is just switch the 3dmodel that made by 3d software, It is not automatic but stable. I use the second method, and I show the result here :)
However, this is not a real solution with my question, so I won't assign me an answer.

Node.js / coffeescript performance on a math-intensive algorithm

I am experimenting with node.js to build some server-side logic, and have implemented a version of the diamond-square algorithm described here in coffeescript and Java. Given all the praise I have heard for node.js and V8 performance, I was hoping that node.js would not lag too far behind the java version.
However on a 4096x4096 map, Java finishes in under 1s but node.js/coffeescript takes over 20s on my machine...
These are my full results. x-axis is grid size. Log and linear charts:
Is this because there is something wrong with my coffeescript implementation, or is this just the nature of node.js still?
Coffeescript
genHeightField = (sz) ->
timeStart = new Date()
DATA_SIZE = sz
SEED = 1000.0
data = new Array()
iters = 0
# warm up the arrays to tell the js engine these are dense arrays
# seems to have neligible effect when running on node.js though
for rows in [0...DATA_SIZE]
data[rows] = new Array();
for cols in [0...DATA_SIZE]
data[rows][cols] = 0
data[0][0] = data[0][DATA_SIZE-1] = data[DATA_SIZE-1][0] =
data[DATA_SIZE-1][DATA_SIZE-1] = SEED;
h = 500.0
sideLength = DATA_SIZE-1
while sideLength >= 2
halfSide = sideLength / 2
for x in [0...DATA_SIZE-1] by sideLength
for y in [0...DATA_SIZE-1] by sideLength
avg = data[x][y] +
data[x + sideLength][y] +
data[x][y + sideLength] +
data[x + sideLength][y + sideLength]
avg /= 4.0;
data[x + halfSide][y + halfSide] =
avg + Math.random() * (2 * h) - h;
iters++
#console.log "A:" + x + "," + y
for x in [0...DATA_SIZE-1] by halfSide
y = (x + halfSide) % sideLength
while y < DATA_SIZE-1
avg =
data[(x-halfSide+DATA_SIZE-1)%(DATA_SIZE-1)][y]
data[(x+halfSide)%(DATA_SIZE-1)][y]
data[x][(y+halfSide)%(DATA_SIZE-1)]
data[x][(y-halfSide+DATA_SIZE-1)%(DATA_SIZE-1)]
avg /= 4.0;
avg = avg + Math.random() * (2 * h) - h;
data[x][y] = avg;
if x is 0
data[DATA_SIZE-1][y] = avg;
if y is 0
data[x][DATA_SIZE-1] = avg;
#console.log "B: " + x + "," + y
y += sideLength
iters++
sideLength /= 2
h /= 2.0
#console.log iters
console.log (new Date() - timeStart)
genHeightField(256+1)
genHeightField(512+1)
genHeightField(1024+1)
genHeightField(2048+1)
genHeightField(4096+1)
Java
import java.util.Random;
class Gen {
public static void main(String args[]) {
genHeight(256+1);
genHeight(512+1);
genHeight(1024+1);
genHeight(2048+1);
genHeight(4096+1);
}
public static void genHeight(int sz) {
long timeStart = System.currentTimeMillis();
int iters = 0;
final int DATA_SIZE = sz;
final double SEED = 1000.0;
double[][] data = new double[DATA_SIZE][DATA_SIZE];
data[0][0] = data[0][DATA_SIZE-1] = data[DATA_SIZE-1][0] =
data[DATA_SIZE-1][DATA_SIZE-1] = SEED;
double h = 500.0;
Random r = new Random();
for(int sideLength = DATA_SIZE-1;
sideLength >= 2;
sideLength /=2, h/= 2.0){
int halfSide = sideLength/2;
for(int x=0;x<DATA_SIZE-1;x+=sideLength){
for(int y=0;y<DATA_SIZE-1;y+=sideLength){
double avg = data[x][y] +
data[x+sideLength][y] +
data[x][y+sideLength] +
data[x+sideLength][y+sideLength];
avg /= 4.0;
data[x+halfSide][y+halfSide] =
avg + (r.nextDouble()*2*h) - h;
iters++;
//System.out.println("A:" + x + "," + y);
}
}
for(int x=0;x<DATA_SIZE-1;x+=halfSide){
for(int y=(x+halfSide)%sideLength;y<DATA_SIZE-1;y+=sideLength){
double avg =
data[(x-halfSide+DATA_SIZE-1)%(DATA_SIZE-1)][y] +
data[(x+halfSide)%(DATA_SIZE-1)][y] +
data[x][(y+halfSide)%(DATA_SIZE-1)] +
data[x][(y-halfSide+DATA_SIZE-1)%(DATA_SIZE-1)];
avg /= 4.0;
avg = avg + (r.nextDouble()*2*h) - h;
data[x][y] = avg;
if(x == 0) data[DATA_SIZE-1][y] = avg;
if(y == 0) data[x][DATA_SIZE-1] = avg;
iters++;
//System.out.println("B:" + x + "," + y);
}
}
}
//System.out.print(iters +" ");
System.out.println(System.currentTimeMillis() - timeStart);
}
}
As other answerers have pointed out, JavaScript's arrays are a major performance bottleneck for the type of operations you're doing. Because they're dynamic, it's naturally much slower to access elements than it is with Java's static arrays.
The good news is that there is an emerging standard for statically typed arrays in JavaScript, already supported in some browsers. Though not yet supported in Node proper, you can easily add them with a library: https://github.com/tlrobinson/v8-typed-array
After installing typed-array via npm, here's my modified version of your code:
{Float32Array} = require 'typed-array'
genHeightField = (sz) ->
timeStart = new Date()
DATA_SIZE = sz
SEED = 1000.0
iters = 0
# Initialize 2D array of floats
data = new Array(DATA_SIZE)
for rows in [0...DATA_SIZE]
data[rows] = new Float32Array(DATA_SIZE)
for cols in [0...DATA_SIZE]
data[rows][cols] = 0
# The rest is the same...
The key line in there is the declaration of data[rows].
With the line data[rows] = new Array(DATA_SIZE) (essentially equivalent to the original), I get the benchmark numbers:
17
75
417
1376
5461
And with the line data[rows] = new Float32Array(DATA_SIZE), I get
19
47
215
855
3452
So that one small change cuts the running time down by about 1/3, i.e. a 50% speed increase!
It's still not Java, but it's a pretty substantial improvement. Expect future versions of Node/V8 to narrow the performance gap further.
Caveat: It's got to be mentioned that normal JS numbers are double-precision, i.e. 64-bit floats. Using Float32Array will thus reduce precision, making this a bit of an apples-and-oranges comparison—I don't know how much of the performance improvement is from using 32-bit math, and how much is from faster array access. A Float64Array is part of the V8 spec, but isn't yet implemented in the v8-typed-array library.)
If you're looking for performance in algorithms like this, both coffee/js and Java are the wrong languages to be using. Javascript is especially poor for problems like this because it does not have an array type - arrays are just hash maps where keys must be integers, which obviously will not be as quick as a real array. What you want is to write this algorithm in C and call that from node (see http://nodejs.org/docs/v0.4.10/api/addons.html). Unless you're really good at hand-optimizing machine code, good C will easily outstrip any other language.
Forget about Coffeescript for a minute, because that's not the root of the problem. That code just gets written to regular old javascript anyway when node runs it.
Just like any other javascript environment, node is single-threaded. The V8 engine is bloody fast, but for certain types of applications you might not be able to exceed the speed of the jvm.
I would first suggest trying to right out your diamond algorithm directly in js before moving to CS. See what kinds of speed optimizations you can make.
Actually, I'm kind of interested in this problem now too and am going to take a look at doing this.
Edit #2 This is my 2nd re-write with some optimizations such as pre-populating the data array. Its not significantly faster, but the code is a bit cleaner.
var makegrid = function(size){
size++; //increment by 1
var grid = [];
grid.length = size,
gsize = size-1; //frequently used value in later calculations.
//setup grid array
var len = size;
while(len--){
grid[len] = (new Array(size+1).join(0).split('')); //creates an array of length "size" where each index === 0
}
//populate four corners of the grid
grid[0][0] = grid[gsize][0] = grid[0][gsize] = grid[gsize][gsize] = corner_vals;
var side_length = gsize;
while(side_length >= 2){
var half_side = Math.floor(side_length / 2);
//generate new square values
for(var x=0; x<gsize; x += side_length){
for(var y=0; y<gsize; y += side_length){
//calculate average of existing corners
var avg = ((grid[x][y] + grid[x+side_length][y] + grid[x][y+side_length] + grid[x+side_length][y+side_length]) / 4) + (Math.random() * (2*height_range - height_range));
//calculate random value for avg for center point
grid[x+half_side][y+half_side] = Math.floor(avg);
}
}
//generate diamond values
for(var x=0; x<gsize; x+= half_side){
for(var y=(x+half_side)%side_length; y<gsize; y+= side_length){
var avg = Math.floor( ((grid[(x-half_side+gsize)%gsize][y] + grid[(x+half_side)%gsize][y] + grid[x][(y+half_side)%gsize] + grid[x][(y-half_side+gsize)%gsize]) / 4) + (Math.random() * (2*height_range - height_range)) );
grid[x][y] = avg;
if( x === 0) grid[gsize][y] = avg;
if( y === 0) grid[x][gsize] = avg;
}
}
side_length /= 2;
height_range /= 2;
}
return grid;
}
makegrid(256)
makegrid(512)
makegrid(1024)
makegrid(2048)
makegrid(4096)
I have always assumed that when people described javascript runtime's as 'fast' they mean relative to other interpreted, dynamic languages. A comparison to ruby, python or smalltalk would be interesting. Comparing JavaScript to Java is not a fair comparison.
To answer your question, I believe that the results you are seeing are indicative of what you can expect comparing these two vastly different languages.

Least Squares solution to simultaneous equations

I am trying to fit a transformation from one set of coordinates to another.
x' = R + Px + Qy
y' = S - Qx + Py
Where P,Q,R,S are constants, P = scale*cos(rotation). Q=scale*sin(rotation)
There is a well known 'by hand' formula for fitting P,Q,R,S to a set of corresponding points.
But I need to have an error estimate on the fit - so I need a least squares solution.
Read 'Numerical Recipes' but I'm having trouble working out how to do this for data sets with x and y in them.
Can anyone point to an example/tutorial/code sample of how to do this ?
Not too bothered about the language.
But - just use built in feature of Matlab/Lapack/numpy/R probably not helpful !
edit:
I have a large set of old(x,y) new(x,y) to fit to. The problem is overdetermined (more data points than unknowns) so simple matrix inversion isn't enough - and as I said I really need the error on the fit.
The following code should do the trick. I used the following formula for the residuals:
residual[i] = (computed_x[i] - actual_x[i])^2
+ (computed_y[i] - actual_y[i])^2
And then derived the least-squares formulae based on the general procedure described at Wolfram's MathWorld.
I tested out this algorithm in Excel and it performs as expected. I Used a collection of ten random points which were then rotated, translated and scaled by a randomly generated transformation matrix.
With no random noise applied to the output data, this program produces four parameters (P, Q, R, and S) which are identical to the input parameters, and an rSquared value of zero.
As more and more random noise is applied to the output points, the constants start to drift away from the correct values, and the rSquared value increases accordingly.
Here is the code:
// test data
const int N = 1000;
float oldPoints_x[N] = { ... };
float oldPoints_y[N] = { ... };
float newPoints_x[N] = { ... };
float newPoints_y[N] = { ... };
// compute various sums and sums of products
// across the entire set of test data
float Ex = Sum(oldPoints_x, N);
float Ey = Sum(oldPoints_y, N);
float Exn = Sum(newPoints_x, N);
float Eyn = Sum(newPoints_y, N);
float Ex2 = SumProduct(oldPoints_x, oldPoints_x, N);
float Ey2 = SumProduct(oldPoints_y, oldPoints_y, N);
float Exxn = SumProduct(oldPoints_x, newPoints_x, N);
float Exyn = SumProduct(oldPoints_x, newPoints_y, N);
float Eyxn = SumProduct(oldPoints_y, newPoints_x, N);
float Eyyn = SumProduct(oldPoints_y, newPoints_y, N);
// compute the transformation constants
// using least-squares regression
float divisor = Ex*Ex + Ey*Ey - N*(Ex2 + Ey2);
float P = (Exn*Ex + Eyn*Ey - N*(Exxn + Eyyn))/divisor;
float Q = (Exn*Ey + Eyn*Ex + N*(Exyn - Eyxn))/divisor;
float R = (Exn - P*Ex - Q*Ey)/N;
float S = (Eyn - P*Ey + Q*Ex)/N;
// compute the rSquared error value
// low values represent a good fit
float rSquared = 0;
float x;
float y;
for (int i = 0; i < N; i++)
{
x = R + P*oldPoints_x[i] + Q*oldPoints_y[i];
y = S - Q*oldPoints_x[i] + P*oldPoints_y[i];
rSquared += (x - newPoints_x[i])^2;
rSquared += (y - newPoints_y[i])^2;
}
To find P, Q, R, and S, then you can use least squares. I think the confusing thing is that that usual description of least squares uses x and y, but they don't match the x and y in your problem. You just need translate your problem carefully into the least squares framework. In your case the independent variables are the untransformed coordinates x and y, the dependent variables are the transformed coordinates x' and y', and the adjustable parameters are P, Q, R, and S. (If this isn't clear enough, let me know and I'll post more detail.)
Once you've found P, Q, R, and S, then scale = sqrt(P^2 + Q^2) and you can then find the rotation from sin rotation = Q / scale and cos rotation = P / scale.
You can use the levmar program to calculate this. Its tested and integrated into multiple products including mine. Its licensed under the GPL, but if this is a non-opensource project, he will change the license for you (for a fee)
Define the 3x3 matrix T(P,Q,R,S) such that (x',y',1) = T (x,y,1). Then compute
A = \sum_i |(T (x_i,y_i,1)) - (x'_i,y'_i,1)|^2
and minimize A against (P,Q,R,S).
Coding this yourself is a medium to large sized project unless you can guarntee that the data are well conditioned, especially when you want good error estimates out of the procedure. You're probably best off using an existing minimizer that supports error estimates..
Particle physics type would use minuit either directly from CERNLIB (with the coding most easily done in fortran77), or from ROOT (with the coding in c++, or it should be accessible though the python bindings). But that is a big installation if you don't have one of these tools already.
I'm sure that others can suggest other minimizers.
Thanks eJames, thats almost exaclty what I have. I coded it from an old army surveying manual that was based on an earlier "Instructions to Surveyors" note that must be 100years old! (It uses N and E for North and East rather than x/y )
The goodness of fit parameter will be very useful - I can interactively throw out selected points if they make the fit worse.
FindTransformation(vector<Point2D> known,vector<Point2D> unknown) {
{
// sums
for (unsigned int ii=0;ii<known.size();ii++) {
sum_e += unknown[ii].x;
sum_n += unknown[ii].y;
sum_E += known[ii].x;
sum_N += known[ii].y;
++n;
}
// mean position
me = sum_e/(double)n;
mn = sum_n/(double)n;
mE = sum_E/(double)n;
mN = sum_N/(double)n;
// differences
for (unsigned int ii=0;ii<known.size();ii++) {
de = unknown[ii].x - me;
dn = unknown[ii].y - mn;
// for P
sum_deE += (de*known[ii].x);
sum_dnN += (dn*known[ii].y);
sum_dee += (de*unknown[ii].x);
sum_dnn += (dn*unknown[ii].y);
// for Q
sum_dnE += (dn*known[ii].x);
sum_deN += (de*known[ii].y);
}
double P = (sum_deE + sum_dnN) / (sum_dee + sum_dnn);
double Q = (sum_dnE - sum_deN) / (sum_dee + sum_dnn);
double R = mE - (P*me) - (Q*mn);
double S = mN + (Q*me) - (P*mn);
}
One issue is that numeric stuff like this is often tricky. Even when the algorithms are straightforward, there's often problems that show up in actual computation.
For that reason, if there is a system you can get easily that has a built-in feature, it might be best to use that.

Resources