I have no idea why something like this should be slow:
steps=500
samples=100000
s_0=2.1
r=.02
sigma=.2
k=1.9
at<-matrix(nrow=(steps+1),ncol=samples)
at[1,]=s_0
for(j in 1:samples)
{
for(i in 2:(steps+1))
{
at[i,j]=at[(i-1),j] + sigma*sqrt(.0008)*rnorm(1)
}
}
I tried to rewrite this using sapply, but it was still awful from a performance standpoint.
Am I missing something here? This would be seconds in c++ or even the bloated c#.
R can vectorize certain operations. In your case you can get rid of the outer loop by doing a following change.
for(i in 2:(steps + 1))
{
at[i,] = at[(i - 1),] + sigma * sqrt(.0008) * rnorm(samples)
}
According to system.time the original version for samples = 1000 takes 6.83s, while the modified one 0.09s.
How about:
at <- s_0 + t(apply(matrix(rnorm(samples*(steps+1),sd=sigma*sqrt(8e-4)),
ncol=samples),
2,
cumsum))
(Haven't tested this carefully yet, but I think it should be right, and much faster.)
To write fast R code, you really need to re-think how you write functions. You want to operate on entire vectors, not just single observations at a time.
If you're really deadset in writing C-style loops, you could also try out Rcpp. Could be handy if you're well accustomed to C++ and prefer writing functions that way.
library(Rcpp)
do_stuff <- cppFunction('NumericMatrix do_stuff(
int steps,
int samples,
double s_0,
double r,
double sigma,
double k ) {
// Ensure RNG scope set
RNGScope scope;
// allocate the output matrix
NumericMatrix at( steps+1, samples );
// fill the first row
for( int i=0; i < at.ncol(); i++ ) {
at(0, i) = s_0;
}
// loop over the matrix and do stuff
for( int j=0; j < samples; j++ ) {
for( int i=1; i < steps+1; i++ ) {
at(i, j) = at(i-1, j) + sigma * sqrt(0.0008) * R::rnorm(0, 1);
}
}
return at;
}')
system.time( out <- do_stuff(500, 100000, 2.1, 0.02, 0.2, 1.9) )
gives me
user system elapsed
3.205 0.092 3.297
So, if you've already got some C++ background, consider learning how to use Rcpp to map data to and from R.
Related
So I need to solve for the linear system (A + i * mu * I) x = b, where A is dense Hermitian matrix (6x6 complex numbers), mu is a real scalar, and I is identity matrix.
Obviously if mu=0, I should just use Cholesky and be done with it. With non-zero mu though, the matrix ceases to be Hermitian and Cholesky fails.
Possible solutions:
Solve normal operator using Cholesky and multiply by the conjugate
Solve directly using LU decomposition
This is in a time-critical performance routine, where I need the most efficient method. Any thoughts on the optimum approach, or if there is a specific method for solving the above shifted Hermitian system?
This is to be deployed in a CUDA kernel, where I'll be solving many linear systems in parallel, e.g., one per thread. This means that I need a solution that minimizes thread divergence. Given the small system size, pivoting can be ignored without too much issue: this removes a possible source of thread divergence. I've already implemented an in-place Cholesky normal method, and while it's working ok, the performance isn't great in double precision.
I can't vouch for the stability of the method below, but if your matrix is reasonably well conditioned, it might be worth a try.
We want to solve
A*X = B
If we pick out the first row and column, say
A = ( a y )
( z A_ )
X = ( x )
( X_)
B = ( b )
( B_ )
The requirement is
a*x + y*X_ = b
z*x + A_*X_ = B_
so
x = (b - y*X_ )/a
(A_ - zy/a) * X_ = B_ - (b/a)z
The solution goes in two stages. First use the second equation to transform A and b, then use the second to form the solution x.
In C:
static void nhsol( int dim, complx* A, complx* B, complx* X)
{
int i, j, k;
complx a, fb, fa;
complx* z;
complx* acol;
// update A and B
for( i=0; i<dim; ++i)
{ z = A + i*dim;
a = z[i];
// update B
fb = B[i]/a;
for( j=i+1; j<dim; ++j)
{ B[j] -= fb*z[j];
}
// update A
for( k=i+1; k<dim; ++k)
{ acol = A + k*dim;
fa = acol[i]/a;
for( j=i+1; j<dim; ++j)
{ acol[j] -= fa*z[j];
}
}
}
// compute x
i = dim-1;
X[i] = B[i] / A[i+dim*i];
while( --i>=0)
{
complx s = B[i];
for( j=i+1; j<dim; ++j)
{ s -= A[i+j*dim]*X[j];
}
X[i] = s/A[i+i*dim];
}
}
where
typedef _Complex double complx;
If code space is not at a premuim it might be worth unrolling the loops. Personally I would do this by writing a program whose sole job was to write the code.
I have the following bit of code to sort double values on my GPU:
void bitonic_sort(double *data, int length) {
#pragma acc data copy(data[0:length], length)
{
int i,j,k;
for (k = 2; k <= length; k *= 2) {
for (j=k >> 1; j > 0; j = j >> 1) {
#pragma acc parallel loop gang worker vector independent
for (i = 0; i < length; i++) {
int ixj = i ^ j;
if ((ixj) > i) {
if ((i & k) == 0 && data[i] > data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
if ((i & k) != 0 && data[i] < data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
}
}
}
}
}
}
This is a bit slower on my GPU than on my CPU. I'm using GCC 6.1. I can't figure out, how to run the whole code on my GPU. So far, only the parallel loop is executed on the cpu and it switches between CPU and GPU for each one of the outer loops.
I'd like to run the whole content of the function on the GPU, but I can't figure out how. One major problem for me now is that the GCC implementation currently doesn't allow nested parallelism, so I can't use a parallel construct inside another parallel construct. Is there any way to get around that?
I've tried putting a kernels construct on top of the first loop but that slows it down by a factor of about 10. If I use a parallel construct above the first loop instead, the result isn't sorted any more, which makes sense. The two outer loops need to be executed sequentially for the algorithm to work.
If you have any other suggestions on how I could improve performance, I would be grateful as well.
Is there any fast way to find the largest power of 10 smaller than a given number?
I'm using this algorithm, at the moment, but something inside myself dies anytime I see it:
10**( int( math.log10(x) ) ) # python
pow( 10, (int) log10(x) ) // C
I could implement simple log10 and pow functions for my problems with one loop each, but still I'm wondering if there is some bit magic for decimal numbers.
An alternative algorithm is:
i = 1;
while((i * 10) < x)
i *= 10;
Log and power are expensive operations. If you want fast, you probably want to look up the IEEE binary exponent in table to get the approximate power of ten, and then check if the mantissa forces a change by +1 or not. This should be 3 or 4 integer machine instructions (alternatively O(1) with a pretty small constant).
Given tables:
int IEEE_exponent_to_power_of_ten[2048]; // needs to be 2*max(IEEE_exponent)
double next_power_of_ten[600]; // needs to be 2*log10(pow(2,1024)]
// you can compute these tables offline if needed
for (p=-1023;p>1023;p++) // bounds are rough, see actual IEEE exponent ranges
{ IEEE_exponent_to_power_of_ten[p+1024]=log10(pow(2,p)); // you might have to worry about roundoff errors here
next_power_of_ten[log10(pow(2,p))+1024]=pow(10,IEEE_exponent_to_power_of_ten[p+1024]);
}
then your computation should be:
power_of_ten=IEEE_exponent_to_power_of_10[IEEE_Exponent(x)+1023];
if (x>=next_power_of_ten[power_of_ten]) power_of_ten++;
answer=next_power_of_ten[power_of_ten];
[You might really need to write this as assembler to squeeze out every last clock.]
[This code not tested.]
However, if you insist on doing this in python, the interpreter overhead may swamp the log/exp time and it might not matter.
So, do you want fast, or do you want short-to-write?
EDIT 12/23: OP now tells us that his "x" is integral. Under the assumption that it is a 64 (or 32) bit integer, my proposal still works but obviously there isn't an "IEEE_Exponent". Most processors have a "find first one" instruction that will tell you the number of 0 bits on the left hand (most significant) part of the value, e.g., leading zeros; you likely This is in essence 64 (or 32) minus the power of two for the value. Given exponent = 64 - leadingzeros, you have the power of two exponent and most of the rest of the algorithm is essentially unchanged (Modifications left for the reader).
If the processor doesn't have a find-first-one instruction, then probably the best bet is a balanced discrimination tree to determine the power of ten. For 64 bits, such a tree would take at most 18 compares to determine the exponent (10^18 ~~ 2^64).
Create an array of powers of 10. Search through it for the largest value smaller than x.
If x is fairly small, you may find that a linear search provides better performance than a binary search, due in part to fewer branch mis-predictions.
The asymptotically fastest way, as far as I know, involves repeated squaring.
func LogFloor(int value, int base) as int
//iterates values of the form (value: base^(2^i), power: 2^i)
val superPowers = iterator
var p = 1
var c = base
while c <= value
yield (c, p)
c *= c
p += p
endwhile
enditerator
//binary search for the correct power
var p = 0
var c = 1
for val ci, pi in superPowers.Reverse()
if c*ci <= value
c *= ci
p += pi
endif
endfor
return p
The algorithm takes logarithmic time and space in N, which is linear to N's representation size. [The time bound is probably a bit worse because I simplified optimistically]
Note that I assumed arbitrarily large integers (watch out for overflow!), since the naive times-10-until-over algorithm is probably fast enough when dealing with just 32-bit integers.
I think the fastest way is O(log(log(n))^2), the while loop takes O(log(log(n)) and it can be recursive call finite time (we can say O(c) where see is constant), first recursive call is takes log(log(sqrt(n))) time second takes .. and the number of sqrt in sqrt(sqrt(sqrt....(n)) < 10 is log(log(n)) and constant, because of machine limitations.
static long findPow10(long n)
{
if (n == 0)
return 0;
long i = 10;
long prevI = 10;
int count = 1;
while (i < n)
{
prevI = i;
i *= i;
count*=2;
}
if (i == n)
return count;
return count / 2 + findPow10(n / prevI);
}
In Python:
10**(len(str(int(x)))-1)
Given that this is language independent, if you can get the power of two that this number is significant to, eg y in x*2^y (which is the way the number is stored, though I'm not sure I have seen an easy way to access y in any language I have used) then if
z = int(y/(ln(10)/ln(2)))
(one floating point division)
10^z or 10^(z+1) will be your answer, though 10^z is still is not so simple (beg to be corrected).
I timed the methods with the following variations in C++ for the value a being a size_t type (inlining improves performance but does not change relative ordering).
Try 1: Multiply until find number.
size_t try1( size_t a )
{
size_t scalar = 1ul;
while( scalar * 10 < a ) scalar *= 10;
return scalar;
}
Try 2: Multiway if (could also be programmed using a lookup table).
size_t try2( size_t a )
{
return ( a < 10ul ? 1ul :
( a < 100ul ? 10ul :
( a < 1000ul ? 100ul :
( a < 10000ul ? 1000ul :
( a < 100000ul ? 10000ul :
( a < 1000000ul ? 100000ul :
( a < 10000000ul ? 1000000ul :
( a < 100000000ul ? 10000000ul :
( a < 1000000000ul ? 100000000ul :
( a < 10000000000ul ? 1000000000ul :
( a < 100000000000ul ? 10000000000ul :
( a < 1000000000000ul ? 100000000000ul :
( a < 10000000000000ul ? 1000000000000ul :
( a < 100000000000000ul ? 10000000000000ul :
( a < 1000000000000000ul ? 100000000000000ul :
( a < 10000000000000000ul ? 1000000000000000ul :
( a < 100000000000000000ul ? 10000000000000000ul :
( a < 1000000000000000000ul ? 100000000000000000ul :
( a < 10000000000000000000ul ? 1000000000000000000ul :
10000000000000000000ul )))))))))))))))))));
}
Try 3: Modified from findPow10 of #Saaed Amiri, which uses squaring to more rapidly find very large powers than Try 1.
size_t try3( size_t a )
{
if (a == 0)
return 0;
size_t i, j = 1;
size_t prev = 1;
while( j != 100 )
{
i = prev;
j = 10;
while (i <= a)
{
prev = i;
i *= j;
j *= j;
}
}
return prev;
}
Try 4: Lookup table indexed using count leading zeros instruction as per #Ira Baxter.
static const std::array<size_t,64> ltable2{
1ul, 1ul, 1ul, 1ul, 1ul, 10ul, 10ul, 10ul,
100ul, 100ul, 100ul, 1000ul, 1000ul, 1000ul,
1000ul, 10000ul, 10000ul, 10000ul, 100000ul,
100000ul, 100000ul, 1000000ul, 1000000ul,
1000000ul, 1000000ul, 10000000ul, 10000000ul,
10000000ul, 100000000ul, 100000000ul,
100000000ul, 1000000000ul, 1000000000ul,
1000000000ul, 1000000000ul, 10000000000ul,
10000000000ul, 10000000000ul, 100000000000ul,
100000000000ul, 100000000000ul, 1000000000000ul,
1000000000000ul, 1000000000000ul, 1000000000000ul,
10000000000000ul, 10000000000000ul, 10000000000000ul,
100000000000000ul, 100000000000000ul, 100000000000000ul,
1000000000000000ul, 1000000000000000ul, 1000000000000000ul,
1000000000000000ul, 10000000000000000ul, 10000000000000000ul,
10000000000000000ul, 100000000000000000ul, 100000000000000000ul,
100000000000000000ul, 100000000000000000ul, 1000000000000000000ul,
1000000000000000000ul };
size_t try4( size_t a )
{
if( a == 0 ) return 0;
size_t scalar = ltable2[ 64 - __builtin_clzl(a) ];
return (scalar * 10 > a ? scalar : scalar * 10 );
}
Timing is as follows (gcc 4.8)
for( size_t i = 0; i != 1000000000; ++i) try1(i) 6.6
for( size_t i = 0; i != 1000000000; ++i) try2(i) 0.3
for( size_t i = 0; i != 1000000000; ++i) try3(i) 6.5
for( size_t i = 0; i != 1000000000; ++i) try4(i) 0.3
for( size_t i = 0; i != 1000000000; ++i) pow(10,size_t(log10((double)i)))
98.1
The lookup/multiway-if beats everything in C++, but requires we know integers are a finite size. try3 is slower than try1 in this test for smaller values of the loop end value, for large numbers try3 beats try1. In python things are made difficult because integers are not limited so I would combine try2 with try3 to quickly process numbers up to a fixed limit then handle the possibly very large numbers.
In python I think lookup using a list comprehension is probably faster than a multiway-if.
# where we previously define lookuptable = ( 1, 10, 100, ..... )
scalar = [i for i in lookuptable if i < a][-1]
Which one is faster? Why?
var messages:Array = [.....]
// 1 - for
var len:int = messages.length;
for (var i:int = 0; i < len; i++) {
var o:Object = messages[i];
// ...
}
// 2 - foreach
for each (var o:Object in messages) {
// ...
}
From where I'm sitting, regular for loops are moderately faster than for each loops in the minimal case. Also, as with AS2 days, decrementing your way through a for loop generally provides a very minor improvement.
But really, any slight difference here will be dwarfed by the requirements of what you actually do inside the loop. You can find operations that will work faster or slower in either case. The real answer is that neither kind of loop can be meaningfully said to be faster than the other - you must profile your code as it appears in your application.
Sample code:
var size:Number = 10000000;
var arr:Array = [];
for (var i:int=0; i<size; i++) { arr[i] = i; }
var time:Number, o:Object;
// for()
time = getTimer();
for (i=0; i<size; i++) { arr[i]; }
trace("for test: "+(getTimer()-time)+"ms");
// for() reversed
time = getTimer();
for (i=size-1; i>=0; i--) { arr[i]; }
trace("for reversed test: "+(getTimer()-time)+"ms");
// for..in
time = getTimer();
for each(o in arr) { o; }
trace("for each test: "+(getTimer()-time)+"ms");
Results:
for test: 124ms
for reversed test: 110ms
for each test: 261ms
Edit: To improve the comparison, I changed the inner loops so they do nothing but access the collection value.
Edit 2: Answers to oshyshko's comment:
The compiler could skip the accesses in my internal loops, but it doesn't. The loops would exit two or three times faster if it was.
The results change in the sample code you posted because in that version, the for loop now has an implicit type conversion. I left assignments out of my loops to avoid that.
Of course one could argue that it's okay to have an extra cast in the for loop because "real code" would need it anyway, but to me that's just another way of saying "there's no general answer; which loop is faster depends on what you do inside your loop". Which is the answer I'm giving you. ;)
When iterating over an array, for each loops are way faster in my tests.
var len:int = 1000000;
var i:int = 0;
var arr:Array = [];
while(i < len) {
arr[i] = i;
i++;
}
function forEachLoop():void {
var t:Number = getTimer();
var sum:Number = 0;
for each(var num:Number in arr) {
sum += num;
}
trace("forEachLoop :", (getTimer() - t));
}
function whileLoop():void {
var t:Number = getTimer();
var sum:Number = 0;
var i:int = 0;
while(i < len) {
sum += arr[i] as Number;
i++;
}
trace("whileLoop :", (getTimer() - t));
}
forEachLoop();
whileLoop();
This gives:
forEachLoop : 87
whileLoop : 967
Here, probably most of while loop time is spent casting the array item to a Number. However, I consider it a fair comparison, since that's what you get in the for each loop.
My guess is that this difference has to do with the fact that, as mentioned, the as operator is relatively expensive and array access is also relatively slow. With a for each loop, both operations are handled natively, I think, as opossed to performed in Actionscript.
Note, however, that if type conversion actually takes place, the for each version is much slower and the while version if noticeably faster (though, still, for each beats while):
To test, change array initialization to this:
while(i < len) {
arr[i] = i + "";
i++;
}
And now the results are:
forEachLoop : 328
whileLoop : 366
forEachLoop : 324
whileLoop : 369
I've had this discussion with a few collegues before, and we have all found different results for different scenarios. However, there was one test that I found quite eloquent for comparison's sake:
var array:Array=new Array();
for (var k:uint=0; k<1000000; k++) {
array.push(Math.random());
}
stage.addEventListener("mouseDown",foreachloop);
stage.addEventListener("mouseUp",forloop);
/////// Array /////
/* 49ms */
function foreachloop(e) {
var t1:uint=getTimer();
var tmp:Number=0;
var i:uint=0;
for each (var n:Number in array) {
i++;
tmp+=n;
}
trace("foreach", i, tmp, getTimer() - t1);
}
/***** 81ms ****/
function forloop(e) {
var t1:uint=getTimer();
var tmp:Number=0;
var l:uint=array.length;
for(var i:uint = 0; i < l; i++)
tmp += Number(array[i]);
trace("for", i, tmp, getTimer() - t1);
}
What I like about this tests is that you have a reference for both the key and value in each iteration of both loops (removing the key counter in the "for-each" loop is not that relevant). Also, it operates with Number, which is probably the most common loop that you will want to optimize that much. And most importantly, the winner is the "for-each", which is my favorite loop :P
Notes:
-Referencing the array in a local variable within the function of the "for-each" loop is irrelevant, but in the "for" loop you do get a speed bump (75ms instead of 105ms):
function forloop(e) {
var t1:uint=getTimer();
var tmp:Number=0;
var a:Array=array;
var l:uint=a.length;
for(var i:uint = 0; i < l; i++)
tmp += Number(a[i]);
trace("for", i, tmp, getTimer() - t1);
}
-If you run the same tests with the Vector class, the results are a bit confusing :S
for would be faster for arrays...but depending on the situation it can be foreach that is best...see this .net benchmark test.
Personally, I'd use either until I got to the point where it became necessary for me to optimize the code. Premature optimization is wasteful :-)
Maybe in a array where all element are there and start at zero (0 to X) it would be faster to use a for loop. In all other case (sparse array) it can be a LOT faster to use for each.
The reason is the usage of two data structure in the array: Hast table an Debse Array.
Please read my Array analysis using Tamarin source:
http://jpauclair.wordpress.com/2009/12/02/tamarin-part-i-as3-array/
The for loop will check at undefined index where the for each will skip those one jumping to next element in the HastTable
guys!
Especially Juan Pablo Califano.
I've checked your test. The main difference in obtain array item.
If you will put var len : int = 40000;, you will see that 'while' cycle is faster.
But it loses with big counts of array, instead for..each.
Just an add-on:
a for each...in loop doesn't assure You, that the elements in the array/vector gets enumerated in the ORDER THEY ARE STORED in them. (except XMLs)
This IS a vital difference, IMO.
"...Therefore, you should not write code that depends on a for-
each-in or for-in loop’s enumeration order unless you are processing
XML data..." C.Moock
(i hope not to break law stating this one phrase...)
Happy benchmarking.
sorry to prove you guys wrong, but for each is faster. even a lot. except, if you don't want to access the array values, but a) this does not make sense and b) this is not the case here.
as a result of this, i made a detailed post on my super new blog ... :D
greetz
back2dos
The environment: I am working in a proprietary scripting language where there is no such thing as a user-defined function. I have various loops and local variables of primitive types that I can create and use.
I have two related arrays, "times" and "values". They both contain floating point values. I want to numerically sort the "times" array but have to be sure that the same operations are applied on the "values" array. What's the most efficient way I can do this without the benefit of things like recursion?
You could maintain an index table and sort the index table instead.
This way you will not have to worry about times and values being consistent.
And whenever you need a sorted value, you can lookup on the sorted index.
And if in the future you decided there was going to be a third value, the sorting code will not need any changes.
Here's a sample in C#, but it shouldn't be hard to adapt to your scripting language:
static void Main() {
var r = new Random();
// initialize random data
var index = new int[10]; // the index table
var times = new double[10]; // times
var values = new double[10]; // values
for (int i = 0; i < 10; i++) {
index[i] = i;
times[i] = r.NextDouble();
values[i] = r.NextDouble();
}
// a naive bubble sort
for (int i = 0; i < 10; i++)
for (int j = 0; j < 10; j++)
// compare time value at current index
if (times[index[i]] < times[index[j]]) {
// swap index value (times and values remain unchanged)
var temp = index[i];
index[i] = index[j];
index[j] = temp;
}
// check if the result is correct
for (int i = 0; i < 10; i++)
Console.WriteLine(times[index[i]]);
Console.ReadKey();
}
Note: I used a naive bubble sort there, watchout. In your case, an insertion sort is probably a good candidate. Since you don't want complex recursions.
Just take your favourite sorting algorithm (e.g. Quicksort or Mergesort) and use it to sort the "values" array. Whenever two values are swapped in "values", also swap the values with the same indices in the "times" array.
So basically you can take any fast sorting algorithm and modify the swap() operation so that elements in both arrays are swapped.
Take a look at the Bottom-Up mergesort at Algorithmist. It's a non-recursive way of performing a mergesort. The version presented there uses function calls, but that can be inlined easily enough.
Like martinus said, every time you change a value in one array, do the exact same thing in the parallel array.
Here's a C-like version of a stable-non-recursive mergesort that makes no function calls, and uses no recursion.
const int arrayLength = 40;
float times_array[arrayLength];
float values_array[arrayLength];
// Fill the two arrays....
// Allocate two buffers
float times_buffer[arrayLength];
float values_buffer[arrayLength];
int blockSize = 1;
while (blockSize <= arrayLength)
{
int i = 0;
while (i < arrayLength-blockSize)
{
int begin1 = i;
int end1 = begin1 + blockSize;
int begin2 = end1;
int end2 = begin2 + blockSize;
int bufferIndex = begin1;
while (begin1 < end1 && begin2 < end2)
{
if ( values_array[begin1] > times_array[begin2] )
{
times_buffer[bufferIndex] = times_array[begin2];
values_buffer[bufferIndex++] = values_array[begin2++];
}
else
{
times_buffer[bufferIndex] = times_array[begin1];
values_buffer[bufferIndex++] = values_array[begin1++];
}
}
while ( begin1 < end1 )
{
times_buffer[bufferIndex] = times_array[begin1];
values_buffer[bufferIndex++] = values_array[begin1++];
}
while ( begin2 < end2 )
{
times_buffer[bufferIndex] = times_array[begin2];
values_buffer[bufferIndex++] = values_array[begin2++];
}
for (int k = i; k < i + 2 * blockSize; ++k)
{
times_array[k] = times_buffer[k];
values_array[k] = values_buffer[k];
}
i += 2 * blockSize;
}
blockSize *= 2;
}
I wouldn't suggest writing your own sorting routine, as the sorting routines provided as part of the Java language are well optimized.
The way I'd solve this is to copy the code in the java.util.Arrays class into your own class i.e. org.mydomain.util.Arrays. And add some comments telling yourself not to use the class except when you must have the additional functionality that you're going to add. The Arrays class is quite stable so this is less, less ideal than it would seem, but it's still less than ideal. However, the methods you need to change are private, so you've no real choice.
You then want to create an interface along the lines of:
public static interface SwapHook {
void swap(int a, int b);
}
You then need to add this to the sort method you're going to use, and to every subordinate method called in the sorting procedure, which swaps elements in your primary array. You arrange for the hook to get called by your modified sorting routine, and you can then implement the SortHook interface to achieve the behaviour you want in any secondary (e.g. parallel) arrays.
HTH.