On Xorshift random number generator algorithm - algorithm

Following is a basic implementation of the Xorshift RNG (copied from the Wikipedia):
uint32_t xor128(void) {
static uint32_t x = 123456789;
static uint32_t y = 362436069;
static uint32_t z = 521288629;
static uint32_t w = 88675123;
uint32_t t;
t = x ^ (x << 11);
x = y; y = z; z = w;
return w = w ^ (w >> 19) ^ (t ^ (t >> 8));
}
I understand that w is the returned value and x, y and z are the state ("memory") variables. However, I can't understand the purpose of more than one memory variable. Can anyone explain me this point?
Also, I tried to copy the above code to Python:
class R2:
def __init__(self):
self.x = x = 123456789
self.y = 362436069
self.z = 521288629
self.w = 88675123
def __call__(self):
t = self.x ^ (self.x<<11)
self.x = self.y
self.y = self.z
self.z = self.w
w = self.w
self.w = w ^ (w >> 19) ^(t ^ (t >> 8))
return self.w
Then, I have generated 100 numbers and plotted their log10 values:
r2 = R2()
x2 = [math.log10(r2()) for _ in range(100)]
plot(x2, '.g')
Here is the output of the plot:
And this what happens when 10000 (and not 100) numbers are generated:
The overall tendency is very clear. And don't forget that the Y axis is log10 of the actual value.
Pretty strange behavior, don't you think?

The problem here is of course that you're using Python to do this.
Python has a notion of big integers, so even though you are copying an implementation that deals with 32-bit numbers, Python just says "I'll just go ahead and keep everything for you".
If you try this instead:
x2 = [r2() for _ in range(100)]
print(x2);
You'll notice that it produces ever-longer numbers, for instance here's the first number:
252977563114
and here's the last:
8735276851455609928450146337670748382228073854835405969246191481699954934702447147582960645
Here's code that has been fixed to handle this:
...
def __call__(self):
t = self.x ^ (self.x<<11) & 0xffffffff # <-- keep 32 bits
self.x = self.y
self.y = self.z
self.z = self.w
w = self.w
self.w = (w ^ (w >> 19) ^(t ^ (t >> 8))) & 0xffffffff # <-- keep 32 bits
return self.w
...

And with a generator:
def xor128():
x = 123456789
y = 362436069
z = 521288629
w = 88675123
while True:
t = (x ^ (x<<11)) & 0xffffffff
(x,y,z) = (y,z,w)
w = (w ^ (w >> 19) ^ (t ^ (t >> 8))) & 0xffffffff
yield w

"However, I can't understand the purpose of more than one memory variable" - if you need to 'remember' 128 bits then you need 4 x 32bit integers.
As to the very strange distribution of 100 randoms, no idea! I could understand perhaps if you had generated a few million, and the steps in the graph were artifacts, but not 100.

Related

m.Equations resulting in TypeError: 'int' object is not subscriptable

I'm having trouble passing my equations of motion to the solver on my control optimization problem.
Just a little explanation on what I'm attempting to do here, because I think there are two problem areas:
First, I'm defining a contact switch c that is used to turn on and off portions of the dynamic equations based on the value a, which is a FV between 0 and .45. I have a loop which sets the value of c[i] based on the value of the time parameter relative to a.
c = [None]*N
for i in range(N):
difference = m.Intermediate(.5-m.time[i])
abs = m.if3(difference, -difference, difference)
c[i] = m.Intermediate(m.if3(abs-(.5-a), 1, 0))
It should resemble a vector of length N:
c= [0, 0, 0, 1, 1, ...., 1, 1, 0, 0, 0]
It's not clear if this was implemented properly, but it's not throwing me errors at this point. (Note: I'm aware that this can be easily implemented as a mixed-integer variable, but I really want to use IPOPT, so I'm using the m.if3() method to create the binary switch.)
Second, I'm getting an error when passing the equations of motion. This exists whether the c is included, so, at least for right now, I know that is not the issue.
m.Equations(xdot.dt()/TF == c*u*(L1*m.sin(q1)-L2*m.sin(q1+q2))/(M*L1*L2*m.sin(2*q1+q2)))
m.Equations(ydot.dt()/TF == -c*u*(L1*m.cos(q1)+L2*m.cos(q1+q2))/(M*L1*L2*m.sin(2*q1+q2))-g/m)
m.Equation(x.dt()/TF == xdot)
m.Equation(y.dt()/TF == ydot)
m.Equation(y*init == y*final) #initial and final y position must be equal
TypeError: 'int' object is not subscriptable
I've attempted to set up an intermediate loop to handle the RH of the equation to no avail:
RH = [None]*N
RH = m.Intermediate([c[i]*u[i]*(L1*m.sin(q1[i])-2*m.sin(q1[i]+q2[i]))/(M*L1*L2*m.sin(2*q1[i]+q2[i])) for i in range(N)])
m.Equations(xdot.dt()/TF == RH)
Below is the full code. Note: there are probably other issues both in my code and problem definition, but I'm just looking to find a way to successfully pass these equations of motion. Much appreciated!
Full code:
import math
import numpy as np
from gekko import GEKKO
#Defining a model
m = GEKKO(remote=True)
v = 1 #set walking speed (m/s)
L1 = .5 #set thigh length (m)
L2 = .5 #set shank length (m)
M = 75 #set mass (kg)
#################################
#Define secondary parameters
D = L1 + L2 #leg length parameter
pi = math.pi #define pi
g = 9.81 #define gravity
#Define initial and final conditions and limits
x0 = -v/2; xf = v/2
xdot0 = v; xdotf = v
ydot0 = 0; ydotf = 0
ymin = .5*D; ymax = 1.5*D
q1min = -pi/2; q1max = pi/2
q2min = -pi/2; q2max = 0
tfmin = D/(2*v); tfmax = 3*D/(2*v)
#Defining the time parameter (0, 1)
N = 100
t = np.linspace(0,1,N)
m.time = t
#Final time Fixed Variable
TF = m.FV(1,lb=tfmin,ub=tfmax); TF.STATUS = 1
end_loc = len(m.time)-1
amin = 0; amax = .45
#Defining initial and final condition vectors
init = np.zeros(len(m.time))
final = np.zeros(len(m.time))
init[1] = 1
final[-1] = 1
init = m.Param(value=init)
final = m.Param(value=final)
#Parameters
M = m.Param(value=M) #cart mass
L1 = m.Param(value=L1) #link 1 length
L2 = m.Param(value=L2) #link 1 length
g = m.Const(value=g) #gravity
#Control Input Manipulated Variable
u = m.MV(0); u.STATUS = 1
#Ground Contact Fixed Variable
a = m.FV(0,lb=amin,ub=amax) #equates to the unscaled time when contact first occurs
#State Variables
x, y, xdot, ydot, q1, q2 = m.Array(m.Var, 6)
x.value = x0;
xdot.value = xdot0; ydot.value = ydot0
y.LOWER = ymin; y.UPPER = ymax
q1.LOWER = q1min; q1.UPPER = q1max
q2.LOWER = q2min; q2.UPPER = q2max
#Intermediates
c = [None]*N
for i in range(N):
difference = m.Intermediate(.5-m.time[i])
abs = m.if3(difference, -difference, difference)
c[i] = m.Intermediate(m.if3(abs-(.5-a), 1, 0))
#Defining the State Space Model
m.Equations(xdot.dt()/TF == c*u*(L1*m.sin(q1)-L2*m.sin(q1+q2))/(M*L1*L2*m.sin(2*q1+q2))) ####This produces the error
m.Equations(ydot.dt()/TF == -c*u*(L1*m.cos(q1)+L2*m.cos(q1+q2))/(M*L1*L2*m.sin(2*q1+q2))-g/m)
m.Equation(x.dt()/TF == xdot)
m.Equation(y.dt()/TF == ydot)
m.Equation(y*init == y*final) #initial and final y position must be equal
#Defining final condition
m.fix_final(x,val=xf)
m.fix_final(xdot,val=xdotf)
m.fix_final(xdot,val=ydotf)
#Try to minimize final time and torque
m.Obj(TF)
m.Obj(0.001*u**2)
m.options.IMODE = 6 #MPC
m.options.SOLVER = 3
m.solve()
m.time = np.multiply(TF, m.time)
Nice application. Here are a few corrections and ideas:
Use a switch condition that uses a NumPy array. There is no need to define the individual points in the horizon with c[i].
#Intermediates
#c = [None]*N
#for i in range(N):
# difference = m.Intermediate(.5-m.time[i])
# abs = m.if3(difference, -difference, difference)
# c[i] = m.Intermediate(m.if3(abs-(.5-a), 1, 0))
diff = 0.5 - m.time
adiff = m.Param(np.abs(diff))
swtch = m.Intermediate(adiff-(0.5-a))
c = m.if3(swtch,1,0)
You may be able to use the m.integral() function to set the value of c to 1 and keep it there when contact is made.
Use the m.periodic(y) function to set the initial value of y equal to the final value of y.
#m.Equation(y*init == y*final) #initial and final y position must be equal
m.periodic(y)
Try using soft constraints instead of hard constraints if there is a problem with finding a feasible solution.
#Defining final condition
#m.fix_final(x,val=xf)
#m.fix_final(xdot,val=xdotf)
#m.fix_final(ydot,val=ydotf)
m.Minimize(final*(x-xf)**2)
m.Minimize(final*(xdot-xdotf)**2)
m.Minimize(final*(ydot-ydotf)**2)
The m.if3() function requires the APOPT solver. Try m.if2() for the continuous version that uses MPCCs instead of binary variables to define the switch. The integral function may be alternative way to avoid a binary variable.
Here is the final code that attempts a solution, but the solver can't yet find a solution. I hope this helps you get a little further on your optimization problem. You may need to use a shooting (sequential method) to find an initial feasible solution.
import math
import numpy as np
from gekko import GEKKO
#Defining a model
m = GEKKO(remote=True)
v = 1 #set walking speed (m/s)
L1 = .5 #set thigh length (m)
L2 = .5 #set shank length (m)
M = 75 #set mass (kg)
#################################
#Define secondary parameters
D = L1 + L2 #leg length parameter
pi = math.pi #define pi
g = 9.81 #define gravity
#Define initial and final conditions and limits
x0 = -v/2; xf = v/2
xdot0 = v; xdotf = v
ydot0 = 0; ydotf = 0
ymin = .5*D; ymax = 1.5*D
q1min = -pi/2; q1max = pi/2
q2min = -pi/2; q2max = 0
tfmin = D/(2*v); tfmax = 3*D/(2*v)
#Defining the time parameter (0, 1)
N = 100
t = np.linspace(0,1,N)
m.time = t
#Final time Fixed Variable
TF = m.FV(1,lb=tfmin,ub=tfmax); TF.STATUS = 1
end_loc = len(m.time)-1
amin = 0; amax = .45
#Defining initial and final condition vectors
init = np.zeros(len(m.time))
final = np.zeros(len(m.time))
init[1] = 1
final[-1] = 1
init = m.Param(value=init)
final = m.Param(value=final)
#Parameters
M = m.Param(value=M) #cart mass
L1 = m.Param(value=L1) #link 1 length
L2 = m.Param(value=L2) #link 1 length
g = m.Const(value=g) #gravity
#Control Input Manipulated Variable
u = m.MV(0); u.STATUS = 1
#Ground Contact Fixed Variable
a = m.FV(0,lb=amin,ub=amax) #equates to the unscaled time when contact first occurs
#State Variables
x, y, xdot, ydot, q1, q2 = m.Array(m.Var, 6)
x.value = x0;
xdot.value = xdot0; ydot.value = ydot0
y.LOWER = ymin; y.UPPER = ymax
q1.LOWER = q1min; q1.UPPER = q1max
q2.LOWER = q2min; q2.UPPER = q2max
#Intermediates
#c = [None]*N
#for i in range(N):
# difference = m.Intermediate(.5-m.time[i])
# abs = m.if3(difference, -difference, difference)
# c[i] = m.Intermediate(m.if3(abs-(.5-a), 1, 0))
diff = 0.5 - m.time
adiff = m.Param(np.abs(diff))
swtch = m.Intermediate(adiff-(0.5-a))
c = m.if3(swtch,1,0)
#Defining the State Space Model
m.Equation(xdot.dt()/TF == c*u*(L1*m.sin(q1)
-L2*m.sin(q1+q2))
/(M*L1*L2*m.sin(2*q1+q2)))
m.Equation(ydot.dt()/TF == -c*u*(L1*m.cos(q1)
+L2*m.cos(q1+q2))
/(M*L1*L2*m.sin(2*q1+q2))-g/M)
m.Equation(x.dt()/TF == xdot)
m.Equation(y.dt()/TF == ydot)
#m.Equation(y*init == y*final) #initial and final y position must be equal
m.periodic(y)
#Defining final condition
#m.fix_final(x,val=xf)
#m.fix_final(xdot,val=xdotf)
#m.fix_final(ydot,val=ydotf)
m.Minimize(final*(x-xf)**2)
m.Minimize(final*(xdot-xdotf)**2)
m.Minimize(final*(ydot-ydotf)**2)
#Try to minimize final time and torque
m.Minimize(TF)
m.Minimize(0.001*u**2)
m.options.IMODE = 6 #MPC
m.options.SOLVER = 1
m.solve()
m.time = np.multiply(TF, m.time)

For given two integers A and B, find a pair of numbers X and Y such that A = X*Y and B = X xor Y

I'm struggling with this problem I've found in a competitive programming book, but without a solution how to do it. For given two integers A and B (can fit in 64-bit integer type), where A is odd, find a pair of numbers X and Y such that A = X*Y and B = X xor Y.
My approach was to list all divisors of A and try pairing numbers under sqrt(A) with numbers over sqrt(A) that multiply up to A and see if their xor is equal to B. But I don't know if that's efficient enough.
What would be a good solution/algorithm to this problem?
You know that at least one factor is <= sqrt(A). Let's make that one X.
The length of X in bits will be about half the length of A.
The upper bits of X, therefore -- the ones higher in value than sqrt(A) -- are all 0, and the corresponding bits in B must have the same value as the corresponding bits in Y.
Knowing the upper bits of Y gives you a pretty small range for the corresponding factor X = A/Y. Calculate Xmin and Xmax corresponding to the largest and smallest possible values for Y, respectively. Remember that Xmax must also be <= sqrt(A).
Then just try all the possible Xs between Xmin and Xmax. There won't be too many, so it won't take very long.
The other straightforward way to solve this problem relies on the fact that the lower n bits of XY and X xor Y depend only on the lower n bits of X and Y. Therefore, you can use the possible answers for the lower n bits to restrict the possible answers for the lower n+1 bits, until you're done.
I've worked out that, unfortunately, there can be more than one possibility for a single n. I don't know how often there will be a lot of possibilities, but it's probably not too often if at all, so this may be fine in a competitive context. Probabilistically, there will only be a few possibilities, since a solution for n bits will provide either 0 or two solutions for n+1 bits, with equal probability.
It seems to work out pretty well for random input. Here's the code I used to test it:
public static void solve(long A, long B)
{
List<Long> sols = new ArrayList<>();
List<Long> prevSols = new ArrayList<>();
sols.add(0L);
long tests=0;
System.out.print("Solving "+A+","+B+"... ");
for (long bit=1; (A/bit)>=bit; bit<<=1)
{
tests += sols.size();
{
List<Long> t = prevSols;
prevSols = sols;
sols = t;
}
final long mask = bit|(bit-1);
sols.clear();
for (long prevx : prevSols)
{
long prevy = (prevx^B) & mask;
if ((((prevx*prevy)^A)&mask) == 0)
{
sols.add(prevx);
}
long x = prevx | bit;
long y = (x^B)&mask;
if ((((x*y)^A)&mask) == 0)
{
sols.add(x);
}
}
}
tests += sols.size();
{
List<Long> t = prevSols;
prevSols = sols;
sols = t;
}
sols.clear();
for (long testx: prevSols)
{
if (A/testx >= testx)
{
long testy = B^testx;
if (testx * testy == A)
{
sols.add(testx);
}
}
}
System.out.println("" + tests + " checks -> X=" + sols);
}
public static void main(String[] args)
{
Random rand = new Random();
for (int range=Integer.MAX_VALUE; range > 32; range -= (range>>5))
{
long A = rand.nextLong() & Long.MAX_VALUE;
long X = (rand.nextInt(range)) + 2L;
X|=1;
long Y = A/X;
if (Y==0)
{
Y = rand.nextInt(65536);
}
Y|=1;
solve(X*Y, X^Y);
}
}
You can see the results here: https://ideone.com/cEuHkQ
Looks like it usually only takes a couple thousand checks.
Here's a simple recursion that observes the rules we know: (1) the least significant bits of both X and Y are set since only odd multiplicands yield an odd multiple; (2) if we set X to have the highest set bit of B, Y cannot be greater than sqrt(A); and (3) set bits in X or Y according to the current bit in B.
The following Python code resulted in under 300 iterations for all but one of the random pairs I picked from Matt Timmermans' example code. But the first one took 231,199 iterations :)
from math import sqrt
def f(A, B):
i = 64
while not ((1<<i) & B):
i = i - 1
X = 1 | (1 << i)
sqrtA = int(sqrt(A))
j = 64
while not ((1<<j) & sqrtA):
j = j - 1
if (j > i):
i = j + 1
memo = {"it": 0, "stop": False, "solution": []}
def g(b, x, y):
memo["it"] = memo["it"] + 1
if memo["stop"]:
return []
if y > sqrtA or y * x > A:
return []
if b == 0:
if x * y == A:
memo["solution"].append((x, y))
memo["stop"] = True
return [(x, y)]
else:
return []
bit = 1 << b
if B & bit:
return g(b - 1, x, y | bit) + g(b - 1, x | bit, y)
else:
return g(b - 1, x | bit, y | bit) + g(b - 1, x, y)
g(i - 1, X, 1)
return memo
vals = [
(6872997084689100999, 2637233646), # 1048 checks with Matt's code
(3461781732514363153, 262193934464), # 8756 checks with Matt's code
(931590259044275343, 5343859294), # 4628 checks with Matt's code
(2390503072583010999, 22219728382), # 5188 checks with Matt's code
(412975927819062465, 9399702487040), # 8324 checks with Matt's code
(9105477787064988985, 211755297373604352), # 3204 checks with Matt's code
(4978113409908739575,67966612030), # 5232 checks with Matt's code
(6175356111962773143,1264664368613886), # 3756 checks with Matt's code
(648518352783802375, 6) # B smaller than sqrt(A)
]
for A, B in vals:
memo = f(A, B)
[(x, y)] = memo["solution"]
print "x, y: %s, %s" % (x, y)
print "A: %s" % A
print "x*y: %s" % (x * y)
print "B: %s" % B
print "x^y: %s" % (x ^ y)
print "%s iterations" % memo["it"]
print ""
Output:
x, y: 4251585939, 1616572541
A: 6872997084689100999
x*y: 6872997084689100999
B: 2637233646
x^y: 2637233646
231199 iterations
x, y: 262180735447, 13203799
A: 3461781732514363153
x*y: 3461781732514363153
B: 262193934464
x^y: 262193934464
73 iterations
x, y: 5171068311, 180154313
A: 931590259044275343
x*y: 931590259044275343
B: 5343859294
x^y: 5343859294
257 iterations
x, y: 22180179939, 107776541
A: 2390503072583010999
x*y: 2390503072583010999
B: 22219728382
x^y: 22219728382
67 iterations
x, y: 9399702465439, 43935
A: 412975927819062465
x*y: 412975927819062465
B: 9399702487040
x^y: 9399702487040
85 iterations
x, y: 211755297373604395, 43
A: 9105477787064988985
x*y: 9105477787064988985
B: 211755297373604352
x^y: 211755297373604352
113 iterations
x, y: 68039759325, 73164771
A: 4978113409908739575
x*y: 4978113409908739575
B: 67966612030
x^y: 67966612030
69 iterations
x, y: 1264664368618221, 4883
A: 6175356111962773143
x*y: 6175356111962773143
B: 1264664368613886
x^y: 1264664368613886
99 iterations
x, y: 805306375, 805306369
A: 648518352783802375
x*y: 648518352783802375
B: 6
x^y: 6
59 iterations

Implementing a FIR filter using Vectors

I have implemented a FIR filter in Haskell. I don't know that much about FIR filters and my code is heavily based on an existing C# implementation. Therefore, I have a feeling that my implementation is has too much of a C# style and is not really Haskell-like. I would like to know if there is a more idiomatic Haskell way of implementing my code. Ideally, I'm lucky for some combination of higher-order functions (map, filter, fold, etc.) that implement the algorithm.
My Haskell code looks like this:
applyFIR :: Vector Double -> Vector Double -> Vector Double
applyFIR b x = generate (U.length x) help
where
help i = if i >= (U.length b - 1) then loop i (U.length b - 1) else 0
loop yi bi = if bi < 0 then 0 else b !! bi * x !! (yi-bi) + loop yi (bi-1)
vec !! i = unsafeIndex vec i -- Shorthand for unsafeIndex
This code is based on the following C# code:
public float[] RunFilter(double[] x)
{
int M = coeff.Length;
int n = x.Length;
//y[n]=b0x[n]+b1x[n-1]+....bmx[n-M]
var y = new float[n];
for (int yi = 0; yi < n; yi++)
{
double t = 0.0f;
for (int bi = M - 1; bi >= 0; bi--)
{
if (yi - bi < 0) continue;
t += coeff[bi] * x[yi - bi];
}
y[yi] = (float) t;
}
return y;
}
As you can see, it's almost a straight copy. How can I turn my implementation into a more Haskell-like one? Do you have any ideas? The only thing I could come up with was using Vector.generate.
I know that the DSP library has an implementation available. But it uses lists and is way too slow for my use case. This Vector implementation is a lot faster than the one in DSP.
I've also tried implementing the algorithm using Repa. It is faster than the Vector implementation. Here is the result:
applyFIR :: V.Vector Float -> Array U DIM1 Float -> Array D DIM1 Float
applyFIR b x = R.traverse x id (\_ (Z :. i) -> if i >= len then loop i (len - 1) else 0)
where
len = V.length b
loop :: Int -> Int -> Float
loop yi bi = if bi < 0 then 0 else (V.unsafeIndex b bi) * x !! (Z :. (yi-bi)) + loop yi (bi-1)
arr !! i = unsafeIndex arr i
First of all, I don't think that your initial vector code is a faithful translation - that is, I think it disagrees with the C# code. For example, suppose that both "x" and "b" ("b" is coeff in C#) have length 3, and have all values of 1.0. Then for y[0] the C# code would produce x[0] * coeff[0], or 1.0. (it would hit continue for all other values of bi)
With your Haskell code, however, help 0 produces 0. Your Repa version seems to suffer from the same problem.
So let's start with a more faithful translation:
applyFIR :: Vector Double -> Vector Double -> Vector Double
applyFIR b x = generate (U.length x) help
where
help i = loop i (min i $ U.length b - 1)
loop yi bi = if bi < 0 then 0 else b !! bi * x !! (yi-bi) + loop yi (bi-1)
vec !! i = unsafeIndex vec i -- Shorthand for unsafeIndex
Now, you're basically doing a calculation like this for computing, say, y[3]:
... b[3] | b[2] | b[1] | b[0]
x[0] | x[1] | x[2] | x[3] | x[4] | x[5] | ....
multiply
b[3]*x[0]|b[2]*x[1] |b[1]*x[2] |b[0]*x[3]
sum
y[3] = b[3]*x[0] + b[2]*x[1] + b[1]*x[2] + b[0]*x[3]
So one way to think of what you're doing is "take the b vector, reverse it, and to compute spot i of the result, line b[0] up with x[i], multiply all the corresponding x and b entries, and compute the sum".
So let's do that:
applyFIR :: Vector Double -> Vector Double -> Vector Double
applyFIR b x = generate (U.length x) help
where
revB = U.reverse b
bLen = U.length b
help i = let sliceLen = min (i+1) bLen
bSlice = U.slice (bLen - sliceLen) sliceLen revB
xSlice = U.slice (i + 1 - sliceLen) sliceLen x
in U.sum $ U.zipWith (*) bSlice xSlice

Finding the continued fraction of 2^(1/3) to very high precision

Here I'll use the notation
It is possible to find the continued fraction of a number by computing it then applying the definition, but that requires at least O(n) bits of memory to find a0, a1 ... an, in practice it is a much worse. Using double floating point precision it is only possible to find a0, a1 ... a19.
An alternative is to use the fact that if a,b,c are rational numbers then there exist unique rationals p,q,r such that 1/(a+b*21/3+c*22/3) = x+y*21/3+z*22/3, namely
So if I represent x,y, and z to absolute precision using the boost rational lib I can obtain floor(x + y*21/3+z*22/3) accurately only using double precision for 21/3 and 22/3 because I only need it to be within 1/2 of the true value. Unfortunately the numerators and denominators of x,y, and z grow considerably fast, and if you use regular floats instead the errors pile up quickly.
This way I was able to compute a0, a1 ... a10000 in under an hour, but somehow mathematica can do that in 2 seconds. Here's my code for reference
#include <iostream>
#include <boost/multiprecision/cpp_int.hpp>
namespace mp = boost::multiprecision;
int main()
{
const double t_1 = 1.259921049894873164767210607278228350570251;
const double t_2 = 1.587401051968199474751705639272308260391493;
mp::cpp_rational p = 0;
mp::cpp_rational q = 1;
mp::cpp_rational r = 0;
for(unsigned int i = 1; i != 10001; ++i) {
double p_f = static_cast<double>(p);
double q_f = static_cast<double>(q);
double r_f = static_cast<double>(r);
uint64_t floor = p_f + t_1 * q_f + t_2 * r_f;
std::cout << floor << ", ";
p -= floor;
//std::cout << floor << " " << p << " " << q << " " << r << std::endl;
mp::cpp_rational den = (p * p * p + 2 * q * q * q +
4 * r * r * r - 6 * p * q * r);
mp::cpp_rational a = (p * p - 2 * q * r) / den;
mp::cpp_rational b = (2 * r * r - p * q) / den;
mp::cpp_rational c = (q * q - p * r) / den;
p = a;
q = b;
r = c;
}
return 0;
}
The Lagrange algorithm
The algorithm is described for example in Knuth's book The Art of Computer Programming, vol 2 (Ex 13 in section 4.5.3 Analysis of Euclid's Algorithm, p. 375 in 3rd edition).
Let f be a polynomial of integer coefficients whose only real root is an irrational number x0 > 1. Then the Lagrange algorithm calculates the consecutive quotients of the continued fraction of x0.
I implemented it in python
def cf(a, N=10):
"""
a : list - coefficients of the polynomial,
i.e. f(x) = a[0] + a[1]*x + ... + a[n]*x^n
N : number of quotients to output
"""
# Degree of the polynomial
n = len(a) - 1
# List of consecutive quotients
ans = []
def shift_poly():
"""
Replaces plynomial f(x) with f(x+1) (shifts its graph to the left).
"""
for k in range(n):
for j in range(n - 1, k - 1, -1):
a[j] += a[j+1]
for _ in range(N):
quotient = 1
shift_poly()
# While the root is >1 shift it left
while sum(a) < 0:
quotient += 1
shift_poly()
# Otherwise, we have the next quotient
ans.append(quotient)
# Replace polynomial f(x) with -x^n * f(1/x)
a.reverse()
a = [-x for x in a]
return ans
It takes about 1s on my computer to run cf([-2, 0, 0, 1], 10000). (The coefficients correspond to the polynomial x^3 - 2 whose only real root is 2^(1/3).) The output agrees with the one from Wolfram Alpha.
Caveat
The coefficients of the polynomials evaluated inside the function quickly become quite large integers. So this approach needs some bigint implementation in other languages (Pure python3 deals with it, but for example numpy doesn't.)
You might have more luck computing 2^(1/3) to high accuracy and then trying to derive the continued fraction from that, using interval arithmetic to determine if the accuracy is sufficient.
Here's my stab at this in Python, using Halley iteration to compute 2^(1/3) in fixed point. The dead code is an attempt to compute fixed-point reciprocals more efficiently than Python via Newton iteration -- no dice.
Timing from my machine is about thirty seconds, spent mostly trying to extract the continued fraction from the fixed point representation.
prec = 40000
a = 1 << (3 * prec + 1)
two_a = a << 1
x = 5 << (prec - 2)
while True:
x_cubed = x * x * x
two_x_cubed = x_cubed << 1
x_prime = x * (x_cubed + two_a) // (two_x_cubed + a)
if -1 <= x_prime - x <= 1: break
x = x_prime
cf = []
four_to_the_prec = 1 << (2 * prec)
for i in range(10000):
q = x >> prec
r = x - (q << prec)
cf.append(q)
if True:
x = four_to_the_prec // r
else:
x = 1 << (2 * prec - r.bit_length())
while True:
delta_x = (x * ((four_to_the_prec - r * x) >> prec)) >> prec
if not delta_x: break
x += delta_x
print(cf)

Python performance: iteration and operations on nested lists

Problem Hey folks. I'm looking for some advice on python performance. Some background on my problem:
Given:
A (x,y) mesh of nodes each with a value (0...255) starting at 0
A list of N input coordinates each at a specified location within the range (0...x, 0...y)
A value Z that defines the "neighborhood" in count of nodes
Increment the value of the node at the input coordinate and the node's neighbors. Neighbors beyond the mesh edge are ignored. (No wrapping)
BASE CASE: A mesh of size 1024x1024 nodes, with 400 input coordinates and a range Z of 75 nodes.
Processing should be O(x*y*Z*N). I expect x, y and Z to remain roughly around the values in the base case, but the number of input coordinates N could increase up to 100,000. My goal is to minimize processing time.
Current results Between my start and the comments below, we've got several implementations.
Running speed on my 2.26 GHz Intel Core 2 Duo with Python 2.6.1:
f1: 2.819s
f2: 1.567s
f3: 1.593s
f: 1.579s
f3b: 1.526s
f4: 0.978s
f1 is the initial naive implementation: three nested for loops.
f2 is replaces the inner for loop with a list comprehension.
f3 is based on Andrei's suggestion in the comments and replaces the outer for with map()
f is Chris's suggestion in the answers below
f3b is kriss's take on f3
f4 is Alex's contribution.
Code is included below for your perusal.
Question How can I further reduce the processing time? I'd prefer sub-1.0s for the test parameters.
Please, keep the recommendations to native Python. I know I can move to a third-party package such as numpy, but I'm trying to avoid any third party packages. Also, I've generated random input coordinates, and simplified the definition of the node value updates to keep our discussion simple. The specifics have to change slightly and are outside the scope of my question.
thanks much!
**`f1` is the initial naive implementation: three nested `for` loops.**
def f1(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
for j in xrange(max(0, topleft[1]), min(topleft[1]+(z*2), y)):
if rows[i][j] <= 255: rows[i][j] += 1
f2 is replaces the inner for loop with a list comprehension.
def f2(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE: f3 is based on Andrei's suggestion in the comments and replaces the outer for with map(). My first hack at this requires several out-of-local-scope lookups, specifically recommended against by Guido: local variable lookups are much faster than global or built-in variable lookups I hardcoded all but the reference to the main data structure itself to minimize that overhead.
rows = [[0]*x for i in xrange(y)]
def f3(x,y,n,z):
inputs = [(int(x*random.random()), int(y*random.random())) for i in range(n)]
rows = map(g, inputs)
def g(input):
inputX, inputY = input
topleft = (inputX - 75, inputY - 75)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(75*2), 1024)):
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE3: ChristopeD also pointed out a couple improvements.
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
UPDATE4: kriss added a few improvements to f3, replacing min/max with the new ternary operator syntax.
def f3b(x,y,n,z):
rn = random.random
rows = [g1(x, y, z) for x, y in [(int(x*rn()), int(y*rn())) for i in xrange(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE5: Alex weighed in with his substantive revision, adding a separate map() operation to cap the values at 255 and removing all non-local-scope lookups. The perf differences are non-trivial.
def f4(x,y,n,z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
Also, since we all seem to be hacking around with variations, here's my test harness to compare speeds: (improved by ChristopheD)
def timing(f,x,y,z,n):
fn = "%s(%d,%d,%d,%d)" % (f.__name__, x, y, z, n)
ctx = "from __main__ import %s" % f.__name__
results = timeit.Timer(fn, ctx).timeit(10)
return "%4.4s: %.3f" % (f.__name__, results / 10.0)
if __name__ == "__main__":
print timing(f, 1024, 1024, 400, 75)
#add more here.
On my (slow-ish;-) first-day Macbook Air, 1.6GHz Core 2 Duo, system Python 2.5 on MacOSX 10.5, after saving your code in op.py I see the following timings:
$ python -mtimeit -s'import op' 'op.f1()'
10 loops, best of 3: 5.58 sec per loop
$ python -mtimeit -s'import op' 'op.f2()'
10 loops, best of 3: 3.15 sec per loop
So, my machine is slower than yours by a factor of a bit more than 1.9.
The fastest code I have for this task is:
def f3(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
which times as:
$ python -mtimeit -s'import op' 'op.f3()'
10 loops, best of 3: 3 sec per loop
so, a very modest speedup, projecting to more than 1.5 seconds on your machine - well above the 1.0 you're aiming for:-(.
With a simple C-coded extensions, exte.c...:
#include "Python.h"
static PyObject*
dopoint(PyObject* self, PyObject* args)
{
int x, y, z, px, py;
int b, t, l, r;
int i, j;
PyObject* rows;
if(!PyArg_ParseTuple(args, "iiiiiO",
&x, &y, &z, &px, &py, &rows
))
return 0;
b = px - z;
if (b < 0) b = 0;
t = px + z;
if (t > x) t = x;
l = py - z;
if (l < 0) l = 0;
r = py + z;
if (r > y) r = y;
for(i = b; i < t; ++i) {
PyObject* row = PyList_GetItem(rows, i);
for(j = l; j < r; ++j) {
PyObject* pyitem = PyList_GetItem(row, j);
long item = PyInt_AsLong(pyitem);
if (item < 255) {
PyObject* newitem = PyInt_FromLong(item + 1);
PyList_SetItem(row, j, newitem);
}
}
}
Py_RETURN_NONE;
}
static PyMethodDef exteMethods[] = {
{"dopoint", dopoint, METH_VARARGS, "process a point"},
{0}
};
void
initexte()
{
Py_InitModule("exte", exteMethods);
}
(note: I haven't checked it carefully -- I think it doesn't leak memory due to the correct interplay of reference stealing and borrowing, but it should be code inspected very carefully before being put in production;-), we could do
import exte
def f4(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
for i in range(n):
inputX, inputY = rr(x), rr(y)
exte.dopoint(x, y, z, inputX, inputY, rows)
and the timing
$ python -mtimeit -s'import op' 'op.f4()'
10 loops, best of 3: 345 msec per loop
shows an acceleration of 8-9 times, which should put you in the ballpark you desire. I've seen a comment saying you don't want any third-party extension, but, well, this tiny extension you could make entirely your own;-). ((Not sure what licensing conditions apply to code on Stack Overflow, but I'll be glad to re-release this under the Apache 2 license or the like, if you need that;-)).
1. A (smaller) speedup could definitely be the initialization of your rows...
Replace
rows = []
for i in range(x):
rows.append([0 for i in xrange(y)])
with
rows = [[0] * y for i in xrange(x)]
2. You can also avoid some lookups by moving random.random out of the loops (saves a little).
3. EDIT: after corrections -- you could arrive at something like this:
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
EDIT: some new timings with timeit (10 runs) -- seems this provides only minor speedups:
import timeit
print timeit.Timer("f1(1024,1024,400,75)", "from __main__ import f1").timeit(10)
print timeit.Timer("f2(1024,1024,400,75)", "from __main__ import f2").timeit(10)
print timeit.Timer("f(1024,1024,400,75)", "from __main__ import f3").timeit(10)
f1 21.1669280529
f2 12.9376120567
f 11.1249599457
in your f3 rewrite, g can be simplified. (Can also be applied to f4)
You have the following code inside a for loop.
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
However, it appears that those values never change inside the for loop. So calculate them once, outside the loop instead.
Based on your f3 version I played with the code. As l and r are constants you can avoid to compute them in g1 loop. Also using new ternary if instead of min and max seems to be consistently faster. Also simplified expression with topleft. On my system it appears to be about 20% faster using with the code below.
def f3b(x,y,n,z):
rows = [g1(x, y, z) for x, y in [(int(x*random.random()), int(y*random.random())) for i in range(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
You can create your own Python module in C, and control the performance as you want:
http://docs.python.org/extending/

Resources