mpi4py: Internal Error: invalid error code 409e0e (Ring ids do not match) - parallel-processing

I am coding in python and using mpi4py to do some optimization in parallel. I am using Ordinary Least Squares, and my data is too large to fit on one processor, so I have a master process that then spawns other processes. These child processes each import a section of the data that they respectively work with throughout the optimization process.
I am using scipy.optimize.minimize for the optimization, so the child processes receive a coefficient guess from the parent process, and then report the sum of squared error (SSE) to the parent process, and then scipy.optimize.minimize goes through iterations, trying to find a minimum for the SSE. After each iteration of the minimize function, the parent broadcasts the new coefficient guesses to the child processes, who then calculate the SSE again. In the child processes, this algorithm is set up in a while loop. In the parent process, I simply call scipy.optimize.minimize.
On the part that is giving me a problem, I am doing a nested optimization, or an optimization within an optimization. The inner optimization is an OLS regression as described above, and then the outer optimization is minimizing another function that uses the coefficient of the inner optimization (the OLS regression).
So in my parent process, I have two functions that I minimize, and the second function calls on the first and does a new optimization for every iteration of the second function's optimization. The child processes have a nested while loop for those two optimizations.
Hopefully that all makes sense. If more information is needed, please let me know.
Here is the relevant code for the parent process:
comm = MPI.COMM_SELF.Spawn(sys.executable,args = ['IVQTparallelSlave_cdf.py'],maxprocs=processes)
# First stage: reg D on Z, X
def OLS(betaguess):
comm.Bcast([betaguess,MPI.DOUBLE], root=MPI.ROOT)
SSE = np.array([0],dtype='d')
comm.Reduce(None,[SSE,MPI.DOUBLE], op=MPI.SUM, root = MPI.ROOT)
comm.Bcast([np.array([1],'i'),MPI.INT], root=MPI.ROOT)
return SSE
# Here is the CDF function.
def CDF(yguess, delta_FS, tau):
# Calculate W(y) in the slave process
# Solving the Reduced form after every iteration: reg W(y) on Z, X
comm.Bcast([yguess,MPI.DOUBLE], root=MPI.ROOT)
betaguess = np.zeros(94).astype('d')
###########
# This calculates the reduced form coefficient
coeffs_RF = scipy.minimize(OLS,betaguess,method='Powell')
# This little block is to get the slave processes to stop
comm.Bcast([betaguess,MPI.DOUBLE], root=MPI.ROOT)
SSE = np.array([0],dtype='d')
comm.Reduce(None,[SSE,MPI.DOUBLE], op=MPI.SUM, root = MPI.ROOT)
cont = np.array([0],'i')
comm.Bcast([cont,MPI.INT], root=MPI.ROOT)
###########
contCDF = np.array([1],'i')
comm.Bcast([contCDF,MPI.INT], root=MPI.ROOT) # This is to keep the outer while loop going
delta_RF = coeffs_RF.x[1]
return abs(delta_RF/delta_FS - tau)
########### This one finds Y(1) ##############
betaguess = np.zeros(94).astype('d')
######### First Stage: reg D on Z, X #########
coeffs_FS = scipy.minimize(OLS,betaguess,method='Powell')
print coeffs_FS
# This little block is to get the slave processes' while loops to stop
comm.Bcast([betaguess,MPI.DOUBLE], root=MPI.ROOT)
SSE = np.array([0],dtype='d')
comm.Reduce(None,[SSE,MPI.DOUBLE], op=MPI.SUM, root = MPI.ROOT)
cont = np.array([0],'i')
comm.Bcast([cont,MPI.INT], root=MPI.ROOT)
delta_FS = coeffs_FS.x[1]
######### CDF Function #########
yguess = np.array([3340],'d')
CDF1 = lambda yguess: CDF(yguess, delta_FS, tau)
y_minned_1 = scipy.minimize(CDF1,yguess,method='Powell')
Here is the relevant code for the child processes:
#IVQTparallelSlave_cdf.py
comm = MPI.Comm.Get_parent()
.
.
.
# Importing data. The data is the matrices D, and ZX
.
.
.
########### This one finds Y(1) ##############
######### First Stage: reg D on Z, X #########
cont = np.array([1],'i')
betaguess = np.zeros(94).astype('d')
# This corresponds to 'coeffs_FS = scipy.minimize(OLS,betaguess,method='Powell')' of the parent process
while cont[0]:
comm.Bcast([betaguess,MPI.DOUBLE], root=0)
SSE = np.array(((D - np.dot(ZX,betaguess).reshape(local_n,1))**2).sum(),'d')
comm.Reduce([SSE,MPI.DOUBLE],None, op=MPI.SUM, root = 0)
comm.Bcast([cont,MPI.INT], root=0)
if rank==0: print '1st Stage OLS regression done'
######### CDF Function #########
cont = np.array([1],'i')
betaguess = np.zeros(94).astype('d')
contCDF = np.array([1],'i')
yguess = np.array([0],'d')
# This corresponds to 'y_minned_1 = spicy.minimize(CDF1,yguess,method='Powell')'
while contCDF[0]:
comm.Bcast([yguess,MPI.DOUBLE], root=0)
# This calculates the reduced form coefficient
while cont[0]:
comm.Bcast([betaguess,MPI.DOUBLE], root=0)
W = 1*(Y<=yguess)*D
SSE = np.array(((W - np.dot(ZX,betaguess).reshape(local_n,1))**2).sum(),'d')
comm.Reduce([SSE,MPI.DOUBLE],None, op=MPI.SUM, root = 0)
comm.Bcast([cont,MPI.INT], root=0)
#if rank==0: print cont
comm.Bcast([contCDF,MPI.INT], root=0)
My problem is that after one iteration through the outer minimization, it spits out the following error:
Internal Error: invalid error code 409e0e (Ring ids do not match) in MPIR_Bcast_impl:1328
Traceback (most recent call last):
File "IVQTparallelSlave_cdf.py", line 100, in <module>
if rank==0: print 'CDF iteration'
File "Comm.pyx", line 406, in mpi4py.MPI.Comm.Bcast (src/mpi4py.MPI.c:62117)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Bcast(1478).....: MPI_Bcast(buf=0x2409f50, count=1, MPI_INT, root=0, comm=0x84000005) failed
MPIR_Bcast_impl(1328):
I haven't been able to find any information about this "ring id" error or how to fix it. Help would be much appreciated. Thanks!

Related

TensorFlow - directly calling tf.function much faster than calling tf.function returned from wrapper

I am training a VAE (using federated learning, but that is not so important) and wanted to keep the loss and train functions simple to exchange. The initial approach was to have a tf.function as loss function and a tf.function as train function as follows:
#tf.function
def kl_reconstruction_loss(model, model_input, beta):
x, y = model_input
mean, logvar = model.encode(x, y)
z = model.reparameterize(mean, logvar)
x_logit = model.decode(z, y)
cross_ent = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=x)
reconstruction_loss = tf.reduce_mean(tf.reduce_sum(cross_ent, axis=[1, 2, 3]), axis=0)
kl_loss = tf.reduce_mean(0.5 * tf.reduce_sum(tf.exp(logvar) + tf.square(mean) - 1. - logvar, axis=-1), axis=0)
loss = reconstruction_loss + beta * kl_loss
return loss, kl_loss, reconstruction_loss
#tf.function
def train_fn(model: tf.keras.Model, batch, optimizer, kl_beta):
"""Trains the model on a single batch.
Args:
model: The VAE model.
batch: A batch of inputs [images, labels] for the vae.
optimizer: The optimizer to train the model.
beta: Weighting of KL loss
Returns:
The loss.
"""
def vae_loss():
"""Does the forward pass and computes losses for the generator."""
# N.B. The complete pass must be inside loss() for gradient tracing.
return kl_reconstruction_loss(model, batch, kl_beta)
with tf.GradientTape() as tape:
loss, kl_loss, rc_loss = vae_loss()
grads = tape.gradient(loss, model.trainable_variables)
grads_and_vars = zip(grads, model.trainable_variables)
optimizer.apply_gradients(grads_and_vars)
return loss
For my dataset this results in an epoch duration of approx. 25 seconds. However, since I have to call those functions directly in my code, I would have to enter different ones if I would want to try out different loss/train functions.
So, alternatively, I followed https://github.com/google-research/federated/tree/master/gans and wrapped the loss function in a class and the train function in another function. Now I have:
class VaeKlReconstructionLossFns(AbstractVaeLossFns):
#tf.function
def vae_loss(self, model, model_input, labels, global_round):
# KL Reconstruction loss
mean, logvar = model.encode(model_input, labels)
z = model.reparameterize(mean, logvar)
x_logit = model.decode(z, labels)
cross_ent = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=model_input)
reconstruction_loss = tf.reduce_mean(tf.reduce_sum(cross_ent, axis=[1, 2, 3]), axis=0)
kl_loss = tf.reduce_mean(0.5 * tf.reduce_sum(tf.exp(logvar) + tf.square(mean) - 1. - logvar, axis=-1), axis=0)
loss = reconstruction_loss + self._get_beta(global_round) * kl_loss
if model.losses:
loss += tf.add_n(model.losses)
return loss, kl_loss, reconstruction_loss
def create_train_vae_fn(
vae_loss_fns: vae_losses.AbstractVaeLossFns,
vae_optimizer: tf.keras.optimizers.Optimizer):
"""Create a function that trains VAE, binding loss and optimizer.
Args:
vae_loss_fns: Instance of gan_losses.AbstractVAELossFns interface,
specifying the VAE training loss.
vae_optimizer: Optimizer for training the VAE.
Returns:
Function that executes one step of VAE training.
"""
# We check that the optimizer has not been used previously, which ensures
# that when it is bound the train fn isn't holding onto a different copy of
# the optimizer variables then the copy that is being exchanged b/w server and
# clients.
if vae_optimizer.variables():
raise ValueError(
'Expected vae_optimizer to not have been used previously, but '
'variables were already initialized.')
#tf.function
def train_vae_fn(model: tf.keras.Model,
model_inputs,
labels,
global_round,
new_optimizer_state=None):
"""Trains the model on a single batch.
Args:
model: The VAE model.
model_inputs: A batch of inputs (usually images) for the VAE.
labels: A batch of labels corresponding to the inputs.
global_round: The current glob al FL round for beta calculation
new_optimizer_state: A possible optimizer state to overwrite the current one with.
Returns:
The number of examples trained on.
The loss.
The updated optimizer state.
"""
def vae_loss():
"""Does the forward pass and computes losses for the generator."""
# N.B. The complete pass must be inside loss() for gradient tracing.
return vae_loss_fns.vae_loss(model, model_inputs, labels, global_round)
# Set optimizer vars
optimizer_state = get_optimizer_state(vae_optimizer)
if new_optimizer_state is not None:
# if optimizer is uninitialised, initialise vars
try:
tf.nest.assert_same_structure(optimizer_state, new_optimizer_state)
except ValueError:
initialize_optimizer_vars(vae_optimizer, model)
optimizer_state = get_optimizer_state(vae_optimizer)
tf.nest.assert_same_structure(optimizer_state, new_optimizer_state)
tf.nest.map_structure(lambda a, b: a.assign(b), optimizer_state, new_optimizer_state)
with tf.GradientTape() as tape:
loss, kl_loss, rc_loss = vae_loss()
grads = tape.gradient(loss, model.trainable_variables)
grads_and_vars = zip(grads, model.trainable_variables)
vae_optimizer.apply_gradients(grads_and_vars)
return tf.shape(model_inputs)[0], loss, optimizer_state
return train_vae_fn
This new formulation takes about 86 seconds per epoch.
I am struggling to understand why the second version performs so much worse than the first one. Does anyone have a good explanation for this?
Thanks in advance!
EDIT: My Tensorflow version is 2.5.0

how to formulate the problem of finding the optimal PID paramters in gekko?

I have defined a first order process model and would like to find the optimal PID parameters for this process. The optimization objective is to minimize the IAE ( Integral of absolute error between the setpoint and process value) for set point change over a horizon of 5 times the process time constant.
It is neither a dynamic optimization ( IMODE =6) problem , nor a pure steady state optimization problem (IMODE=3) as it involves the derivatives. How to formulate the above problem in gekko?
m = GEKKO(remote=False)
# Controller model
Kc = m.Var(1.0,lb=0.01,ub=10) # controller gain
tauI = m.Var(2.0,lb=0.01,ub=1000) # controller reset time
tauD = m.Var(1.0,lb=0.0,ub=100) # derivative constant
OP = m.Var(value=0.0,lb=0.0,ub=100) # controller output
PV = m.Var(value=0.0) # process variable
SP = 1.0 # set point
Intgl = m.Var(value=0.0) # integral of the error
err = m.Intermediate(SP-PV) # set point error
m.Equation(Intgl.dt()==err) # integral of the error
m.Equation(OP == Kc*(err + (1/tauI)*Intgl + tauD*PV.dt()))
# Process model
Kp = 2 # process gain
tauP = 10.0 # process time constant
m.Equation(tauP*PV.dt() + PV == Kp*OP)
m.Obj((SP-PV)**2) # how to define the objective to minimize the error over a horizon
m.options.IMODE=3
m.solve(disp=False)
print(str(Kc.VALUE))
print(str(tauI.VALUE))
print(str(tauD.VALUE))
print(str(m.options.OBJFCNVAL))
There is a video tutorial on simulating (00:00-17:00) and optimizing (17:00-23:41) PID tuning parameters with GEKKO. There is starting code as problem #14 in this list of tutorials.
The main points from the video are to switch to IMODE=6 and set the STATUS=1 for the parameters that should be adjusted to minimize the error: (SP-PV)**2.

Tensorflow doesn't calculate summary

I try to understand how to collect summaries for tensorboard and wrote a simple code to increment x from 1 till 5.
For some unknown reason I see variable My_x as 0 in all steps.
import tensorflow as tf
tf.reset_default_graph() # To clear the defined variables/operations
# create the scalar variable
x = tf.Variable(0, name='x')
# ____step 1:____ create the scalar summary
x_summ = tf.summary.scalar(name='My_x', tensor=x)
# accumulate all summaries
merged_summary = tf.summary.merge_all()
# create the op for initializing all variables
model = tf.global_variables_initializer()
# launch the graph in a session
with tf.Session() as session:
# ____step 2:____ creating the writer inside the session
summary_writer = tf.summary.FileWriter('output', session.graph)
for i in range(5):
#initialize variables
session.run(model)
x = x + 1
# ____step 3:____ evaluate the scalar summary
merged_summary_ans, x_summ_ans, x_ans = session.run([merged_summary, x_summ, x])
print(x_ans)
print(x_summ_ans)
print(merged_summary_ans)
# ____step 4:____ add the summary to the writer (i.e. to the event file)
summary_writer.add_summary(summary=x_summ_ans, global_step=i)
summary_writer.flush()
summary_writer.close()
print('Done with writing the scalar summary')
There are two problems that I can see in your code:
1) The first is that in each loop you are re-initialising the global variables again. This is resetting x back to its original value (0).
2) Second of all when you are updating x you are overwriting the link to the variable with a TensorFlow addition operation. Your code to increase x replaces 'x' with a tf.add operation and then your summary value is no longer tracing a tf.Variable but an addition operation. If you add "print(x)" after you define it and have it run once in every loop, you will see that originally it starts out as <tf.Variable 'x:0' shape=() dtype=int32_ref> but then after seeing that "x = x+1" then print(x) becomes Tensor("add:0", shape=(), dtype=int32). Here you can see that tf.summary.scalar is only compatible with the original value and you can see why it can't be updated.
Here is code I altered to get it to work so you can see the linear of the value of x in Tensorboard.
import tensorflow as tf
tf.reset_default_graph()
x = tf.Variable(0, name='x')
x_summary = tf.summary.scalar('x_', x)
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
merged_summary_op = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter('output', session.graph)
for i in range(5):
print(x.eval())
summary = session.run(merged_summary_op)
summary_writer.add_summary(summary, i)
session.run(tf.assign(x, x+1))
summary_writer.flush()
summary_writer.close()

reading in parallel from" generator in Keras

I have a big dataset divided in files.
I would like to read and process my data one file at the time and for this I have this keras generator:
def myGenerator():
while 1:
rnd = random.randint(1,200)
strRnd = str(rnd)
lenRnd = len(strRnd)
rndPadded = strRnd.rjust(5, '0')
nSearchesInBatch = 100
f = "path/part-" + rndPadded + "*" #read one block of data
data = sqlContext.read.load(f).toPandas()
imax = int(data.shape[0]/nSearchesInBatch) #number of batches that will be created sequentially from the generator
for i in range(imax):
data_batch = data[i*nSearchesInBatch:(i+1)*nSearchesInBatch]
features = data_batch['features']
output = data_batch['output']
yield features, output
The problem is that the reading takes the biggest part (each file is around 200mb), and in the meanwhile the GPU sits waiting, it is possible to pre-read the next batch while the GPU is traning on the previous one?
At the moment one file is read and split in steps (the inner loop), the CPUs are hidden and the GPU training, as soon as the epoch finishes, the GPU goes idle and the cpu start reading (which takes 20/30 seconds).
Any solution to parallelize this?

PID control - value of process parameter based on PID result

I'm trying to implement a PID controller following http://en.wikipedia.org/wiki/PID_controller
The mechanism I try to control works as follows:
1. I have an input variable which I can control. Typical values would be 0.5...10.
2. I have an output value which I measure daily. My goal for the output is roughly at the same range.
The two variables have strong correlation - when the process parameter goes up, the output generally goes up, but there's quite a bit of noise.
I'm following the implementation here:
http://code.activestate.com/recipes/577231-discrete-pid-controller/
Now the PID seems like it is correlated with the error term, not the measured level of output. So my guess is that I am not supposed to use it as-is for the process variable, but rather as some correction to the current value? How is that supposed to work exactly?
For example, if we take Kp=1, Ki=Kd=0, The process (input) variable is 4, the current output level is 3 and my target is a value of 2, I get the following:
error = 2-3 = -1
PID = -1
Then I should set the process variable to -1? or 4-1=3?
You need to think in terms of the PID controller correcting a manipulated variable (MV) for errors, and that you need to use an I term to get to an on-target steady-state result. The I term is how the PID retains and applies memory of the prior behavior of the system.
If you are thinking in terms of the output of the controller being changes in the MV, it is more of a 'velocity form' PID, and the memory of prior errors and behavior is integrated and accumulated in the prior MV setting.
From your example, it seems like a manipulated value of -1 is not feasible and that you would like the controller to suggest a value like 3 to get a process output (PV) of 2. For a PID controller to make use of "The process (input) variable is 4,..." (MV in my terms) Ki must be non-zero, and if the system was at steady-state, whatever was accumulated in the integral (sum_e=sum(e)) would precisely equal 4/Ki, so:
Kp= Ki = 1 ; Kd =0
error = SV - PV = 2 - 3 = -1
sum_e = sum_e + error = 4/Ki -1
MV = PID = -1(Kp) + (4/Ki -1)Ki = -1Kp + 4 - 1*Ki = -1 +4 -1 = 2
If you used a slower Ki than 1, it would smooth out the noise more and not adjust the MV so quickly:
Ki = 0.1 ;
MV = PID = -1(Kp) + (4/Ki -1)Ki = -1Kp + 4 - 1*Ki = -1 +4 -0.1 = 2.9
At steady state at target (PV = SV), sum_e * Ki should produce the steady-state MV:
PV = SV
error = SV - PV = 0
Kp * error = 0
MV = 3 = PID = 0 * Kp + Ki * sum_e
A nice way to understand the PID controller is to put units on everything and think of Kp, Ki, Kd as conversions of the process error, accumulated error*timeUnit, and rate-of-change of error/timeUnit into terms of the manipulated variable, and that the controlled system converts the controller's manipulated variable into units of output.

Resources