Implementation of model parallelism in tensorflow - parallel-processing

I'm currently working on a system with 2 GPUs each of 12GB. I want to implement model parallelism across the two GPUs to train large models. I have been looking through all over the internet, SO, tensorflow documentation, etc, i was able to find the explanations of model parallelism and its results but nowhere did i find a small tutorial or small code snippets on how to implement it using tensorflow. I mean we have to exchange activations after every layer right? So how do we do that? Is there a specific or cleaner ways of implementing model parallelism in tensorflow? It would be very helpful if you could suggest me a place where i can learn to implement it or a simple code like mnist training on multiple GPU using 'MODEL PARALLELISM'.
Note: I have done data parallelism like in CIFAR10 - multi gpu tutorial but i haven't found any implementation of model parallelism.

Here's an example. The model has some parts on GPU0, some parts on GPU1 and some parts on CPU, so this is 3 way model parallelism.
with tf.device("/gpu:0"):
a = tf.Variable(tf.ones(()))
a = tf.square(a)
with tf.device("/gpu:1"):
b = tf.Variable(tf.ones(()))
b = tf.square(b)
with tf.device("/cpu:0"):
loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10):
loss0, _ = sess.run([loss, train_op])
print("loss", loss0)

Related

Questions about feature selection and data engineering when using H2O autoencoder for anomaly detection

I am using H2O autoencoder in R for anomaly detection. I don’t have a training dataset, so I am using the data.hex to train the model, and then the same data.hex to calculate the reconstruction errors. The rows in data.hex with the largest reconstruction errors are considered anomalous. Mean squared error (MSE) of the model, which is calculated by the model itself, would be the sum of the squared reconstruction errors and then divided by the number of rows (i.e. examples). Below is some sudo code of the model.
# Deeplearning Model
model.dl <- h2o.deeplearning(x = x, training_frame = data.hex, autoencoder = TRUE, activation = "Tanh", hidden = c(25,25,25), variable_importances = TRUE)
# Anomaly Detection Algorithm
errors <- h2o.anomaly(model.dl, data.hex, per_feature = FALSE)
Currently there are about 10 features (factors) in my data.hex, and they are all categorical features. I have two questions below:
(1) Do I need to perform feature selection to select a subset of the 10 features before the data go into the deep learning model (with autoencoder=TRUE), in case some features are significantly associated with each other? Or I don’t need to since the data will go into an autoencoder which compresses the data and selects only the most importance information already, so feature selection would be redundant?
(2) The purpose of using the H2O autoencoder here is to identify the senders in data.hex whose action is anomalous. Here are two examples of data.hex. Example B is a transformed version of Example A, by concatenating all the actions for each sender-receiver pair in Example A.
After running the model on data.hex in Example A and in Example B separately, what I got is
(a) MSE from Example A (~0.005) is 20+ times larger than MSE from Example B;
(b) When I put the reconstruction errors in ascending order and plot them (so errors increase from left to right in the plot), the reconstruction error curve from Example A is steeper (e.g. skyrocketing) on the right end, while the reconstruction error curve from Example B increases more gradually.
My question is, which example of data.hex works better for my purpose to identify anomalies?
Thanks for your insights!
Question 1
You shouldn't need to decrease the number of inputted features into the model. I can't say I know what would happen during training, but collinear/associated features could be eliminated in the hidden layers as you said. You could consider adjusting your hidden nodes and see how it behaves. hidden = c(25,25,25) -> hidden = c(25,10,25) or hidden = c(15,15) or even hidden = c(7, 5, 7) for your few features.
Question 2
What is the purpose of your model? Are you trying to determine which "Sender/Receiver combinations" are anomalies or are you trying to determine which "Sender/Receiver + specific Action combo" are anomalies? If it's the former ("Sender/Receiver combinations") I would guess Example B is better.
If you want to know "Sender/Receiver combinations" and use Example A, then how would you aggregate all the actions for one Sender-Receiver combo? Will you average their error?
But it sounds like Example A has more of a response for anomalies in ascended order list (where only a few rows have high error). I would sample different rows and see if the errors make sense (as a domain expert). See if higher errors tend to seem to be anomaly-like rows.

The implementation of Adaboost on neural network

Hi I recently taking course and do some survey on Adaboost
I view some code using Adaboost to boost the performance of neural network
As far as I Know with multiple classes Adaboost can be done by:
(1)Weighting the training data as 1 for each data.
(2)After training we re-weight the data by adding the weight if the
classifier do it wrong,else reduce the weight if classifier predict it correctly.
(3)And final we take the combination of all classifiers we and take the max one (probability)
I could make some code about it with Keras and sklearn:
model = Model( img_input , o )
model.fit_generator(#some parameters)
from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier(base_estimator=model,algorithm='SAMME')
adaboost.fit_generator(#some parameters)
My question is:
I would like to know how Adaboost is used with neural network
I could imagine two ways to do this not sure how Adaboost do here:
(1)After complete training(1 hour),we re-weight the training data and then again and again until iteration is over.
(2)If first round of all data have been fed into neural network and then we re-weight the training data.
The difference between (1) and (2) is how we define one iteration in Adaboost:
(1) would take too long to complete whole iteration
(2) just some how don't make sense to me cause I don't think the whole process is going to convergence so fast or the iteration number would need to be set large.
It seems that only few people go this way.
I think I would choose the "stack" method .

Get cross_validation_holdout_predictions() of models from a grid search

I'm trying to calculate performance in a different way how it is built in for models right now.
I would like to access raw predictions during cross-validation, so I can calculate performance on my own.
g = h2o.get_grid(grid_id)
for m in g.models:
print "Model %s" % m.model_id
rrc[m.model_id] = m.cross_validation_holdout_predictions()
I could just run prediction with a model on my dataset, but I think then this test might be biased because the model has seen this data before, or not? Can I take new predictions made on the same data set and use it to calculate performance?
I would like to access raw predictions during cross-validation, so I can calculate performance on my own.
If you want to calculate a custom metric on the cross-validated predictions, then set keep_cross_validation_predictions = True and you can access the raw predicted values using the .cross_validation_holdout_predictions() method like you have above.
Can I take new predictions made on the same data set and use it to calculate performance?
It sounds like you're asking if you can use only training data to estimate model performance? Yes, using cross-validation. If you set nfolds > 1, H2O will do cross-validation and compute a handful of cross-validated performance metrics for you. Also, if you tell H2O to save the cross-validated predictions, you can compute "cross-validated metrics" of your own.

How do I constrain the outputs of Gaussian Processes in PYMC?

So I have a very challenging MCMC run I would like to do in PyMC, which I have run several times before for much simpler analyses. However, my newest challenge requires me to combine many different Gaussian Processes in a very specific way, and I don't know enough about Gaussian processes in general or how they are implemented in PyMC to engineer the code I need.
Here is the problem I am trying to tackle:
The data I have is five time series (we'll call them A(t), B(t), C(t), D(t), and E(t)) , each measurement of which has Gaussian/Normal uncertainties. Each of these can be modeled as the product of one series-specific efficiency function and one underlying function shared between all five time series, so A(t) = a(t) * f(t), B(t) = b(t) * f(t), C(t) = c(t) * f(t), etc... I need to measure the posterior for f(t), or more specifically, the posterior of the integral of f(t) dt over a domain.
So I have read over some documentation about implementing Gaussian Processes in PyMC, but I have a few additional wrinkles with my efficiency functions that need to be addressed specifically before I can start coding up my model. Mainly -
1) I have no strong prior about the shape of the efficiency functions a(t), b(t), etc... So long as they vary smoothly there is no shape that is strongly forbidden.
2) These efficiency functions are physically bound to be between 0 and 1 for all times. So while I have no prior on the shape of the curve it has to fall between these bounds. I do have some prior about its typical value but since I need to marginalize over it I can't put too many other constraints on this.
Has anyone out there tackled a similar type of problem before, and what might be the most elegant way to guarantee that my efficiency priors are implemented in this complex MCMC run? I simply don't know enough about Gaussian Processes/Covariance functions to know how to force these constraints on the data.

Practical guidelines on setting weights for examples in vowpal wabbit

I have a multi-class classification problem on a data set (with 6 target classes).The training data has a skewed distribution of the class labels: Below is a distribution of each of the class labels (1 to 6)
(array([174171, 12, 29, 8285, 9996, 11128]),
I am using vowpal wabbit's oaa scheme to classify and have tried the default weight of 1.0 for each example. However for most models this just results in the model predicting 1.0 for all examples in the evaluation (as label 1 has a very large representation in the training set).
I am trying to now experiment with different weights that I can apply to the examples of each class to help boost the performance of the classifier.
Any pointers or practical tips on techniques to decide on weights of each example would be very useful. One possible technique was to weigh the example in inverse ratio according to their frequency. Unfortunately this seems to result in the classifier being biased greatly towards Labels 2 and 3 , and predicting 2 and 3 for almost everything in the evaluation.
Would the model choice play a role in deciding the weights. I am experimenting with neural networks and logistic and hinge loss functions.
There may be better approaches, but I would start, like you did, by inverse weighting the examples based on the rarity of their labels as follows:
Sum of counts of labels = 174171 + 12 + 29 + 8285 + 9996 + 11128 = 203621 so
Label 1 appearing 174171 times (85.5% of total) would be weighted: 203621/174171 = 1.16909
Label 2 appearing 12 times (rarest) would be weighted: 203621/12 = 16968.4
and so on.
Make sure the examples in the train-set are well shuffled. This is of critical importance in online learning. Having the same label examples lumped together is a recipe for very poor online performance.
If you did shuffle well, and you get bad performance on new examples, you can reweight less aggressively, for example take the sqrt() of the inverse weights, then if that's still too aggressive, switch to log() of the inverse weights, etc.
Another approach is to use one of the new cost-sensitive multi-class options, e.g. --csoaa
The VW wiki on github has some examples with details on how to use these options and their training-set formats.
The loss function chosen should definitely have an effect. However note that generally, when using multi-class, or any other reduction-based option in vw, you should leave the --loss_function alone and let the algorithm use its built-in default. If you try a different loss function and get better results than the reduction built-in loss-function, this may be of interest to the developers of vw, please report it as a bug.

Resources