Model takes twice the memory footprint with distributed data parallel - parallel-processing

I have a model that trains just fine on a single GPU. But I'm getting CUDA memory errors when I switch to Pytorch distributed data parallel (DDP). Specifically, the DDP model takes up twice the memory footprint compared to the model with no parallelism. Here is a minimal reproducible example:
import os
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.multiprocessing as mp
import torch
def train(rank, gpu_list, train_distributed):
device_id = gpu_list[rank]
model = torch.nn.Linear(1000, 1000)
print(device_id, torch.cuda.memory_allocated(device_id))
print(device_id, torch.cuda.memory_allocated(device_id))
print(device_id, torch.cuda.memory_allocated(device_id))
if train_distributed:
# convert model to DDP
dist.init_process_group("gloo", rank=rank, world_size=len(gpu_list))
model = DDP(model, device_ids=[device_id], find_unused_parameters=False)
print(device_id, torch.cuda.memory_allocated(device_id))
def train_distributed():
gpu_list = [torch.device(i) for i in [5, 6]]
os.environ['MASTER_ADDR'] = '127.0.01'
os.environ['MASTER_PORT'] = '7676'
mp.spawn(train, args=(gpu_list, True), nprocs=len(gpu_list), join=True)
if __name__ == '__main__':
# First test one GPU
train(0, [torch.device(5)], False)
# Then test multiple GPUs
Output - note that the GPU usage doubles on both devices when switching to DDP:
cuda:5 0
cuda:5 4004352
cuda:5 4004352
cuda:5 4004352
cuda:5 0
cuda:6 0
cuda:5 4004352
cuda:5 4004352
cuda:6 4004352
cuda:6 4004352
cuda:5 8008704
cuda:6 8008704
Why does the model take up twice the space in DDP? Is it intended behavior? Is there a way to avoid this extra memory usage?

I'm adding here the solution of #ptrblck written in the PyTorch discussion forum.
Here're two quotes.
The statement:
[...] the allocated memory get doubled when torch.distributed.Reducer is instantiated in the constructor of DistributedDataParallel
And the answer:
[...] the Reducer will create gradient buckets for each parameter, so that the memory usage after wrapping the model into DDP will be 2x model_parameter_size. Note that the parameter size of a model is often much smaller than the activation size so that this memory increase might or might not be significant
So, from here we can see the reason why the memory footprint sometimes doubles.

Try use gradient_as_bucket_view to save memory. As document says,
gradient_as_bucket_view (bool) – When set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. Moreover, it avoids the overhead of copying between gradients and allreduce communication buckets. When gradients are views, detach_() cannot be called on the gradients. If hitting such errors, please fix it by referring to the zero_grad() function in torch/optim/ as a solution.


Pytorch: RAM explodes when using multiprocessing SharedMemory and CUDA

I would like to use multiprocessing to launch multiple training instances on CUDA device. Since the data is common between the processes, I want to avoid data copy for every process. I'm using python 3.8's SharedMemory from multiprocessing module to achieve this following this SO example.
I can allocate a memory block using SharedMemory and create as many processes as I'd like with constant memory (RAM) usage. However, when I try to send tensors to CUDA, the memory scales linearly with the number of processes. It appears as if when is called, the base data is copied for every process.
Does any one know why this is happening? Any ideas to mitigate this issue?
Here is the sample code I'm using:
import numpy as np
from multiprocessing import shared_memory, get_context
import time
import torch
import copy
dim = 10000
batch_size = 10
sleep_time = 2
npe = 1 # number of parallel executions
# cuda
if torch.cuda.is_available():
dev = 'cuda:0'
dev = "cpu"
device = torch.device(dev)
def step(i, shr_name):
existing_shm = shared_memory.SharedMemory(name=shr_name)
np_arr = np.ndarray((dim, dim), dtype=np.float32, buffer=existing_shm.buf)
b = np_arr[i * batch_size: (i + 1) * batch_size, :]
b = torch.Tensor(b)
# This is just to explicitly copy the tensor so that it has nothing to do
# with the shared memory block
c = copy.deepcopy(b)
# If tensor c is sent to the cuda device, then RAM scales linearly
# with the number of parallel executions.
# If c is not sent to cuda device, memory consumption is constant.
c =
def create_shared_block():
a = np.random.random((dim, dim)).astype(np.float32)
shm = shared_memory.SharedMemory(create=True, size=a.nbytes, name='sha')
np_arr = np.ndarray(a.shape, dtype=np.float32, buffer=shm.buf)
np_arr[:] = a[:]
return shm, np_arr
if __name__ == '__main__':
# create shared memory block
shm, np_arr = create_shared_block()
# create list of inputs to be executed in parallel
inp = [[x, 'sha'] for x in range(npe)]
# sleep added before and after launching multiprocessing to monitor the memory consumption
print('before pool') # to check memory with top or htop
context = get_context('spawn')
with context.Pool(npe) as pool:
print('after pool') # to check memory with top or htop
pool.starmap(step, inp)

Dataloader on top of protobuf file using pytorch's

I am building a RNN network using pytorch.
The data is stored in various protobuf file.
Each record in protobuf represents one training example with multiple timestamp.
As this is very large dataset, reading the whole data in memory or random read by extending class isn't feasible.
As per the docs using the is recommended.
DataLoader on top of IterableDataset would be able to achieve parallelism
However I am not able to find an implementation of this on custom data, docs only talk about a simple range iterator.
import math
import stream
from src import record_pb2
import torch
class MyIterableDataset(
def __init__(self, pb_file):
self.pb_file = pb_file
self.start = 0
self.end = 0
# One time read of the data to get the total count of records in the dataset
with, 'rb') as data_stream:
for _ in data_stream:
self.end += 1
def __iter__(self):
worker_info =
if worker_info is None: # Single-process data loading, return the full iterator
iter_start = self.start
iter_end = self.end
# in a worker process, split the workload
per_worker = int(math.ceil((self.end - self.start))/float(worker_info.num_workers))
worker_id =
iter_start = self.start + worker_id * per_worker
iter_end = min(iter_start + per_worker, self.end)
data_stream =, 'rb')
# Block to skip the streaming data till the iter start for the current worker process
i = 0
for _ in data_stream:
i += 1
if i >= iter_start:
return iter(self.pb_stream)
I am expecting a mechanism by which a parallel data feeder could be designed on top of a large streaming data (protobuf)
The __iter__ method of the IterableDataset would yield your data samples one at a time. In a parallel setup, you have to choose the samples based on worker_id. And with respect to the DataLoader using this dataset, shuffle and sampler options would not work, as an IterableDataset is not going to have any indices. In other words, have your dataset yield one sample at a time and the data loader will take care of loading them. Does this answer?

Customize task resources on Airflow using MesosExecutor

Is it possible to specify resources (CPU, memory, GPU, disk space) for each operator of a DAG when using MesosExecutor?
I know you can specify global values for resources of a task.
For instance, I have several operators that are CPU expensive and others that not. I would like to execute one at a time of the first, but many in parallel of the non CPU expensive ones.
From the code ( line 67), it seems that is not possible since cpu and memory values are passed to the Scheduler during initialization:
def __init__(self,
self.task_queue = task_queue
self.result_queue = result_queue
self.task_cpu = task_cpu
self.task_mem = task_mem
and those values are used without modification:
cpus = task.resources.add() = "cpus"
cpus.type = mesos_pb2.Value.SCALAR
cpus.scalar.value = self.task_cpu
mem = task.resources.add() = "mem"
mem.type = mesos_pb2.Value.SCALAR
mem.scalar.value = self.task_mem
It requires a custom Executor implementation to achieve that

VectorAssembler in spark is very slow, even in trivial cases

I've been using Spark for some data analysis and machine learning.
Having read in some data as trainDF, I construct two pipelines which are logically equivalent, but one of which has a VectorAssembler at the end (which only has one inputCols) to demonstrate the slow down:
scala> val assembler = new VectorAssembler().setInputCols(Array("all_description_features")).setOutputCol("features")
assembler: = vecAssembler_a76e6412bc96
scala> val idfDescription = new IDF().setInputCol("all_description_hashed").setOutputCol("all_description_features")
idfDescription: = idf_4b504cf08d86
scala> val descriptionArray = Array(tokensDescription, removerDescription, hashingTFDescription, idfDescription, assembler, lr)
descriptionArray: Array[ with{def copy(extra: with{def copy(extra: with{def copy(extra: with}}}] = Array(regexTok_316674b9209b, stopWords_8ecdf6f09955, hashingTF_48cf3f9cc065, idf_4b504cf08d86, vecAssembler_a76e6412bc96, logreg_f0763c33b304)
scala> val pipeline = new Pipeline().setStages(descriptionArray)
pipeline: = pipeline_4e462d0ee649
scala> time {}
16/09/28 13:04:17 WARN Executor: 1 block locks were not released by TID = 9526:
Elapsed time: 62370646425ns
res94: = pipeline_4e462d0ee649
scala> val idfDescription = new IDF().setInputCol("all_description_hashed").setOutputCol("features")
idfDescription: = idf_264569f76b23
scala> val descriptionArray = Array(tokensDescription, removerDescription, hashingTFDescription, idfDescription, lr)
descriptionArray: Array[ with{def copy(extra: with{def copy(extra: with{def copy(extra: with}}}] = Array(regexTok_316674b9209b, stopWords_8ecdf6f09955, hashingTF_48cf3f9cc065, idf_264569f76b23, logreg_f0763c33b304)
scala> val pipeline = new Pipeline().setStages(descriptionArray)
pipeline: = pipeline_758ec8aa3228
scala> time {}
Elapsed time: 11092968167ns
res95: = pipeline_758ec8aa3228
As you can see the with the additional VectorAssembler is significantly slower. This is a toy example, but the actual example I'm using would benefit from a VectorAssembler (whereas in this case there is no point in using one) and suffers from a similiar performance impact.
Just wondering if this is to be expected, or whether I am using this wrong. I also notice that with the VectorAssembler I get the warning message about locks not being released which may be related?
Thanks for any assistance and guidance!
Update #1
Some further analysis is showing that the additional time taken is in the logisticRegression fit step, not the actual assembling of features. It is puzzling why this would take longer though as the data it is acting on in both cases is identical (I've proved this to myself by joining the two datasets before they are passed into the fit function, and checking the two feature columns match for all ids).
Update #2
One other thing I noticed is that if I write the two datasets out to disk as parquet (one which has gone through the VectorAssembler, and one which hasn't) the one which went through the VectorAssembler is 10x the size even though they have seemingly identical schema, row count and data.
Update #3
OK - so I think I can see what is going on. Although the data with / without the VectorAssembler is identical, the act of calling transform on the VectorAssembler on my data decorates it with a large amount of (in my case somewhat useless) metadata. This causes the disk size bloat and also presumably the much slower regression due to having to process this additional data.

TensorFlow: Reading images in queue without shuffling

I have a training set of 614 images which have already been shuffled. I want to read the images in order in batches of 5. Because my labels are arranged in the same order, any shuffling of the images when being read into the batch will result in incorrect labelling.
These are my functions to read and add the images to the batch:
# To add files from queue to a batch:
def add_to_batch(image):
print('Adding to batch')
image_batch = tf.train.batch([image],batch_size=5,num_threads=1,capacity=614)
# Add to summary
return image_batch
# To read files in queue and process:
def get_batch():
# Create filename queue of images to read
filenames = [('/media/jessica/Jessica/TensorFlow/StreetView/training/original/train_%d.png' % i) for i in range(1,614)]
filename_queue = tf.train.string_input_producer(filenames,shuffle=False,capacity=614)
reader = tf.WholeFileReader()
key, value =
# Read and process image
# Image is 500 x 275:
my_image = tf.image.decode_png(value)
my_image_float = tf.cast(my_image,tf.float32)
my_image_float = tf.reshape(my_image_float,[275,500,4])
return add_to_batch(my_image_float)
This is my function to perform the prediction:
def inference(x):
< Perform convolution, pooling etc.>
return y_conv
This is my function to calculate loss and perform optimisation:
def train_step(y_label,y_conv):
""" Calculate loss """
# Cross-entropy
loss = -tf.reduce_sum(y_label*tf.log(y_conv + 1e-9))
# Add to summary
""" Optimisation """
opt = tf.train.AdamOptimizer().minimize(loss)
return loss
This is my main function:
def main ():
# Training
images = get_batch()
y_conv = inference(images)
loss = train_step(y_label,y_conv)
# To write and merge summaries
writer = tf.train.SummaryWriter('/media/jessica/Jessica/TensorFlow/StreetView/SummaryLogs/log_5', graph_def=sess.graph_def)
merged = tf.merge_all_summaries()
""" Run session """
print "Running..."
for step in range(5):
# y_1 = <get the correct labels here>
# Train
loss_value =,feed_dict={y_label:y_1})
print "Step %d, Loss %g"%(step,loss_value)
# Save summary
summary_str =,feed_dict={y_label:y_1})
if __name__ == '__main__':
When I check my image_summary the images do not seem to be in sequence. Or rather, what is happening is:
Images 1-5: discarded, Images 6-10: read, Images 11-15: discarded, Images 16-20: read etc.
So it looks like I am getting my batches twice, throwing away the first one and using the second one? I have tried a few remedies but nothing seems to work. I feel like I am understanding something fundamentally wrong about calling images = get_batch() and
Your batch operation is a FIFOQueue, so every time you use it's output, it advances the state.
Your first call uses the images 1-5 in the computation of train_step, your second asks for the computation of image_summary which pulls images 5-6 and uses them in the visualization.
If you want to visualize things without affecting the state of input, it helps to cache queue values in variables and define your summaries with variables as inputs rather than depending on live queue.
(image_batch_live,) = tf.train.batch([image],batch_size=5,num_threads=1,capacity=614)
image_batch = tf.Variable(
tf.zeros((batch_size, image_size, image_size, color_channels)),
advance_batch = tf.assign(image_batch, image_batch_live)
So now your image_batch is a static value which you can use both for computing loss and visualization. Between steps you would call to advance the queue.
Minor wrinkle with this approach -- default saver will save your image_batch variable to checkpoint. If you ever change your batch-size, then your checkpoint restore will fail with dimension mismatch. To work-around you would need to specify the list of variables to restore manually, and run initializers for the rest.
