Customize task resources on Airflow using MesosExecutor - mesos

Is it possible to specify resources (CPU, memory, GPU, disk space) for each operator of a DAG when using MesosExecutor?
I know you can specify global values for resources of a task.
For instance, I have several operators that are CPU expensive and others that not. I would like to execute one at a time of the first, but many in parallel of the non CPU expensive ones.

From the code ( line 67), it seems that is not possible since cpu and memory values are passed to the Scheduler during initialization:
def __init__(self,
self.task_queue = task_queue
self.result_queue = result_queue
self.task_cpu = task_cpu
self.task_mem = task_mem
and those values are used without modification:
cpus = task.resources.add() = "cpus"
cpus.type = mesos_pb2.Value.SCALAR
cpus.scalar.value = self.task_cpu
mem = task.resources.add() = "mem"
mem.type = mesos_pb2.Value.SCALAR
mem.scalar.value = self.task_mem
It requires a custom Executor implementation to achieve that


Why triton serving shared memory failed with running multiple workers in uvicorn in order to send multiple request concurrently to the models?

I run a model in triton serving with shared memory and it works correctly.
In order to simulate backend structure I wrote a Fast API for my model and run it with gunicorn with 6 workers. Then I wrote anthor Fast API to route locust requests to my first Fast Fast API as below image(pseudo code). my second Fast API runs with uvicorn. but the problem is when I used multiple workers for my uvicorn, triton serving failed to shared memory.
Note: without shared memory every thing works but my response time is much longer than the shared memory option. so I need to use shared memory option.
here is my triton client code:
I have a functions in my client code named predict function which used the requestGenerator to shared input_simple and output_simple spaces.
this is my requestGenerator generator:
def requestGenerator(self, triton_client, batched_img_data, input_name, output_name, dtype, batch_data):
output_simple = "output_simple"
input_simple = "input_simple"
input_data = np.ones(
shape=(batch_data, 3, self.width, self.height), dtype=np.float32)
input_byte_size = input_data.size * input_data.itemsize
output_byte_size = input_byte_size * 2
shm_op0_handle = shm.create_shared_memory_region(
output_name, output_simple, output_byte_size)
output_name, output_simple, output_byte_size)
shm_ip0_handle = shm.create_shared_memory_region(
input_name, input_simple, input_byte_size)
input_name, input_simple, input_byte_size)
inputs = []
httpclient.InferInput(input_name, batched_img_data.shape, dtype))
inputs[0].set_data_from_numpy(batched_img_data, binary_data=True)
outputs = []
inputs[-1].set_shared_memory(input_name, input_byte_size)
outputs[-1].set_shared_memory(output_name, output_byte_size)
yield inputs, outputs, shm_ip0_handle, shm_op0_handle
this is my predict function:
def predict(self, triton_client, batched_data, input_layer, output_layer, dtype):
responses = []
results = None
for inputs, outputs, shm_ip_handle, shm_op_handle in self.requestGenerator(
triton_client, batched_data, input_layer, output_layer, type,
self.sent_count += 1
shm.set_shared_memory_region(shm_ip_handle, [batched_data])
output_buffer = responses[0].get_output(output_layer)
if output_buffer is not None:
results = shm.get_contents_as_numpy(
shm_op_handle, triton_to_np_dtype(output_buffer['datatype']),
return results
Any help would be appreciated to help me how to use multiple uvicorn workers to send multiple requests concurrently to my triton code without failing.

Model takes twice the memory footprint with distributed data parallel

I have a model that trains just fine on a single GPU. But I'm getting CUDA memory errors when I switch to Pytorch distributed data parallel (DDP). Specifically, the DDP model takes up twice the memory footprint compared to the model with no parallelism. Here is a minimal reproducible example:
import os
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.multiprocessing as mp
import torch
def train(rank, gpu_list, train_distributed):
device_id = gpu_list[rank]
model = torch.nn.Linear(1000, 1000)
print(device_id, torch.cuda.memory_allocated(device_id))
print(device_id, torch.cuda.memory_allocated(device_id))
print(device_id, torch.cuda.memory_allocated(device_id))
if train_distributed:
# convert model to DDP
dist.init_process_group("gloo", rank=rank, world_size=len(gpu_list))
model = DDP(model, device_ids=[device_id], find_unused_parameters=False)
print(device_id, torch.cuda.memory_allocated(device_id))
def train_distributed():
gpu_list = [torch.device(i) for i in [5, 6]]
os.environ['MASTER_ADDR'] = '127.0.01'
os.environ['MASTER_PORT'] = '7676'
mp.spawn(train, args=(gpu_list, True), nprocs=len(gpu_list), join=True)
if __name__ == '__main__':
# First test one GPU
train(0, [torch.device(5)], False)
# Then test multiple GPUs
Output - note that the GPU usage doubles on both devices when switching to DDP:
cuda:5 0
cuda:5 4004352
cuda:5 4004352
cuda:5 4004352
cuda:5 0
cuda:6 0
cuda:5 4004352
cuda:5 4004352
cuda:6 4004352
cuda:6 4004352
cuda:5 8008704
cuda:6 8008704
Why does the model take up twice the space in DDP? Is it intended behavior? Is there a way to avoid this extra memory usage?
I'm adding here the solution of #ptrblck written in the PyTorch discussion forum.
Here're two quotes.
The statement:
[...] the allocated memory get doubled when torch.distributed.Reducer is instantiated in the constructor of DistributedDataParallel
And the answer:
[...] the Reducer will create gradient buckets for each parameter, so that the memory usage after wrapping the model into DDP will be 2x model_parameter_size. Note that the parameter size of a model is often much smaller than the activation size so that this memory increase might or might not be significant
So, from here we can see the reason why the memory footprint sometimes doubles.
Try use gradient_as_bucket_view to save memory. As document says,
gradient_as_bucket_view (bool) – When set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. Moreover, it avoids the overhead of copying between gradients and allreduce communication buckets. When gradients are views, detach_() cannot be called on the gradients. If hitting such errors, please fix it by referring to the zero_grad() function in torch/optim/ as a solution.

Pytorch: RAM explodes when using multiprocessing SharedMemory and CUDA

I would like to use multiprocessing to launch multiple training instances on CUDA device. Since the data is common between the processes, I want to avoid data copy for every process. I'm using python 3.8's SharedMemory from multiprocessing module to achieve this following this SO example.
I can allocate a memory block using SharedMemory and create as many processes as I'd like with constant memory (RAM) usage. However, when I try to send tensors to CUDA, the memory scales linearly with the number of processes. It appears as if when is called, the base data is copied for every process.
Does any one know why this is happening? Any ideas to mitigate this issue?
Here is the sample code I'm using:
import numpy as np
from multiprocessing import shared_memory, get_context
import time
import torch
import copy
dim = 10000
batch_size = 10
sleep_time = 2
npe = 1 # number of parallel executions
# cuda
if torch.cuda.is_available():
dev = 'cuda:0'
dev = "cpu"
device = torch.device(dev)
def step(i, shr_name):
existing_shm = shared_memory.SharedMemory(name=shr_name)
np_arr = np.ndarray((dim, dim), dtype=np.float32, buffer=existing_shm.buf)
b = np_arr[i * batch_size: (i + 1) * batch_size, :]
b = torch.Tensor(b)
# This is just to explicitly copy the tensor so that it has nothing to do
# with the shared memory block
c = copy.deepcopy(b)
# If tensor c is sent to the cuda device, then RAM scales linearly
# with the number of parallel executions.
# If c is not sent to cuda device, memory consumption is constant.
c =
def create_shared_block():
a = np.random.random((dim, dim)).astype(np.float32)
shm = shared_memory.SharedMemory(create=True, size=a.nbytes, name='sha')
np_arr = np.ndarray(a.shape, dtype=np.float32, buffer=shm.buf)
np_arr[:] = a[:]
return shm, np_arr
if __name__ == '__main__':
# create shared memory block
shm, np_arr = create_shared_block()
# create list of inputs to be executed in parallel
inp = [[x, 'sha'] for x in range(npe)]
# sleep added before and after launching multiprocessing to monitor the memory consumption
print('before pool') # to check memory with top or htop
context = get_context('spawn')
with context.Pool(npe) as pool:
print('after pool') # to check memory with top or htop
pool.starmap(step, inp)

What does virtual core in YARN vcore mean?

Yarn is using the concept of virtual core to manage CPU resources. I would ask what's the benefit to use virtual core, is there some reason here that YARN uses vcore?
Here is what the documentation states (emphasis mine)
A node's capacity should be configured with virtual cores equal to its
number of physical cores. A container should be requested with the
number of cores it can saturate, i.e. the average number of threads it
expects to have runnable at a time.
Unless the CPU core is hyper-threaded it can run only one thread at a time (in case of hyper threaded OS actually sees 2 cores for one physical core and can run two threads - of course it's a bit of cheating and no-where as efficient as having actual physical core). Essentially what it means to end user is that a core can run a single thread so theoretically if I want parallelism using java threads then a reasonably good approximation is number of threads equal to number of core. So if your container process ( which is a JVM)
will require 2 threads then it's better to map it to 2 vcore - that what the last line means. And as total capacity of node the vcore should be equal to number of physical cores.
The most important thing to remember is still that it's actually the OS which will schedule the threads to be executed in different cores as it happens in any other application and
YARN in itself does not have control on it except the fact that what is the best possible approximation for how many thread to allocate for each container. And that's why it is important to take into consideration other applications running on OS, CPU cycles used by kernel etc., as all of cores will not be available to YARN application all the time.
EDIT: Further research
Yarn does not influence hard limits on CPU but Going through the code I can see how it tries to influence the CPU scheduling or cpu rate. Technically Yarn can launch different container processes - java, python , custom shell command etc. The responsibility of launching containers in Yarn belongs to the ContainerExecutor component of Node manager and I can see code for launching the container etc., along with some hints (depending on platform). For example in case of DefaultContainerExecutor ( which extends ContainerExecutor) - for windows it uses "-c" parameter for cpu restriction and on linux it uses process niceness to influence it. There is another implementation LinuxContainerExecutor (or better still CgroupsLCEResourcesHandler as former does not force the usage of cgroups) which tries to use Linux cgroups to limit the Yarn CPU resources on that node. More details can be found here.
ContainerExecutor {
protected String[] getRunCommand(String command, String groupId,
String userName, Path pidFile, Configuration conf, Resource resource) {
boolean containerSchedPriorityIsSet = false;
int containerSchedPriorityAdjustment =
if (conf.get(YarnConfiguration.NM_CONTAINER_EXECUTOR_SCHED_PRIORITY) !=
null) {
containerSchedPriorityIsSet = true;
containerSchedPriorityAdjustment = conf
if (Shell.WINDOWS) {
int cpuRate = -1;
int memory = -1;
if (resource != null) {
if (conf
memory = resource.getMemory();
if (conf.getBoolean(
int containerVCores = resource.getVirtualCores();
int nodeVCores = conf.getInt(YarnConfiguration.NM_VCORES,
// cap overall usage to the number of cores allocated to YARN
int nodeCpuPercentage = Math
nodeCpuPercentage = Math.max(0, nodeCpuPercentage);
if (nodeCpuPercentage == 0) {
String message = "Illegal value for "
+ ". Value cannot be less than or equal to 0.";
throw new IllegalArgumentException(message);
float yarnVCores = (nodeCpuPercentage * nodeVCores) / 100.0f;
// CPU should be set to a percentage * 100, e.g. 20% cpu rate limit
// should be set as 20 * 100. The following setting is equal to:
// 100 * (100 * (vcores / Total # of cores allocated to YARN))
cpuRate = Math.min(10000,
(int) ((containerVCores * 10000) / yarnVCores));
return new String[] { Shell.WINUTILS, "task", "create", "-m",
String.valueOf(memory), "-c", String.valueOf(cpuRate), groupId,
"cmd /c " + command };
} else {
List<String> retCommand = new ArrayList<String>();
if (containerSchedPriorityIsSet) {
retCommand.addAll(Arrays.asList("nice", "-n",
retCommand.addAll(Arrays.asList("bash", command));
return retCommand.toArray(new String[retCommand.size()]);
For windows (it utilizes winutils.exe) , it uses cpu rate
For Linux it uses niceness as a parameter to control the CPU priority
"Virtual cores" are merely an abstraction of actual cores. This abstraction or "lie" (as i like to call it), allows YARN (and others) to dynamically spin threads (parallel process) based on availability. Take for example running map reduce on an "elastic" cluster with a processing limit constrained only by your wallet... The cloud baby... The. Cloud.
you can read more here

Akka actors and Clustering-I'm having trouble with ClusterSingletonManager- unhandled event in state Start

I've got a system that uses Akka 2.2.4 which creates a bunch of local actors and sets them as the routees of a Broadcast Router. Each worker handles some segment of the total work, according to some hash range we pass it. It works great.
Now, I've got to cluster this application for failover. Based on the requirement that only one worker per hash range exist/be triggered on the cluster, it seems to me that setting up each one as a ClusterSingletonManager would make sense..however I'm having trouble getting it working. The actor system starts up, it creates the ClusterSingletonManager, it adds the path in the code cited below to a Broadcast Router, but it never instantiates my actual worker actor to handle my messages for some reason. All I get is a log message: "unhandled event ${my message} in state Start". What am I doing wrong? Is there something else I need to do to start up this single instance cluster? Am I sending the wrong actor a message?
here's my akka config(I use the default config as a fallback):
min-nr-of-members = 1
role {
workerSystem.min-nr-of-members = 1
daemonic = true
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
hostname = ""
port = ${akkaPort}
provider = akka.cluster.ClusterActorRefProvider
single-message-bound-mailbox {
# FQCN of the MailboxType. The Class of the FQCN must have a public
# constructor with
# (, com.typesafe.config.Config) parameters.
mailbox-type = "akka.dispatch.BoundedMailbox"
# If the mailbox is bounded then it uses this setting to determine its
# capacity. The provided value must be positive.
# Up to version 2.1 the mailbox type was determined based on this setting;
# this is no longer the case, the type must explicitly be a bounded mailbox.
mailbox-capacity = 1
# If the mailbox is bounded then this is the timeout for enqueueing
# in case the mailbox is full. Negative values signify infinite
# timeout, which should be avoided as it bears the risk of dead-lock.
mailbox-push-timeout-time = 1
type = PinnedDispatcher
executor = "thread-pool-executor"
# Throughput defines the number of messages that are processed in a batch
# before the thread is returned to the pool. Set to 1 for as fair as possible.
throughput = 500
thread-pool-executor {
# Keep alive time for threads
keep-alive-time = 60s
# Min number of threads to cap factor-based core number to
core-pool-size-min = ${workerCount}
# The core pool size factor is used to determine thread pool core size
# using the following formula: ceil(available processors * factor).
# Resulting size is then bounded by the core-pool-size-min and
# core-pool-size-max values.
core-pool-size-factor = 3.0
# Max number of threads to cap factor-based number to
core-pool-size-max = 64
# Minimum number of threads to cap factor-based max number to
# (if using a bounded task queue)
max-pool-size-min = ${workerCount}
# Max no of threads (if using a bounded task queue) is determined by
# calculating: ceil(available processors * factor)
max-pool-size-factor = 3.0
# Max number of threads to cap factor-based max number to
# (if using a bounded task queue)
max-pool-size-max = 64
# Specifies the bounded capacity of the task queue (< 1 == unbounded)
task-queue-size = -1
# Specifies which type of task queue will be used, can be "array" or
# "linked" (default)
task-queue-type = "linked"
# Allow core threads to time out
allow-core-timeout = on
fork-join-executor {
# Min number of threads to cap factor-based parallelism number to
parallelism-min = 1
# The parallelism factor is used to determine thread pool size using the
# following formula: ceil(available processors * factor). Resulting size
# is then bounded by the parallelism-min and parallelism-max values.
parallelism-factor = 3.0
# Max number of threads to cap factor-based parallelism number to
parallelism-max = 1
Here's where I create my Actors(its' written in Groovy):
Props clusteredProps = ClusterSingletonManager.defaultProps("worker".toString(), PoisonPill.getInstance(), "workerSystem",
new ClusterSingletonPropsFactory(){
Props create(Object handOverData) {"called in ClusterSingetonManager")
Props.create(WorkerActorCreator.create(applicationContext, it.start, it.end)).withDispatcher("").withMailbox("")
} )
ActorRef manager = system.actorOf(clusteredProps, "worker-${it.start}-${it.end}".toString())
String path = manager.path().child("worker").toString()
when I try to send a message to the actual worker actor, should the path above resolve? Currently it does not.
What am I doing wrong? Also, these actors live within a Spring application, and the worker actors are set up with some #Autowired dependencies. While this Spring integration worked well in a non-clustered environment, are there any gotchyas in a clustered environment I should be looking out for?
thank you
FYI:I've also posted this in the akka-user google group. Here's the link.
The path in your code is to the ClusterSingletonManager actor that you start on each node with role "workerSystem". It will create a child actor (WorkerActor) with name "worker-${it.start}-${it.end}" on the oldest node in the cluster, i.e. singleton within the cluster.
You should also define the name of the ClusterSingletonManager, e.g. system.actorOf(clusteredProps, "workerSingletonManager").
You can't send the messages to the ClusterSingletonManager. You must send them to the path of the active worker, i.e. including the address of the oldest node. That is illustrated by the ConsumerProxy in the documentation.
I'm not sure you should use a singleton at all for this. All workers will be running on the same node, the oldest. I would prefer to discuss alternative solutions to your problem at the akka-user google group.
