In the PyTorch Distributed Data Parallel (DDP) tutorial, how does `setup` know it's rank? - parallel-processing

For the tutorial Getting Started with Distributed Data Parallel
How does setup() function knows the rank when mp.spawn() doesn't pass the rank?
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# initialize the process group
dist.init_process_group("gloo", rank=rank, world_size=world_size)
def demo_basic(rank, world_size):
print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)
.......
def run_demo(demo_fn, world_size):
mp.spawn(demo_fn,
args=(world_size,),
nprocs=world_size,
join=True)
if __name__ == "__main__":
n_gpus = torch.cuda.device_count()
assert n_gpus >= 2, f"Requires at least 2 GPUs to run, but got {n_gpus}"
world_size = n_gpus
run_demo(demo_basic, world_size)

mp.spawn does pass the rank to the function it calls.
From the torch.multiprocessing.spawn docs
torch.multiprocessing.spawn(fn, args=(), nprocs=1, join=True, daemon=False, start_method='spawn')
...
fn (function) -
Function is called as the entrypoint of the spawned process. This
function must be defined at the top level of a module so it can be
pickled and spawned. This is a requirement imposed by multiprocessing. The function is called as fn(i, *args), where i is the process index
and args is the passed through tuple of arguments.
So when spawn invokes fn it passes it the process index as the first argument.

Related

How to create multiprocess with regression function?

I'm trying to build a regression function that call itself in a new process. The new process should not stop the parent process nor wait for it to finish, that is why I don't use join(). Do you have another way to create regression function with multi-process.
I use the following code:
import multiprocessing as mp
import concurrent.futures
import time
def do_something(c, seconds, r_list):
c += 1 # c is a counter that all processes should use
# such that no more than 20 processes are created.
print(f"Sleeping {seconds} second(s)...")
if c < 20:
P_V = mp.Value('d', 0.0, lock=False)
p = mp.Process(group=None, target=do_something, args=(c, 1, r_list,))
p.start()
if not p.is_alive():
r_list.append(P_V.value)
time.sleep(seconds)
print(f"Done Sleeping...{seconds}")
return f"Done Sleeping...{seconds}"
if __name__ == '__main__':
C = 0 # C is a counter that all processes should use
# such that no more than 20 processes are created.
Result_list = [] # results that come from all processes are saved here
Result_list.append(do_something(C, 1, Result_list))
Notice that results from all processes should be compared at the end.
In fact, this code is working well but the child processes, which are created in the recursive method, do not print anything, the list "Result_list" contains only one item from the first call, and C=0 at the end, any idea why?
Here's a simplified example of what I think you're trying to do (side note: launching processes recursively is a great way to accidentally create a "fork bomb". It is extremely more common to create multiple processes in some sort of loop instead)
from multiprocessing import Process, Queue
from time import sleep
from os import getpid
def foo(n_procs, return_Q, arg):
if __name__ == "__main__": #don't actually run the body of foo in the "main" process, just start the recursion
Process(target=foo, args=(n_procs, return_Q, arg)).start()
else:
n_procs -= 1
if n_procs > 0:
Process(target=foo, args=(n_procs, return_Q, arg)).start()
sleep(arg)
print(f"{getpid()} done sleeping {arg} seconds")
return_Q.put(f"{getpid()} done sleeping {arg} seconds") #put the result to a queue so we can get it in the main process
if __name__ == "__main__":
q = Queue()
foo(10, q, 2)
sleep(10) #do something else in the meantime
results = []
#while not q.empty(): #usually better to just know how many results you're expecting as q.empty can be unreliable
for _ in range(10):
results.append(q.get())
print("mp results:")
print("\n".join(results))

Asyncio with multiprocessing : Producers-Consumers model

I am trying retrieve stock prices and process the prices them as they come. I am a beginner with concurrency but I thought this set up seems suited to an asyncio producers-consumers model in which each producers retrieve a stock price, and pass it to the consumers vial a queue. Now the consumers have do the stock price processing in parallel (multiprocessing) since the work is CPU intensive. Therefore I would have multiple consumers already working while not all the producers are finished retrieving data. In addition, I would like to implement a step in which, if the consumer finds that the stock price it's working on is invalid , we spawn a new consumer job for that stock.
So far, i have the following toy code that sort of gets me there, but has issues with my process_data function (the consumer).
from concurrent.futures import ProcessPoolExecutor
import asyncio
import random
import time
random.seed(444)
#producers
async def retrieve_data(ticker, q):
'''
Pretend we're using aiohttp to retrieve stock prices from a URL
Place a tuple of stock ticker and price into asyn queue as it becomes available
'''
start = time.perf_counter() # start timer
await asyncio.sleep(random.randint(4, 8)) # pretend we're calling some URL
price = random.randint(1, 100) # pretend this is the price we retrieved
print(f'{ticker} : {price} retrieved in {time.perf_counter() - start:0.1f} seconds')
await q.put((ticker, price)) # place the price into the asyncio queue
#consumers
async def process_data(q):
while True:
data = await q.get()
print(f"processing: {data}")
with ProcessPoolExecutor() as executor:
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(executor, data_processor, data)
#if output of data_processing failed, send ticker back to queue to retrieve data again
if not result[2]:
print(f'{result[0]} data invalid. Retrieving again...')
await retrieve_data(result[0], q) # add a new task
q.task_done() # end this task
else:
q.task_done() # so that q.join() knows when the task is done
async def main(tickers):
q = asyncio.Queue()
producers = [asyncio.create_task(retrieve_data(ticker, q)) for ticker in tickers]
consumers = [asyncio.create_task(process_data(q))]
await asyncio.gather(*producers)
await q.join() # Implicitly awaits consumers, too. blocks until all items in the queue have been received and processed
for c in consumers:
c.cancel() #cancel the consumer tasks, which would otherwise hang up and wait endlessly for additional queue items to appear
'''
RUN IN JUPYTER NOTEBOOK
'''
start = time.perf_counter()
tickers = ['AAPL', 'AMZN', 'TSLA', 'C', 'F']
await main(tickers)
print(f'total elapsed time: {time.perf_counter() - start:0.2f}')
'''
RUN IN TERMINAL
'''
# if __name__ == "__main__":
# start = time.perf_counter()
# tickers = ['AAPL', 'AMZN', 'TSLA', 'C', 'F']
# asyncio.run(main(tickers))
# print(f'total elapsed time: {time.perf_counter() - start:0.2f}')
The data_processor() function below, called by process_data() above needs to be in a different cell in Jupyter notebook, or a separate module (from what I understand, to avoid a PicklingError)
from multiprocessing import current_process
def data_processor(data):
ticker = data[0]
price = data[1]
print(f'Started {ticker} - {current_process().name}')
start = time.perf_counter() # start time counter
time.sleep(random.randint(4, 5)) # mimic some random processing time
# pretend we're processing the price. Let the processing outcome be invalid if the price is an odd number
if price % 2==0:
is_valid = True
else:
is_valid = False
print(f"{ticker}'s price {price} validity: --{is_valid}--"
f' Elapsed time: {time.perf_counter() - start:0.2f} seconds')
return (ticker, price, is_valid)
THE ISSUES
Instead of using python's multiprocessing module, i used concurrent.futures' ProcessPoolExecutor, which I read is compatible with asyncio (What kind of problems (if any) would there be combining asyncio with multiprocessing?). But it seems that I have to choose between retrieving the output (result) of the function called by the executor and being able to run several subprocesses in parallel. With the construct below, the subprocesses run sequentially, not in parallel.
with ProcessPoolExecutor() as executor:
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(executor, data_processor, data)
Removing result = await in front of loop.run_in_executor(executor, data_processor, data) allows to run several consumers in parallel, but then I can't collect their results from the parent process. I need the await for that. And then of course the remaining of the code block will fail.
How can I have these subprocesses run in parallel and provide the output? Perhaps it needs a different construct or something else than the producers-consumers model
the part of the code that requests invalid stock prices to be retrieved again works (provided I can get the result from above), but it is ran in the subprocess that calls it and blocks new consumers from being created until the request is fulfilled. Is there a way to address this?
#if output of data_processing failed, send ticker back to queue to retrieve data again
if not result[2]:
print(f'{result[0]} data invalid. Retrieving again...')
await retrieve_data(result[0], q) # add a new task
q.task_done() # end this task
else:
q.task_done() # so that q.join() knows when the task is done
But it seems that I have to choose between retrieving the output (result) of the function called by the executor and being able to run several subprocesses in parallel.
Luckily this is not the case, you can also use asyncio.gather() to wait for multiple items at once. But you obtain data items one by one from the queue, so you don't have a batch of items to process. The simplest solution is to just start multiple consumers. Replace
# the single-element list looks suspicious anyway
consumers = [asyncio.create_task(process_data(q))]
with:
# now we have an actual list
consumers = [asyncio.create_task(process_data(q)) for _ in range(16)]
Each consumer will wait for an individual task to finish, but that's ok because you'll have a whole pool of them working in parallel, which is exactly what you wanted.
Also, you might want to make executor a global variable and not use with, so that the process pool is shared by all consumers and lasts as long as the program. That way consumers will reuse the worker processes already spawned instead of having to spawn a new process for each job received from the queue. (That's the whole point of having a process "pool".) In that case you probably want to add executor.shutdown() at the point in the program where you don't need the executor anymore.

Python asyncio - Increase the value of Semaphore

I am making use of aiohttp in one of my projects and would like to limit the number of requests made per second. I am using asyncio.Semaphore to do that. My challenge is I may want to increase/decrease the number of requests allowed per second.
For example:
limit = asyncio.Semaphore(10)
async with limit:
async with aiohttp.request(...)
...
await asyncio.sleep(1)
This works great. That is, it limits that aiohttp.request to 10 concurrent requests in a second. However, I may want to increase and decrease the Semaphore._value. I can do limit._value = 20 but I am not sure if this is the right approach or there is another way to do that.
Accessing the private _value attribute is not the right approach for at least two reasons: one that the attribute is private and can be removed, renamed, or change meaning in a future version without notice, and the other that increasing the limit won't be noticed by a semaphore that already has waiters.
Since asyncio.Semaphore doesn't support modifying the limit dynamically, you have two options: implementing your own Semaphore class that does support it, or not using a Semaphore at all. The latter is probably easier as you can always replace a semaphore-enforced limit with a fixed number of worker tasks that receive jobs through a queue. Assuming you currently have code that looks like this:
async def fetch(limit, arg):
async with limit:
# your actual code here
return result
async def tweak_limit(limit):
# here you'd like to be able to increase the limit
async def main():
limit = asyncio.Semaphore(10)
asyncio.create_task(tweak_limit(limit))
results = await asyncio.gather(*[fetch(limit, x) for x in range(1000)])
You could express it without a semaphore by creating workers in advance and giving them work to do:
async def fetch_task(queue, results):
while True:
arg = await queue.get()
# your actual code here
results.append(result)
queue.task_done()
async def main():
# fill the queue with jobs for the workers
queue = asyncio.Queue()
for x in range(1000):
await queue.put(x)
# create the initial pool of workers
results = []
workers = [asyncio.create_task(fetch_task(queue, results))
for _ in range(10)]
asyncio.create_task(tweak_limit(workers, queue, results))
# wait for workers to process the entire queue
await queue.join()
# finally, cancel the now-idle worker tasks
for w in workers:
w.cancel()
# results are now available
The tweak_limit() function can now increase the limit simply by spawning new workers:
async def tweak_limit(workers, queue, results):
while True:
await asyncio.sleep(1)
if need_more_workers:
workers.append(asyncio.create_task(fetch_task(queue, results)))
Using workers and queues is a more complex solution, you have to think about issues like setup, teardown, exception handling and backpressure, etc.
Semaphore can be implemented with Lock, if you don't mind abit of inefficiency (you will see why), here's a simple implemention for a dynamic-value semaphore:
class DynamicSemaphore:
def __init__(self, value=1):
self._lock = asyncio.Lock()
if value < 0:
raise ValueError("Semaphore initial value must be >= 0")
self.value = value
async def __aenter__(self):
await self.acquire()
return None
async def __aexit__(self, exc_type, exc, tb):
self.release()
def locked(self):
return self.value == 0
async def acquire(self):
async with self._lock:
while self.value <= 0:
await asyncio.sleep(0.1)
self.value -= 1
return True
def release(self):
self.value += 1

How to convert event-based communication into async/await using asyncio.Future

I'm trying to expose an event-based communication as a coroutine. Here is an example:
class Terminal:
async def start(self):
loop = asyncio.get_running_loop()
future = loop.create_future()
t = threading.Thread(target=self.run_cmd, args=future)
t.start()
return await future
def run_cmd(self, future):
time.sleep(3) # imitating doing something
future.set_result(1)
But when I run it like this:
async def main():
t = Terminal()
result = await t.start()
print(result)
asyncio.run(main())
I get the following error: RuntimeError: await wasn't used with future
Is it possible to achieve the desired behavior?
There are two issues with your code. One is that the args argument to the Thread constructor requires a sequence or iterable, so you need to write wrap the argument in a container, e.g. args=(future,). Since future is iterable (for technical reasons unrelated to this use case), args=future is not immediately rejected, but leads to the misleading error later down the line.
The other issue is that asyncio objects aren't thread-safe, so you cannot just call future.set_result from another thread. This causes the test program to hang even after fixing the first issue. The correct way to resolve the future from another thread is through the call_soon_threadsafe method on the event loop:
class Terminal:
async def start(self):
loop = asyncio.get_running_loop()
future = loop.create_future()
t = threading.Thread(target=self.run_cmd, args=(loop, future,))
t.start()
return await future
def run_cmd(self, loop, future):
time.sleep(3)
loop.call_soon_threadsafe(future.set_result, 1)
If your thread is really just calling a blocking function whose result you're interested in, consider using run_in_executor instead of manually spawning threads:
class Terminal:
async def start(self):
loop = asyncio.get_running_loop()
return await loop.run_in_executor(None, self.run_cmd)
# Executed in a different thread; `run_in_executor` submits the
# callable to a thread pool, suspends the awaiting coroutine until
# it's done, and transfers the result/exception back to asyncio.
def run_cmd(self):
time.sleep(3)
return 1

Dataloader on top of protobuf file using pytorch's torch.utils.data.IterableDataset

I am building a RNN network using pytorch.
The data is stored in various protobuf file.
Each record in protobuf represents one training example with multiple timestamp.
As this is very large dataset, reading the whole data in memory or random read by extending torch.utils.data.Dataset class isn't feasible.
As per the docs using the torch.utils.data.IterableDataset is recommended.
DataLoader on top of IterableDataset would be able to achieve parallelism
However I am not able to find an implementation of this on custom data, docs only talk about a simple range iterator.
import math
import stream
from src import record_pb2
import torch
class MyIterableDataset(torch.utils.data.IterableDataset):
def __init__(self, pb_file):
self.pb_file = pb_file
self.start = 0
self.end = 0
# One time read of the data to get the total count of records in the dataset
with stream.open(self.pb_file, 'rb') as data_stream:
for _ in data_stream:
self.end += 1
def __iter__(self):
worker_info = torch.utils.data.get_worker_info()
if worker_info is None: # Single-process data loading, return the full iterator
iter_start = self.start
iter_end = self.end
else:
# in a worker process, split the workload
per_worker = int(math.ceil((self.end - self.start))/float(worker_info.num_workers))
worker_id = worker_info.id
iter_start = self.start + worker_id * per_worker
iter_end = min(iter_start + per_worker, self.end)
data_stream = stream.open(self.pb_file, 'rb')
# Block to skip the streaming data till the iter start for the current worker process
i = 0
for _ in data_stream:
i += 1
if i >= iter_start:
break
return iter(self.pb_stream)
I am expecting a mechanism by which a parallel data feeder could be designed on top of a large streaming data (protobuf)
The __iter__ method of the IterableDataset would yield your data samples one at a time. In a parallel setup, you have to choose the samples based on worker_id. And with respect to the DataLoader using this dataset, shuffle and sampler options would not work, as an IterableDataset is not going to have any indices. In other words, have your dataset yield one sample at a time and the data loader will take care of loading them. Does this answer?

Resources