Why do we need to add [] to a iterable array and * to asyncio.gather? - python-asyncio

Using this example
import time
import asyncio
async def main(x):
print(f"Starting Task {x}")
await asyncio.sleep(3)
print(f"Finished Task {x}")
async def async_io():
tasks = []
for i in range(10):
tasks += [main(i)]
await asyncio.gather(*tasks)
if __name__ == "__main__":
start_time = time.perf_counter()
asyncio.run(async_io())
print(f"Took {time.perf_counter() - start_time} secs")
I noticed that we need to create a list that keeps track of the tasks to do. Understandable, but then why do we add the [] wrapper over the main(i) function? And also in the asyncio.gather(*tasks), why do we need to add the asterisk there as well?

why do we add the [] wrapper over the main(i) function?
There are a few ways to add items into a list. One such way, the way you've chosen, is by concatenating two lists together.
>>> [1] + [2]
[1, 2]
Trying to concatenate a list and something else will lead to a TypeError.
In your particular case you're using augmented assignment, a (often more performant) shorthand for
tasks = tasks + [main(i)]
Another way to accomplish this is with append.
tasks.append(main(i))
If your real code matches your example code, an even better way to spell all of this is
tasks = [main(i) for i in range(10)]
in the asyncio.gather(*tasks), why do we need to add the asterisk there as well?
Because gather will to run each positional argument it receives. Calls to gather should look like
asyncio.gather(main(0))
asyncio.gather(main(0), main(1))
Since there are times when you need to use a variable number of positional arguments, Python offers the unpacking operator (* in the case of lists).
If you felt so inclined, your example can be rewritten as
async def async_io():
await asyncio.gather(*[main(i) for i in range(10)])

Related

multiprocessing a geopandas.overlay() throws no error but seemingly never completes

I'm trying to pass a geopandas.overlay() to multiprocessing to speed it up.
I have used custom functions and functools to partially fill function inputs and then pass the iterative component to the function to produce a series of dataframes that I then concat into one.
def taska(id, points, crs):
return make_break_points((vms_points[points.ID == id]).reset_index(drop=True), crs)
points_gdf = geodataframe of points with an id field
grid_gdf = geodataframe polygon grid
partialA = functools.partial(taska, points=points_gdf, crs=grid_gdf.crs)
partialA_results =[]
with Pool(cpu_count()-4) as pool:
for results in pool.map(partialA, list(points_gdf.ID.unique())):
partialA_results.append(results)
bpts_gdf = pd.concat(partialA_results)
In the example above I use the list of unique values to subset the df and pass it to a processor to perform the function and return the results. In the end all the results are combined using pd.concat.
When I apply the same approach to a list of dataframes created using numpy.array_split() the process starts with a number of processors, then they all close and everything hangs with no indication that work is being done or that it will ever exit.
def taskc(tracks, grid):
return gpd.overlay(tracks, grid, how='union').explode().reset_index(drop=True)
tracks_gdf = geodataframe of points with an id field
dfs = np.array_split(tracks_gdf, (cpu_count()-4))
grid_gdf = geodataframe polygon grid
partialC_results = []
partialC = functools.partial(taskc, grid=grid_gdf)
with Pool(cpu_count() - 4) as pool:
for results in pool.map(partialC, dfs):
partialC_results.append(results)
results_df = pd.concat(partialC_results)
I tried using with get_context('spawn').Pool(cpu_count() - 4) as pool: based on the information here https://pythonspeed.com/articles/python-multiprocessing/ with no change in behavior.
Additionally, if I simply run geopandas.overlay(tracks_gdf, grid_gdf) the process is successful and the script carries on to the end with expected results.
Why does the partial function approach work on a list of items but not a list of dataframes?
Is the numpy.array_split() not an iterable object like a list?
How can I pass a single df into geopandas.overlay() in chunks to utilize multiprocessing capabilities and get back a single dataframe or a series of dataframes to concat?
This is my work around but am also interested if there is a better way to perform this and similar tasks. Essentially, modified the partial function so the df split is moved to the partial function then I create a list of values from range() as my iteral.
def taskc(num, tracks, grid):
return gpd.overlay(np.array_split(tracks, cpu_count()-4)[num], grid, how='union').explode().reset_index(drop=True)
partialC = functools.partial(taskc, tracks=tracks_gdf, grid=grid_gdf)
dfrange = list(range(0, cpu_count() - 4))
partialC_results = []
with get_context('spawn').Pool(cpu_count() - 4) as pool:
for results in pool.map(partialC, dfrange):
partialC_results.append(results)
results_gdf = pd.concat(partialC_results)

as_completed identifying coroutie objects

I'm using asyncio to await set of coroutines in following way:
# let's assume we have fn defined and that it can throw an exception
coros_objects = []
for x in range(10):
coros_objects.append(fn(x))
for c in asyncio.as_completed(coros_objects):
try:
y = await c
exception:
# something
# if possible print(x)
Question is how can I know which coroutine failed and for which argument?
I could append "x" to the outputs but this would give me info about successful executions only.
I can know that form order because it's different from the order of coros_objects
Can I somehow identify what coro just yielded result?
Question is how can I know which coroutine failed and for which argument?
You can't with the current as_completed. Once this PR is merged, it will be possible by attaching the information to the future (because as_completed will then yield the original futures). At the moment there are two workarounds:
wrap the coroutine execution in a wrapper that catches exceptions and stores them, and also stores the original arguments that you need, or
not use as_completed at all, but write your own loop using tools like asyncio.wait.
The second option is easier than most people expect, so here it is (untested):
# create a list of tasks and attach the needed information to each
tasks = []
for x in range(10):
t = asyncio.create_task(fn(x))
t.my_task_arg = x
tasks.append(t)
# emulate as_completed with asyncio.wait()
while tasks:
done, tasks = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
for t in done:
try:
y = await t
except Exception as e:
print(f'{e} happened while processing {t.my_task_arg}')

Is there a more Pythonic way of changing `None` to `[]` than

Is there a more Pythonic way of doing this?:
if self.name2info[name]['prereqs'] is None:
self.name2info[name]['prereqs'] = []
if self.name2info[name]['optionals'] is None:
self.name2info[name]['optionals'] = []
The reason I do this is because I need to iterate over those later. They're None to begin with sometimes because that's the default value. It's my workaround to not making [] a default value.
Thanks.
If you prefer this:
self.name2info[name]['prereqs'] = self.name2info[name]['prereqs'] or []
If you can't fix the input you could do this (becomes 'better' if you need to add more):
for prop in ['prereqs', 'optionals']:
if self.name2info[name][prop] is None:
self.name2info[name][prop] = []
But replacing these values to be iterating over the empty list you just added doesn't make a whole lot of sense (unless maybe if you're appending something to this list at some point). So maybe you could just move the test for None-ness right before the iteration:
prereqs = self.name2info[name]['prereqs']
if prereqs is not None:
for prereq in prereqs:
do_stuff(prereq)
Slightly going off-topic now, but if you ever want to test if an item is iterable at all, a common (pythonic) way would be to write:
try:
my_iterable_obj = iter(my_obj)
except TypeError:
# not iterable
You could do it this way:
if not self.name2info[name]['prereqs']: self.name2info[name]['prereqs'] = []
or this way
self.name2info[name]['prereqs'] = [] if not self.name2info[name]['prereqs'] else self.name2info[name]['prereqs']
Every one of those attribute and dict lookups takes time and processing. It's Pythonic to look up self.name2info[name] just once, and then work with a temporary name bound to that dict:
rec = self.name2info[name]
for key in "prereqs optionals required elective distance".split():
if key not in rec or rec[key] is None:
rec[key] = []
Now if need to add another category, like "AP_credit", you just add that to the string of key names.
If you're iterating over them I assume they're stored in a list. In which case combining some of the above approaches would probably be best.
seq=list(map(lambda x: x or [], seq))
Is a concise way of doing it. To my knowledge conversions in map() are faster than explicit for loops because the loops are run in the underlying C code.

pythonic way to do something N times without an index variable? [duplicate]

This question already has answers here:
Is it possible to implement a Python for range loop without an iterator variable?
(15 answers)
Closed 7 months ago.
I have some code like:
for i in range(N):
do_something()
I want to do something N times. The code inside the loop doesn't depend on the value of i.
Is it possible to do this simple task without creating a useless index variable, or in an otherwise more elegant way? How?
A slightly faster approach than looping on xrange(N) is:
import itertools
for _ in itertools.repeat(None, N):
do_something()
Use the _ variable, like so:
# A long way to do integer exponentiation
num = 2
power = 3
product = 1
for _ in range(power):
product *= num
print(product)
I just use for _ in range(n), it's straight to the point. It's going to generate the entire list for huge numbers in Python 2, but if you're using Python 3 it's not a problem.
since function is first-class citizen, you can write small wrapper (from Alex answers)
def repeat(f, N):
for _ in itertools.repeat(None, N): f()
then you can pass function as argument.
The _ is the same thing as x. However it's a python idiom that's used to indicate an identifier that you don't intend to use. In python these identifiers don't takes memor or allocate space like variables do in other languages. It's easy to forget that. They're just names that point to objects, in this case an integer on each iteration.
I found the various answers really elegant (especially Alex Martelli's) but I wanted to quantify performance first hand, so I cooked up the following script:
from itertools import repeat
N = 10000000
def payload(a):
pass
def standard(N):
for x in range(N):
payload(None)
def underscore(N):
for _ in range(N):
payload(None)
def loopiter(N):
for _ in repeat(None, N):
payload(None)
def loopiter2(N):
for _ in map(payload, repeat(None, N)):
pass
if __name__ == '__main__':
import timeit
print("standard: ",timeit.timeit("standard({})".format(N),
setup="from __main__ import standard", number=1))
print("underscore: ",timeit.timeit("underscore({})".format(N),
setup="from __main__ import underscore", number=1))
print("loopiter: ",timeit.timeit("loopiter({})".format(N),
setup="from __main__ import loopiter", number=1))
print("loopiter2: ",timeit.timeit("loopiter2({})".format(N),
setup="from __main__ import loopiter2", number=1))
I also came up with an alternative solution that builds on Martelli's one and uses map() to call the payload function. OK, I cheated a bit in that I took the freedom of making the payload accept a parameter that gets discarded: I don't know if there is a way around this. Nevertheless, here are the results:
standard: 0.8398549720004667
underscore: 0.8413165839992871
loopiter: 0.7110594899968419
loopiter2: 0.5891903560004721
so using map yields an improvement of approximately 30% over the standard for loop and an extra 19% over Martelli's.
Assume that you've defined do_something as a function, and you'd like to perform it N times.
Maybe you can try the following:
todos = [do_something] * N
for doit in todos:
doit()
What about a simple while loop?
while times > 0:
do_something()
times -= 1
You already have the variable; why not use it?

a more pythonic way to express conditionally bounded loop?

I've got a loop that wants to execute to exhaustion or until some user specified limit is reached. I've got a construct that looks bad yet I can't seem to find a more elegant way to express it; is there one?
def ello_bruce(limit=None):
for i in xrange(10**5):
if predicate(i):
if not limit is None:
limit -= 1
if limit <= 0:
break
def predicate(i):
# lengthy computation
return True
Holy nesting! There has to be a better way. For purposes of a working example, xrange is used where I normally have an iterator of finite but unknown length (and predicate sometimes returns False).
Maybe something like this would be a little better:
from itertools import ifilter, islice
def ello_bruce(limit=None):
for i in islice(ifilter(predicate, xrange(10**5)), limit):
# do whatever you want with i here
I'd take a good look at the itertools library. Using that, I think you'd have something like...
# From the itertools examples
def tabulate(function, start=0):
return imap(function, count(start))
def take(n, iterable):
return list(islice(iterable, n))
# Then something like:
def ello_bruce(limit=None):
take(filter(tabulate(predicate)), limit)
I'd start with
if limit is None: return
since nothing can ever happen to limit when it starts as None (if there are no desirable side effects in the iteration and in the computation of predicate -- if there are, then, in this case you can just do for i in xrange(10**5): predicate(i)).
If limit is not None, then you just want to perform max(limit, 1) computations of predicate that are true, so an itertools.islice of an itertools.ifilter would do:
import itertools as it
def ello_bruce(limit=None):
if limit is None:
for i in xrange(10**5): predicate(i)
else:
for _ in it.islice(
it.ifilter(predicate, xrange(10**5),
max(limit, 1)): pass
You should remove the nested ifs:
if predicate(i) and not limit is None:
...
What you want to do seems perfectly suited for a while loop:
def ello_bruce(limit=None):
max = 10**5
# if you consider 0 to be an invalid value for limit you can also do
# if limit:
if limit is None:
limit = max
while max and limit:
if predicate(i):
limit -= 1
max -=1
The loop stops if either max or limit reaches zero.
Um. As far as I understand it, predicate just computes in segments, and you totally ignore its return value, right?
This is another take:
import itertools
def ello_bruce(limit=None):
if limit is None:
limiter= itertools.repeat(None)
else:
limiter= xrange(limit)
# since predicate is a Python function
# itertools looping won't be faster, so use plain for.
# remember to replace the xrange(100000) with your own iterator
for dummy in itertools.izip(xrange(100000), limiter):
pass
Also, remove the unneeded return True from the end of predicate.

Resources