as_completed identifying coroutie objects - python-asyncio

I'm using asyncio to await set of coroutines in following way:
# let's assume we have fn defined and that it can throw an exception
coros_objects = []
for x in range(10):
coros_objects.append(fn(x))
for c in asyncio.as_completed(coros_objects):
try:
y = await c
exception:
# something
# if possible print(x)
Question is how can I know which coroutine failed and for which argument?
I could append "x" to the outputs but this would give me info about successful executions only.
I can know that form order because it's different from the order of coros_objects
Can I somehow identify what coro just yielded result?

Question is how can I know which coroutine failed and for which argument?
You can't with the current as_completed. Once this PR is merged, it will be possible by attaching the information to the future (because as_completed will then yield the original futures). At the moment there are two workarounds:
wrap the coroutine execution in a wrapper that catches exceptions and stores them, and also stores the original arguments that you need, or
not use as_completed at all, but write your own loop using tools like asyncio.wait.
The second option is easier than most people expect, so here it is (untested):
# create a list of tasks and attach the needed information to each
tasks = []
for x in range(10):
t = asyncio.create_task(fn(x))
t.my_task_arg = x
tasks.append(t)
# emulate as_completed with asyncio.wait()
while tasks:
done, tasks = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
for t in done:
try:
y = await t
except Exception as e:
print(f'{e} happened while processing {t.my_task_arg}')

Related

multiprocessing a geopandas.overlay() throws no error but seemingly never completes

I'm trying to pass a geopandas.overlay() to multiprocessing to speed it up.
I have used custom functions and functools to partially fill function inputs and then pass the iterative component to the function to produce a series of dataframes that I then concat into one.
def taska(id, points, crs):
return make_break_points((vms_points[points.ID == id]).reset_index(drop=True), crs)
points_gdf = geodataframe of points with an id field
grid_gdf = geodataframe polygon grid
partialA = functools.partial(taska, points=points_gdf, crs=grid_gdf.crs)
partialA_results =[]
with Pool(cpu_count()-4) as pool:
for results in pool.map(partialA, list(points_gdf.ID.unique())):
partialA_results.append(results)
bpts_gdf = pd.concat(partialA_results)
In the example above I use the list of unique values to subset the df and pass it to a processor to perform the function and return the results. In the end all the results are combined using pd.concat.
When I apply the same approach to a list of dataframes created using numpy.array_split() the process starts with a number of processors, then they all close and everything hangs with no indication that work is being done or that it will ever exit.
def taskc(tracks, grid):
return gpd.overlay(tracks, grid, how='union').explode().reset_index(drop=True)
tracks_gdf = geodataframe of points with an id field
dfs = np.array_split(tracks_gdf, (cpu_count()-4))
grid_gdf = geodataframe polygon grid
partialC_results = []
partialC = functools.partial(taskc, grid=grid_gdf)
with Pool(cpu_count() - 4) as pool:
for results in pool.map(partialC, dfs):
partialC_results.append(results)
results_df = pd.concat(partialC_results)
I tried using with get_context('spawn').Pool(cpu_count() - 4) as pool: based on the information here https://pythonspeed.com/articles/python-multiprocessing/ with no change in behavior.
Additionally, if I simply run geopandas.overlay(tracks_gdf, grid_gdf) the process is successful and the script carries on to the end with expected results.
Why does the partial function approach work on a list of items but not a list of dataframes?
Is the numpy.array_split() not an iterable object like a list?
How can I pass a single df into geopandas.overlay() in chunks to utilize multiprocessing capabilities and get back a single dataframe or a series of dataframes to concat?
This is my work around but am also interested if there is a better way to perform this and similar tasks. Essentially, modified the partial function so the df split is moved to the partial function then I create a list of values from range() as my iteral.
def taskc(num, tracks, grid):
return gpd.overlay(np.array_split(tracks, cpu_count()-4)[num], grid, how='union').explode().reset_index(drop=True)
partialC = functools.partial(taskc, tracks=tracks_gdf, grid=grid_gdf)
dfrange = list(range(0, cpu_count() - 4))
partialC_results = []
with get_context('spawn').Pool(cpu_count() - 4) as pool:
for results in pool.map(partialC, dfrange):
partialC_results.append(results)
results_gdf = pd.concat(partialC_results)

Why do we need to add [] to a iterable array and * to asyncio.gather?

Using this example
import time
import asyncio
async def main(x):
print(f"Starting Task {x}")
await asyncio.sleep(3)
print(f"Finished Task {x}")
async def async_io():
tasks = []
for i in range(10):
tasks += [main(i)]
await asyncio.gather(*tasks)
if __name__ == "__main__":
start_time = time.perf_counter()
asyncio.run(async_io())
print(f"Took {time.perf_counter() - start_time} secs")
I noticed that we need to create a list that keeps track of the tasks to do. Understandable, but then why do we add the [] wrapper over the main(i) function? And also in the asyncio.gather(*tasks), why do we need to add the asterisk there as well?
why do we add the [] wrapper over the main(i) function?
There are a few ways to add items into a list. One such way, the way you've chosen, is by concatenating two lists together.
>>> [1] + [2]
[1, 2]
Trying to concatenate a list and something else will lead to a TypeError.
In your particular case you're using augmented assignment, a (often more performant) shorthand for
tasks = tasks + [main(i)]
Another way to accomplish this is with append.
tasks.append(main(i))
If your real code matches your example code, an even better way to spell all of this is
tasks = [main(i) for i in range(10)]
in the asyncio.gather(*tasks), why do we need to add the asterisk there as well?
Because gather will to run each positional argument it receives. Calls to gather should look like
asyncio.gather(main(0))
asyncio.gather(main(0), main(1))
Since there are times when you need to use a variable number of positional arguments, Python offers the unpacking operator (* in the case of lists).
If you felt so inclined, your example can be rewritten as
async def async_io():
await asyncio.gather(*[main(i) for i in range(10)])

python; asyncio async for statement

Here is a block of code from: https://github.com/ronf/asyncssh/blob/master/examples/math_server.py#L38
async def handle_client(process):
process.stdout.write('Enter numbers one per line, or EOF when done:\n')
total = 0
try:
async for line in process.stdin:
line = line.rstrip('\n')
if line:
try:
total += int(line)
except ValueError:
process.stderr.write('Invalid number: %s\n' % line)
except asyncssh.BreakReceived:
pass
There is an async keyword before the def, however there is also one before the for loop. In looking at the documentation for asyncio here: https://docs.python.org/3/library/asyncio-task.html, I do not see any similar uses of this async keyword.
So, what does the keyword do in this context?
The async for ... in ... construct allows you to loop through "Asynchronous iterable" and as stated in comments, detailed explanation is in PEP 492
In your example case, the async for loop, waits for stdin input, while not blocking other asyncio-loop tasks.
If you would use for loop, this would be blocking operation and no other tasks on the loop could be executed unit you've entered input.
To get another example, imagine MySQL client fetching x rows from database.
aio-mysql example
async for row in conn.execute("SELECT * FROM table;"):
print(row)
This fetches single row, and it isn't blocking the execution of other tasks on the asyncio-loop, while waiting for IO operations (mysql query).
Then you do something with the row data you've obtained.

Julia - Sending GLPK.Prob to worker

I am using GLPK with Julia, and using the methods written by spencerlyon
sendto(2, lp = lp) #lp is type GLPK.Prob
However, I cant seem to send a type GLPK.Prob between workers. Whenever I do try to send a type GLPK.Prob, it gets 'sent' and calling
remotecall_fetch(2, whos)
confirms that the GLPK.Prob got sent
The problem appears when I try to solve it by calling
simplex(lp)
the error
GLPK.GLPKError("invalid GLPK.Prob")
appears. I know that the GLPK.Prob isnt originally an invalid GLPK.Prob and if I decide to construct the GLPK.Prob type explicitly on another worker, fx worker 2, calling simplex runs just fine
This is a problem as the GLPK.Prob is generated from a custom type of mine that is a bit on the heavy side
tl;dr Are there possibly some types that cannot be sent between workers properly?
Update
I see now that calling
remotecall_fetch(2, simplex, lp)
will return the above GLPK error
Furthermore I've just noticed that the GLPK module has got a method called
GLPK.copy_prob(GLPK.Prob, GLPK.Prob, Int)
but deepcopy (and certainly not copy) wont work when copying a GLPK.Prob
Example
function create_lp()
lp = GLPK.Prob()
GLPK.set_prob_name(lp, "sample")
GLPK.term_out(GLPK.OFF)
GLPK.set_obj_dir(lp, GLPK.MAX)
GLPK.add_rows(lp, 3)
GLPK.set_row_bnds(lp,1,GLPK.UP,0,100)
GLPK.set_row_bnds(lp,2,GLPK.UP,0,600)
GLPK.set_row_bnds(lp,3,GLPK.UP,0,300)
GLPK.add_cols(lp, 3)
GLPK.set_col_bnds(lp,1,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,1,10)
GLPK.set_col_bnds(lp,2,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,2,6)
GLPK.set_col_bnds(lp,3,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,3,4)
s = spzeros(3,3)
s[1,1] = 1
s[1,2] = 1
s[1,3] = 1
s[2,1] = 10
s[3,1] = 2
s[2,2] = 4
s[3,2] = 2
s[2,3] = 5
s[3,3] = 6
GLPK.load_matrix(lp, s)
return lp
end
This will return a lp::GLPK.Prob() which will return 733.33 when running
simplex(lp)
result = get_obj_val(lp)#returns 733.33
However, doing
addprocs(1)
remotecall_fetch(2, simplex, lp)
will result in the error above
It looks like the problem is that your lp object contains a pointer.
julia> lp = create_lp()
GLPK.Prob(Ptr{Void} #0x00007fa73b1eb330)
Unfortunately, working with pointers and parallel processing is difficult - if different processes have different memory spaces then it won't be clear which memory address the process should look at in order to access the memory that the pointer points to. These issues can be overcome, but apparently they require individual work for each data type that involves said pointers, see this GitHub discussion for more.
Thus, my thought would be that if you want to access the pointer on the worker, you could just create it on that worker. E.g.
using GLPK
addprocs(2)
#everywhere begin
using GLPK
function create_lp()
lp = GLPK.Prob()
GLPK.set_prob_name(lp, "sample")
GLPK.term_out(GLPK.OFF)
GLPK.set_obj_dir(lp, GLPK.MAX)
GLPK.add_rows(lp, 3)
GLPK.set_row_bnds(lp,1,GLPK.UP,0,100)
GLPK.set_row_bnds(lp,2,GLPK.UP,0,600)
GLPK.set_row_bnds(lp,3,GLPK.UP,0,300)
GLPK.add_cols(lp, 3)
GLPK.set_col_bnds(lp,1,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,1,10)
GLPK.set_col_bnds(lp,2,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,2,6)
GLPK.set_col_bnds(lp,3,GLPK.LO,0,0)
GLPK.set_obj_coef(lp,3,4)
s = spzeros(3,3)
s[1,1] = 1
s[1,2] = 1
s[1,3] = 1
s[2,1] = 10
s[3,1] = 2
s[2,2] = 4
s[3,2] = 2
s[2,3] = 5
s[3,3] = 6
GLPK.load_matrix(lp, s)
return lp
end
end
a = #spawnat 2 eval(:(lp = create_lp()))
b = #spawnat 2 eval(:(result = simplex(lp)))
fetch(b)
See the documentation below on #spawn for more info on using it, as it can take a bit of getting used to.
The macros #spawn and #spawnat are two of the tools that Julia makes available to assign tasks to workers. Here is an example:
julia> #spawnat 2 println("hello world")
RemoteRef{Channel{Any}}(2,1,3)
julia> From worker 2: hello world
Both of these macros will evaluate an expression on a worker process. The only difference between the two is that #spawnat allows you to choose which worker will evaluate the expression (in the example above worker 2 is specified) whereas with #spawn a worker will be automatically chosen, based on availability.
In the above example, we simply had worker 2 execute the println function. There was nothing of interest to return or retrieve from this. Often, however, the expression we sent to the worker will yield something we wish to retrieve. Notice in the example above, when we called #spawnat, before we got the printout from worker 2, we saw the following:
RemoteRef{Channel{Any}}(2,1,3)
This indicates that the #spawnat macro will return a RemoteRef type object. This object in turn will contain the return values from our expression that is sent to the worker. If we want to retrieve those values, we can first assign the RemoteRef that #spawnat returns to an object and then, and then use the fetch() function which operates on a RemoteRef type object, to retrieve the results stored from an evaluation performed on a worker.
julia> result = #spawnat 2 2 + 5
RemoteRef{Channel{Any}}(2,1,26)
julia> fetch(result)
7
The key to being able to use #spawn effectively is understanding the nature behind the expressions that it operates on. Using #spawn to send commands to workers is slightly more complicated than just typing directly what you would type if you were running an "interpreter" on one of the workers or executing code natively on them. For instance, suppose we wished to use #spawnat to assign a value to a variable on a worker. We might try:
#spawnat 2 a = 5
RemoteRef{Channel{Any}}(2,1,2)
Did it work? Well, let's see by having worker 2 try to print a.
julia> #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,4)
julia>
Nothing happened. Why? We can investigate this more by using fetch() as above. fetch() can be very handy because it will retrieve not just successful results but also error messages as well. Without it, we might not even know that something has gone wrong.
julia> result = #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,5)
julia> fetch(result)
ERROR: On worker 2:
UndefVarError: a not defined
The error message says that a is not defined on worker 2. But why is this? The reason is that we need to wrap our assignment operation into an expression that we then use #spawn to tell the worker to evaluate. Below is an example, with explanation following:
julia> #spawnat 2 eval(:(a = 2))
RemoteRef{Channel{Any}}(2,1,7)
julia> #spawnat 2 println(a)
RemoteRef{Channel{Any}}(2,1,8)
julia> From worker 2: 2
The :() syntax is what Julia uses to designate expressions. We then use the eval() function in Julia, which evaluates an expression, and we use the #spawnat macro to instruct that the expression be evaluated on worker 2.
We could also achieve the same result as:
julia> #spawnat(2, eval(parse("c = 5")))
RemoteRef{Channel{Any}}(2,1,9)
julia> #spawnat 2 println(c)
RemoteRef{Channel{Any}}(2,1,10)
julia> From worker 2: 5
This example demonstrates two additional notions. First, we see that we can also create an expression using the parse() function called on a string. Secondly, we see that we can use parentheses when calling #spawnat, in situations where this might make our syntax more clear and manageable.

How can a value be accumulated when run in parallel/concurent processes?

I'm running some Ruby scripts concurrently using Grosser/Parallel.
During each concurrent test I want to add up the number of times a particular thing has happened, then display that number.
Let's say:
def main
$this_happened = 0
do_this_in_parallel
puts $this_happened
end
def do_this_in_parallel
Parallel.each(...) {
$this_happened += 1
}
end
The final value after do_this_in_parallel has finished will always be 0
I'd like to know why this happens.
How can I get the desired result which would be $this_happenend > 0?
Thanks.
This doesn't work because separate processes have separate memory spaces: setting variables etc in one process has no effect on what happens in the other process.
However you can return a result from your block (because under the hood parallel sets up pipes so that the processes can be fed input/return results). For example you could do this
counts = Parallel.map(...) do
#the return value of the block should
#be the number of times the event occurred
end
Then just sum the counts to get your total count (eg counts.reduce(:+)). You might also want to read up on map-reduce for more information about this way of parallelising work
I have never used parallel but the documentation seems to suggest that something like this might work.
Parallel.each(..., :finish => lambda {|*_| $this_happened += 1}) { do_work }

Resources