How to run multiple functions (input dependent on prev function) with python asyncio? - python-asyncio

I have 2 dataframes which I need to process separately with replace() before merging them together. In my actual use-case, there are more functions to be applied to each dataframe before the final step which is merging the dataframes.
# replace function
def replace(df, old_value, new_value='above 60'):
df.age = df.age.replace(old_value, new_value)
return df
# dataframes
dfx = pd.DataFrame({
'age': ['11', '19', '22', '30', '24', '27', '67'],
'group': ['A', 'B', 'C', 'C', 'B', 'C', 'B'],
'count':[3,5,2,1,4,5,2]
})
dfy = pd.DataFrame({
'age': ['11', '19', '79', '30', '24', '27', '15'],
'group': ['D', 'D', 'D', 'F', 'E', 'D', 'F'],
'count':[7,5,6,1,5,5,8]
})
my code to run asynchronously:
import asyncio
async def do_after(delay, what):
await asyncio.sleep(delay)
print(what)
async def main():
task1 = asyncio.create_task(
do_after(1, replace(dfx, '67')))
task2 = asyncio.create_task(
do_after(2, replace(dfx, '79')))
final_task = asyncio.create_task(
do_after(3, pd.merge(dfx, dfy, how='inner', on='age')))
print(f"started at {time.strftime('%X')}")
# Wait until both tasks are completed (should take
# around 3 seconds.)
await task1
await task2
await final_task
print(f"finished at {time.strftime('%X')}")
return pd.DataFrame(final_task)
await main()
[OUTPUT]:
started at 09:25:18
age group count
0 11 A 3
1 19 B 5
2 22 C 2
3 30 C 1
4 24 B 4
5 27 C 5
6 above 60 B 2
age group count
0 11 A 3
1 19 B 5
2 22 C 2
3 30 C 1
4 24 B 4
5 27 C 5
6 above 60 B 2
age group_x count_x group_y count_y
0 11 A 3 D 7
1 19 B 5 D 5
2 30 C 1 F 1
3 24 B 4 E 5
4 27 C 5 D 5
finished at 09:25:21
Looking at the last table below, the tables have been merged without the replace() applied to them.
1. Why is that?
2. How do I return the merged table as a pandas dataframe?
Any help is appreciated.

As stated in the comments, this is not the typical workload that will benefit, at all, from async programming. One could use asyncio to orchestrate calls work in parallel dataframes in different subprocess, and maybe that could make a difference (but the most optimized operations in pandas, numpy echo-system will make use of all CPU cores, so even in this case, the gains might be negligible).
That said, the way to ensure ordered execution in asyncio workloads is not to create tasks at all, and simply use inline await expressions:
async def main():
...
await replace(dfx, '67')
await replace(dfx, '79')
await pd.merge(...)
print("done")
If any steps can actually be run in parallel, then one can await for an
asyncio.gather call into which all these steps passed at once.

Related

max() function stops for loop

I don't know why but for loop after printing max value from array does not work. But if I remove print(max(arr)) it works fine.
Input:
3 3 3 34 5
Code:
arr = map(int, set(input().split()))
print(max(arr))
for i in arr:
print(i)
Expected output:
34
3
34
5
Output:
34
You have exhausted the iterator returned from map(). Instead, create a list from the map():
arr = list(map(int, set(input().split()))) # <-- add list() around map()
print(max(arr))
for i in arr:
print(i)
Prints (for example):
3 3 3 34 5
34
5
3
34

How to parallel compute to improve efficiency in pyspark instead of loop about "for" ?

import pyspark.sql.functions as F
pd_df = pd.DataFrame(np.arange(30).reshape(6 ,5), columns=['a', 'b', 'c', 'd', 'e'])
print df
spark_df = sqlContext.createDataFrame(df)
for col in ['b', 'c', 'd', 'e']:
df_groupby = spark_df.groupby(col).count(F.col('a'))
spark_df = spark_df.join(df_groupby, col, how = 'left')
output:
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
5 25 26 27 28 29
I have a big data to deal like this type
I think loop about "for" will reduce efficiency
Who can tell me how to parallel compute to improve efficiency in pyspark
Sincere thanks

Pandas multiindex sort

In Pandas 0.19 I have a large dataframe with a Multiindex of the following form
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
I want to sort bar and foo (and many more double lines as them) according to "two" to get the following:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
I am interested in speed (as I have many columns and many pairs of rows). I am also happy with re-arranging the data if it speeds up the sorting. Many thanks
Here is a mostly numpy solution that should yield good performance. It first selects only the 'two' rows and argsorts them. It then sets this order for each row of the original dataframe. It then unravels this order (after adding a constant to offset each row) and the original dataframe values. It then reorders all the original values based on this unraveled, offset and argsorted array before creating a new dataframe with the intended sort order.
rows, cols = df.shape
df_a = np.argsort(df.xs('two', level=1))
order = df_a.reindex(df.index.droplevel(-1)).values
offset = np.arange(len(df)) * cols
order_final = order + offset[:, np.newaxis]
pd.DataFrame(df.values.ravel()[order_final.ravel()].reshape(rows, cols), index=df.index, columns=df.columns)
Output
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
Some Speed tests
# create much larger frame
import string
idx = pd.MultiIndex.from_product((list(string.ascii_letters), list(string.ascii_letters) + ['two']))
df1 = pd.DataFrame(index=idx, data=np.random.rand(len(idx), 3), columns=['C0', 'C1', 'C2'])
#scott boston
%timeit df1.groupby(level=0).apply(sortit)
10 loops, best of 3: 199 ms per loop
#Ted
1000 loops, best of 3: 5 ms per loop
Here is a solution, albeit klugdy:
Input dataframe:
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
Custom sorting function:
def sortit(x):
xcolumns = x.columns.values
x.index = x.index.droplevel()
x.sort_values(by='two',axis=1,inplace=True)
x.columns = xcolumns
return x
df.groupby(level=0).apply(sortit)
Output:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3

Drawing from a 2-D prior that is only available as samples in pymc2

I'm trying to play around with Bayesian updating, and have a situation in which I am using a posterior from previous runs as a prior. This is a 2D prior on alpha and beta, for which I have traces, alphatrace and betatrace. So I stack them and use code adopted from https://gist.github.com/jcrudy/5911624 to make a KDE based stochastic.
#from https://gist.github.com/jcrudy/5911624
def KernelSmoothing(name, dataset, bw_method=None, observed=False, value=None):
'''Create a pymc node whose distribution comes from a kernel smoothing density estimate.'''
density = gaussian_kde(dataset, bw_method)
def logp(value):
#print "VAL", value
d = density(value)
if d == 0.0:
return float('-inf')
return np.log(d)
def random():
result = None
sample=density.resample(1)
#print sample, sample.shape
result = sample[0][0],sample[1][0]
return result
if value == None:
value = random()
dtype = type(value)
result = pymc.Stochastic(logp = logp,
doc = 'A kernel smoothing density node.',
name = name,
parents = {},
random = random,
trace = True,
value = None,
dtype = dtype,
observed = observed,
cache_depth = 2,
plot = True,
verbose = 0)
return result
Note that the critical thing here is to obtain 2-values from the joint prior: this is why i need a 2-D prior and not two 1-D priors.
The model itself is so:
ctrace=np.vstack((alphatrace, betatrace))
cnew=KernelSmoothing("cnew", ctrace)
#pymc.deterministic
def alphanew(cnew=cnew, name='alphanew'):
return cnew[0]
#pymc.deterministic
def betanew(cnew=cnew, name='betanew'):
return cnew[1]
newtheta=pymc.Beta("newtheta", alphanew, betanew)
newexp = pymc.Binomial('newexp', n=[14], p=[newtheta], value=[4], observed=True)
model3=pymc.Model([cnew, alphanew, betanew, newtheta, newexp])
mcmc3=pymc.MCMC(model3)
mcmc3.sample(20000,5000,5)
In case you are wondering, this is to do the 71st experiment in the hierarchical Rat Tumor example in Chapter 5 in Gelman's BDA. The "prior" I am using is the posterior on alpha and beta after 70 experiments.
But, when I sample, things blow up with the error:
ValueError: Maximum competence reported for stochastic cnew is <= 0... you may need to write a custom step method class.
Its not cnew I care about updating as a stochastic, but rather alphanew and betanew. How ought I be structuring the code to make this error go away?
EDIT: initial model which gave me the posteriors I wish to use as the prior:
tumordata="""0 20
0 20
0 20
0 20
0 20
0 20
0 20
0 19
0 19
0 19
0 19
0 18
0 18
0 17
1 20
1 20
1 20
1 20
1 19
1 19
1 18
1 18
3 27
2 25
2 24
2 23
2 20
2 20
2 20
2 20
2 20
2 20
1 10
5 49
2 19
5 46
2 17
7 49
7 47
3 20
3 20
2 13
9 48
10 50
4 20
4 20
4 20
4 20
4 20
4 20
4 20
10 48
4 19
4 19
4 19
5 22
11 46
12 49
5 20
5 20
6 23
5 19
6 22
6 20
6 20
6 20
16 52
15 46
15 47
9 24
"""
tumortuples=[e.strip().split() for e in tumordata.split("\n")]
tumory=np.array([np.int(e[0].strip()) for e in tumortuples if len(e) > 0])
tumorn=np.array([np.int(e[1].strip()) for e in tumortuples if len(e) > 0])
N = tumorn.shape[0]
mu = pymc.Uniform("mu",0.00001,1., value=0.13)
nu = pymc.Uniform("nu",0.00001,1., value=0.01)
#pymc.deterministic
def alpha(mu=mu, nu=nu, name='alpha'):
return mu/(nu*nu)
#pymc.deterministic
def beta(mu=mu, nu=nu, name='beta'):
return (1.-mu)/(nu*nu)
thetas=pymc.Container([pymc.Beta("theta_%i" % i, alpha, beta) for i in range(N)])
deaths = pymc.Binomial('deaths', n=tumorn, p=thetas, value=tumory, size=N, observed=True)
I use the joint-posterior from this model on alpha, beta as input to the "new model" at top. This also begs the question if I ought to be including theta1..theta70 in the model at top as they will update along with alpha and beta thanks to the new data which is a binomial with n=14, y=4. But I cant even get the little model with only a prior as a 2d sample array working :-(
I found your question since I ran into a similar proble. According to the documentation of pymc.StepMethod.competence, the problem is that none of the built-in samplers handle the dtype associated with the stochastic variable.
I am not sure what needs to be done to actually resolve that. Maybe one of the sampler methods can be extended to handle special types?
Hopefully someone with more pymc mojo can shine a light on what needs to be done..
def competence(s):
"""
This function is used by Sampler to determine which step method class
should be used to handle stochastic variables.
Return value should be a competence
score from 0 to 3, assigned as follows:
0: I can't handle that variable.
1: I can handle that variable, but I'm a generalist and
probably shouldn't be your top choice (Metropolis
and friends fall into this category).
2: I'm designed for this type of situation, but I could be
more specialized.
3: I was made for this situation, let me handle the variable.
In order to be eligible for inclusion in the registry, a sampling
method's init method must work with just a single argument, a
Stochastic object.
If you want to exclude a particular step method from
consideration for handling a variable, do this:
Competence functions MUST be called 'competence' and be decorated by the
'#staticmethod' decorator. Example:
#staticmethod
def competence(s):
if isinstance(s, MyStochasticSubclass):
return 2
else:
return 0
:SeeAlso: pick_best_methods, assign_method
"""

Given 5 numbers, by only using addition multiplication and substraction check whether we can generate 42?

Given five numbers between 1-52 check whether you can generate 42 by using operations addition, multiplication and subtraction. You can use these operations any number of times.
I got this question during an online test and couldn't do it.
Assuming each number is to be used once and once only, with only five numbers and three operations, you can quite easily do this with a brute force approach.
It will only have to check 5 * 3 * 4 * 3 * 3 * 3 * 2 * 3 * 1, or about 10,000 potential solutions.
As proof-of-concept, here's a Python program for doing this:
import sys
import itertools
if len(sys.argv) != 6:
print "Usage: testprog.py <num1> <num2> <num3> <num4> <num5>"
sys.exit(1)
ops = ['+', '-', '*']
nums = []
for num in sys.argv[1:]:
nums.append(num)
for p in itertools.permutations(nums,len(nums)):
for op1 in ops:
for op2 in ops:
for op3 in ops:
for op4 in ops:
expr = p[0] + op1 + p[1] + op2 + p[2] + op3 + p[3] + op4 + p[4]
result = eval(expr)
if result == 42:
print expr, '=', result
Running that shows the results for the numbers { 1, 2, 3, 4, 5 }:
pax$ time python testprog.py 1 2 3 4 5
2*4*5-1+3 = 42
2*4*5+3-1 = 42
2*5*4-1+3 = 42
2*5*4+3-1 = 42
:
5*4*2-1+3 = 42
5*4*2+3-1 = 42
real 0m0.187s
user 0m0.093s
sys 0m0.077s
and you can see that it completes in about a fifth of a second (on my box).
assumtions:
any number may be used multiple times
operations are done immediately (no operator precedence)
do a breadth first search over the graph
from collections import defaultdict
def add(lhs, rhs):
return lhs+rhs
def sub(lhs, rhs):
return lhs-rhs
def mul(lhs, rhs):
return lhs*rhs
ops = [add, sub, mul] #allowed operations
graph = { 0:["0"]} #graph key is node(number); value is a list of shortest paths to this node
numbers=[1,2,3] #allowed numbers in operations
target=12 #target node(number)
gv_edges=[] #edges for optional graphviz output
#breadth first search until target is met
while not target in graph:
new_graph=defaultdict(list, graph)
for key in graph:
#inefficiently searches old nodes also
for n in numbers:
for op in ops:
newkey = op(key, n)
if newkey not in graph:
#not met in previous iterations, keep new edge
newvals = ["{} --{}({})--> {}".format(val, op.__name__, n, newkey) for val in new_graph[key]]
new_graph[newkey].extend(newvals)
gv_edges.append('"{}" -> "{}" [label="{}({})"]'.format(key, newkey, op.__name__, n))
else:
#already met in previous iterations (shorter paths), do not keep new
pass
graph=dict(new_graph)
#print all solutions
print "Solutions:"
print
for val in graph[target]:
print val
print
print
#print optional graphviz digraph
gv_digraph='digraph {{ rankdir=LR ranksep=2\n"{}" [color=green style=filled fillcolor=green]\n{}\n}}'.format(target,"\n".join(gv_edges))
print "Graphviz Digraph for paste into http://stamm-wilbrandt.de/GraphvizFiddle/"
print "do this for reasonable number of edges only"
print
print gv_digraph
results in the following solutions:
0 --add(1)--> 1 --add(3)--> 4 --mul(3)--> 12
0 --add(2)--> 2 --add(2)--> 4 --mul(3)--> 12
0 --add(2)--> 2 --mul(2)--> 4 --mul(3)--> 12
0 --add(3)--> 3 --add(1)--> 4 --mul(3)--> 12
0 --add(2)--> 2 --mul(3)--> 6 --mul(2)--> 12
0 --add(3)--> 3 --mul(2)--> 6 --mul(2)--> 12
0 --add(3)--> 3 --add(3)--> 6 --mul(2)--> 12
0 --add(3)--> 3 --mul(3)--> 9 --add(3)--> 12
the complete graph (only shortest paths!) for depth 3 looks like

Resources