How to do a customized "average" for pandas multilevel dataframe? - performance

I have a pandas multilevel dataframe df to contain the quarterly financial report data for about 2000+ stocks from year 2006 to 2012 . And I am trying to figure out a way to quickly calculate the 'average' values for each data point.
demo_data() is the function to generate the demo data (df = demo_data(stk_qty=2000, col_num=200) can be used to simulate the financial report data):
def demo_data(stk_qty, col_num):
''' generate demo data, return multilevel dataframe '''
import random
import pandas as pd
rpt_date_template = [(yr+qt) for yr in map(str, range(2006, 2013)) for qt in ['0331','0630','0930','1231']]
stk_id_list = ['STK'+str(x).zfill(3) for x in range(0, stk_qty)]
stk_id_column, rpt_date_column = [], []
for i in range(stk_qty):
stk_rpt_date_list = rpt_date_template[random.randint(0,8):] # rpt dates with random start
stk_id_column.extend([stk_id_list[i]] * len(stk_rpt_date_list))
rpt_date_column.extend(stk_rpt_date_list)
index_name = ['STK_ID', 'RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = stk_id_column
second_level_dt = rpt_date_column
dt = pd.DataFrame(np.random.randn(len(stk_id_column), col_num), columns=col_name)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
multilevel_df = dt.set_index(index_name, drop=True, inplace=False)
return multilevel_df
Here is a sample data. (note: sw() is a method to display the four corners data of a big dataframe, source code is at: How to preview a part of a large pandas DataFrame? )
>>> df = demo_data(5,3)
>>> df.sw()
COL000 COL001 COL002
STK_ID RPT_Date
STK000 20060630 1.8196 0.9519 -1.0526
20060930 -0.4074 -0.9025 1.3562
20061231 -1.1750 0.4190 -1.2976
20070331 -0.5609 1.5190 0.4893
20070630 0.4580 -0.3804 0.3705
20070930 -0.4711 -1.1953 -0.0609
20071231 0.3363 1.1949 1.2802
20080331 1.6359 0.8355 -0.2763
20080630 0.2697 -0.8236 -1.7095
20080930 0.6178 -0.3742 -1.1646
.......................................
STK004 20111231 -0.3198 1.6972 -1.3281
20120331 -1.1905 -0.4597 0.3695
20120630 -0.8253 -0.0502 -0.2862
20120930 0.0059 -1.8535 -1.2107
20121231 0.5762 -0.2872 0.0993
Index : ['STK_ID', 'RPT_Date']
Column: COL000,COL001,COL002
row: 117 col: 3
The customized average function I want is named as my_avg() and defined as below rules:
1. Q1's average value is (Q4_of_previous_yr + Q1)/2
2. Q2's average value is (Q4_of_previous_yr + Q1 + Q2)/3
3. Q3's average value is (Q4_of_previous_yr + Q1 + Q2 + Q3)/4
4. Q4's average value is (Q4_of_previous_yr + Q1 + Q2 + Q3 + Q4)/5
5. if some of the data points are not provided, just calculate the normal average of available data points
so the my_avg(df) will have below output for each STK_ID:
STK_ID RPT_Date COL000 COL001 COL002
STK000 20060630 1.819619705 0.951918984 -1.052639309
20060930 0.706112476 0.024688028 0.151757352
20061231 0.079077767 0.156125083 -0.331359614
20070331 -0.867930112 0.969000466 -0.404129827
20070630 -0.425943376 0.519205768 -0.145929753
20070930 -0.437234418 0.090579744 -0.124681449
20071231 -0.282524858 0.3114374 0.156297097
20080331 0.986121631 1.015202552 0.501971496
.......................................
STK004 20111231 xxxxx xxxxxxx xxxxxxx
How to write the code for my_avg() ?
Reference:
I try to write a temp_solution_avg() function. But it has three issues:
1. the average calculation not include 'Q4_of_previous_yr' data point, so the result is not what I want.
2. data's 'RPT_Date' must start with Q1 ('xxxx0331'), otherwise first yr's data is wrong
3. the calculation speed is very very slow.
In [3]: df = demo_data(500,100)
In [4]: timeit temp_solution_avg(df)
1 loops, best of 3: 66.3 s per loop
def temp_solution_avg(df):
''' return the average , Q1: not change, Q2 : (df.Q1 + df.Q2)/2 ,
Q3: (df.Q1 + df.Q2 + df.Q3)/3, Q4 : (df.Q1 + df.Q2 + df.Q3 + df.Q4)/4
data's 'RPT_Date' must start with Q1 ('xxxx0331'), otherwise first yr's
data is wrong .
'''
dt = df.reset_index()
dt['yr'] = dt['RPT_Date'].str[0:4]
dt['temp_stk_id'] = dt['STK_ID']
dt = dt.set_index(['STK_ID','RPT_Date'], drop=True, inplace=False)
rst = dt.groupby(['temp_stk_id','yr']).transform(pd.expanding_mean)
return rst

Related

fable package: Could not find an appropriate ARIMA model

I am trying to fit the Arima model to hourly data. First, I tried fable package, and the ARIMA function could not find the appropriate model. Second, I used forecast package with auto.arima function, which worked perfectly. I have one example series (available here: https://gist.github.com/mizhozan/800fec80682822969e7d35ebba395) and the results as an example here:
data.arima <- read.csv('test.csv', header = TRUE)[,-1]
## fable package
data.arima$Date <- lubridate::ymd_hms(data.arima$Date, truncated = 2)
library(tidyverse)
library(fable)
result.arima <- data.arima %>%
as_tsibble(., index = Date)%>%
model(ARIMA(value ~ PDQ() + pdq() +
fourier(period = "day", K = 3) +
fourier(period = "week", K = 2), seasonal.test = "ocsb")) %>%
forecast(h = 24)
Warning message:
1 error encountered for ARIMA(value ~ PDQ() + pdq() + fourier(period = "day", K = 3) +
fourier(period = "week", K = 2), seasonal.test = "ocsb")
[1] Could not find an appropriate ARIMA model.
This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots
## forecast package
library(forecast)
series.arima <- msts(data.arima$value, seasonal.periods = c(24, 24*7))
model.arima <- auto.arima(series.arima, seasonal.test = "ocsb", xreg=fourier(series.arima,K=c(3,2)))
Series: series.arima
Regression with ARIMA(4,0,1) errors
Coefficients:
ar1 ar2 ar3 ar4 ma1 intercept S1-24 C1-24 S2-24 C2-24 S3-24 C3-24 S1-168 C1-168 S2-168 C2-168
1.9064 -1.4934 0.8292 -0.3056 -0.8728 664263.21 -310891.13 -349744.23 -133862.32 -20587.2 69313.88 51963.803 43880.66 1524.578 -3823.166 5642.26
s.e. 0.0755 0.1192 0.1085 0.0521 0.0605 7781.72 20778.06 20591.69 11662.66 11606.0 8792.99 8768.856 11342.32 11669.244 12819.074 13091.08
sigma^2 estimated as 5.122e+09: log likelihood=-4225.19
AIC=8484.38 AICc=8486.31 BIC=8549.28
result.arima.2 <- forecast(model.arima, xreg=fourier(series.arima, K = c(3,2), h = 24))
I would appreciate that if someone could explain the problem here.

Python Matplotlib Add 2 Dynamic Title Components

I have 6 subplots that need 2 dynamic title components and I can code for 1 but I'm not sure how to change my code below to add a 2nd dynamic title component on the same line after searching the literature. Here is my for loop to generate the 6 subplots with the "plt.title.." line below:
list = [0,1,2,3,4,5]
now = datetime.datetime.now()
currm = now.month
import calendar
fig, ax = plt.subplots(6)
for x in list:
dam = DS.where(DS['time.year']==rmax.iloc[x,1]).groupby('time.month').mean()#iterate by index of
column "1" or the years
dam = dam.sel(month=3)#current month mean 500
dam = dam.sel(level=500)
damc = dam.to_array()
lats = damc['lat'].data
lons = damc['lon'].data
#plot data
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines(lw=1)
damc = damc.squeeze()
cnplot = plt.contour(lons,lats,damc,cmap='jet')
plt.title('Mean 500mb Hgt + Phase {} 2020'.format(calendar.month_name[currm-1]))
plt.show()
#plt.clf()
I need to add one of each from this list in the loop to the "plt.title.." between the "+" and the word "Phase" line above...?
tindices = ['SOI','AO','NAO','PNA','EPO','PDO']
Thank you for any help with this!
Try accessing the tindices one by one and passing them to the title
plt.title('Mean 500mb Hgt + {} Phase {} 2020'.format(tindices[x],
calendar.month_name[currm-1]))

how to solve a simple mixing operation in gekko?

I am trying to solve a simple mixing operation in gekko. The mixer mx takes two inlet streams Feed1 and Feed2. The expected result is that mass flow of outlet stream mx.outlet should be the summation of mass flows of the inlet streams.
Here is what I have tried.
from gekko import GEKKO, chemical
m = GEKKO(remote=False)
f = chemical.Flowsheet(m)
P = chemical.Properties(m)
c1 = P.compound('Butane')
c2 = P.compound('Propane')
feed1 = f.stream()
m_feed1 = f.massflows(sn= feed1.name)
m_feed1.mdot = 200
m_feed1.mdoti = [50,150]
feed2= f.stream()
m_feed2 = f.massflows(sn= feed2.name)
m_feed2.mdot = 200
m_feed2.mdoti = [50,150]
mx = f.mixer(ni=2)
mx.inlet = [feed1.name,feed2.name]
m.options.SOLVER = 1
mf = f.massflows(sn = mx.outlet)
m.solve()
The code runs successfully. However, on mf.mdot seems to output incorrect value [-1.8220132454e-06]. The expected value is 400. Any help , what is wrong with my code?
Here is source code that works for this mixing application:
from gekko import GEKKO, chemical
import json
m = GEKKO(remote=False)
f = chemical.Flowsheet(m)
P = chemical.Properties(m)
# define compounds
c1 = P.compound('Butane')
c2 = P.compound('Propane')
# create feed streams
feed1 = f.stream(fixed=False)
feed2 = f.stream(fixed=False)
# create massflows objects
m_feed1 = f.massflows(sn=feed1.name)
m_feed2 = f.massflows(sn=feed2.name)
# create mixer
mx = f.mixer(ni=2)
# connect feed streams to massflows objects
f.connect(feed1,mx.inlet[0])
f.connect(feed2,mx.inlet[1])
m.options.SOLVER = 1
mf = f.massflows(sn = mx.outlet)
# specify mass inlet flows
mi = [50,150]
for i in range(2):
m.fix(m_feed1.mdoti[i],val=mi[i])
m.fix(m_feed2.mdoti[i],val=mi[i])
# fix pressure and temperature
m.fix(feed1.P,val=101325)
m.fix(feed2.P,val=101325)
m.fix(feed1.T,val=300)
m.fix(feed2.T,val=305)
m.solve(disp=True)
# print results
print(f'The total massflow out is {mf.mdot.value}')
print('')
# get additional solution information
with open(m.path+'//results.json') as f:
r = json.load(f)
for name, val in r.items():
print(f'{name}={val[0]}')
Below is the solver output. This will only work with APM 0.9.1 and Gekko v0.2.3 (release coming Aug 2019). The thermo and flowsheet object libraries were released with v0.2.2 and there are several features that are still under development. The next release should resolve many of them.
----------------------------------------------------------------
APMonitor, Version 0.9.1
APMonitor Optimization Suite
----------------------------------------------------------------
--------- APM Model Size ------------
Each time step contains
Objects : 6
Constants : 0
Variables : 19
Intermediates: 0
Connections : 44
Equations : 2
Residuals : 2
Number of state variables: 14
Number of total equations: - 14
Number of slack variables: - 0
---------------------------------------
Degrees of freedom : 0
----------------------------------------------
Steady State Optimization with APOPT Solver
----------------------------------------------
Iter Objective Convergence
0 3.86642E-16 1.99000E+02
1 4.39087E-18 1.11937E+01
2 8.33448E-19 6.05819E-01
3 1.84640E-19 1.62783E-01
4 2.91981E-20 7.21250E-02
5 1.55439E-21 2.28110E-02
6 5.51232E-24 1.21437E-03
7 7.03139E-29 4.30650E-06
8 7.03139E-29 4.30650E-06
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 0.0469 sec
Objective : 0.
Successful solution
---------------------------------------------------
v1 not found in results file
The total massflow out is [400.0]
time=0.0
feed1.h=44154989.486
feed1.x[2]=0.79815448476
feed1.vdot=104.9180373
feed1.dens=0.040621756423
feed1.c[1]=0.0081993193551
feed1.c[2]=0.032422437068
feed1.mdot=200.0
feed1.y[1]=0.25
feed1.y[2]=0.75
feed1.sfrc=0.0
feed1.lfrc=0.0
feed1.vfrc=1.0
feed2.h=44552246.421
feed2.x[2]=0.79815448476
feed2.vdot=106.66667125
feed2.dens=0.03995582599
feed2.c[1]=0.0080649042837
feed2.c[2]=0.031890921707
feed2.mdot=200.0
feed2.y[1]=0.25
feed2.y[2]=0.75
feed2.sfrc=0.0
feed2.lfrc=0.0
feed2.vfrc=1.0
mixer5.outlet.t=381.10062836
mixer5.outlet.h=44353617.96
mixer5.outlet.ndot=8.5239099109
mixer5.outlet.x[1]=0.20184551524
mixer5.outlet.x[2]=0.79815448476
mixer5.outlet.vdot=1.5797241143
mixer5.outlet.dens=5.5635215396
mixer5.outlet.c[1]=1.0891224437
mixer5.outlet.c[2]=4.3066994177
mixer5.outlet.mdot=400.0
mixer5.outlet.y[1]=0.25
mixer5.outlet.y[2]=0.75
mixer5.outlet.sfrc=0.0
mixer5.outlet.lfrc=1.0
mixer5.outlet.vfrc=0.0
v2=300.0
v3=4.2619549555
v4=0.20184551524
v5=0.79815448476
v6=101325.0
v7=305.0
v8=4.2619549555
v9=0.20184551524
v10=0.79815448476
v11=200.0
v12=50.0
v13=150.0
v14=200.0
v15=50.0
v16=150.0
v17=400.0
v18=100.0
v19=300.0

Extract multiple protein sequences from a Protein Data Bank along with Secondary Structure

I want to extract protein sequences and their corresponding secondary structure from any Protein Data bank, say RCSB. I just need short sequences and their secondary structure. Something like,
ATRWGUVT Helix
It is fine even if the sequences are long, but I want a tag at the end that denotes its secondary structure. Is there any programming tool or anything available for this.
As I've shown above I want only this much minimal information. How can I achieve this?
from Bio.PDB import *
from distutils import spawn
Extract sequence:
def get_seq(pdbfile):
p = PDBParser(PERMISSIVE=0)
structure = p.get_structure('test', pdbfile)
ppb = PPBuilder()
seq = ''
for pp in ppb.build_peptides(structure):
seq += pp.get_sequence()
return seq
Extract secondary structure with DSSP as explained earlier:
def get_secondary_struc(pdbfile):
# get secondary structure info for whole pdb.
if not spawn.find_executable("dssp"):
sys.stderr.write('dssp executable needs to be in folder')
sys.exit(1)
p = PDBParser(PERMISSIVE=0)
ppb = PPBuilder()
structure = p.get_structure('test', pdbfile)
model = structure[0]
dssp = DSSP(model, pdbfile)
count = 0
sec = ''
for residue in model.get_residues():
count = count + 1
# print residue,count
a_key = list(dssp.keys())[count - 1]
sec += dssp[a_key][2]
print sec
return sec
This should print both sequence and secondary structure.
You can use DSSP.
The output of DSSP is explained extensively under 'explanation'. The very short summary of the output is:
H = α-helix
B = residue in isolated β-bridge
E = extended strand, participates in β ladder
G = 3-helix (310 helix)
I = 5 helix (π-helix)
T = hydrogen bonded turn
S = bend

Speed up the analysis

I have 2 dataframes in R for example df and dfrefseq.
df<-data.frame( chr = c("chr1","chr1","chr1","chr4")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("HTA","OdX","FEA","MGA")
)
dfrefseq<-data.frame( chr = c("chr1","chr1","chr1","chr2")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("tra","FGE","FFs","FAA")
)
I want to check for each gene in df witch gene in dfrefseq lies closest to the selected df gene.
I first selected "chr1" in both dataframes.
Then I calculated for the first gene in readschr1 the distance between start-start start-stop stop-start and stop-stop sites.
The sum of this calculations say everything about the distance. My question here is, How can I speed up this analyse? Because now I tested only 1 gene against a dataframe, but I need to test 2000 genes.
readschr1 <- subset(df,df[,1]=="chr1")
refseqchr1 <- subset(dfrefseq,dfrefseq[,1]=="chr1")
names<-list()
read_start_start<-list()
read_start_stop<-list()
read_stop_start<-list()
read_stop_stop<-list()
for (i in 1:nrow(refseqchr1)) {
startstart<-abs(readschr1[1,2] - refseqchr1[i,2])
startstop<-abs(readschr1[1,2] - refseqchr1[i,3])
stopstart<-abs(readschr1[1,3] - refseqchr1[i,2])
stopstop<-abs(readschr1[1,3] - refseqchr1[i,3])
read_start_start[[i]]<- matrix(startstart)
read_start_stop[[i]]<- matrix(startstop)
read_stop_start[[i]]<- matrix(stopstart)
read_stop_stop[[i]]<- matrix(stopstop)
names[[i]]<-matrix(refseqchr1[i,4])
}
table<-cbind(names, read_start_start, read_start_stop, read_stop_start, read_stop_stop)
sumtotalcolumns<-as.numeric(table[,2]) + as.numeric(table[,3])+ as.numeric(table[,4]) + as.numeric(table[,5])
test<-cbind(table, sumtotalcolumns)
test1<-test[order(as.vector(test$sumtotalcolumns)), ]
Thank you!
The Bioconductor package GenomicRanges is designed to work with this type of data
source('http://bioconductor.org/biocLite.R')
biocLite('GenomicRanges') # one-time installation
then
library(GenomicRanges)
gr <- with(df,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
grrefseq <- with(dfrefseq,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
and
> nearest(gr, grrefseq)
[1] 1 2 3 NA
You can merge the two separate data.frames together to form one table and then use vectorized operations. The key to merge is to specify the common column(s) between the data.frames and to tell it what to do when there are cases that do not match. Specifying all = TRUE will return all rows and fill in NAs if there is no match in the other data.frame, i.e. ch2 and ch4 in this case. Once the data.frames have been merged, then it's a simple exercise in subtracting the different columns from one another and then summing the four columns of interest. I use transform to cut down on the typing needed to do the subtraction.
zz <- merge(df, dfrefseq, by = "chr", all = TRUE)
zz <- transform(zz,
read_start_start = abs(start.x - start.y)
, read_start_stop = abs(start.x - stop.y)
, read_stop_start = abs(stop.x - start.y)
, read_stop_stop = abs(stop.x - stop.y)
)
zz <- transform(zz,
sum_total_columns = read_start_start + read_start_stop + read_stop_start + read_stop_stop
)
Here's one approach get the row with the minimum distance. I'm assuming you want to do this by chr and genenames. I use the plyr package, but I'm sure there are base solutions if you'd prefer one of those. Maybe someone else will chime in with a base solution.
require(plyr)
ddply(zz, c("chr", "genenames.x"), function(x) x[which.min(x$sum_total_columns) ,])

Resources