spark sql distance to nearest holiday - sorting

In pandas I have a function similar to
indices = df.dateColumn.apply(holidays.index.searchsorted)
df['nextHolidays'] = holidays.index[indices]
df['previousHolidays'] = holidays.index[indices - 1]
which calculates the distance to the nearest holiday and stores that as a new column.
searchsorted http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.searchsorted.html was a great solution for pandas as this gives me the index of the next holiday without a high algorithmic complexity Parallelize pandas apply e.g. this approach was a lot quicker then parallel looping.
How can I achieve this in spark or hive?

This can be done using aggregations but this method would have higher complexity than pandas method. But you can achieve similar performance using UDFs. It won't be as elegant as pandas, but:
Assuming this dataset of holidays:
holidays = ['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03']
index = spark.sparkContext.broadcast(sorted(holidays))
And dataset of dates of 2016 in dataframe:
from datetime import datetime, timedelta
dates_array = [(datetime(2016, 1, 1) + timedelta(i)).strftime('%Y-%m-%d') for i in range(366)]
from pyspark.sql import Row
df = spark.createDataFrame([Row(date=d) for d in dates_array])
The UDF can use pandas searchsorted but would need to install pandas on executors. Insted you can use plan python like this:
def nearest_holiday(date):
last_holiday = index.value[0]
for next_holiday in index.value:
if next_holiday >= date:
break
last_holiday = next_holiday
if last_holiday > date:
last_holiday = None
if next_holiday < date:
next_holiday = None
return (last_holiday, next_holiday)
from pyspark.sql.types import *
return_type = StructType([StructField('last_holiday', StringType()), StructField('next_holiday', StringType())])
from pyspark.sql.functions import udf
nearest_holiday_udf = udf(nearest_holiday, return_type)
And can be used with withColumn:
df.withColumn('holiday', nearest_holiday_udf('date')).show(5, False)
+----------+-----------------------+
|date |holiday |
+----------+-----------------------+
|2016-01-01|[null,2016-01-03] |
|2016-01-02|[null,2016-01-03] |
|2016-01-03|[2016-01-03,2016-01-03]|
|2016-01-04|[2016-01-03,2016-03-03]|
|2016-01-05|[2016-01-03,2016-03-03]|
+----------+-----------------------+
only showing top 5 rows

Related

How to get daily market cap with yfinance for Python

I'd like to get market cap for tickers on a daily basis. The code I compiled from other examples just show 1 data for each ticker.
import yfinance as yf
import pandas as pd
stocks = ['XP','STNE','PAGS']
df = pd.DataFrame(columns=['Stock','Marketcap'])
df.columns = ['Stock','Marketcap']
for stock in stocks:
info = yf.Ticker(stock).info
marketcap = info['marketCap']
df = df.append({'Stock':stock,'Marketcap':marketcap}, ignore_index=True)
display(df)
Output:
Please use fast_info instead of info. Therefore your syntax should be like this:
for stock in stocks:
info = yf.Ticker(stock).fast_info
marketcap = info['market_cap']
df = df.append({'Stock':stock,'Marketcap':marketcap}, ignore_index=True)
display(df)
Then you should get be able to get back your market cap. Hope this help.

Two StatsModels modules have totally different 'end-runs'

I'm running StatsModels to estimate parameters of a multiple regression model, using county-level data for 3085 counties. When I use statsmodels.formula.api, and drop a few rows from the data, I get desired results. All seems well enough.
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
%matplotlib inline
from statsmodels.compat import lzip
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
eg=pd.read_csv(r'C:/Users/user/anaconda3/une_edu_pipc_06.csv')
pd.options.display.precision = 3
plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)
sm_col = eg["lt_hsd_17"] + eg["hsd_17"]
eg["ut_hsd_17"] = sm_col
sm_col2 = eg["sm_col_17"] + eg["col_17"]
eg["bnd_hsd_17"] = sm_col2
eg["d_09"]= eg["Rate_09"]-eg["Rate_06"]
eg["d_10"]= eg["Rate_10"]-eg["Rate_06"]
inc_2=eg["p_c_inc_18"]*eg["p_c_inc_18"]
res = sm.ols(formula = "Rate_18 ~ p_c_inc_18 + ut_hsd_17 + d_10 + inc_2",
data=eg, missing='drop').fit()
print(res.summary()).
(BTW, eg["p_c_inc_18"]is per-capita income, and inc_2 is p_c_inc_18 squarred).
But when I wish to use import statsmodels.api as smas the module, everything else staying pretty much the same, and run the following code after all appropriate preliminaries,
inc_2=eg["p_c_inc_18"]*eg["p_c_inc_18"]
X = eg[["p_c_inc_18","ut_hsd_17","d_10","inc_2"]]
y = eg["Rate_18"]
X = sm.add_constant(X)
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())
then things fall apart, and the Python interpreter throws an error, as follows:
[......]
KeyError: "['inc_2'] not in index"
BTW, the only difference between the two 'runs' is that 15 rows are dropped during the first, successful, model run, while I don't as yet know how to drop missing rows from the second model formulation. Could that difference be responsible for why the second run fails? (I chose to omit large parts of the error message, to reduce clutter.)
You need to assign inc_2 in your DataFrame.
inc_2=eg["p_c_inc_18"]*eg["p_c_inc_18"]
should be
eg["inc_2"] = eg["p_c_inc_18"]*eg["p_c_inc_18"]

pandas series sort_index() not working with kind='mergesort'

I needed a stable index sorting for DataFrames, when I had this problem:
In cases where a DataFrame becomes a Series (when only a single column matches the selection), the kind argument returns an error. See example:
import pandas as pd
df_a = pd.Series(range(10))
df_b = pd.Series(range(100, 110))
df = pd.concat([df_a, df_b])
df.sort_index(kind='mergesort')
with the following error:
----> 6 df.sort_index(kind='mergesort')
TypeError: sort_index() got an unexpected keyword argument 'kind'
If DataFrames (more then one column is selected), mergesort works ok.
EDIT:
When selecting a single column from a DataFrame for example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame(np.array(range(25)).reshape(5,5))
df_b = pd.DataFrame(np.array(range(100, 125)).reshape(5,5))
df = pd.concat([df_a, df_b])
the following returns an error:
df[0].sort_index(kind='mergesort')
...since the selection is casted to a pandas Series, and as pointed out the pandas.Series.sort_index documentation contains a bug.
However,
df[[0]].sort_index(kind='mergesort')
works alright, since its type continues to be a DataFrame.
pandas.Series.sort_index() has no kind parameter.
here is the definition of this function for Pandas 0.18.1 (file: ./pandas/core/series.py):
# line 1729
#Appender(generic._shared_docs['sort_index'] % _shared_doc_kwargs)
def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
sort_remaining=True):
axis = self._get_axis_number(axis)
index = self.index
if level is not None:
new_index, indexer = index.sortlevel(level, ascending=ascending,
sort_remaining=sort_remaining)
elif isinstance(index, MultiIndex):
from pandas.core.groupby import _lexsort_indexer
indexer = _lexsort_indexer(index.labels, orders=ascending)
indexer = com._ensure_platform_int(indexer)
new_index = index.take(indexer)
else:
new_index, indexer = index.sort_values(return_indexer=True,
ascending=ascending)
new_values = self._values.take(indexer)
result = self._constructor(new_values, index=new_index)
if inplace:
self._update_inplace(result)
else:
return result.__finalize__(self)
file ./pandas/core/generic.py, line 39
_shared_doc_kwargs = dict(axes='keywords for axes', klass='NDFrame',
axes_single_arg='int or labels for object',
args_transpose='axes to permute (int or label for'
' object)')
So most probably it's a bug in the pandas documentation...
Your df is Series, it's not a data frame

Spark/Hive Hours Between Two Datetimes

I would like to know how to precisely get the number of hours between 2 datetimes in spark.
There is a function called datediff which I could use to get the number of days and then convert to hours however this is less precise than I'd like
example of what I want modeled after datediff:
>>> df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-19 19:15:00')], ['d1', 'd2'])
>>> df.select(hourdiff(df.d2, df.d1).alias('diff')).collect()
[Row(diff=22)]
Try using UDF Here is the sample code, You can modify to UDF return what ever granularity as you want.
from pyspark.sql.functions import udf, col
from datetime import datetime, timedelta
from pyspark.sql.types import LongType
def timediff_x():
def _timediff_x(date1, date2):
date11 = datetime.strptime(date1, '%Y-%m-%d %H:%M:%S')
date22 = datetime.strptime(date2, '%Y-%m-%d %H:%M:%S')
return (date11 - date22).days
return udf(_timediff_x, LongType())
df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-25 19:15:00')], ['d1', 'd2'])
df.select(timediff_x()(col("d2"), col("d1"))).show()
+----------------------------+
|PythonUDF#_timediff_x(d2,d1)|
+----------------------------+
| 6|
+----------------------------+
If your columns are of type TimestampType(), you can use the answer at the following question:
Spark Scala: DateDiff of two columns by hour or minute
However, if your columns are of type StringType(), you have an option that is easier than defining an UDF, using the built-in functions:
from pyspark.sql.functions import *
diffCol = unix_timestamp(col('d1'), 'yyyy-MM-dd HH:mm:ss') - unix_timestamp(col('d2'), 'yyyy-MM-dd HH:mm:ss')
df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-19 19:15:00')], ['d1', 'd2'])
df2 = df.withColumn('diff_secs', diffCol)

how to sample multiple chains in PyMC3

I'm trying to sample multiple chains in PyMC3. In PyMC2 I would do something like this:
for i in range(N):
model.sample(iter=iter, burn=burn, thin = thin)
How should I do the same thing in PyMC3? I saw there is a 'njobs' argument in the 'sample' method, but it throws an error when I set a value for it. I want to use sampled chains to get 'pymc.gelman_rubin' output.
Better is to use njobs to run chains in parallel:
#!/usr/bin/env python3
import pymc3 as pm
import numpy as np
from pymc3.backends.base import merge_traces
xobs = 4 + np.random.randn(20)
model = pm.Model()
with model:
mu = pm.Normal('mu', mu=0, sd=20)
x = pm.Normal('x', mu=mu, sd=1., observed=xobs)
step = pm.NUTS()
with model:
trace = pm.sample(1000, step, njobs=2)
To run them serially, you can use a similar approach to your PyMC 2
example. The main difference is that each call to sample returns a
multi-chain trace instance (containing just a single chain in this
case). merge_traces will take a list of multi-chain instances and
create a single instance with all the chains.
#!/usr/bin/env python3
import pymc as pm
import numpy as np
from pymc.backends.base import merge_traces
xobs = 4 + np.random.randn(20)
model = pm.Model()
with model:
mu = pm.Normal('mu', mu=0, sd=20)
x = pm.Normal('x', mu=mu, sd=1., observed=xobs)
step = pm.NUTS()
with model:
trace = merge_traces([pm.sample(1000, step, chain=i)
for i in range(2)])

Resources