How to plot shaded areas on a seaborn countplot - seaborn

I would like to be able to show shaded areas on a seaborn countplot as shown below. The idea would be to show the covid lockown periods that cover the data period. I have the countplot, but i cant figure out how to add the the shaded area.
My current df contains dates of properties for sale (from domain.com.au), and i have a small dataset of covid dates:
The code that generates the seaborn plot:
fig, ax = plt.subplots(figsize=(16,10));
sns.countplot(data=df, x='dates', ax=ax)
representation of what i am looking to produce (created from excel).

Assuming your property dataset looks like this:
df = pd.DataFrame({
"date": ["2022-07-01", "2022-07-01", "2022-07-01", "2022-07-02", "2022-07-02", "2022-07-03", "2022-07-04", "2022-07-05", "2022-07-05", "2022-07-05", "2022-07-06"],
"prop": ['test'] * 11
})
df['date'] = pd.to_datetime(df['date']).dt.date
date prop
0 2022-07-01 test
1 2022-07-01 test
2 2022-07-01 test
3 2022-07-02 test
4 2022-07-02 test
5 2022-07-03 test
6 2022-07-04 test
7 2022-07-05 test
8 2022-07-05 test
9 2022-07-05 test
10 2022-07-06 test
Then you could do a bar plot of the counts and shade date ranges with matplotlib's axvspan:
fig, ax = plt.subplots(figsize=(16,10));
prop_count = df['date'].value_counts()
plt.bar(prop_count.index, prop_count.values)
for start_date, end_date in zip(covid_df['start_date'], covid_df['end_date']):
ax.axvspan(start_date, end_date, alpha=0.1, color='red')
plt.show()

Related

Performance for adfuller and SARIMAX

This is somewhat of a continuation of a previous post but I am trying to forecast weekly revenues. My program seems to hang on the adfuller test. It has run before and appears stationary via p-value but not consistently. I have added SARIMAX in as well and the code just hangs. If I cancel out I get a message towards the bottom (periodically) that says the problem is unconstrained.
Data:
Week | Week_Start_Date |Amount |year
Week 1 2018-01-01 42920 2018
Week 2 2018-01-08 37772 2018
Week 3 2018-01-15 41076 2018
Week 4 2018-01-22 38431 2018
Week 5 2018-01-29 101676 2018
Code:
x = organic_search.groupby('Week_Start_Date').Amount.sum()
# Augmented Dickey-Fuller test
ad_fuller_result = adfuller(x)
print(f'ADF Statistic: {ad_fuller_result[0]}')
print(f'p-value: {ad_fuller_result[1]}')
# SARIMA Model
plt.figure(2)
best_model = SARIMAX(x, order=(2, 1, 1), seasonal_order=(2, 1, 1, 52)).fit(dis=1)
print(best_model.summary())
best_model.plot_diagnostics(figsize=(15,12))
I am only working with 185 or so rows. I don't understand why code is just hanging. Any optimization suggestions welcome (for adfuller and SARIMAX).
Fixed via passing organic_search['Amount'] instead of organic_search.groupby('Week_Start_Date').Amount.sum()

Plot side by side box and whisker plots from two dataframes

I'm hoping to take these two box plots and combine them into one image:
[![These are two data files I was able to make box and whisker charts for easily using Seaborn boxplot][1]][1]
The datafile I am using is from multiple excel spread sheets and looks like this:
0
1
2
3
4
5
6
...
5
2
3
5
6
2
5
...
2
3
4
6
1
2
1
...
1
2
4
6
7
8
9
...
...
...
...
...
...
...
...
...
Where the column headers represent hours and the column values are the ones I want to use to create box and whisker plots with.
Currently my code is this:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
xls = pd.ExcelFile('ControlDayVar.xlsx')
df1= pd.read_excel(xls, 'DE_ControlDays').assign(Location=1)
df2= pd.read_excel(xls, 'DE_FestDays').assign(Location=2)
DE_all =pd.concat([df1,df2])
DE= pd.melt(DE_all, id_vars=['Location'], var_name=['Hours'], value_name='Concentration')
ax= sns.boxplot(x='Hours', y= 'Concentration', hue= 'Location', data=DE)
plt.show()
The result I get is this:
[![Yikes][2]][2]
I expect my issue has to do with the format of my data files, but any help would be appreciated.Thanks!
[1]: https://i.stack.imgur.com/dXo6F.jpg
[2]: https://i.stack.imgur.com/NEpi7.jpg
This could happen if somehow the Concentration values are not properly recognized as a numerical data type anymore.
In that case, the y-axis can no longer be understood as continuous, which can lead to that "yikes" result.

Removing categories with patsy and statsmodels

I am using statsmodels and patsy for building a logistic regression model. I'll use pseudocode here. Let's assume I have a dataframe containing a categorical variable, say Country, with 200 levels. I have reasons to believe some of them would be predictive, so I build a model as in
formula = 'outcome ~ C(Country)'
patsy splits Country into its levels and the model is build using all countries. I then see that the coefficient in GB is high so I want to remove only GB. Can I do something like this in patsy:
formula = 'outcome ~ C(country) - C(country)[GB]'
I tried and it did not change anything.
I don't know if there is a way to subset a Category with patsy formula, but you can do it in the DataFrame.
For example
import numpy as np
import pandas as pd
import statsmodels.api as sm
# sample data
size = 100
np.random.seed(1)
countries = ['IT', 'UK', 'US', 'FR', 'ES']
df = pd.DataFrame({
'outcome': np.random.random(size),
'Country': np.random.choice(countries, size)
})
df['Country'] = df.Country.astype('category')
print(df.Country)
0 ES
1 IT
2 UK
3 US
4 UK
..
95 FR
96 UK
97 ES
98 UK
99 US
Name: Country, Length: 100, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Let us suppose we want to remove Category "US"
# create a deep copy excluding 'US'
_df = df[df.Country!='US'].copy(deep=True)
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Even if there are no more elements with category "US" in the DataFrame, the category is still there. If we use this DataFrame in a statsmodels model, we'd get a singular matrix error, so we need to remove unused categories
# remove unused category 'US'
_df['Country'] = _df.Country.cat.remove_unused_categories()
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (4, object): ['ES', 'FR', 'IT', 'UK']
and now we can fit a model
mod = sm.Logit.from_formula('outcome ~ Country', data=_df)
fit = mod.fit()
print(fit.summary())
Optimization terminated successfully.
Current function value: 0.684054
Iterations 4
Logit Regression Results
==============================================================================
Dep. Variable: outcome No. Observations: 83
Model: Logit Df Residuals: 79
Method: MLE Df Model: 3
Date: Sun, 16 May 2021 Pseudo R-squ.: 0.01179
Time: 22:43:37 Log-Likelihood: -56.776
converged: True LL-Null: -57.454
Covariance Type: nonrobust LLR p-value: 0.7160
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept -0.1493 0.438 -0.341 0.733 -1.007 0.708
Country[T.FR] 0.4129 0.614 0.673 0.501 -0.790 1.616
Country[T.IT] -0.1223 0.607 -0.201 0.840 -1.312 1.068
Country[T.UK] 0.1027 0.653 0.157 0.875 -1.178 1.383
=================================================================================

calculating some seasonal climate metrics with Iris

I have a new project on, calculating some seasonal climate metrics. As part of this, I need to identify, eg the wettest quarter in a set of climatological monthly data:
print(pr_cube)
Precipitation / (mm) (time: 12; latitude: 125; longitude: 211)
Dimension coordinates:
time x - -
latitude - x -
longitude - - x
where time is every month, averaged across 30-years with coord('time) =
DimCoord([2030-01-01 00:00:00, 2030-02-01 00:00:00, 2030-03-01 00:00:00,
2030-04-01 00:00:00, 2030-05-01 00:00:00, 2030-06-01 00:00:00,
2030-07-01 00:00:00, 2030-08-01 00:00:00, 2030-09-01 00:00:00,
2030-10-01 00:00:00, 2030-11-01 00:00:00, 2030-12-01 00:00:00]
I was wondering if I could add a seasons coordinate for all sets of consecutive 3 months, including 'wrapping around', something like this:
iris.coord_categorisation.add_season(cube, coord, name='season',
seasons=(''jfm', 'fma', 'mam', 'amj', 'mjj', 'jja', 'jas', 'aso', 'son', 'ond', 'ndj', 'djf'))
or
season = ('jfm', 'fma', 'mam', 'amj', 'mjj', 'jja', 'jas', 'aso', 'son', 'ond', 'ndj', 'djf')
iris.coord_categorisation.add_season_membership(cube, coord, season, name='all_quarters')
Not tested this yet, just wondered if about suggestions or a recommendation?
And then, get the season with the max rainfall?
Qtr_max_rain = pr_cube.collapsed('season', iris.analysis.MAX)
Would that work correctly ?
There may be a way to achieve this using coord_categorisation, but I believe the simplest way is to instead use iris.cube.Cube.rolling_window(). There's no native way to wrap around in the way you need, so you can hack it by duplicating Jan and Feb on the end of the existing data.
I've tested the below and it seems to work as intended. Hopefully it works for you.
# Create extra cube based on Jan and Feb from pr_cube.
extra_months_cube = pr_cube[:2, ...]
# Replace time coordinate with another that is advanced by a year - ensures correct sorting.
# Adjust addition depending on the unit of the time coordinate.
extra_months_coord = extra_months_cube.coord("time") + (24 * 365)
extra_months_cube.remove_coord("time")
extra_months_cube.add_dim_coord(extra_months_coord, 0)
# Combine original cube with extra cube.
both_cubes = iris.cube.CubeList([pr_cube, extra_months_cube])
fourteen_month_cube = both_cubes.concatenate_cube()
# Generate cube of 3-month MAX aggregations.
rolling_cube = fourteen_month_cube.rolling_window("time", iris.analysis.MAX, 3)
Once done, you would of course be free to add your suggested three month labels using iris.cube.Cube.add_aux_coord().

Calculate features at multiple training windows in Featuretools

I have a table with customers and transactions. Is there a way how to get features that would be filtered for last 3/6/9/12 months? I would like to automatically generate features:
number of trans in last 3 months
....
number of trans in last 12 months
average trans in last 3 months
...
average trans in last 12 months
I've tried using the training_window =["1 month", "3 months"],, but it does not seem to return multiple features for each window.
Example:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
window_features = ft.dfs(entityset=es,
target_entity="customers",
training_window=["1 hour", "1 day"],
features_only = True)
window_features
Do I have to do individual windows separately and then merge the results?
As you mentioned, in Featuretools 0.2.1 you have to build the feature matrices individually for each training window and then merge the results. With your example, you would do that as follows:
import pandas as pd
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
cutoff_times = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
"time": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
features = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=['count'],
trans_primitives=[],
features_only = True)
fm_1 = ft.calculate_feature_matrix(features,
entityset=es,
cutoff_time=cutoff_times,
training_window='1h',
verbose=True)
fm_2 = ft.calculate_feature_matrix(features,
entityset=es,
cutoff_time=cutoff_times,
training_window='1d',
verbose=True)
new_df = fm_1.reset_index()
new_df = new_df.merge(fm_2.reset_index(), on="customer_id", suffixes=("_1h", "_1d"))
Then, the new dataframe will look like:
customer_id COUNT(sessions)_1h COUNT(transactions)_1h COUNT(sessions)_1d COUNT(transactions)_1d
1 1 17 3 43
2 3 36 3 36
3 0 0 1 25
4 0 0 0 0
5 1 15 2 29

Resources