Change Color of Pair Plot Points by Column Value using Seaborn - seaborn

Using the iris data set and seaborn pair plot and the following code:
sns.pairplot(iris, hue="class", diag_kind = 'kde', markers=["o", "s", "D"])
Gives you something like this:
I'm trying to replicate this with another data set I'm working with. It's the monthly returns of four publicly traded companies: amzn, fb, ibm, mmm.
Data set looks like this:
AMZN IBM MMM FB
Date
2016-04-29 0.072016 0.039741 0.030894 0.010415
2016-05-31 0.136702 0.000138 0.005589 0.058342
2016-06-30 0.027122 0.023761 0.014004 -0.026895
2016-07-29 0.035005 0.044894 0.051525 0.033821
2016-08-31 0.031521 0.022542 0.006572 0.044401
Any idea how I can simulate the look of the iris pairplot with the stock returns data plot? Right now I only get a single color and marker when I run the pair plot, but having different colors or markers to distinguish the returns of each stock would really make the pairs look cleaner.
Update:
I think the question here should actually be how can I transform the data so that there is a 'Stock' category and four entries on the same date period? It would look like this:
Stock Return
Date
2016-04-29 AMZN 0.039741
2016-04-29 IBM 0.000138
2016-04-29 MMM 0.023761
2016-04-29 FB 0.044894
I'm not sure this would work, but I think it would since there would be a category for 'hue'.

In your DataFrame you have only columns of values but no column of categories, so you can't reproduce that pair plot because there is no category to use as hue.

Your dataset does not have enough dimensions to be a pairplot and have hue color-coding. If you want a pairplot, you can just use Seaborn's pairplot:
df = pd.DataFrame({'Date': ['2016-04-29','2016-05-31','2016-06-30','2016-07-29','2016-08-31'],
'AMZN': [0.072016,0.136702,0.027122,0.035005,0.031521],
'IBM': [0.039741,0.000138,0.023761,0.044894,0.022542],
'MMM': [0.030894,0.005589,0.014004,0.051525,0.006572],
'FB': [0.010415,0.058342,-0.026895,0.033821,0.044401]
})
sns.pairplot(df[['AMZN','IBM','MMM','FB']])
plt.show()
If your question is "how to transform the data," then you can just do a melt:
df2 = pd.melt(df, id_vars='Date').rename(columns={'variable': 'Stock'})
print(df2)
Date Stock value
0 2016-04-29 AMZN 0.072016
1 2016-05-31 AMZN 0.136702
2 2016-06-30 AMZN 0.027122
3 2016-07-29 AMZN 0.035005
4 2016-08-31 AMZN 0.031521
5 2016-04-29 IBM 0.039741
6 2016-05-31 IBM 0.000138
7 2016-06-30 IBM 0.023761
8 2016-07-29 IBM 0.044894
9 2016-08-31 IBM 0.022542
10 2016-04-29 MMM 0.030894
11 2016-05-31 MMM 0.005589
12 2016-06-30 MMM 0.014004
13 2016-07-29 MMM 0.051525
14 2016-08-31 MMM 0.006572
15 2016-04-29 FB 0.010415
16 2016-05-31 FB 0.058342
17 2016-06-30 FB -0.026895
18 2016-07-29 FB 0.033821
19 2016-08-31 FB 0.044401

Related

How to plot shaded areas on a seaborn countplot

I would like to be able to show shaded areas on a seaborn countplot as shown below. The idea would be to show the covid lockown periods that cover the data period. I have the countplot, but i cant figure out how to add the the shaded area.
My current df contains dates of properties for sale (from domain.com.au), and i have a small dataset of covid dates:
The code that generates the seaborn plot:
fig, ax = plt.subplots(figsize=(16,10));
sns.countplot(data=df, x='dates', ax=ax)
representation of what i am looking to produce (created from excel).
Assuming your property dataset looks like this:
df = pd.DataFrame({
"date": ["2022-07-01", "2022-07-01", "2022-07-01", "2022-07-02", "2022-07-02", "2022-07-03", "2022-07-04", "2022-07-05", "2022-07-05", "2022-07-05", "2022-07-06"],
"prop": ['test'] * 11
})
df['date'] = pd.to_datetime(df['date']).dt.date
date prop
0 2022-07-01 test
1 2022-07-01 test
2 2022-07-01 test
3 2022-07-02 test
4 2022-07-02 test
5 2022-07-03 test
6 2022-07-04 test
7 2022-07-05 test
8 2022-07-05 test
9 2022-07-05 test
10 2022-07-06 test
Then you could do a bar plot of the counts and shade date ranges with matplotlib's axvspan:
fig, ax = plt.subplots(figsize=(16,10));
prop_count = df['date'].value_counts()
plt.bar(prop_count.index, prop_count.values)
for start_date, end_date in zip(covid_df['start_date'], covid_df['end_date']):
ax.axvspan(start_date, end_date, alpha=0.1, color='red')
plt.show()

Performance for adfuller and SARIMAX

This is somewhat of a continuation of a previous post but I am trying to forecast weekly revenues. My program seems to hang on the adfuller test. It has run before and appears stationary via p-value but not consistently. I have added SARIMAX in as well and the code just hangs. If I cancel out I get a message towards the bottom (periodically) that says the problem is unconstrained.
Data:
Week | Week_Start_Date |Amount |year
Week 1 2018-01-01 42920 2018
Week 2 2018-01-08 37772 2018
Week 3 2018-01-15 41076 2018
Week 4 2018-01-22 38431 2018
Week 5 2018-01-29 101676 2018
Code:
x = organic_search.groupby('Week_Start_Date').Amount.sum()
# Augmented Dickey-Fuller test
ad_fuller_result = adfuller(x)
print(f'ADF Statistic: {ad_fuller_result[0]}')
print(f'p-value: {ad_fuller_result[1]}')
# SARIMA Model
plt.figure(2)
best_model = SARIMAX(x, order=(2, 1, 1), seasonal_order=(2, 1, 1, 52)).fit(dis=1)
print(best_model.summary())
best_model.plot_diagnostics(figsize=(15,12))
I am only working with 185 or so rows. I don't understand why code is just hanging. Any optimization suggestions welcome (for adfuller and SARIMAX).
Fixed via passing organic_search['Amount'] instead of organic_search.groupby('Week_Start_Date').Amount.sum()

Removing categories with patsy and statsmodels

I am using statsmodels and patsy for building a logistic regression model. I'll use pseudocode here. Let's assume I have a dataframe containing a categorical variable, say Country, with 200 levels. I have reasons to believe some of them would be predictive, so I build a model as in
formula = 'outcome ~ C(Country)'
patsy splits Country into its levels and the model is build using all countries. I then see that the coefficient in GB is high so I want to remove only GB. Can I do something like this in patsy:
formula = 'outcome ~ C(country) - C(country)[GB]'
I tried and it did not change anything.
I don't know if there is a way to subset a Category with patsy formula, but you can do it in the DataFrame.
For example
import numpy as np
import pandas as pd
import statsmodels.api as sm
# sample data
size = 100
np.random.seed(1)
countries = ['IT', 'UK', 'US', 'FR', 'ES']
df = pd.DataFrame({
'outcome': np.random.random(size),
'Country': np.random.choice(countries, size)
})
df['Country'] = df.Country.astype('category')
print(df.Country)
0 ES
1 IT
2 UK
3 US
4 UK
..
95 FR
96 UK
97 ES
98 UK
99 US
Name: Country, Length: 100, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Let us suppose we want to remove Category "US"
# create a deep copy excluding 'US'
_df = df[df.Country!='US'].copy(deep=True)
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Even if there are no more elements with category "US" in the DataFrame, the category is still there. If we use this DataFrame in a statsmodels model, we'd get a singular matrix error, so we need to remove unused categories
# remove unused category 'US'
_df['Country'] = _df.Country.cat.remove_unused_categories()
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (4, object): ['ES', 'FR', 'IT', 'UK']
and now we can fit a model
mod = sm.Logit.from_formula('outcome ~ Country', data=_df)
fit = mod.fit()
print(fit.summary())
Optimization terminated successfully.
Current function value: 0.684054
Iterations 4
Logit Regression Results
==============================================================================
Dep. Variable: outcome No. Observations: 83
Model: Logit Df Residuals: 79
Method: MLE Df Model: 3
Date: Sun, 16 May 2021 Pseudo R-squ.: 0.01179
Time: 22:43:37 Log-Likelihood: -56.776
converged: True LL-Null: -57.454
Covariance Type: nonrobust LLR p-value: 0.7160
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept -0.1493 0.438 -0.341 0.733 -1.007 0.708
Country[T.FR] 0.4129 0.614 0.673 0.501 -0.790 1.616
Country[T.IT] -0.1223 0.607 -0.201 0.840 -1.312 1.068
Country[T.UK] 0.1027 0.653 0.157 0.875 -1.178 1.383
=================================================================================

time data doesn't match format specified

I am trying to convert the string to the type of 'datetime' in python. My data match the format, but still get the
'ValueError: time data 11 11 doesn't match format specified'
I am not sure where does the "11 11" in the error come from.
My code is
train_df['date_captured1'] = pd.to_datetime(train_df['date_captured'], format="%Y-%m-%d %H:%M:%S")
Head of data is
print (train_df.date_captured.head())
0 2011-05-13 23:43:18
1 2012-03-17 03:48:44
2 2014-05-11 11:56:46
3 2013-10-06 02:00:00
4 2011-07-12 13:11:16
Name: date_captured, dtype: object
I tried the following by just selecting the first string and running the code with same datetime format. They all work without problem.
dt=train_df['date_captured']
dt1=dt[0]
date = datetime.datetime.strptime(dt1, "%Y-%m-%d %H:%M:%S")
print(date)
2011-05-13 23:43:18
and
dt1=pd.to_datetime(dt1, format='%Y-%m-%d %H:%M:%S')
print (dt1)
2011-05-13 23:43:18
But why wen I using the same format in pd.to_datetime to convert all the data in the column, it comes up with the error above?
Thank you.
I solved it.
train_df['date_time'] = pd.to_datetime(train_df['date_captured'], errors='coerce')
print (train_df[train_df.date_time.isnull()])
I found in line 100372, the date_captured value is '11 11'
category_id date_captured ... height date_time
100372 10 11 11 ... 747 NaT
So the code with errors='coerce' will replace the invalid parsing with NaN.
Thank you.

Fill Column in CSV with previous value using shell

I have the following File that can be generated like the below by pulling information from a webpage, however as you can see it is missing Dates on Rows 4,5,7 & 8:
1,Nov 09 2016,Pakistan,Karachi Stock Exchange,Iqbal Day
2,Nov 11 2016,Poland,Warsaw Stock Exchange,Independence Day
3,Nov 14 2016,Colombia,Colombia Stock Exchange,Independence of Cartagena
4,,India,India National Stock Exchange,Guru Nanak Jayanti
5,,Sri Lanka,Colombo Stock Exchange,Ill Full Moon Poya Day
6,Nov 15 2016,Brazil,Sao Paulo Stock Exchange,Republic Day
7,,Palestinian Territory,Ramallah Stock Exchange,Independence Day
8,,Sri Lanka,Colombo Stock Exchange,Ill Full Moon Poya Day
What I would need to achieve is to Fill the Date from row3 into 4 and 5 and the Date from row6 into 7 & 8, this file could have more than the 8 lines above and will require all blank lines to be filled with what is in the previous cell. I've tried all different types of answers on Stackoverflow but none are able to do what I require.
Assuming that the first row is complete. (with Date value in column 2). Give this awk one-liner a try:
awk -F, -v OFS="," '{$2=$2?$2:d;d=$2}7' file
With your input, it outputs:
1,Nov 09 2016,Pakistan,Karachi Stock Exchange,Iqbal Day
2,Nov 11 2016,Poland,Warsaw Stock Exchange,Independence Day
3,Nov 14 2016,Colombia,Colombia Stock Exchange,Independence of Cartagena
4,Nov 14 2016,India,India National Stock Exchange,Guru Nanak Jayanti
5,Nov 14 2016,Sri Lanka,Colombo Stock Exchange,Ill Full Moon Poya Day
6,Nov 15 2016,Brazil,Sao Paulo Stock Exchange,Republic Day
7,Nov 15 2016,Palestinian Territory,Ramallah Stock Exchange,Independence Day
8,Nov 15 2016,Sri Lanka,Colombo Stock Exchange,Ill Full Moon Poya Day

Resources