How to get the correlation matrix of a pyspark data frame? NEW 2020 - matrix

I have the same question from this topic:
How to get the correlation matrix of a pyspark data frame?
"I have a big pyspark data frame. I want to get its correlation matrix. I know how to get it with a pandas data frame.But my data is too big to convert to pandas. So I need to get the result with pyspark data frame.I searched other similar questions, the answers don't work for me. Can any body help me? Thanks!"
df4 is my dataset, he has 9 columns and all of them are integers:
reference__YM_unix:integer
tenure_band:integer
cei_global_band:integer
x_band:integer
y_band:integer
limit_band:integer
spend_band:integer
transactions_band:integer
spend_total:integer
I have first done this step:
# convert to vector column first
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=df4.columns, outputCol=vector_col)
df_vector = assembler.transform(df4).select(vector_col)
# get correlation matrix
matrix = Correlation.corr(df_vector, vector_col)
And got the following output:
(matrix.collect()[0]["pearson({})".format(vector_col)].values)
Out[33]: array([ 1. , 0.0760092 , 0.09051543, 0.07550633, -0.08058203,
-0.24106848, 0.08229602, -0.02975856, -0.03108094, 0.0760092 ,
1. , 0.14792512, -0.10744735, 0.29481762, -0.04490072,
-0.27454922, 0.23242408, 0.32051685, 0.09051543, 0.14792512,
1. , -0.03708623, 0.13719527, -0.01135489, 0.08706559,
0.24713638, 0.37453265, 0.07550633, -0.10744735, -0.03708623,
1. , -0.49640664, 0.01885793, 0.25877516, -0.05019079,
-0.13878844, -0.08058203, 0.29481762, 0.13719527, -0.49640664,
1. , 0.01080777, -0.42319841, 0.01229877, 0.16440178,
-0.24106848, -0.04490072, -0.01135489, 0.01885793, 0.01080777,
1. , 0.00523737, 0.01244241, 0.01811365, 0.08229602,
-0.27454922, 0.08706559, 0.25877516, -0.42319841, 0.00523737,
1. , 0.32888075, 0.21416322, -0.02975856, 0.23242408,
0.24713638, -0.05019079, 0.01229877, 0.01244241, 0.32888075,
1. , 0.53310864, -0.03108094, 0.32051685, 0.37453265,
-0.13878844, 0.16440178, 0.01811365, 0.21416322, 0.53310864,
1. ])
I've tried to insert this result on arrays or an excel file but it didnt work.
I did:
matrix2 = (matrix.collect()[0]["pearson({})".format(vector_col)])
Then I got the following error when I tried to display this info:
display(matrix2)
Exception: ML model display does not yet support model type <class 'pyspark.ml.linalg.DenseMatrix'>.
I was expecting to insert the name of the columns back from df4 but it didnt succeed, I've read that I need to use df4.columns but I have no idea how does it works.
Finally, I was expecting to print the following graph that I've seen from medium article
https://medium.com/towards-artificial-intelligence/feature-selection-and-dimensionality-reduction-using-covariance-matrix-plot-b4c7498abd07
But also it didn't work:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_std = stdsc.fit_transform(df4.iloc[:,range(0,7)].values)
cov_mat =np.cov(X_std.T)
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
hm = sns.heatmap(cov_mat,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 12},
cmap='coolwarm',
yticklabels=cols,
xticklabels=cols)
plt.title('Covariance matrix showing correlation coefficients', size = 18)
plt.tight_layout()
plt.show()
AttributeError: 'DataFrame' object has no attribute 'iloc'
I've tried to replace df4 to matrix2 and didn't work too

You can use the following to get the correlation matrix in a form you can manipulate:
matrix = matrix.toArray().tolist()
From there you can convert to a dataframe pd.DataFrame(matrix) which would allow you to plot the heatmap, or save to excel etc.

Related

to_crs("epsg:4326") retruns different coordinate

I am trying to change coordinates system on my geopandas dataframe from epsg:5179 to epsg:4236.
BUT, .to_crs("epsg:4326") retruns different coordinates... How can I get true cordinates?
geo[geometry].set_crs("epsg:5179", inplace = True)
geo_df = geo[geometry].to_crs("epsg:4326")
Original
LINESTRING (14138122.900 4519000.200, 14138248...LINESTRING (14135761.800 4518881.600, 14135799...
Changed-proj
LINESTRING (-149.90927 12.31701, -149.90912 12...LINESTRING (-149.91219 12.32162, -149.91215 12...
It seems like you got true coordinates with your code which is :
geo[geometry].set_crs("epsg:5179", inplace = True)
geo_df = geo[geometry].to_crs("epsg:4326")
I've been looking through pyproj, and couldn't find error to change coordinates epsg:5179 to epsg:4326.
If you want to get futher more information about cordinates, you can visit here.

Random sample for y variable in catplot seaborn

I'm new to python and trying to create catplot (stripplot and swarmplot) with a jitter in seaborn for x='region' and y='amount' using a random sample of 300 from my y variable. I have tried:
data_sample = data.sample(300)
y = data_sample['amount']
plt.figure(figsize=(8,8))
sns.catplot('region', y, data=data, jitter='1', kind='strip')
Which produces:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
Can anyone explain what this error means and how to resolve? Also, below the long list of error recommendations it is actually showing a visual of a catplot, labeled with 'Figure size 2160x2160 with 0 Axes.'
Thank you, help appreciated.

How to calculate shap values for ADABoost model?

I am running 3 different model (Random forest, Gradient Boosting, Ada Boost) and a model ensemble based on these 3 models.
I managed to use SHAP for GB and RF but not for ADA with the following error:
Exception Traceback (most recent call last)
in engine
----> 1 explainer = shap.TreeExplainer(model,data = explain_data.head(1000), model_output= 'probability')
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
113
114 if feature_perturbation not in feature_perturbation_codes:
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
755
756 # build a dense numpy version of all the tree objects
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
I found this link on Git that state
TreeExplainer creates a TreeEnsemble object from whatever model type we are trying to explain, and then works with that downstream. So all you would need to do is and add another if statement in the
TreeEnsemble constructor similar to the one for gradient boosting
But I really don't know how to implement it since I quite new to this.
I had the same problem and what I did, was to modify the file in the git you are commenting.
In my case I use windows so the file is in C:\Users\my_user\AppData\Local\Continuum\anaconda3\Lib\site-packages\shap\explainers but you can do double click over the error message and the file will be opened.
The next step is to add another elif as the answer of the git help says. In my case I did it from the line 404 as following:
1) Modify the source code.
...
self.objective = objective_name_map.get(model.criterion, None)
self.tree_output = "probability"
elif str(type(model)).endswith("sklearn.ensemble.weight_boosting.AdaBoostClassifier'>"): #From this line I have modified the code
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line I added
elif str(type(model)).endswith("sklearn.ensemble.forest.ExtraTreesClassifier'>"): # TODO: add unit test for this case
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
...
Note in the other models, the code of shap needs the attribute 'criterion' that the AdaBoost classifier doesn't have in a direct way. So in this case this attribute is obtained from the "weak" classifiers with the AdaBoost has been trained, that's why I add model.base_estimator_.criterion .
Finally you have to import the library again, train your model and get the shap values. I leave an example:
2) Import again the library and try:
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier
import shap
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
ADABoost_model = AdaBoostClassifier()
ADABoost_model.fit(X, y)
shap_values = shap.TreeExplainer(ADABoost_model).shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
Which generates the following:
3) Get your new results:
It seems that the shap package has been updated and still does not contain the AdaBoostClassifier. Based on the previous answer, I've modified the previous answer to work with the shap/explainers/tree.py file in lines 598-610
### Added AdaBoostClassifier based on the outdated StackOverflow response and Github issue here
### https://stackoverflow.com/questions/60433389/how-to-calculate-shap-values-for-adaboost-model/61108156#61108156
### https://github.com/slundberg/shap/issues/335
elif safe_isinstance(model, ["sklearn.ensemble.AdaBoostClassifier", "sklearn.ensemble._weighted_boosting.AdaBoostClassifier"]):
assert hasattr(model, "estimators_"), "Model has no `estimators_`! Have you called `model.fit`?"
self.internal_dtype = model.estimators_[0].tree_.value.dtype.type
self.input_dtype = np.float32
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line added
Also working on testing to add this to the package :)

Keras Inception-v3 fine-tuning workaraound

I am trying to fine-tune Inception-v3, but no matter which layer I choose to freeze I get random predictions. I found that other people are having the same problem: https://github.com/keras-team/keras/issues/9214 . It seems that the problem comes from setting the BN layer to not trainable.
Now I am trying to get the output of the last layer I want to freeze and use it as an input to the following layers, which I will then train:
train_generator = train_datagen.flow_from_directory(
os.path.join(directory, "train_data"),
target_size=size,
interpolation="bilinear",
classes=["a", "b", "c","d"],
batch_size=1,
shuffle=False) base_model = InceptionV3(weights='imagenet', include_top=True, input_shape=(299, 299, 3))
model_features = Model(inputs=base_model.input, outputs=base_model.get_layer(
self.Inception_Fine_Tune_Layers[layer_freeze]).output)
#I want to use this as input
values_train = model_features.predict_generator(train_generator, verbose=1)
However, I get Memory error like this, although I have 12Gb, which is more than what I need:
....
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3268864 totalling 3.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3489024 totalling 3.33MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 4211968 totalling 4.02MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 5129472 totalling 4.89MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 3.62GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 68719476736
InUse: 3886957312
MaxInUse: 3889054464
NumAllocs: 3709
MaxAllocSize: 8388608
Any suggestion how to fix that or another workaround to fine-tune Inception will be very helpful.
I can't tell if you're preprocessing your input properly from what you've provided. However, Keras provides functions for preprocessing that are specific to the pre-trained net, in this case Inception V3.
from keras.applications.inception_v3 import preprocess_input
Try adding this to your data generator as the preprocessing function like so...
train_generator = train_datagen.flow_from_directory(
os.path.join(directory, "train_data"),
preprocessing_function=preprocess_input, # <---
target_size=size,
interpolation="bilinear",
classes=["a", "b", "c","d"],
batch_size=1,
shuffle=False)
You should then be able to unfreeze all of the layers, or the select few that you want to train.
Hope that helps!

How to calculate number of missing values summed over time dimension in a netcdf file in bash

I have a netcdf file with data as a function of lon,lat and time. I would like to calculate the total number of missing entries in each grid cell summed over the time dimension, preferably with CDO or NCO so I do not need to invoke R, python etc.
I know how to get the total number of missing values
ncap2 -s "nmiss=var.number_miss()" in.nc out.nc
as I answered to this related question:
count number of missing values in netcdf file - R
and CDO can tell me the total summed over space with
cdo info in.nc
but I can't work out how to sum over time. Is there a way for example of specifying the dimension to sum over with number_miss in ncap2?
We added the missing() function to ncap2 to solve this problem elegantly as of NCO 4.6.7 (May, 2017). To count missing values through time:
ncap2 -s 'mss_val=three_dmn_var_dbl.missing().ttl($time)' in.nc out.nc
Here ncap2 chains two methods together, missing(), followed by a total over the time dimension. The 2D variable mss_val is in out.nc. The response below does the same but averages over space and reports through time (because I misinterpreted the OP).
Old/obsolete answer:
There are two ways to do this with NCO/ncap2, though neither is as elegant as I would like. Either call assemble the answer one record at a time by calling num_miss() with one record at a time, or (my preference) use the boolean comparison function followed by the total operator along the axes of choice:
zender#aerosol:~$ ncap2 -O -s 'tmp=three_dmn_var_dbl;mss_val=tmp.get_miss();tmp.delete_miss();tmp_bool=(tmp==mss_val);tmp_bool_ttl=tmp_bool.ttl($lon,$lat);print(tmp_bool_ttl);' ~/nco/data/in.nc ~/foo.nc
tmp_bool_ttl[0]=0
tmp_bool_ttl[1]=0
tmp_bool_ttl[2]=0
tmp_bool_ttl[3]=8
tmp_bool_ttl[4]=0
tmp_bool_ttl[5]=0
tmp_bool_ttl[6]=0
tmp_bool_ttl[7]=1
tmp_bool_ttl[8]=0
tmp_bool_ttl[9]=2
or
zender#aerosol:~$ ncap2 -O -s 'for(rec=0;rec<time.size();rec++){nmiss=three_dmn_var_int(rec,:,:).number_miss();print(nmiss);}' ~/nco/data/in.nc ~/foo.nc
nmiss = 0
nmiss = 0
nmiss = 8
nmiss = 0
nmiss = 0
nmiss = 1
nmiss = 0
nmiss = 2
nmiss = 1
nmiss = 2
Even though you are asking for another solution, I would like to show you that it takes only one very short line to find the answer with the help of Python. The variable m_data has exactly the same shape as a variable with missing values read using the netCDF4 package. With the execution of only one np.sum command with the correct axis specified, you have your answer.
import numpy as np
import matplotlib.pyplot as plt
import netCDF4 as nc4
# Generate random data for this experiment.
data = np.random.rand(365, 64, 128)
# Masked data, this is how the data is read from NetCDF by the netCDF4 package.
# For this example, I mask all values less than 0.1.
m_data = np.ma.masked_array(data, mask=data<0.1)
# It only takes one operation to find the answer.
n_values_missing = np.sum(m_data.mask, axis=0)
# Just a plot of the result.
plt.figure()
plt.pcolormesh(n_values_missing)
plt.colorbar()
plt.xlabel('lon')
plt.ylabel('lat')
plt.show()
# Save a netCDF file of the results.
f = nc4.Dataset('test.nc', 'w', format='NETCDF4')
f.createDimension('lon', 128)
f.createDimension('lat', 64 )
n_values_missing_nc = f.createVariable('n_values_missing', 'i4', ('lat', 'lon'))
n_values_missing_nc[:,:] = n_values_missing[:,:]
f.close()

Resources