Annotate FacetGrid subplots with values from dataframe: how to loop through/index into df? [duplicate] - seaborn

In my regular data analysis work, I have switched to use 100% python since the seaborn package becomes available. Big thanks to this wonderful package.
However, One excel-chart feature I miss is to display the polyfit equation and/or R2 value when use the lmplot() function. Does anyone know an easy way to add that?

This now can be done using FacetGrid methods .map() or .map_dataframe():
import seaborn as sns
import scipy as sp
tips = sns.load_dataset('tips')
g = sns.lmplot(x='total_bill', y='tip', data=tips, row='sex',
col='time', height=3, aspect=1)
def annotate(data, **kws):
r, p = sp.stats.pearsonr(data['total_bill'], data['tip'])
ax = plt.gca()
ax.text(.05, .8, 'r={:.2f}, p={:.2g}'.format(r, p),
transform=ax.transAxes)
g.map_dataframe(annotate)
plt.show()

It can't be done automatically with lmplot because it's undefined what that value should correspond to when there are multiple regression fits (i.e. using a hue, row or col variable.
But this is part of the similar jointplot function. By default it shows the correlation coefficient and p value:
import seaborn as sns
import numpy as np
x, y = np.random.randn(2, 40)
sns.jointplot(x, y, kind="reg")
But you can pass any function. If you want R^2, you could do:
from scipy import stats
def r2(x, y):
return stats.pearsonr(x, y)[0] ** 2
sns.jointplot(x, y, kind="reg", stat_func=r2)

Related

How to find all local minimums of a function efficiently

This question is related to global optimization and it is simpler. The task is to find all local minimums of a function. This is useful sometimes, for example, in physics we might want to find metastable states besides the true ground state in phase space. I have a naive implementation which has been tested on a scalar function xsin(x)+xcos(2*x) by randomly searching points in the interval. But clearly this is not efficient. The code and output are attached if you are interested.
#!/usr/bin/env python
from scipy import *
from numpy import *
from pylab import *
from numpy import random
"""
search all of the local minimums using random search when the functional form of the target function is known.
"""
def function(x):
return x*sin(x)+x*cos(2*x)
# return x**4-3*x**3+2
def derivative(x):
return sin(x)+x*cos(x)+cos(2*x)-2*x*sin(2*x)
# return 4.*x**3-9.*x**2
def ploting(xr,yr,mls):
plot(xr,yr)
grid()
for xm in mls:
axvline(x=xm,c='r')
savefig("plotf.png")
show()
def findlocmin(x,Nit,step_def=0.1,err=0.0001,gamma=0.01):
"""
we use gradient decent method to find local minumum using x as the starting point
"""
for i in range(Nit):
slope=derivative(x)
step=min(step_def,abs(slope)*gamma)
x=x-step*slope/abs(slope)
# print step,x
if(abs(slope)<err):
print "Found local minimum using "+str(i)+' iterations'
break
if i==Nit-1:
raise Exception("local min is not found using Nit=",str(Nit),'iterations')
return x
if __name__=="__main__":
xleft=-9;xright=9
xs=linspace(xleft,xright,100)
ys=array([function(x) for x in xs ])
minls=[]
Nrand=100;it=0
Nit=10000
while it<Nrand:
xint=random.uniform(xleft,xright)
xlocm=findlocmin(xint,Nit)
print xlocm
minls.append(xlocm)
it+=1
# print minls
ploting(xs,ys,minls)`]
I'd like to know if there exists better solution to this?

How training and test data is split - Keras on Tensorflow

I am currently training my data using neural network and using fit function.
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)
Now I have used validation_split as 20%. What I understood is that my training data will be 80% and testing data will be 20%. I am confused how this data is dealt on back end. Is it like top 80% samples will be taken for training and below 20% percent for testing or the samples are randomly picked from inbetween? If I want to give separate training and testing data, how will I do that using fit()??
Moreover, my second concern is how to check if data is fitting well on model? I can see from the results that training accuracy is around 90% while the validation accuracy is around 55%. Does this mean it is the case of over-fitting or Under-fitting?
My last question is what does evaluate returns? Document says it returns the loss but I am already getting loss and accuracy during each epoch (as a return of fit() (in history)). What does accuracy and score returned by evaluate shows? If the accuracy returned by evaluate returns 90%, can I say my data is fitting well, regardless of what individual accuracy and loss was for each epoch?
Below is my Code:
import numpy
import pandas
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import np_utils
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import itertools
seed = 7
numpy.random.seed(seed)
dataframe = pandas.read_csv("INPUTFILE.csv", skiprows=range(0, 0))
dataset = dataframe.values
X = dataset[:,0:50].astype(float) # number of cols-1
Y = dataset[:,50]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
encoded_Y = np_utils.to_categorical(encoded_Y)
print("encoded_Y=", encoded_Y)
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(5, kernel_initializer='normal', activation='relu'))
#model.add(Dense(2, kernel_initializer='normal', activation='sigmoid'))
model.add(Dense(2, kernel_initializer='normal', activation='softmax'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # for binayr classification
#model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # for multi class
return model
model=create_baseline();
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)
print(history.history.keys())
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
pre_cls=model.predict_classes(X)
cm1 = confusion_matrix(encoder.transform(Y),pre_cls)
print('Confusion Matrix : \n')
print(cm1)
score, acc = model.evaluate(X,encoded_Y)
print('Test score:', score)
print('Test accuracy:', acc)
The keras documentation says:"The validation data is selected from the last samples in the x and y data provided, before shuffling.", this means that the shuffle occurs after the split, there is also a boolean parameter called "shuffle" which is set true as default, so if you don't want your data to be shuffled you could just set it to false
Getting good results on your training data and then getting bad or not so good results on your evaluation data usually means that your model is overfitting, overfit is when your model learns in a very specific scenario and can't achieve good results on new data
evaluation is to test your model on new data that it has "never seen before", usually you divide your data on training and test, but sometimes you might also want to create a third group of data, because if you just adjust your model to obtain better and better results on your test data this in some way is like cheating because in some way you are telling your model how is the data you are going to use for evaluation and this could cause overfitting
Also, if you want to split your data without using keras, I recommend you to use the sklearn train_test_split() function.
it's easy to use and it looks like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

How can I print the intermediate variables in the loss function in TensorFlow and Keras?

I'm writing a custom objective to train a Keras (with TensorFlow backend) model but I need to debug some intermediate computation. For simplicity, let's say I have:
def custom_loss(y_pred, y_true):
diff = y_pred - y_true
return K.square(diff)
I could not find an easy way to access, for example, the intermediate variable diff or its shape during training. In this simple example, I know that I could return diff to print its values, but my actual loss is more complex and I can't return intermediate values without getting compiling errors.
Is there an easy way to debug intermediate variables in Keras?
This is not something that is solved in Keras as far as I know, so you have to resort to backend-specific functionality. Both Theano and TensorFlow have Print nodes that are identity nodes (i.e., they return the input node) and have the side-effect of printing the input (or some tensor of the input).
Example for Theano:
diff = y_pred - y_true
diff = theano.printing.Print('shape of diff', attrs=['shape'])(diff)
return K.square(diff)
Example for TensorFlow:
diff = y_pred - y_true
diff = tf.Print(diff, [tf.shape(diff)])
return K.square(diff)
Note that this only works for intermediate values. Keras expects tensors that are passed to other layers to have specific attributes such as _keras_shape. Values processed by the backend, i.e. through Print, usually do not have that attribute. To solve this, you can wrap debug statements in a Lambda layer for example.
In TensorFlow 2, you can now add IDE breakpoints in the TensorFlow Keras models/layers/losses, including when using the fit, evaluate, and predict methods. However, you must add model.run_eagerly = True after calling model.compile() for the values of the tensor to be available in the debugger at the breakpoint. For example,
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
def custom_loss(y_pred, y_true):
diff = y_pred - y_true
return tf.keras.backend.square(diff) # Breakpoint in IDE here. =====
class SimpleModel(Model):
def __init__(self):
super().__init__()
self.dense0 = Dense(2)
self.dense1 = Dense(1)
def call(self, inputs):
z = self.dense0(inputs)
z = self.dense1(z)
return z
x = tf.convert_to_tensor([[1, 2, 3], [4, 5, 6]], dtype=tf.float32)
y = tf.convert_to_tensor([0, 1], dtype=tf.float32)
model0 = SimpleModel()
model0.run_eagerly = True
model0.compile(optimizer=Adam(), loss=custom_loss)
y0 = model0.fit(x, y, epochs=1) # Values of diff *not* shown at breakpoint. =====
model1 = SimpleModel()
model1.compile(optimizer=Adam(), loss=custom_loss)
model1.run_eagerly = True
y1 = model1.fit(x, y, epochs=1) # Values of diff shown at breakpoint. =====
This also works for debugging the outputs of intermediate network layers (for example, adding the breakpoint in the call of the SimpleModel).
Note: this was tested in TensorFlow 2.0.0-rc0.
In TensorFlow 2.0, you can use tf.print and print anything inside the definition of your loss function. You can also do something like tf.print("my_intermediate_tensor =", my_intermediate_tensor), i.e. with a message, similar to Python's print. However, you may need to decorate your loss function with #tf.function to actually see the results of the tf.print.

Using if conditions inside a TensorFlow graph

In tensorflow CIFAR-10 tutorial in cifar10_inputs.py line 174 it is said you should randomize the order of the operations random_contrast and random_brightness for better data augmentation.
To do so the first thing I think of is drawing a random variable from the uniform distribution between 0 and 1 : p_order. And do:
if p_order>0.5:
distorted_image=tf.image.random_contrast(image)
distorted_image=tf.image.random_brightness(distorted_image)
else:
distorted_image=tf.image.random_brightness(image)
distorted_image=tf.image.random_contrast(distorted_image)
However there are two possible options for getting p_order:
1) Using numpy which disatisfies me as I wanted pure TF and that TF discourages its user to mix numpy and tensorflow
2) Using TF, however as p_order can only be evaluated in a tf.Session()
I do not really know if I should do:
with tf.Session() as sess2:
p_order_tensor=tf.random_uniform([1,],0.,1.)
p_order=float(p_order_tensor.eval())
All those operations are inside the body of a function and are run from another script which has a different session/graph. Or I could pass the graph from the other script as an argument to this function but I am confused.
Even the fact that tensorflow functions like this one or inference for example seem to define the graph in a global fashion without explicitly returning it as an output is a bit hard to understand for me.
You can use tf.cond(pred, fn1, fn2, name=None) (see doc).
This function allows you to use the boolean value of pred inside the TensorFlow graph (no need to call self.eval() or sess.run(), hence no need of a Session).
Here is an example of how to use it:
def fn1():
distorted_image=tf.image.random_contrast(image)
distorted_image=tf.image.random_brightness(distorted_image)
return distorted_image
def fn2():
distorted_image=tf.image.random_brightness(image)
distorted_image=tf.image.random_contrast(distorted_image)
return distorted_image
# Uniform variable in [0,1)
p_order = tf.random_uniform(shape=[], minval=0., maxval=1., dtype=tf.float32)
pred = tf.less(p_order, 0.5)
distorted_image = tf.cond(pred, fn1, fn2)

Supervised Machine Learning, producing a trained estimator

I have an assignment in which I am supposed to use scikit, numpy and pylab to do the following:
"All of the following should use data from the training_data.csv file
provided. training_data gives you a labeled set of integer pairs,
representing the scores of two sports teams, with the labels giving the
sport.
Write the following functions:
plot_scores() should draw a scatterplot of the data.
predict(dataset) should produce a trained Estimator to guess the sport
that resulted in a given score (from a dataset we've withheld, which will
be inputs as a 1000 x 2 np array). You can use any algorithm from scikit.
An optional additional function called "preprocess" will process dataset
before we it is passed to predict.
"
This is what I have done so far:
import numpy as np
import scipy as sp
import pylab as pl
from random import shuffle
def plot_scores():
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
array=np.array(lst)
pl.scatter(array[:,0], array[:,1])
pl.show()
def preprocess(dataset):
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
shuffle(lst)
return lst
In preprocess, I shuffled the data because I am supposed to use some of it to train on and some of it to test on, but the original data was not at all random. My question is, how am I supposed to "produce a trained estimator" in predict(dataset)? Is this supposed to be a function that returns another function? And which algorithm would be ideal to classify based on a dataset that looks like this:
The task likely wants you to train a standard scikit classifier model and return it, i.e. something like
from sklearn.svm import SVC
def predict(dataset):
X = ... # features, extract from dataset
y = ... # labels, extract from dataset
clf = SVC() # create classifier
clf.fit(X, y) # train
return clf
Though judging from the name of the function (predict) you should check if it really wants you to return a trained classifier or return predictions for the given dataset argument, as that would be more typical.
As a classifier you can basically use anyone that you like. Your plot looks like your dataset is linearly seperable (there are no colors for the classes, but I assume that the blops are the two classes). On linearly separable data hardly anything will fail. Try SVMs, logistic regression, random forests, naive bayes, ... For extra fun you can try to plot the decision boundaries, see here (which also contains an overview of the available classifiers).
I would recommend you to take a look at this structure:
from random import shuffle
import matplotlib.pyplot as plt
# import a classifier you need
def get_data():
# open your file and parse data to prepare X as a set of input vectors and Y as a set of targets
return X, Y
def split_data(X, Y):
size = len(X)
indices = range(size)
shuffle(indices)
train_indices = indices[:size/2]
test_indices = indices[size/2:]
X_train = [X[i] for i in train_indices]
Y_train = [Y[i] for i in train_indices]
X_test = [X[i] for i in test_indices]
Y_test = [Y[i] for i in test_indices]
return X_train, Y_train, X_test, Y_test
def plot_scatter(Y1, Y2):
plt.figure()
plt.scatter(Y1, Y2, 'bo')
plt.show()
# get data
X, Y = get_data()
# split data
X_train, Y_train, X_test, Y_test = split_data(X, Y)
# create a classifier as an object
classifier = YourImportedClassifier()
# train the classifier, after that the classifier is the trained estimator you need
classifier.train(X_train, Y_train) # or .fit(X_train, Y_train) or another train routine
# make a prediction
Y_prediction = classifier.predict(X_test)
# plot the scatter
plot_scatter(Y_prediction, Y_test)
I think what you are looking for is clf.fit() function, instead creating function that produce another function

Resources