How training and test data is split - Keras on Tensorflow - validation

I am currently training my data using neural network and using fit function.
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)
Now I have used validation_split as 20%. What I understood is that my training data will be 80% and testing data will be 20%. I am confused how this data is dealt on back end. Is it like top 80% samples will be taken for training and below 20% percent for testing or the samples are randomly picked from inbetween? If I want to give separate training and testing data, how will I do that using fit()??
Moreover, my second concern is how to check if data is fitting well on model? I can see from the results that training accuracy is around 90% while the validation accuracy is around 55%. Does this mean it is the case of over-fitting or Under-fitting?
My last question is what does evaluate returns? Document says it returns the loss but I am already getting loss and accuracy during each epoch (as a return of fit() (in history)). What does accuracy and score returned by evaluate shows? If the accuracy returned by evaluate returns 90%, can I say my data is fitting well, regardless of what individual accuracy and loss was for each epoch?
Below is my Code:
import numpy
import pandas
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import np_utils
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import itertools
seed = 7
numpy.random.seed(seed)
dataframe = pandas.read_csv("INPUTFILE.csv", skiprows=range(0, 0))
dataset = dataframe.values
X = dataset[:,0:50].astype(float) # number of cols-1
Y = dataset[:,50]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
encoded_Y = np_utils.to_categorical(encoded_Y)
print("encoded_Y=", encoded_Y)
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(5, kernel_initializer='normal', activation='relu'))
#model.add(Dense(2, kernel_initializer='normal', activation='sigmoid'))
model.add(Dense(2, kernel_initializer='normal', activation='softmax'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # for binayr classification
#model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # for multi class
return model
model=create_baseline();
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)
print(history.history.keys())
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
pre_cls=model.predict_classes(X)
cm1 = confusion_matrix(encoder.transform(Y),pre_cls)
print('Confusion Matrix : \n')
print(cm1)
score, acc = model.evaluate(X,encoded_Y)
print('Test score:', score)
print('Test accuracy:', acc)

The keras documentation says:"The validation data is selected from the last samples in the x and y data provided, before shuffling.", this means that the shuffle occurs after the split, there is also a boolean parameter called "shuffle" which is set true as default, so if you don't want your data to be shuffled you could just set it to false
Getting good results on your training data and then getting bad or not so good results on your evaluation data usually means that your model is overfitting, overfit is when your model learns in a very specific scenario and can't achieve good results on new data
evaluation is to test your model on new data that it has "never seen before", usually you divide your data on training and test, but sometimes you might also want to create a third group of data, because if you just adjust your model to obtain better and better results on your test data this in some way is like cheating because in some way you are telling your model how is the data you are going to use for evaluation and this could cause overfitting
Also, if you want to split your data without using keras, I recommend you to use the sklearn train_test_split() function.
it's easy to use and it looks like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Related

getting recall and precision of score function in MultinomialNB in sklearn

I want to get the recall and precision in my prediction but I got confused. This function "score" in MultinomialNB in sklearn has no capability to return a predicted list or a thing that help me to reach the precision and recall.For example:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)
model.score(x_test, y_test)
Here I have a model that is going to predict but MultinomialNB() has no function to provid the precision or recall or a predicted list ets.

Remove features based on correlation in case of mixed continuous & categorical features

I am working on a machine learning regression task with mixed continuous and categorical features in Python .
I apply one-hot encoding on categorical features as can be seen below:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
# -----------------------------------------------------------------------------
# Data
# -----------------------------------------------------------------------------
# Ames
X, y = fetch_openml(name="house_prices", as_frame=True, return_X_y=True)
# In this dataset, categorical features have "object" or "non-numerical" data-type.
numerical_features = X.select_dtypes(include='number').columns.tolist() # 37
categorical_features = X.select_dtypes(include='object').columns.tolist() # 43
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
# -----------------------------------------------------------------------------
# Data preprocessing
# -----------------------------------------------------------------------------
numerical_preprocessor = Pipeline(steps=[
('impute', SimpleImputer(strategy='mean')),
('scale', MinMaxScaler())
])
categorical_preprocessor = Pipeline(steps=[
('impute', SimpleImputer(strategy='most_frequent')),
('one-hot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
preprocessor = ColumnTransformer(transformers=[
('number', numerical_preprocessor, numerical_features),
('category', categorical_preprocessor, categorical_features)
],
verbose_feature_names_out=True,
)
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
I want to remove highly correlated features by the following algorithm:
Find Pearson correlation coefficient between all features.
If correlation > threshold:
Drop one of the features which has lower correlation with objective variable (which is a continuous variable)
However, I am not sure which method is suitable to calculate correlation between :
continuous features & one-hot encoded categorical features
one-hot encoded categorical features & continuous objective variable
Any advice is appreciated.
Assume that the machine learning task is a classification task. Which method do you recommend to calculate correlation between :
one-hot encoded categorical features & categorical objective variable
continuous features & categorical objective variable
Why not use mutual_info_regression for one-hot encoded categorical features & continuous objective variables, and mutual_info_classif for continuous features & one-hot encoded categorical features. Here's a template you can build upon, suitable for inclusion in a ML pipeline.

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

Performing K-fold Cross-Validation: Using Same Training Set vs. Separate Validation Set

I am using the Python scikit-learn framework to build a decision tree. I am currently splitting my training data into two separate sets, one for training and the other for validation (implemented via K-fold cross-validation).
To cross-validate my model, should I split my data into two sets as outlined above or simply use the full training set? My main objective is to prevent overfitting. I have seen conflicting answers online about the use and efficacy of both these approaches.
I understand that K-fold cross-validation is commonly used when there is not enough data for a separate validation set. I do not have this limitation. Intuitively speaking, I believe that employing K-fold cross-validation in conjunction with a separate dataset will further reduce overfitting.
Is my supposition correct? Is there a better approach I can use to validate my model?
Split Dataset Approach:
x_train, x_test, y_train, y_test = train_test_split(df[features], df["SeriousDlqin2yrs"], test_size=0.2, random_state=13)
dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt.fit(x_train, y_train)
scores = cross_val_score(dt, x_test, y_test, cv=10)
Training Dataset Approach:
x_train=df[features]
y_train=df["SeriousDlqin2yrs"]
dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt.fit(x_train, y_train)
scores = cross_val_score(dt, x_train, y_train, cv=10)
Ok it seems that you are very confused by both validating as well as what cross_val_score does. First thing first, you should not do any of the above approaches. If you are not searching for some hyperparameters, but instead just want to answer the question "How good is DT with min_samples_split=20 on my data", then you should do:
dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
scores = cross_val_score(dt, X, y, cv=10)
Without any splitting. Why? Because cross_val_score does the splitting. What it does, it splits the X and y into 10 parts and performs 10 times fitting on trianing and then testing on remaining part. In other words if you do something like
x_train=df[features]
y_train=df["SeriousDlqin2yrs"]
dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt.fit(x_train, y_train) # this line does nothing!
scores = cross_val_score(dt, x_train, y_train, cv=10)
Then fit command is useless, as cross_val_score will call fit again, 10 times. Furthermore here you do not use test set at all! Similarly in your second code - you would both fit, and test on test set, also incorrect.
However, if you are trying to fit some hyperparameter, lets say this min_samples_split, then you should (assuming that your test set is big enough to be representible):
X_train, y_train = X[train], y[train]
X_test, y_test = X[test], y[test]
scores = []
for param in [10, 20, 40]:
dt = DecisionTreeClassifier(min_samples_split=param, random_state=99)
scores.append((cross_val_score(dt, X_train, y_train, cv=10), param))
best_param = max(scores)[1]
dt = DecisionTreeClassifier(min_samples_split=best_param, random_state=99)
print np.mean(dt.predict(X_test)==y_test) # checking accuracy on testing set

Supervised Machine Learning, producing a trained estimator

I have an assignment in which I am supposed to use scikit, numpy and pylab to do the following:
"All of the following should use data from the training_data.csv file
provided. training_data gives you a labeled set of integer pairs,
representing the scores of two sports teams, with the labels giving the
sport.
Write the following functions:
plot_scores() should draw a scatterplot of the data.
predict(dataset) should produce a trained Estimator to guess the sport
that resulted in a given score (from a dataset we've withheld, which will
be inputs as a 1000 x 2 np array). You can use any algorithm from scikit.
An optional additional function called "preprocess" will process dataset
before we it is passed to predict.
"
This is what I have done so far:
import numpy as np
import scipy as sp
import pylab as pl
from random import shuffle
def plot_scores():
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
array=np.array(lst)
pl.scatter(array[:,0], array[:,1])
pl.show()
def preprocess(dataset):
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
shuffle(lst)
return lst
In preprocess, I shuffled the data because I am supposed to use some of it to train on and some of it to test on, but the original data was not at all random. My question is, how am I supposed to "produce a trained estimator" in predict(dataset)? Is this supposed to be a function that returns another function? And which algorithm would be ideal to classify based on a dataset that looks like this:
The task likely wants you to train a standard scikit classifier model and return it, i.e. something like
from sklearn.svm import SVC
def predict(dataset):
X = ... # features, extract from dataset
y = ... # labels, extract from dataset
clf = SVC() # create classifier
clf.fit(X, y) # train
return clf
Though judging from the name of the function (predict) you should check if it really wants you to return a trained classifier or return predictions for the given dataset argument, as that would be more typical.
As a classifier you can basically use anyone that you like. Your plot looks like your dataset is linearly seperable (there are no colors for the classes, but I assume that the blops are the two classes). On linearly separable data hardly anything will fail. Try SVMs, logistic regression, random forests, naive bayes, ... For extra fun you can try to plot the decision boundaries, see here (which also contains an overview of the available classifiers).
I would recommend you to take a look at this structure:
from random import shuffle
import matplotlib.pyplot as plt
# import a classifier you need
def get_data():
# open your file and parse data to prepare X as a set of input vectors and Y as a set of targets
return X, Y
def split_data(X, Y):
size = len(X)
indices = range(size)
shuffle(indices)
train_indices = indices[:size/2]
test_indices = indices[size/2:]
X_train = [X[i] for i in train_indices]
Y_train = [Y[i] for i in train_indices]
X_test = [X[i] for i in test_indices]
Y_test = [Y[i] for i in test_indices]
return X_train, Y_train, X_test, Y_test
def plot_scatter(Y1, Y2):
plt.figure()
plt.scatter(Y1, Y2, 'bo')
plt.show()
# get data
X, Y = get_data()
# split data
X_train, Y_train, X_test, Y_test = split_data(X, Y)
# create a classifier as an object
classifier = YourImportedClassifier()
# train the classifier, after that the classifier is the trained estimator you need
classifier.train(X_train, Y_train) # or .fit(X_train, Y_train) or another train routine
# make a prediction
Y_prediction = classifier.predict(X_test)
# plot the scatter
plot_scatter(Y_prediction, Y_test)
I think what you are looking for is clf.fit() function, instead creating function that produce another function

Resources