getting recall and precision of score function in MultinomialNB in sklearn - multinomial

I want to get the recall and precision in my prediction but I got confused. This function "score" in MultinomialNB in sklearn has no capability to return a predicted list or a thing that help me to reach the precision and recall.For example:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)
model.score(x_test, y_test)
Here I have a model that is going to predict but MultinomialNB() has no function to provid the precision or recall or a predicted list ets.

Related

Curve fit does not return expected result

I need a little help with my code during curve fitting some data.
I have the following data:
'''
x_data=[0.0, 0.006702200711821348, 0.012673613376102217, 0.01805805116486128, 0.02296065262674275, 0.027460615301376282,
0.03161908492177514, 0.03548425629114566, 0.03909479074665314, 0.06168416627459879, 0.06395092768264225,
0.0952415360565632, 0.0964823380829502, 0.11590819258911032, 0.11676250975220677, 0.18973251809768016,
0.1899603458289615, 0.2585011532435637, 0.2586068948029052, 0.40046782450999047, 0.40067753715444315]
y_data=[0.005278154532534359, 0.004670803439961002, 0.004188802888597246, 0.003796976494876385, 0.003472183813732432,
0.0031985782141146, 0.002964943046115825, 0.0027631157936632137, 0.0025870148284089897, 0.001713418196416643,
0.0016440241050665323, 0.0009291243501697267, 0.0009083385934116964, 0.0006374601714823219, 0.0006276132323039056,
0.00016900738921547616, 0.00016834735819595378, 7.829234957755694e-05, 7.828353274888779e-05, 0.00015519569743801753,
0.00015533437619227267]
'''
I know that the data can be fitted using the following mathematical model:
'''
def model(x,a,b,c):
return (ab)/(bx+1)+3cx**2
'''
I am trying to obtain the a,b,c coefficients of the model calibrated, so that I obtain the following result (in red is the model calibrated and in blue is the data sample):
My code to achieve the shown result in the former picture is:enter image description here
'''
import numpy as np
from scipy.optimize import curve_fit
popt, _pcov = curve_fit(model, x_data, y_data,maxfev = 100000)
x_sample=np.linspace(0,0.5,1000)
y_sample=model(x_sample,*popt)
'''
If I plot the predicted data based on the fitted coefficients (in green) I get this result:enter image description here
for some reason I get some coefficients that produce a result I know it is wrong. Does anyone know how to solve this issue?
Your model y=(ab)/(bx+1)+3cx**2 appears not really satisfising. Instead of the hyperbolic term an exponential term seems better according to the shape of the data. That is why the proposed model is :
y=A * exp(B * x) + C * x**2
The method to compute approximates of the parameters A,B,C is shown below :
Details of the numerical calculus :
Note :
The parabolic term appears under represented. This is because they are not enough points at large x compare to the many points at small x.
The method used above is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales. The method isn't iterative and doesn't need initial "guessed" values. The accuracy is not good in case of few points, due to the numerical integration (calculus of the Sk).
If necessary, this can be improved thanks to post-treatment with non-linear regression starting from the above approximative values of the parameters;
An even better model is made of two exponentials :

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

How training and test data is split - Keras on Tensorflow

I am currently training my data using neural network and using fit function.
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)
Now I have used validation_split as 20%. What I understood is that my training data will be 80% and testing data will be 20%. I am confused how this data is dealt on back end. Is it like top 80% samples will be taken for training and below 20% percent for testing or the samples are randomly picked from inbetween? If I want to give separate training and testing data, how will I do that using fit()??
Moreover, my second concern is how to check if data is fitting well on model? I can see from the results that training accuracy is around 90% while the validation accuracy is around 55%. Does this mean it is the case of over-fitting or Under-fitting?
My last question is what does evaluate returns? Document says it returns the loss but I am already getting loss and accuracy during each epoch (as a return of fit() (in history)). What does accuracy and score returned by evaluate shows? If the accuracy returned by evaluate returns 90%, can I say my data is fitting well, regardless of what individual accuracy and loss was for each epoch?
Below is my Code:
import numpy
import pandas
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import np_utils
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import itertools
seed = 7
numpy.random.seed(seed)
dataframe = pandas.read_csv("INPUTFILE.csv", skiprows=range(0, 0))
dataset = dataframe.values
X = dataset[:,0:50].astype(float) # number of cols-1
Y = dataset[:,50]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
encoded_Y = np_utils.to_categorical(encoded_Y)
print("encoded_Y=", encoded_Y)
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(5, kernel_initializer='normal', activation='relu'))
#model.add(Dense(2, kernel_initializer='normal', activation='sigmoid'))
model.add(Dense(2, kernel_initializer='normal', activation='softmax'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # for binayr classification
#model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # for multi class
return model
model=create_baseline();
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)
print(history.history.keys())
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
pre_cls=model.predict_classes(X)
cm1 = confusion_matrix(encoder.transform(Y),pre_cls)
print('Confusion Matrix : \n')
print(cm1)
score, acc = model.evaluate(X,encoded_Y)
print('Test score:', score)
print('Test accuracy:', acc)
The keras documentation says:"The validation data is selected from the last samples in the x and y data provided, before shuffling.", this means that the shuffle occurs after the split, there is also a boolean parameter called "shuffle" which is set true as default, so if you don't want your data to be shuffled you could just set it to false
Getting good results on your training data and then getting bad or not so good results on your evaluation data usually means that your model is overfitting, overfit is when your model learns in a very specific scenario and can't achieve good results on new data
evaluation is to test your model on new data that it has "never seen before", usually you divide your data on training and test, but sometimes you might also want to create a third group of data, because if you just adjust your model to obtain better and better results on your test data this in some way is like cheating because in some way you are telling your model how is the data you are going to use for evaluation and this could cause overfitting
Also, if you want to split your data without using keras, I recommend you to use the sklearn train_test_split() function.
it's easy to use and it looks like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

GenSim Word2Vec unexpectedly pruning

My objective is to find a vector representation of phrases. Below is the code I have, that works partially for bigrams using the Word2Vec model provided by the GenSim library.
from gensim.models import word2vec
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, window=4, sg=1, hs=1, negative=0, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return None
The problem is that the Word2Vec model is seemingly doing automatic pruning of some of the bigrams, i.e. len(model.vocab.keys()) != len(bigrams.vocab.keys()). I've tried adjusting various parameters such as trim_rule, min_count, but they don't seem to affect the pruning.
PS - I am aware that bigrams to look up need to be represented using underscore instead of space, i.e. proper way to call my function would be bigram2vec(unigrams, 'this_report')
Thanks to further clarification at the GenSim support forum, the solution is to set the appropriate min_count and threshold values for the Phrases being generated (see documentation for details about these parameters in the Phrases class). The corrected solution code is below.
from gensim.models import word2vec, Phrases
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams, min_count=1, threshold=0.1)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return []

Supervised Machine Learning, producing a trained estimator

I have an assignment in which I am supposed to use scikit, numpy and pylab to do the following:
"All of the following should use data from the training_data.csv file
provided. training_data gives you a labeled set of integer pairs,
representing the scores of two sports teams, with the labels giving the
sport.
Write the following functions:
plot_scores() should draw a scatterplot of the data.
predict(dataset) should produce a trained Estimator to guess the sport
that resulted in a given score (from a dataset we've withheld, which will
be inputs as a 1000 x 2 np array). You can use any algorithm from scikit.
An optional additional function called "preprocess" will process dataset
before we it is passed to predict.
"
This is what I have done so far:
import numpy as np
import scipy as sp
import pylab as pl
from random import shuffle
def plot_scores():
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
array=np.array(lst)
pl.scatter(array[:,0], array[:,1])
pl.show()
def preprocess(dataset):
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
shuffle(lst)
return lst
In preprocess, I shuffled the data because I am supposed to use some of it to train on and some of it to test on, but the original data was not at all random. My question is, how am I supposed to "produce a trained estimator" in predict(dataset)? Is this supposed to be a function that returns another function? And which algorithm would be ideal to classify based on a dataset that looks like this:
The task likely wants you to train a standard scikit classifier model and return it, i.e. something like
from sklearn.svm import SVC
def predict(dataset):
X = ... # features, extract from dataset
y = ... # labels, extract from dataset
clf = SVC() # create classifier
clf.fit(X, y) # train
return clf
Though judging from the name of the function (predict) you should check if it really wants you to return a trained classifier or return predictions for the given dataset argument, as that would be more typical.
As a classifier you can basically use anyone that you like. Your plot looks like your dataset is linearly seperable (there are no colors for the classes, but I assume that the blops are the two classes). On linearly separable data hardly anything will fail. Try SVMs, logistic regression, random forests, naive bayes, ... For extra fun you can try to plot the decision boundaries, see here (which also contains an overview of the available classifiers).
I would recommend you to take a look at this structure:
from random import shuffle
import matplotlib.pyplot as plt
# import a classifier you need
def get_data():
# open your file and parse data to prepare X as a set of input vectors and Y as a set of targets
return X, Y
def split_data(X, Y):
size = len(X)
indices = range(size)
shuffle(indices)
train_indices = indices[:size/2]
test_indices = indices[size/2:]
X_train = [X[i] for i in train_indices]
Y_train = [Y[i] for i in train_indices]
X_test = [X[i] for i in test_indices]
Y_test = [Y[i] for i in test_indices]
return X_train, Y_train, X_test, Y_test
def plot_scatter(Y1, Y2):
plt.figure()
plt.scatter(Y1, Y2, 'bo')
plt.show()
# get data
X, Y = get_data()
# split data
X_train, Y_train, X_test, Y_test = split_data(X, Y)
# create a classifier as an object
classifier = YourImportedClassifier()
# train the classifier, after that the classifier is the trained estimator you need
classifier.train(X_train, Y_train) # or .fit(X_train, Y_train) or another train routine
# make a prediction
Y_prediction = classifier.predict(X_test)
# plot the scatter
plot_scatter(Y_prediction, Y_test)
I think what you are looking for is clf.fit() function, instead creating function that produce another function

Resources