Scikit learn algorithms performing extremely poorly - algorithm

I'm new to scikit learn and I'm banging my head against the wall. I've used both real world and test data and the scikit algorithms are not performing above chance level in predicting anything. I've tried knn, decision trees, svc and naive bayes.
Basically, I made a test data set consisting of a column of 0s and 1s, with all the 0s having a feature between 0 and .5 and all the 1s having a feature value between .5 and 1. This should be extremely easy and give near 100% accuracy. However, none of the algorithms are performing above chance level. Accurasies range from 45 to 55 %. I've already tried tweaking a whole bunch of parameters for every algorithm but noting helps. I think something is fundamentally wrong with my implementation.
Please help me out. Here's my code:
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
import sklearn
import pandas
import numpy as np
df=pandas.read_excel('Test.xlsx')
# Make data into np arrays
y = np.array(df[1])
y=y.astype(float)
y=y.reshape(399)
x = np.array(df[2])
x=x.astype(float)
x=x.reshape(399, 1)
# Creating training and test data
labels_train, labels_test = train_test_split(y)
features_train, features_test = train_test_split(x)
#####################################################################
# PERCEPTRON
#####################################################################
from sklearn import linear_model
perceptron=linear_model.Perceptron()
perceptron.fit(features_train, labels_train)
perc_pred=perceptron.predict(features_test)
print sklearn.metrics.accuracy_score(labels_test, perc_pred, normalize=True, sample_weight=None)
print 'perceptron'
#####################################################################
# KNN classifier
#####################################################################
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(features_train, labels_train)
knn_pred = knn.predict(features_test)
# Accuraatheid
print sklearn.metrics.accuracy_score(labels_test, knn_pred, normalize=True, sample_weight=None)
print 'knn'
#####################################################################
## SVC
#####################################################################
from sklearn.svm import SVC
from sklearn import svm
svm2 = SVC(kernel="linear")
svm2 = svm.SVC()
svm2.fit(features_train, labels_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=1.0, kernel='linear', max_iter=-1, probability=False,
random_state=None,
shrinking=True, tol=0.001, verbose=False)
svc_pred = svm2.predict(features_test)
print sklearn.metrics.accuracy_score(labels_test, svc_pred, normalize=True,
sample_weight=None)
#####################################################################
# Decision tree
#####################################################################
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features_train, labels_train)
tree_pred=clf.predict(features_test)
# Accuraatheid
print sklearn.metrics.accuracy_score(labels_test, tree_pred, normalize=True,
sample_weight=None)
print 'tree'
#####################################################################
# Naive bayes
#####################################################################
import sklearn
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
GaussianNB()
bayes_pred = clf.predict(features_test)
print sklearn.metrics.accuracy_score(labels_test, bayes_pred,
normalize=True, sample_weight=None)

You seem to use train_test_split the wrong way.
labels_train, labels_test = train_test_split(y) #WRONG
features_train, features_test = train_test_split(x) #WRONG
the splitting of your labels and data isn't necessary the same. One easy way to split your data manually:
randomvec=np.random.rand(len(data))
randomvec=randomvec>0.5
train_data=data[randomvec]
train_label=labels[randomvec]
test_data=data[np.logical_not(randomvec)]
test_label=labels[np.logical_not(randomvec)]
or to use the scikit method properly:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=42)

Related

Forecasting validation loss flactuation

I have a question for those who have some experience with timeseries forecasting.
I have been experiment with this field for few weeks and i was trying to forecast some timeseries with both ARIMA and LSTM models to compare the results.
Basically i did plot this graph Figure 1 that has 4 plots :
Top left : ARIMA training data points and fitted model points.
Top right : ARIMA test and forecast points.
Bottom Left : LSTM training data and fitted data (i could not really find fitted point for LSTM so i just forecasted the training data but you can just ignore that part).
Bottom right : Test and forecast data for the LSTM model.
This graph was acceptable and also i did compute the RMSE and MSE and LSTM gave lower error which agrees with most literature online that states the superiority of LSTM over ARIMA models.
However after i did plot the loss and validation loss of the LSTM model to have more insights, i noticed that the validation_loss is following a wierd flectuating pattern Figure 2.
I can explain this as follow : the time series has a lot of outliers or abnormal behaviour, so splitting it to train/validation/test would mean validation cannot be really a good metric to show how good the model can learn.
But since all research papers never show this graph and explain this problem, i don't have a solid argument to defende this idea.
what do you guys think?
Thank you in advance
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_percentage_error,mean_absolute_percentage_error
from statsmodels.tsa.seasonal import STL
import numpy as np
from pandas import Series, DataFrame
from scipy import stats
from statsmodels.tsa.stattools import adfuller
import statsmodels
from statsmodels.tsa.seasonal import seasonal_decompose
from pandas.plotting import register_matplotlib_converters
import pmdarima as pm
register_matplotlib_converters()
import warnings
import time
from numpy import array
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from numpy import array
import keras_tuner as kt
import tensorflow as tf
print(tf.__version__)
from numpy import array
from tensorflow import keras
import keras_tuner as kt
from sklearn.preprocessing import MinMaxScaler
from keras.layers import Bidirectional
from keras.models import Sequential
from keras.preprocessing.sequence import TimeseriesGenerator
from keras.layers import Bidirectional
from tensorflow.keras import initializers
import random as rn
np.random.seed(123)
rn.seed(123)
tf.random.set_seed(123)
tf.keras.utils.set_random_seed(123)
keras.utils.set_random_seed(123)
warnings.filterwarnings('ignore')
df3 = pd.read_csv('favorita_train.csv')
## 1 - Get TS and do STL
print("TS lenbgth : "+str(len(df3)))
results = seasonal_decompose(df3['unit_sales'],period=30)
results.plot();
train_all = df3.iloc[:int(len(df3)*0.8)]
train = df3.iloc[:int(len(df3)*0.6)]
val = df3.iloc[int(len(df3)*0.6):int(len(df3)*0.8)]
test = df3.iloc[int(len(df3)*0.8):]
scaler = MinMaxScaler()
scaler.fit(train_all)
scaled_all = scaler.transform(df3)
scaled_train = scaler.transform(train)
scaled_train_all = scaler.transform(train_all)
scaled_val = scaler.transform(val)
scaled_test = scaler.transform(test)
# We do the same thing, but now instead for 12 months
n_features = 1
n_input =5
train_generator_all = TimeseriesGenerator(scaled_train_all, scaled_train_all, length=n_input, batch_size=1,shuffle=True)
train_generator = TimeseriesGenerator(scaled_train, scaled_train, length=n_input, batch_size=1,shuffle=True)
val_generator = TimeseriesGenerator(scaled_val, scaled_val, length=n_input, batch_size=1,shuffle=True)
adfPValue = adfuller(scaled_all)
adfPValue=adfPValue[1]
adi = len(scaled_all)/((scaled_all != 0).sum())
sd=scaled_all.std()
mean=scaled_all.mean()
cv2 = np.square(sd/mean)
print("CV2 (describe magnitude of demande variability <0.5 is good) :"+str(cv2))
print("SD (-2,2 is good | mean data variance is low) :"+str(sd))
print("ADI (1.3 or smaller means smooth ts) :"+str(adi))
print("Stationarity test (stationary if <0.05) :"+str(adfPValue))
def model_builder(hp):
model = keras.Sequential()
hp_units = hp.Int('units', min_value=1, max_value=50, step=1)
hp_layers = hp.Int('layers', min_value=1, max_value=3, step=1)
if hp_layers==1 :
model.add(Bidirectional(LSTM(hp_units,activation='relu'), input_shape=(n_input, n_features)))
elif hp_layers==2:
model.add(Bidirectional(LSTM(hp_units, activation='relu', return_sequences=True), input_shape=(n_input, n_features)))
model.add(Bidirectional(LSTM(hp_units, activation='relu')))
else:
model.add(Bidirectional(LSTM(hp_units, activation='relu', return_sequences=True), input_shape=(n_input, n_features)))
for i in range(hp_layers-2):
model.add(Bidirectional(LSTM(hp_units, activation='relu', return_sequences=True)))
model.add(Bidirectional(LSTM(hp_units, activation='relu')))
model.add(Dense(1))
hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate), loss='mse',metrics=['accuracy'])
return model
tuner = kt.Hyperband(model_builder,
objective='val_loss',
max_epochs=300,
factor=3,
directory='499',
project_name='949',
seed=123)
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=30)
tuner.search(train_generator, epochs=300, validation_data=val_generator, shuffle=True, callbacks=[stop_early], batch_size=len(train_generator))
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]
print(best_hps.get('units'))
print(best_hps.get('layers'))
print(best_hps.get('window'))
print(best_hps.get('learning_rate'))
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]
model = tuner.hypermodel.build(best_hps)
history = model.fit(img_train, label_train, epochs=50, validation_split=0.2)
val_acc_per_epoch = history.history['val_accuracy']
best_epoch = val_acc_per_epoch.index(max(val_acc_per_epoch)) + 1
print('Best epoch: %d' % (best_epoch,))

sparse matrix use in pycaret for nlp

First of all, thank you for allowing me to use this wonderful library
I am doing Korean NLP now
So I pre-processed Korean and converted it into a Tfidf Vectorizer.
I'm going to put it in setup() and use it, but there are errors
Can't I use the sparse matrix for the pycaret?
If so, is there any way to do Korean NLP?
X_train = train_data.Text.tolist()
Y_train =train_data['Label'].values
X_test = test_data.Text.tolist()
Y_test =test_data['Label'].values
from soynlp.word import WordExtractor
word_extractor = WordExtractor(min_frequency=100,
min_cohesion_forward=0.05,
min_right_branching_entropy=0.0
)
word_extractor.train(X_train) # list of str or like
words = word_extractor.extract()
scores = word_extractor.word_scores()
import math
score_dict = {key: scores[key].cohesion_forward *
math.exp(scores[key].right_branching_entropy)
for key in scores}
from soynlp.tokenizer import LTokenizer
cohesion_score = {word:score.cohesion_forward for word, score in words.items()}
tokenizer = LTokenizer(scores=score_dict)
import os
from scipy.sparse import save_npz, load_npz
from sklearn.feature_extraction.text import TfidfVectorizer
# if not os.path.isfile('soy_train.npz'):
tfidf = TfidfVectorizer(ngram_range=(1, 2),
min_df=3,
tokenizer=tokenizer.tokenize,
token_pattern=None)
tfidf.fit(X_train)
X_train_soy = tfidf.transform(X_train)
X_test_soy = tfidf.transform(X_test)
save_npz('soy_train.npz', X_train_soy)
save_npz('soy_test.npz', X_test_soy)
type(X_train_soy[0])
import numpy as np
train = pd.DataFrame(X_train_soy)
y_train = np.array(Y_train,dtype=float)
train['Label'] = y_train
from pycaret.classification import *
import numpy as np
exp1 = setup(train,train_size=0.8, target = 'Label',use_gpu = True)
TypeError: Cannot compare types 'ndarray(dtype=object)' and 'float'

Not able to fit a function with scipy.optimize.curve_fit()

I would like to fit the following function:
def invlaplace_stehfest2(time,EL,tau):
upsilon=0.25
pmax=6.9
E0=0.0154
M=8
results=[]
for t in time:
func=0
for k in range(1,2*M+1):
SUM=0
for j in range(int(math.floor((k+1)/2)),min(k,M)+1):
dummy=j**(M+1)*scipy.special.binom(M,j)*scipy.special.binom(2*j,j)*scipy.special.binom(j,k-j)/scipy.math.factorial(M)
SUM+=dummy
s=k*math.log(2)/t[enter image description here][1]
func+=(-1)**(M+k)*SUM*pmax*EL/(mp.exp(tau*s)*mp.expint(1,tau*s)*E0+EL)/s
func=func*math.log(2)/t
results.append(func)
return [float(i) for i in results]
To do so I use the following data:
data_time=np.array([69.0,99.0,139.0,179.0,219.0,259.0,295.5,299.03])
data_relax=np.array([6.2536,6.1652,6.0844,6.0253,5.9782,5.9404,5.9104,5.9066])
With the folowing guess:
guess=np.array([0.1,0.05])
And scipy.optimize.curve_fit() as folow:
Parameter,Covariance=scipy.optimize.curve_fit(invlaplace_stehfest2,data_time,data_relax,guess)
For A reason that I don't understand I am not able to fit the data correctly. I get the following graph.
Bad fitting
My function is undoubtedly correct since when I use the correct guess:
guess=np.array([0.33226685047281592707364253044085038793404361200072,8.6682623502960394383501102909774397295654841654769])
I am able to fit correctly my data.
Expected fitting
Any hint on why I am not able to fit correctly? Should I use another method?
Here is the hole program:
##############################################
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib.pylab as mplab
import math
from math import *
import numpy as np
import scipy
from scipy.optimize import curve_fit
import mpmath as mp
############################################################################################
def invlaplace_stehfest2(time,EL,tau):
upsilon=0.25
pmax=6.9
E0=0.0154
M=8
results=[]
for t in time:
func=0
for k in range(1,2*M+1):
SUM=0
for j in range(int(math.floor((k+1)/2)),min(k,M)+1):
dummy=j**(M+1)*scipy.special.binom(M,j)*scipy.special.binom(2*j,j)*scipy.special.binom(j,k-j)/scipy.math.factorial(M)
SUM+=dummy
s=k*math.log(2)/t
func+=(-1)**(M+k)*SUM*pmax*EL/(mp.exp(tau*s)*mp.expint(1,tau*s)*E0+EL)/s
func=func*math.log(2)/t
results.append(func)
return [float(i) for i in results]
############################################################################################
###Constant###
####Value####
data_time=np.array([69.0,99.0,139.0,179.0,219.0,259.0,295.5,299.03])
data_relax=np.array([6.2536,6.1652,6.0844,6.0253,5.9782,5.9404,5.9104,5.9066])
###Fitting###
guess=np.array([0.33226685047281592707364253044085038793404361200072,8.6682623502960394383501102909774397295654841654769])
#guess=np.array([0.1,0.05])
Parameter,Covariance=scipy.optimize.curve_fit(invlaplace_stehfest2,data_time,data_relax,guess)
print Parameter
residu=sum(data_relax-invlaplace_stehfest2(data_time,Parameter[0],Parameter[1]))
Graph_Curves=plt.figure()
ax = Graph_Curves.add_subplot(111)
ax.plot(data_time,invlaplace_stehfest2(data_time,Parameter[0],Parameter[1]),"-")
ax.plot(data_time,data_relax,"o")
plt.show()
Non-linear fitters such as the default Levenberg-Marquardt solver used in scipy.optimize.curve_fit(), like most iterative solvers, can stop in a local minimum in error space. If error space is "bumpy" then initial parameter estimates become very important, as in this case.
Scipy has added the Differential Evolution genetic algorithm to the optimize module, which can be used to determine initial parameter estimates for curve_fit(). Scipy's implementation uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, requiring parameter upper and lower bounds within which to search. As you can see below, I have used this scipy module to replace the hard-coded values for the value named "guess" in your code. This does not run quickly, but a somewhat slower correct result is much better than a fast wrong result. Try this code:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib.pylab as mplab
import math
from math import *
import numpy as np
import scipy
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import mpmath as mp
############################################################################################
def invlaplace_stehfest2(time,EL,tau):
upsilon=0.25
pmax=6.9
E0=0.0154
M=8
results=[]
for t in time:
func=0
for k in range(1,2*M+1):
SUM=0
for j in range(int(math.floor((k+1)/2)),min(k,M)+1):
dummy=j**(M+1)*scipy.special.binom(M,j)*scipy.special.binom(2*j,j)*scipy.special.binom(j,k-j)/scipy.math.factorial(M)
SUM+=dummy
s=k*math.log(2)/t
func+=(-1)**(M+k)*SUM*pmax*EL/(mp.exp(tau*s)*mp.expint(1,tau*s)*E0+EL)/s
func=func*math.log(2)/t
results.append(func)
return [float(i) for i in results]
############################################################################################
###Constant###
####Value####
data_time=np.array([69.0,99.0,139.0,179.0,219.0,259.0,295.5,299.03])
data_relax=np.array([6.2536,6.1652,6.0844,6.0253,5.9782,5.9404,5.9104,5.9066])
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
return np.sum((data_relax - invlaplace_stehfest2(data_time, *parameterTuple)) ** 2)
###Fitting###
#guess=np.array([0.33226685047281592707364253044085038793404361200072,8.6682623502960394383501102909774397295654841654769])
#guess=np.array([0.1,0.05])
parameterBounds = [[0.0, 1.0], [0.0, 10.0]]
# "seed" the numpy random number generator for repeatable results
# note the ".x" here to return only the parameter estimates in this example
guess = differential_evolution(sumOfSquaredError, parameterBounds, seed=3).x
Parameter,Covariance=scipy.optimize.curve_fit(invlaplace_stehfest2,data_time,data_relax,guess)
print Parameter
residu=sum(data_relax-invlaplace_stehfest2(data_time,Parameter[0],Parameter[1]))
Graph_Curves=plt.figure()
ax = Graph_Curves.add_subplot(111)
ax.plot(data_time,invlaplace_stehfest2(data_time,Parameter[0],Parameter[1]),"-")
ax.plot(data_time,data_relax,"o")
plt.show()

Scikit-learn GradientBoostingClassifier random_state not working

So I was messing around with different classifiers in sklearn, and found that regardless of the value the random_state parameter GradientBoostingClassifier is in, it always returns the same values. For example, when I run the following code:
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size =0.2,
random_state=0)
scores = []
for i in range(10):
clf = GradientBoostingClassifier(random_state=i).fit(X_train, y_train)
score = clf.score(X_test,y_test)
scores = np.append(scores, score)
print scores
the output is:
[ 0.66666667 0.66666667 0.66666667 0.66666667 0.66666667 0.66666667
0.66666667 0.66666667 0.66666667 0.66666667]
However, when I run the same thing with another classifier, such as RandomForest:
from sklearn.ensemble import RandomForestClassifier
scores = []
for i in range(10):
clf = RandomForestClassifier(random_state=i).fit(X_train, y_train)
score = clf.score(X_test,y_test)
scores = np.append(scores, score)
print scores
The output is what you would expect, i.e. with slight variability:
[ 0.6 0.56666667 0.63333333 0.76666667 0.6 0.63333333
0.66666667 0.56666667 0.66666667 0.53333333]
What could be causing GradientBoostingClassifier to ignore the random state? I checked the classifier info but everything seems normal:
print clf
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, presort='auto', random_state=9,
subsample=1.0, verbose=0, warm_start=False)
I tried messing around with warm_start and presort but it didn't change anything. Any ideas? I've been trying to figure this out for almost an hour so I figured I'd ask here. Thank you for your time!

Use Gensim or other python LDA packages to use trained LDA model from Mallet

I have an LDA model trained through Mallet in Java. Three files are generated from the Mallet LDA model, which allow me to run the model from files and infer the topic distribution of a new text.
Now I would like to implement a Python tool which is able to infer a topic distribution given a new text, based on the trained LDA model. I do not want to re-trained the LDA model in Python. Therefore, I wonder if it is possible to load the trained Mallet LDA model into Gensim or any other python LDA package. If so, how can I do it?
Thanks for any answers or comments.
In short yes you can! That is what is nice about using mallet is that once it is run you don't have to go through and relabel topics. I'm doing something very similar - I'll post my code below with a few helpful links. Once your model is trained save the notebook widget state and you'll be free to run your model on new and different data-sets with the same topic allocation. This code includes a test and validation set. Make sure you've downloaded mallet and java then try this:
# future bridges python 2 and 3
from __future__ import print_function
# pandas works with data structures, data manipulation, and analysis specifically for numerical tables, and series like
# the csv we are using here today
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
# Gensim unsupervised topic modeling, natural language processing, statistical machine learning
import gensim
# convert a document to a list of tolkens
from gensim.utils import simple_preprocess
# remove stopwords - words that are not telling: "it" "I" "the" "and" ect.
from gensim.parsing.preprocessing import STOPWORDS
# corpus iterator
from gensim import corpora, models
# nltk - Natural Language Toolkit
# lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed
# into present.
# stemmed — words are reduced to their root form.
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
# NumPy - multidimensional arrays, matrices, and high-level mathematical formulas
import numpy as np
np.random.seed(2018)
import os
from gensim.models.wrappers import LdaMallet
from pathlib import Path
import codecs
import logging
import re
import numpy as np
import pandas as pd
from pprint import pprint
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# spacy for lemmatization
import spacy
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline
# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO)
data = pd.read_csv('YourData.csv', encoding = "ISO-8859-1");
data_text = data[['Preprocessed Document or your comments column title']]
data_text['index'] = data_text.index
documents = data_text
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['yourStopWord1', 'yourStopWord2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
# gensim.parsing.preprocessing.STOPWORDS
# look at a random row 4310 and see if things worked out
# note that the document created was already preprocessed
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))
# let’s look at ten rows passed through the lemmatize stemming and preprocess
documents = documents.dropna(subset=['Preprocessed Document'])
processed_docs = documents['Preprocessed Document'].map(preprocess)
processed_docs[:10]
# we create a dictionary of all the words in the csv by iterating through
# contains the number of times a word appears in the training set.
dictionary_valid = gensim.corpora.Dictionary(processed_docs[20000:])
count = 0
for k, v in dictionary_valid.iteritems():
print(k, v)
count += 1
if count > 30:
break
# we create a dictionary of all the words in the csv by iterating through
# contains the number of times a word appears in the training set.
dictionary_test = gensim.corpora.Dictionary(processed_docs[:20000])
count = 0
for k, v in dictionary_test.iteritems():
print(k, v)
count += 1
if count > 30:
break
# we want to throw out words that are so frequent that they tell us little about the topic
# as well as words that are too infrequent >15 rows then keep just 100,000 words
dictionary_valid.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
# we want to throw out words that are so frequent that they tell us little about the topic
# as well as words that are too infrequent >15 rows then keep just 100,000 words
dictionary_test.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
# the words become numbers and are then counted for frequency
# consider a random row 4310 - it has 8 words word indexed 2 shows up once
# preview the bag of words
bow_corpus_valid = [dictionary_valid.doc2bow(doc) for doc in processed_docs]
bow_corpus_valid[4310]
# the words become numbers and are then counted for frequency
# consider a random row 4310 - it has 8 words word indexed 2 shows up once
# preview the bag of words
bow_corpus_test = [dictionary_test.doc2bow(doc) for doc in processed_docs]
bow_corpus_test[4310]
# same thing in more words
bow_doc_4310 = bow_corpus_test[4310]
for i in range(len(bow_doc_4310)):
print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0],
dictionary_test[bow_doc_4310[i][0]],
bow_doc_4310[i][1]))
mallet_path = 'C:/mallet/mallet-2.0.8/bin/mallet.bat'
ldamallet_test = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus_test, num_topics=20, id2word=dictionary_test)
result = (ldamallet_test.show_topics(num_topics=20, num_words=10,formatted=False))
for each in result:
print (each)
mallet_path = 'C:/mallet/mallet-2.0.8/bin/mallet.bat'
ldamallet_valid = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus_valid, num_topics=20, id2word=dictionary_valid)
result = (ldamallet_valid.show_topics(num_topics=20, num_words=10,formatted=False))
for each in result:
print (each)
# Show Topics
for idx, topic in ldamallet_test.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
# Show Topics
for idx, topic in ldamallet_valid.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
# check out the topics - 30 words - 20 topics
ldamallet_valid.print_topics(idx, 30)
# check out the topics - 30 words - 20 topics
ldamallet_test.print_topics(idx, 30)
# Compute Coherence Score
coherence_model_ldamallet_valid = CoherenceModel(model=ldamallet_valid, texts=processed_docs, dictionary=dictionary_valid, coherence='c_v')
coherence_ldamallet_valid = coherence_model_ldamallet_valid.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet_valid)
# Compute Coherence Score
coherence_model_ldamallet_test = CoherenceModel(model=ldamallet_test, texts=processed_docs, dictionary=dictionary_test, coherence='c_v')
coherence_ldamallet_test = coherence_model_ldamallet_test.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet_test)
Look at 16: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
This helped: https://rare-technologies.com/tutorial-on-mallet-in-python/
and this: https://radimrehurek.com/gensim/models/wrappers/ldamallet.html
I hope this helps and good luck :)

Resources