Accuracy hit in classification with naive Bayes MlLib - hadoop

I had been using Mahout's 0.9 Naive Bayes algorithm to classify document data. For a specific train(2/3 of data) and test (1/3 of data) set, I was getting accuracy in the range of 86%. When I shifted to Spark's MLlib, the accuracy dropped to 82%. In both case using Standard Analyzer.
MlLib Link: https://spark.apache.org/docs/latest/mllib-naive-bayes.html
Mahout link: http://mahout.apache.org/users/classification/bayesian.html
Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker for me.
I found a problem also MlLib take more time in data classification compare to Mahout.
And can any one help for me increase accuracy using MlLib naive Bayes.

Related

train-test split is required to evaluate all of metrics?

I have a hybrid recommender system. I use Precision to evaluate the result of recommender system, I think its not essential to split my data to train and test because I don't have any machine learning algorithm in my recommender system and I don't use MSE or RMSE or others error measurement.
are you agree with me? or you think I have to split my data in train and test dataset?

Distributed cross correlation matrix computation

How can I calculate pearson cross correlation matrix of large (>10TB) data set, possibly in distributed manner ? Any efficient distributed algorithm suggestion will be appreciated.
update:
I read the implementation of apache spark mlib correlation
Pearson Computaation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
Covariance Computation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
but for me it looks like all the computation is happening at one node and it is not distributed in real sense.
Please put some light in here. I also tried executing it on a 3 node spark cluster and below are the screenshot:
As you can see from 2nd image that data is pulled up at one node and then computation is being done.Am i right in here ?
To start with, have a look at this to see if things are going right. You may then refer to any of these implementations: MPI/OpenMP: Agomezl or Meismyles, MapReduce: Vangjee or Seawolf42. It'd also be interesting to read this before you proceed. On a different note, James's thesis provides some pointers if you're interested in computing the correlations that are robust to outliers.
Each local data sets can converted into stdv and covariances.
Also stdv and covariance and sum make correlation.
This is working example
https://github.com/jeesim2/distributed-correlation

Sentiment Analysis using tensorflow

I am exploring tensorflow and would like to do sentiment analysis using the options available. I had a look at the following tutorial http://www.tensorflow.org/tutorials/recurrent/index.html#language_modeling
I have worked woth Naive Bayes Classifier, Maximum Entropy Algorithm and Scikit Learn Classifier and would like to know if there are any better algorithms offered by tensorflow. Is this the right place to start or are there any other options?
Any help pointing in the right direction would be greatly appreciated.
Thanks in advance.
A commonly used approach would be using a Convolutional Neural Network (CNN) to do sentiment analysis. You can find a great explanation/tutorial in this WildML blogpost. The accompanying TensorFlow code can be found here.
Another approach would be using an LSTM (or related network), you can find example implementations online, a good starting point is this blogpost.
I would suggest you try a character-level LSTM, it's been shown to be able to achieve state-of-the-art results in many text classification tasks one of them being sentiment analysis.
I wrote a pretty lengthy article that you can find here where I go through it's implementation in TensorFlow line by line. The result is a model that is less than 100mb in size and that achieves an accuracy of over 80% on a test set of 80,000 tweets.
Another approach that has proven to be very effective is to use a recursive neural network, you can read the paper from Stanford NLP Group here
For me, the easiest tutorial to follow was: https://pythonprogramming.net/data-size-example-tensorflow-deep-learning-tutorial/?completed=/train-test-tensorflow-deep-learning-tutorial/
It walks you throughTensorFlow.train.AdamOptimizer().minimize(cost) and uses Sentiment140 dataset (from Stanford, ~1 mil examples of positive and negative sentiment)

How to use custom loss function (PU Learning)

I am currently exploring PU learning. This is learning from positive and unlabeled data only. One of the publications [Zhang, 2009] asserts that it is possible to learn by modifying the loss function of an algorithm of a binary classifier with probabilistic output (for example Logistic Regression). Paper states that one should optimize Balanced Accuracy.
Vowpal Wabbit currently supports five loss functions [listed here]. I would like to add a custom loss function where I optimize for AUC (ROC), or equivalently, following the paper: 1 - Balanced_Accuracy.
I am unsure where to start. Looking at the code reveals that I need to provide 1st, 2nd derivatives and some other info. I could also run the standard algorithm with Logistic loss but trying to adjust l1 and l2 according to my objective (not sure if this is good). I would be glad to get any pointers or advices on how to proceed.
UPDATE
More search revealed that it is impossible/difficult to optimize for AUC in online learning: answer
I found two software suites that are immediately ready to do PU learning:
(1) SVM perf from Joachims
Use the ``-l 10'' option here!
(2) Sofia-ml
Use ``--loop_type roc'' option here!
In general you set +1'' labels to your positive examples and-1'' to all unlabeled ones. Then you launch the training procedure followed by prediction.
Both softwares give you some performance metrics. I would suggest to use standardized and well established binary from KDD`04 cup: ``perf''. Get it here.
Hope it helps for those wondering how this works in practice. Perhaps I prevented the case XKCD

Logistic Regression\SVM implementation in Mahout

I am currently working on sentimental analysis of twitter data for one of telecom company data.I am loading the data into HDFS and using Mahout's Naive Bayes Classifier for predicting the sentiments as positive,negative or neutral .
Here's is what i am doing
I am providing training data to the machine (key :sentiment,value:text) .
Using mahout library by calculating tf-idf(Inverse Document Frequency) of text it is creating feature vector.
mahout seq2sparser -i /user/root/new_model/dataseq --maxDFPercent 1000000 --minSupport 4 --maxNGramSize 2 -a org.apache.lucene.analysis.WhitespaceAnalyzer -o /user/root/new_model/predicted
Splitting data as training set and testing set.
That feature vector I am passing to the naive Bayes algorithm to build a model.
mahout trainnb -i /user/root/new_model/train-vectors -el -li /user/root/new_model/labelindex -o /user/root/new_model/model -ow -c
Using this model I am predicting sentiment of new data.
This is very simple implementation what I am doing , By this implementation I am getting very low accuracy even if i have good training set . So I was thinking of switching to Logistic regression/SVM because they give better results for these kind of problem .
So my question how can i use these algorithm for building my model or predicting the sentiments of tweets using these two algorithms . What steps i need to follow to achieve this ?
Try using CrossFoldLearner but I am doubtful if it takes naïve Bayes as learning model, I had used OnlineLogisticRegression some time ago. Or hopefully you can write your own crossFoldLearner with naïve Bayes as the learner. Also I don't think changing algorithm would improve the results drastically. Which implies you have to carefully look into the analyzer for doing the tokenization. Perhaps consider bigram tokenization, instead of only using unigram tokens.
Have you given thought to phonetics as most of the twitter words are not from dictionary.

Resources