Logistic Regression\SVM implementation in Mahout - hadoop

I am currently working on sentimental analysis of twitter data for one of telecom company data.I am loading the data into HDFS and using Mahout's Naive Bayes Classifier for predicting the sentiments as positive,negative or neutral .
Here's is what i am doing
I am providing training data to the machine (key :sentiment,value:text) .
Using mahout library by calculating tf-idf(Inverse Document Frequency) of text it is creating feature vector.
mahout seq2sparser -i /user/root/new_model/dataseq --maxDFPercent 1000000 --minSupport 4 --maxNGramSize 2 -a org.apache.lucene.analysis.WhitespaceAnalyzer -o /user/root/new_model/predicted
Splitting data as training set and testing set.
That feature vector I am passing to the naive Bayes algorithm to build a model.
mahout trainnb -i /user/root/new_model/train-vectors -el -li /user/root/new_model/labelindex -o /user/root/new_model/model -ow -c
Using this model I am predicting sentiment of new data.
This is very simple implementation what I am doing , By this implementation I am getting very low accuracy even if i have good training set . So I was thinking of switching to Logistic regression/SVM because they give better results for these kind of problem .
So my question how can i use these algorithm for building my model or predicting the sentiments of tweets using these two algorithms . What steps i need to follow to achieve this ?

Try using CrossFoldLearner but I am doubtful if it takes naïve Bayes as learning model, I had used OnlineLogisticRegression some time ago. Or hopefully you can write your own crossFoldLearner with naïve Bayes as the learner. Also I don't think changing algorithm would improve the results drastically. Which implies you have to carefully look into the analyzer for doing the tokenization. Perhaps consider bigram tokenization, instead of only using unigram tokens.
Have you given thought to phonetics as most of the twitter words are not from dictionary.

Related

What machine learning algorithm would be best suited for a scenario when you are not sure about the test features/attributes?

Eg: For training, you use data for which users have filled up all the fields (around 40 fields) in a form along with an expected output.
We now build a model (could be an artificial neural net or SVM or logistic regression, etc).
Finally, a user now enters 3 fields in the form and expects a prediction.
In this scenario, what is the best ML algorithm I can use?
I think it will depend on the specific context of your problem. What are you trying to predict based on what kind of input?
For example, recommender systems are used by companies like Netflix to predict a user's rating of, for example, movies based on a very sparse feature vector (user's existing ratings of a tiny percentage of all of the movies in the catalog).
Another option is to develop some mapping algorithm from your sparse feature space to a common latent space on which you perform your classification with, e.g., an SVM or neural network. I believe this paper does something similar. You can also look in to papers like this one for a classifier that translates data from two different domains (your training vs. testing set, for example, where both contain similar information, but one has complete data and the other does not) into a common latent space for classification. There is a lot out there actually on domain-independent classification.
Keywords to look up (with some links to get you started): generative adversarial networks (GAN), domain-adversarial training, domain-independent classification, transfer learning.

some confusions in machine learning

I have two confusions when I use machine learning algorithm. At first, I have to say that I just use it.
There are two categories A and B, if I want to pick as many as A from their mixture, what kind of algorithm should I use ( no need to consider the number of samples) . At first I thought it should be a classification algorithm. And I use for example boost decision tree in a package TMVA, but someone told me that BDT is a regression algorithm indeed.
I find when I have coarse data. If I analysis it ( do some combinations ...) before I throw it to BDT, the result is better than I throw the coarse data into BDT. Since the coarse data contains every information, why do I need analysis it myself?
Is you are not clear, please just add a comment. And hope you can give me any advise.
For 2, you have to perform some manipulation on data and feed it to perform better because from it is not built into algorithm to analyze. It only looks at data and classifies. The problem of analysis as you put it is called feature selection or feature engineering and it has to be done by hand (of course unless you are using some kind of technique that learns features eg. deep learning). In machine learning, it has been seen a lot of times that manipulated/engineered features perform better than raw features.
For 1, I think BDT can be used for regression as well as classification. This looks like a classification problem (to choose or not to choose). Hence you should use a classification algorithm
Are you sure ML is the approach for your problem? In case it is, some classification algorithms would be:
logistic regression, neural networks, support vector machines,desicion trees just to name a few.

How to use custom loss function (PU Learning)

I am currently exploring PU learning. This is learning from positive and unlabeled data only. One of the publications [Zhang, 2009] asserts that it is possible to learn by modifying the loss function of an algorithm of a binary classifier with probabilistic output (for example Logistic Regression). Paper states that one should optimize Balanced Accuracy.
Vowpal Wabbit currently supports five loss functions [listed here]. I would like to add a custom loss function where I optimize for AUC (ROC), or equivalently, following the paper: 1 - Balanced_Accuracy.
I am unsure where to start. Looking at the code reveals that I need to provide 1st, 2nd derivatives and some other info. I could also run the standard algorithm with Logistic loss but trying to adjust l1 and l2 according to my objective (not sure if this is good). I would be glad to get any pointers or advices on how to proceed.
UPDATE
More search revealed that it is impossible/difficult to optimize for AUC in online learning: answer
I found two software suites that are immediately ready to do PU learning:
(1) SVM perf from Joachims
Use the ``-l 10'' option here!
(2) Sofia-ml
Use ``--loop_type roc'' option here!
In general you set +1'' labels to your positive examples and-1'' to all unlabeled ones. Then you launch the training procedure followed by prediction.
Both softwares give you some performance metrics. I would suggest to use standardized and well established binary from KDD`04 cup: ``perf''. Get it here.
Hope it helps for those wondering how this works in practice. Perhaps I prevented the case XKCD

Accuracy hit in classification with naive Bayes MlLib

I had been using Mahout's 0.9 Naive Bayes algorithm to classify document data. For a specific train(2/3 of data) and test (1/3 of data) set, I was getting accuracy in the range of 86%. When I shifted to Spark's MLlib, the accuracy dropped to 82%. In both case using Standard Analyzer.
MlLib Link: https://spark.apache.org/docs/latest/mllib-naive-bayes.html
Mahout link: http://mahout.apache.org/users/classification/bayesian.html
Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker for me.
I found a problem also MlLib take more time in data classification compare to Mahout.
And can any one help for me increase accuracy using MlLib naive Bayes.

How to improve the accuracy of a Naive Bayes Classifier?

I am using Naive Bayes Classifier. Following this tutorial.
For the the trained data, i am using 308 questions and categorizing them into 26 categories which are manually tagged.
Before sending the data i am performing NLP. In NLP i am performing(punctuation removal, tokenization, stopword removal and stemming)
This filtered data, am using as input for mahout.
Using mahout NBC's i train this data and get the model file. Now when i run
mahout testnb
command i get Correctly Classified Instances as 96%.
Now for my test data i am using 100 questions which i have manually tagged. And when i use the trained model with the test data, i get Correctly Classified Instances as 1%.
This is pissing me off.
Can anyone suggest me what i doing wrong or suggest me some ways to increase the performance of NBC.?
Also, ideally how much of questions data should i use to train and test?
This appears to be the classic problem of "overfitting"... where you get a very high % accuracy on the training set, but a low % in real situations.
You probably need more training instances. Also, there is the possibility that the 26 categories don't correlate to the features you have. Machine Learning isn't magical and needs some sort of statistical relationship between the variables and the outcomes. Effectively, what NBC might be doing here is effectively "memorizing" the training set, which is completely useless for questions outside of memory.

Resources