train-test split is required to evaluate all of metrics? - precision

I have a hybrid recommender system. I use Precision to evaluate the result of recommender system, I think its not essential to split my data to train and test because I don't have any machine learning algorithm in my recommender system and I don't use MSE or RMSE or others error measurement.
are you agree with me? or you think I have to split my data in train and test dataset?

Related

AI bias in the sentiment analysis

Using sentiment analysis API and want to know how the AI bias that gets in through the training set of data and other biases quantified. Any help would be appreciated.
There are several tools developed to deal with it:
Fair Learn https://fairlearn.github.io/
Interpretability Toolkit https://learn.microsoft.com/en-us/azure/machine-learning/how-to-machine-learning-interpretability
In Fair Learn you can see how biased a ML model is after it has been trained with the data set and choose a maybe less accurate model which performs better with biases. The explainable ML models provide different correlation of inputs with outputs and combined with Fair Learn can give an idea of the health of the ML model.

What machine learning algorithm would be best suited for a scenario when you are not sure about the test features/attributes?

Eg: For training, you use data for which users have filled up all the fields (around 40 fields) in a form along with an expected output.
We now build a model (could be an artificial neural net or SVM or logistic regression, etc).
Finally, a user now enters 3 fields in the form and expects a prediction.
In this scenario, what is the best ML algorithm I can use?
I think it will depend on the specific context of your problem. What are you trying to predict based on what kind of input?
For example, recommender systems are used by companies like Netflix to predict a user's rating of, for example, movies based on a very sparse feature vector (user's existing ratings of a tiny percentage of all of the movies in the catalog).
Another option is to develop some mapping algorithm from your sparse feature space to a common latent space on which you perform your classification with, e.g., an SVM or neural network. I believe this paper does something similar. You can also look in to papers like this one for a classifier that translates data from two different domains (your training vs. testing set, for example, where both contain similar information, but one has complete data and the other does not) into a common latent space for classification. There is a lot out there actually on domain-independent classification.
Keywords to look up (with some links to get you started): generative adversarial networks (GAN), domain-adversarial training, domain-independent classification, transfer learning.

How to use custom loss function (PU Learning)

I am currently exploring PU learning. This is learning from positive and unlabeled data only. One of the publications [Zhang, 2009] asserts that it is possible to learn by modifying the loss function of an algorithm of a binary classifier with probabilistic output (for example Logistic Regression). Paper states that one should optimize Balanced Accuracy.
Vowpal Wabbit currently supports five loss functions [listed here]. I would like to add a custom loss function where I optimize for AUC (ROC), or equivalently, following the paper: 1 - Balanced_Accuracy.
I am unsure where to start. Looking at the code reveals that I need to provide 1st, 2nd derivatives and some other info. I could also run the standard algorithm with Logistic loss but trying to adjust l1 and l2 according to my objective (not sure if this is good). I would be glad to get any pointers or advices on how to proceed.
UPDATE
More search revealed that it is impossible/difficult to optimize for AUC in online learning: answer
I found two software suites that are immediately ready to do PU learning:
(1) SVM perf from Joachims
Use the ``-l 10'' option here!
(2) Sofia-ml
Use ``--loop_type roc'' option here!
In general you set +1'' labels to your positive examples and-1'' to all unlabeled ones. Then you launch the training procedure followed by prediction.
Both softwares give you some performance metrics. I would suggest to use standardized and well established binary from KDD`04 cup: ``perf''. Get it here.
Hope it helps for those wondering how this works in practice. Perhaps I prevented the case XKCD

Accuracy hit in classification with naive Bayes MlLib

I had been using Mahout's 0.9 Naive Bayes algorithm to classify document data. For a specific train(2/3 of data) and test (1/3 of data) set, I was getting accuracy in the range of 86%. When I shifted to Spark's MLlib, the accuracy dropped to 82%. In both case using Standard Analyzer.
MlLib Link: https://spark.apache.org/docs/latest/mllib-naive-bayes.html
Mahout link: http://mahout.apache.org/users/classification/bayesian.html
Please help me in this regard as I have to use Spark in a production system very soon and this is a blocker for me.
I found a problem also MlLib take more time in data classification compare to Mahout.
And can any one help for me increase accuracy using MlLib naive Bayes.

Where does map-reduce/hadoop come in in machine learning training?

Map-reduce/hadoop is perfect in gathering insights from piles of data from various resources, and organize them in a way we want it to be.
But when it comes to training, my impression is that we have to dump all the training data into algorithm (be it SVN, Logistic regression, or random forest) all at once so that the algorithm is able to come up with a model that has it all. Can map-reduce/hadoop help in the training part? If yes, how in general?
Yes. There are many MapReduce implementations such as hadoop streaming and even some easy tools like Pig, which can be used for learning. In addition, there are distributed learning toolset built upon Map/Reduce such as vowpal wabbit (https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial). The big idea of this kind of methods is to do training on small portion of data (split by HDFS) and then averaging the models and commutation with each nodes. So the model get updates directly from submodels built on part of the data.

Resources