How to use custom loss function (PU Learning) - vowpalwabbit

I am currently exploring PU learning. This is learning from positive and unlabeled data only. One of the publications [Zhang, 2009] asserts that it is possible to learn by modifying the loss function of an algorithm of a binary classifier with probabilistic output (for example Logistic Regression). Paper states that one should optimize Balanced Accuracy.
Vowpal Wabbit currently supports five loss functions [listed here]. I would like to add a custom loss function where I optimize for AUC (ROC), or equivalently, following the paper: 1 - Balanced_Accuracy.
I am unsure where to start. Looking at the code reveals that I need to provide 1st, 2nd derivatives and some other info. I could also run the standard algorithm with Logistic loss but trying to adjust l1 and l2 according to my objective (not sure if this is good). I would be glad to get any pointers or advices on how to proceed.
UPDATE
More search revealed that it is impossible/difficult to optimize for AUC in online learning: answer

I found two software suites that are immediately ready to do PU learning:
(1) SVM perf from Joachims
Use the ``-l 10'' option here!
(2) Sofia-ml
Use ``--loop_type roc'' option here!
In general you set +1'' labels to your positive examples and-1'' to all unlabeled ones. Then you launch the training procedure followed by prediction.
Both softwares give you some performance metrics. I would suggest to use standardized and well established binary from KDD`04 cup: ``perf''. Get it here.
Hope it helps for those wondering how this works in practice. Perhaps I prevented the case XKCD

Related

some confusions in machine learning

I have two confusions when I use machine learning algorithm. At first, I have to say that I just use it.
There are two categories A and B, if I want to pick as many as A from their mixture, what kind of algorithm should I use ( no need to consider the number of samples) . At first I thought it should be a classification algorithm. And I use for example boost decision tree in a package TMVA, but someone told me that BDT is a regression algorithm indeed.
I find when I have coarse data. If I analysis it ( do some combinations ...) before I throw it to BDT, the result is better than I throw the coarse data into BDT. Since the coarse data contains every information, why do I need analysis it myself?
Is you are not clear, please just add a comment. And hope you can give me any advise.
For 2, you have to perform some manipulation on data and feed it to perform better because from it is not built into algorithm to analyze. It only looks at data and classifies. The problem of analysis as you put it is called feature selection or feature engineering and it has to be done by hand (of course unless you are using some kind of technique that learns features eg. deep learning). In machine learning, it has been seen a lot of times that manipulated/engineered features perform better than raw features.
For 1, I think BDT can be used for regression as well as classification. This looks like a classification problem (to choose or not to choose). Hence you should use a classification algorithm
Are you sure ML is the approach for your problem? In case it is, some classification algorithms would be:
logistic regression, neural networks, support vector machines,desicion trees just to name a few.

ML algorithm Representation, Evaluation & Optimization

I was reading through a paper published by the University of Washington with tips about Machine Learning algorithms.
The way they break down an ML algorithm is into 3 parts. A representation, evaluation and optimization. I would like to understand how these three parts work together, so what is the process like throughout a typical machine learning algorithm.
I understand that my question is very abstract and each algorithm will be different, but if you know of a way to explain it abstractly please do. If not feel free to use a specific algorithm to explain.
Paper: http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf ~
See Table 1.
Sebastian Raschka has provided a very nice flowchart of the (supervised) machine learning process.
(I am familiar with text classification, so I will use this as an example)
In short :
Representation: Your data needs to be brought into a suitable algorithmic form. For Text classification you would for example extract features from your full text documents (input data) and bring them into a bag-of-words representation.
Application of the respective algorithm: Here your algorithm is executed. After training you have to do some kind of evaluation to measure your success (the criteria for evaluation is depending on the task). (The Evaluation is the final part of this step)
Optimization: The learning algorithms usually have one or more parameters to tune. Having reference pairs of parameter configuration / evaluation (from the previous step) you can now tweak the paramaters and evaluate the effect on your results.
(In the flowchart this is the edge, where we go back to the step "Learning algorithm training")

trouble with recurrent neural network algorithm for structured data classification

TL;DR
I need help understanding some parts of a specific algorithm for structured data classification. I'm also open to suggestions for different algorithms for this purpose.
Hi all!
I'm currently working on a system involving classification of structured data (I'd prefer not to reveal anything more about it) for which I'm using a simple backpropagation through structure (BPTS) algorithm. I'm planning on modifying the code to make use of a GPU for an additional speed boost later, but at the moment I'm looking for better algorithms than BPTS that I could use.
I recently stumbled on this paper -> [1] and I was amazed by the results. I decided to give it a try, but I have some trouble understanding some parts of the algorithm, as its description is not very clear. I've already emailed some of the authors requesting clarification, but haven't heard from them yet, so, I'd really appreciate any insight you guys may have to offer.
The high-level description of the algorithm can be found in page 787. There, in Step 1, the authors randomize the network weights and also "Propagate the input attributes of each node through the data structure from frontier nodes to root forwardly and, hence, obtain the output of root node". My understanding is that Step 1 is never repeated, since it's the initialization step. The part I quote indicates that a one-time activation also takes place here. But, what item in the training dataset is used for this activation of the network? And is this activation really supposed to happen only once? For example, in the BPTS algorithm I'm using, for each item in the training dataset, a new neural network - whose topology depends on the current item (data structure) - is created on the fly and activated. Then, the error backpropagates, the weights are updated and saved, and the temporary neural network is destroyed.
Another thing that troubles me is Step 3b. There, the authors mention that they update the parameters {A, B, C, D} NT times, using equations (17), (30) and (34). My understanding is that NT denotes the number of items in the training dataset. But equations (17), (30) and (34) already involve ALL items in the training dataset, so, what's the point of solving them (specifically) NT times?
Yet another thing I failed to get is how exactly their algorithm takes into account the (possibly) different structure of each item in the training dataset. I know how this works in BPTS (I described it above), but it's very unclear to me how it works with their algorithm.
Okay, that's all for now. If anyone has any idea of what might be going on with this algorithm, I'd be very interested in hearing it (or rather, reading it). Also, if you are aware of other promising algorithms and / or network architectures (could long short term memory (LSTM) be of use here?) for structured data classification, please don't hesitate to post them.
Thanks in advance for any useful input!
[1] http://www.eie.polyu.edu.hk/~wcsiu/paper_store/Journal/2003/2003_J4-IEEETrans-ChoChiSiu&Tsoi.pdf

Methods to identify duplicate questions on Twitter?

As stated in the title, I'm simply looking for algorithms or solutions one might use to take in the twitter firehose (or a portion of it) and
a) identify questions in general
b) for a question, identify questions that could be the same, with some degree of confidence
Thanks!
(A)
I would try to identify questions using machine learning and the Bag of Words model.
Create a labeled set of twits, and label each of them with a binary
flag: question or not question.
Extract the features from the training set. The features are traditionally words, but at least for any time I tried it - using bi-grams significantly improved the results. (3-grams were not helpful for my cases).
Build a classifier from the data. I usually found out SVM gives better performance then other classifiers, but you can use others as well - such as Naive Bayes or KNN (But you will probably need feature selection algorithm for these).
Now you can use your classifier to classify a tweet.1
(B)
This issue is referred in the world of Information-Retrieval as "duplicate detection" or "near-duplicate detection".
You can at least find questions which are very similar to each other using Semantic Interpretation, as described by Markovitch and Gabrilovich in their wonderful article Wikipedia-based Semantic Interpretation for Natural Language Processing. At the very least, it will help you identify if two questions are discussing the same issues (even though not identical).
The idea goes like this:
Use wikipedia to build a vector that represents its semantics, for a term t, the entry vector_t[i] is the tf-idf score of the term i as it co-appeared with the term t. The idea is described in details in the article. Reading the 3-4 first pages are enough to understand it. No need to read it all.2
For each tweet, construct a vector which is a function of the vectors of its terms. Compare between two vectors - and you can identify if two questions are discussing the same issues.
EDIT:
On 2nd thought, the BoW model is not a good fit here, since it ignores the position of terms. However, I believe if you add NLP processing for extracting feature (for examples, for each term, also denote if it is pre-subject or post-subject, and this was determined using NLP procssing), combining with Machine Learning will yield pretty good results.
(1) For evaluation of your classifier, you can use cross-validation, and check the expected accuracy.
(2) I know Evgeny Gabrilovich published the implemented algorithm they created as an open source project, just need to look for it.

What are some algorithms for symbol-by-symbol handwriting recognition?

I think there are some algorithms that evaluate difference between drawn symbol and expected one, or something like that. Any help will be appreciated :))
You can implement a simple Neural Network to recognize handwritten digits. The simplest type to implement is a feed-forward network trained via backpropagation (it can be trained stochastically or in batch-mode). There are a few improvements that you can make to the backpropagation algorithm that will help your neural network learn faster (momentum, Silva and Almeida's algorithm, simulated annealing).
As far as looking at the difference between a real symbol and an expected image, one algorithm that I've seen used is the k-nearest-neighbor algorithm. Here is a paper that describes using the k-nearest-neighbor algorithm for character recognition (edit: I had the wrong link earlier. The link I've provided requires you to pay for the paper; I'm trying to find a free version of the paper).
If you were using a neural network to recognize your characters, the steps involved would be:
Design your neural network with an appropriate training algorithm. I suggest starting with the simplest (stochastic backpropagation) and then improving the algorithm as desired, while you train your network.
Get a good sample of training data. For my neural network, which recognizes handwritten digits, I used the MNIST database.
Convert the training data into an input vector for your neural network. For the MNIST data, you will need to binarize the images. I used a threshold value of 128. I started with Otsu's method, but that didn't give me the results I wanted.
Create your network. Since the images from MNIST come in an array of 28x28, you have an input vector with 784 components and 1 bias (so 785 inputs), to your neural network. I used one hidden layer with the number of nodes set as per the guidelines outlined here (along with a bias). Your output vector will have 10 components (one for each digit).
Randomly present training data (so randomly ordered digits, with random input image for each digit) to your network and train it until it reaches a desired error-level.
Run test data (MNIST data comes with this as well) against your neural network to verify that it recognizes digits correctly.
You can check out an example here (shameless plug) that tries to recognize handwritten digits. I trained the network using data from MNIST.
Expect to spend some time getting yourself up to speed on neural network concepts, if you decide to go this route. It took me at least 3-4 days of reading and writing code before I actually understood the concept. A good resource is heatonresearch.com. I recommend starting with trying to implement neural networks to simulate the AND, OR, and XOR boolean operations (using a threshold activation function). This should give you an idea of the basic concepts. When it actually comes down to training your network, you can try to train a neural network that recognizes the XOR boolean operator; it's a good place to start for an introduction to learning algorithms.
When it comes to building the neural network, you can use existing frameworks like Encog, but I found it to be far more satisfactory to build the network myself (you learn more that way I think). If you want to look at some source, you can check out a project that I have on github (shameless plug) that has some basic classes in Java that help you build and train simple neural-networks.
Good luck!
EDIT
I've found a few sources that use k-nearest-neighbors for digit and/or character recognition:
Bangla Basic Character Recognition Using Digital Curvelet Transform (The curvelet coefficients of an
original image as well as its morphologically altered versions are used to train separate k–
nearest neighbor classifiers. The output values of these classifiers are fused using a simple
majority voting scheme to arrive at a final decision.)
The Homepage of Nearest Neighbors and Similarity Search
Fast and Accurate Handwritten Character Recognition using Approximate Nearest Neighbors Search on Large Databases
Nearest Neighbor Retrieval and Classification
For resources on Neural Networks, I found the following links to be useful:
CS-449: Neural Networks
Artificial Neural Networks: A neural network tutorial
An introduction to neural networks
Neural Networks with Java
Introduction to backpropagation Neural Networks
Momentum and Learning Rate Adaptation (this page goes over a few enhancements to the standard backpropagation algorithm that can result in faster learning)
Have you checked Detexify. I think it does pretty much what you want
http://detexify.kirelabs.org/classify.html
It is open source, so you could take a look at how it is implemented.
You can get the code from here (if I do not recall wrongly, it is in Haskell)
https://github.com/kirel/detexify-hs-backend
In particular what you are looking for should be in Sim.hs
I hope it helps
Addendum
If you have not implemented machine learning algorithms before you should really check out: www.ml-class.org
It's a free class taught by Andrew Ng, Director of the Stanford Machine Learning Centre. The course is an entirely online-taught course specifically on implementing a wide range of machine learning algorithms. It does not go too much into the theoretical intricacies of the algorithms but rather teaches you how to choose, implement, use the algorithms and how diagnose their performance. - It is unique in that your implementation of the algorithms is checked automatically! It's great for getting started in machine learning at you have instantaneous feedback.
The class also includes at least two exercises on recognising handwritten digits. (Programming Exercise 3: with multinomial classification and Programming Exercise 4: with feed-forward neural networks)
The class has started a while ago but it should still be possible to sign up. If not, a new run should start early next year. If you want to be able to check your implementations you need to sign up for the "Advanced Track".
One way to implement handwriting recognition
The answer to this question depends on a number of factors, including what kind of resource constraints you have (embedded platform) and whether you have a good library of correctly labelled symbols: i.e. different examples of a handwritten letter for which you know what letter they represent.
If you have a decent sized library, implementation of a quick and dirty standard machine learning algorithm is probably the way to go. You can use multinomial classifiers, neural networks or support vector machines.
I believe a support vector machine would be fastest to implement as there are excellent libraries out there who handle the machine learning portion of the code for you, e.g. libSVM. If you are familiar with using machine learning algorihms, this should take you less than 30 minutes to implement.
The basic procedure you would probably want to implement is as follows:
Learning what symbols "look like"
Binarise the images in your library.
Unroll the images into vectors / 1-D arrays.
Pass the "vector representation" of the images in your library and their labels to libSVM to get it to learn how the pixel coverage relates to the represented symbol for the images in the library.
The algorithm gives you back a set of model parameters which describe the recognition algorithm that was learned.
You should repeat 1-4 for each character you want to recognise to get an appropriate set of model parameters.
Note: steps 1-4 you only have to carry out once for your library (but once for each symbol you want to recognise). You can do this on your developer machine and only include the parameters in the code you ship / distribute.
If you want to recognise a symbol:
Each set of model parameters describes an algorithm which tests whether a character represents one specific character - or not. You "recognise" a character by testing all the models with the current symbol and then selecting the model that best fits the symbol you are testing.
This testing is done by again passing the model parameters and the symbol to test in unrolled form to the SVM library which will return the goodness-of-fit for the tested model.

Resources