How does vowpal wabbit works with gradient descend algorithm? - vowpalwabbit

I have been trying to understand vowpal wabbit algorithm.
Is anyone can help me to understand VW and how to implement it

Vowpal Wabbit is focused on online learning (though it can do also batch L-BFGS) and it's main algorithm is Stochastic gradient descent with several (optional, but included in the default) improvements (adaptive, normalized updates, clever importance weighting,...).
The algorithm is described on slides 5 and 11 of the tutorial.
There is no need to implement it, it's already implemented:-) and it's super fast and memory efficient. Therefore, the code contains many optimization tricks, so it's not really suitable for inspection by beginners.

Related

How does a Bayesian Linear Regression work on a non-randomized data set of traffic intensities?

I am trying to predict the intensities of each lane for the next 15 minutes (a part of my thesis research). I have a data set with the intensities of each lane of each 15 minutes of the past 3 months. I have used 6 different Machine Learning algorithms in Azure Machine Learning to check which one predicts the most accurately. I picked the Bayesian Linear Regression to describe the algorithm and what it does step-by-step.
It is still unclear to me how the algorithm works, because I am not good in detailed maths. This is why I used a cloud-computing ready Machine Learning tool to do the work for me. I have seen some sources and explanations on the internet, but they are all too mathematical to me and I still don't get it.
My trained model looks like this when I click on 'Visualize':
My evaluation model:
My question is if someone would like to explain the Bayesian Linear Regression algorithm like I'm a dummy. And why do the multiple other features of the data set that I included in the algorithm, influence the prediction?
Maybe you should start learning what is linear regression first then move to Bayesian Linear. The idea of linear regression is to use a function to
draw a line that passes through the middle of your data. Simpler then that without maths would be hard to explain.
Linear Regression
Linear Regression
https://www.youtube.com/watch?v=zPG4NjIkCjc

Sentiment Analysis using tensorflow

I am exploring tensorflow and would like to do sentiment analysis using the options available. I had a look at the following tutorial http://www.tensorflow.org/tutorials/recurrent/index.html#language_modeling
I have worked woth Naive Bayes Classifier, Maximum Entropy Algorithm and Scikit Learn Classifier and would like to know if there are any better algorithms offered by tensorflow. Is this the right place to start or are there any other options?
Any help pointing in the right direction would be greatly appreciated.
Thanks in advance.
A commonly used approach would be using a Convolutional Neural Network (CNN) to do sentiment analysis. You can find a great explanation/tutorial in this WildML blogpost. The accompanying TensorFlow code can be found here.
Another approach would be using an LSTM (or related network), you can find example implementations online, a good starting point is this blogpost.
I would suggest you try a character-level LSTM, it's been shown to be able to achieve state-of-the-art results in many text classification tasks one of them being sentiment analysis.
I wrote a pretty lengthy article that you can find here where I go through it's implementation in TensorFlow line by line. The result is a model that is less than 100mb in size and that achieves an accuracy of over 80% on a test set of 80,000 tweets.
Another approach that has proven to be very effective is to use a recursive neural network, you can read the paper from Stanford NLP Group here
For me, the easiest tutorial to follow was: https://pythonprogramming.net/data-size-example-tensorflow-deep-learning-tutorial/?completed=/train-test-tensorflow-deep-learning-tutorial/
It walks you throughTensorFlow.train.AdamOptimizer().minimize(cost) and uses Sentiment140 dataset (from Stanford, ~1 mil examples of positive and negative sentiment)

How to use custom loss function (PU Learning)

I am currently exploring PU learning. This is learning from positive and unlabeled data only. One of the publications [Zhang, 2009] asserts that it is possible to learn by modifying the loss function of an algorithm of a binary classifier with probabilistic output (for example Logistic Regression). Paper states that one should optimize Balanced Accuracy.
Vowpal Wabbit currently supports five loss functions [listed here]. I would like to add a custom loss function where I optimize for AUC (ROC), or equivalently, following the paper: 1 - Balanced_Accuracy.
I am unsure where to start. Looking at the code reveals that I need to provide 1st, 2nd derivatives and some other info. I could also run the standard algorithm with Logistic loss but trying to adjust l1 and l2 according to my objective (not sure if this is good). I would be glad to get any pointers or advices on how to proceed.
UPDATE
More search revealed that it is impossible/difficult to optimize for AUC in online learning: answer
I found two software suites that are immediately ready to do PU learning:
(1) SVM perf from Joachims
Use the ``-l 10'' option here!
(2) Sofia-ml
Use ``--loop_type roc'' option here!
In general you set +1'' labels to your positive examples and-1'' to all unlabeled ones. Then you launch the training procedure followed by prediction.
Both softwares give you some performance metrics. I would suggest to use standardized and well established binary from KDD`04 cup: ``perf''. Get it here.
Hope it helps for those wondering how this works in practice. Perhaps I prevented the case XKCD

How do people prove the correctness of Computer Vision methods?

I'd like to pose a few abstract questions about computer vision research. I haven't quite been able to answer these questions by searching the web and reading papers.
How does someone know whether a computer vision algorithm is correct?
How do we define "correct" in the context of computer vision?
Do formal proofs play a role in understanding the correctness of computer vision algorithms?
A bit of background: I'm about to start my PhD in Computer Science. I enjoy designing fast parallel algorithms and proving the correctness of these algorithms. I've also used OpenCV from some class projects, though I don't have much formal training in computer vision.
I've been approached by a potential thesis advisor who works on designing faster and more scalable algorithms for computer vision (e.g. fast image segmentation). I'm trying to understand the common practices in solving computer vision problems.
You just don't prove them.
Instead of a formal proof, which is often impossible to do, you can test your algorithm on a set of testcases and compare the output with previously known algorithms or correct answers (for example when you recognize the text, you can generate a set of images where you know what the text says).
In practice, computer vision is more like an empirical science: You gather data, think of simple hypotheses that could explain some aspect of your data, then test those hypotheses. You usually don't have a clear definition of "correct" for high-level CV tasks like face recognition, so you can't prove correctness.
Low-level algorithms are a different matter, though: You usually have a clear, mathematical definition of "correct" here. For example if you'd invent an algorithm that can calculate a median filter or a morphological operation more efficiently than known algorithms or that can be parallelized better, you would of course have to prove it's correctness, just like any other algorithm.
It's also common to have certain requirements to a computer vision algorithm that can be formalized: For example, you might want your algorithm to be invariant to rotation and translation - these are properties that can be proven formally. It's also sometimes possible to create mathematical models of signal and noise, and design a filter that has the best possible signal to noise-ratio (IIRC the Wiener filter or the Canny edge detector were designed that way).
Many image processing/computer vision algorithms have some kind of "repeat until convergence" loop (e.g. snakes or Navier-Stokes inpainting and other PDE-based methods). You would at least try to prove that the algorithm converges for any input.
This is my personal opinion, so take it for what it's worth.
You can't prove the correctness of most of the Computer Vision methods right now. I consider most of the current methods some kind of "recipe" where ingredients are thrown down until the "result" is good enough. Can you prove that a brownie cake is correct?
It is a bit similar in a way to how machine learning evolved. At first, people did neural networks, but it was just a big "soup" that happened to work more or less. It worked sometimes, didn't on other cases, and no one really knew why. Then statistical learning (through Vapnik among others) kicked in, with some real mathematical backup. You could prove that you had the unique hyperplane that minimized a particular loss function, PCA gives you the closest matrix of fixed rank to a given matrix (considering the Frobenius norm I believe), etc...
Now, there are still a few things that are "correct" in computer vision, but they are pretty limited. What comes to my mind is the wavelet : they are the sparsest representation in an orthogonal basis of function. (i.e : the most compressed way to represent an approximation of an image with minimal error)
Computer Vision algorithms are not like theorems which you can prove, they usually try to interpret the image data into the terms which are more understandable to us humans. Like face recognition, motion detection, video surveillance etc. The exact correctness is not calculable, like in the case of image compression algorithms where you can easily find the result by the size of the images.
The most common methods used to show the results in Computer Vision methods(especially classification problems) are the graphs of precision Vs recall, accuracy Vs false positives. These are measured on standard databases available on various sites. Usually the harsher you set the parameters for correct detection, the more false positives you generate. The typical practice is to choose the point from the graph according to your requirement of 'how many false positives are tolerable for the application'.

Which optimization algorithm should I use to optimize the weights of a multilayer perceptron?

Actually these are 3 questions:
Which optimization algorithm should I use to optimize the weights of a multilayer perceptron, if I knew...
1) only the value of the error function? (blackbox)
2) the gradient? (first derivative)
3) the gradient and the hessian? (second derivative)
I heard CMA-ES should work very well for 1) and BFGS for 2) but I would like to know if there are any alternatives and I don't know wich algorithm to take for 3).
Ok, so this doesn't really answer the question you initially asked, but it does provide a solution to the problem you mentioned in the comments.
Problems like dealing with a continuous action space are normally not dealt with via changing the error measure, but rather by changing the architecture of the overall network. This allows you to keep using the same highly informative error information while still solving the problem you want to solve.
Some possible architectural changes that could accomplish this are discussed in the solutions to this question. In my opinion, I'd suggest using a modified Q-learning technique where the state and action spaces are both represented by self organizing maps, which is discussed in a paper mentioned in the above link.
I hope this helps.
I solved this problem finally: there are some efficient algorithms for optimizing neural networks in reinforcement learning (with fixed topology), e. g. CMA-ES (CMA-NeuroES) or CoSyNE.
The best optimization algorithm for supervised learning seems to be Levenberg-Marquardt (LMA). This is an algorithm that is specifically designed for least square problems. When there are many connections and weights, LMA does not work very well because the required space is huge. In this case I am using Conjugate Gradient (CG).
The hessian matrix does not accelerate optimization. Algorithms that approximate the 2nd derivative are faster and more efficient (BFGS, CG, LMA).
edit: For large scale learning problems often Stochastic Gradient Descent (SGD) outperforms all other algorithms.

Resources