RNTN: stop training early on convergence? - stanford-nlp

I'm currently training some sentiment analysis models with the RNTN within CoreNLP. With the default settings, training runs for 400 iterations which takes a long time. Is there some way to stop training earlier, e.g. if the error does not get smaller? Is there code which allows this?
In the 2013 paper by Socher et al, there is a sentence stating that the RNTN convergences after a few hours of training. Can I exploit this?
edit for clarification:
The paper I am referring to is "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" by Socher et al, EMNLP 2013. The RNTN I refer to is part of the Stanford CoreNLP package.
To rephrase and clarify my question:
How can I make edu.stanford.nlp.sentiment.SentimentTraining stop training when the model is "good enough" (for some criterion) instead of going through all 400 iterations?

Unfortunately, the code does not automatically detect when it is no longer improving in order to terminate the run early. However, it does output intermediate models. If you train using a dev set, you can keep the model with the highest dev score at the end of the run.

Related

How `vw --audit` internally computes the weights of the features?

In vowpawabbit there is an option --audit that prints the weights of the features.
If we have a vw contextual bandit model with four arms, how is this feature weight created?
From what I understand vowpawabbit tries to fit one linear model to each arm.
So if weights were calculated using an average across all the arms, then they would correlate with getting a reward generally, instead of which features makes the model pick one variant from another.
I am interested know out how they are calculated to see how I can interpret the results obtained. I tried searching its Github repository but could not find anything meaningful.
I am interested know out how they are calculated to see how I can interpret the results obtained.
Unfortunately knowing the first does not lead to knowing the second.
Your question is concerned with contextual bandits, but it is important to note that interpreting model parameters is an issue that also occurs in supervised learning. Machine learning has made progress recently (i.e., my lifetime) largely by focusing concern on quality of predictions rather than meaningfulness of model parameters. In a blog post, Phoebe Wong outlines the issue while being entertaining.
The bottom line is that our models are not causal, so you simply cannot conclude because "the weight of feature X is for arm A is large means that if I were to intervene in the system and increase this feature value that I will get more reward for playing arm A".
We are currently working on tools for model inspection that leverage techniques such as permutation importance that will help you answer questions like "if I were to stop using a particular feature how would the frequency of playing each arm change for the trained policy". We're hoping that is helpful information.
Having said all that, let me try to answer your original question ...
In vowpawabbit there is an option --audit that prints the weights of the features.
If we have a vw contextual bandit model with four arms, how is this feature weight created?
The format is documented here. Assuming you are using --cb (not --cb_adf) then there are a fixed number of arms and so the offset field will increment over the arms. So for an example like
1:2:0.4 |foo bar
with --cb 4 you'll get an audit output with namespace of foo, feature of bar, and offset of 0, 1, 2, and 3.
Interpreting the output when using --cb_adf is possible but difficult to explain succinctly.
From what I understand vowpawabbit tries to fit one linear model to each arm.
Shorter answer: With --cb_type dm, essentially VW independently tries to predict the average reward for each arm using only examples where the policy played that arm. So the weight you get from audit at a particular offset N is analogous to what you would get from a supervised learning model trained to predict reward on a subset of the historical data consisting solely of times the historical policy played arm N. With other --cb_type settings the interpretation is more complicated.
Longer answer: "Linear model" refers to the representation being used. VW can incorporate nonlinearities into the model but let's ignore that for now. "Fit" is where some important details are. VW takes the partial feedback information of a CB problem (partial feedback = "for this example you don't know the reward of the arms not pulled") and reduces it to a full feedback supervised learning problem (full feedback = "for this example you do the reward of all arms"). The --cb_type argument selects the reduction strategy. There are several papers on the topic, a good place to start is Dudik et. al. and then look for papers that cite this paper. In terms of code, ultimately things are grounded here, but the code is written more for performance than intelligibility.

Is there any sentiment forum dataset for unsupervised training available?

I recently finished a machine learning course and would like to make a forum sentiment analysis tool, to apply it in stock-related forums.
The idea is to:
Capture (text mining) users with their comments, and evaluate their comment's sentiment (positive, negative, neutral).
Capture what happens (stock market) after those comments, and assign a weight to the user accordingly (bigger weight if the user's sentiments is spot-on and the market follows the same direction)
Use the comments as a tool to predict market direction.
Actually, I do this myself (pay attention on forums) plus my own technical analysis and the obligatory due diligence, and it has been working very well for me. I just wanted to try to automate it a little bit and maybe even allow a program to play with some of my accounts (paper trading first, and if it performs decently assign some money in a real account)
This would be my first machine learning project (just as a proof-of-concept) so any comments would be very kindly appreciated.
The biggest problem that I find is that I would like to make an unsupervised training, and I need a sample dataset to do the training.
Question: Is there any known forum-sentiment dataset available to be used for unsupervised training?
I've found several sentiment datasets (twitter, imbd, amazon reviews) but they are very specific to their niche (short messages, movies, products...) but I'm looking for something more general.
Since you are looking for an unsupervised approach you can use any set of data that matches your "real case scenario". Text mining and sentiment analysis are are often tailored to the problem at hand so it is easy to start directly with the real data. The best approach is to built a scraper that grabs directly the forum posts that you want to analyze. You can build the scraper easily enough with Python (beautifulsoup/selenium). Online is full of nice tutorial eg: https://www.dataquest.io/blog/web-scraping-tutorial-python/

4 fold cross validation | Caffe

So I trying to perform a 4-fold cross validation on my training set. I have divided my training data into four quarters. I use three quarters for training and one quarter for validation. I repeat this three more times till all the quarters are given a chance to be the validation set, atleast once.
Now after training I have four caffemodels. I test the models on my validation sets. I am getting different accuracy in each case. How should I proceed from here? Should I just choose the model with the highest accuracy?
Maybe it is a late reply, but in any case...
The short answer is that, if the performances of the four models are similar and good enough, then you re-train the model on all the data available, because you don't want to waste any of them.
The n-fold cross validation is a practical technique to get some insights on the learning and generalization properties of the model you are trying to train, when you don't have a lot of data to start with. You can find details everywhere on the web, but I suggest the open-source book Introduction to Statistical Learning, Chapter 5.
The general rule says that after you trained your n models, you average the prediction error (MSE, accuracy, or whatever) to get a general idea of the performance of that particular model (in your case maybe the network architecture and learning strategy) on that dataset.
The main idea is to assess the models learned on the training splits checking if they have an acceptable performance on the validation set. If they do not, then your models probably overfitted tha training data. If both the errors on training and validation splits are high, then the models should be reconsidered, since they don't have predictive capacity.
In any case, I would also consider the advice of Yoshua Bengio who says that for the kind of problem deep learning is meant for, you usually have enough data to simply go with a training/test split. In this case this answer on Stackoverflow could be useful to you.

Natural Language Processing for Smart Homes

I'm writing up a Smart Home software for my bachelor's degree, that will only simulate the actual house, but I'm stuck at the NLP part of the project. The idea is to have the client listen to voice inputs (already done), transform it into text (done) and send it to the server, which does all the heavy lifting / decision making.
So all my inputs will be fairly short (like "please turn on the porch light"). Based on this, I want to take the decision on which object to act, and how to act. So I came up with a few things to do, in order to write up something somewhat efficient.
Get rid of unnecessary words (in the previous example "please" and "the" are words that don't change the meaning of what needs to be done; but if I say "turn off my lights", "my" does have a fairly important meaning).
Deal with synonyms ("turn on lights" should do the same as "enable lights" -- I know it's a stupid example). I'm guessing the only option is to have some kind of a dictionary (XML maybe), and just have a list of possible words for one particular object in the house.
Detecting the verb and subject. "turn on" is the verb, and "lights" is the subject. I need a good way to detect this.
General implementation. How are these things usually developed in terms of algorithms? I only managed to find one article about NLP in Smart Homes, which was very vague (and had bad English). Any links welcome.
I hope the question is unique enough (I've seen NLP questions on SO, none really helped), that it won't get closed.
If you don't have a lot of time to spend with the NLP problem, you may use the Wit API (http://wit.ai) which maps natural language sentences to JSON:
It's based on machine learning, so you need to provide examples of sentences + JSON output to configure it to your needs. It should be much more robust than grammar-based approaches, especially because the voice-to-speech engine might make mistakes that will break your grammar (but the machine learning module can still get the meaning of the sentence).
I am no way a pioneer in NLP(I love it though) but let me try my hand on this one. For your project I would suggest you to go through Stanford Parser
From your problem definition I guess you don't need anything other then verbs and nouns. SP generates POS(Part of speech tags) That you can use to prune the words that you don't require.
For this I can't think of any better option then what you have in mind right now.
For this again you can use grammatical dependency structure from SP and I am pretty much sure that it is good enough to tackle this problem.
This is where your research part lies. I guess you can find enough patterns using GD and POS tags to come up with an algorithm for your problem. I hardly doubt that any algorithm would be efficient enough to handle every set of input sentence(Structured+unstructured) but something that is more that 85% accurate should be good enough for you.
First, I would construct a list of all possible commands (not every possible way to say a command, just the actual function itself: "kitchen light on" and "turn on the light in the kitchen" are the same command) based on the actual functionality the smart house has available. I assume there is a discrete number of these in the order of no more than hundreds. Assign each some sort of identifier code.
Your job then becomes to map an input of:
a sentence of english text
location of speaker
time of day, day of week
any other input data
to an output of a confidence level (0.0 to 1.0) for each command.
The system will then execute the best match command if the confidence is over some tunable threshold (say over 0.70).
From here it becomes a machine learning application. There are a number of different approaches (and furthermore, approaches can be combined together by having them compete based on features of the input).
To start with I would work through the NLP book from Jurafsky/Manning from Stanford. It is a good survey of current NLP algorithms.
From there you will get some ideas about how the mapping can be machine learned. More importantly how natural language can be broken down into a mathematical structure for machine learning.
Once the text is semantically analyzed, the simplest ML algorithm to try first would be of the supervised ones. To generate training data have a normal GUI, speak your command, then press the corresponding command manually. This forms a single supervised training case. Make some large number of these. Set some aside for testing. It is also unskilled work so other people can help. You can then use these as your training set for your ML algorithm.

Weather prediction algorithm variety

Currently there's a big 'storm' over the predictions by the MetOffice in the UK. They predicted a mild, wet winter, while we have the coldest temperature on record in Northern Ireland and solid snow on the ground, normally rare in December.
It's something I'd love to have a play with, not that I'm claiming I can beat them, but was wondering what algorithms are out there currently that people are working with? What datasets do they base it on?
Possibilities presumably include neural networks modelling input with fitness being the accuracy of the prediction, complex mathematical models, or even the 'same as yesterday' prediction which I've heard claim (although not seen evidence) that it's more reliable for single-day prediction (although obviously drops off after that).
Ideally like to hear from some developers in weather centres or who get access to the supercomputers, it'd be interesting to hear approaches...
In short, if you intend to build and run your own forecasting model, you will face three major problems:
Access to observations
Development of a mathematical model
Computational power to run your model
Access to observation
As far as I know, access to good meteorological observations costs a lot of money.
You need to have observations from all over the globe and model the state of oceans and atmosphere for the whole planet. Alternatively, you need to obtain so-called lateral boundary conditions from someone who calculates a global model.
Development of a mathematical model
I'm not and I've never been affiliated with Met Office, but I used to port and optimize a version of their Unified Model to a supercomputer at our center a couple of years ago. Here's how I remember the model.
Met Office has been developing their Unified Model for the last 20+ years, we're talking about millions of lines of code that contain state of the art ocean/atmospheric models and numerical algorithms. Check out this section of (outdated) User Guide for a glimpse of scientific methods used in their model. It's a fruit of, give or take, half a century of well-funded, extensive research by a large community of smart people. If there was a simple solution that would consistently give better results than the complex models, someone would've probably implemented it by now.
To conclude, I guess it's very hard to get even remotely satisfactory results in weather forecasting by building a model from scratch, unless you're a MSc/PhD in atmospheric physics and you've got a couple of years of free time on your hands.
Computational power to run your model
The first forecasting models were run in the middle of 20th century on machines that cannot match with today's cellphones, so, technically, you could calculate something on your PC. However, this type of job is often done on very, very powerful machines. In fact, 10 systems in the Top500 are dedicated solely to weather forecasting and climate research.
Interesting reads
http://en.wikipedia.org/wiki/Weather_forecasting#How_models_create_forecasts
http://en.wikipedia.org/wiki/Numerical_weather_prediction
http://research.metoffice.gov.uk/research/nwp/numerical/operational/index.html
http://ncas-cms.nerc.ac.uk/html_umdocs/UM55_User_Guide/
UPDATE It's possible to obtain the source code of the WRF model for free, together with some met data. Note that WRF, Unified Model, COAMPS, and many other models are written primarily in Fortran.
First off, you can import raw data from http://tgftp.nws.noaa.gov and other weather data. The best way for the computer to understand the data is putting it on a map. Each point on the map reacts with each other. Data at each point can represent Temp, Pressure, Wind and Direction, Cloud Coverage, Where sun is in the sky, Visibility, last 100hrs of precipitation. You could make predictions, then compare them later to the actual predictions as well as the Weather Service's predictions. Then update a climate model for that data point. That way, it could be a self learning neural network. As far as computation power is concerned, Get a Titan, Big Mac!
It seems to be possible to construct simple forecast model. My watch features a barometer and a thermometer (which is not usable at all, because the watch is warmed by the hand). Solely on those measurements, it has several times warned me of incoming rain, in spite of sunny forecasts from internet sites. (the cloud picture at upper left corner)
A quick search leads us to the Sager Algorithm, which uses only very simple input data. However, while the implementation claims to be open-source, I have failed to locate both the code and scientific papers on the algorithm.

Resources