How to train on very small data set? - rasa-nlu

We are trying to understand the underlying model of Rasa - the forums there still didnt get us an answer - on two main questions:
we understand that Rasa model is a transformer-based architecture. Was it
pre-trained on any data set? (eg wikipedia, etc)
then, if we
understand correctly, the intent classification is a fine tuning task
on top of that transformer. How come it works with such small
training sets?
appreciate any insights!
thanks
Lior

the transformer model is not pre-trained on any dataset. We use quite a shallow stack of transformer which is not as data hungry as deeper stacks of transformers used in large pre-trained language models.
Having said that, there isn't an exact number of data points that will be sufficient for training your assistant as it varies by the domain and your problem. Usually a good estimate is 30-40 examples per intent.

Related

AI bias in the sentiment analysis

Using sentiment analysis API and want to know how the AI bias that gets in through the training set of data and other biases quantified. Any help would be appreciated.
There are several tools developed to deal with it:
Fair Learn https://fairlearn.github.io/
Interpretability Toolkit https://learn.microsoft.com/en-us/azure/machine-learning/how-to-machine-learning-interpretability
In Fair Learn you can see how biased a ML model is after it has been trained with the data set and choose a maybe less accurate model which performs better with biases. The explainable ML models provide different correlation of inputs with outputs and combined with Fair Learn can give an idea of the health of the ML model.

CNN features for classification

I am new to deep learning and I hope you guys can help me.
The following site uses CNN features for multi-class classification:
https://www.mathworks.com/help/deeplearning/examples/feature-extraction-using-alexnet.html
This example extracts features from fully connected layer and the extracted features are fed to ECOC classifier.
In this example, regarding to the whole dataset, there are total 15 samples in each category and in the training dataset, there are 11 samples in each category.
My question are related to the dataset size: If I want to use cnn features for ECOC classification as above example, it must be required to have the number of samples in each category the same?
If so, would you like to explain why?
If not, would you like to show the reference papers which have used different numbers?
Thank you.
You may want to have a balanced dataset to prevent your model from learning a wrong probability distribution. If a category represents 95% of your dataset, a model that classifies everything as part of that category, will have an accuracy of 95%.

Is there pre-trained doc2vec model?

Is there a pre-trained doc2vec model with a large data set, like Wikipedia or similar?
I don't know of any good one. There's one linked from this project, but:
it's based on a custom fork from an older gensim, so won't load in recent code
it's not clear what parameters or data it was trained with, and the associated paper may have made uninformed choices about the effects of parameters
it doesn't appear to be the right size to include actual doc-vectors for either Wikipedia articles (4-million-plus) or article paragraphs (tens-of-millions), or a significant number of word-vectors, so it's unclear what's been discarded
While it takes a long time and significant amount of working RAM, there is a Jupyter notebook demonstrating the creation of a Doc2Vec model from Wikipedia included in gensim:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
So, I would recommend fixing the mistakes in your attempt. (And, if you succeed in creating a model, and want to document it for others, you could upload it somewhere for others to re-use.)
Yes!
I could find two pre-trained doc2vec models at this link
but still could not find any pre-trained doc2vec model which is trained on tweets

4 fold cross validation | Caffe

So I trying to perform a 4-fold cross validation on my training set. I have divided my training data into four quarters. I use three quarters for training and one quarter for validation. I repeat this three more times till all the quarters are given a chance to be the validation set, atleast once.
Now after training I have four caffemodels. I test the models on my validation sets. I am getting different accuracy in each case. How should I proceed from here? Should I just choose the model with the highest accuracy?
Maybe it is a late reply, but in any case...
The short answer is that, if the performances of the four models are similar and good enough, then you re-train the model on all the data available, because you don't want to waste any of them.
The n-fold cross validation is a practical technique to get some insights on the learning and generalization properties of the model you are trying to train, when you don't have a lot of data to start with. You can find details everywhere on the web, but I suggest the open-source book Introduction to Statistical Learning, Chapter 5.
The general rule says that after you trained your n models, you average the prediction error (MSE, accuracy, or whatever) to get a general idea of the performance of that particular model (in your case maybe the network architecture and learning strategy) on that dataset.
The main idea is to assess the models learned on the training splits checking if they have an acceptable performance on the validation set. If they do not, then your models probably overfitted tha training data. If both the errors on training and validation splits are high, then the models should be reconsidered, since they don't have predictive capacity.
In any case, I would also consider the advice of Yoshua Bengio who says that for the kind of problem deep learning is meant for, you usually have enough data to simply go with a training/test split. In this case this answer on Stackoverflow could be useful to you.

What are techniques and practices on measuring data quality?

If I have a large set of data that describes physical 'things', how could I go about measuring how well that data fits the 'things' that it is supposed to represent?
An example would be if I have a crate holding 12 widgets, and I know each widget weighs 1 lb, there should be some data quality 'check' making sure the case weighs 13 lbs maybe.
Another example would be that if I have a lamp and an image representing that lamp, it should look like a lamp. Perhaps the image dimensions should have the same ratio of the lamp dimensions.
With the exception of images, my data is 99% text (which includes height, width, color...).
I've studied AI in school, but have done very little outside of that.
Are standard AI techniques the way to go? If so, how do I map a problem to an algorithm?
Are some languages easier at this than others? Do they have better libraries?
thanks.
Your question is somewhat open-ended, but it sounds like you want is what is known as a "classifier" in the field of machine learning.
In general, a classifier takes a piece of input and "classifies" it, ie: determines a category for the object. Many classifiers provide a probability with this determination, and some may even return multiple categories with probabilities on each.
Some examples of classifiers are bayes nets, neural nets, decision lists, and decision trees. Bayes nets are often used for spam classification. Emails are classified as either "spam" or "not spam" with a probability.
For you question you'd want to classify your objects as "high quality" or "not high quality".
The first thing you'll need is a bunch of training data. That is, a set of objects where you already know the correct classification. One way to obtain this could be to get a bunch of objects and classify them by hand. If there are too many objects for one person to classify you could feed them to Mechanical Turk.
Once you have your training data you'd then build your classifier. You'll need to figure out what attributes are important to your classification. You'll probably need to do some experimentation to see what works well. You then have your classifier learn from your training data.
One approach that's often used for testing is to split your training data into two sets. Train your classifier using one of the subsets, and then see how well it classifies the other (usually smaller) subset.
AI is one path, natural intelligence is another.
Your challenge is a perfect match to Amazon's Mechanical Turk. Divvy your data space up into extremely small verifiable atoms and assign them as HITs on Mechanical Turk. Have some overlap to give yourself a sense of HIT answer consistency.
There was a shop with a boatload of component CAD drawings that needed to be grouped by similarity. They broke it up and set it loose on Mechanical Turk to very satisfying results. I could google for hours and not find that link again.
See here for a related forum post.
This is a tough answer. For example, what defines a lamp? I could google images a picture of some crazy looking lamps. Or even, look up the definition of a lamp (http://dictionary.reference.com/dic?q=lamp). Theres no physical requirements of what a lamp must look like. Thats the crux of the AI problem.
As for data, you could setup Unit testing on the project to ensure that 12 widget() weighs less than 13 lbs in the widetBox(). Regardless, you need to have the data at hand to be able to test things like that.
I hope i was able to answer your question somewhat. Its a bit vauge, and my answers are broad, but hopefully it'll at least send you in a good direction.

Resources