How do you treat multi-class classification use case? - text-classification

I have a list of labelled text. Some have one label, others have 2 and some have even 3. Do you treat this as a multi-class classification problem?

The type of classification problem to solve depends on what your goal is, id don't know exactly what type of problem you are trying to solve, but from the form of data i presume you are talking about a multi-label classification problem.
In any case let's make some clarifications:
Multi-class classification:
you can have many classes (dog,cat,bear, ...) but each sample can be assigned only to one class, a dog cannot be a cat.
Multi-label classfication
the goal of this approach is assigning a set of labels to samples, in the text classification scenario for example the phrase "Today is the weather is sunny" may be assigned the set of labels ["weather","good"].
So, if you need to assign each sample to one class only, based on some metric that for example can be tied to the labels, you should use a multi-class algorithm,
but if your goal is predicting the labels that are most appropriate for your sample (text tagging for ex.), then we are talking about a multi-label classification problem.

Related

Determining the "goodness" of a phrase based on "grammatical" or "contextual" relevancy

Given a random string of words, I would like to assign a "goodness" score to the phrase, where "goodness" is some indication of grammatical and contextual relevancy.
For example:
"the green tree was tall" [Good score]
"delicious tires swim open" [Medium score]
"jump an con porch calmly" [Poor score]
I've been experimenting with the Natural Language Toolkit. I'd considered using a trained tagger to assign parts-of-speech to each word in a phrase, and then parse a corpus for occurrences of that POS pattern. This may give me an indication of grammatical "goodness". However, as the tagger itself is trained on the same corpus that I'm using for validation, I can't imagine the results would be reliable. This approach also does not take into consideration the contextual relevancy of the words.
Is anyone aware of existing projects or research into this sort of thing? How would you approach this?
You could employ two different approaches - supervised and semi-supervised.
Supervised
Assuming you have a labeled dataset of tuples of the form <sentence> <goodness label> (like the one in your examples), you could first split your dataset up in a train:test fold (e.g. 4:1).
Then you could simply use BERT feature vectors (these are pre-trained on large volumes of natural language text). The following piece of code gives you the vector for the sentence the green tree was tall (read more here).
nlp_features = pipeline('feature-extraction')
output = nlp_features('the green tree was tall')
np.array(output).shape # (Samples, Tokens, Vector Size)
Assuming you vectorize every sentence, you could then train a simple logistc regression model (sklearn) that learns a set of parameters to minimize the errors in these predictions on the training set and eventually you throw the test set sentences at this model to see how it behaves.
Instead of BERT, you could also use embedded vectors as inputs to an LSTM network for training the classifier (like the one here).
Semi-supervised
This is applicable when you don't have sufficient labeled data (although you need a few to get you started with).
In this case, I think what you could do is to map the words of a sentence into POS tag sequences, e.g.,
the green tree was tall --> ARTICLE ADJ NOUN VERB ADJ (see here for more details).
This step would make your method depend less on the words themselves. A model trained on these sequences would try to discover some latent distinguishing characteristics of good sentences from the bad ones.
In particular, you could run a standard text classification approach with Bidirectional LSTMs for training your classifier (this time not with words but with a much smaller vocabulary of POS tags).
You can use a transformer model from HuggingFace that is fine tuned for sentence correctness. Specifically, the model has to be fine tuned on the Corpus of Linguistic Acceptability (CoLA). Here's a medium article on HuggingFace, transformers, and the fine tuning process.
You can also get a model that's already fine-tuned and you can put in the text classification pipeline for HuggingFace's transformers library here. That site hosts fine-tuned models and you can search for a few others that are fine tuned for the CoLA task there.

Machine learning classifying algorithm with "unknown" class

I understand that if I train a ML classifying algorithm on sample pictures of apples, pears and bananas, it will be able to classify new pictures in one of those three categories. But if I provide a picure of a car, it will also classify it in one of those three classes because it has nowhere else to go.
But is there a ML classifying algorithm that would be able to tell if a item/picture is not really beloning to any of the classes it was trained for? I know I could create a "unknown" class and train it on all sorts of pictures that are neither apples, pears or bananas, but the training set would need to be huge I assume. That does not sound very practical.
One way to do this can be found in this paper - https://arxiv.org/pdf/1511.06233.pdf
The paper also compares the result generated by simply putting the threshold on the final scores and the (OpenMax) technique proposed by the author.
You should look at One-class classification. This is the problem of learning membership to a class, as opposed to distinguishing between two classes. This is interesting if there are too few examples of a second class ("not-in-class", let's say), or the "not-in-class" class is not well defined.
Where this popped up for me once was classifying Wikipedia articles for being flawed in some way - since it was not clear that an article not flagged as flawed was really not flawed, one approach was one-class classification. I have to add though that for my problem this did not perform well, so you should compare performance with other solutions.
EDIT 02/2019:
I agree with the comments below that the following answer in its original form is not correct. You will absolutely need negative samples to provide some balance your training dataset, otherwise your model may not learn useful discriminators between positive and negative samples.
That being said, you do not need to train on every possible negative class, only those which may be present when you are performing inference. This is getting more into how you set the problem up and how you plan to use your trained model.
ORIGINAL ANSWER:
Most classification algorithms will output a classification along with a score/certainty measure which indicates how confident that algorithm is that the returned label is correct (based on some internal figuring, this is not an external accuracy evaluation).
If the score is below a certain threshold, you can have it output unknown rather than one of the known classes. There is no need to train with negative examples.
it certainly helps having a class with random pictures (without objects of your other classes you want to detect) labeled as UNKNOWN class. this will prevent lot's of false positives. this is also best practice. read here to see it used with AutoML: https://cloud.google.com/vision/automl/docs/prepare

Machine Learning/Artificial Intelligence - Classify column based on the value / pattern

I have been trying some frameworks and algorithms, and I can't find one that do what I want - which is classify the column of the data based on the value.
I tried to use Bayes algorithm, but it isn't very precise because I can't expect that the data that is being searched for is in the training set - but I can expect that the pattern is in the training.
I don't have background in Machine Learning / AI, but I was looking for some working example before really going deeper in the implementation.
I built a smaller ARFF to exemplify. Also tried lots of Weka classifying algorithms but none of them gave me good results.
#relation recommend
#attribute class {name,email,taxid,phone}
#attribute text String
#data
name,'Erik Kolh'
name,'Eric Candid'
name,'Allan Pavinan'
name,'Jubaru Guttenberg'
name,'Barabara Bere'
name,'Chuck Azul'
email,'erik#gmail.com'
email,'steven#spielberg.com'
email,'dogs#cats.com'
taxid,'123611216'
taxid,'123545413'
taxid,'562321677'
taxid,'671312678'
taxid,'123123216'
phone,'438-597-7427'
phone,'478-711-7678'
phone,'321-651-5468'
My expectation is train a huge dataset like the above one and get recommendations based on the pattern, e.g.:
joao#bing.com -> email
Joao Vitor -> name
400-123-5519 -> phone
Can you please suggest any algorithms, examples or ideas to research?
I couldn't find a good fit, maybe it's just lack of vocabulary.
Thank you!
What you are trying to do is called named entity recognition (NER). Weka is most likely not a real help here. The library Mallet (http://mallet.cs.umass.edu) might be a good fit. I would recommend a Conditional Random Field (CRF) based approach.
If you would like to stay with weka, you need to change your feature space. Then Naive bayes will be do ok on your data as presented
E.g. add a features for
whether the word has only characters
whether it is alphanumeric
whether it is numeric data
number of Numbers,
whether it starts captilized
... (just be creative)

Liblinear how to use it

I'm fairly new at machine learning and text mining in general. It has come to my attention the presence of a ruby library called Liblinear https://github.com/tomz/liblinear-ruby-swig.
What I want to do so far is train the software to identify whether a text mentions anything related to bicycles or not.
Can someone please highlight the steps that I should be following (i.e: preprocessing text and how), share resources and ideally share a simple example to get me going.
Any help will do, thanks!
The classical approach is:
Collect a representative sample of input texts, each labeled as related/unrelated.
Divide the sample into training and test sets.
Extract all the terms in all the documents of the training set; call this the vocabulary, V.
For each document in the training set, convert it into a vector of booleans where the i'th element is true/1 iff the i'th term in the vocabulary occurs in the document.
Feed the vectorized training set to the learning algorithm.
Now, to classify a document, vectorize it as in step 4. and feed it to the classifier to get a related/unrelated label for it. Compare this with the actual label to see if it went right. You should be able to get at least some 80% accuracy with this simple method.
To improve this method, replace the booleans with term counts, normalized by document length, or, even better, tf-idf scores.

Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm?

Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm? Why or why not?
Basically, I'm trying to figure out why the Wikipedia page for Statistical Classification does not mention LSI. I'm just getting into this stuff and I'm trying to see how all the different approaches for classifying something relate to one another.
No, they're not quite the same. Statistical classification is intended to separate items into categories as cleanly as possible -- to make a clean decision about whether item X is more like the items in group A or group B, for example.
LSI is intended to show the degree to which items are similar or different and, primarily, find items that show a degree of similarity to an specified item. While this is similar, it's not quite the same.
LSI/LSA is eventually a technique for dimensionality reduction, and usually is coupled with a nearest neighbor algorithm to make it a into classification system. Hence in itself, its only a way of "indexing" the data in lower dimension using SVD.
Have you read about LSI on Wikipedia ? It says it uses matrix factorization (SVD), which in turn is sometimes used in classification.
The primary distinction in machine learning is between "supervised" and "unsupervised" modeling.
Usually the words "statistical classification" refer to supervised models, but not always.
With supervised methods the training set contains a "ground-truth" label that you build a model to predict. When you evaluate the model, the goal is to predict the best guess at (or probability distribution of) the true label, which you will not have at time of evaluation. Often there's a performance metric and it's quite clear what the right vs wrong answer is.
Unsupervised classification methods attempt to cluster a large number of data points which may appear to vary in complicated ways into a smaller number of "similar" categories. Data in each category ought to be similar in some kind of 'interesting' or 'deep' way. Since there is no "ground truth" you can't evaluate 'right or wrong', but 'more' vs 'less' interesting or useful.
Similarly evaluation time you can place new examples into potentially one of the clusters (crisp classification) or give some kind of weighting quantifying how similar or different looks like the "archetype" of the cluster.
So in some ways supervised and unsupervised models can yield something which is a "prediction", prediction of class/cluster label, but they are intrinsically different.
Often the goal of an unsupervised model is to provide more intelligent and powerfully compact inputs for a subsequent supervised model.

Resources