What does probability means in OpenNLP for sentence or name detection? - opennlp

Trying to make sense of the outcome of API like SentenceDetecrorEvaluator()
What do precision mean?

Related

Determining the "goodness" of a phrase based on "grammatical" or "contextual" relevancy

Given a random string of words, I would like to assign a "goodness" score to the phrase, where "goodness" is some indication of grammatical and contextual relevancy.
For example:
"the green tree was tall" [Good score]
"delicious tires swim open" [Medium score]
"jump an con porch calmly" [Poor score]
I've been experimenting with the Natural Language Toolkit. I'd considered using a trained tagger to assign parts-of-speech to each word in a phrase, and then parse a corpus for occurrences of that POS pattern. This may give me an indication of grammatical "goodness". However, as the tagger itself is trained on the same corpus that I'm using for validation, I can't imagine the results would be reliable. This approach also does not take into consideration the contextual relevancy of the words.
Is anyone aware of existing projects or research into this sort of thing? How would you approach this?
You could employ two different approaches - supervised and semi-supervised.
Supervised
Assuming you have a labeled dataset of tuples of the form <sentence> <goodness label> (like the one in your examples), you could first split your dataset up in a train:test fold (e.g. 4:1).
Then you could simply use BERT feature vectors (these are pre-trained on large volumes of natural language text). The following piece of code gives you the vector for the sentence the green tree was tall (read more here).
nlp_features = pipeline('feature-extraction')
output = nlp_features('the green tree was tall')
np.array(output).shape # (Samples, Tokens, Vector Size)
Assuming you vectorize every sentence, you could then train a simple logistc regression model (sklearn) that learns a set of parameters to minimize the errors in these predictions on the training set and eventually you throw the test set sentences at this model to see how it behaves.
Instead of BERT, you could also use embedded vectors as inputs to an LSTM network for training the classifier (like the one here).
Semi-supervised
This is applicable when you don't have sufficient labeled data (although you need a few to get you started with).
In this case, I think what you could do is to map the words of a sentence into POS tag sequences, e.g.,
the green tree was tall --> ARTICLE ADJ NOUN VERB ADJ (see here for more details).
This step would make your method depend less on the words themselves. A model trained on these sequences would try to discover some latent distinguishing characteristics of good sentences from the bad ones.
In particular, you could run a standard text classification approach with Bidirectional LSTMs for training your classifier (this time not with words but with a much smaller vocabulary of POS tags).
You can use a transformer model from HuggingFace that is fine tuned for sentence correctness. Specifically, the model has to be fine tuned on the Corpus of Linguistic Acceptability (CoLA). Here's a medium article on HuggingFace, transformers, and the fine tuning process.
You can also get a model that's already fine-tuned and you can put in the text classification pipeline for HuggingFace's transformers library here. That site hosts fine-tuned models and you can search for a few others that are fine tuned for the CoLA task there.

Machine Learning/Artificial Intelligence - Classify column based on the value / pattern

I have been trying some frameworks and algorithms, and I can't find one that do what I want - which is classify the column of the data based on the value.
I tried to use Bayes algorithm, but it isn't very precise because I can't expect that the data that is being searched for is in the training set - but I can expect that the pattern is in the training.
I don't have background in Machine Learning / AI, but I was looking for some working example before really going deeper in the implementation.
I built a smaller ARFF to exemplify. Also tried lots of Weka classifying algorithms but none of them gave me good results.
#relation recommend
#attribute class {name,email,taxid,phone}
#attribute text String
#data
name,'Erik Kolh'
name,'Eric Candid'
name,'Allan Pavinan'
name,'Jubaru Guttenberg'
name,'Barabara Bere'
name,'Chuck Azul'
email,'erik#gmail.com'
email,'steven#spielberg.com'
email,'dogs#cats.com'
taxid,'123611216'
taxid,'123545413'
taxid,'562321677'
taxid,'671312678'
taxid,'123123216'
phone,'438-597-7427'
phone,'478-711-7678'
phone,'321-651-5468'
My expectation is train a huge dataset like the above one and get recommendations based on the pattern, e.g.:
joao#bing.com -> email
Joao Vitor -> name
400-123-5519 -> phone
Can you please suggest any algorithms, examples or ideas to research?
I couldn't find a good fit, maybe it's just lack of vocabulary.
Thank you!
What you are trying to do is called named entity recognition (NER). Weka is most likely not a real help here. The library Mallet (http://mallet.cs.umass.edu) might be a good fit. I would recommend a Conditional Random Field (CRF) based approach.
If you would like to stay with weka, you need to change your feature space. Then Naive bayes will be do ok on your data as presented
E.g. add a features for
whether the word has only characters
whether it is alphanumeric
whether it is numeric data
number of Numbers,
whether it starts captilized
... (just be creative)

Stanford NLP sentiment training set

is there a problem with the original (movie reviews) training set provided by Stanford?
Looking at it, it seems that the words "no" and "not" are always marked as negative and the word "n't" is always marked as neutral. Moreover, words with 2 meanings are also always consistent. One would expect the word "like" to be positive in a phrase such as "I like you" and neutral in a phrase such as "A is like B".
Does anyone know why this is the case?
"Problem" is a relative term. There's not something really wrong, but you could provide arguments for doing things differently.
tl;dr
Annotation was indeed done under the model that one subtree of words (including the limiting case of a single word) always gets the same rating.
The idea here is the principle of compositionality of language: if you want to work out the meaning of a novel large sentence, it's generally accepted that you should work out the meaning of the parts and then work out what happens when those parts are combined. The ratings are doing that for the case of sentiment.
In contrast it's not quite obvious what you'd be doing if you were assigning sentiment to a substring in context. Like if the substring was "[a little bit]" what does it mean to say that you're evaluating it in a context like "The movie was [a little bit] original" or "The movie was [a little bit] boring". Are you evaluating the sentiment of "a little bit" or are you just looking at the context and sticking on the substring a rating which really reflects the sentiment of "original" or "boring"?
Nevertheless, one can still raise questions about the approach. For one thing, there is no use of word senses. One substring gets one rating. As another, it could be argued that sentiment is a kind of gestalt, and even though words have a meaning and larger phrase meanings are calculated compositionally from them, it doesn't really make sense to say that words have a sentiment absent of their use in a particular context. That is, "thin" has a clear meaning, and working from that using your world knowledge, it makes sense that a "thin laptop" is a good thing and "thin walls" are a bad thing, but it doesn't seem like by itself "thin" has a sentiment - it arises as a result of whether the object it refers to is deemed good if thin. Hopefully, in such cases, AMT annotators gave "thin" by itself neutral sentiment, and only gave positive and negative ratings to phrases like "thin laptop" and "thin walls". But, in practice, their mind could easily have been conjuring up a particular context and they judged the word relative to that context.
p.s. This question really seems more Linguistics Stack Exchange than Stack Overflow.

Natural Language Processing for Smart Homes

I'm writing up a Smart Home software for my bachelor's degree, that will only simulate the actual house, but I'm stuck at the NLP part of the project. The idea is to have the client listen to voice inputs (already done), transform it into text (done) and send it to the server, which does all the heavy lifting / decision making.
So all my inputs will be fairly short (like "please turn on the porch light"). Based on this, I want to take the decision on which object to act, and how to act. So I came up with a few things to do, in order to write up something somewhat efficient.
Get rid of unnecessary words (in the previous example "please" and "the" are words that don't change the meaning of what needs to be done; but if I say "turn off my lights", "my" does have a fairly important meaning).
Deal with synonyms ("turn on lights" should do the same as "enable lights" -- I know it's a stupid example). I'm guessing the only option is to have some kind of a dictionary (XML maybe), and just have a list of possible words for one particular object in the house.
Detecting the verb and subject. "turn on" is the verb, and "lights" is the subject. I need a good way to detect this.
General implementation. How are these things usually developed in terms of algorithms? I only managed to find one article about NLP in Smart Homes, which was very vague (and had bad English). Any links welcome.
I hope the question is unique enough (I've seen NLP questions on SO, none really helped), that it won't get closed.
If you don't have a lot of time to spend with the NLP problem, you may use the Wit API (http://wit.ai) which maps natural language sentences to JSON:
It's based on machine learning, so you need to provide examples of sentences + JSON output to configure it to your needs. It should be much more robust than grammar-based approaches, especially because the voice-to-speech engine might make mistakes that will break your grammar (but the machine learning module can still get the meaning of the sentence).
I am no way a pioneer in NLP(I love it though) but let me try my hand on this one. For your project I would suggest you to go through Stanford Parser
From your problem definition I guess you don't need anything other then verbs and nouns. SP generates POS(Part of speech tags) That you can use to prune the words that you don't require.
For this I can't think of any better option then what you have in mind right now.
For this again you can use grammatical dependency structure from SP and I am pretty much sure that it is good enough to tackle this problem.
This is where your research part lies. I guess you can find enough patterns using GD and POS tags to come up with an algorithm for your problem. I hardly doubt that any algorithm would be efficient enough to handle every set of input sentence(Structured+unstructured) but something that is more that 85% accurate should be good enough for you.
First, I would construct a list of all possible commands (not every possible way to say a command, just the actual function itself: "kitchen light on" and "turn on the light in the kitchen" are the same command) based on the actual functionality the smart house has available. I assume there is a discrete number of these in the order of no more than hundreds. Assign each some sort of identifier code.
Your job then becomes to map an input of:
a sentence of english text
location of speaker
time of day, day of week
any other input data
to an output of a confidence level (0.0 to 1.0) for each command.
The system will then execute the best match command if the confidence is over some tunable threshold (say over 0.70).
From here it becomes a machine learning application. There are a number of different approaches (and furthermore, approaches can be combined together by having them compete based on features of the input).
To start with I would work through the NLP book from Jurafsky/Manning from Stanford. It is a good survey of current NLP algorithms.
From there you will get some ideas about how the mapping can be machine learned. More importantly how natural language can be broken down into a mathematical structure for machine learning.
Once the text is semantically analyzed, the simplest ML algorithm to try first would be of the supervised ones. To generate training data have a normal GUI, speak your command, then press the corresponding command manually. This forms a single supervised training case. Make some large number of these. Set some aside for testing. It is also unskilled work so other people can help. You can then use these as your training set for your ML algorithm.

Liblinear how to use it

I'm fairly new at machine learning and text mining in general. It has come to my attention the presence of a ruby library called Liblinear https://github.com/tomz/liblinear-ruby-swig.
What I want to do so far is train the software to identify whether a text mentions anything related to bicycles or not.
Can someone please highlight the steps that I should be following (i.e: preprocessing text and how), share resources and ideally share a simple example to get me going.
Any help will do, thanks!
The classical approach is:
Collect a representative sample of input texts, each labeled as related/unrelated.
Divide the sample into training and test sets.
Extract all the terms in all the documents of the training set; call this the vocabulary, V.
For each document in the training set, convert it into a vector of booleans where the i'th element is true/1 iff the i'th term in the vocabulary occurs in the document.
Feed the vectorized training set to the learning algorithm.
Now, to classify a document, vectorize it as in step 4. and feed it to the classifier to get a related/unrelated label for it. Compare this with the actual label to see if it went right. You should be able to get at least some 80% accuracy with this simple method.
To improve this method, replace the booleans with term counts, normalized by document length, or, even better, tf-idf scores.

Resources