Are there any alternate ways other than Named Entity Recognition to extract event names from sentences? - opennlp

I'm a newbie to NLP and I'm working on NER using OpenNLP. I have a sentence like " We have a dinner party today ". Here "dinner party" is an event type. Similarly consider this sentence- "we have a room reservation" here room reservation is an event type. My goal is to extract such words from sentences and label it as "Event_types" as the final output. This can be fairly achieved by creating custom NER model's by annotating sentences with proper tags in the training dataset. But the event types can be heterogeneous and random and hence it is very hard to label all possible patterns(ie. event types can be anything like "security meeting", "family function","parents teachers meeting", etc,etc,...). So I'm looking for an alternate way to achieve this problem... Immediate response would be appreciated. Thanks ! :)

Basically you have two options: 1) A list-based approach where you have lists of entities you will extract from text. To solve the heterogeneous language use, one can train an embedding (e.g. Word2Vec or FastText) to identify contextually similar phrases for your list. 2) Train a custom CRF with data you have annotated (this obviously requires that you annotate bunch of sentences with corresponding tags). I guess the ideal solution really depends on the data and people's willingness to annotate it.

Related

Google cloud natural language API adding own context classifier

I have been searching how to create a new entity in google natural language API, and found nothing. Can anybody help how to create a new classifier such that if I pass a sentence and I want to detect suppose 'python' as programming language then how would I get that. Current the API is giving 'python' as 'other'.
I have also looked into cloud auto ml api for my solution and tried to create and train a model but It was only able to do sentiment analysis not entity detection.It was giving me the score rather than telling me that Java is programming language.
Thanks in advance.Your help will be appreciated.
Automl content classification classifies your data into the labels specified in the training set. It does not do entity detection. But it seems like what you need to do is closer to content classification than entity detection. My understanding from the description you provided is that you have content (may be words or phrases or short sentences) and you want to classify them into some labels (e.g. programmingLanguage). If you put together a good training set, the automl model should be able to do this.
The number it provides in eval is not sentiment, it's the probability of the predicted label. As you can see in the eval page you posted, it's telling you that java is a programmingLanguage with probability of 1 (so, it's very certain about it).

Add domain-specific entities to spaCy or Stanford NLP training set

We would like to add some custom entities to the training set of either Stanford NLP or spaCy, before re-training the model. We are willing to label our custom entities, but we would like to add these to the existing training set, so as to not spend too much time labeling.
We assume that the NLP model was trained on a large labeled data set, which includes labels for words that are labeled "O" ("other", i.e. nothing of interest) as well as words that are labeled "DATE", "PERSON", "ORGANIZATION", etc. We have a custom set of ORGANIZATION words, but we would like to add these to all the other labeled data, before re-training the model.
Is this possible? How can we do this? Do we have to get the labeled dataset that the models were trained on, so we can add our own data? If so, how can we do that?
We have built prototypes using both Stanford NLP and spaCy, so an answer for either one works for us.
For spaCy, you should just be able to call nlp.update(). This will make a weight update against the current weights, allowing you to resume training. If you want to make many updates, you might want to parse some text with the original model and mix that through your training, to avoid the "catastrophic forgetting" problem.
You can use this entity tagger tool by helkaroui to create your own training set.

How to improve the accuracy of ner of StanfordCoreNLP?

I used NER of StanfordCoreNLP to recognize the entity including organization, location and person. But there exists something weird. For example, I input a sentence like "Cleveland Cavaliers" and it will recognize the 'Cleveland' as 'location' but not 'Cleveland Cavaliers' as organization.
I am not very familiar with the ner and I don't know how the NER works. My task is to get all the company name in the text and the result I have got is not very satisfactory. So there are two ways occuring to me to solve the problem. The first is to modify the dict and insert the correct data. The second is to train the model. But there are still some questions.
Will the first way work effectively?
If the answer of question 1 is yes, how to modify the dict?
Further more, the FAQ list at https://nlp.stanford.edu/software/crf-faq.shtml#a proposed the way to train the ner model but what confused me most is what I will get if I trained my model.
If I create a dataset containing like
"organization 'Cleveland
Cavaliers'"
to train the model, what will happen in the model? The dict inside the CRFClassifier will change?
Will the CRFClassifier modify the bug when I input 'Cleveland Cavaliers' and recognize the 'Cleveland Cavaliers' as an organization entity?
These are all my puzzles and I am preparing the dataset to try the second way. Can anybody answer the 4 questions above?
Thanks
I think the first solution is not very technical and every time you want to tag a new company, you need to update your dictionary.
I prefer your second solution and I do this before and trained a new model to tag my sentences.
If you have a good corpus that is big enough which tagged properly, It may take some time to train, but it worth the effort.

OpenNLP, Training Named Entity Recognition on unsupported languages: clarifications needed

I want to experiment NER on a specific domain, that is location names extraction from travel offers in Italian language.
So far I've got that I need to prepare the training set by myself, so I'm going to put the
<START:something><END>
tags in some offers from my training set.
But looking at OpenNLP documentation on how to train for NER, I ended up in having a couple of questions:
1) When defining the START/END tags, I'm I free to use whatever name inside the tags (where I wrote "something" a few line above) or is there a restricted set to be bound?
2) I noticed that the call to the training tool
opennlp TokenNameFinderTrainer
takes a string representing the language as the first argument. What is that for? Considering I want to train a model on Italian language that is NOT supported, is there any additional task to be done before I could train for NER?
1) Yes, you can specify multiple types. If the training file contains multiple types, the created model will also be able to detect these multiple types.
2) I think that "lang" parameter has the same meaning/use of other commands (e.g. opennlp TokenizerTrainer -lang it ...)

How can I do "related tags"?

I have tags on my website, and I input them one by one when I create a blog post. I love gmail's new feature, that ask you if you want to include X in a mail, if you type Y's name and that you often include both of them in the same messages.
I'd like to do something similar on my website, but I don't know how to represent the tags "related-ness" in an object or database ... thoughts ?
It all boils down to create associations between certain characteristics of your posts and certain tags, and then - when you press the "publish" button - to analyse the new post and propose all tags matched with your post characteristics.
This can be done in several ways from a "totally hard-coded" association to some sort of "learning AI"... and everything in-between.
Hard-coded solutions
This are the simplest algorithms to implement. You should first decide what characteristics of your post are relevant for tagging (e.g.: it's length if you tag them "short" or "long", the presence of photos or videos if you tag them "multimedia-content", etc...). The most obvious is however to focus on which words are used in posts. For example you could build a mapping like this:
tag_hint_words = {'code-development' : ['programming',
'language', 'python', 'function',
'object', 'method'],
'family' : ['Theresa', 'kids',
'uncle Ben', 'holidays']}
Then you would check your post for the presence of the words in the list (the code between [ and ] ) and propose the tag (the word before :) as a possible candidate.
A common approach is to give "scores", or in other word to put a number that indicates the probability a given tag is the right one. For example: if your post would contain the sentence...
After months of programming, we finally left for the summer holidays at uncle Ben's cottage. Theresa and the kids were ecstatic!
...despite the presence of the word "programming" the program should indicate family as the most likely tag to use, as there are many more words hinting.
Learning AI's
One of the obvious limitations of the above method is that - say one day you pick up java beside python - you would probably need to change your code and include words like "java" or "oracle" too. The same applies if you create new tags.
To circumvent this limitation (and have some fun!!) you could try to implement a learning algorithm. Learning algorithms are those who refine their outcome the more you use them (so they indeed... learn!). Some algorithm requires initial training (many spam filters and voice recognition programs need this initial "primer"). Some don't.
I am absolutely no expert on the subject, but two common AI's are: the Naive Bayes Classifier and some flavour of Neural network.
Although the WP pages might look scary, they are surprisingly easy to implement (at least in Python). Here's the recording of a lecture at PyCon 2009 on the subject "Easy AI with Python". I found it very informative and even somehow inspiring! :)
HTH!
You should have a look at this post :
Any suggestions for a db schema for storing related keywords?
If you're looking for a schema for storing related tags it will help.
Relevancy searches where multiple agents play a part are usually done using Collaborative filtering. You might want to give that a look see.
Look up Clustering (Machine Learning algorithm). Don't be intimidated by math, it's a pretty straightforward algorithm. Check out Machine Learning for Hackers for simpler explanations of many Machine Learning algorithms and methods.

Resources