OpenNLP, Training Named Entity Recognition on unsupported languages: clarifications needed - opennlp

I want to experiment NER on a specific domain, that is location names extraction from travel offers in Italian language.
So far I've got that I need to prepare the training set by myself, so I'm going to put the
<START:something><END>
tags in some offers from my training set.
But looking at OpenNLP documentation on how to train for NER, I ended up in having a couple of questions:
1) When defining the START/END tags, I'm I free to use whatever name inside the tags (where I wrote "something" a few line above) or is there a restricted set to be bound?
2) I noticed that the call to the training tool
opennlp TokenNameFinderTrainer
takes a string representing the language as the first argument. What is that for? Considering I want to train a model on Italian language that is NOT supported, is there any additional task to be done before I could train for NER?

1) Yes, you can specify multiple types. If the training file contains multiple types, the created model will also be able to detect these multiple types.
2) I think that "lang" parameter has the same meaning/use of other commands (e.g. opennlp TokenizerTrainer -lang it ...)

Related

Clarify steps to add a language variant to Stanza

I would like to add a non-standard variant of a language already supported by Stanza. It should be named differently from the standard variety included in the common distribution of Stanza. I could use a modification of the corpus for training the AI, since the changes are mostly morphological rather than syntactical, but how many steps would I need to take in order to make a new language variety for Stanza from this background? I don't understand what data are input and what are output in the process of adding a new language in the web documentation.
It sounds like you are trying to add a different set of processors rather than a whole new language. The difference being that other steps of the pipeline will still work the same, right? NER models, for example.
If that's the case, if you can follow the steps to retrain the current models, you should be able to then replace the input data with your morphological updates.
I suggest filing an issue on github if you encounter difficulties in the process. It will be a lot easier to back & forth there.
Times when we would actually recommend a whole new language are when 1) it's actually a new language or 2) it uses a different character set - think different writing systems for ZH or for Punjabi, if we had any Punjabi models

Are there any alternate ways other than Named Entity Recognition to extract event names from sentences?

I'm a newbie to NLP and I'm working on NER using OpenNLP. I have a sentence like " We have a dinner party today ". Here "dinner party" is an event type. Similarly consider this sentence- "we have a room reservation" here room reservation is an event type. My goal is to extract such words from sentences and label it as "Event_types" as the final output. This can be fairly achieved by creating custom NER model's by annotating sentences with proper tags in the training dataset. But the event types can be heterogeneous and random and hence it is very hard to label all possible patterns(ie. event types can be anything like "security meeting", "family function","parents teachers meeting", etc,etc,...). So I'm looking for an alternate way to achieve this problem... Immediate response would be appreciated. Thanks ! :)
Basically you have two options: 1) A list-based approach where you have lists of entities you will extract from text. To solve the heterogeneous language use, one can train an embedding (e.g. Word2Vec or FastText) to identify contextually similar phrases for your list. 2) Train a custom CRF with data you have annotated (this obviously requires that you annotate bunch of sentences with corresponding tags). I guess the ideal solution really depends on the data and people's willingness to annotate it.

Which Tagging format is the best for training Stanford NER (IO/ IOB)?

I have trained Stanford NER to extract the organization names from text. I used IO tagging format. It works fine. However, I wonder if changing the tag format to IOB (or other formats) might improve the scores. ?
Suppose you have a sentence that lacks normal punctuation, like this:
John Sam Ted are all here.
If you don't have a B tag you won't be able to tell if this should be three entities or one entity with three words.
On the other hand, for many common types of entities, they can't just run together in normal English text since you'll at least have a comma between them.
If you can set it up, using IOB is better in case you have entities run together, but depending on your data set it may not be an issue. You'll have to look at the data to tell.

Google cloud natural language API adding own context classifier

I have been searching how to create a new entity in google natural language API, and found nothing. Can anybody help how to create a new classifier such that if I pass a sentence and I want to detect suppose 'python' as programming language then how would I get that. Current the API is giving 'python' as 'other'.
I have also looked into cloud auto ml api for my solution and tried to create and train a model but It was only able to do sentiment analysis not entity detection.It was giving me the score rather than telling me that Java is programming language.
Thanks in advance.Your help will be appreciated.
Automl content classification classifies your data into the labels specified in the training set. It does not do entity detection. But it seems like what you need to do is closer to content classification than entity detection. My understanding from the description you provided is that you have content (may be words or phrases or short sentences) and you want to classify them into some labels (e.g. programmingLanguage). If you put together a good training set, the automl model should be able to do this.
The number it provides in eval is not sentiment, it's the probability of the predicted label. As you can see in the eval page you posted, it's telling you that java is a programmingLanguage with probability of 1 (so, it's very certain about it).

Training caseless NER models with Stanford corenlp

I know how to train an NER model as specified here and have a very successful one in fact. I also know about the 3 provided caseless models as talked about here. But what if I want to train my own caseless model, what is the trick there? I have a bunch of all uppercase documents for training. Do I use the same training process or are there special/different features for the caseless models or are there properties that need to be set? I can't find a description as to how the provided caseless models were created.
There is only one property change in our models, which is that you want to have it invoke a function that removes case information before words are processed for classification. We do that with this property value (which also maps some words to American spelling):
wordFunction = edu.stanford.nlp.process.LowercaseAndAmericanizeFunction
but there is also simply:
wordFunction = edu.stanford.nlp.process.LowercaseFunction
Having more automatic stuff for deciding document format (hard/soft line breaks), case, or even language would be nice, but at present we don't have any of those....

Resources