I try to use the Stanford NER to parse product data. My training data looks like the following:
iPhone 4 16GB black
Nikon D5100
Apple iPhone 4s
kindle touch
kindle fire
Now I want to train the NER with that data, so I have to categorize it first. The standford website provides an example where they parse a chapter of a book and tokenize every word in a new line. This wouldn't help in my case, cause than the data looks like:
iPhone
4
16GB
black
The "4" should not be in a new line, but when I put "iPhone 4" in a line, the NER thinks that "4" is the category of the token "iPhone".
I just need some help how to train the NER with product data. What would you suggest? And would you categorize "iPhone" as a "phone" and "iPhone 4" also as a "phone"?
I'm wondering wether you'll be able to efficiently extract information using traditional (non-recursive) named entities. In my opinion, you may need something more structured, as:
<phone>
<model> iPhone <model>
<version> 4 </version>
<capacity> 16GB <capacity>
<color> black </color>
</phone>
How to recognize structured named entities using CRF is describe for instance in this paper. Basically, it learns one CRF per entity type, and combines posterior probabilities (from each individual CRF) to recognize structured named entities.
Indeed, this needs some corpus reengineering, as entities should be have adequate structure un training corpora...
Related
I'm a newbie to NLP and I'm working on NER using OpenNLP. I have a sentence like " We have a dinner party today ". Here "dinner party" is an event type. Similarly consider this sentence- "we have a room reservation" here room reservation is an event type. My goal is to extract such words from sentences and label it as "Event_types" as the final output. This can be fairly achieved by creating custom NER model's by annotating sentences with proper tags in the training dataset. But the event types can be heterogeneous and random and hence it is very hard to label all possible patterns(ie. event types can be anything like "security meeting", "family function","parents teachers meeting", etc,etc,...). So I'm looking for an alternate way to achieve this problem... Immediate response would be appreciated. Thanks ! :)
Basically you have two options: 1) A list-based approach where you have lists of entities you will extract from text. To solve the heterogeneous language use, one can train an embedding (e.g. Word2Vec or FastText) to identify contextually similar phrases for your list. 2) Train a custom CRF with data you have annotated (this obviously requires that you annotate bunch of sentences with corresponding tags). I guess the ideal solution really depends on the data and people's willingness to annotate it.
I have trained Stanford NER to extract the organization names from text. I used IO tagging format. It works fine. However, I wonder if changing the tag format to IOB (or other formats) might improve the scores. ?
Suppose you have a sentence that lacks normal punctuation, like this:
John Sam Ted are all here.
If you don't have a B tag you won't be able to tell if this should be three entities or one entity with three words.
On the other hand, for many common types of entities, they can't just run together in normal English text since you'll at least have a comma between them.
If you can set it up, using IOB is better in case you have entities run together, but depending on your data set it may not be an issue. You'll have to look at the data to tell.
I'm looking into creating training data for a Japanese NER.
Wondering if I need to pre-tokenize the training data or is there a way to specify a Tokenizer during model creation?
In the example below Japanese doesn't have any whitespace:
<START:person> Pierre Vinken <END> 61 years old will join the board as a nonexecutive director Nov. 29 .
<START:person> Pierre Vinken <END> は11月29日、非執行取締役として理事に就任する。
Will this work for training a model or do I need provide the training sentences tokenized?
It was a little hard to find the documentation on this but OpenNLP expects the training data to be pre-tokenized, see here:
The data can be converted to the OpenNLP name finder training format. Which is one sentence per line. Some other formats are available as well. The sentence must be tokenized and contain spans which mark the entities.
This could be also inferred from the english example you gave since there's a space before the final period. As a rule, CRF training data is usually pre-tokenized, as that makes evaluation across software packages easier.
Once the sentences are tokenized OpenNLP should work fine with Japanese, since it doesn't really care what the strings are.
https://www.rondhuit.com/apache-opennlp-1-9-0-ja-ner.html
I found this link which could be useful for your purpose.There is a pretrained NRE for the Japanese language which you can download
We would like to add some custom entities to the training set of either Stanford NLP or spaCy, before re-training the model. We are willing to label our custom entities, but we would like to add these to the existing training set, so as to not spend too much time labeling.
We assume that the NLP model was trained on a large labeled data set, which includes labels for words that are labeled "O" ("other", i.e. nothing of interest) as well as words that are labeled "DATE", "PERSON", "ORGANIZATION", etc. We have a custom set of ORGANIZATION words, but we would like to add these to all the other labeled data, before re-training the model.
Is this possible? How can we do this? Do we have to get the labeled dataset that the models were trained on, so we can add our own data? If so, how can we do that?
We have built prototypes using both Stanford NLP and spaCy, so an answer for either one works for us.
For spaCy, you should just be able to call nlp.update(). This will make a weight update against the current weights, allowing you to resume training. If you want to make many updates, you might want to parse some text with the original model and mix that through your training, to avoid the "catastrophic forgetting" problem.
You can use this entity tagger tool by helkaroui to create your own training set.
I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.