I am trying to train my own model on tweets and in my model I care about NEs inside hashtags. However, I can't think of a way that makes the tool actually learn such patterns in the data. Here is an example training record for OpenNLP:
RAW Text ► Wright State is in #DaytonOH
OpenNLP Training ► <START>Wright State<END> is in #<START>Dayton<END><START>OH<END>
Now, if I prepare the same text for Stanford NER following this link:
Wright LOC
State LOC
is O
in O
# O
Dayton LOC
OH LOC
Would that be OK? How can we make it work for character level instead of only token level? Do you think the CRF module is going to learn such patterns? Or should we just ignore hashtags?
Thanks in advance.
-H
Related
I have trained Stanford NER to extract the organization names from text. I used IO tagging format. It works fine. However, I wonder if changing the tag format to IOB (or other formats) might improve the scores. ?
Suppose you have a sentence that lacks normal punctuation, like this:
John Sam Ted are all here.
If you don't have a B tag you won't be able to tell if this should be three entities or one entity with three words.
On the other hand, for many common types of entities, they can't just run together in normal English text since you'll at least have a comma between them.
If you can set it up, using IOB is better in case you have entities run together, but depending on your data set it may not be an issue. You'll have to look at the data to tell.
I'm looking into creating training data for a Japanese NER.
Wondering if I need to pre-tokenize the training data or is there a way to specify a Tokenizer during model creation?
In the example below Japanese doesn't have any whitespace:
<START:person> Pierre Vinken <END> 61 years old will join the board as a nonexecutive director Nov. 29 .
<START:person> Pierre Vinken <END> は11月29日、非執行取締役として理事に就任する。
Will this work for training a model or do I need provide the training sentences tokenized?
It was a little hard to find the documentation on this but OpenNLP expects the training data to be pre-tokenized, see here:
The data can be converted to the OpenNLP name finder training format. Which is one sentence per line. Some other formats are available as well. The sentence must be tokenized and contain spans which mark the entities.
This could be also inferred from the english example you gave since there's a space before the final period. As a rule, CRF training data is usually pre-tokenized, as that makes evaluation across software packages easier.
Once the sentences are tokenized OpenNLP should work fine with Japanese, since it doesn't really care what the strings are.
https://www.rondhuit.com/apache-opennlp-1-9-0-ja-ner.html
I found this link which could be useful for your purpose.There is a pretrained NRE for the Japanese language which you can download
We are using the NER models to identify entities like org, percent, money, number etc - we would like to add an entity (I don't think we can extend the models) or build another model to tag these entities ( we are looking to classify financial securities).
I have just started looking at this and have used the models available so far.
I am looking at https://nlp.stanford.edu/software/crf-faq.shtml#a
to get started for the custom models are there sample data files I need to look at?
Does this still mean that the only entities that can be tagged are the already available ones like organization, date, money, location ...
Are there any changes one needs to made to the java files i.e which ones would I start with to understand how the classifier works.
Basically for some text like :
2.200% Notes due October 30, 2020 the principal amount $ 1,500,000,000.00 $ 186,750.00
I'd like to tag:
<security>2.200% Notes due October 30, 2020</security> the principal amount $ 1,500,000,000.00 $ 186,750.00
You can train a new sequence tagger with the following format:
Joe PERSON
Smith PERSON
was O
born O
in O
California LOCATION
. O
He O
works O
for O
Apple ORGANIZATION
. O
Note it should be a \t separating the token from the tag. You can use any tag you want. The statistical tagger will then be able to apply tags it saw in the training data.
You can see the full details of the properties file you should use if you look at this file in the models jar:
edu/stanford/nlp/models/ner/english.all.3class.distsim.prop
I should note, if what you're trying to extract follows a few basic patterns, you're going to probably get better results with a rule-based approach.
Here is some documentation on rule based approaches in StanfordCoreNLP:
https://nlp.stanford.edu/software/tokensregex.html
I know how to train an NER model as specified here and have a very successful one in fact. I also know about the 3 provided caseless models as talked about here. But what if I want to train my own caseless model, what is the trick there? I have a bunch of all uppercase documents for training. Do I use the same training process or are there special/different features for the caseless models or are there properties that need to be set? I can't find a description as to how the provided caseless models were created.
There is only one property change in our models, which is that you want to have it invoke a function that removes case information before words are processed for classification. We do that with this property value (which also maps some words to American spelling):
wordFunction = edu.stanford.nlp.process.LowercaseAndAmericanizeFunction
but there is also simply:
wordFunction = edu.stanford.nlp.process.LowercaseFunction
Having more automatic stuff for deciding document format (hard/soft line breaks), case, or even language would be nice, but at present we don't have any of those....
I have been using the Stanford NER tagger to find the named entities in a document. The problem that I am facing is described below:-
Let the sentence be The film is directed by Ryan Fleck-Anna Boden pair.
Now the NER tagger marks Ryan as one entity, Fleck-Anna as another and Boden as a third entity. The correct marking should be Ryan Fleck as one and Anna Boden as another.
Is this a problem of the NER tagger and if it is then can it be handled?
How about
take your data and run it through Stanford NER or some other NER.
look at the results and find all the mistakes
correctly tag the incorrect results and feed them back into your NER.
lather, rinse, repeat...
This is a sort of manual boosting technique. But your NER probably won't learn too much this way.
In this case it looks like there is a new feature, hyphenated names, the the NER needs to learn about. Why not make up a bunch of hyphenated names, put them in some text, and tag them and train your NER on that?
You should get there by adding more features, more data and training.
Instead of using stanford-coreNLP you could try Apache opeNLP. There is option available to train your model based on your training data. As this model is dependent on the names supplied by you, it able to detect names of your interest.