How to set up training and feature template files for NER? - CRF++ - crf

For the problem of named entity recognition,
After tokenizing the sentences, how do you set up the columns? it looks like one column in the documentation is POS tag, but where do these come from? Am I supposed to tag the POS myself or is there a tool to generate these?
What is the next column represent? A class like PERSON, LOCATION, etc? and does it have to be in any particular format?
Is there any example of a completed training file and template for NER?

You can find example training and test data in the crf++ repo here. The training data for noun phrase chunking looks like this:
Confidence NN B
in IN O
the DT B
pound NN I
is VBZ O
widely RB O
expected VBN O
... etc ...
The columns are arbitrary in that they can be anything. CRF++ requires that every line have the same number of columns (or be blank, to separate sentences), not all CRF packages require that. You will have to provide the data values yourself; they are the data the classifier learns from.
While anything can go in the various columns, one convention you should know is IOB Format. To deal with potentially multi-token entities, you mark them as Inside/Outside/Beginning. It may be useful to give an example. Pretend we are training a classifier to detect names - for compactness I'll write this on one line:
John/B Smith/I ate/O an/O apple/O ./O
In columnar format it would look like this:
John B
Smith I
ate O
an O
apple O
. O
With these tags, B (beginning) means the word is the first in an entity, I means a word is inside an entity (it comes after a B tag), and O means the word is not an entity. If you have more than one type of entity it's typical to use labels like B-PERSON or I-PLACE.
The reason for using IOB tags is so that the classifier can learn different transition probabilities for starting, continuing, and ending entities. So if you're learning company names It'll learn that Inc./I-COMPANY usually transitions to an O label because Inc. is usually the last part of a company name.
Templates are another problem and CRF++ uses its own special format, but again, there are examples in the source distribution you can look at. Also see this question.
To answer the comment on my answer, you can generate POS tags using any POS tagger. You don't even have to provide POS tags at all, though they're usually helpful. The other labels can be added by hand or automatically; for example, you can use a list of known nouns as a starting point. Here's an example using spaCy for a simple name detector:
import spacy
nlp = spacy.load('en')
names = ['John', 'Jane', etc...]
text = nlp("John ate an apple.")
for word in text:
person = 'O' # default not a person
if str(word) in names:
person = 'B-PERSON'
print(str(word), word.pos_, person)


How to differentiate code terminology in MEDICAL SERVICE LINES?

In the MEDICAL_SERVICE_LINES table, there is a field ‘PROCEDURE’. The data dictionary notes that this is ‘CPT, HCPCS, or ICD-10-PCS (less commonly)’. Is there a field that indicates which of these terminologies the code is from?
Can you use modifiers to help identify? Or are the code formats the best tool like:
5 numbers or 4 numbers and a letter (in that order)
1 letter and 4 numbers (in that order).
This customer receives PLAID and is not in Sentinel. (data dictionary here)
The code formats would be the best to distinguish definitively what type of code it is. The modifiers are not filled out all the time (some claims may not have modifiers attached to the procedure).
Your layout of the code format is correct (see section HCPCS Coding here for additional confirmation). HCPCS Level 1 is comprised of CPT codes. HCPCS Level 2/3 is what we typically regard as just "HCPCS"

Limiting BART HuggingFace Model to complete sentences of maximum length

I'm implementing BART on HuggingFace, see reference:
Here is the code from their documentation that works in creating a generated summary:
from transformers import BartModel, BartTokenizer, BartForConditionalGeneration
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=25, early_stopping=True)
return [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
I need to impose conciseness with my summaries so I am setting max_length=25. In doing so though, I'm getting incomplete sentences such as these two examples:
EX1: The opacity at the left lung base appears stable from prior exam.
There is elevation of the left hemidi
EX 2: There is normal mineralization and alignment. No fracture or
osseous lesion is identified. The ankle mort
How do I make sure that the predicted summary is only coherent sentences with complete thoughts and remains concise. If possible, I'd prefer to not perform a regex on the summarized output and cut off any text after the last period, but actually have the BART model produce sentences within the the maximum length.
I tried setting truncation=True in the model but that didn't work.

how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

I have few intents in my training set( file) with sufficient amount of training examples under each intent.
Following is an example,
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
I have added multiple sentences like this.
At the time of testing, all sentences in training file are working fine. But if any input query is having spelling mistake e.g, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.
I want to resolve this issue.
I am allowed to change only training data, also restricted not to write any custom component for this.
To handle spelling mistakes like this in entities, you should add these examples to your training data. So something like this:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place) in Chennai
- [hetel](place) in Berlin please
Once you've added enough examples, the model should be able to generalise from the sentence structure.
If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer. That should be in the default pipeline described on this page already
One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching. If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities). There's a whole blogpost about it here: on Rasa.
There's a working implementation of fuzzy wuzzy as a custom component:
class FuzzyExtractor(Component):
name = "FuzzyExtractor"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list ["en"]
threshold = 90
def __init__(self, component_config=None, *args):
super(FuzzyExtractor, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
def process(self, message, **kwargs):
entities = list(message.get('entities'))
# Get file path of lookup table in json format
cur_path = os.path.dirname(__file__)
if == 'nt':
partial_lookup_file_path = '..\\data\\lookup_master.json'
partial_lookup_file_path = '../data/lookup_master.json'
lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)
with open(lookup_file_path, 'r') as file:
lookup_data = json.load(file)['data']
tokens = message.get('tokens')
for token in tokens:
# STOP_WORDS is just a dictionary of stop words from NLTK
if token.text not in STOP_WORDS:
fuzzy_results = process.extract(
processor=lambda a: a['value']
if isinstance(a, dict) else a,
for result, confidence in fuzzy_results:
if confidence >= self.threshold:
"start": token.offset,
"end": token.end,
"value": token.text,
"fuzzy_value": result["value"],
"confidence": confidence,
"entity": result["entity"]
message.set("entities", entities, add_to_output=True)
But I didn't implement it, it was implemented and validated here: Rasa forum
Then you will just pass it to your NLU pipeline in config.yml file.
Its a strange request that they ask you not to change the code or do custom components.
The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place:hotel) in Chennai
- [hetel](place:hotel) in Berlin please
This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)
First of all, add samples for the most common typos for your entities as advised here
Beyond this, you need a spellchecker.
I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component. Otherwise, dealing with only training data is not feasible. You can't create samples for each typo.
Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues.
Universal Encoder is another solution.
There should be more options for spell correction, but you will need to write code in any way.

Find if a word is plural of another

I am writing a program in Go to generate a report of crimes in my University. I have run into a roadblock where I need to find if one word is a plural of another. I am making a map of crimes first
crimes := make(map[string]int)
then, adding crimes to the map with the number of occurrences as int
for i := 0; i < len(feed.Items); i++ {
Now, the problem arises when there are entries like, "Armed Robberies (with a count of 1)" and "Armed Robbery (with a count of 2)". I want to check if a word is a plural of another. In this case, I want to make a single entry for "Armed Robbery (with a count of 3)". I could not find a package for doing this. Is there a way to do this?
What you are looking for is called inflections. Basically, it is the black art of determining the various forms of a word, in particular singular from plural, or the opposite.
There are libraries for this, mostly inspired from the Ruby On Rails ActiveSupport::Inflector system, see for example
Also see for a very interesting read about algorithms for english pluralization.

Stanford NER: How do I create a new training set that I can use and test out?

From my understanding, to create a training file, you put your words in a text file. Then after each word, add a space or tab along with the tag (such as PERS, LOC, etc...)
I also copied text from a sample properties file into a word pad. How do I get these into a gz file that I can input into the classifier and use?
Please guide me though. I'm a newbie and am fairly inept with technology.
Your training file (say training-data.tsv) should look like this:
drove O
to O
Vancouver LOCATION
yesterday O
where O means "Outside", as in not a named entity.
where the space between the columns is a tab.
You don't put them in a ser.gz file. The ser.gz file is the classifier model that is created by the training process.
To train the classifier run:
java -cp ner.jar -prop
where would look like this:
trainFile = training-data.tsv
serializeTo = my-classification-model.ser.gz
map = word=0,answer=1
I'd advise you take a look at the NLTK documentation to learn more about training a parser
. Now, it seems that you want to train the CRFClassifier (not the parser!); for that you may want to check this FAQ
