Relationship Extraction (RE) using Stanford API - stanford-nlp

I have created a custom Named Entity Recognition(NER) classifier and a custom Relationship Extraction(RE) classifier. In the training data for the RE, I have given it a set of 10 sentences in which I have given the exact entities and the relationship between them.
When I am running the code I am getting the correct relationships for 6 out of the 10 sentences. However, I am not getting the correct relationship in all the sentences. I wanted to understand why is the RE code not able to identify the correct relationships in the sentences even though I have given the exact same sentence in the training data?
For example, the following sentence:
The Fund's objective is to help our members achieve the best possible
RetOue.
In the training data, the relationship given is
Fund RetOue build
Below are all the RelationMentions found in the sentence and it can be seen that the relation beween "Fund" and "RetOut" is coming as _NR and has a probability of (_NR, 0.6074190677382846) and the actual relation (build, 0.26265263651796966) has a lower probability. The second one in the list below:
RelationMention [type=_NR, start=1, end=9, {_NR, 0.8706606065870188; build, 0.04609463244214589; reply, 0.014127678851794745; cause, 0.01412618987143006; deliver, 0.014028667880335159; calculate, 0.014026673364224201; change, 0.013888249765034161; collaborate, 0.01304730123801706}
EntityMention [type=RESOURCE, objectId=EntityMention-10, hstart=1, hend=2, estart=1, eend=2, headPosition=1, value="Fund", corefID=-1]
EntityMention [type=ROLE, objectId=EntityMention-11, hstart=8, hend=9, estart=8, eend=9, headPosition=8, value="members", corefID=-1]
]
RelationMention [type=_NR, start=1, end=14, {_NR, 0.6074190677382846; build, 0.26265263651796966; collaborate, 0.029635339573025835; reply, 0.020273680468829585; cause, 0.020270355199687763; change, 0.020143296854960534; calculate, 0.019807048865472295; deliver, 0.01979857478176975}
EntityMention [type=RESOURCE, objectId=EntityMention-10, hstart=1, hend=2, estart=1, eend=2, headPosition=1, value="Fund", corefID=-1]
EntityMention [type=RESOURCE, objectId=EntityMention-12, hstart=13, hend=14, estart=13, eend=14, headPosition=13, value="RetOue", corefID=-1]
]
RelationMention [type=_NR, start=1, end=9, {_NR, 0.9088620248226259; build, 0.029826907381364745; cause, 0.01048834533846858; reply, 0.010472406713467062; change, 0.010430417119225247; deliver, 0.010107963031033371; calculate, 0.010090071219976819; collaborate, 0.009721864373838134}
EntityMention [type=ROLE, objectId=EntityMention-11, hstart=8, hend=9, estart=8, eend=9, headPosition=8, value="members", corefID=-1]
EntityMention [type=RESOURCE, objectId=EntityMention-10, hstart=1, hend=2, estart=1, eend=2, headPosition=1, value="Fund", corefID=-1]
]
RelationMention [type=_NR, start=8, end=14, {_NR, 0.6412212367693484; build, 0.0795874107991397; deliver, 0.061375929752833555; calculate, 0.061195561682179045; cause, 0.03964100603702037; reply, 0.039577811103586304; change, 0.03870906323316812; collaborate, 0.038691980622724644}
EntityMention [type=ROLE, objectId=EntityMention-11, hstart=8, hend=9, estart=8, eend=9, headPosition=8, value="members", corefID=-1]
EntityMention [type=RESOURCE, objectId=EntityMention-12, hstart=13, hend=14, estart=13, eend=14, headPosition=13, value="RetOue", corefID=-1]
]
RelationMention [type=_NR, start=1, end=14, {_NR, 0.8650327055005457; build, 0.05264799740623545; collaborate, 0.01878896136615606; reply, 0.012762167223115933; cause, 0.01276049397449083; calculate, 0.012671777715382195; change, 0.012668721250994311; deliver, 0.012667175563079464}
EntityMention [type=RESOURCE, objectId=EntityMention-12, hstart=13, hend=14, estart=13, eend=14, headPosition=13, value="RetOue", corefID=-1]
EntityMention [type=RESOURCE, objectId=EntityMention-10, hstart=1, hend=2, estart=1, eend=2, headPosition=1, value="Fund", corefID=-1]
]
RelationMention [type=_NR, start=8, end=14, {_NR, 0.8687007489440899; cause, 0.019732766828364688; reply, 0.0197319383076219; change, 0.019585387681083893; collaborate, 0.019321463597270272; deliver, 0.018836262558606865; calculate, 0.018763499991179922; build, 0.015327932091782685}
EntityMention [type=RESOURCE, objectId=EntityMention-12, hstart=13, hend=14, estart=13, eend=14, headPosition=13, value="RetOue", corefID=-1]
EntityMention [type=ROLE, objectId=EntityMention-11, hstart=8, hend=9, estart=8, eend=9, headPosition=8, value="members", corefID=-1]
]
I wanted to understand the reasons I should look out for for this.
Q.1 My assumption was that as entity types are being recognized accurately will help in the relationship getting recognized accurately. Is it correct?
Q.2 How can I improve my training data to make sure I ge the accurate relationship as the result?
Q.3 Does it matter how many records of each entity type I have defined? Should I maintain equal number of definitions for each relation type? For Example: In my training data if I have 10 exampls of the relationship "build", then should I define 10 relations each of the other relationship types as well like for "cause", "reply" etc.?
Q.4 My assumption is that the correct NER classification of the entity makes a difference in the relationship extraction. Is it correct?

Your assumptions that good NER information will help is correct, but chances are you'll need much more than 10 training examples. You should be thinking more along the lines of thousands of examples, optimally tens / hundreds of thousands of examples.
But, you should probably be memorizing the training set nonetheless. What are your training examples? Are you using the default features?

There are lots of features that can be used by RE for improving the accuracy of the relationship classification that need to be analysed in detail.
Answers to my questions:
A.1. Yes, entity types are being recognized accurately will help in the relationship getting recognized accurately.
A.2. As far as I know, training data needs to be annotated and improved manually.
A.3. As far as I know, yes the number of records defined between entities matters.
A.4. The NER accuracy makes a difference in the RE accuracy.

Related

How to execute search for FHIR patient with multiple given names?

We've implemented the $match operation for patient that takes FHIR parameters with the search criteria. How should this search work when the patient resource in the parameters contains multiple given names? We don't see anything in FHIR that speaks to this. Our best guess is that we treat it as an OR when trying to match on given names in our system.
We do see that composite parameters can be used in the query string as AND or OR, but not sure how this equates when using the $match operation.
$match is intrinsically a 'fuzzy' search. Different servers will implement it differently. Many will allow for alternate spellings, common short names (e.g. 'Dick' for 'Richard'), etc. They may also allow for transposition of month and day and all sorts of similar data entry errors. The 'closeness' of the match is reflected in the score the match is given. It's entirely possible get back a match candidate that doesn't match any of the given names exactly if the score on other elements is high enough.
So technically, I think SEARCH works this way:
AND
/Patient?givenname=John&givenname=Jacob&givenname=Jingerheimer
The above is an AND clause. There is (can be) a person named with multiple given names "John", "Jacob", "Jingerheimer".
Now I realize SEARCH and MATCH are 2 different operations.
But they are loosely related.
But Patient-Matching is an "art". Be careful, a "false positive" (with a high "score") is/could-be a very big deal.
But as mentioned from Lloyd....you have a little more flexibility with your implementation of $match.
I have worked on 2 different "teams".
One team, we never let "out the door" anything that was below a 80% match-score. (How you determine a match-score is a deeper discussion).
Another team, we made $match work with a "IF you give me enough information to find a SINGLE match, I'll give it to you" .. but if not, tell people "not enough info to match a single".
Patient Matching is HARD. Do not let anyone tell you different.
at HIMSS and other events..when people show a demo of moving data, I always ask "how did you match this single person on this side.....as it is that person on the other side?"
As in "without patient matching...alot of work-flows fall a part at the get go"
Side note, I actually reported a bug with the MS-FHIR-Server (which the team fixed very quickly) (for SEARCH) here:
https://github.com/microsoft/fhir-server/issues/760
"name": [
{
"use": "official",
"family": "Kirk",
"given": [
"James",
"Tiberious"
]
},
Sidenote:
The Hapi-Fhir object to represent this is "ca.uhn.fhir.rest.param.TokenAndListParam"
Sidenote:
There is a feature request for Patient Match on the Ms-Fhir-Server github page:
https://github.com/microsoft/fhir-server/issues/943

how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

I have few intents in my training set(nlu_data.md file) with sufficient amount of training examples under each intent.
Following is an example,
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
I have added multiple sentences like this.
At the time of testing, all sentences in training file are working fine. But if any input query is having spelling mistake e.g, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.
I want to resolve this issue.
I am allowed to change only training data, also restricted not to write any custom component for this.
To handle spelling mistakes like this in entities, you should add these examples to your training data. So something like this:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place) in Chennai
- [hetel](place) in Berlin please
Once you've added enough examples, the model should be able to generalise from the sentence structure.
If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer. That should be in the default pipeline described on this page already
One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching. If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities). There's a whole blogpost about it here: on Rasa.
There's a working implementation of fuzzy wuzzy as a custom component:
class FuzzyExtractor(Component):
name = "FuzzyExtractor"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list ["en"]
threshold = 90
def __init__(self, component_config=None, *args):
super(FuzzyExtractor, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
pass
def process(self, message, **kwargs):
entities = list(message.get('entities'))
# Get file path of lookup table in json format
cur_path = os.path.dirname(__file__)
if os.name == 'nt':
partial_lookup_file_path = '..\\data\\lookup_master.json'
else:
partial_lookup_file_path = '../data/lookup_master.json'
lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)
with open(lookup_file_path, 'r') as file:
lookup_data = json.load(file)['data']
tokens = message.get('tokens')
for token in tokens:
# STOP_WORDS is just a dictionary of stop words from NLTK
if token.text not in STOP_WORDS:
fuzzy_results = process.extract(
token.text,
lookup_data,
processor=lambda a: a['value']
if isinstance(a, dict) else a,
limit=10)
for result, confidence in fuzzy_results:
if confidence >= self.threshold:
entities.append({
"start": token.offset,
"end": token.end,
"value": token.text,
"fuzzy_value": result["value"],
"confidence": confidence,
"entity": result["entity"]
})
file.close()
message.set("entities", entities, add_to_output=True)
But I didn't implement it, it was implemented and validated here: Rasa forum
Then you will just pass it to your NLU pipeline in config.yml file.
Its a strange request that they ask you not to change the code or do custom components.
The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place:hotel) in Chennai
- [hetel](place:hotel) in Berlin please
This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)
First of all, add samples for the most common typos for your entities as advised here
Beyond this, you need a spellchecker.
I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component. Otherwise, dealing with only training data is not feasible. You can't create samples for each typo.
Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues.
Universal Encoder is another solution.
There should be more options for spell correction, but you will need to write code in any way.

How to set up training and feature template files for NER? - CRF++

For the problem of named entity recognition,
After tokenizing the sentences, how do you set up the columns? it looks like one column in the documentation is POS tag, but where do these come from? Am I supposed to tag the POS myself or is there a tool to generate these?
What is the next column represent? A class like PERSON, LOCATION, etc? and does it have to be in any particular format?
Is there any example of a completed training file and template for NER?
You can find example training and test data in the crf++ repo here. The training data for noun phrase chunking looks like this:
Confidence NN B
in IN O
the DT B
pound NN I
is VBZ O
widely RB O
expected VBN O
... etc ...
The columns are arbitrary in that they can be anything. CRF++ requires that every line have the same number of columns (or be blank, to separate sentences), not all CRF packages require that. You will have to provide the data values yourself; they are the data the classifier learns from.
While anything can go in the various columns, one convention you should know is IOB Format. To deal with potentially multi-token entities, you mark them as Inside/Outside/Beginning. It may be useful to give an example. Pretend we are training a classifier to detect names - for compactness I'll write this on one line:
John/B Smith/I ate/O an/O apple/O ./O
In columnar format it would look like this:
John B
Smith I
ate O
an O
apple O
. O
With these tags, B (beginning) means the word is the first in an entity, I means a word is inside an entity (it comes after a B tag), and O means the word is not an entity. If you have more than one type of entity it's typical to use labels like B-PERSON or I-PLACE.
The reason for using IOB tags is so that the classifier can learn different transition probabilities for starting, continuing, and ending entities. So if you're learning company names It'll learn that Inc./I-COMPANY usually transitions to an O label because Inc. is usually the last part of a company name.
Templates are another problem and CRF++ uses its own special format, but again, there are examples in the source distribution you can look at. Also see this question.
To answer the comment on my answer, you can generate POS tags using any POS tagger. You don't even have to provide POS tags at all, though they're usually helpful. The other labels can be added by hand or automatically; for example, you can use a list of known nouns as a starting point. Here's an example using spaCy for a simple name detector:
import spacy
nlp = spacy.load('en')
names = ['John', 'Jane', etc...]
text = nlp("John ate an apple.")
for word in text:
person = 'O' # default not a person
if str(word) in names:
person = 'B-PERSON'
print(str(word), word.pos_, person)

Rewriting sentences while retaining semantic meaning

Is it possible to use WordNet to rewrite a sentence so that the semantic meaning of the sentence still ways the same (or mostly the same)?
Let's say I have this sentence:
Obama met with Putin last week.
Is it possible to use WordNet to rephrase the sentence into alternatives like:
Obama and Putin met the previous week.
Obama and Putin met each other a week ago.
If changing the sentence structure is not possible, can WordNet be used to replace only the relevant synonyms?
For example:
Obama met Putin the previous week.
If the question is the possibility to use WordNet to do sentence paraphrases. It is possible with much grammatical/syntax components. You would need system that:
First get the individual semantics of the tokens and parse the sentence for its syntax.
Then understand the overall semantics of the composite sentence (especially if it's metaphorical)
Then rehash the sentence with some grammatical generator.
Up till now I only know of ACE parser/generator that can do something like that but it takes a LOT of hacking the system to make it work as a paraphrase generator. http://sweaglesw.org/linguistics/ace/
So to answer your questions,
Is it possible to use WordNet to rephrase the sentence into alternatives? Sadly, WordNet isn't a silverbullet. You will need more than semantics for a paraphrase task.
If changing the sentence structure is not possible, can WordNet be used to replace only the relevant synonyms? Yes this is possible. BUT to figure out which synonym is replace-able is hard... And you would also need some morphology/syntax component.
First you will run into a problem of multiple senses per word:
from nltk.corpus import wordnet as wn
sent = "Obama met Putin the previous week"
for i in sent.split():
possible_senses = wn.synsets(i)
print i, len(possible_senses), possible_senses
[out]:
Obama 0 []
met 13 [Synset('meet.v.01'), Synset('meet.v.02'), Synset('converge.v.01'), Synset('meet.v.04'), Synset('meet.v.05'), Synset('meet.v.06'), Synset('meet.v.07'), Synset('meet.v.08'), Synset('meet.v.09'), Synset('meet.v.10'), Synset('meet.v.11'), Synset('suffer.v.10'), Synset('touch.v.05')]
Putin 1 [Synset('putin.n.01')]
the 0 []
previous 3 [Synset('previous.s.01'), Synset('former.s.03'), Synset('previous.s.03')]
week 3 [Synset('week.n.01'), Synset('workweek.n.01'), Synset('week.n.03')]
Then even if you know the sense (let's say the first sense), you get multiple words per sense and not every word can be replaced in the sentence. Moreover, they are in the lemma form not a surface form (e.g. verbs are in their base form (simple present tense) and nouns are in singular):
from nltk.corpus import wordnet as wn
sent = "Obama met Putin the previous week"
for i in sent.split():
possible_senses = wn.synsets(i)
if possible_senses:
print i, possible_senses[0].lemma_names
else:
print i
[out]:
Obama
met ['meet', 'run_into', 'encounter', 'run_across', 'come_across', 'see']
Putin ['Putin', 'Vladimir_Putin', 'Vladimir_Vladimirovich_Putin']
the
previous ['previous', 'old']
week ['week', 'hebdomad']
One approach is grammatical analysis with nltk read more here and after analysis convert your sentence in to active voice or passive voice.

Efficient algorithm for association Obj1 to Obj2 based on a rule set

I have a table containing millions of transaction records(Obj1) which looks like this
TransactionNum Country ZipCode State TransactionAmount
1 USA 94002 CA 1000
2 USA 00023 FL 1000
I have another table containing Salesreps records(Obj2),again in hundreds of thousands.
SalesrepId PersonNumber Name
Srp001 123 Rohan
Srp002 124 Shetty
I have a few ruleset tables,where basically rules are defined as below
Rule Name : Rule 1
Qualifying criteria : Country = "USA" and (ZipCode = 94002 or State = "FL")
Credit receiving salesreps :
Srp001 gets 70%
Srp002 gets 30%
The qualifying criteria is for the transactions,which means if the transaction attributes match the criteria in the Rule then credits are assigned to the salesreps defined in the rule's credit receiver section.
Now,I need an algorithm which populates a result table as below
ResultId TransactionNumber SalesrepId Credit
1 1 Srp001 700
2 2 Srp002 300
What is the efficient algorithm to do this?
So your real problem is how to quickly match transactions to potential rules. You can do this with an inverted index that says which rules match particular values for the attributes. For example, let's say you have these three rules:
Rule 1: if Country = "USA" and State = "FL"
S1 gets 100%
Rule 2: if Country = "USA" and (State = "CO" or ZIP = 78640)
S2 gets 60%
S3 gets 40%
Rule 3: if Country = "UK"
S3 gets 70%
S2 gets 30%
Now, you process your rules and create output like this:
Country,USA,Rule1
State,FL,Rule1
Country,USA,Rule2
State,CO,Rule2
ZIP,78640,Rule2
Country,UK,Rule3
You then process that output (or you can do it while you're processing the rules) and build three tables. One maps Country values to rules, one maps State values to rules, and one maps ZIP values to rules. You end up with something like:
Countries:
USA, {Rule1, Rule2}
UK, {Rule3}
States:
FL, {Rule1}
CO, {Rule2}
"*", {Rule3}
ZIP:
78640, {Rule2}
"*", {Rule1, Rule3}
The "*" value is a "don't care," which will match all rules that don't specifically mention that field. Whether this is required depends on how you've structured your rules.
The above indexes are constructed whenever your rules change. With 4000 rules, it shouldn't take any time at all, and the list size shouldn't be very large.
Now, given a transaction that has a Country value of "USA", you can look in the Countries table to find all the rules that mention that country. Call that list Country_Rules. Do the same thing for States and ZIP codes.
You can then do a list intersection. That is, build another list called Country_And_State_Rules that contains only those rules that exist in both the Country_Rules and State_Rules lists. That will typically be a small set of possible rules. You could then go through them one-by-one, testing country, state, and ZIP code, as required.
What you're building is essentially a search engine for rules. It should allow you to narrow the candidates from 4,000 to just a handful very quickly.
There are a few problems that you'll have to solve. Having conditional logic ("OR"), complicates things a little bit, but it's not intractable. Also, you have to determine how to handle ambiguity (what if two rules match?). Or, if no rules match the particular Country and State, then you have to back up and check for rules that only match the Country ... or only match the State. That's where the "don't care" comes in.
If your rules are sufficiently unambiguous, then in the vast majority of cases you should be able to pick the relevant rule very quickly. Some few cases will require you to search many different rules for some transactions. But those cases should be pretty rare. If they're frequent, then you need to consider re-examining your rule set.
Once you know which rule applies to a particular transaction, you can easily look up which salesperson gets how much, since the proportions are stored with the rules.

Resources