I'm trying to extract relation triples from Stanford CoreNLP, and it's working very well for single relation triples in a sentence but doesn't seem to work for multiple ideas in the same sentence.
For example: I drink water, and he eats a cake.
I would expect there to be two triples. (I, drink, water), (he, eats, cake), but only one will show up.
Here's what I'm currently working with:
with corenlp.CoreNLPClient(annotators="tokenize ssplit lemma pos ner depparse natlog openie".split()) as client:
ann = client.annotate(text)
sentence = ann.sentence[0].openieTriple
for x in ann.sentence:
print(x.openieTriple)
I would assume I"m doing something wrong here. Changing max_entailments doesn't fix the problem.
You must do:
for x in ann.sentence:
for triple in x.openieTriple
print(triple)`
Discovered this today thanks to your question, so thanks!
CoreNLP is returning more triples than we’d expect to get i.e. whole sentences or phrases instead of one or two triples that constitute the essential or basic information conveyed by the sentence.
For example, in the sentence:
"The preliminary diagnosis was notified to Dr. Tom by Roy Coy MD at
16:00 CDT on 11/11/2011."
We expect this triple:
preliminary diagnosis; be notify to; Dr. Tom
But we get triples like these:
1.0 diagnosis be notify by Roy Coy MD at 16:00 cdt on 11/11/2011
1.0 diagnosis be notify to Dr. Tom at 16:00 cdt on 11/11/2011
1.0 preliminary diagnosis be notify to Dr. Tom
which in addition to the basic information contain additional details. In an extreme case CoreNLP returns the whole original sentence.
What arguments could we change in order to reduce the CoreNLP output to basic triples? We have experimented with the maximum number of entailments and the triple strict set but they don’t work. We could provide a file with full list of triples.
Java code:
java -mx1g -cp stanford-openie.jar;stanford-openie-models.jar;slf4j-api.jar edu.stanford.nlp.naturalli.OpenIE -openie.max_entailments_per_clause= 1 -openie.triple.strict= true -openie.splitter.disable=true
This is, actually, by design. It's not always clear a priori what the level of granularity people would like from OpenIE systems is, and so our system tries to produce all the levels of granularity it can. The intended use here is to produce triples that can be looked up in a database. So, if someone asks a very specific query, the longer triples are returned. If someone asks a simple query, we return the simple triples (and it doesn't matter that there are some longer ones in there alongside them).
I have a question regarding how CoreNLP assigns parentheses to phrases en route to accumulating an overall sentence score. The main question is the ORDER to which it calculates sentiment of phrases in a sentence. Does anyone know what algorithm is used? An example will clearly illustrate my question:
In my training model, the scale I am using is 0-4, where 0 is negative, 2 is neutral, and 4 is positive, so the following phrase is scored: (3 (1 lower) (2 (2 oil) (2 production)))
-Note: the reason for the jump to positive is we are predicting oil prices and lower oil production will lead to higher prices so a proper prediction of the price of oil increasing would need an overall positive sentiment.
Next, lets assume the following tweet was grabbed: "OPEC decides to lower oil production". I assume the first thing CoreNLP does is assign each individual word a score. In our training model, lower has a score of 1 and all other words are no scored so will receive a score of neutral.
The problem seems to stem from how CoreNLP decides to score phrases (groups of words). If the first thing it did was score "oil production", then score "lower oil production", it would see we have an exact phrase match of "lower oil production" in our model and properly assign a score of 3.
However, what I'm guessing happens is this: first CoreNLP scores "OPEC decides", then "OPEC decides to", then "OPEC decides to lower", then "OPEC decides to lower oil", then OPEC decides to lower oil production". In this instance, the phrase 'lower oil production' is never considered in a vacuum, because there are no phrases matching our training model, the individual word scores decide the overall sentiment and it gets a score of 1 due to "lower."
The only solution for this would be for someone to tell me the exact parentheses algorithm that CoreNLP uses to score phrases. Thanks for the help!
Stanford CoreNLP runs a constituency parser on the sentence. Then it turns the constituency tree into a binary tree with TreeBinarizer.
This is the relevant class:
edu.stanford.nlp.parser.lexparser.TreeBinarizer
Here is a link to source code on GitHub:
https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/parser/lexparser/TreeBinarizer.java
Here is the source code of where that TreeBinarizer is set up:
https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/ParserAnnotator.java
Is there any algorithm or standard to verify customer names in different formats.
I mean,
J. Smith
John Smith
John L. Smith
J. Louis Smith
John Louis S.
Could be the same person and should pass the validation.
Thanks
The accepted answer of Figure out if a business name is very similar to another one - Python will definitely help you out as I myself have worked on a very similar approach to normalize names.
Note that a single standalone metric is not going to suffice. An ensemble approach will have to implemented taking character N Gram matching, Edit Distance and so on into account which ultimately returns a strength of the matched words. Devise a formula for calculating strength of your matched keywords and once your list of names is exhausted just re-run the Algorithm for the names/words which have a strength below a particular threshold set by you. This make the names then resonate to some other cluster of names where the match/strength value is more strong.
Also you will have to watch out for precision/recall trade-off. With the above approach I have seen that the precision is too good but the recall is not that great.
I read this problem in a book (Interview Question), and wanted to discuss this problem, in detail over here. Kindly throw some lights on it.
The problem is as follows:-
Privacy & Anonymization
The Massachusetts Group Insurance Commission had a bright idea back in the mid 1990s - it decided to release "anonymized" data on state employees that showed every single hospital visit they had.
The goal was to help the researchers. The state spent time removing identifiers such as name, address and social security no. The Governor of Masachusetts assured the public that this was sufficient to protect patient privacy.
Then a graduate student, saw significant pitfalls in this approach. She requested a copy of the data and by collating the data in multiple columns, she was able to identify the health records of the Governor.
This demonstrated that extreme care needs to be taken in anonymizing data. One way of ensuring privacy is to aggregate data such that any record can be mapped to at least k individuals, for some large value of k.
I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization. I hope you are clear with the question.....
I have no experienced person, who can help me deal with such kind of problems. Kindly don't put votes to close this question..... As I would be helpless, if this happens...
Thanks & if any more explanation in question required, kindly shoot with the questions.
I just copy pasted part of your text, and stumbled upon this
This helps understanding your problem :
At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.
Boom! But it was only an early mile marker in Sweeney's career; in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.
Well, as you stated it, you need a random database, and ensure that any record can be mapped to at least k individuals, for some large value of k.
In other words, you need to clear the database of discriminative information. For example, if you keep in the database only the sex (M/F), then there is no way to found out who is who. Because there are only two entries : M and F.
But, if you take the birthdate, then your total number of entries become more or less 2*365*80 ~=50.000. (I chose 80 years). Even if your database contain 500.000 people, there is a chance that one of them (let's say a male born on 03/03/1985) is the ONLY one with such entry, thus you can recognize him.
This is only a simplistic approach that relies on combinatorial stuff. If you're wanting something more complex, look for correlated information and PCA
Edit : Let's give an example. Let's suppose I'm working with medical things. If I keep only
The sex : 2 possibilities (M, F)
The blood group : 4 possibilities (O, A, B, AB)
The rhesus : 2 possibilities (+, -)
The state they're living in : 50 possibilities (if you're in the USA)
The month of birth : 12 possibilities (affects death rate of babies)
Their age category : 10 possibilities (0-9 years old, 10-19 years old ... 90-infinity)
That leads to a total number of category of 2*4*2*50*12*10 = 96.000 categories. Thus, if your database contains 200.000.000 entries (rough approximation of the number of inhabitants in the USA that are in your database) there is NO WAY you can identify someone.
This also implies that you do not give out any further information, no ZIP code, etc... With only the 6 information given, you can compute some nice statistics (do persons born in december live longer?) but there is no identification possible because 96.000 is very inferior to 200.000.000.
However, if you only have the database of the city you live in, who has for example 200.000 inhabitants, the you cannot guaranty anonymization. Because 200.000 is "not much bigger" than 96.000. ("not much bigger" is a true complex scientifical term that requires knowledge in probabilities :P )
"I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization."
You can also construct your own dataset by finding one alone, "anonymizing" it, and trying to reconstruct it.
Here is a very detailed discussion of the de-identification/anonymization problem, and potential tools & techniques for solving them.
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CDQQFjAA&url=https%3A%2F%2Fwww.infoway-inforoute.ca%2Findex.php%2Fcomponent%2Fdocman%2Fdoc_download%2F624-tools-for-de-identification-of-personal-health-information&ei=QiO0VL72J-3nsATkl4CQBg&usg=AFQjCNF3YUE2cl9QZTuw-L4PYtWnzmwlIQ&sig2=JE8bYkqg04auXstgF0f7Aw&bvm=bv.83339334,d.cWc
The jurisdiction for the document above is within the rules of the Canadian public health system, but they are conceptually applicable to other jurisdictions.
For the U.S., you would specifically need to comply with the HIPAA de-identification requirements. http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html
"Conceptually applicable" does not mean "compliant". To be compliant, with the EU, for example, you would need to dig into their specific EU requirements as well as the country requirements and potentially State/local requirements.