Can anyone point out the algorithm(s) used by openNLP NameFinder module?
The code is complex and only sparsely documented and playing with it as a black box (with the default model provided) gives me the impression that it is mostly heuristic.
Here are some examples for input and output:
Input:
John Smith is frustrated.
john smith is frustrated.
Barak Obama is frustrated.
Hugo Chavez is frustrated. (no more)
Jeff Atwood is frustrated.
Bing Liu is frustrated with openNLP NER module.
Noam Chomsky is frustrated with the world.
Jayden Smith is frustrated.
Smith Jayden is frustrated.
Lady Gaga is frustrated.
Ms. Gaga is frustrated.
Mrs. Gaga is frustrated.
Jayden is frustrated.
Mr. Liu is frustrated.
Output (I changed diamonds to square brackets) :
[START:person] John Smith [END] is frustrated.
john smith is frustrated.
[START:person] Barak Obama [END] is frustrated.
Hugo Chavez is frustrated. (no more)
[START:person] Jeff Atwood [END] is frustrated.
Bing Liu is frustrated with openNLP NER module.
[START:person] Noam Chomsky [END] is frustrated with the world.
Jayden [START:person] Smith [END] is frustrated.
[START:person] Smith [END] [START:person] Jayden [END] is frustrated.
Lady Gaga is frustrated.
Ms. Gaga is frustrated.
Mrs. Gaga is frustrated.
Jayden is frustrated.
Mr. Liu is frustrated.
It seems that the model simply learns a fixed list of names that are annotated in the training data and allows some tiling and combinations.
Two notable (FN) examples are:
Strong name indicators such as Mr. and Mrs. are ignored.
Jayden (#4 most popular name in the US in 2011) wasn't identified while the following 'Smith' (in "Jayden Smith...") was identified. I suspect that the model "thinks" that the capitalized Jayden in the beginning of the sentence is due the beginning of sentence and not due being a NE. Reversing the order, "Smith Jayden" as a hint (assuming 1), openNLP identifies it as two distinctive NEs, unlike other full names such as "John Smith", maybe suggesting that 'Smith' is in the last-names list...
-> I'm puzzled and frustrated and if anyone could point me to the algorithm (or verify it sucks) I'll be thankful.
p.s. both Stanford and UIUC NER systems perform much better with some subtle differences that are interesting but off topic (this question is too long as is)
As the name implies, NameFinderME uses a Maximum Entropy model. Here is the seminal paper on ME.
If OpenNLP's performance does not meets your requirements and you can not use Stanford or UIUC NERs, I recommend to try Mallet, using a CRF. This sample code should get you started.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 months ago.
Improve this question
Could anyone point me to direction how to use GPT-J model for text paraphrasing . As generating text is easy, but paraphrasing?
Do I need to fine tune on paraphrasing dataset? Or could I just use few shot training?
GPT does only one thing: completing the input you provide it with. This means the main attribute you use to control GPT is the input.
A good way of approaching a certain use-case is to explicitly write out what the task of the model should be + inserting the needed variables + initializing the task.
In your use-case this would be something like this (actual demo using GPT-J):
Input:
Paraphrase the sentence.
Sentence: The dog was scared of the cat.
Paraphrase:
Output:
Paraphrase the sentence.
Sentence: The dog was scared of the cat.
Paraphrase: The cat scared the dog.
GPT-J is very good at paraphrasing content. In order to achieve this, you have to do 2 things:
Properly use few-shot learning (aka "prompting")
Play with the top p and temperature parameters
Here is a few-shot example you could use:
[Original]: Algeria recalled its ambassador to Paris on Saturday and closed its airspace to French military planes a day later after the French president made comments about the northern Africa country.
[Paraphrase]: Last Saturday, the Algerian government recalled its ambassador and stopped accepting French military airplanes in its airspace. It happened one day after the French president made comments about Algeria.
###
[Original]: President Macron was quoted as saying the former French colony was ruled by a "political-military system" with an official history that was based not on truth, but on hatred of France.
[Paraphrase]: Emmanuel Macron said that the former colony was lying and angry at France. He also said that the country was ruled by a "political-military system".
###
[Original]: The diplomatic spat came days after France cut the number of visas it issues for citizens of Algeria and other North African countries.
[Paraphrase]: Diplomatic issues started appearing when France decided to stop granting visas to Algerian people and other North African people.
###
[Original]: After a war lasting 20 years, following the decision taken first by President Trump and then by President Biden to withdraw American troops, Kabul, the capital of Afghanistan, fell within a few hours to the Taliban, without resistance.
[Paraphrase]:
Depending on whether you want GPT-J to stick to the original, or be more creative, you should respectively decrease or increase top p and temperature.
I actually wrote an article about few-shot learning with GPT-J that you might find useful: effectively using GPT-J with few-shot learning
So now I have a list of commodities which bought by many different people.
1. bread, bear, egg, apple
2. carrot, water, glasses
3. apple, egg, bottle
4. meat, egg, soup, juice
5. water, carrot, bear
6. apple, carrot, water
....
I want to know which commodity combo is most popular.
The output of my example is likely this:
carrot, water
because they are bought together more popular by other commodities combo.
I know the algorithm might be belong to Data Mining.
However, I don't know what the keyword is.
I only need the keyword (maybe the algorithm name) and I will do the research by myself!
Thank you all. :)
You are looking for the subdomain known as
Frequent Itemset Mining
in particular, the algorithm APRIORI.
The lecture Frequent Itemsets
from stanford CS246 courese may help you.
I have a sentence and after that, some one edited this sentence. I would like to highlight changes in new sentence.
Is there any code or algorithm for this? Please help me.
Ex:
Org: I are a student in national university.
Edited: I am a teacher in high school
Highlight: am teacher high school
Take the phrase "A Pedestrian wishes to cross the road".
I learnt english in England and, according to the old rules, the word 'Pedestrian' is a noun. Stanford CoreNLP finds it to be an adjective, regardless of capitalization.
I don't want to contradict the big-brains of Stanford, USA, but that is just wrong. I am new to this semantic stuff but, by finding the word to be an adjective, the sentence lacks a valid noun phrase.
Have I missed the point of CoreNLP, lost the point of the english language, or should I be seeking more effective analysis tools?
I ask as the example sentence is the very first sentence, of my very first processing experiment, and it is most discouraging.
CoreNLP is a statistical analysis tool. It is trained on many texts that have been annotated by pools of human experts. These experts agree on about 90% of the cases. Thus the CoreNLP system cannot beat that percentage and your sentence is part of the 10% wrong parses.
For example, given a string:
"Bob went fishing with his friend Jim Smith."
Bob and Jim Smith are both names, but bob and smith are both words. Weren't for them being uppercase, there would be less indication of this outside of our knowledge of the sentence. Are there any well known algorithms for detecting the presence of names, at least Western names?
Take a look at Named Entity Recognition.
http://en.wikipedia.org/wiki/Named_entity_recognition The article links to two good implementations.
Im not sure if this falls under your definition of grammar analysis though.