TokensRegex not merging entity mentions - stanford-nlp

I have noticed that the Stanford default entity mention tags don't merge with my new tags. For example; I want to find and tag telephone numbers. +1 234-567-8901 before my tags get run, Standford tags "+1" as "NUMBER" and "234-567-8901" also as "NUMBER" however, they remain as separate tags. I have tried creating my own TokensRegex rule to find these and mark them as "TELEPHONE_NUMBER". They are successfully changed to the new tag, but they remain separate tags.
I have also noticed that this happens when tagging text and digits. Say if I wanted to tag "my number is 234-567-8901" then it would tag "my number is" as one tag and "234-567-8901" as a separate tag. I have tried using the B-TELEPHONE_NUMBER and I-TELEPHONE_NUMBER tags as mention in the documentation and it does remove the B and I, but they remain separate tags. I have noticed this behavior only happens with Stanford's tags. I have tested it with DATE and it does the same thing
So, my question is, how can I get Stanford's tags to merge with mine? If they can't be merged, is there a way to delete that tag and then add my own tag?
EDIT
here is another example of the tags not merging together.
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
$CORONA = "/((c[A-za-z]*?r[A-za-z]*?a)(.*?v[A-za-z]*?r[A-za-z]*?s)?)/"
$COVID = "/(((c[A-za-z]*?v[A-za-z]*?d)(.*?19)?|(c[A-za-z]*?d.*?19)))/"
{ pattern: ( (([/c.*?[vd].*/]) ([/19/]))|([/c.*?a/] [/v.*?s/])|([ $CORONA|$COVID ]) ),
action: (Annotate($0, ner, "CORONAVIRUS")) }
This snippet is suppose to catch covid 19
(([/c.*?[vd].*/]) ([/19/]))
However, 19 is tagged as a NUMBER before this rule is run. it tags covid as CORONAVIRUS and changes 19's tag from NUMBER to CORONAVIRUS. It's suppose to then combine the two into one entity mention, but it doesn't.
When it comes across corona virus, which is also separated into two tokens, it combines the two into one entity mention.
([/c.*?a/] [/v.*?s/])
EDIT 2
I decided to make a small pipeline to replicate the issue. I based it off the example found here https://stanfordnlp.github.io/CoreNLP/tokensregex.html I made modification to more closely replicate the one used in our pipeline. here is the function
public static void main(String[] args) throws ClassNotFoundException {
// set properties
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner");
props.setProperty("ner.additional.tokensregex.rules", "basic_ner.rules");
// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// annotate
Annotation ann = new Annotation("We had to close our business due to covid 19. " +
"There will be a big announcement by Apple Inc today at 5:00pm. " +
"She has worked at Miller Corp. for 5 years.");
pipeline.annotate(ann);
// show results
System.out.println("---");
System.out.println("tokens\n");
for (CoreMap sentence : ann.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreMap mention : sentence.get(CoreAnnotations.MentionsAnnotation.class)){
System.out.println("mention: " + mention);
}
System.out.println("---");
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
System.out.println(token.word() + "\t" + token.ner());
}
System.out.println("");
}
System.out.println("---");
System.out.println("matched expressions\n");
for (CoreMap me : ann.get(MyMatchedExpressionAnnotation.class)) {
System.out.println(me);
}
}
Here is the output
---
tokens
mention: covid
mention: 19
---
We O
had O
to O
close O
our O
business O
due O
to O
covid CORONAVIRUS
19 CORONAVIRUS
. O
mention: Apple Inc
mention: today at 5:00pm.
---
There O
will O
be O
a O
big O
announcement O
by O
Apple COMPANY
Inc COMPANY
today DATE
at DATE
5:00 DATE
pm DATE
. DATE
mention: Miller Corp.
mention: 5 years
mention: She
---
She O
has O
worked O
at O
Miller COMPANY
Corp. COMPANY
for O
5 DURATION
years DURATION
. O
---
As you can see the tags like COMPANY are merged into one mention when next to each other. However, like I mentioned above, covid 19 remains separate. This happens when a portion of the tokens already have a tag. In this case 19 was NUMBER.

Related

openNLP NGramModel does not keep the original order of the words?

Here is my simple code using openNLP:
public static void main(String[] args) {
String text = "This is the original sequence in the text";
System.out.println(text);
StringList tokens = new StringList(WhitespaceTokenizer.INSTANCE.tokenize(text));
System.out.println("Tokens: " + tokens);
NGramModel nGramModel = new NGramModel();
nGramModel.add(tokens, 2, 2);
System.out.println("Total ngrams: " + nGramModel.numberOfGrams());
for (StringList ngram : nGramModel) {
System.out.println(nGramModel.getCount(ngram) + " - " + ngram);
}
}
and it gives the following output:
This is the original sequence in the text
Tokens: [This,is,the,original,sequence,in,the,text]
Total ngrams: 7
1 - [the,text]
1 - [the,original]
1 - [is,the]
1 - [sequence,in]
1 - [This,is]
1 - [original,sequence]
1 - [in,the]
So it does not keep the original order of the words in the sentence? How can I get [This,is] as the very first n-gram, and then [is,the] as the second ngram, ... so on so forth? if we lose this original ordering of the n-gram... would that hurt?
thanks for the help!
I think it's important to clarify what is your use case and why you think you need order preserved.
Ngrams are often used in bag of words models (which disrespect order anyway) and / or in language models where probability estimation (e.g. based on ngram counts) are calculated at ngram level and aggregated using the chain rule.

How to combine two sentences using simplenlg

Given a set of sentence like "John has a cat" and "John has a dog" would to create a sentence like "John has a cat and dog".
Can I use simplenlg to create the same.
The task you are asking about is called aggregation in Natural Language Generation (NLG). Whilst SimpleNLG does support aggregation with its realisation engine, it will not directly aggregate two strings such as those in your example.
It is possible however to use a syntactic parser and SimpleNLG to perform this task. I will first explain how to generate your target sentence using SimpleNLG grammar:
import simplenlg.framework.*;
import simplenlg.lexicon.*;
import simplenlg.realiser.english.*;
import simplenlg.phrasespec.*;
import simplenlg.features.*;
public class TestMain {
public static void main(String[] args) throws Exception {
Lexicon lexicon = Lexicon.getDefaultLexicon();
NLGFactory nlgFactory = new NLGFactory(lexicon);
Realiser realiser = new Realiser(lexicon);
// Create the SPhraseSpec object (sentence phrase).
SPhraseSpec p = nlgFactory.createClause();
// Create a noun phrase and set it as the subject of your sentence
NPPhraseSpec john = nlgFactory.createNounPhrase("John");
p.setSubject(john);
// Create a verb phrase and set it as the verb of your sentence
VPPhraseSpec have = nlgFactory.createVerbPhrase("have");
// Note that the verb is "have" not "has". Have is the base lemma.
// The morphology of this will be handled based on the tense you set (see below)
p.setVerb(have);
// Create a determiner 'a'
NPPhraseSpec a = nlgFactory.createNounPhrase("a");
// Create two more noun phrases
// One for dog
NPPhraseSpec cat = nlgFactory.createNounPhrase("cat");
// set the determiner
cat.setDeterminer(a);;
// And one for cat.
NPPhraseSpec dog = nlgFactory.createNounPhrase("dog");
// set the determiner
dog.setDeterminer(a);
// Create a coordinated phrase
// This tells SimpleNLG that these objects are a collection which should be aggregated
CoordinatedPhraseElement coord = nlgFactory.createCoordinatedPhrase(cat, dog);
// Set the coordinated phrase as the object of your sentence
p.setObject(coord);
// Print it -
String output = realiser.realiseSentence(p);
System.out.println(output);
// => John has a cat and a dog.
// Now lets see what SimpleNLG can do!
// Change the tense to past (present was the default)
p.setTense(Tense.PAST);
output = realiser.realiseSentence(p);
System.out.println(output);
// => John had a cat and a dog.
// Change the tense to future
p.setTense(Tense.FUTURE);
output = realiser.realiseSentence(p);
System.out.println(output);
// => John will will have a cat and a dog.
}
}
That is how you work with language in the SimpleNLG realiser. It does not however answer your question of aggregating two strings directly. There may be other ways but my first thought is to use a syntactic parses such as StanfordNLP or spaCy.
I use spaCy in my own work (which is a python library). I will show a brief example of what I mean here.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'John has a cat')
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
This outputs:
John john PROPN NNP nsubj Xxxx True False
has have VERB VBZ ROOT xxx True True
a a DET DT det x True True
cat cat NOUN NN dobj xxx True False
You can see from the output that each token in the sentence has been marked as a noun, verb, determiner etc. You could use this information to format the input for SimpleNLG and then aggregate your sentences. I would suggest the XMLRealiser available in SimpleNLG would be better than just coding the grammar in Java. It takes XML as input.
NLP/NLG work is not trivial. Language is very complex. The above is just one way of approaching such a task. Tools might exist which just aggregate based on strings, but SimpleNLG is just a surface realiser so you would have to present it with input data in a suitable format as shown above.

Separate characters and numbers following specific rules

I am trying to distinguish flight numbers.
Example:
flightno = "FR556"
split_data = flightno.upcase.match(/([A-Za-z]+)(\d+)/)
first = split_data[1] # FR
second = split_data[1] # 556
I then go on to query the database to find an airline based on the FR in this example and apply some logic with the result which is Ryanair.
My problem is when the flight number might be:
flightno = "U21920"
split_data = flightno.upcase.match(/([A-Za-z]+)(\d+)/)
first = split_data[1] # U
second = split_data[1] # 21920
i basically want first to be U2 not just U. This is used to search the database of airlines by their IATA code in this case is U2
****EDIT**
In the interest of clarity i made some mistakes in terminology when asking my question. Due to the complexities of booking reference numbers, the input is taken from whatever the passenger provides. For an easyJet flight for example, the passenger may input EZY1920 or U21920 only the airline provides either so the passenger is ignorant really.
"EZY" = ICAO
"U2" = IATA
I take the input from the user and try to separate the ICAO or IATA from the flight number "1920" but there is no way of determining that without searching the database or separating the input which i feel is cumbersome from a user experience point of view.
Using a regex to separate characters from numbers works until the user inputs an IATA as part of their flight number (the passenger won't know the difference) and as you can see in the example above this confuses the regex.**
The trouble is i cant think of any other pattern with flight numbers. They always have at least two characters made up of just letters or a mixture of a letter and a number and can be 3 characters in length. The numbers part can be as short as 1 but can also be as long as 4 - always numbers.
****edit**
As has been mentioned in the comments, there is no fixed size however one thing that is always true (at least so far) is the first character will always be a letter regardless if it is ICAO or IATA.
After considering every bodies input so far i'm wondering if searching the database and returning airlines with an IATA or ICAO that matches the first two letters provided by the user (U2), (FR), (EZ) might be one way to go, however this is subject to obvious problems should an ICAO or IATA be released that matches another airline, for example "EZY" & "EZT". This is not future proof and i'm looking for better ruby or regex solutions.**
Appreciate your input.
EDIT
I have answered my own question below. While other answers provide a solution for handling some conditions they would fall down if the flight number began with a number so i worked out a crass but to date stable way to analyse the string for digits and then work out if it is an ICAO or IATA from that.
A solution I think of is that you match your given flight number against a complete list of ICAO/IATA codes: https://raw.githubusercontent.com/datasets/airport-codes/master/data/airport-codes.csv
Spending some time with google might give you a more appropriate list.
Then use the first three characters (if that is the maximum) of your flight number to find a match within the icao codes. If you find one, you will know where to seperate your string.
Here a minimal ugly example that should set you on a track. Feel free to update!
ICAOCODES = %w(FR DEU U21) # grab your data here
def retrieve_flight_information(flightnumber)
ICAOCODES.each do |icao|
co = flightnumber.match(icao).to_s
if co.length > 0
# airline
puts co
# flight number
puts flightnumber.gsub(co,'')
end
end
end
retrieve_flight_information("FR556")
#=> FR
#=> 556
retrieve_flight_information("U21214123")
#=> U21
#=> 214123
The biggest flaw lies in using .gsub() as it might mess up your flightnumber in case it looks like this: "FR21413FR2"
However you will find plenty of solutions to this problem on so.
As mentioned in the comments, a list of icao codes is not what you are looking for. But what is relevant here, is that you somehow need a list of strings that you can securely compare against.
I have a fairly crass solution that seems to be working in all scenarios i can throw at it to date. I wanted to make this available to anybody else that might find it useful?
The general rule of thumb for flight codes/numbers seems to be:
IATA: two characters made up of any combination letters and digits
ICAO: three characters made up of letters only (to date)
With that in mind we should be able to work out if we need to search the database by IATA or ICAO depending on the condition of the first three characters.
First we take the flight number and convert to uppercase
string = "U21920".upcase
Next we analyse the first three characters to check for any numbers.
first_three = string[0,3] # => U21
Is there a digit in first_three?
if first_three =~ /\d/ # => true
iata = first_three[0,2] # => If true lets get rid of the last character
# Now we go to the database searching IATA (U2)
search = Airline.where('iata LIKE ?', "#{iata}%") # => Starts with search, just in case
Otherwise if there isnt a digit found in the string
else
icao = string.match(/([A-Za-z]+)(\d+)/)
search = Airline.where('icao LIKE ?', "#{icao[1]}%")
This seems to work for the random flight numbers ive tested it with today from a few of the major airport live departure/arrival boards. Its an interesting problem because some airlines issue tickets with either an ICAO or IATA code as part of the flight number which means passengers won't know any different, not to mention, some airports provide flight information in their own format so assumign there isnt a change to the ICAO and IATA build then the above should work.
Here is an example script you can run
test.rb
puts "What is your flight number?"
string = gets.upcase
first_three = string[0,3]
puts "Taking first three from #{string} is #{first_three}"
if first_three =~ /\d/ # Calling String's =~ method.
puts "The String #{first_three} DOES have a number in it."
iata = first_three[0,2]
search = Airline.where('iata LIKE ?', "#{iata}%")
puts "Searching Airlines starting with IATA #{iata} = #{search.count}"
puts "Found #{search.first.name} from IATA #{iata}"
else
puts "The String #{first_three} does not have a number in it."
icao = string.match(/([A-Za-z]+)(\d+)/)
search = Airline.where('icao LIKE ?', "#{icao[1]}%")
puts "Searching Airlines starting with ICAO #{icao[1]} = #{search.count}"
puts "Found #{search.first.name} from IATA #{icao[1]}"
end
Airline
Airline(id: integer, name: string, iata: string, icao: string, created_at: datetime, updated_at: datetime )
stick this in your lib folder and run
rails runner lib/test.rb
Obviously you can remove all of the puts statements to get straight to the result. I'm using rails runner to include access to my Airline model when running the script.

How do we get run Stanford Classifier on an array of Strings?

I have got an array of strings
String strarr[] = {
"What a wonderful day",
"beautiful beds",
"food was awesome"
};
I also have a trained dataset
Room What a beautiful room
Room Wonderful sea-view
Room beds are comfortable
Room bed-spreads are good
Food The dinner was marvellous
Food Tasty foods
Service people are rude
Service waitors were not on time
Service service was horrible
Pogrammatically I am unable to get the scores and labels of the strings I want to classify.
If however, I am using a train dataset, with the two columns like in the test dataset, it works. My problem is, in reality, it is not possible to understand which label falls to each of the strings in my array.
How can get the classifier to run on the array, instead of creating a train dataset?
I got an error when trying to compute
ColumnDataClassifier cdc = new ColumnDataClassifier("examples/drogo.prop");
Classifier<String, String> cl
= cdc.makeClassifier(cdc.readTrainingExamples("examples/drogo.train"));
for (String li : strarr){
Datum<String, String> d = cdc.makeDatumFromLine(li);
System.out.println(li + " ==> " + cl.classOf(d) + " (score: " + cl.scoresOf(d) + ")");
}
Error:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatum(ColumnDataClassifier.java:738)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatumFromStrings(ColumnDataClassifier.java:275)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeDatumFromLine(ColumnDataClassifier.java:245)
at alchemypoc.DrogoClassifier.main(DrogoClassifier.java:55)
Java Result: 1
Okay, so I did the following and it now seemed to work. Since it was a ColumnDataClassifier and it somehow expected columnar data, I added a tab before each sentence.
String strarr[] = {
"\tWhat a wonderful day",
"\tbeautiful beds",
"\tfood was awesome"
};
It now gives me the values.
What a wonderful day ==> Room (score: {Service=-0.6692784244930884, Room=1.4113604761865859, Food=-0.7420810715491954})
beautiful beds ==> Room (score: {Service=-2.1042147142001038, Room=3.888249805012589, Food=-1.7840358277259})
food was awesome ==> Food (score: {Service=-0.44203328206155995, Room=-0.9779506257026013, Food=1.4199861760769543})
If anyone, has a different answer or a more correct way to do this, please do post your answers.

Word comparison algorithm

I am doing a CSV Import tool for the project I'm working on.
The client needs to be able to enter the data in excel, export them as CSV and upload them to the database.
For example I have this CSV record:
1, John Doe, ACME Comapny (the typo is on purpose)
Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting.
I plan to do this by comparing the company names in the database with the company names in the CSV.
the comparison should return 0 if the strings are exactly the same, and return some value that gets bigger as the strings get more different, but strcmp doesn't cut it here because:
"Acme Company" and "Acme Comapny" should have a very small difference index, but
"Acme Company" and "Cmea Mpnyaco" should have a very big difference index
Or "Acme Company" and "Acme Comp." should also have a small difference index, even though the character count is different.
Also, "Acme Company" and "Company Acme" should return 0.
So if the client makes a type while entering data, i could prompt him to choose the name he most probably wanted to insert.
Is there a known algorithm to do this, or maybe we can invent one :)
?
You might want to check out the Levenshtein Distance algorithm as a starting point. It will rate the "distance" between two words.
This SO thread on implementing a Google-style "Do you mean...?" system may provide some ideas as well.
I don't know what language you're coding in, but if it's PHP, you should consider the following algorithms:
levenshtein(): Returns the minimal number of characters you have to replace, insert or delete to transform one string into another.
soundex(): Returns the four-character soundex key of a word, which should be the same as the key for any similar-sounding word.
metaphone(): Similar to soundex, and possibly more effective for you. It's more accurate than soundex() as it knows the basic rules of English pronunciation. The metaphone generated keys are of variable length.
similar_text(): Similar to levenshtein(), but it can return a percent value instead.
I've had some success with the Levenshtein Distance algorithm, there is also Soundex.
What language are you implementing this in? we may be able to point to specific examples
I have actually implemented a similar system. I used the Levenshtein distance (as other posters already suggested), with some modifications. The problem with unmodified edit distance (applied to whole strings) is that it is sensitive to word reordering, so "Acme Digital Incorporated World Company" will match poorly against "Digital Incorporated World Company Acme" and such reorderings were quite common in my data.
I modified it so that if the edit distance of whole strings was too big, the algorithm fell back to matching words against each other to find a good word-to-word match (quadratic cost, but there was a cutoff if there were too many words, so it worked OK).
I've taken SoundEx, Levenshtein, PHP similarity, and double metaphone and packaged them up in C# in one set of extension methods on String.
Entire blog post here.
There's multiple algorithms to do just that, and most databases even include one by default. It is actually a quite common concern.
If its just about English words, SQL Server for example includes SOUNDEX which can be used to compare on the resulting sound of the word.
http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx
I'm implementing it in PHP, and I am now writing a piece of code that will break up 2 strings in words and compare each of the words from the first string with the words of the second string using levenshtein and accept the lowes possible values. Ill post it when I'm done.
Thanks a lot.
Update: Here's what I've come up with:
function myLevenshtein( $str1, $str2 )
{
// prepare the words
$words1 = explode( " ", preg_replace( "/\s+/", " ", trim($str1) ) );
$words2 = explode( " ", preg_replace( "/\s+/", " ", trim($str2) ) );
$found = array(); // array that keeps the best matched words so we don't check them again
$score = 0; // total score
// In my case, strings that have different amount of words can be good matches too
// For example, Acme Company and International Acme Company Ltd. are the same thing
// I will just add the wordcount differencre to the total score, and weigh it more later if needed
$wordDiff = count( $words1 ) - count( $words2 );
foreach( $words1 as $word1 )
{
$minlevWord = "";
$minlev = 1000;
$return = 0;
foreach( $words2 as $word2 )
{
$return = 1;
if( in_array( $word2, $found ) )
continue;
$lev = levenshtein( $word1, $word2 );
if( $lev < $minlev )
{
$minlev = $lev;
$minlevWord = $word2;
}
}
if( !$return )
break;
$score += $minlev;
array_push( $found, $minlevWord );
}
return $score + $wordDiff;
}

Resources