Getting output in the desired format using TokenRegex - stanford-nlp

I am using TokensRegex for rule based entity extraction. It works well but I am having trouble getting my output in the desired format. The following snippet of code gives me an output given below for the sentence:
Earlier this month Trump targeted Toyota, threatening to impose a
hefty fee on the world's largest automaker if it builds its Corolla
cars for the U.S. market at a plant in Mexico.
for (CoreMap sentence : sentences)
{
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
if (matched != null) {
matched = MatchedExpression.removeNested(matched);
matched = MatchedExpression.removeNullValues(matched);
System.out.print("FOR SENTENCE:" + sentence);
}
for(MatchedExpression phrase : matched){
// Print out matched text and value
System.out.print("MATCHED ENTITY: " + phrase.getText()+ "\t" + "VALUE: " + phrase.getValue());
OUTPUT
MATCHED ENTITY: Donald Trump targeted Toyota, threatening to impose a hefty fee on the world's largest automaker if it builds its Corolla cars for the U.S. market
VALUE: LIST([PERSON])
I know if I iterate over tokens using :
for (CoreLabel token : cm.get(TokensAnnotation.class))
{String word = token.get(TextAnnotation.class);
String lemma = token.get(LemmaAnnotation.class);
String pos = token.get(PartOfSpeechAnnotation.class);
String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + ", NE=" + ne);
}
I can get an output that gives annotation for each tag. However, I am using my own rules to detect Named Entities and I have sometimes seen issues where in a multi token entity one word from it may be tagged as person where the where multi token expression should have been an organization (mostly in the case of Organization and location names)
So the output I am expecting is:
MATCHED ENTITY: Donald Trump VALUE: PERSON
MATCHED ENTITY: Toyota VALUE: ORGANIZATION
How do I change the above code to get the desired output? Do I need to use custom annotations?

I produced a jar of the latest build a week or so ago. Use that jar available from GitHub.
This sample code will run the rules and apply the appropriate ner tags.
package edu.stanford.nlp.examples;
import edu.stanford.nlp.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import java.util.*;
public class TokensRegexExampleTwo {
public static void main(String[] args) {
// set up properties
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex");
props.setProperty("tokensregex.rules", "multi-step-per-org.rules");
props.setProperty("tokensregex.caseInsensitive", "true");
// set up pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// set up text to annotate
Annotation annotation = new Annotation("...text to annotate...");
// annotate text
pipeline.annotate(annotation);
// print out found entities
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
System.out.println(token.word() + "\t" + token.ner());
}
}
}
}

I managed to get output in desired format.
Annotation document = new Annotation(<Sentence to annotate>);
//use the pipeline to annotate the document we created
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
//Note- I doesn't put environment related stuff in rule file.
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor
.createExtractorFromFiles(env, "test_degree.rules");
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
for(MatchedExpression phrase : matched){
// Print out matched text and value
System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
}
}
Output:
MATCHED ENTITY: Technical Skill VALUE: SKILL
You might want to have a look at my rule file in this question.
Hope this helps!

Answering my own question for those struggling with a similar issue. THe key to getting your output in the correct format lies in how you define your rules in the rules file. Here's what I changed in the rules to change the output:
Old Rule:
{ ruleType: "tokens",
pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
result: Annotate($1, ner, "LOCATION"),
}
New Rule
{ ruleType: "tokens",
pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
action: Annotate($1, ner, "LOCATION"),
result: "LOCATION"
}
How you define your result field defines the output format of your data.
Hope this helps!

Related

Entity Framework Core translation capabilities [duplicate]

This question already has an answer here:
Can I reuse code for selecting a custom DTO object for a child property with EF Core?
(1 answer)
Closed 1 year ago.
I have a rather theoretical issue with Entity Framework Core on SQLite.
I have an entity - Person { ID, FirstName, LastName, ... }, class PersonReference { ID, Representation : string }, extension method with argument of type Person that composes reference out of Person like this:
public static PersonReference ComposeReference(this Person from) => new PersonReference
{
ID = from.ID,
Representation = from.FirstName + " " + from.LastName
};
I need to compose references on the sql side. So I do the following:
var result = dbContext.People.Select(p => p.ComposeReference());
Result is IQueriable and program goes beyond that line and materializes the collection successfully. But when I look at the query I see it selects everything of Person and then query text ends.
If I rewrite EF expression to direct
var result = dbContext.People.Select(p => new PersonReference
{
ID = from.ID,
Representation = from.FirstName + " " + from.LastName
});
it gives me the satisfying expression with compact select and string concatenations.
Is there a way to keep the composition logic in extension method but still do calculations on the SQL side?
The trick is about using System.Linq.Expressions.Expression.
I met it at work and didn't get what it was for at first, but it is designed right for the purpose I required.
Declaration:
Expression<Func<Person, PersonReference>> ComposeReference => from => new
PersonReference
{
ID = from.ID,
Representation = from.FirstName + " " + from.LastName
};
Usage:
var result = dbContext.People.Select(ComposeReference);
Pay attention, that expressions can be compiled, but for this case never do it, or it will treat your DbSet as IEnumerable.
The answer from Svyatoslav's comment referred to some libraries, but I think vanilla EF does well enough on its own.

Combine/Concatenate 2 properties/fields (First Name + Last Name) within a single object using Java Streams

List<CustomerDetails> customerDetailsList = repo.getCustomerDetails();
Set<String> combinedNamesList = new HashSet<>();
customerDetailsList.forEach(i -> {
combinedNamesList .add((i.getFirstName() != null ? i.getFirstName().toLowerCase(): "") + (i.getLastName() != null ? i.getLastName().toLowerCase(): ""));
});
I would like to create the combinedNamesList in one operation using streams. Each CustomerDetails object has properties for a firstName and LastName. I would like to combine the two properties into a single String in an array such as:
{BobSmith, RachelSnow, DavidJohnson}
Stream the list and filter all customer objetcs having valid firstname and lastname, and then combine the name using String.format
List<String> combinedNamesList = repo.getCustomerDetails()
.stream()
.filter(cust->cust.getFirstName()!=null && cust.getLastName()!=null)
.map(cust->String.format("%s%s",cust.getFirstName(),cust.getLastName()))
.collect(Collectors.toList());
Adding to Deadpool's answer, thinking this might help someone too.
Person p = new Person("Mohamed", "Anees");
Person p1 = new Person("Hello", "World");
Person p2 = new Person("Hello", "France");
System.out.println(
Stream.of(p, p1, p2)
.map(person -> String.join(" ", person.getFirstName(), person.getLastName()))
.collect(Collectors.toSet()));
Here String.join() is used to concatenate names. And, this also produces a more sensible output than the one you are expecting
[Mohamed Anees, Hello World, Hello France]
If you really need names without space, you can replace " " in String.join() delimiter to ""
You can add filter() in the Stream for null checks before converting to lowercase.

How to have parametrizable "methods" in Elm data-structures

I'm stuck refactoring a large data-structure in Elm. I know how I would implement this in OO languages, but have no experience in a functional setting. I can't express well what my problem is, because I can only frame it in OOP terms which don't apply. So I go by way of example, which is a simplified version of the module I'm refactoring.
Suppose I have this type:
type alias Book = { title : String, text : String }
and I have two Books:
englishBook = { title = "An English Book", text = "This book is an Eglish book." }
frenchBook = { title = "Un livre Francais", text = "Ce livre est un livre Francais." }
There is an associated index function to compute which words are in a Book:
index = String.words >> Set.fromList
Here's already my first problem. When index takes a String, the user of the module must know how to take the text from the book. Instead, my habits say that the function should do this for us. So index could behave like a method and take a Book as first argument: index = .text >> String.words >> Set.fromList. But that also feels weird.
That's not the end of it though, because the index generator should be parametrizable. Depending on the book, it should do different things. So I could add the index function like this:
englishBook = { title = "...", text = "...", index = englishIndex }
frenchBook = { title = "...", text = "...", index = frenchIndex }
now each book has the function to build its index. But still the caller has to supply the record when it wants the index:
wordsInEnglishBook = englishBook.index englishBook.text
which is not a nice solution to me because it burdens the caller with internals of the module. Well what if that part is encapsulated with a constructor?
book title text index = { title = title, text = text, index = \_ -> index text }
Now I've come full circle and have implemented a method. So what is the idiomatic solution for this in Elm?
You can use a custom type to represent the language then pattern match on the language to perform your index.
type Language
= English
| French
type alias Book =
{ title : String
, text : String
, language : Language
}
wordsInBook : Book -> Set String
wordsInBook { language, text } =
case language of
English ->
doSomethingWithEnglish text
French ->
doSomethingWithFrench text
or
type Book
= Book Language Data
type Language
= English
| French
type alias Data =
{ title : String
, text : String
}
wordsInBook : Book -> Set String
wordsInBook (Book language data) =
case language of
English ->
doSomethingWithEnglish data
French ->
doSomethingWithFrench data

Using openIE to extract negation

I am trying to test OpenIE with Stanford CoreNLP
http://nlp.stanford.edu/software/openie.html
I am using the following code based on one of the demos available on http://stanfordnlp.github.io/CoreNLP/openie.html
public static void main(String[] args) throws Exception {
// Create the Stanford CoreNLP pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
props.setProperty("openie.triple.strict", "false");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate an example document.
//File inputFile = new File("src/test/resources/0.txt");
//String text = Files.toString(inputFile, Charset.forName("UTF-8"));
String text = "Cats do not drink milk.";
Annotation doc = new Annotation(text);
pipeline.annotate(doc);
// Loop over sentences in the document
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples = sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
System.out.println(triple.confidence + "|\t" +
triple.subjectLemmaGloss() + "|\t" +
triple.relationLemmaGloss() + "|\t" +
triple.objectLemmaGloss());
}
}
}
This counter-intuitively results in the triple
1.0| cat| drink| milk
being extracted, which is the same result I get using input text "Cats drink milk." If I set "openie.triple.strict" to "true" no triples are extracted at all. Is there a way to extract a triple like cats | do not drink | milk ?
I think you want to set "openie.triple.strict" to true to ensure logically warranted triples. OpenIE does not extract negative relations, it is only designed to find positive ones.
So you are getting the correct behavior when "openie.triple.strict" is set to true (i.e. no relation being extracted). Note that a relation is extracted for "Cats drink milk." when "openie.triple.strict" is set to true.

Need an algorithm to group several parameters of a person under the persons name

I have a bunch of names in alphabetical order with multiple instances of the same name all in alphabetical order so that the names are all grouped together. Beside each name, after a coma, I have a role that has been assigned to them, one name-role pair per line, something like whats shown below
name1,role1
name1,role2
name1,role3
name1,role8
name2,role8
name2,role2
name2,role4
name3,role1
name4,role5
name4,role1
...
..
.
I am looking for an algorithm to take the above .csv file as input create an output .csv file in the following format
name1,role1,role2,role3,role8
name2,role8,role2,role4
name3,role1
name4,role5,role1
...
..
.
So basically I want each name to appear only once and then the roles to be printed in csv format next to the names for all names and roles in the input file.
The algorithm should be language independent. I would appreciate it if it does NOT use OOP principles :-) I am a newbie.
Obviously has some formatting bugs but this will get you started.
var lastName = "";
do{
var name = readName();
var role = readRole();
if(lastName!=name){
print("\n"+name+",");
lastName = name;
}
print(role+",");
}while(reader.isReady());
This is easy to do if your language has associative arrays: arrays that can be indexed by anything (such as a string) rather than just numbers. Some languages call them "hashes," "maps," or "dictionaries."
On the other hand, if you can guarantee that the names are grouped together as in your sample data, Stefan's solution works quite well.
It's kind of a pity you said it had to be language-agnostic because Python is rather well-qualified for this:
import itertools
def split(s):
return s.strip().split(',', 1)
with open(filename, 'r') as f:
for name, lines in itertools.groupby(f, lambda s: split(s)[0])
print name + ',' + ','.join(split(s)[1] for s in lines)
Basically the groupby call takes all consecutive lines with the same name and groups them together.
Now that I think about it, though, Stefan's answer is probably more efficient.
Here is a solution in Java:
Scanner sc = new Scanner (new File(fileName));
Map<String, List<String>> nameRoles = new HashMap<String, List<String>> ();
while (sc.hasNextLine()) {
String line = sc.nextLine();
String args[] = line.split (",");
if (nameRoles.containsKey(args[0]) {
nameRoles.get(args[0]).add(args[1]);
} else {
List<String> roles = new ArrayList<String>();
roles.add(args[1]);
nameRoles.put(args[0], roles);
}
}
// then print it out
for (String name : nameRoles.keySet()) {
List<String> roles = nameRoles.get(name);
System.out.print(name + ",");
for (String role : roles) {
System.out.print(role + ",");
}
System.out.println();
}
With this approach, you can work with an random input like:
name1,role1
name3,role1
name2,role8
name1,role2
name2,role2
name4,role5
name4,role1
Here it is in C# using nothing fancy. It should be self-explanatory:
static void Main(string[] args)
{
using (StreamReader file = new StreamReader("input.txt"))
{
string prevName = "";
while (!file.EndOfStream)
{
string line = file.ReadLine(); // read a line
string[] tokens = line.Split(','); // split the name and the parameter
string name = tokens[0]; // this is the name
string param = tokens[1]; // this is the parameter
if (name == prevName) // if the name is the same as the previous name we read, we add the current param to that name. This works right because the names are sorted.
{
Console.Write(param + " ");
}
else // otherwise, we are definitely done with the previous name, and have printed all of its parameters (due to the sorting).
{
if (prevName != "") // make sure we don't print an extra newline the first time around
{
Console.WriteLine();
}
Console.Write(name + ": " + param + " "); // write the name followed by the first parameter. The output format can easily be tweaked to print commas.
prevName = name; // store the new name as the previous name.
}
}
}
}

Resources