I am trying to test OpenIE with Stanford CoreNLP
http://nlp.stanford.edu/software/openie.html
I am using the following code based on one of the demos available on http://stanfordnlp.github.io/CoreNLP/openie.html
public static void main(String[] args) throws Exception {
// Create the Stanford CoreNLP pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
props.setProperty("openie.triple.strict", "false");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate an example document.
//File inputFile = new File("src/test/resources/0.txt");
//String text = Files.toString(inputFile, Charset.forName("UTF-8"));
String text = "Cats do not drink milk.";
Annotation doc = new Annotation(text);
pipeline.annotate(doc);
// Loop over sentences in the document
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples = sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
System.out.println(triple.confidence + "|\t" +
triple.subjectLemmaGloss() + "|\t" +
triple.relationLemmaGloss() + "|\t" +
triple.objectLemmaGloss());
}
}
}
This counter-intuitively results in the triple
1.0| cat| drink| milk
being extracted, which is the same result I get using input text "Cats drink milk." If I set "openie.triple.strict" to "true" no triples are extracted at all. Is there a way to extract a triple like cats | do not drink | milk ?
I think you want to set "openie.triple.strict" to true to ensure logically warranted triples. OpenIE does not extract negative relations, it is only designed to find positive ones.
So you are getting the correct behavior when "openie.triple.strict" is set to true (i.e. no relation being extracted). Note that a relation is extracted for "Cats drink milk." when "openie.triple.strict" is set to true.
Related
List<CustomerDetails> customerDetailsList = repo.getCustomerDetails();
Set<String> combinedNamesList = new HashSet<>();
customerDetailsList.forEach(i -> {
combinedNamesList .add((i.getFirstName() != null ? i.getFirstName().toLowerCase(): "") + (i.getLastName() != null ? i.getLastName().toLowerCase(): ""));
});
I would like to create the combinedNamesList in one operation using streams. Each CustomerDetails object has properties for a firstName and LastName. I would like to combine the two properties into a single String in an array such as:
{BobSmith, RachelSnow, DavidJohnson}
Stream the list and filter all customer objetcs having valid firstname and lastname, and then combine the name using String.format
List<String> combinedNamesList = repo.getCustomerDetails()
.stream()
.filter(cust->cust.getFirstName()!=null && cust.getLastName()!=null)
.map(cust->String.format("%s%s",cust.getFirstName(),cust.getLastName()))
.collect(Collectors.toList());
Adding to Deadpool's answer, thinking this might help someone too.
Person p = new Person("Mohamed", "Anees");
Person p1 = new Person("Hello", "World");
Person p2 = new Person("Hello", "France");
System.out.println(
Stream.of(p, p1, p2)
.map(person -> String.join(" ", person.getFirstName(), person.getLastName()))
.collect(Collectors.toSet()));
Here String.join() is used to concatenate names. And, this also produces a more sensible output than the one you are expecting
[Mohamed Anees, Hello World, Hello France]
If you really need names without space, you can replace " " in String.join() delimiter to ""
You can add filter() in the Stream for null checks before converting to lowercase.
I need the number of lines that contain two words. For this purpose I have written the following code:
The input file contains 1000 lines and about 4,000 words, and it takes about 4 hours.
Is there a library in Java that can do it faster?
Can I implement this code using Appache Lucene or Stanford Core NLP to achieve less run time?
ArrayList<String> reviews = new ArrayList<String>();
ArrayList<String> terms = new ArrayList<String>();
Map<String,Double> pij = new HashMap<String,Double>();
BufferedReader br = null;
FileReader fr = null;
try
{
fr = new FileReader("src/reviews-preprocessing.txt");
br = new BufferedReader(fr);
String line;
while ((line= br.readLine()) != null)
{
for(String term : line.split(" "))
{
if(!terms.contains(term))
terms.add(term);
}
reviews.add(line);
}
}
catch (IOException e) { e.printStackTrace();}
finally
{
try
{
if (br != null)
br.close();
if (fr != null)
fr.close();
}
catch (IOException ex) { ex.printStackTrace();}
}
long Count = reviews.size();
for(String term_i : terms)
{
for(String term_j : terms)
{
if(!term_i.equals(term_j))
{
double p = (double) reviews.parallelStream().filter(s -> s.contains(term_i) && s.contains(term_j)).count();
String key = String.format("%s_%s", term_i,term_j);
pij.put(key, p/Count);
}
}
}
Your first loop getting the distinct words relies on ArrayList.contains, which has a linear time complexity, instead of using a Set. So if we assume nd distinct words, it already has a time complexity of “number of lines”×nd.
Then, you are creating nd×nd word combinations and probing all 1,000 lines for the presence of these combination. In other words, if we only assume 100 distinct words, you are performing 1,000×100 + 100×100×1,000 = 10,100,000 operations, if we assume 500 distinct words, we’re talking about 250,500,000 already.
Instead, you should just create the combinations actually existing in a line and collect them into the map. This will only process those combinations actually existing and you may improve this by only checking either of each “a_b”/“b_a” combination, as the probability of both is identical. Then, you are only performing “number of lines”דword per line”דword per line” operations, in other words, roughly 16,000 operations in your case.
The following method combines all words of a line, only keeping one of the “a_b”/“b_a” combination, and eliminates duplicates so each combination can count as a line.
static Stream<String> allCombinations(String line) {
String[] words = line.split(" ");
return Arrays.stream(words)
.flatMap(word1 ->
Arrays.stream(words)
.filter(words2 -> word1.compareTo(words2)<0)
.map(word2 -> word1+'_'+word2))
.distinct();
}
This method can be use like
List<String> lines = Files.readAllLines(Paths.get("src/reviews-preprocessing.txt"));
double ratio = 1.0/lines.size();
Map<String, Double> pij = lines.stream()
.flatMap(line -> allCombinations(line))
.collect(Collectors.groupingBy(Function.identity(),
Collectors.summingDouble(x->ratio)));
It ran through my copy of “War and Peace” within a few seconds, without needing any attempt to do parallel processing. Not much surprising, “and_the” was the combination with the highest probability.
You may consider changing the line
String[] words = line.split(" ");
to
String[] words = line.toLowerCase().split("\\W+");
to generalize the code to work with different input, handling multiple spaces or other punctuation characters and ignoring the case.
I am using TokensRegex for rule based entity extraction. It works well but I am having trouble getting my output in the desired format. The following snippet of code gives me an output given below for the sentence:
Earlier this month Trump targeted Toyota, threatening to impose a
hefty fee on the world's largest automaker if it builds its Corolla
cars for the U.S. market at a plant in Mexico.
for (CoreMap sentence : sentences)
{
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
if (matched != null) {
matched = MatchedExpression.removeNested(matched);
matched = MatchedExpression.removeNullValues(matched);
System.out.print("FOR SENTENCE:" + sentence);
}
for(MatchedExpression phrase : matched){
// Print out matched text and value
System.out.print("MATCHED ENTITY: " + phrase.getText()+ "\t" + "VALUE: " + phrase.getValue());
OUTPUT
MATCHED ENTITY: Donald Trump targeted Toyota, threatening to impose a hefty fee on the world's largest automaker if it builds its Corolla cars for the U.S. market
VALUE: LIST([PERSON])
I know if I iterate over tokens using :
for (CoreLabel token : cm.get(TokensAnnotation.class))
{String word = token.get(TextAnnotation.class);
String lemma = token.get(LemmaAnnotation.class);
String pos = token.get(PartOfSpeechAnnotation.class);
String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + ", NE=" + ne);
}
I can get an output that gives annotation for each tag. However, I am using my own rules to detect Named Entities and I have sometimes seen issues where in a multi token entity one word from it may be tagged as person where the where multi token expression should have been an organization (mostly in the case of Organization and location names)
So the output I am expecting is:
MATCHED ENTITY: Donald Trump VALUE: PERSON
MATCHED ENTITY: Toyota VALUE: ORGANIZATION
How do I change the above code to get the desired output? Do I need to use custom annotations?
I produced a jar of the latest build a week or so ago. Use that jar available from GitHub.
This sample code will run the rules and apply the appropriate ner tags.
package edu.stanford.nlp.examples;
import edu.stanford.nlp.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import java.util.*;
public class TokensRegexExampleTwo {
public static void main(String[] args) {
// set up properties
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex");
props.setProperty("tokensregex.rules", "multi-step-per-org.rules");
props.setProperty("tokensregex.caseInsensitive", "true");
// set up pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// set up text to annotate
Annotation annotation = new Annotation("...text to annotate...");
// annotate text
pipeline.annotate(annotation);
// print out found entities
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
System.out.println(token.word() + "\t" + token.ner());
}
}
}
}
I managed to get output in desired format.
Annotation document = new Annotation(<Sentence to annotate>);
//use the pipeline to annotate the document we created
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
//Note- I doesn't put environment related stuff in rule file.
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor
.createExtractorFromFiles(env, "test_degree.rules");
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
for(MatchedExpression phrase : matched){
// Print out matched text and value
System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
}
}
Output:
MATCHED ENTITY: Technical Skill VALUE: SKILL
You might want to have a look at my rule file in this question.
Hope this helps!
Answering my own question for those struggling with a similar issue. THe key to getting your output in the correct format lies in how you define your rules in the rules file. Here's what I changed in the rules to change the output:
Old Rule:
{ ruleType: "tokens",
pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
result: Annotate($1, ner, "LOCATION"),
}
New Rule
{ ruleType: "tokens",
pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
action: Annotate($1, ner, "LOCATION"),
result: "LOCATION"
}
How you define your result field defines the output format of your data.
Hope this helps!
db.student.aggregate([{$project:{rollno:1,per:{$divide:[{$add:["marks1","$marks2","$marks3"]},3]}}}])
how to write this query in java????? here,student is collection with fields rollno,name and marks and i have to find the percentage of the students according to their roll numbers . I am not able to write code for adding their marks as add operator does not support multiple value's for addition.
This seems to work. There are other builder pattern/conveniences but this exposes all the workings. And leaves room for rich dynamic construction.
DBCollection coll = db.getCollection("student");
List<DBObject> pipe = new ArrayList<DBObject>();
/**
* Clearly, lots of room for dynamic behavior here.
* Different sets of marks, etc. And the divisor is
* the length of these, etc.
*/
String[] marks = new String[]{"$marks1","$marks2","$marks3"};
DBObject add = new BasicDBObject("$add", marks);
List l2 = new ArrayList();
l2.add(add);
l2.add(marks.length); // 3
DBObject divide = new BasicDBObject("$divide", l2);
DBObject prjflds = new BasicDBObject();
prjflds.put("rollno", 1);
prjflds.put("per", divide);
DBObject project = new BasicDBObject();
project.put("$project", prjflds);
pipe.add(project);
AggregationOutput agg = coll.aggregate(pipe);
for (DBObject result : agg.results()) {
System.out.println(result);
}
I have a bunch of names in alphabetical order with multiple instances of the same name all in alphabetical order so that the names are all grouped together. Beside each name, after a coma, I have a role that has been assigned to them, one name-role pair per line, something like whats shown below
name1,role1
name1,role2
name1,role3
name1,role8
name2,role8
name2,role2
name2,role4
name3,role1
name4,role5
name4,role1
...
..
.
I am looking for an algorithm to take the above .csv file as input create an output .csv file in the following format
name1,role1,role2,role3,role8
name2,role8,role2,role4
name3,role1
name4,role5,role1
...
..
.
So basically I want each name to appear only once and then the roles to be printed in csv format next to the names for all names and roles in the input file.
The algorithm should be language independent. I would appreciate it if it does NOT use OOP principles :-) I am a newbie.
Obviously has some formatting bugs but this will get you started.
var lastName = "";
do{
var name = readName();
var role = readRole();
if(lastName!=name){
print("\n"+name+",");
lastName = name;
}
print(role+",");
}while(reader.isReady());
This is easy to do if your language has associative arrays: arrays that can be indexed by anything (such as a string) rather than just numbers. Some languages call them "hashes," "maps," or "dictionaries."
On the other hand, if you can guarantee that the names are grouped together as in your sample data, Stefan's solution works quite well.
It's kind of a pity you said it had to be language-agnostic because Python is rather well-qualified for this:
import itertools
def split(s):
return s.strip().split(',', 1)
with open(filename, 'r') as f:
for name, lines in itertools.groupby(f, lambda s: split(s)[0])
print name + ',' + ','.join(split(s)[1] for s in lines)
Basically the groupby call takes all consecutive lines with the same name and groups them together.
Now that I think about it, though, Stefan's answer is probably more efficient.
Here is a solution in Java:
Scanner sc = new Scanner (new File(fileName));
Map<String, List<String>> nameRoles = new HashMap<String, List<String>> ();
while (sc.hasNextLine()) {
String line = sc.nextLine();
String args[] = line.split (",");
if (nameRoles.containsKey(args[0]) {
nameRoles.get(args[0]).add(args[1]);
} else {
List<String> roles = new ArrayList<String>();
roles.add(args[1]);
nameRoles.put(args[0], roles);
}
}
// then print it out
for (String name : nameRoles.keySet()) {
List<String> roles = nameRoles.get(name);
System.out.print(name + ",");
for (String role : roles) {
System.out.print(role + ",");
}
System.out.println();
}
With this approach, you can work with an random input like:
name1,role1
name3,role1
name2,role8
name1,role2
name2,role2
name4,role5
name4,role1
Here it is in C# using nothing fancy. It should be self-explanatory:
static void Main(string[] args)
{
using (StreamReader file = new StreamReader("input.txt"))
{
string prevName = "";
while (!file.EndOfStream)
{
string line = file.ReadLine(); // read a line
string[] tokens = line.Split(','); // split the name and the parameter
string name = tokens[0]; // this is the name
string param = tokens[1]; // this is the parameter
if (name == prevName) // if the name is the same as the previous name we read, we add the current param to that name. This works right because the names are sorted.
{
Console.Write(param + " ");
}
else // otherwise, we are definitely done with the previous name, and have printed all of its parameters (due to the sorting).
{
if (prevName != "") // make sure we don't print an extra newline the first time around
{
Console.WriteLine();
}
Console.Write(name + ": " + param + " "); // write the name followed by the first parameter. The output format can easily be tweaked to print commas.
prevName = name; // store the new name as the previous name.
}
}
}
}