Stanford NER for phrases or compound entities

Stanford NER for phrases or compound entities - stanford-nlp

I noticed that corenlp.run can identify "10am tomorrow" and parse it out as time. But the training tutorial and the docs I've seen only allow for 1 word per line. How do I get it to understand a phrase.
On a related note, is there a way to tag compound entities?

Time related phrases like that are recognized by the SUTime library. More details can be found here: https://nlp.stanford.edu/software/sutime.html
There is functionality for extracting entities after the ner tagging has been done.
For instance if you have tagged a sentence: Joe Smith went to Hawaii . as PERSON PERSON O O LOCATION O you can extract out Joe Smith and Hawaii. This requires the entitymentions annotator.
Here is some example code:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class EntityMentionsExample {
public static void main(String[] args) {
Annotation document =
new Annotation("John Smith visited Los Angeles on Tuesday.");
Properties props = new Properties();
//props.setProperty("regexner.mapping", "small-names.rules");
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
for (CoreMap entityMention : document.get(CoreAnnotations.MentionsAnnotation.class)) {
System.out.println(entityMention);
//System.out.println(entityMention.get(CoreAnnotations.TextAnnotation.class));
System.out.println(entityMention.get(CoreAnnotations.EntityTypeAnnotation.class));
}
}
}

Related

Is there a way to combine re-tokenize several tokens into one using TokensRegex?

I want to combine consecutive tokens with the same named entity annotation (say, STANFORD UNIVERSITY, where both tokens "stanford" and "university" have NE "ORGANIZATION") into a single token, so that I just have "STANFORD UNIVERSITY" with NE "ORGANIZATION". Is there a way to do that with tokens regex?
So, this really is a two-part question:
1) How would you write the pattern for an unbroken sequence of tokens with the same NER?
2) How would you write the action to combine captured tokens into one (basically, do the opposite of the Split function)?
Thanks!

You want to use the entitymentions annotator, which will do this for you and extract full entities from the text.
sample code:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class EntityMentionsExample {
public static void main(String[] args) {
Annotation document =
new Annotation("John Smith visted Los Angeles on Tuesday.");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
for (CoreMap entityMention : document.get(CoreAnnotations.MentionsAnnotation.class)) {
System.out.println(entityMention);
}
}
}

How To Deserialize Generic Types with Moshi?

Suppose we have this JSON:
[
{
"__typename": "Car",
"id": "123",
"name": "Toyota Prius",
"numDoors": 4
},
{
"__typename": "Boat",
"id": "4567",
"name": "U.S.S. Constitution",
"propulsion": "SAIL"
}
]
(there could be many more elements to the list; this just shows two)
I have Car and Boat POJOs that use a Vehicle base class for the common fields:
public abstract class Vehicle {
public final String id;
public final String name;
}
public class Car extends Vehicle {
public final Integer numDoors;
}
public class Boat extends Vehicle {
public final String propulsion;
}
The result of parsing this JSON should be a List<Vehicle>. The problem is that no JSON parser is going to know, out of the box, that __typename is how to distinguish a Boat from a Car.
With Gson, I can create a JsonDeserializer<Vehicle> that can examine the __typename field, identify whether this is a Car or Boat, then use deserialize() on the supplied JsonDeserializationContext to parse the particular JSON object into the appropriate type. This works fine.
However, the particular thing that I am building ought to support pluggable JSON parsers, and I thought that I would try Moshi as an alternative parser. However, this particular problem is not covered well in the Moshi documentation at the present time, and I am having difficulty figuring out how best to address it.
The closest analogue to JsonDeserializer<T> is JsonAdapter<T>. However, fromJson() gets passed a JsonReader, which has a destructive API. To find out what the __typename is, I would have to be able to parse everything by hand from the JsonReader events. While I could call adapter() on the Moshi instance to try to invoke existing Moshi parsing logic once I know the proper concrete type, I will have consumed data off of the JsonReader and broken its ability to provide the complete object description anymore.
Another analogue of JsonDeserializer<Vehicle> would be a #FromJson-annotated method that returns a Vehicle. However, I cannot identify a simple thing to pass into the method. The only thing that I can think of is to create yet another POJO representing the union of all possible fields:
public class SemiParsedKindOfVehicle {
public final String id;
public final String name;
public final Integer numDoors;
public final String propulsion;
public final String __typename;
}
Then, in theory, if I have #FromJson Vehicle rideLikeTheWind(SemiParsedKindOfVehicle rawVehicle) on a class that I register as a type adapter with Moshi, Moshi might be able to parse my JSON objects into SemiParsedKindOfVehicle instances and call rideLikeTheWind(). In there, I would look up the __typename, identify the type, and completely build the Car or Boat myself, returning that object.
While doable, this is a fair bit more complex than the Gson approach, and my Car/Boat scenario is on the simple end of the possible data structures that I will need to deal with.
Is there another approach to handling this with Moshi that I am missing?

The moshi-adapters add-on library contains a PolymorphicJsonAdapterFactory class. While the JavaDocs for this library do not seem to be posted, the source does contain a detailed description of its use.
The setup for the example in my question would be:
private val moshi = Moshi.Builder()
.add(
PolymorphicJsonAdapterFactory.of(Vehicle::class.java, "__typename")
.withSubtype(Car::class.java, "Car")
.withSubtype(Boat::class.java, "Boat")
)
.build()
Now, our Moshi object knows how to convert things like List<Vehicle> to/from JSON, based on the __typename property in the JSON, comparing it to "Car" and "Boat" to create the Car and Boat classes, respectively.

UPDATE 2019-05-25: The newer answer is your best bet. I am leaving my original solution here for historical reasons.
One thing that I had not taken into account is that you can create a type adapter using a generic type, like Map<String, Object>. Given that, you can create a VehicleAdapter that looks up __typename. It will be responsible for completely populating the Car and Boat instances (or, optionally, delegate that to constructors on Car and Boat that take the Map<String, Object> as input). Hence, this is still not quite as convenient as Gson's approach. Plus, you have to have a do-nothing #ToJson method, as otherwise Moshi rejects your type adapter. But, otherwise, it works, as is demonstrated by this JUnit4 test class:
import com.squareup.moshi.FromJson;
import com.squareup.moshi.JsonAdapter;
import com.squareup.moshi.Moshi;
import com.squareup.moshi.ToJson;
import com.squareup.moshi.Types;
import org.junit.Assert;
import org.junit.Test;
import java.io.IOException;
import java.lang.reflect.Type;
import java.util.List;
import java.util.Map;
import static org.junit.Assert.assertEquals;
public class Foo {
static abstract class Vehicle {
public String id;
public String name;
}
static class Car extends Vehicle {
public Integer numDoors;
}
static class Boat extends Vehicle {
public String propulsion;
}
static class VehicleAdapter {
#FromJson
Vehicle fromJson(Map<String, Object> raw) {
String typename=raw.get("__typename").toString();
Vehicle result;
if (typename.equals("Car")) {
Car car=new Car();
car.numDoors=((Double)raw.get("numDoors")).intValue();
result=car;
}
else if (typename.equals("Boat")) {
Boat boat=new Boat();
boat.propulsion=raw.get("propulsion").toString();
result=boat;
}
else {
throw new IllegalStateException("Could not identify __typename: "+typename);
}
result.id=raw.get("id").toString();
result.name=raw.get("name").toString();
return(result);
}
#ToJson
String toJson(Vehicle vehicle) {
throw new UnsupportedOperationException("Um, why is this required?");
}
}
static final String JSON="[\n"+
" {\n"+
" \"__typename\": \"Car\",\n"+
" \"id\": \"123\",\n"+
" \"name\": \"Toyota Prius\",\n"+
" \"numDoors\": 4\n"+
" },\n"+
" {\n"+
" \"__typename\": \"Boat\",\n"+
" \"id\": \"4567\",\n"+
" \"name\": \"U.S.S. Constitution\",\n"+
" \"propulsion\": \"SAIL\"\n"+
" }\n"+
"]";
#Test
public void deserializeGeneric() throws IOException {
Moshi moshi=new Moshi.Builder().add(new VehicleAdapter()).build();
Type payloadType=Types.newParameterizedType(List.class, Vehicle.class);
JsonAdapter<List<Vehicle>> jsonAdapter=moshi.adapter(payloadType);
List<Vehicle> result=jsonAdapter.fromJson(JSON);
assertEquals(2, result.size());
assertEquals(Car.class, result.get(0).getClass());
Car car=(Car)result.get(0);
assertEquals("123", car.id);
assertEquals("Toyota Prius", car.name);
assertEquals((long)4, (long)car.numDoors);
assertEquals(Boat.class, result.get(1).getClass());
Boat boat=(Boat)result.get(1);
assertEquals("4567", boat.id);
assertEquals("U.S.S. Constitution", boat.name);
assertEquals("SAIL", boat.propulsion);
}
}

How to use quote annotator

Running
./corenlp.sh -annotators quote -outputFormat xml -file input.txt
on the modified input file
"Stanford University" is located in California. It is a great university, founded in 1891.
yields the following output:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
<document>
<sentences/>
</document>
</root>
Maybe I misunderstood the intended use of this annotator, but I expected it to mark the parts of the sentence that is between the ".
When I run the script with the "usual" annotators tokenize,ssplit,pos,lemma,ner, they are all working well, but adding quote does not change the output. I use the stanford-corenlp-full-2015-12-09 release.
How can I use the quote annotator and what is it meant to do?

If you build a StanfordCoreNLP object in Java code and run it with the quote annotator, the final Annotation object will have the quotes.
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;
import edu.stanford.nlp.util.*;
public class PipelineExample {
public static void main (String[] args) throws IOException {
// build pipeline
Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit, quote");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "\"Stanford University\" is located in California. It is a great university, founded in 1891.";
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
System.out.println(annotation.get(CoreAnnotations.QuotationsAnnotation.class));
}
}
Currently none of the outputters (json, xml, text, etc...) output the quotes. I'll make a note we should add this to the output for future versions.

Why sonar(JaCoCo) is asking me to test my packages?

I am trying to close some test holes in my application and found that JaCoCo sonar plugin is giving me a smaller coverage in my enums because it thinks I should test the Package names.
Why is that?
It's showing me a 97% coverage in one of my enums and displaying a red line on top of the package declaration like this, telling me to test it... it does that in all Enums and on Enums only.

I came here looking for the answer to this, and after some more digging I discovered that it's due to some static methods that can be found in the bytecode of the compiled enum class which Jacoco is expecting to be covered. After some experimentation, I came up with the following superclass to use for unit tests which are focused on enums, with JUnit 4. This resolved my coverage problems with enums.
import org.junit.Test;
import java.lang.reflect.InvocationTargetException;
import java.lang.reflect.Method;
import static org.junit.Assert.assertEquals;
public abstract class EnumTest {
#Test
public void verifyEnumStatics() throws NoSuchMethodException, InvocationTargetException, IllegalAccessException {
Class e = getEnumUnderTest();
Method valuesMethod = e.getMethod("values");
Object[] values = (Object[]) valuesMethod.invoke(null);
Method valueOfMethod = e.getMethod("valueOf", String.class);
assertEquals(values[0], valueOfMethod.invoke(null, ((Enum)values[0]).name()));
}
protected abstract Class getEnumUnderTest();
}
And then use it like this:
public class TravelTypeTest extends EnumTest {
#Override
protected Class getEnumUnderTest() {
return TravelType.class;
}
// other test methods if needed
}
This is a rough first attempt - it doesn't work on enums that for whatever reason don't have any entries, and doubtless there are better ways to get the same effect, but this will exercise the generated static methods by ensuring that you can retrieve the values of the enum, and that if you pass the name of the first enum entry to the valueOf() method you will get the first enum entry back.
Ideally we'd write a test that searches for all enums in the packages under test and exercise them in the same way automatically (and avoid having to remember to create a new test class for each new enum that inherits from EnumTest), but I don't have many enums so I haven't felt any pressure to attempt this yet.

deserializing multiple types with gson

I receive the following JSON response from my http web service
{"status":100,"data":[{"name":"kitchen chair","price":25.99,"description":null}]}
Now I want to be able to deserialize this. I stumbled upon Gson from Google, which at first worked well for some small testcases but I'm having some trouble getting the above string deserialized.
I know the data field only has one item in it, but it could hold more than one which is why all responses have the data field as an array.
I was reading the Gson User Guide and ideally I would like to have a Response object which has two attributes: status and data, but the data field would have to be a List of Map objects which presumably is making it hard for Gson.
I then looked at this which is an example closer to my problem but I still can't quite figure it out. In the example the whole response is an array, whereas my JSON string has one string element and then an array.
How would I best go about deserializing this?

It's not clear what exactly was already attempted that appeared to be "hard for Gson".
The latest release of Gson handles this simply. Following is an example.
input.json
{
"status":100,
"data":
[
{
"name":"kitchen chair",
"price":25.99,
"description":null
}
]
}
Response.java
import java.util.List;
import java.util.Map;
class Response
{
int status;
List<Map> data;
}
GsonFoo.java
import java.io.FileReader;
import com.google.gson.Gson;
public class GsonFoo
{
public static void main(String[] args) throws Exception
{
Response response = new Gson().fromJson(new FileReader("input.json"), Response.class);
System.out.println(new Gson().toJson(response));
}
}
Output:
{"status":100,"data":[{"name":"kitchen chair","price":25.99}]}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Stanford NER for phrases or compound entities - stanford-nlp

I noticed that corenlp.run can identify "10am tomorrow" and parse it out as time. But the training tutorial and the docs I've seen only allow for 1 word per line. How do I get it to understand a phrase. On a related note, is there a way to tag compound entities?

Related

Is there a way to combine re-tokenize several tokens into one using TokensRegex?

How To Deserialize Generic Types with Moshi?

How to use quote annotator

Why sonar(JaCoCo) is asking me to test my packages?

deserializing multiple types with gson

Categories

Resources