Extracting multi word named entities using CoreNLP - stanford-nlp

I'm using CoreNLP for named entity extraction and have run into a bit of an issue.
The issue is that whenever a named entity is composed of more than one token, such as "Han Solo", the annotator does not return "Han Solo" as a single named entity, but as two separate entities, "Han" "Solo".
Is it possible to get the named entity as one token? I know I can make use of the CRFClassifier with classifyWithInlineXML to this extent, but my solution requires that I use CoreNLP, since I need to know the word number as well.
The following is the code that I have so far:
Properties props = new Properties();
props.put("annotators", "tokenize,ssplit,pos,lemma,ner,parse");
props.setProperty("ner.model", "edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz");
pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
System.out.println(token.get(NamedEntityTagAnnotation.class));
}
}
Help me Obi-Wan Kenobi. You're my only hope.

PrintWriter writer = null;
try {
String inputLine = "Several possible plans emerged from the talks, held at the Federal Reserve Bank of New York" + " and led by Timothy R. Geithner, the president of the New York Fed, and Treasury Secretary Henry M. Paulson Jr.";
String serializedClassifier = "english.all.3class.distsim.crf.ser.gz";
AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier);
writer = new PrintWriter(new File("output.xml"));
writer.println("<Sentences>");
writer.flush();
String output ="<Sentence>"+classifier.classifyToString(inputLine, "xml", true)+"</Sentence>";
writer.println(output);
writer.flush();
writer.println("</Sentences>");
writer.flush();
} catch (FileNotFoundException ex) {
ex.printStackTrace();
} finally {
writer.close();
}
I was able to come up with this solution. I am writing the output to an XML file "output.xml". From the obtained output, you can merge continuous nodes in xml with "PERSON" or "ORGANIZATION" or "LOCATION" attributes in to one entity. And this format produces the word count by default.
Here is a snapshot of xml output.
<wi num="11" entity="ORGANIZATION">Federal</wi>
<wi num="12" entity="ORGANIZATION">Reserve</wi>
<wi num="13" entity="ORGANIZATION">Bank</wi>
<wi num="14" entity="ORGANIZATION">of</wi>
<wi num="15" entity="ORGANIZATION">New</wi>
<wi num="16" entity="ORGANIZATION">Yorkand</wi>
From the above output you can see that continuously words are recognized as "ORGANIZATION". So these words could be combined to one entity.

I use one temp variable to hold the previous ner tag and check if the current ner tag is equal to the temp, it will combine two words together. and iteration goes by assigning temp to current ner tag.

Related

OpenCSV : getting the list of header names in the order it appears in csv

I am using Springboot + OpenCSV to parse a CSV with 120 columns (sample 1). I upload the file process each rows and in case of error, return a similar CSV (say errorCSV). This errorCSV will have only errored out rows with 120 original columns and 3 additional columns for details on what went wrong. Sample Error file 2
I have used annotation based processing and beans are populating fine. But I need to get header names in the order they appear in the csv. This particular part is quite challenging. Then capture exception and original data during parsing. The two together can later be used in writing CSV.
CSVReaderHeaderAware headerReader;
headerReader = new CSVReaderHeaderAware(reader);
try {
header = headerReader.readMap().keySet();
} catch (CsvValidationException e) {
e.printStackTrace();
}
However the header order is jumbled and there is no way to get header index. The reason being CSVReaderHeaderAware internally uses a HashMap. In order to solve this I built my custom class. It is a replica of CSVReaderHeaderAware 3 except that I used LinkedHashMap
public class CSVReaderHeaderOrderAware extends CSVReader {
private final Map<String, Integer> headerIndex = new LinkedHashMap<>();
}
....
// This code cannot be done with a stream and Collectors.toMap()
// because Map.merge() does not play well with null values. Some
// implementations throw a NullPointerException, others simply remove
// the key from the map.
Map<String, String> resultMap = new LinkedHashMap<>(headerIndex.size()*2);
It does the job however wanted to check if this is the best way out or can you think of a better way to get header names and failed values back and write in a csv.
I referred to following links but couldn't get much help
How to read from particular header in opencsv?

Apache Mahout Database to Sequence File

I am currently trying to play around with mahout. I purchased the book Mahout in Action.
The whole process is understood and with simple test data sets I was already successful.
Now I have a classification problem that I would like to solve.
the target variable is found, which I call - for now - x.
The existing data in our database has already been classified with -1, 0 and +1.
We defined several predictor variables which we select with an SQL query.
These are the product's attributes: language, country, category (of the shop), title, description.
Now I want them to directly be written in a SequenceFile, for which I wrote a little helper class that will append to the sequence file each time a new row of the SQL resultset has been processed:
public void appendToFile(String classification, String databaseID, String language, String country, String vertical, String title, String description) {
int count = 0;
Text key = new Text();
Text value = new Text();
key.set("/" + classification + "/" + databaseID);
//??value.set(message);
try {
this.writer.append(key, value);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
If I only had the title or so, I could simply store it in the value - but how do I store mutiple values like country, lang, and so on, in that particular key?
Thanks for any help!
you shouldnt be storing structures in a seq file, just dump all the text you have seperated by a space,
it's simply a place to put all your content for term counting and such when using something like Naive Bayes, it cares not about structure.
Then when you have classification, lookup the structure in your database.

Does Avro schema evolution require access to both old and new schemas?

If I serialize an object using a schema version 1, and later update the schema to version 2 (say by adding a field) - am I required to use schema version 1 when later deserializing the object? Ideally I would like to just use schema version 2 and have the deserialized object have the default value for the field that was added to the schema after the object was originally serialized.
Maybe some code will explain better...
schema1:
{"type": "record",
"name": "User",
"fields": [
{"name": "firstName", "type": "string"}
]}
schema2:
{"type": "record",
"name": "User",
"fields": [
{"name": "firstName", "type": "string"},
{"name": "lastName", "type": "string", "default": ""}
]}
using the generic non-code-generation approach:
// serialize
ByteArrayOutputStream out = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
GenericDatumWriter writer = new GenericDatumWriter(schema1);
GenericRecord datum = new GenericData.Record(schema1);
datum.put("firstName", "Jack");
writer.write(datum, encoder);
encoder.flush();
out.close();
byte[] bytes = out.toByteArray();
// deserialize
// I would like to not have any reference to schema1 below here
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema2);
Decoder decoder = DecoderFactory.get().binaryDecoder(bytes, null);
GenericRecord result = reader.read(null, decoder);
results in an EOFException. Using the jsonEncoder results in an AvroTypeException.
I know it will work if I pass both schema1 and schema2 to the GenericDatumReader constructor, but I'd like to not have to keep a repository of all previous schemas and also somehow keep track of which schema was used to serialize each particular object.
I also tried the code-gen approach, first serializing to a file using the User class generated from schema1:
User user = new User();
user.setFirstName("Jack");
DatumWriter<User> writer = new SpecificDatumWriter<User>(User.class);
FileOutputStream out = new FileOutputStream("user.avro");
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(user, encoder);
encoder.flush();
out.close();
Then updating the schema to version 2, regenerating the User class, and attempting to read the file:
DatumReader<User> reader = new SpecificDatumReader<User>(User.class);
FileInputStream in = new FileInputStream("user.avro");
Decoder decoder = DecoderFactory.get().binaryDecoder(in, null);
User user = reader.read(null, decoder);
but it also results in an EOFException.
Just for comparison's sake, what I'm trying to do seems to work with protobufs...
format:
option java_outer_classname = "UserProto";
message User {
optional string first_name = 1;
}
serialize:
UserProto.User.Builder user = UserProto.User.newBuilder();
user.setFirstName("Jack");
FileOutputStream out = new FileOutputStream("user.data");
user.build().writeTo(out);
add optional last_name to format, regen UserProto, and deserialize:
FileInputStream in = new FileInputStream("user.data");
UserProto.User user = UserProto.User.parseFrom(in);
as expected, user.getLastName() is the empty string.
Can something like this be done with Avro?
Avro and Protocol Buffers have different approaches to handling versioning, and which approach is better depends on your use case.
In Protocol Buffers you have to explicitly tag every field with a number, and those numbers are stored along with the fields' values in the binary representation. Thus, as long as you never change the meaning of a number in a subsequent schema version, you can still decode a record encoded in a different schema version. If the decoder sees a tag number that it doesn't recognise, it can simply skip it.
Avro takes a different approach: there are no tag numbers, instead the binary layout is completely determined by the program doing the encoding — this is the writer's schema. (A record's fields are simply stored one after another in the binary encoding, without any tagging or separator, and the order is determined by the writer's schema.) This makes the encoding more compact, and saves you from having to manually maintain tags in the schema. But it does mean that for reading, you have to know the exact schema with which the data was written, or you won't be able to make sense of it.
If knowing the writer's schema is essential for decoding Avro, the reader's schema is a layer of niceness on top of it. If you're doing code generation in a program that needs to read Avro data, you can do the codegen off the reader's schema, which saves you from having to regenerate it every time the writer's schema changes (assuming it changes in a way that can be resolved). But it doesn't save you from having to know the writer's schema.
Pros & Cons
Avro's approach is good in an environment where you have lots of records that are known to have the exact same schema version, because you can just include the schema in the metadata at the beginning of the file, and know that the next million records can all be decoded using that schema. This happens a lot in a MapReduce context, which explains why Avro came out of the Hadoop project.
Protocol Buffers' approach is probably better for RPC, where individual objects are being sent over the network (as request parameters or return value). If you use Avro here, you may have different clients and different servers all with different schema versions, so you'd have to tag every binary-encoded blob with the Avro schema version it's using, and maintain a registry of schemas. At that point you might as well have used Protocol Buffers' built-in tagging.
To do what you are trying to do you need to make the last_name field optional, by allowing null values. The type for last_name should be ["null", "string"] instead of "string"
I have tried to circumvent this problem. I am putting it here:
I have also tried using two schemas one schema just an addition of another column to the other schema using the refection API of Avro. I have the following schema:
Employee (having name, age, ssn)
ExtendedEmployee (extending Employee and having gender column)
I am assuming on file which had the Employee objects earlier now also has the ExtendedEmployee object and I tried to read that file as :
RecordHandler rh = new RecordHandler();
if (rh.readObject(employeeSchema, dbLocation) instanceof Employee) {
Employee e = (Employee) rh.readObject(employeeSchema, dbLocation);
System.out.print(e.toString());
} else if (rh.readObject(schema, dbLocation) instanceof ExtendedEmployee) {
ExtendedEmployee e = (ExtendedEmployee) rh.readObject(schema, dbLocation);
System.out.print(e.toString());
}
This solves the problem here. However, I would love to know if there is an API wherein we can give the ExtendedEmployee schema to read the objects of Employee as well.

LINQ to XML exception when there is no summary node present

I have been working with LINQ to XML and have been stuck with an issue. I would really appreciate any help. I am new to LINQ to XML, but I found it easy to work with.
I have two different syndication feeds that I aggregate to one single syndication feed using Union. The final syndication feed contains 10 items.
I am trying to write the syndication feed to an XML file using XDocument and XElement. I have been able to do that successfully for the most part. But, some of the items in the feed do not have a description as a node element. When I get to the items that do not have this node element I am getting an Exception as I don’t have a description node for one of the items. How can I check the items to see if there is a node called description before I start writing the XML file? If the item does not contain the description node how could I populate it with a default value? Could you please suggest me any solution? Thank you for all your time!
SyndicationFeed combinedfeed = new SyndicationFeed(newFeed1.Items.Union(newFeed2.Items).OrderByDescending(u => u.PublishDate));
//save the filtered xml file to a folder
XDocument filteredxmlfile = new XDocument(
new XDeclaration("2.0", "utf-8", "yes"),
new XElement("channel",
from filteredlist in combinedfeed.Items
select new XElement("item",
new XElement("title", filteredlist.Title.Text),
new XElement("source", FormatContent(filteredlist.Links[0].Uri.ToString())[0]),
new XElement("url", FormatContent(filteredlist.Links[0].Uri.ToString())[1]),
new XElement("pubdate", filteredlist.PublishDate.ToString("r")),
new XElement("date",filteredlist.PublishDate.Date.ToShortDateString()),
// I get an exception here as the summary/ description node is not present for all the items in the syndication feed
new XElement("date",filteredlist.Summary.Text)
)));
string savexmlpath = Server.MapPath(ConfigurationManager.AppSettings["FilteredFolder"]) + "sorted.xml";
filteredxmlfile.Save(savexmlpath);
Just check for null:
new XElement("date",filteredlist.Summary !=null ? filteredlist.Summary.Text : "default summary")

How do I search for a wildcard character in Microsoft CRM 4.0?

I need to search for accounts in Microsoft CRM, using a wildcard search to get a "contains" search for the user's input. So if the user enters "ABC", I use ConditionOperator.Like and the value "%ABC%".
My question is, how would I search for a customer name that contains a percentage sign, such as "100% Great llc"? I can't find a way to escape the %.
Sounds like you're looking for a SQL-based approach so I'm not sure if this helps.
One way I know is through the user interface with an asterisk *
So if you want to find all of the accounts that have a % sign just type in *% into the account search.
Try using square blocks for special characters, for instance like [%]. So the condition would be: 100[%] Great llc or %100[%] Great llc%.
--EDIT--
This is in response to your comment.
Try utilizing the ConditionExpression, something like following:
//1. Condition expression.
ConditionExpression nameCondition= new ConditionExpression();
nameCondition.AttributeName = "AccountName";
nameCondition.Operator = ConditionOperator.Like;
nameCondition.Values = new string[] { "%100[%] Great llc%" };
//2. Create filter expression
FilterExpression nameFilter = new FilterExpression();
nameFilter.Conditions = new ConditionExpression[] { nameCondition };
//3. Provide columns
ColumnSet resultSetColumns = new ColumnSet();
resultSetColumns.Attributes = new string[] { "name", "address" };
//4. Prepare query expression
QueryExpression qryExpression = new QueryExpression();
qryExpression.Criteria = nameFilter;
qryExpression.ColumnSet = resultSetColumns;
//5. Set the table to query.
qryExpression.EntityName = EntityName.account.ToString();
//6. BusinessEntityCollection accountsResultSet = service.RetrieveMultiple(qryExpression);
Though I have played alot with CRM, but never came across special characters scenario. Let me know your findings. This article has some revelations.

Resources