Iterate through tokens and find the entity for a token - stanford-nlp

Problem
After running CoreNLP over some text, I want to reconstruct a sentence adding the POS-tag for each Token and grouping the tokens that form an entity.
This could be easily done if there was a way to see which entity a Token belongs to.
Aproach
One option I was considering now was going through sentence.tokens() and finding the index in a list containing only the Tokens from all the CoreEntityMentions for that sentence. Then I could see which CoreEntityMention that Token belongs to, so I can group them.
Another option could be to look the offsets of each Token in the sentence and compare it to the offset of each CoreEntityMention.
I think the question is similar to what was asked here, but since it was a while ago, maybe the API has changed since.
This is the setup:
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner");
pipeline = new StanfordCoreNLP(props);
String text = "Some text with entities goes here";
CoreDocument coreDoc = new CoreDocument(text);
// annotate the document
pipeline.annotate(coreDoc);
for (CoreSentence sentence : coreDoc.sentences()) {
// Code goes here
List<CoreEntityMention> em : sentence.entityMentions();
}

Each token in an entity mention contains an index to which entity mention in the document it corresponds to.
cl.get(CoreAnnotations.EntityMentionIndexAnnotation.class);
I'll make a note to add a convenience method for this future versions.

Related

Is it safe to pass a Lucene Query String directly from a user into a QueryParser?

tldr: Can I securely pass a raw query string (retrieved as a URL parameter) into a Lucene QueryParser without any added input sanitization?
I'm not a security expert, but I need some advice. As the title states, is it safe to use this controller method:
#CrossOrigin(origins = "${allowed-origin}")
#GetMapping(value = "/search/{query_string}", produces = MediaType.APPLICATION_JSON_VALUE)
public List doSearch(#PathVariable("query_string") String queryString) {
return searchQueryHandlerService.doSearch(queryString);
}
In tandem with this service method (the error handling is for testing only):
public List doSearch(String queryString) {
LOGGER.debug("Parsing query string: " + queryString);
try {
Query q = new QueryParser(null, standardAnalyzer).parse(queryString);
FullTextEntityManager manager = Search.getFullTextEntityManager(entityManager);
FullTextQuery fullTextQuery = manager.createFullTextQuery(q, Poem.class, Book.class, Section.class);
return fullTextQuery.getResultList();
} catch (ParseException e) {
LOGGER.error(e);
return Collections.emptyList();
}
}
With only basic input sanitization? If this isn't safe are there measures I can take to make it safe?
Any help is greatly appreciated.
I've been looking into this on and off for the last few weeks and I cannot find any reason why it wouldn't be safe, but It's such an obscure question (in an area I'm unfamiliar with) that I may be missing some obvious, fundamental problem anyone working in the area would see immediately.
A FullTextQuery is always read only, so you don't have to be concerned with people dropping tables or similar issues that you might have to consider when dealing with SQL injection.
But you might want to be careful if you have security restrictions on what data can be seen by your users.
The API also restricts the operation to a certain set of indexes - in your case those containing the Poem entities - so it's also not possible to break out of the chosen indexes.
But you need to consider:
is it ok if the user is able to somehow find a different Poem than what you expected them to look for
if you share the same index with other entities, there might be some ways to infer data about these other entities
So to be security conscious you might want to:
each entity type gets indexed into its own index (which is the default).
enable some FullTextFilter to restrict the user query based on your custom rules.
actually check the content of each result before rendering it, so to remove content that your other filters didn't catch.
If you are extremely paranoid, consider that any full-text index can actually reveal a bit about how frequent certain terms are in the whole index. People are normally not too concerned about this as it's extremely hard to take advantage of, and only minimal clues about the data distribution are revealed.
So back at your example, if this index just contains poems and you're ok with allowing any user to see any poem you have stored, giving away clues about which poems you are making available is normally not a security concern but is rather the whole point of your service.

Print Probabilities from CoreNLP Pipeline

I'm aware of the functionality of using printProbs from a classifier to print the probabilities that a particular token is a particular ner type. However, how can I access the CRFClassifier used by the CoreNLP pipeline in the bottom code to actually call the printProb method?
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
One possible approach I would say is to have your own custom ner annotator which resembles very closely the CRFClassifier and add that to your pipeline instead of ner. So basically start with a copy of that code and then from within there you can access the CRFClassifier methods which contain the cliques and the probabilities
One could also use reflection (hahaha)
NERCombinerAnnotator nerAnnotator = (NERCombinerAnnotator) StanfordCoreNLP.getExistingAnnotator(Annotator.STANFORD_NER);
Field field = nerAnnotator.getClass().getDeclaredField("ner");
field.setAccessible(true);
NERClassifierCombiner classifier = (NERClassifierCombiner) field.get(nerAnnotator);
Field field2 = ner.getClass().getSuperclass().getDeclaredField("baseClassifiers");
field2.setAccessible(true);
// one of these will be the CRFClassifier used
List baseClassifiers = (List) field2.get(ner);
You should also realize that a number of the usual reflection exceptions could be raised using this code.

How to add POS tag feature in OpenNLP named entity recognition tool

I am trying to setup the OpenNLP NameFinder in a project with part-of-speech tag feature.
I extended my feature class from FeatureGeneratorAdapter class, and overrode following method. Unfortunately this method taking just raw tokens in parameter. The problem is that how to pass POS tag information in to this method?
public void createFeatures(List features, String[] tokens, int index, String[] previousOutcomes)
Try just passing in the pos as the tokens, ie append the pos to the word like this
bob_nn, went_vv etc....
the goal of the method in the interface is to return the "List features" ref back filled with the tokens so you may as well just put the pos_token combos straight into the list to begin with... never tried this before so hope this helps

retrieve the Quote Detail with c#

I'm trying to create a custom workflow (for Dynamics CRM 2011) which must send an email with information on the Details Quote from a quote.
I create it in Visual Studio 2010 with the sdk.
The workflow is triggered manualy from a quote.
I am able to retrieve the value of the customerid, but I am unable to get the attached documents or the quotedetails of the Quote, when I launched the workflow I have this exception :
System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
at Microsoft.Xrm.Sdk.Entity.get_Item(String attributeName)
at CPageCRM.Workflow.QuoteSendMailNotificationRIP.Execute(CodeActivityContext executionContext)
My code is :
//to get the current Quote
Entity preImageEntity = context.PreEntityImages.Values.FirstOrDefault();
//preImageEntity is a Quote because I trigger the workflow from a Quote
//the next two lines work, I can retrieve the good value of the Quote
string natureDevis = Utils.GetOptionSetValueLabel(service, preImageEntity, "new_nature", (OptionSetValue)preImageEntity["new_nature"]);
string prospectDevis = ((EntityReference)preImageEntity["customerid"]).Name;
//I get the exception after that :
List<QuoteDetail> listQuoteDetail = new List<QuoteDetail>();
listQuoteDetail = preImageEntity["quote_details"] as List<QuoteDetail>; //I get the exception
I don't understand why the quote_details doesn't exist in the dictionnary, because when I do :
Quote devis = new Quote();
devis.quote_details //<= (the autocompletion is working)
I have the same problem when I try to get sharepointdocumentlocation
Anyone have an explication? How can I retrieve the Quote Details and the document attached to my Quote from the code?
Thanks
A comment and potential answer.
My comment is when retrieving stuff out of the Images I often find it easier to let the compiler grab the proper type and just use 'var'.
My answer is that quote_details isn't just a field, but an actual 1-n relationshp (by looking in the metadata browser). You may need to get the related entities in a separate retrieve.
Edit:
For example: _service.Retrieve("quote", quoteId, new ColumnSet("quote_details"))
will retrieve the quote details from the service. However, you could also check and see if you are passing in the quote_details attribute from the PreImage.
I successed with a linq query
I had to search the quote_detail which were linked to the quote :
var queryQuoteDetail = from r in orgServiceContext.CreateQuery("quotedetail")
where ((EntityReference)r["quoteid"]).Id.Equals(context.PrimaryEntityId)
select r;

Field returns empty string

I create a new profile document with the following code:
Set doc = db.Createdocument()
doc.Form = "SMBPrivateProfile"
Call doc.Computewithform(True,True)
Call doc.Save(True, False)
But whenever I want to read a field by #GetProfileField i get an empty string, even if the field I want to read has a default value.
After opening & saving the document manually everything works.
Further details:
I improved an application and hit Application --> Replace Design.... The new version includes a new field within the profile document. When reading one of these new fields, the result is an empty string. When reading an 'old' field within the same document the result is the expected string.
e.g.:
MessageBox([OK];"Title"; #GetProfileField("SMBPrivateProfile"; "OLD_FIELD"; #ThisName))
--> Will result in: "This is a fancy old default value"
MessageBox([OK];"Title"; #GetProfileField("SMBPrivateProfile"; "NEW_FIELD"; #ThisName))
--> Will result in: "" (instead of "This is a fancy new default value")
That's not a profile document. To create profile document use:
db.GetProfileDocument("SMBPrivateProfile");
You can also add a second parameter for a unique key in addition to profile name.
Also consider if you really want to use profile documents. They are heavily cached and not visible in any views.
If I'm reading you right, it appears that you have updated a form and added a new field with a default value formula. You are then reading an existing document. When you do this, the new field that you added to the form does not yet exist. New fields and formulas aren't applied to existing documents until you do something to force them to be applied.
If it's a regular document (as your original code indicated), you can just open the document in the Notes client, edit, and re-save it. That will create the NEW_FIELD and give it its value. If there are lots of these documents, you could write a simple formula agent to do this via #Command([ToolsRefreshAllDocs]) or #Command( [ToolsRefreshSelectedDocs]).
If it is a profile document (as per the responce chain to #Panu's anser), then after you do the replace design you will have to write an agent to open the existing profile document using db.getProfileDocument use doc.ReplaceItemValue("NEW_FIELD";"new value").

Resources