Pattern matching with tregex in Stanzas Corenlp implementation doesn't seem to finde the right subtrees - stanford-nlp

I am relatively new to NLP and at the moment I'm trying to extract different phrase scructures in german texts. For that I'm using the Stanford corenlp implementation of stanza with the tregex feature for pattern machting in trees.
So far I didn't have any problem an I was able to match simple patterns like "NPs" or "S > CS".
No I'm trying to match S nodes that are immediately dominated either by ROOT or by a CS node that is immediately dominated by ROOT. For that im using the pattern "S > (CS > TOP) | > TOP". But it seems that it doesn't work properly. I'm using the following code:
text = "Peter kommt und Paul geht."
def linguistic_units(_client, _text, _pattern):
matches = _client.tregex(_text,_pattern)
list = matches['sentences']
print('+++++Tree++++')
print(list[0])
for sentence in matches['sentences']:
for match_id in sentence:
print(sentence[match_id]['spanString'])
return count_units
with CoreNLPClient(properties='./corenlp/StanfordCoreNLP-german.properties',
annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'parse', 'depparse', 'coref'],
timeout=300000,
be_quiet=True,
endpoint='http://localhost:9001',
memory='16G') as client:
result = linguistic_units(client, text, 'S > (CS > ROOT) | > ROOT'
print(result)
In the example with the text "Peter kommt und Paul geht" the pattern I'm using should match the two phrases "Peter kommt" and "Paul geht", but it doesn't work.
Afterwards I had a look at the tree itselfe and the output of the parser was the following:
constituency parse of first sentence
child {
child {
child {
child {
child {
value: "Peter"
}
value: "PROPN"
}
child {
child {
value: "kommt"
}
value: "VERB"
}
value: "S"
}
child {
child {
value: "und"
}
value: "CCONJ"
}
child {
child {
child {
value: "Paul"
}
value: "PROPN"
}
child {
child {
value: "geht"
}
value: "VERB"
}
value: "S"
}
value: "CS"
}
child {
child {
value: "."
}
value: "PUNCT"
}
value: "NUR"
}
value: "ROOT"
score: 5466.83349609375
I now suspect that this is due to the ROOT node, since it is the last node of the tree. Should the ROOT node not be at the beginning of the tree?
Does anyone know what I am doing wrong?

A few comments:
1.) Assuming you are using a recent version of CoreNLP (4.0.0+), you need to use the mwt annotator with German. So your annotators list should be tokenize,ssplit,mwt,pos,parse
2.) Here is your sentence in PTB for clarity:
(ROOT
(NUR
(CS
(S (PROPN Peter) (VERB kommt))
(CCONJ und)
(S (PROPN Paul) (VERB geht)))))
As you can see the ROOT is the root node of the tree, so your pattern would not match in this sentence. I personally find the PTB format easier to see the tree structure and for writing Tregex patterns off of. You can get that via the json or text output formats (instead of the serialized object). In the client request set output_format="text"
3.) Here is the latest documentation on using the Stanza client: https://stanfordnlp.github.io/stanza/client_properties.html

Related

Is there a way to make assertions only on selected array elements?

Let's say I want to check if an array contains an element with a certain value of a field that fulfills a given assertion, for eg.:
{ array: [ { element1: Mario, element2: White }, { element1: Luigi, element2: Green } ] }
Here I want to check that the element of array which has element1 equal to Mario has element2 equal to White. How can I do so with chai/supertest (or other npm packages)?
You can do it with jsonpath module:
var jp = require('jsonpath');
expect(jp.query(json, '$..array[?(#.element1=="Mario")]')[0].element2).to.equal('White')

Elasticsearch - Recursive nested JSON object

I'm trying to parse an HTML document into nested set of tags and content. It needs to support arbitrary nesting depth. The object (created in
Python code) looks like:
{
"content": [
"some text about a thing, ",
{"content": "More text with additional set of tags ",
"tags": ["strong"]
}
],
"tags": ["p"]
}
ES seems to dislike this structure, because the content field is of both a text and object type, producing this error; "reason": "mapper [content] of different type, current_type [text], merged_type [ObjectMapper]"
Does anyone have any ideas on how to index this type of object, and also allow for searches on both tags and content? Ideally I'd like to search by tags associated with the ancestors of a given object too. I can reformat it to
{
"content": [
{"content": "some text about a thing, "},
{"content": "More text with a different set of tags ",
"tags": ["strong"]
}
],
"tags": ["p"]
}
But then searching isn't very effective as I need to write content.content:"search string" to get results, which will become hard with multiple levels of nesting.
Why not store the ancestor tags in a separate field? Implementing a nested set will should solve your problem too.
Edit: As requested here comes a example of a nested set
Imagine a tree structure. Every node in this tree has a set of properties like description, or other attributes. Each node holds also a reference to it's parent node. Beside this there are two numbers: left and right position in the tree when traversing with in-depth search:
A(parent:null, left:1, right:12, desc:“root node“)
B(parent:A, left:2, right:3, desc:“left child“)
C(parent:A, left:4, right:11, desc:“right child“)
D(parent:C, left:5, right:6, desc:“foo“)
E(parent:C, left:7, right:10, desc:“bar“)
F(parent:E, left:8, right:9, desc:“baz“)
Calculating all ancenstors of a node is now easy:
ancestors(F for X) = search nodes as N WHERE N.left < X.left AND N.right > X.right
For the node F you'll get [E,C,A]. Ordering them by the left value you'll get the proper order for the ancestors of F.
So now you can use this criteria for the filter query in ES and use a second query for the search in the attributes of filtered nodes.
This structure is very efficient when looking for subtrees, but has downsides when you change the node order/position.
If you need further explanation, please add a comment.

Using N-Grams to search within a table of string Xcode, swift 3

Below is my code for the current search, right now it only checks string by string. How do I implement ngram into it? This code checks an array of strings compared to what the user types in on the front end. It generates results with that string.
ngram is a solution i found online for what i need. However, what i want is the ability for users to search in the array without the use of spaces, say the array is "I love sushi and popcorn,...." and the user searches "I love popcorn" the array which consist of the string "I love sushi and popcorn" will not show up. However, I want it to be.
One option is to write a function that puts each string seperated by space into another array and have the search function run through a loop in that array and do this for each other strings. However, I find this inefficient.
Please let me know if there are other solutions. Thanks
func updateSearchResults(for searchController: UISearchController) {
// filter results
// reload table
let searchString:String = searchController.searchBar.text!
self.dataToDisplay = self.sampleData.filter({ (dataString:String) -> Bool in
let match = dataString.range(of: searchString, options: NSString.CompareOptions.caseInsensitive)
if match != nil {
return true
}
else {
return false
}
})
self.pdfArrSearch = self.pdfArr.filter({ (dataString:String) -> Bool in
let match = dataString.range(of: searchString, options: NSString.CompareOptions.caseInsensitive)
if match != nil {
return true
}
else {
return false
}
})

Parsing JSON in Ruby (like XPATH)

I have a JSON document returned from a query to the Google Books API, e.g:
{
"items": [
{
"volumeInfo": {
"industryIdentifiers": [
{
"type": "OTHER",
"identifier": "OCLC:841804665"
}
]
}
},
{
"volumeInfo": {
"industryIdentifiers": [
{
"type": "ISBN_10",
"identifier": "156898118X"...
I need the ISBN number (type: ISBN_10 or ISBN_13) and I've written a simple loop that traverses the parsed JSON (parsed = json.parse(my_uri_response)). In this loop, I have a next if k['type'] = "OTHER" which sets "type" to "OTHER".
How do I best extract just one ISBN number from my JSON example? Not all of them, just one.
Something like XPath search would be helpful.
JSONPath may be just what you're looking for:
require 'jsonpath'
json = #your raw JSON above
path = JsonPath.new('$..industryIdentifiers[?(#.type == "ISBN_10")].identifier')
puts path.on(json)
Result:
156898118X
See this page for how XPath translates to JSONPath. It helped me determine the JSONPath above.
how about:
parsed['items'].map { |book|
book['volume_info']['industryIdentifiers'].find{ |prop|
['ISBN_10', 'ISBN_13'].include? prop['type']
}['identifier']
}
If you receive undefined method [] for nil:NilClass this means that you have an element within items array, which has no volume_info key, or that you have a volume with a set of industryIdentifiers without ISBN. Code below should cover all those cases (+ the case when you have volumeInfo without industry_identifiers:
parsed['items'].map { |book|
identifiers = book['volume_info'] && book['volume_info']['industryIdentifiers']
isbn_identifier = idetifiers && identifiers.find{ |prop|
['ISBN_10', 'ISBN_13'].include? prop['type']}['identifier']
}
isbn_identifier && isbn_identifier['identifier']
}.compact
If you happen to have the andand gem, this might be written as:
parsed['items'].map { |book|
book['volume_info'].andand['industryIdentifiers'].andand.find{ |prop|
['ISBN_10', 'ISBN_13'].include? prop['type']
}.andand['identifier']
}.compact
Note that this will return only one ISBN for each volume. If you have volumes with both ISBN_10 and ISBN_13 and you want to get both, instead of find you'll need to use select method and .map{|i| i[:identifier]} in place of .andand['identifier'].

Word 2010 combining INCLUDEPICTURE and IF

Using MS Word 2010 I am trying to place an INCLUDEPICTURE field into a block of an IF statement. While both the IF statement and the INCLUDEPICTURE work correctly separate, they do not work in combination.
IF Statement:
{ IF { MERGEFIELD condition \* MERGEFORMAT } = "expression" "true" "false" \* MERGEFORMAT }
This works correctly.
INCLUDEPICTURE:
{ INCLUDEPICTURE "picture.png" }
This works correctly, too.
Combination of the two:
{ IF { MERGEFIELD condition \* MERGEFORMAT } = "expression" "{ INCLUDEPICTURE "picture.png" }" "false" \* MERGEFORMAT }
This does not work. If the IF expression is true, nothing is displayed at all.
How can I combine both the IF statement and the INCLUDEPICTURE command?
This is a well known-problem (i.e. you are right, it doesn't work).
Unfortunately, there isn't a particularly good solution - the simplest involves using a blank 1-pixel image file.
The usual starting point is to invert the nesting so that you have something more like this...
{ INCLUDEPICTURE "{ IF "{ MERGEFIELD condition }" = "expression" "picture.png" }" }" \d }
This always tries to insert a picture, and will report (and insert) an error in the case where { MERGEFIELD condition } <> "expression". The simplest resolution is to have a blank 1-pixel picture that you can include instead, e.g.
{ INCLUDEPICTURE "{ IF "{ MERGEFIELD condition }" = "expression" "picture.png" "blank1.png" }" }" \d }
It is sometimes clearer to remove the test and assignment and do it separately, particularly if there are multiple tests. In this case,
{ SET picname "{ IF "{ MERGEFIELD condition }" = "expression" "picture.png" "blank1.png" }" }
or if you prefer,
{ IF "{ MERGEFIELD condition }" = "expression" "{ SET picname "picture.png" }" "{ SET picname "blank1.png" }" }
You still need an IF nested inside the INNCLUDEPICTURE to make it work. You can use:
{ INCLUDEPICTURE "{ IF TRUE { picname } }" \d }
If you merge those nested fields to an output document, the fields will remain in the output. If you want the fields to be resolved (e.g. because you need to send the output to someone who does not have the image files) then you need something more like this:
{ IF { INCLUDEPICTURE "{ IF TRUE { picname } }" } { INCLUDEPICTURE "{ IF TRUE { picname } }" \d } }
I believe you can reduce this to
{ IF { INCLUDEPICTURE "{ picname }" } { INCLUDEPICTURE "{ IF TRUE { picname } }" \d } }
In fact, I believe you can insert the full path+name of any graphic file that you know exists instead of the first { picname }, e.g.
{ IF { INCLUDEPICTURE "the full pathname of blank1.png" } { INCLUDEPICTURE "{ IF TRUE { picname } }" \d } }
But you should check that those work for you.
EDIT
FWIW, some recent tests suggest that whereas the pictures appear unlinked, a save/re-open displays a reconstituted link (with a *MERGEFORMATINET near the end), and the pictures are expected to be at the locaitons indicated in those links. Whether this is due to a change in Word I cannot tell. If anything has changed, it looks to be an attempt to allow some relative path addressing in the Relationship records that Word creates inside the .docx.
Some observations...
Make sure paths have doubled-up backslashes, e.g.
c:\\mypath\\blank1.png . This is usually necessary for any paths
hard-coded into fields. For paths that come in via nested field
codes, please check.
As a general point, it is easier to work with INCLUDEPICTURE fields
when the document is a .doc, not .docx, and to ensure that
File->Options->Advanced->General->Web options->Files->"Update links
on save" is checked. Otherwise, Word is more likely to replace
INCLUDEPICTURE fields with a result that cannot be redisplayed as a
field using Alt-F9
When you want to treat the comparands in an IF field as strings, it
is advisable to surround them with double-quotes, as I have done.
Otherwise, a { MERGEFIELD } field that resolves to the name of a
bookmark may not behave as you would hope. Otherwise, spacing and
quoting is largely a matter of personal choice.
So far, none of these field constructions will deal with the situation where you have path names for pictures that may or may not exist. If that is what you need, please modify your original question.
Step by step guide:
bibadia's answer works, but word does not tell you when you make mistakes, so it is very hard to get it right. So I hope this step by step answer helps.
Step 1: Add a Picture
In Word 2013 docx (no idea about other versions) add
{ INCLUDEPICTURE "C:\\picture.png" }
Note: Use CTRL+F9 to add { } , don't ever type them in, as they will not work.
Use \\ and not \
Run the mail merge, do Ctrl+A then F9 to show the picture.
Step 2: Auto Show it
To change the mail merge document use (CTRL+A Shift+F9). Change it to
{ SET picname "C:\\picture.png" }
{ INCLUDEPICTURE "{ IF TRUE { picname } }" \d }
Run the mail merge - the picture should show up, no need for Ctrl+A then F9
Step 3: Unlink it
Remove the \d
This will let you email the doc. As the \d causes the document to create a link to the image file, rather than include it.
Step 4: add an IF
Use bibadia's solution, i.e.
{ SET picname "{ IF "{ MERGEFIELD condition }" = "expression" "picture.png" "blank1.png" }" }
Another option that I've tested works is to use an If statement to check an expression (In my example check if the entry is not null), and if not then display the image, if not display some custom text (If you don't want text just have empty quotation marks i.e. ""):
{IF {MERGEFIELD my_photo_variable_name} <> "" {INCLUDEPICTURE "{IF TRUE {MERGEFIELD my_photo_variable_name}}" \d} "Text to display if no picture available"}
Which translates as:
If there is no value for the image my_photo_variable_name, include the image in the mail merge.
If there is no value i.e no image, then display custom text Text to display if no picture available.

Resources