Fastest/shortest way to build unique tree in Ruby? - ruby

What is the fastest/shortest/one-liner (not possible :p) way to build a unique tree of elements from a tree where many of the elements are duplicated/missing in some nodes, given the tree has a defined set of nodes (which we'd use this algorithm to figure out so we don't have to manually do it).
It could be XML/JSON(hash), or whatever. So something like this:
root {
nodes {
nodeA {}
nodeB {
subNodeA {}
}
}
nodes {
nodeA {
subNodeA {}
}
nodeB {
subNodeX {}
}
}
}
...converted to this:
root {
nodes {
nodeA {
subNodeA {}
}
nodeB {
subNodeA {}
subNodeX {}
}
}
}
Same with xml:
<root>
<nodes>
<nodeA/>
<nodeB>
<subNodeA/>
</nodeB>
</nodes>
<nodes>
<nodeA>
<subNodeA/>
</nodeA>
<nodeB>
<subNodeX/>
</nodeB>
</nodes>
</root>
<root>
<nodes>
<nodeA>
<subNodeA/>
</nodeA>
<nodeB>
<subNodeA/>
<subNodeX/>
</nodeB>
</nodes>
</root>
The xml/json files could be decently large (1MB+), so having to iterate over every element depth-first or something seems like it would take a while. It could also be as small as the example above.

This'll get you a set of unique paths:
require 'nokogiri'
require 'set'
xml = Nokogirl::XML.parse(your_data)
paths = Set.new
xml.traverse {|node| next if node.text?; paths << node.path.gsub(/\[\d+\]/,"").sub(/\/$/,"")}
Does that get you started?
[response to question in comment]
Adding attibute-paths is also easy, but let's go at least a little bit multi-line:
xml.traverse do |node|
next if node.text?
paths << (npath = node.path.gsub(/\[\d+\]/,"").sub(/\/$/,""))
paths += node.attributes.map {|k,v| "#{npath}##{k}"}
end

Related

fileTree visit exclude directories

I'm looking for some mechanism that would allow me to execute an action over each file in a directory that matches a certain pattern.
I'm currently trying to make fileTree work this way.
def srcDir = 'myDir'
def includePattern = '*'
def tree = fileTree(srcDir) {
include includePattern
}
tree.visit { d ->
logger.info(d.file)
}
My directory looks like this:
myDir/file1
myDir/file2
myDir/subDir/file3
What I would like to have as output is:
/../myDir/file1
/../myDir/file2
But of course subDir also matches the * include pattern. So it gets included in the result.
How can I only visit files?
If you do not want to do a recursive scan but just scan a single directory, you can do something like eachFileMatch:
import static groovy.io.FileType.*
new File('myDir').eachFileMatch FILES, ~/.*/, { f ->
println f.name
}
which would print:
─➤ groovy solution.groovy 1 ↵
file2
file1
where f is a java.io.File and ~/.*/ is a regular expression matching on file names. .* means any character, zero or more times. To match say a .txt extension you would do something like ~/.*\.txt/.
Consider:
def srcDir = 'myDir'
task go() {
doFirst {
new File(srcDir).eachFileRecurse { f ->
if (f.isFile()) {
println f.absolutePath.replaceAll("${projectDir}", "/..")
}
}
}
}
where this is invoked with gradle go.
The best thing I could come up with that will still work with an include pattern (and exclude if needed) is to add an if in the visit closure.
def srcDir = 'myDir'
def includePattern = '*'
def tree = fileTree(srcDir) {
include includePattern
}
tree.visit { d ->
if (!d.file.isDirectory()) {
logger.info(d.file)
}
}

Pattern matching with tregex in Stanzas Corenlp implementation doesn't seem to finde the right subtrees

I am relatively new to NLP and at the moment I'm trying to extract different phrase scructures in german texts. For that I'm using the Stanford corenlp implementation of stanza with the tregex feature for pattern machting in trees.
So far I didn't have any problem an I was able to match simple patterns like "NPs" or "S > CS".
No I'm trying to match S nodes that are immediately dominated either by ROOT or by a CS node that is immediately dominated by ROOT. For that im using the pattern "S > (CS > TOP) | > TOP". But it seems that it doesn't work properly. I'm using the following code:
text = "Peter kommt und Paul geht."
def linguistic_units(_client, _text, _pattern):
matches = _client.tregex(_text,_pattern)
list = matches['sentences']
print('+++++Tree++++')
print(list[0])
for sentence in matches['sentences']:
for match_id in sentence:
print(sentence[match_id]['spanString'])
return count_units
with CoreNLPClient(properties='./corenlp/StanfordCoreNLP-german.properties',
annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'parse', 'depparse', 'coref'],
timeout=300000,
be_quiet=True,
endpoint='http://localhost:9001',
memory='16G') as client:
result = linguistic_units(client, text, 'S > (CS > ROOT) | > ROOT'
print(result)
In the example with the text "Peter kommt und Paul geht" the pattern I'm using should match the two phrases "Peter kommt" and "Paul geht", but it doesn't work.
Afterwards I had a look at the tree itselfe and the output of the parser was the following:
constituency parse of first sentence
child {
child {
child {
child {
child {
value: "Peter"
}
value: "PROPN"
}
child {
child {
value: "kommt"
}
value: "VERB"
}
value: "S"
}
child {
child {
value: "und"
}
value: "CCONJ"
}
child {
child {
child {
value: "Paul"
}
value: "PROPN"
}
child {
child {
value: "geht"
}
value: "VERB"
}
value: "S"
}
value: "CS"
}
child {
child {
value: "."
}
value: "PUNCT"
}
value: "NUR"
}
value: "ROOT"
score: 5466.83349609375
I now suspect that this is due to the ROOT node, since it is the last node of the tree. Should the ROOT node not be at the beginning of the tree?
Does anyone know what I am doing wrong?
A few comments:
1.) Assuming you are using a recent version of CoreNLP (4.0.0+), you need to use the mwt annotator with German. So your annotators list should be tokenize,ssplit,mwt,pos,parse
2.) Here is your sentence in PTB for clarity:
(ROOT
(NUR
(CS
(S (PROPN Peter) (VERB kommt))
(CCONJ und)
(S (PROPN Paul) (VERB geht)))))
As you can see the ROOT is the root node of the tree, so your pattern would not match in this sentence. I personally find the PTB format easier to see the tree structure and for writing Tregex patterns off of. You can get that via the json or text output formats (instead of the serialized object). In the client request set output_format="text"
3.) Here is the latest documentation on using the Stanza client: https://stanfordnlp.github.io/stanza/client_properties.html

Get parent node based on some condition in ruby

i have a hash like below.
prop = {"Pets"=>[]},
{"Misc"=>["HOA Frequency: (C101)"], "photos"=>nil},
{"Legal and finance"=>["HOA fee: $300.0"], "photos"=>nil}
I need to get Legal and finance nodes based on some condition.
I tried like below.
prop.find { |feature| feature.keys.include?("Legal and finance") }
But sometimes HOA fee will be under different node. It might be in "Finance" or "Legal and Finance" or "Home Finance" like
{"Finance"=>["HOA fee: $300.0"], "photos"=>nil} or
{"Home Finance"=>["HOA fee: $300.0"], "photos"=>nil}
So i need to get that complete node by checking whether any node contains text as "HOA Fee" as value.
prop.find do |feature|
feature.values.flatten.compact.any? do |value|
value.include?("HOA Fee")
end
end
This is a very messy data structure, however.
I would strongly advise you to refactor the code to store data in well-defined objects, not hashes of hashes of arrays...
I would do something like this:
prop.find { |hash| hash.keys.any? { |key| key.downcase.include?('finance') } }
#=> { "Legal and finance" => ["HOA fee: $300.0"], "photos" => nil }

How to get nokogiri attribute value?

My xml contains multiple statements like
<House name="bla"><Room id="bla" name="black" ><blah id="blue" name="brown"></blah></Room></House>
I need to get all the values for the given keyword.
I used nodes = doc.css("[name]") to get the <Room id="bla" name="black" ><blah id="blue" name="brown"></blah></Room>.\
But how do I get the value for a key from this. Is there any easier way to do this?
node_names = doc.css("[name]").map { |node| node['name'] }
for all node names; or for just "black",
black = doc.at_css("[name]")['name']

Ruby: Extract and operate on partially extracted Nokogiri objects

require 'nokogiri'
xml = DATA.read
xml_nokogiri = Nokogiri::XML.parse xml
widgets = xml_nokogiri.xpath("//Widget")
dates = widgets.map { |widget| widget.xpath("//DateAdded").text }
puts dates
__END__
<Widgets>
<Widget>
<Price>42</Price>
<DateAdded>04/22/1989</DateAdded>
</Widget>
<Widget>
<Price>29</Price>
<DateAdded>02/05/2015</DateAdded>
</Widget>
</Widgets>
Notes:
This is a contrived example I cooked up as its very inconvenient to post the actual code because of too many dependencies. Did this as this code is readily testable on copy/paste.
widgets is a Nokogiri::XML::NodeSet object which has two Nokogiri::XML::Elements. Each of which is the xml fragment corresponding to the Widget tag.
I am intending to operate on each of those fragments with xpath again, but use of xpath query that starts with // seems to query from the ROOT of the xml AGAIN not the individual fragment.
Any idea why its so? Was expecting dates to hold the tag of each fragment alone.
EDIT: Assume that the tags have a complicated structure that
relative addressing is not practical (like using
xpath("DateAdded"))
.//DateAdded will give you relative XPath (any nested DateAdded node), as well as simple DateAdded without preceding slashes (immediate child):
- dates = widgets.map { |widget| widget.xpath("//DateAdded").text }
# for immediate children use 'DateAdded'
+ dates = widgets.map { |widget| widget.xpath("DateAdded").text }
# for nested elements use './/DateAdded'
+ dates = widgets.map { |widget| widget.xpath(".//DateAdded").text }
#⇒ [
# [0] "04/22/1989",
# [1] "02/05/2015"
#]

Resources