How to select only paragraphs that contain certain child elements with nokogiri? - ruby

I have the following XML:
<w:p w14:paraId="07E73137" w14:textId="77777777" w:rsidP="00D279DF" w:rsidR="00D279DF" w:rsidRDefault="00D279DF">
</w:p>
<w:p w14:paraId="07E73138" w14:textId="77777777" w:rsidP="00D279DF" w:rsidR="00D279DF" w:rsidRDefault="00D279DF>
<w:r w:rsidRPr="00922473">
<w:t xml:space="preserve">Visual attributes </w:t>
</w:r>
<w:ins w:author="RKH RKH" w:date="2016-12-17T16:40:00Z" w:id="0">
<w:r>
<w:t>an</w:t>
</w:r>
</w:ins>
<w:del w:author="RKH RKH" w:date="2016-12-17T16:40:00Z" w:id="1">
<w:r w:rsidDel="008B2A6A">
<w:delText>the</w:delText>
</w:r>
</w:del>
</w:p>
The first <w:p> element does not contain any <w:ins> and <w:del> children elements.
However, the second <w:p> does contain <w:ins> and <w:del> elements.
I am currently selecting all paragraph elements using the following:
#all_paragraph_nodes = #file.xpath('//w:p')
I would like to select only paragraph elements that contain at least one <w:ins> element or <w:del> element.
How can I do this using Nokogiri?

You can use :
#all_paragraph_nodes = #file.xpath('//w:p[w:ins or w:del]')
Note that you have a typo in the 3rd lines of your XML :
w:rsidRDefault="00D279DF
isn't closed.

Related

xpath: How to select only nearby parent siblings of current child node?

Im using xpath 1.0 and i want to select a siblings of a parent of a current child node.
The current node that im in is <w:instrText> and how can i get the parent sibling nodes <w:r> that only in range of <!-- SELECT FROM HERE --> and <!-- SELECT TILL HERE -->?
<w:p>
<w:pPr>
<w:rPr>
<w:lang w:val='en-US'/>
</w:rPr>
</w:pPr>
<w:r>
<w:t>TEXT 1</w:t>
</w:r>
<w:r w:rsidR='002D5530'>
<w:t xml:space='preserve'> </w:t>
</w:r>
<!-- SELECT FROM HERE -->
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='begin'/>
</w:r>
<w:r w:rsidR='002D5530'>
<w:instrText xml:space='preserve'> SOME DYNAMIC TEXT THAT CAN CHANGE </w:instrText> <!-- My cursor is here! -->
</w:r>
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='separate'/>
</w:r>
<w:r w:rsidR='00E5783C'>
<w:t>SOME DYNAMIC TEXT</w:t>
</w:r>
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='end'/>
</w:r>
<!-- SELECT TILL HERE -->
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='begin'/>
</w:r>
<w:r w:rsidR='002D5530'>
<w:instrText xml:space='preserve'> SOME DYNAMIC TEXT THAT CAN CHANGE </w:instrText>
</w:r>
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='separate'/>
</w:r>
<w:r w:rsidR='00E5783C'>
<w:t>SOME DYNAMIC TEXT</w:t>
</w:r>
<w:r w:rsidR='002D5530'>
<w:fldChar w:fldCharType='end'/>
</w:r>
</w:p>
This is what i tried but got only 3 nodes...
../preceding-sibling::w:r[w:fldChar/#w:fldCharType='begin'] | ../following-sibling::w:r[w:fldChar/#w:fldCharType='end']
Given the limitations of XPath 1.0 and the assumption that the order of the sets of five w:r elements is consistent you could use a somewhat brute-force approach like:
(
(../preceding-sibling::w:r[w:fldChar/#w:fldCharType='begin'])[last()],
(../following-sibling::w:r[w:fldChar/#w:fldCharType='separate'])[position() = 1],
(../following-sibling::w:r[w:t])[position() = 1],
(../following-sibling::w:r[w:fldChar/#w:fldCharType='end'])[position() = 1]
)
Which is forming a sequence of four elements:
the last of the preceding "begin" elements
the first of the following "separate" elements
the first of the following "t" elements (w:r containing a w:t)
the first of the following "end" elements
If you also need to include the immediate parent of your context, then just add .. or parent::* or parent::w:r to the sequence:
(
(../preceding-sibling::w:r[w:fldChar/#w:fldCharType='begin'])[last()],
..,
(../following-sibling::w:r[w:fldChar/#w:fldCharType='separate'])[position() = 1],
(../following-sibling::w:r[w:t])[position() = 1],
(../following-sibling::w:r[w:fldChar/#w:fldCharType='end'])[position() = 1]
)
However, this may not be efficient if your input document is huge or performance is a significant factor.
If you are curious why your attempt returned three elements instead of two it is probably because you selected both "end" elements that were beyond the position of your context. The answer here may not be precisely what you need, but I suspect the last()/position() predicate examples will set you in the right direction.
BTW, if you ever have the opportunity to use XQuery for something like this, it is a perfect use case for the XQuery 3 "tumbling window" syntax which makes it very easy to identify the boundaries of sibling groups and iterate over the groups.

How to select an element only if the next element is of a specific type using Nokogiri

Given the following XML:
<w:p>
<w:pPr>
<w:lang w:val="en-CA"/>
</w:pPr>
<w:ins>
<w:t>sections</w:t>
</w:ins>
<w:pPr>
<w:lang w:val="en-CA"/>
</w:pPr>
<w:r>
<w:t>I am</w:t>
</w:r>
</w:p>
I want to select the w:pPr elements that are followed by either w:ins or w:del elements.
I tried:
doc.xpath("w:pPr[following-sibling::w:ins[1] or following-sibling::w:del[1]]")
which still returns the second w:pPr element which is followed by a w:r element so it's not what I'm looking for.
How can I do this?
Try this XPath expression:
//w:pPr[following-sibling::*[position()=1][name()='w:del' or name()='w:ins']]

How to select only nodes that are not only spaces using Nokogiri?

I have the following XML document:
<w:p w14:paraId="572705D7" w14:textId="77777777" w:rsidP="00CA0169" w:rsidR="00CA0169" w:rsidRDefault="00CA0169" w:rsidRPr="00777A35">
<w:r>
<w:t xml:space="preserve"/>
</w:r>
<w:r>
<w:t>synthesized in cyanobacteria under unsuitable condition</w:t>
</w:r>
</w:p>
I currently select all nodes that begin with as follows:
text_nodes = p.xpath('w:r')
However, I would like to select only those text nodes that contain text and are not only spaces as the first node is as shown in the xml sample above.
I have extended the String Class to test for spaces as follows:
class String
def spaces?
x = self =~ /^\s+$/
x == 0
end
end
So I can do:
element.text.spaces?
I just don't know how to put it together with the p.xpath('w:r')to select only nodes that are NOT only spaces.
w:r[normalize-space(.) != '']
as your XPath expression should do.

XPath to select all nodes between two text markers in OOXML?

I have a big XML file (from Microsoft Word) that contains tables, paragraphs, etc. I'm trying to grab all of the XML between two elements. For example, I want to grab all of the XML between these two
<w:p w:rsidR="00C82C88" w:rsidRDefault="00265695">
<w:r>
<w:t>#StartHere#</w:t>
</w:r>
</w:p>
a whole bunch of XML
<w:p w:rsidR="00C82C88" w:rsidRDefault="00265695" w:rsidP="00265695">
<w:pPr>
<w:pStyle w:val="Caption"/>
</w:pPr>
<w:r>
<w:t xml:space="preserve">Figure </w:t>
</w:r>
<w:r w:rsidR="00F044F8">
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r w:rsidR="00F044F8">
<w:instrText xml:space="preserve"> SEQ Figure \* ARABIC </w:instrText>
</w:r>
<w:r w:rsidR="00F044F8">
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:t>1</w:t>
</w:r>
<w:r w:rsidR="00F044F8">
<w:rPr>
<w:noProof/>
</w:rPr>
<w:fldChar w:fldCharType="end"/>
</w:r>
<w:r>
<w:t>: #StopHere#</w:t>
</w:r>
</w:p>
How can I have Nokogiri to grab me all of the XML between #StartHere# and #StopHere#, including those elements that this text is wrapped in? I'd like to call something like extracted_data = document[from..stop] somehow.
I can find those points in the document by looking for:
start = doc.at_xpath("//w:p[.//w:t[contains(., '#StartHere#')]]")
stop = doc.at_xpath("//w:p[.//w:t[contains(., '#StopHere#')]]")
but need to figure out how I can say document[start..stop] to grab everything (including those) and between it.
This XPath,
//node()[ preceding::w:p[w:r/w:t[.='#StartHere#']]
and following::w:p[w:r/w:t[.=': #StopHere#']]]
will select all nodes between the two paragraphs that contain your marker text.
In Nokogiri: doc.xpath("insert above XPath here")

Extract attributes conditionally and recursively for word document xml using xpath in ruby

[I have a word doc xml like :
<w:document>
<w:body>
<w:p w14:paraId="5B6351BB" w14:textId="0D9644FF" w:rsidR="00432348" w:rsidRDefault="00432348" w:rsidP="00432348">
<w:pPr>
<w:pStyle w:val="Heading1"/>
</w:pPr>
<w:bookmarkStart w:id="27" w:name="_Toc435537885"/>
<w:r>
<w:t>TESTPLAN</w:t>
</w:r>
<w:bookmarkEnd w:id="26"/>
<w:r w:rsidR="00B46E57">
<w:t xml:space="preserve"> – PART I</w:t>
</w:r>
<w:bookmarkEnd w:id="27"/>
</w:p>
</w:body>
<w:document>
I want to extract the text for Heading1 , thus wrote the following code, but it does not seem to work.
#doc.xpath('//w:document//w:body//w:p[w:pPr//w:pStyle[#val]="Heading1"]//w:r//w:t')
In place of #val , I have tried #w:val and also in place of "Heading" I have tried 'Heading' for comparison. But still it returns a nil value.
You have typed wrong the w:pStyle[#w:val]="Heading1" part. It should be w:pStyle/#w:val="Heading1". Then it will work.
Correct XPATH:
//w:document//w:body//w:p[w:pPr//w:pStyle/#w:val="Heading1"]//w:r//w:t

Resources