I have found a node, now i need to select sibling text after it:
In my case i need to get the text : 10 January
How do i do this?
Try this:
//foo/following-sibling::text()[1]
(replace //foo/ with your current XPath expression.
With this XML:
<data>
<foo>foo</foo>
bar
<baz>baz</baz>
</data>
it gives bar as output.
Related
Given the following Xml:
<Root><Foo Bar="" Bar_Baz="12" /></Root>
Is there an XPath statement (using version 1.0 functions only) that can return Root/Foo/#Bar where there exists some sibling attribute starting with Bar (determined by context), and ending in _Baz, where that node has the value 12?
Bar should be anonymous - the XPath shouldn't care what it's called - but whatever it is called, if it is returned or not should be determined by whether X_Baz exists, and has the value of 12.
I was looking into something like:
//#*[sibling::#*[concat(local-name(), '_Baz') = '12']
But fairly obviously, this would just compare the text Bar_Baz to 12, not the value of that sibling attribute.
I'm making use of this using the .Net XmlDocument class, meaning I'm limited to Microsoft's XPath 1.0 implementation, so please don't make use of subsequent versions of the spec!
EDIT: Per the comment requesting a more diverse set of examples, see below:
<Root>
<Item Foo="" Foo_Baz="12">Yes - #Foo_Baz is 12, and #Foo exists</Item>
<Item Bar="" Bar_Baz="12">Yes - #Bar_Baz is 12, and #Bar exists</Item>
<Item Foo="" Foo_Baz="1">No - Foo_Baz != 12<Item>
<Item Baz="" Foo_Baz="12">No - No #Foo to return</Item>
<Item Foo_Baz="12">No - No #Foo to return</Item>
<Item Foo="" Foo_Haz="12">No - No #Foo_Baz node to check the value of</Item>
</Root>
Edit 2:
Looking at the first couple of answers proposed, I think there is something I haven't been clear on: the names, Foo or Bar, are unknown. The only things that are known are:
There are one or more attributes with a suffix _Baz that has the value 12
They may have siblings whose entire name is whatever came before the suffice
If they do, then that sibling is the node I want to match, provided the _Baz attribute has the value of 12
Another option :
//item[substring-after(local-name(./#*[last()]),"_")="baz" and ./#*[last()]="12"][local-name(./#*[1])=substring-before(local-name(./#*[last()]),"_")]
Shortest form :
//item[#foo or #bar][#bar_baz="12" or #foo_baz="12"]
EDIT : Massive and horrible XPath here, but it should work. It supports up to 5 attributes per item and regardless the position of these attributes inside each item tag.
//item[contains(local-name(#*[1]),"_baz") and #*[1]=12][local-name(#*[1])=substring-before(local-name(#*[1]),"_")]|//item[contains(local-name(#*[1]),"_baz") and #*[1]=12][local-name(#*[3])=substring-before(local-name(#*[1]),"_")]|//item[contains(local-name(#*[1]),"_baz") and #*[1]=12][local-name(#*[4])=substring-before(local-name(#*[1]),"_")]|//item[contains(local-name(#*[1]),"_baz") and #*[1]=12][local-name(#*[5])=substring-before(local-name(#*[1]),"_")]|//item[contains(local-name(#*[2]),"_baz") and #*[2]=12][local-name(#*[1])=substring-before(local-name(#*[2]),"_")]|//item[contains(local-name(#*[2]),"_baz") and #*[2]=12][local-name(#*[3])=substring-before(local-name(#*[2]),"_")]|//item[contains(local-name(#*[2]),"_baz") and #*[2]=12][local-name(#*[4])=substring-before(local-name(#*[2]),"_")]|//item[contains(local-name(#*[2]),"_baz") and #*[2]=12][local-name(#*[5])=substring-before(local-name(#*[2]),"_")]|//item[contains(local-name(#*[3]),"_baz") and #*[3]=12][local-name(#*[1])=substring-before(local-name(#*[3]),"_")]|//item[contains(local-name(#*[3]),"_baz") and #*[3]=12][local-name(#*[3])=substring-before(local-name(#*[3]),"_")]|//item[contains(local-name(#*[3]),"_baz") and #*[3]=12][local-name(#*[4])=substring-before(local-name(#*[3]),"_")]|//item[contains(local-name(#*[3]),"_baz") and #*[3]=12][local-name(#*[5])=substring-before(local-name(#*[3]),"_")]|//item[contains(local-name(#*[4]),"_baz") and #*[4]=12][local-name(#*[1])=substring-before(local-name(#*[4]),"_")]|//item[contains(local-name(#*[4]),"_baz") and #*[4]=12][local-name(#*[3])=substring-before(local-name(#*[4]),"_")]|//item[contains(local-name(#*[4]),"_baz") and #*[4]=12][local-name(#*[4])=substring-before(local-name(#*[4]),"_")]|//item[contains(local-name(#*[4]),"_baz") and #*[4]=12][local-name(#*[5])=substring-before(local-name(#*[4]),"_")]|//item[contains(local-name(#*[5]),"_baz") and #*[5]=12][local-name(#*[1])=substring-before(local-name(#*[5]),"_")]|//item[contains(local-name(#*[5]),"_baz") and #*[5]=12][local-name(#*[3])=substring-before(local-name(#*[5]),"_")]|//item[contains(local-name(#*[5]),"_baz") and #*[5]=12][local-name(#*[4])=substring-before(local-name(#*[5]),"_")]|//item[contains(local-name(#*[5]),"_baz") and #*[5]=12][local-name(#*[5])=substring-before(local-name(#*[5]),"_")]
Working sample (4 nodes selected) :
Strictly in terms of xpath, this expression
//Item[attribute::*[contains(local-name(), '_Baz')]='12'][attribute::*[local-name()='Foo'] | attribute::*[local-name()='Bar']]
should get you your desired output.
I have this data, and I'm looking for the lowest bid.
<root>
<current_bid>$1.00</current_bid>
<current_bid>$2.00</current_bid>
<current_bid>$3.00</current_bid>
<current_bid>$4.00</current_bid>
<current_bid>$5.00</current_bid>
</root>
This is my XPath 1.0 attempt:
//current_bid[not(translate (., '$,.','') > translate(//current_bid, '$,.',''))]
And it works fine (returns only the $1.00 bid) with the data above, but if I change the ordering of the data to let's say this here:
<root>
<current_bid>$5.00</current_bid>
<current_bid>$1.00</current_bid>
<current_bid>$2.00</current_bid>
<current_bid>$3.00</current_bid>
<current_bid>$4.00</current_bid>
</root>
Then it gives a wrong output (returns all values).
Shouldn't the order be irrelevant when I use //current_bid, since it queries the whole document?
Also: how would I go if I wanted the second lowest bid?
XPath 1.0 processes nodes in document order so there's no way to sort them with pure XPath. It can be done with XSL processing
This approach works only if minimum is at first position.
Xpath:
'//current_bid[(position()<=last()) and not(translate (., "$,.","") > translate(//current_bid, "$,.",""))]'
Sample:
<root>
<current_bid>$1.00</current_bid>
<current_bid>$5.00</current_bid>
<current_bid>$2.00</current_bid>
<current_bid>$4.00</current_bid>
<current_bid>$3.00</current_bid>
</root>
Testing on command line with xmllint
xmllint --xpath '//current_bid[(position()<=last()) and not(translate (., "$,.","") > translate(//current_bid, "$,.",""))]' test.xml ; echo
Result:
<current_bid>$1.00</current_bid>
If the number of nodes is known in advance perhaps it could be done with nested conditions but would give a very complex XPath expression.
From my xml, I can get this :
<home>
<creditors>
<count>2</count>
</creditors>
</home>
OR even this :
<home>
<creditors>
<moreThan>2</moreThan>
</creditors>
</home>
Which xpath expression can I use to get "<count>2</count>" instead of getting only "2" OR to get "<moreThan>2</moreThan>" instead of getting "2" ?
This XPath,
//creditors/count
will select all count child elements of all creditors elements in the XML document.
Update per OP's request in comments for a single XPath that selects both count and moreThan elements:
This XPath,
//creditors/*[self::count or self::moreThan]
will select all count or moreThan child elements of all creditors elements in the XML document.
Assuming that your xpath expression is OK, you just need to convert the element to string:
doc.xpath("home/creditors/*").to_s
=> "<count>2</count>"
Please check with queries returning more than one element, to make sure that it's desired behaviour.
Basically I need to scrape some text that has nested tags.
Something like this:
<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>
And I want an expression that will produce this:
This is an example bolded text
I have been struggling with this for hour or more with no result.
Any help is appreciated
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
You want to call the XPath string() function on the div element.
string(//div[#id='theNode'])
You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.
normalize-space(//div[#id='theNode'])
// if theNode was the context node, you could use this instead
normalize-space()
You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.
var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;
The whitespace only text node between the span and b elements might be a problem.
Use:
string(//div[#id='theNode'])
When this expression is evaluated, the result is the string value of the first (and hopefully only) div element in the document.
As the string value of an element is defined in the XPath Specification as the concatenation in document order of all of its text-node descendants, this is exactly the wanted string.
Because this can include a number of all-white-space text nodes, you may want to eliminate contiguous leading and trailing white-space and replace any such intermediate white-space by a single space character:
Use:
normalize-space(string(//div[#id='theNode']))
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
"<xsl:copy-of select="string(//div[#id='theNode'])"/>"
===========
"<xsl:copy-of select="normalize-space(string(//div[#id='theNode']))"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<div id='theNode'> This is an
<span style="color:red">example</span>
<b>bolded</b> text
</div>
the two XPath expressions are evaluated and the results of these evaluations are copied to the output:
" This is an
example
bolded text
"
===========
"This is an example bolded text"
If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:
txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""
selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'
How about this :
/div/text()[1] | /div/span/text() | /div/b/text() | /div/text()[2]
Hmmss I am not sure about the last part though. You might have to play with that.
normal code
//div[#id='theNode']
to get all text but if they become split then
//div[#id='theNode']/text()
Not sure but if you provide me the link I will try
I am trying to find a way to search for a string within nodes, but excluding ythe content of some subelements of those nodes. Plain and simple, I want to search for a string in paragraphs of a text, excluding the footnotes which are children elements of the paragraphs.
For example,
My document being:
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there<footnote>It's not a very long text!</footnote></p>
</document>
When I'm searching for "text", I would like the Xpath / XQuery to retrieve the first p element, but not the second one (where "text" is contained only in the footnote subelement).
I have tried the contains() function, but it retrieves both p elements.
Any help would be much appreciated :)
I want to search for a string in
paragraphs of a text, excluding the
footnotes which are children elements
of the paragraphs
An XPath 1.0 - only solution:
Use:
//p//text()[not(ancestor::footnote) and contains(.,'text')]
Against the following XML document (obtained from yours but added p s within a footnote to make this more interesting):
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there
<footnote>It's not a very long text!
<p>text</p>
</footnote>
</p>
</document>
this XPath expression selects exactly the wanted text node:
My text starts here/
//p[(.//text() except .//footnote//text())[contains(., 'text')]]
/document/p[text()[contains(., 'text')]] should do.
For the record, as a complement to the other answers, I've found this workaround that also seems to do the job:
//p[contains(child::text()|not(descendant::footnote), "text")]