Get all elements by partial match of class attribute - ruby

I'm trying to use Nokogiri to display results from a URL. (essentially scraping a URL).
I have some HTML which is similar to:
<p class="mattFacer">Matty</p>
<p class="mattSmith">Matthew</p>
<p class="suzieSmith">Suzie</p>
So I need to then find all the elements which begin with the word "matt". What I need to do is save the value of the element and the element name so I can reference it next time.. so I need to capture
"Matty" and "<p class='mattFacer'>"
"Matthew" and "<p class='mattSmith'>"
I haven't worked out how to capture the element HTML, but here's what I have so far for the element (It doesnt work!)
doc = Nokogiri::HTML(open(url))
tmp = ""
doc.xpath("[class*=matt").each do |item|
tmp += item.text
end
#testy2 = tmp

This should get you started:
doc.xpath('//p[starts-with(#class, "matt")]').each do |el|
p [el.attributes['class'].value, el.children[0].text]
end
["mattFacer", "Matty"]
["mattSmith", "Matthew"]

Use:
/*/p[starts-with(#class, 'matt')] | /*/p[starts-with(#class, 'matt')]/text()
This selects any p elements that is a child of the top element of the XML document and the value of whose class attribute starts with "matt" and any text-node child of any such p element.
When evaluated against this XML document (none was provided!):
<html>
<p class="mattFacer">Matty</p>
<p class="mattSmith">Matthew</p>
<p class="suzieSmith">Suzie</p>
</html>
the following nodes are selected (each on a separate line) and can be accessed by position:
<p class="mattFacer">Matty</p>
Matty
<p class="mattSmith">Matthew</p>
Matthew
Here is a quick XSLT verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select=
"/*/p[starts-with(#class, 'matt')]
|
/*/p[starts-with(#class, 'matt')]/text()
">
<xsl:copy-of select="."/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The result of this transformation, when applied on the same XML document (above) is the expected, correct sequence of selected nodes:
<p class="mattFacer">Matty</p>
Matty
<p class="mattSmith">Matthew</p>
Matthew

doc = Nokogiri::HTML(open(url))
tmp = ""
items = doc.css("p[class*=matt]").map(&:text).join

The accepted answer is great, but another approach would be to use Nikkou, which lets you match via regular expressions (without needing to be familiar with XPATH functions):
doc.attr_matches('class', /^matt/).collect do |item|
[item.attributes['class'].value, item.text]
end

Related

extract tags not having sub elements

I have a web page which is something like this
<p> Content </p>
...........
..........
<p> Other content
<b> Use link <b>
<h3> some text <h3>
</p>
...........
........... and some other elements starting with <p> tag having
different sub-elements inside it
What I want to do is to extract text of only those <p> tags which doesnt have any sub elements
The proposes solution is correct when using XSLT. As you tagged this just with xpath, here is the XPath version:
//p[count(*) = 0]/text()
Just checking if there is one, is often faster than count:
//p[not(*)]/text()
You can use the template
<xsl:template match="p">
<xsl:if test="count(*) = 0">
<xsl:value-of select="."/>
</xsl:if>
</xsl:template>
This will print the value of p tags only if the node have no other tags inside it.

In XPath I want a expression to select all under div but those nodes that are not a. The tree structure must remain the same

How to exclude specific descendants of a node? In this direction, the expression *[not(self::nodetag)] seems just to discriminate at a child level of the node, accepting all other descedants in the returned node set. I want a expression to select all under div but those nodes that are not a, see example below. The tree structure must remain the same.
The approach poste by #Dimitri Novatchev seems to be right but not for HAP implementation:
Using this example document:
<div>
<span>
<a>lala</a>
</span>
</div>
The HAP would return the following structure with his suggested expression /div/descendant::node()[not(self::a)]
<div>
<span>
<a>lala</a>
</span>
</div>
<span>
<a>lala</a>
</span>
If there would be another tag other than a nested on span, it would also return it as a separte tree, any one know about this strange behavior? Is it a HAP bug?
Thanks
I want a expression to select all under div but those nodes that are
not a. The tree structure must remain the same.
Use:
/div/descendant::node()[not(self::a)]
This selects any descendant of the top element div that (the descendant) is not an a.
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="/div/descendant::node()[not(self::a)]">
<xsl:value-of select="concat('
', position(), '. "')"/>
<xsl:copy-of select="."/>"
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<div>
<span>
<a>lala</a>
</span>
</div>
the XPath expression is evaluated and all selected nodes are output with proper formatting to make them well-visible:
1. "
"
2. "<span>
<a>lala</a>
</span>"
3. "
"
4. "lala"
5. "
"
6. "
"
As we can see, 6 nodes are selected -- one span element, four whitespace-only text nodes and one non-whitespace-only text node -- and none of them is an a.
Update:
In a comment the OP has clarified that he actually wants the XML document to be transformed into another, in which any a descendant of a div is omitted.
Here is one such transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div//a"/>
</xsl:stylesheet>
When this transformation is applied on the same XML document (above), the (wwhat I guess is) wanted result is produced:
<div>
<span/>
</div>
If we want to produce only the descendants of any div that has an a descendant, then we need almost the same transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div[.//a]"><xsl:apply-templates/></xsl:template>
<xsl:template match="div//a"/>
</xsl:stylesheet>
The result of this applied to the same XML document as above is:
<span/>
#Devela: you are confusing the set of nodes selected by the XPath expression with the way that they are then displayed by the application that issued the request. It's quite common for an application to display a node by showing the whole subtree rooted at that node. So if your query is //div, and one of the selected div elements contains an <a> node as a descendant, the results will be shown including that <a> element. You can't change that by changing the XPath expression, because the XPath expression didn't select the <a> element; you can only change it by changing the way the results are displayed.
Now, if you want to display a <div> element that is like the <div> element in your source except that the <a> is omitted, then you are outside the scope of what XPath can do. XPath can only choose a subset of the nodes in your input tree, it can't create a modified tree. For that, you need XSLT or XQuery.

find next-to-last node with xpath

I have a XML document with chapters and nested sections.
I am trying to find, for any section, the first second-level section ancestor.
That is the next-to-last section in the ancestor-or-self axis.
pseudo-code:
<chapter><title>mychapter</title>
<section><title>first</title>
<section><title>second</title>
<more/><stuff/>
</section>
</section>
</chapter>
my selector:
<xsl:apply-templates
select="ancestor-or-self::section[last()-1]" mode="title.markup" />
Of course that works until last()-1 isn't defined (the current node is the first section).
If the current node is below the second section, i want the title second.
Otherwise I want the title first.
Replace your xpath with this:
ancestor-or-self::section[position()=last()-1 or count(ancestor::section)=0][1]
Since you can already find the right node in all cases except one, I updated your xpath to also find the first section (or count(ancestor::section)=0), and then select ([1]) the first match (in reverse document order, since we are using the ancestor-or-self axis).
Here is a shorter and more efficient solution:
(ancestor-or-self::section[position() > last() -2])[last()]
This selects the last of the possibly first two topmost ancestors named section. If there is only one such ancestor, then it itself is the last.
Here is a complete transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="section">
<xsl:value-of select="title"/>
<xsl:text> --> </xsl:text>
<xsl:value-of select=
"(ancestor-or-self::section[position() > last() -2])[last()]/title"/>
<xsl:text>
</xsl:text>
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
When this transformation is applied on the following document (based on the provided, but added more nested section elements):
<chapter>
<title>mychapter</title>
<section>
<title>first</title>
<section>
<title>second</title>
<more/>
<stuff/>
<section>
<title>third</title>
</section>
</section>
</section>
</chapter>
the correct results are produced:
first --> first
second --> second
third --> second

Return full text element (including child/descendant elements)

I'm trying to get the text from the first occurrence on the page of div/p, and only the first p. The <p> contains other tags (<b>, <a href>) and the returned text from <p> stops at any other tag. Is there a way to get this line to return all the text between <p> and </p>, even between embedded tags?
puts doc.xpath('html/body/div/p[1]/text()').first
Use:
string((//div/p)[1])
When this XPath expression is evaluated the result is the string value of the first p in the document that is a child of a div.
By definition the string value of an element is the concatenation (in document order) of all of its text-node descendents.
Therefore, you get exactly all the text in the subtree rooted by this p element, with any other nodes (elements, comments, PIs) skipped.
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select="string(p)"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the following XML document (no such provided!):
<p>
Hello <b>
XML
World!</b>
</p>
the result of the evaluated XPath expression is output:
Hello XML
World!
Using Nokogiri as an alternative for more XPath you can use Nokogiri::XML::Node#inner_text:
puts doc.xpath('html/body/div/p[1]').inner_text

XPath "following siblings before"

I'm trying to select elements (a) with XPath 1.0 (or possibly could be with Regex) that are following siblings of particular element (b) but only preceed another b element.
<img><b>First</b><br>
<img> First Href - 19:30<br>
<img><b>Second</b><br>
<img> Second Href - 19:30<br>
<img> Third Href - 19:30<br>
I tried to make the sample as close to real world as possible. So in this scenario when I'm at element
<b>First</b>
I need to select
First Href
and when I'm at
<b>Second</b>
I need to select
Second Href
Third Href
Any idea how to achieve that? Thank you!
Dynamically create this XPath:
following-sibling::a[preceding-sibling::b[1][.='xxxx']]
where 'xxxx' is the replaced with the text of the current <b>.
This is assuming that all the elements actually are siblings. If they are not, you can try to work with the preceding and following axes, or you write a more specific XPath that better resembles document structure.
In XSLT you could also use:
following-sibling::a[
generate-id(preceding-sibling::b[1]) = generate-id(current())
]
Here is a solution which is just a single XPath expression.
Using the Kaysian formula for intersection of two nodesets $ns1 and $ns2:
$ns1[count(. | $ns2) = count($ns2)]
We simply substitute $ns1 with the nodeset of <a> siblings that follow the current <b> node, and we substitute $ns2 with the nodeset of <a> siblings that precede the next <b> node.
Here is a complete transformation that uses this:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:apply-templates select="*/b"/>
</xsl:template>
<xsl:template match="b">
At: <xsl:value-of select="."/>
<xsl:variable name="vNextB" select="following-sibling::b[1]"/>
<xsl:variable name="vA-sAfterCurrentB" select="following-sibling::a"/>
<xsl:variable name="vA-sBeforeNextB" select=
"$vNextB/preceding-sibling::a
|
$vA-sAfterCurrentB[not($vNextB)]
"/>
<xsl:copy-of select=
"$vA-sAfterCurrentB
[count(.| $vA-sBeforeNextB)
=
count($vA-sBeforeNextB)
]
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the following XML document:
<t>
<img/>
<b>First</b>
<br />  
<img/>  
First Href - 19:30
<br />
<img/>
<b>Second</b>
<br />
<img/>  
Second Href - 19:30
<br />
<img/> 
Third Href - 19:30
<br />
</t>
the correct result is produced:
At: First First Href
At: Second Second Href
Third Href

Resources