Xpath: select following until node - xpath

In Xpath i need to select <p> nodes following the <h2>DATA</h2> nodes until next <h2>, so in a structure like:
<div class="box">
<h2>NO</h2>
<p>B:<span> Y</span></p>
<h2>DATA</h2>
<p>AA:<span> CONTENT</span></p>
<p>AA:<span> MORE</span></p>
<h2>NO</h2>
<p>C:<span> Z</span></p>
<h2>DATA</h2>
<p>BB:<span> CONTENT</span></p>
<p>BB:<span> MORE</span></p>
</div>
should select:
<p>AA:<span> CONTENT</span></p>
<p>AA:<span> MORE</span></p>
<p>BB:<span> CONTENT</span></p>
<p>BB:<span> MORE</span></p>

How about this?
p[preceding-sibling::h2[1][.="DATA"]]
My python test for checking the xpath I provided:
>>> from lxml import etree
>>> doc = etree.XML("""<div class="box">
... <h2>NO</h2>
... <p>B:<span> Y</span></p>
... <h2>DATA</h2>
... <p>AA:<span> CONTENT</span></p>
... <p>AA:<span> MORE</span></p>
... <h2>NO</h2>
... <p>C:<span> Z</span></p>
... <h2>DATA</h2>
... <p>BB:<span> CONTENT</span></p>
... <p>BB:<span> MORE</span></p>
... </div>""")
>>> doc.xpath('p[preceding-sibling::h2[1][.="DATA"]]')
[<Element p at 252ef70>, <Element p at 252efc8>, <Element p at 2542050>, <Element p at 25420a8>]
>>> doc.xpath('p[preceding-sibling::h2[1][.="DATA"]]/text()')
['AA:', 'AA:', 'BB:', 'BB:']

Related

element UI table get filtered data or indices of original data

I would like to access the filtered data of an Element UI Table.
https://element.eleme.io/#/en-US/component/table
Suppose the full data of the table looks like this:
$index: {value}
[
0: {A},
1: {B},
2: {C},
3: {D}
]
Now suppose I set a filter via filter-method on a column and the filtered dataset only leaves behind values B and D.
The table now looks like this:
[
0: {B},
1: {D}
]
whereas I would like to have the original indecies, or access the leftover data that the table shows.
[
1: {B},
3: {D}
]
How can I do this, does anyone have an idea?
I basically want to color my cells via cell-class-name but because of this behahiour, the cells are not colored correctly while the data is filtered. If I could access the remaining data OR the original indecies that would solve my problem.
Thank you!
<html>
<head>
<script src="https://unpkg.com/vue#2/dist/vue.js"></script>
<script src="https://unpkg.com/element-ui/lib/index.js"></script>
<link rel="stylesheet" href="https://unpkg.com/element-ui/lib/theme-chalk/index.css">
</head>
<div id="app">
<template>
<div>
<el-table
:data="data"
style="width: 100%;"
height="600"
>
<el-table-column
label="Index"
width="60">
<template v-slot="scope">
<div class="data">
{{ scope.$index }}
</div>
</template>
</el-table-column>
<el-table-column
label="Data"
:filters="[{text: 'A', value: 'A'}, {text: 'B', value:'B'}, {text: 'C', value: 'C'}, {text: 'D', value:'D'}]"
:filter-method="filterHandler"
width="100">
<template v-slot="scope">
<div class="data">
{{ scope.row.value }}
</div>
</template>
</el-table-column>
</el-table-column>
</el-table>
</div>
</template>
</div>
<script>
new Vue({
el: '#app', //Tells Vue to render in HTML element with id "app"
data() {
return {
data: [{value: "A"},{value: "B"},{value: "C"},{value: "D"}]
}
},
methods: {
filterHandler(value, row, column) {
return row.value === value;
}
}
});
</script>
</html>

Auto generate XPath for known element in HTML tree using python

Is there any way (libs, not manually) for generating relative XPath for a known element in HTML?
Let say the second P element inside class="content"
<html>
<body>
<div class"title">
<h1>***</h1>
</div>
<p> *** </p>
<h3>***</h3>
<div class"content">
<p>****</p>
<p>****</p>
</div>
</body>
</html>
Use case:
The idea is to guess where are the elements that I might be interested in. For example title, content or author. After I've found the element I want to generate xpath for it and later use Python3.
Try something like this:
from lxml import etree
datum = """
<html>
<body>
<div class="title">
<h1>***</h1>
</div>
<p> *** </p>
<h3>***</h3>
<div class="content">
<p>something</p>
<p>target</p>
</div>
</body>
</html>
"""
root = etree.fromstring(datum)
tree = etree.ElementTree(root)
find_text = etree.XPath("//p[text()='target']")
for target in find_text(root):
print(tree.getpath(target))
Output:
/html/body/div[2]/p[2]

Use nokogiri with xpath

How can i use nokogiri, to fetch image via xpath, but my main problem is that, i could have this div, but didn't have image:
image_node = #get_doc.xpath( '//*[#id="recaptcha_image"]/img/#src').map {|a| a.value }
#binding.pry
if image_node != nil
rec = Net::HTTP.get( URI.parse( "#{image_node['src']}" ) )
end
but i get
in `[]': can't convert String into Integer (TypeError)
how is it correct to use?
some part of html:
<div id="recaptcha_widget" style="display: none">
<div id="recaptcha_image">
<img *****>
</div>
<input type="text" id="recaptcha_response_field" name="recaptcha_response_field"
style="width: 295px">
I recommend CSS over XPath for most HTML queries, and many XML ones. Using CSS makes this very "visible":
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="recaptcha_widget" style="display: none">
<div id="recaptcha_image">
<img src="path_to_image.jpg">
</div>
<input type="text" id="recaptcha_response_field" name="recaptcha_response_field" style="width: 295px">
EOT
doc.at('#recaptcha_widget img')['src'] # => "path_to_image.jpg"
how to do check, if i have div, but didn't have image?
How do you check if you didn't have the embedded <img> tag inside the <div>? Break your lookup into two parts, and check for a nil:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="recaptcha_widget" style="display: none">
<div id="recaptcha_image">
<img src="path_to_image.jpg">
</div>
<div id="recaptcha_image2">
</div>
<input type="text" id="recaptcha_response_field" name="recaptcha_response_field" style="width: 295px">
EOT
img = doc.at('#recaptcha_widget img')
img_src = img['src'] # => "path_to_image.jpg"
If the <img> tag doesn't exist you'll get nil:
img = doc.at('#recaptcha_widget2 img') # => nil
From that point you'd continue with a check to see if img was set:
if (img)
# ...do something...
end
Or, use a trailing rescue to capture the nil-exception and assign nil to img_src then test for it:
img_src = doc.at('#recaptcha_widget img')['src'] rescue nil # => "path_to_image.jpg"
img_src = doc.at('#recaptcha_widget2 img')['src'] rescue nil # => nil
if (img_src)
# do something
end

Use XPath to group siblings from an HTML/XML document?

I want to transform an HTML or XML document by grouping previously ungrouped sibling nodes.
For example, I want to take the following fragment:
<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
Into this:
<section>
<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
</section>
<section>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
</section>
Is this possible using simple Xpath selectors and an XML parser like Nokogiri? Or do I need to implement a SAX parser for this task?
Updated Answer
Here's a general solution that creates a hierarchy of <section> elements based on header levels and their following siblings:
class Nokogiri::XML::Node
# Create a hierarchy on a document based on heading levels
# wrap : e.g. "<section>" or "<div class='section'>"
# stops : array of tag names that stop all sections; use nil for none
# levels : array of tag names that control nesting, in order
def auto_section(wrap='<section>', stops=%w[hr], levels=%w[h1 h2 h3 h4 h5 h6])
levels = Hash[ levels.zip(0...levels.length) ]
stops = stops && Hash[ stops.product([true]) ]
stack = []
children.each do |node|
unless level = levels[node.name]
level = stops && stops[node.name] && -1
end
stack.pop while (top=stack.last) && top[:level]>=level if level
stack.last[:section].add_child(node) if stack.last
if level && level >=0
section = Nokogiri::XML.fragment(wrap).children[0]
node.replace(section); section << node
stack << { :section=>section, :level=>level }
end
end
end
end
Here is this code in use, and the result it gives.
The original HTML
<body>
<h1>Main Section 1</h1>
<p>Intro</p>
<h2>Subhead 1.1</h2>
<p>Meat</p><p>MOAR MEAT</p>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<h3>Caveats</h3>
<p>FYI</p>
<h4>ProTip</h4>
<p>Get it done</p>
<h2>Subhead 1.3</h2>
<p>Meat</p>
<h1>Main Section 2</h1>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<h4>Dive! Dive!</h4>
<p>...and down</p>
<hr /><p id="footer">Copyright © All Done</p>
</body>
The conversion code
# Use XML only so that we can pretty-print the results; HTML works fine, too
doc = Nokogiri::XML(html,&:noblanks) # stripping whitespace allows indentation
doc.at('body').auto_section # make the magic happen
puts doc.to_xhtml # show the result with indentation
The result
<body>
<section>
<h1>Main Section 1</h1>
<p>Intro</p>
<section>
<h2>Subhead 1.1</h2>
<p>Meat</p>
<p>MOAR MEAT</p>
</section>
<section>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<section>
<h3>Caveats</h3>
<p>FYI</p>
<section>
<h4>ProTip</h4>
<p>Get it done</p>
</section>
</section>
</section>
<section>
<h2>Subhead 1.3</h2>
<p>Meat</p>
</section>
</section>
<section>
<h1>Main Section 2</h1>
<section>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
</section>
<section>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<section>
<h4>Dive! Dive!</h4>
<p>...and down</p>
</section>
</section>
</section>
<hr />
<p id="footer">Copyright All Done</p>
</body>
Original Answer
Here's an answer using no XPath, but Nokogiri. I've taken the liberty of making the solution somewhat flexible, handling arbitrary start/stops (but not nested sections).
html = "<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
<hr>
<p id='footer'>All done!</p>"
require 'nokogiri'
class Nokogiri::XML::Node
# Provide a block that returns:
# true - for nodes that should start a new section
# false - for nodes that should not start a new section
# :stop - for nodes that should stop any current section but not start a new one
def group_under(name="section")
group = nil
element_children.each do |child|
case yield(child)
when false, nil
group << child if group
when :stop
group = nil
else
group = document.create_element(name)
child.replace(group)
group << child
end
end
end
end
doc = Nokogiri::HTML(html)
doc.at('body').group_under do |node|
if node.name == 'hr'
:stop
else
%w[h1 h2 h3 h4 h5 h6].include?(node.name)
end
end
puts doc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <section><h2>Header</h2>
#=> <p>First paragraph</p>
#=> <p>Second paragraph</p></section>
#=>
#=> <section><h2>Second header</h2>
#=> <p>Third paragraph</p>
#=> <p>Fourth paragraph</p></section>
#=>
#=> <hr>
#=> <p id="footer">All done!</p>
#=> </body></html>
For XPath, see XPath : select all following siblings until another sibling
One way using xpath is to select all the p elements that follow your h2 and from them subtract the p elements that also follow the next h2:
doc = Nokogiri::HTML.fragment(html)
doc.css('h2').each do |h2|
nodeset = h2.xpath('./following-sibling::p')
next_h2 = h2.at('./following-sibling::h2')
nodeset -= next_h2.xpath('./following-sibling::p') if next_h2
section_tag = h2.add_previous_sibling Nokogiri::XML::Node.new('section',doc)
h2.parent = section_tag
nodeset.each {|n| n.parent = section_tag}
end
XPath can only select things from your input document, it can't transform it into a new document. For that you need XSLT or some other transformation language. I guess if you're into Nokogiri then the previous answers will be useful, but for completeness, here's what it looks like in XSLT 2.0:
<xsl:for-each-group select="*" group-starting-with="h2">
<section>
<xsl:copy-of select="current-group()"/>
</section>
</xsl:for-each-group>

extract data from a div that have no class using xpath

Code
<div id="content">
<div class="sample">sample text</div>
<div class="datebar">
<span style="float:right">some text1</span>
<b>some text2</b>
</div>
<p>paragraph 1</p>
<p>paragraph 2</p>
</div>
I want to get data that in <p> tags or you can say that is coming after <div class="datebar">.
//div[#id="content"]/p/text()
Would achieve what you're asking for with your provided sample.
Update
If you only wanted those <p> that came after <div class="datebar">. The following should work:
//div[#id = 'content']/p[preceding-sibling::div[#class='datebar']]/text()
Another Update - For Kirill
Here's a sample of HTML which has an extra <p> before <div class="datebar"> and xpath expressions tested using python.
Obviously, the solution depends on what the full input HTML is and what the OP wants to extract, neither of which are clear at the moment.
>>> from lxml import etree
>>> doc = etree.HTML("""
... <div id="content">
... <div class="sample">sample text</div>
... <p>paragraph 1</p>
... <div class="datebar">
... <span style="float:right">some text1</span>
... <b>some text2</b>
... </div>
... <p>paragraph 2</p>
... <p>paragraph 3</p>
... </div>""")
>>> # My first suggestion
... doc.xpath("//div[#id='content']/p/text()")
['paragraph 1', 'paragraph 2', 'paragraph 3']
>>> # Kirill's solution
... doc.xpath("//div[#id = 'content' and div[#class = 'datebar']]/p/text()")
['paragraph 1', 'paragraph 2', 'paragraph 3']
>>> # My response to Kirill
... doc.xpath("//div[#id = 'content']/p[preceding-sibling::div[#class='datebar']]/text()")
['paragraph 2', 'paragraph 3']
Kirill's expression of //div[#id = 'content' and div[#class = 'datebar']]/p/text() does not select
only those p which parent div has #id = 'content' and have preceding div with #class = 'datebar'
As stated in his comments.
//div[#id = 'content' and div[#class = 'datebar']]/p/text()

Resources