How to get elements between tags with XPATH - xpath

I need to get each subtitle of an article and its text. Since each subheading is inside , and I need to get everything between the first and the second. And then I will do between the second and third until I finish.
The structure is similar to this:
<article>
<p> introducion </p>
<h3>1. Subtitle </h3>
<p> text text </p>
<div> <p>other text</p> </div>
<h3>2. Subtitle </h3>
<p> text text </p>
<div> <p>other text</p> </div>
<h3>3. Subtitle </h3>
<p> text text </p>
<div> <p>other text</p> </div>
</article>
Currently I can get to the first subtitle like this: //h3[1]
But how can I get everything between the first and the second ???

This XPath expression gets nodes between //h3[1] and //h3[2] inclusive
//article/*[position()>= count(//h3[1]/preceding-sibling::*)+1 and position()<= count(//h3[2]/preceding-sibling::*)+1]
Result on browser console
$x('//article/*[position()>= count(//h3[1]/preceding-sibling::*)+1 and position()<= count(//h3[2]/preceding-sibling::*)+1]')
Array(4) [ h3, p, div, h3]
0: <h3>​
1: <p>​
2: <div>​
3: <h3>
length: 4

Related

How to exclude from a contains query all the informations from a child class & after some sibling text?

<root>
<a></a>
<b></b>
<c></c>
<a></a>
<d></d>
<e></e>
<a></a>
<a></a>
</root>
In an XML document, how can I exclude from a contains research all the information from nodes after <d> ?
to get only result from:
<a></a>
<b></b>
<c></c>
<a></a>
<d></d>
I can't say only the first 2 answer from
and first for
and <c> because sometimes a value will exist only after the <d>
I have this code that is working:
//div[contains(#class,'class searched')]/*[contains(text(), 'Text Searched')] | //div[contains(#class,'class searched')]/*[not(contains(#class,'class excluded'))]/*[contains(text(), 'Text Searched')]
Thanks for your help :)
EDIT for clarity:
<div Class="TopClass">
<div Class="ClassA">
<div Class="ClassB">
<h3> Text Researched</h3>
<u1 Class="ClassC">
<h3> Text Researched</h3>
</u1>
</div>
</div>
<h4>Other Text</h4>
<div Class="ClassA">
<div Class="ClassB">
<h3> Text Researched</h3>
<u1 Class="ClassC">
<h3> Text Researched</h3>
</u1>
</div>
</div>
I would like to get only the Text Researched that is between the Class B and Class C and that is above the "Other Text". Sometime the "Text researched" will only appear below the "Other Text" and i don't want to get this result so a [1] will not work there. Also the <h3> and <h4> are used elsewhere in the code.
Given this html
<div class="TopClass">
<div class="ClassA">
<div class="ClassB">
<h3> Text Researched 1</h3>
<u1 class="ClassC">
<h3> Text Researched 2</h3>
</u1>
</div>
</div>
<h4>Other Text</h4>
<div class="ClassA">
<div class="ClassB">
<h3> Text Researched 3</h3>
<u1 class="ClassC">
<h3> Text Researched 4</h3>
</u1>
</div>
</div>
</div>
This XPath expression will get the first 2 h3 tags
//div[#class="ClassA" and following-sibling::h4[.="Other Text"]]//h3[contains(.,"Text Researched 1")]/text()
Result:
echo -e 'cat //div[#class="ClassA" and following-sibling::h4[.="Other Text"]]//h3/text()\nbye' | xmllint --shell test.html
/ > cat //div[#class="ClassA" and following-sibling::h4[.="Other Text"]]//h3[contains(.,"Text Researched 1")]/text()
-------
Text Researched 1
/ > bye

How to select the first occurrence in each element by XPath?

In the following html tags:
<div>
<div>
<h3>
<a href='http://Ali.org'></a>
</h3>
<div>
<p>
<a href='http://Mohammad.org'></a>
</p>
</div>
</div>
<div>
<h4>
<a href='http://Ali.org'></a>
</h4>
<p>
<a href='http://Mohammad.org'></a>
</p>
</div>
</div>
I want to select two 'a' tags 'http://Ali.org' & 'http://YaALi.org'. By the following, I can:
//div//a[not(parent::*[not(following-sibling::*)])]
But what about a simpler XPath?
By the following, all of 'a' tags will be selected since they are all the first child of their parents:
//div/div//a[1]
Or by the following, just the first 'a' tag will be selected:
(//div//a)[1]
I want to select 'a' tags that are the first in the 'a' tags of div elements...
// in the middle of a path is an abbreviation for descendant-or-self::node(), so if you do
//div/div//a[1]
this effectively means
//div/div/descendant-or-self::node()/a[1]
This picks the first child a of all descendant nodes. What you want is:
//div/div/descendant::a[1]
which will pick the first descendant a.

How to get concatenated child text nodes in lxml

This is the HTML sample:
<div class="wpb_text_column">
<div class="wpb_wrapper">
<p style="text-align: center;">First text part </p>
<p style="text-align: center;">Second text part </p>
<p style="text-align: center;">Third text part</p>
</div>
</div>
<div class="wpb_text_column">
<div class="wpb_wrapper">
<p style="text-align: center;">First text part </p>
<p style="text-align: center;">Second text part</p>
</div>
</div>
With below code
tree = html.fromstring(html_sample)
tree.xpath('//div[#class="wpb_text_column"]/div[#class="wpb_wrapper"]/p/a/text()')
I can get list of text values
['First text part ', 'Second text part ', 'Third text part', 'First text part ', 'Second text part']
However, I want to get all values from each div as single string like
['First text part Second text part Third text part', 'First text part Second text part']
and
//div[#class="wpb_text_column"]/div[#class="wpb_wrapper"]/normalize-space()
seem to be exact XPath to solve the problem, but lxml doesn't support /normalize-space() syntax:
lxml.etree.XPathEvalError: Invalid expression
So how to get desired output in lxml?
Solved with below code:
[" ".join(string.text_content().split()) for string in tree.xpath('//div[#class="wpb_text_column"]/div[#class="wpb_wrapper"]')]

goquery- Concatenate a tag with the one that follows

For some background info, I'm new to Go (3 or 4 days), but I'm starting to get more comfortable with it.
I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). For my problem, an example will be the easiest way to explain it:
<html>
<body>
<h1>
<span class="text">Go </span>
</h1>
<p>
<span class="text">totally </span>
<span class="post">kicks </span>
</p>
<p>
<span class="text">hacks </span>
<span class="post">its </span>
</p>
<h1>
<span class="text">debugger </span>
</h1>
<p>
<span class="text">should </span>
<span class="post">be </span>
</p>
<p>
<span class="text">called </span>
<span class="post">ogle </span>
</p>
<h3>
<span class="statement">true</span>
</h3>
</body>
<html>
I'd like to:
Extract the content of <h1..."text".
Insert (and concatenate) this extracted content into the content of <p..."text".
Only do this for the <p> tag that immediately follows the <h1> tag.
Do this for all of the <h1> tags on the page.
So this is what I want it to look like:
<html>
<body>
<p>
<span class="text">Go totally </span>
<span class="post">kicks </span>
</p>
<p>
<span class="text">hacks </span>
<span class="post">its </span>
</p>
<p>
<span class="text">debugger should </span>
<span class="post">be </span>
</p>
<p>
<span class="text">called </span>
<span class="post">ogle</span>
</p>
<h3>
<span class="statement">true</span>
</h3>
</body>
<html>
With the code starting off like this,
package main
import (
"fmt"
"strings"
"github.com/PuerkitoBio/goquery"
)
func main() {
html_code := strings.NewReader(`code_example_above`)
doc, _ := goquery.NewDocumentFromReader(html_code)
I know that I can read <h1..."text" with:
h3_tag := doc.Find("h3 .text")
I also know that I can add the content of <h1..."text" to the content of <p..."text" with this:
doc.Find("p .text").Before("h3 .text")
^But this command inserts the content from every single case of <h1..."text" before every single case of <p..."text".
Then, I found out how to get a step closer to what I want:
doc.Find("p .text").First().Before("h3 .text")
^This command inserts the content from every single case of <h1..."text" only before the first case of <p..."text" (which is closer to what I want).
I also tried using goquery's Each() function, but I could not get any closer to what I wanted with that method (though I'm sure there's a way to do it with Each(), right?)
My biggest issue is that I can't figure out how to associate each instance of <h1..."text" with the <p..."text" instance that immediately follows it.
If it helps, <h1..."text" is always followed by <p..."text" on the web pages I'm trying to parse.
My brain's out of juice. Do any Go geniuses know how to do this and are willing to explain it? Thanks in advance.
EDIT
I found out something else I can do:
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
nex := s.Next().Text()
fmt.Println(s.Text(), nex, "\n\n")
})
^This prints out what I want--the contents of each instance of <h1..."text" followed by its immediate instance of <p..."text". I had thought that s.Next() would output the next instance of <h1>, but it outputs the next tag in doc--the *goquery.Selection that it's iterating through. Is that correct?
Or, as mattn pointed out, I could also use doc.Find("h1+p").
I'm still having trouble appending <h1..."text" to <p..."text". I'll post it as another question because you can break this one down into multiple questions, and Mattn already answered one.
I don't know what you are writing code with goquery. But maybe, your expected is neighbor selector.
h1+p
This returns h1 tags which has p tag in neighbor.

How to exclude content from a rich snippet element?

I'm trying to apply rich snippet data to my web page, following http://schema.org/Article standards. One of the properties is articleBody, which I expect should include the entire body of text that comprises the article.
Unfortunately, the article's HTML representation is spotted with occasional buttons, ads and other hints, which has text that should not go into the articleBody.
For example:
<div itemscope itemtype="http://schema.org/Article">
<div itemtype="articleBody">
<p>1st Paragraph</p>
<p>2nd paragraph</p>
<a>A few useful links for my users</a>
<p>3rd paragraph</p>
<div>A few text ads</div>
<p>4th paragraph</p>
</div>
</div>
Is there a way to exclude the texts in the ads/links from the article itself?
No, Microdata doesn’t offer a way to exclude content.
articleBody’s value will be the textContent of the element.
An ugly "hack" would be to specify several articleBody properties for this item:
<div itemscope itemtype="http://schema.org/Article">
<div itemtype="articleBody">
<p>1st Paragraph</p>
<p>2nd paragraph</p>
</div>
<a>A few useful links for my users</a>
<p itemtype="articleBody">3rd paragraph</p>
<div>A few text ads</div>
<p itemtype="articleBody">4th paragraph</p>
</div>
</div>
But note that Microdata does not define how those values should be interpreted, so it’s up to the consumers.
Another ugly method:
Duplicate the information, contained in a meta element:
<div itemscope itemtype="http://schema.org/Article">
<div>
<p>1st Paragraph</p>
<p>2nd paragraph</p>
<a>A few useful links for my users</a>
<p>3rd paragraph</p>
<div>A few text ads</div>
<p>4th paragraph</p>
</div>
<meta itemtype="articleBody" content="1st Paragraph. 2nd paragraph. 3rd paragraph. 4th paragraph." />
</div>

Resources