XPath extract text inside tag - xpath

HTML structure looks like this:
<div class="Parent">
<div id="A">more tags and text</div>
<div id="B">more tags and text</div>
more tags
<p> and text </p>
</div>
I would like to extract text just from the parent and the tags apart from the A and B children.
I have tried
/div[#class='Parent']//text()
which extracts text from all the descendant nodes, so a made a constraint like /div[#class='Parent']//text()[not(self::div)]
but it did not change a thing.
Thanks for any advice

/div[#class='Parent']/*[not(self::div and (#id='A' or #id='B'))]//text() | /div[#class='Parent']/text()

Related

xpath retrieving text inclusive of tag

I trying to parse a webpage and get all the content inside a div tag named div1. I tried ('div[#class="div1"]') which gives me the content below
<div class="div1">
<p>
something something <br>
abc<br>
def
</p>
</div>
However, I am trying to get everything that is inside the div tag, not including the div tag as shown below
<p>
something something <br>
abc<br>
def
</p>
Try changing your xpath to
div[#class="div1"]/child::*
Quote from https://www.w3.org/TR/xpath/#location-paths:
child::* selects all element children of the context node
For one thing, you're looking for #id when it's #class

how to get specific text after a div with xpath

I get trouble to get specific texts which are located between two tags.
I mean, want to get Text after em tag. I want to get this. and also text after this p tag. I also want to get this..
is there any way of doing that?
thanks in advance.
<article>
<h1 id='h1'>Heading 1</h1>
<img src='mypath/pictures/pic.jpg'></img>
<p></p>
<div id='div1'>
<time datetime='2016'>2016</time>
</div>
<br></br>
<em>my location, TN, United States</em>
Text after em tag. I want to get this.
<p></p>
text after this p tag. I also want to get this.
<div id='div2'>
</div>
</article>
you can get the following sibling texts by using
following-sibling::text()
so to get all the em after text
//em/following-sibling::text()[1]
the same will be for p tag, and then join them
string-join(em/following-sibling::text()[1] | p/following-sibling::text()[1] , ',')
I hope this could help!

How can I make custom class HTML divisions using AsciiDoctor?

I am beginning with AsciiDoctor and I want to output HTML. I've been trying to figure out how to create custom class in divisions, I searched google, manuals etc. and couldn't find a solution. What I want to do is simply write something like this:
Type the word [userinput]#asciidoc# into the search bar.
Which generates HTML
<span class="userinput">asciidoc</span>
but I want to have div tags instead of span. Is there any way to do it or should I just use something like
+++<div class="userinput">asciidoc</span>+++ ?
I think what you need is called "role" in Asciidoctor.
This example:
This is some text.
[.userinput]
Type the word asciidoc into the search bar.
This is some text.
Produces:
<div class="paragraph">
<p>This is some text.</p>
</div>
<div class="paragraph userinput">
<p>Type the word asciidoc into the search bar.</p>
</div>
<div class="paragraph">
<p>This is some text.</p>
</div>
You have now a css selector div.userinput for the concerned div.
See 13.5. Setting attributes on an element in the Asciidoctor User Manual (you can also search for "role").
You may want to use an open block for that purpose:
Type the following commands:
[.userinput]
--
command1
command1
--
Producing:
<div class="paragraph">
<p>Type the following commands:</p>
</div>
<div class="openblock userinput">
<div class="content">
<div class="paragraph">
<p>command1</p>
</div>
<div class="paragraph">
<p>command1</p>
</div>
</div>
</div>
The advantage is it can wrap any other block and is not limited to only one paragraph like the other answer.
For slightly different use cases, you may also consider defining a custom style.

Make use of XPath Axes to extract sibling elements' text

Given the following html, how to get a list of tuple (TIME, COMMENT, OOXX) by XPath? I think I need to make use of XPath Axes but not sure how to use that. Furthermore, the OOXX seems not to belong to any tags!
<div class="contents">
<p></p>
<div class="meta">TIME</div>OOXX
<div class="comment">COMMENT</div>
<p></p>
<div class="meta">TIME</div>OOXX
<div class="comment">COMMENT</div>
<p></p>
<div class="meta">TIME</div>OOXX
<div class="comment">COMMENT</div>
<p></p>
<div class="meta">TIME</div>OOXX
<div class="comment">COMMENT</div>
<p></p>
</div>
How you'll want to deal with multiple such tuples in the input XML will depend on your requirements and the facilities of the context of the XPath evaluation.
However, here's how to get the first TIME:
/div/div[#class="meta"][1]/text()
Here's how to get the first COMMENT:
/div/div[#class="comment"][1]/text()
And here's how to get the first OOXX:
/div/div[#class="meta"][1]/following-sibling::text()[1]

How do I exclude a nested element when grabbing content using Nokogiri?

I have a page with content that looks similar to this:
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
My goal is to capture the text in #level2 but the #level3 <div> is nested inside of it at the same level as the text I want.
Is it possible to some how exclude that <div>? Should I be modifying the document and simply removing the element before parsing?
require 'nokogiri'
xml = <<-XML
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
XML
page = Nokogiri::XML(xml)
p page.xpath("//*[#id='level3']").remove.xpath("//*[#id='level2']").inner_text
# => "\n \n Here is some text i want\n \n Here is some more text i want\n \n Oh i want this text too :)\n "
Now, you may clean the output text if you wish.
If your HTML fragment is in html, then you could do something like this:
doc = Nokogiri::HTML(html)
div = doc.at_css('#level2') # Extract <div id="level2">
div.at_css('#level3').remove # Remove <div id="level3">
text_you_want = div.inner_text
You could also do it with XPath but I find CSS selectors a bit simpler for simple cases like this.

Resources