How do I exclude a nested element when grabbing content using Nokogiri? - ruby

I have a page with content that looks similar to this:
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
My goal is to capture the text in #level2 but the #level3 <div> is nested inside of it at the same level as the text I want.
Is it possible to some how exclude that <div>? Should I be modifying the document and simply removing the element before parsing?

require 'nokogiri'
xml = <<-XML
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
XML
page = Nokogiri::XML(xml)
p page.xpath("//*[#id='level3']").remove.xpath("//*[#id='level2']").inner_text
# => "\n \n Here is some text i want\n \n Here is some more text i want\n \n Oh i want this text too :)\n "
Now, you may clean the output text if you wish.

If your HTML fragment is in html, then you could do something like this:
doc = Nokogiri::HTML(html)
div = doc.at_css('#level2') # Extract <div id="level2">
div.at_css('#level3').remove # Remove <div id="level3">
text_you_want = div.inner_text
You could also do it with XPath but I find CSS selectors a bit simpler for simple cases like this.

Related

xPath how to access input text file with generic name

Is there any option to access input in code like this:
(...)
<div class="dialogProp">
<div class="gwt-Label">Name</div>
<div class="floatLeft">
<div>
<input type="text" class="textBox">
</div>
<div class="notVisible"></div>
</div>
<div class="dialogProp">
<div class="gwt-Label">Surname</div>
<div class="floatLeft">
<div>
<input type="text" class="textBox">
</div>
<div class="notVisible"></div>
</div>
(...)
As you can see I got two inputs and only difference between them is label inside of div with different text inside. This kind of pattern can be found all around of website and I cannot change this. I can not add id's as well.
Do you know if there is possibility to add to the xPath this different text inside of div's?
Let's say I would like to access first input.
Of course I could use some ass long xPath, but I would like to reuse this with text inside of gwt-Label as variable.
Use below to locate input by label text:
//div[#class="gwt-Label" and .="Name"]/following-sibling::div//input
In Python you can pass label from variable:
label = "Name"
xpath = '//div[#class="gwt-Label" and .="%s"]/following-sibling::div//input' % label
To access the input with respect to the label text you can use the following solution:
labelText = "Name"
#or labelText = "Surname"
xpath = "//div[#class='gwt-Label' and contains(.,'" +labelText+ "')]//following::div[1]//input"

xpath retrieving text inclusive of tag

I trying to parse a webpage and get all the content inside a div tag named div1. I tried ('div[#class="div1"]') which gives me the content below
<div class="div1">
<p>
something something <br>
abc<br>
def
</p>
</div>
However, I am trying to get everything that is inside the div tag, not including the div tag as shown below
<p>
something something <br>
abc<br>
def
</p>
Try changing your xpath to
div[#class="div1"]/child::*
Quote from https://www.w3.org/TR/xpath/#location-paths:
child::* selects all element children of the context node
For one thing, you're looking for #id when it's #class

Print attributes of Nokogiri::XML::Node only, without innerHTML

Does anyone know if there is a native method for printing the attributes of a Nokogiri::XML::Node without innerHTML or text content.
For example, given the following Nokogiri::XML::Node:
<div id="customer" class="highlighted">
<h1>Customer Name</h1>
<p>Some customer description</p>
</div>
I would like to print only:
<div id="customer" class="highlighted">
or
<div id="customer" class="highlighted"/>
or
<div id="customer" class="highlighted"></div>
I know I could simply loop through the list of attributes using the attributes method, but I was wondering if Nokogiri already supports something like this natively.
You could output the node with its content deleted:
doc = Nokogiri::HTML.fragment(
'<div id="customer" class="highlighted">
<h1>Customer Name</h1>
<p>Some customer description</p>
</div>'
)
node = doc.at_css('#customer').clone
node.content = nil
p node.to_html
#=> "<div id=\"customer\" class=\"highlighted\"></div>"

How do I get the text within HTML tags?

I want to get the text within a certain HTML tag. It looks like:
<div id="data123">data1: value1<br>data2: value2<br> data3: value</div>
My code looks like:
html_page = Nokogiri::HTML open 'my_url'
who_is_raw = html_page.css('div#data123')[0] #.text
I get either the text within the <div> tag without <br> tags or the whole <div> with all <br> inside. But, I want only the text within that <div> tag and <br> tags inside it.
How do I do that?
Try with inner_html
who_is_raw = html_page.css('div#data123')[0].inner_html

XPath extract text inside tag

HTML structure looks like this:
<div class="Parent">
<div id="A">more tags and text</div>
<div id="B">more tags and text</div>
more tags
<p> and text </p>
</div>
I would like to extract text just from the parent and the tags apart from the A and B children.
I have tried
/div[#class='Parent']//text()
which extracts text from all the descendant nodes, so a made a constraint like /div[#class='Parent']//text()[not(self::div)]
but it did not change a thing.
Thanks for any advice
/div[#class='Parent']/*[not(self::div and (#id='A' or #id='B'))]//text() | /div[#class='Parent']/text()

Resources