How do I retrieve innerhtml using watir webdriver - ruby

I have the following HTML, and I need to get the text that is outside of the bold tag. For instance 'Submitted At:' I need to get the timestamp that follows. You will see that 'Submitted At: is surrounded by bold tags and the timestamp follows and I can not retrieve it.
<body>
<h2> … </h2>
<b> … </b>
jenkins
<br></br>
<b> … </b>
<br></br>
<b> … </b>
…
<br></br>
<b> … </b>
<br></br>
<b>
Submitted At:
</b>
29-Jan-2016 17:12:24
Things I have tried.
#browser.body.text.split("\n")
#browser.body.split("\n")
body_html = Nokogiri::HTML.parse(#browser.body.html)
body_html.xpath("//body//b").text
returned: "User: JobName: JobConf: Job-ACLs: All users are allowedSubmitted At: Launched At: Finished At: Status: Analyse This Job"
I have tried several things such as xpath, plain old text retrieval, but I am not able to get what I need. I have also done several searches and can't find what I need.

To start with, html bereft of classes and ids is always going to provide a challenge. It is going to be even worse when you want to access text that is merely in the body tag.
In this specific instance, this should work:
browser.b(index: 4)

InnerHtml is literally what it is - its inside a HTMLstart and end tag. So you are looking at InnerHtml of the outer tag actually - <body>.
The .text of <Body> tag will give you entire text. If the tags are gonna be dynamic index is not going to work. So if you know the timestamp length is gonna always be same, Get the entire text, delimit/unstring based on this string 'Submitted At:' to max timestamp length. This will be stable solution rather than a hardcoded Index value if it may change. Ie pickup substring starting from that tag to max length of timestamp.

The HTML appears to have a structure of:
a <b> tag that is the field description and
a following text node that is the field value.
Watir can only return the concatenation of all an element's text nodes. As a result, it does not deal well with this structure, which needs the text nodes separated. While you could parse the concatenated String, it could be error prone depending on the possible field descriptions/values.
I would therefore suggest parsing the HTML with Nokogiri as it can return individual text nodes. This would look like:
html = browser.html
doc = Nokogiri::HTML(html)
p doc.at_xpath('//b[normalize-space(text()) = "Submitted At:"]
/following-sibling::text()[1]').text.strip
#=> "29-Jan-2016 17:12:24"
Here we are using an XPath to find the <b> tag that contains the relevant field description, "Submitted At:". From that node, we find the text node, ie the "29-Jan-2016 17:12:24", that comes right after it.

Related

Returning multiple strings using XPATH

Source website is here on Nethys
Since I don't know all of the terminologies I'm going to keep this as neutral as possible. I'm trying to gather information from this website into separate columns in a google doc.
I want the bold text in one column, the associated link in the next, and the spell description in another. The issue comes when a description references another spell they put it in italics which breaks the description into multiple parts seen in C153 and C154. I think it would be easier to just grab everything in between the bold text and a line break but I don't know the context.
From an example such as (Forgive me if the formatting is wrong, I'm mostly guessing here);
<p>
<b>
<a href='link1'>
Bold Link 1
</a>
</b>
:Followed by normal text
<br>
<b>
<a href='link2'>
Bold Link 2
</a>
</b>
:Normal Text
<i>with an italic</i>
in between
<br>
<b>
<a href='link3'>
Bold Link 3
</a>
</b>
:Back to this one
<br>
</p>
I can get it to return
:Followed by normal text
Normal text
in between
:Back to this one
But I want it to return :Followed by normal text :Normal text with an italic in between :Back to this one
I don't even know if it's possible to do with a single command but any help would be appreciated.
If you want to select every text node descendant of p root element that it's not also descendant of a you could use this XPath:
/p//text()[not(ancestor::a)]
Or more restricted ussing the Kayian method:
/p//text()[count(.|/p//a//text()) != count(/p//a//text())]
Note: XPath 1.0 has no intersection nor set differenciation operators, but it has union by | operator and cardinality by count() function. Dr. Michael Kay discovered that those were enough to test for set membership: a element is member of B set if and only if {a} union B has the same cardinality than B. From there you build all the others set operations.

XPath query for contains() with multiple text elements

How can I find all tags that have text "world"?
<c>
<a>
<b>hello</b>
world
</a>
</c>
Expected result should be tag 'a'.
I am trying //a[contains(text(),'world')] but it doesn't give anything.
This 'a' tag is kind of mix of text and another tag.
Try this one:
(//*[contains(., "world")])[last()]
or if you know for sure that it will be a node:
//a[contains(.,'world')]
Also check the difference between string value and text node

Trouble accessing a text with XPath query

I have this html snippet
<div id="overview">
<strong>some text</strong>
<br/>
some other text
<strong>more text</strong>
TEXT I NEED IS HERE
<div id="sub">...</div>
</div>
How can I get the text I am looking for (shown in caps)?
I tried this, I get an error message saying not able to locate the element.
"//div[#id='overview']/strong[position()=2]/following-sibling"
I tried this, I get the div with id=sub, but not the text (correctly so)
"//div[#id='overview']/*[preceding-sibling::strong[position()=2]]"
Is there anyway to get the text, other than doing some string matching or regex with contents of overview div?
Thanks.
following-sibling is the axis, you still need to specify the actual node (in your example the XPath processor is searching for an element named following-sibling). You separate the axis from the node with ::.
Try this:
//div[#id='overview']/strong[position()=2]/following-sibling::text()[1]
This specifies the first text node after the second strong in the div.
If you always want the text immediately preceding the <div id="sub"> then you could try
//div[#id='sub']/preceding-sibling::text()[1]
That would give you everything between the </strong> and the opening <div ..., i.e. the upper case text plus its leading and trailing new lines and whitespace.

what xpath to select CDATA content when some childs exist

Let's say I have an XML that looks like this:
<a>
<b>
<![CDATA[some text]]>
<c>xxx</c>
<d>yyy</d>
</b>
</a>
I can't find a way to get "some text". Any idea?
If I'm using "a/b" it returns also xxx and yyy
If I'm using "a/b/text()" it returns nothing
You can't actually select a CDATA section: CDATA is just a way of telling the parser to avoid unescaping special characters, and your input document looks to XPath exactly the same as:
<a>
<b>
some text
<c>xxx</c>
<d>yyy</d>
</b>
</a>
(Having said that, if you're using DOM, then some DOM XPath engines fail to implement the spec correctly, and treat the CDATA content as a separate text node from the text outside the CDATA section).
The XPath expression a/b/text() should select three text nodes, of which the first contains "some text" along with surrounding whitespace.
With the XPath data model the path /a/b/text()[1] should select a text node with the string value
some text
that is a line break, some spaces, the text some text followed by a line break and some spaces.

XPath/Scrapy crawling weirdly formatted pages

I've been playing around with scrapy and I see that knowledge of xpath is vital in order to leverage scrapy sucessfully. I have a webpage I'm trying to gather some information from where the tags are formatted as such
<div id = "content">
<h1></h1>
<p></p>
<p></p>
<h1></h1>
<p></p>
<p></p>
Now the heading contains a title and the first 'p' contains data1 and the second 'p' contains data2. This seems like a pretty straight forward task, and if this were always the case I would have no problem i.e. hsx.select('//*[#id="content"]') etc. etc.
The problem is, sometimes there will only be ONE p tag following a header instead of two.
<div id = "content">
<h1></h1>
<p></p> (a)
<h1></h1>
<p></p> (b)
<p></p> (c)
What i would like is if there is a paragraph tag missing I want to store that information as just blank data in my list. Right now what happens is the lists are storing the first heading 1, the first paragraph tag(a), and then the paragraph tag under the second h1 (b).
What it should be doing is storing
title -> h1[0]
data1[0] -> (a)
data2[0] ->[]
I hope that makes sense. I've been looking for a good xpath or scrapy solution to do this but I can't seem to find one. Any helpful tips would be awesome. thanks
Use:
//div[#id='content']
/h1[1]/following sibling::*
[not(position()>2)][self::p]
This selects the (utmost) two immediate sibling elements, only if they are p, of the first h1 child of any div (we know that this must be just one div) the string value of whoseidattribute is"content"`.
If only the first immediate sibling is a p, then the returned node-list contains only one item.
You can check whether the length of the returned node-list is 1 or 2, and use this to build the control of your processing.
I think you'd want something like this; not 100% though / untested.
//h1/following-sibling::*[2][self::p]/text()|//h1[not(following-sibling::*[2][self::p])]/string('')

Resources