put each text surrounded via html tag, into an array? - ruby

using nokogiri,
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").to_s
this does the job, however, it puts everything into one flat text.
i need to take each text surrounded via html tags
<b> text</b>
<h1>text3</b>
and put them into array. ["text", "text3"]
what is the recommended action ?
i thought of doing
doc.xpath("*").text
but dont know how to iterate through it all.

doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").to_a

Related

How do I retrieve innerhtml using watir webdriver

I have the following HTML, and I need to get the text that is outside of the bold tag. For instance 'Submitted At:' I need to get the timestamp that follows. You will see that 'Submitted At: is surrounded by bold tags and the timestamp follows and I can not retrieve it.
<body>
<h2> … </h2>
<b> … </b>
jenkins
<br></br>
<b> … </b>
<br></br>
<b> … </b>
…
<br></br>
<b> … </b>
<br></br>
<b>
Submitted At:
</b>
29-Jan-2016 17:12:24
Things I have tried.
#browser.body.text.split("\n")
#browser.body.split("\n")
body_html = Nokogiri::HTML.parse(#browser.body.html)
body_html.xpath("//body//b").text
returned: "User: JobName: JobConf: Job-ACLs: All users are allowedSubmitted At: Launched At: Finished At: Status: Analyse This Job"
I have tried several things such as xpath, plain old text retrieval, but I am not able to get what I need. I have also done several searches and can't find what I need.
To start with, html bereft of classes and ids is always going to provide a challenge. It is going to be even worse when you want to access text that is merely in the body tag.
In this specific instance, this should work:
browser.b(index: 4)
InnerHtml is literally what it is - its inside a HTMLstart and end tag. So you are looking at InnerHtml of the outer tag actually - <body>.
The .text of <Body> tag will give you entire text. If the tags are gonna be dynamic index is not going to work. So if you know the timestamp length is gonna always be same, Get the entire text, delimit/unstring based on this string 'Submitted At:' to max timestamp length. This will be stable solution rather than a hardcoded Index value if it may change. Ie pickup substring starting from that tag to max length of timestamp.
The HTML appears to have a structure of:
a <b> tag that is the field description and
a following text node that is the field value.
Watir can only return the concatenation of all an element's text nodes. As a result, it does not deal well with this structure, which needs the text nodes separated. While you could parse the concatenated String, it could be error prone depending on the possible field descriptions/values.
I would therefore suggest parsing the HTML with Nokogiri as it can return individual text nodes. This would look like:
html = browser.html
doc = Nokogiri::HTML(html)
p doc.at_xpath('//b[normalize-space(text()) = "Submitted At:"]
/following-sibling::text()[1]').text.strip
#=> "29-Jan-2016 17:12:24"
Here we are using an XPath to find the <b> tag that contains the relevant field description, "Submitted At:". From that node, we find the text node, ie the "29-Jan-2016 17:12:24", that comes right after it.

Ruby Nokogiri extract text after the end of a tag

I have a rather basic question here which means i'm probably missing something i'm using Nokogiri to scrape a site.
I want to extract the text AFTER the end of a strong tag within a div which looks like this:
<p style="padding-bottom:0px;"><strong>Location:</strong> Cape Town</p>
Currently my code is as follows:
location = detail_page.css('p[style="padding-bottom:0px;"]').text
Which obviously gives the <strong>Location:</strong> bit as well, is there a way to do this without using a regex?
The reason for asking is that there are other divs in the same format containing information which I need so I can't just delete the strong elements.
Thanks in advance
Marc
You could use XPath:
detail_page.xpath('//p[#style="padding-bottom:0px;"]/strong/following-sibling::text()')
This selects any text nodes that are following siblings of strong elements that are in turn children of p elements with a style attribute witht he value padding-bottom:0px;.
Here I would do as below :
require 'nokogiri'
#doc = Nokogiri::HTML.parse('<p style="padding-bottom:0px;"><strong>Location:</strong> Cape Town</p>')
#doc.at_css('p[style*="padding-bottom:0px;"] > text()').text.strip
# => Cape Town

what xpath to select CDATA content when some childs exist

Let's say I have an XML that looks like this:
<a>
<b>
<![CDATA[some text]]>
<c>xxx</c>
<d>yyy</d>
</b>
</a>
I can't find a way to get "some text". Any idea?
If I'm using "a/b" it returns also xxx and yyy
If I'm using "a/b/text()" it returns nothing
You can't actually select a CDATA section: CDATA is just a way of telling the parser to avoid unescaping special characters, and your input document looks to XPath exactly the same as:
<a>
<b>
some text
<c>xxx</c>
<d>yyy</d>
</b>
</a>
(Having said that, if you're using DOM, then some DOM XPath engines fail to implement the spec correctly, and treat the CDATA content as a separate text node from the text outside the CDATA section).
The XPath expression a/b/text() should select three text nodes, of which the first contains "some text" along with surrounding whitespace.
With the XPath data model the path /a/b/text()[1] should select a text node with the string value
some text
that is a line break, some spaces, the text some text followed by a line break and some spaces.

How do I find matching <pre> tags using a reqular expression?

I am trying to create a simple blog that has code inclosed in <pre> tags.
I want to display "read more" after the first closing </pre> tag is encountered, thus showing only the first code segment.
I need to display all text, HTML, code up to the first closing </pre> tag.
What I've come up with so far is the follow:
/^(.*<\/pre>).*$/m
However, this matches every closing </pre> tag up to the last one encountered.
I thought something like the following would work:
/^(.*<\/pre>{1}).*$/m
It of course does not.
I've been using Rubular.
My solution thanks to your guys help:
require 'nokogiri'
module PostsHelper
def readMore(post)
doc = Nokogiri::HTML(post.message)
intro = doc.search("div[class='intro']")
result = Nokogiri::XML::DocumentFragment.parse(intro)
result << link_to("Read More", post_path(post))
result.to_html
end
end
Basically in my editor for the blog I wrap the blog preview in div class=intro
Thus, only the intro is displayed with read more added on to it.
This is not a job for regular expressions, but for a HTML/XML parser.
Using Nokogiri, this will return all <pre> blocks as HTML, making it easy for you to grab the one you want:
require 'nokogiri'
html = <<EOT
<html>
<head></head>
<body>
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
pre_blocks = doc.search('pre')
puts pre_blocks.map(&:to_html)
Which will output:
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>
You can capture all text upto the first closing pre tag by modifying your regular expression to,
/^(.*?<\/pre>{1}).*$/m
This way you can get the matched text by,
text.match(regex)[1]
which will return only the text upto the first closing pre tag.
Reluctant matching might help in your case:
/^(.*?<\/pre>).*$/m
But it's probably not the best way to do the thing, consider using some html parser, like Nokogiri.

Selecting specific using x-path while disregarding certain nodes

I have some html that looks pretty much like this.
<p>
<a img src="img src">
<strong>foo</strong>
<strong>bar</strong>
<strong>baz</strong>
<strong>eek</strong>
This is the text I want to select using xpath.
</p>
How can I select only this particular text node as indicated above using xpath?
How do I get at only this particular
text element in question using xpath?
Use:
/p/text()[last()]
"/p/text()" xpath expression will select the text from "p" node in above XML (Posted in question).
/p/text()[normalize-space()]
this will remove trailing spaces from string. This xpath produces exactly what you want.
There is very good tutorial at http://www.w3schools.com/xpath/

Resources