I've got many HTML files in a folder, for each file I want to replace n-dash and m-dash with linefeed or paragrah mark, but only for specific html class.
For example, I would like to find/replace only text in class "Center".
Original:
class=Center <p class="Center">« Sentence1 — Sentence2 – Sentence3</p>
class=Aligned <p class="Aligned">«Other Sentence4 — Other Sentence5 – OtherSentence6«</p>
Desired result:
<p class="Center">« Sentence1 </p><p></p><p> Sentence2 </p><p></p><p> Sentence3«</p>
<p class="Aligned">«Other Sentence4 — Other Sentence5 – OtherSentence6«</p>
So far I'm using this solution by Helen: https://stackoverflow.com/a/1758239/5471234
But implementing this "strText = Replace(strText, "–", "< /p>< p>< /p>< p>")" performs F/R in the whole text.
How can I limit it to class=Center? Any way to use RegEx? and/or html object .innerText to grab only specific class?
Related
Watir
mytext =browser.element(:xpath => '//*[#id="gold"]/div[1]/h1').text
Html
<h1>
This is the text I want
<span> I do not want this text </span>
</h1>
When I run my Watir code, it selects all the text, including what is in the spans. How do I just get the text "This is the text I want", and no span text?
If you have a more complicated HTML, I find it can be easier to deal with this using Nokogiri as it provides more methods for parsing the HTML:
require 'nokogiri'
h1 = browser.element(:xpath => '//*[#id="gold"]/div[1]/h1')
doc = Nokogiri::HTML.fragment(h1.html)
mytext = doc.at('h1').children.select(&:text?).map(&:text).join.strip
Ideally start by trying to avoid using XPath. One of the most powerful features of Watir is the ability to create complicated locators without XPath syntax.
The issue is that calling text on a node gets all content within that node. You'd need to do something like:
top_level = browser.element(id: 'gold')
h1_text = top_level.h1.text
span_text = top_level.h1.span.text
desired_text = h1_text.chomp(span_text)
This is useful for top level text.
If there is only one h1, you can ommit id
#b.h1.text.remove(#b.h1.children.collect(&:text).join(' '))
Or specify it if there are more
#b.h1(id: 'gold').text.remove(#b.h1.children.collect(&:text).join(' '))
Make it a method and call it from your script with get_top_text(#b.h1) to get it
def get_top_text(el)
el.text.chomp(#b.h1.children.collect(&:text).join(' '))
end
This may be an easy question, I'm new to this.
I'm trying to get the data within this div
<div class="search-results-listings
" vocab="http://schema.org/" typeof="SearchResultsPage">
response.xpath("//div[#class='search-results-listings\n']")
and
response.xpath("//div[#class='search-results-listings\n ']")
are returning empty arrays
You can use XPath's contains:
response.xpath("//div[contains(#class, 'search-results-listings')]")
I'm have a document A and want to build a new one B using A's node values.
Given A looks like this...
<html>
<head></head>
<body>
<div id="section0">
<h1>Section 0</h1>
<div>
<p>Some <b>important</b> info here</p>
<div>Some unimportant info here</p>
</div>
<div>
<div id="section1">
<h1>Section 1</h1>
<div>
<p>Some <i>important</i> info here</p>
<div>Some unimportant info here</div>
</div>
<div>
</body>
</html>
When building a B document, I'm using method a.at_css("#section#{n} h1").text to grab the data from A's h1 tags like this:
require 'nokogiri'
a = Nokogiri::HTML(html)
Nokogiri::HTML::Builder.new do |doc|
...
doc.h1 a.at_css("#section#{n} h1").text
...
end
So there are three questions:
How do I grab the content of <p> tags preserving tags inside
<p>?
Currently, once I hit a.at_css("#section#{n} p").text it
returns a plain text, which is not what's needed.
If, instead of .text I hit .to_html or .inner_html, the html appears escaped. So I get, for example, <p> instead of <p>.
Is there any known true way of assigning nodes at the document building stage? So that I wouldn't dance with text method at all? I.e. how do I assign doc.h1 node with value of a.at_css("#section#{n} h1") node at building stage?
What's the profit of Nokogiri::Builder.with(...) method? I wonder if I can get use of it...
How do I grab the content of <p> tags preserving tags inside <p>?
Use .inner_html. The entities are not escaped when accessing them. They will be escaped if you do something like builder.node_name raw_html. Instead:
require 'nokogiri'
para = Nokogiri.HTML( '<p id="foo">Hello <b>World</b>!</p>' ).at('#foo')
doc = Nokogiri::HTML::Builder.new do |d|
d.body do
d.div(id:'content') do
d.parent << para.inner_html
end
end
end
puts doc.to_html
#=> <body><div id="content">Hello <b>World</b>!</div></body>
Is there any known true way of assigning nodes at the document building stage?
Similar to the above, one way is:
puts Nokogiri::HTML::Builder.new{ |d| d.body{ d.parent << para } }.to_html
#=> <body><p id="foo">Hello <b>World</b>!</p></body>
Voila! The node has moved from one document to the other.
What's the profit of Nokogiri::Builder.with(...) method?
That's rather unrelated to the rest of your question. As the documentation says:
Create a builder with an existing root object. This is for use when you have an existing document that you would like to augment with builder methods. The builder context created will start with the given root node.
I don't think it would be useful to you here.
In general, I find the Builder to be convenient when writing a large number of custom nodes from scratch with a known hierarchy. When not doing that you may find it simpler to just create a new document and use DOM methods to add nodes as appropriate. It's hard to tell how much hard-coded nodes/hierarchy your document will have versus procedurally created.
One other, alternative suggestion: perhaps you should create a template XML document and then augment that with details from the other, scraped HTML?
I'm stuck not being able to parse irregularly embedded html tags. Is there a way to remove all html tags from a node and retain all text?
I'm using the code:
rows = doc.search('//table[#id="table_1"]/tbody/tr')
details = rows.collect do |row|
detail = {}
[
[:word, 'td[1]/text()'],
[:meaning, 'td[6]/font'],
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
Using Xpath:
[:meaning, 'td[6]/font']
generates
:meaning: ! '<font size="3">asking for information specifying <font
color="#CC0000" size="3">what is your name?</font> /what/ as in, <font color="#CC0000" size="3">I'm not sure what you mean</font>
/what/ as in <a style="text-decoration: none;" href="http://somesecretlink.com">what</a></font>
On the other hand, using Xpath:
'td/font/text()'
generates
:meaning: asking for information specifying
thus ignoring all children of the node. What I want to achieve is this
:meaning: asking for information specifying what is your name? /what/ as in, I'm not sure what you mean /what/ as in what? I can't hear you
This depends on what you need to extract. If you want all text in font elements, you can do it with the following xpath:
'td/font//text()'
It extracts all text nodes in font tags. If you want all text nodes in the cell, then:
'td//text()'
You can also call the text method on a Nokogiri node:
row.at_xpath(xpath).text
I added an answer for this same sort of question the other day. It's a very easy process.
Take a look at: Convert HTML to plain text and maintain structure/formatting, with ruby
I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
Your xpath is selecting the text of the a nodes, not the text of the td nodes:
$item//a[#name='hw']/text()
Change it to this:
$item[a/#name='hw']/text()
Update (following comments and update to question):
This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:
$item[a/#name='hw']//text()[2]
I would not want to use text()[3] but
is there some way I could extract the
text out between /a[#name='hw2'] and
/a[#name='hw3'].
If there is just one text node between the two <a> elements, then the following would be quite simple:
/a[#name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression $ns1 with:
/a[#name='hw2']/following-sibling::text()
and $ns2 with:
/a[#name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
/a[#name='hw2']/following-sibling::text()
intersect
/a[#name='hw3']/preceding-sibling::text()
This handles your expanded case, while letting you select by attribute value rather than position:
let $item :=
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
return $item//node()[./preceding-sibling::a/#name = "hw2"][1]
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".