I have a question and it could be very straight forward to sort out.
I'm looking to write a test that will look within an element on page, store the value or text within that element so that it can be used later.
Example:
Within this css path "#clickable-rows > tbody > tr:nth-child(1) > td:nth-child(1)`" is a value that I'd like to extract so that I can use it later
Is this possible?
Yes, you're just looking for #text right?
element_css_locator = "#clickable-rows > tbody > tr:nth-child(1) > td:nth-child(1)"
# save text of element
element_text = page.find(element_css_locator).text
# later on assert:
page.find(element_css_locator).should have_content element_text
# or
page.should have_selector(element_css_locator, :text => element_text)
It's usually best to find the element both times rather than hanging onto the capybara element instance.
Related
Watir
mytext =browser.element(:xpath => '//*[#id="gold"]/div[1]/h1').text
Html
<h1>
This is the text I want
<span> I do not want this text </span>
</h1>
When I run my Watir code, it selects all the text, including what is in the spans. How do I just get the text "This is the text I want", and no span text?
If you have a more complicated HTML, I find it can be easier to deal with this using Nokogiri as it provides more methods for parsing the HTML:
require 'nokogiri'
h1 = browser.element(:xpath => '//*[#id="gold"]/div[1]/h1')
doc = Nokogiri::HTML.fragment(h1.html)
mytext = doc.at('h1').children.select(&:text?).map(&:text).join.strip
Ideally start by trying to avoid using XPath. One of the most powerful features of Watir is the ability to create complicated locators without XPath syntax.
The issue is that calling text on a node gets all content within that node. You'd need to do something like:
top_level = browser.element(id: 'gold')
h1_text = top_level.h1.text
span_text = top_level.h1.span.text
desired_text = h1_text.chomp(span_text)
This is useful for top level text.
If there is only one h1, you can ommit id
#b.h1.text.remove(#b.h1.children.collect(&:text).join(' '))
Or specify it if there are more
#b.h1(id: 'gold').text.remove(#b.h1.children.collect(&:text).join(' '))
Make it a method and call it from your script with get_top_text(#b.h1) to get it
def get_top_text(el)
el.text.chomp(#b.h1.children.collect(&:text).join(' '))
end
I have this HTML fragment:
<p>Yes. No. Both. Maybe a plane?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes. No. Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird? Is it a plane? No, it’s Superman.</p>
I need to replace the word plane with
plane
but only when it's outside of an <a></a> anchor tag, and outside a heading, <h1-h6></h> tag.
This is what I've tried:
require 'Nokogiri'
h = '<p>Yes. No. Both. Maybe a plane?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes. No. Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird? Is it a plane? No, it’s Superman.</p>'
doc = Nokogiri::HTML::DocumentFragment(h).parse
# Try 1: This outputs all content, but I need to avoid <a>/<h#>
doc.content
# Try 2: The below line removes headings permanently - I need them to remain
# doc.search(".//h2").remove
# Try 3: This just comes out empty - why?
# doc.xpath('text()')
# doc.xpath('//text()')
# then,
# code to replace `plane` is here ...
# this part is not needed
# then,
doc.to_html
I tried various other variations of xpath to no avail. What am I doing wrong?
After some playing around, it appears you needed to use the XPath selector p/text(). Things then got more complicated because you're trying to replace normal text with a link element.
When I just tried using gsub, Nokogiri was escaping the new link, so I needed to split the text element into multiple sibling elements where I could replace some of the siblings with link elements instead of text nodes.
doc.xpath('p/text()').grep(/plane/) do |node|
node_content, *remaining_texts = node.content.split(/(plane)/)
node.content = node_content
remaining_texts.each do |text|
if text == 'plane'
node = node.add_next_sibling('plane').last
else
node = node.add_next_sibling(text).last
end
end
end
puts doc
# <p>Yes. No. Both. Maybe a plane?</p>
# <h2 id="2-is-it-a-plane">2. Is it a plane?</h2>
# <p>Yes. No. Both.</p>
# <h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2>
# <p>Is it a bird? Is it a plane? No, it’s Superman.</p>
A more general purpose XPath selector for all elements, except headings and links, might be:
*[not(name()='a')][not(name()='h1')][not(name()='h2')][not(name()='h3')][not(name()='h4')][not(name()='h5')][not(name()='h6')]/text()
You may need to tweak this some as I'm not an XML or Nokogiri expert, but it appears to me to be working for the provided example, at least, so it should get you going.
Let's say I want to scrape the "Weight" attribute from the following content on a website:
<div>
<h2>Details</h2>
<ul>
<li><b>Height:</b>6 ft</li>
<li><b>Weight:</b>6 kg</li>
<li><b>Age:</b>6</li>
</ul>
</div>
All I want is "6 kg". But it's not labeled, and neither is anything around it. But I know that I always want the text after "Weight:". Is there a way of selecting an element based on the text near it or in it?
In pseudocode, this is what it might look like:
require 'selenium-webdriver'
require 'nokogiri'
doc = parsed document
div_of_interest = doc.div where text of h2 == "Details"
element_of_interest = <li> element in div_of_interest with content that contains the string "Weight:"
selected_text = (content in element) minus ("<b>Weight:</b>")
Is this possible?
You can write the following code
p driver.find_elements(xpath: "//li").detect{|li| li.text.include?'Weight'}.text[/:(.*)/,1]
output
"6 kg"
My suggestion is to use WATIR which is wrapper around Ruby Selenium Binding where you can easily write the following code
p b.li(text: /Weight/).text[/:(.*)/,1]
Yes.
require 'nokogiri'
Nokogiri::HTML.parse(File.read(path_to_file))
.css("div > ul > li")
.children # get the 'li' items
.each_slice(2) # pair a 'b' item and the text following it
.find{|b, text| b.text == "Weight:"}
.last # extract the text element
.text
will return
"6 kg"
You can locate the element through pure xpath: use the contains() function which returns Boolean is its second argument found in the first, and pass to it text() (which returns the text of the node) and the target string.
xpath_locator = '/div/ul/li[contains(text(), "Weight:")]'
value = driver.find_element(:xpath, xpath_locator).text.partition('Weight:').last
Then just get the value after "Weight:".
In the HTML example below I am trying to grab the $16.95 text in the outer span.price element and exclude the text from the inner span.sale one.
<div class="price">
<span class="sale">
<span class="sale-text">"Low price!"</span>
"$16.95"
</span>
</div>
If I was using Nokogiri this wouldn't be too difficult.
price = doc.css('sale')
price.search('.sale-text').remove
price.text
However Capybara navigates rather than removes nodes. I knew something like price.text would grab text from all sub elements, so I tried to use xpath to be more specific. p.find(:xpath, "//span[#class='sale']", :match => :first).text. However this grabs text from the inner element as well.
Finally, I tried looping through all spans to see if I could separate the results but I get an Ambiguous error.
p.find(:css, 'span').each { |result| puts result.text }
Capybara::Ambiguous: Ambiguous match, found 2 elements matching css "span"
I am using Capybara/Selenium as this is for a web scraping project with authentication complications.
There is no single statement way to do this with Capybara since the DOMs concept of innerText doesn't really support what you want to do. Assuming p is the '.price' element, two ways you could get what you want are as follows:
Since you know the node you want to ignore just subtract that text from the whole text
p.find('span.sale').text.sub(p.find('span.sale-text').text, '')
Grab the innerHTML string and parse that with Nokogiri or Capybara.string (which just wraps Nokogiri elements in the Capybara DSL)
doc = Capybara.string(p['innerHTML'])
nokogiri_fragment = doc.native
#do whatever you want with the nokogiri fragment
I'm working on a vim rspec plugin (https://github.com/skwp/vim-rspec) - and I am parsing some html from rspec. It looks like this:
doc = %{
<dl>
<dt id="example_group_1">This is the heading text</dt>
Some puts output here
</dl>
}
I can get the entire inner of the using:
(Hpricot.parse(doc)/:dl).first.inner_html
I can get just the dt by using
(Hpricot.parse(doc)/:dl).first/:dt
But how can I access the "Some puts output here" area? If I use inner_html, there is way too much other junk to parse through. I've looked through hpricot docs but don't see an easy way to get essentially the inner text of an html element, disregarding its html children.
I ended up figuring out a route by myself, by manually parsing the children:
(#context/"dl").each do |dl|
dl.children.each do |child|
if child.is_a?(Hpricot::Elem) && child.name == 'dd'
# do stuff with the element
elsif child.is_a?(Hpricot::Text)
text=child.to_s.strip
puts text unless text.empty?
end
end
Note that this is bad HTML you have there. If you have control over it, you should wrap the content you want in a <dd>.
In XML terms what you are looking for is the TextNode following the <dt> element. In my comment I showed how you can select this node using XPath in Nokogiri.
However, if you must use Hpricot, and cannot select text nodes using it, then you could hack this by getting the inner_html and then stripping out the unwanted:
(Hpricot.parse(doc)/:dl).first.inner_html.sub %r{<dt>.+?</dt>}, ''