How to scrape data using Nokogiri from elements having two 'data-' attributes - ruby

I want to scrape data using Nokogiri from some HTML:
<td data-bar="hoge" data-date="2000-01-01" class="modals"></td>
<td data-bar="fuga" data-date="2000-01-02" class="modals"></td>
I wrote:
element = page.css("td[data-bar='hoge'][data-date='2000-01-01']")
but element.length returns 0.
How do I distinguish elements having two data- attributes?

Try using XPath selectors instead. This worked for me:
element = page.xpath "//td[#data-bar='hoge'][#data-date='2000-01-01']"
In this example, the // portion will match any td element (with those attributes) in the document, which may not be desirable. In that case, you would need to write a more explicit XPath to the node.
Here's the documentation for XPath: https://www.w3.org/TR/xpath/

Related

How to find an element's text in Capybara while ignoring inner element text

In the HTML example below I am trying to grab the $16.95 text in the outer span.price element and exclude the text from the inner span.sale one.
<div class="price">
<span class="sale">
<span class="sale-text">"Low price!"</span>
"$16.95"
</span>
</div>
If I was using Nokogiri this wouldn't be too difficult.
price = doc.css('sale')
price.search('.sale-text').remove
price.text
However Capybara navigates rather than removes nodes. I knew something like price.text would grab text from all sub elements, so I tried to use xpath to be more specific. p.find(:xpath, "//span[#class='sale']", :match => :first).text. However this grabs text from the inner element as well.
Finally, I tried looping through all spans to see if I could separate the results but I get an Ambiguous error.
p.find(:css, 'span').each { |result| puts result.text }
Capybara::Ambiguous: Ambiguous match, found 2 elements matching css "span"
I am using Capybara/Selenium as this is for a web scraping project with authentication complications.
There is no single statement way to do this with Capybara since the DOMs concept of innerText doesn't really support what you want to do. Assuming p is the '.price' element, two ways you could get what you want are as follows:
Since you know the node you want to ignore just subtract that text from the whole text
p.find('span.sale').text.sub(p.find('span.sale-text').text, '')
Grab the innerHTML string and parse that with Nokogiri or Capybara.string (which just wraps Nokogiri elements in the Capybara DSL)
doc = Capybara.string(p['innerHTML'])
nokogiri_fragment = doc.native
#do whatever you want with the nokogiri fragment

How to get td text value of nested table by XPath

I have this table inside another table inside another table and so on. And then I want to get the text value of the td element with a specific class.
<tr>
<td width="5%"></td>
<td class="wintxt">The XML ....<br/><br/>Number: xyz</td>
</tr>
I need to get the text content "The XML ....Number: xyz"
I tried using:
List<?> submissionString = resultOfsubmissionPage.getByXPath("//tr[#class=\"wintxt\"]/td/text()");
...and many other variations but I always get a zero element List. Anyone has a clue?
There is mistakes with your provided xpath you are searching text() in that row means tr which has class attribute but as your provided HTML only one td has class attribute. So try as below :-
List<?> submissionString = resultOfsubmissionPage.getByXPath("//tr/td[#class='wintxt']/text()");
Hope it helps..:)

Find element which has no style

In a table, there are rows like this:
<tr id="filtersJob_intrinsicTable_row6" class="evenRow" style="display: none;">some stuff here<tr>
<tr id="filtersJob_intrinsicTable_row7" class="evenRow">some stuff here<tr>
How do i use watir to get the rows which are to be displayed, i.e the rows which do NOT have style="display: none; ?
You have a number of ways of collecting elements without the style attribute:
Using a :css locator:
browser.trs(css: 'tr:not([style])')
Using a :xpath locator:
browser.trs(xpath: '//tr[not(#style)]')
You could also check the attribute value:
browser.trs.select { |tr| tr.attribute_value('style').nil? }
Note that you should be cautious about using the style attribute as an indicator of the row being displayed. Someone could add some other unrelated style property and then all of the tests will fail. Instead, I would suggest that you look for rows that are present:
browser.trs.select(&:present?)
I think that this also makes the purpose of the code more obvious and readable.
Using xpath:
browser.element_by_xpath(".//tr[not(#style)]")
[not(#style)] meaning having no style attribute.

Get a specific tag in a node?

I'm using Ruby, XPath and Nokogiri and trying to retrieve d1 from the following XML:
<a>
<b1>
<c>
<d1>01/11/2001</d1>
<d2>02/02/2004</d2>
</c>
</b1>
</a>
This is my code in a loop:
rs = doc.xpath("//a/b1/c/d1").inner_text
puts rs
It returns nothing (No error).
I want to get the text in <d1>.
You don't ask for the text content in your xpath query:
rs = doc.xpath('//a/b1/c/d1/text()')
You're misusing XPath:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<a>
<b1>
<c>
<d1>01/11/2001</d1>
<d2>02/02/2004</d2>
</c>
</b1>
</a>
EOT
doc.at('/a/b1/c/d1').text # => "01/11/2001"
doc.at('//d1').text # => "01/11/2001"
// in XPath-ese means start at the top and look anywhere in your document. Instead, if you're supplying an explicit/absolute selector, start at the top of the document and drill down using '/a/b1/c/d1'. Or, do the simple thing and let the parser search through the document for that particular node using //d1. You can do that if you know there's a single instance of that node.
In my code above, I used at instead of xpath. at returns the first matching node, which is similar to using xpath('//d1').first. xpath returns a NodeSet, which is like an array of nodes, whereas at returns a Node only. Using inner_text on a NodeSet is likely to not give you the results you want, which would be the text of a particular node, so be careful there.
doc.xpath('/a/b1/c/d1/text()').class # => Nokogiri::XML::NodeSet
doc.xpath('//c').inner_text # => "\n 01/11/2001\n 02/02/2004\n "
doc.xpath('/a/b1/c/d1').first.text # => "01/11/2001"
Look at the following lines. Instead of using XPath selectors, I used CSS, which tends to be more readable. Nokogiri supports both.
doc.at('d1').text # => "01/11/2001"
doc.at('a b1 c d1').text # => "01/11/2001"
Also, notice the type of data returned from these two lines:
doc.at('/a/b1/c/d1/text()').class # => Nokogiri::XML::Text
doc.at('/a/b1/c/d1').text.class # => String
While it might seem good/smart to tell the parser to locate the text() node inside <d1>, what will be returned isn't text, and will need to be accessed further to make it usable, so consider forgoing the use of text() unless you know exactly why you need it:
doc.at('/a/b1/c/d1/text()').text # => "01/11/2001"
Finally, Nokogiri has many methods used for locating nodes. As I said above, xpath returns a NodeSet and at returns a Node. xpath is really an XPath-specific version of Nokogiri's search method. search, css and xpath all return NodeSets. at, at_css and at_xpath all return Nodes. The CSS and XPath variants are useful when you have an ambiguous selector that you need to be used as CSS or XPath specifically. Most of the time Nokogiri can figure whether it's CSS or XPath on its own and will do the right thing, so it's OK to use the generic search and at for the majority of your coding. Use the specific versions when you have to specify one or the other.

Parsing inner tags using Nokogiri

I'm stuck not being able to parse irregularly embedded html tags. Is there a way to remove all html tags from a node and retain all text?
I'm using the code:
rows = doc.search('//table[#id="table_1"]/tbody/tr')
details = rows.collect do |row|
detail = {}
[
[:word, 'td[1]/text()'],
[:meaning, 'td[6]/font'],
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
Using Xpath:
[:meaning, 'td[6]/font']
generates
:meaning: ! '<font size="3">asking for information specifying <font
color="#CC0000" size="3">what is your name?</font> /what/ as in, <font color="#CC0000" size="3">I'm not sure what you mean</font>
/what/ as in <a style="text-decoration: none;" href="http://somesecretlink.com">what</a></font>
On the other hand, using Xpath:
'td/font/text()'
generates
:meaning: asking for information specifying
thus ignoring all children of the node. What I want to achieve is this
:meaning: asking for information specifying what is your name? /what/ as in, I'm not sure what you mean /what/ as in what? I can't hear you
This depends on what you need to extract. If you want all text in font elements, you can do it with the following xpath:
'td/font//text()'
It extracts all text nodes in font tags. If you want all text nodes in the cell, then:
'td//text()'
You can also call the text method on a Nokogiri node:
row.at_xpath(xpath).text
I added an answer for this same sort of question the other day. It's a very easy process.
Take a look at: Convert HTML to plain text and maintain structure/formatting, with ruby

Resources