Select previous td and clicking link with Mechanize and Nokogiri - ruby

Hi I am scrapping a webpage with mechanize and nokogiri. I am selecting a series of links <a></a>
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr/td[2]/a[1]")
Then I need to check if the content of each link (<a>content</a>, not the href) matches some stuff in my db. I am doing this:
links.each do |link|
if link = #tournament.homologation_number
if my condition is realized I need to select the <td></td> that is right before the <td> of the link I checked and click on the link that's in it.
<td></td>
<td>content I check with my condition</td>
How can I achieve this using Mechanize and nokogiri ?

I would iterate the first td's because it's easier to get at following elements than previous ones (with css anyway)
page.search('td[1]').each do |td|
if td.at('+ td a').text == 'foo'
page2 = agent.get td.at('a')[:href]
end
end

First of all you have to select all <td></td>, the followining xpath //table/tbody/tr/td[2]/a[1] only selects the first <a></a> element, so you could try something like //table/tbody/tr/td, but this depends on the situation.
Once you have your array of <td></td> you can access their links like this:
tds.each do |td|
link = td.children.first # Select the first children
if condition_is_matched(link.html) # Only consider the html part of the link, if matched follow the previous link
previous_td = td.previous
previous_url = previous_td.children.first.href
goto_url previous_url
end
end

Related

Put the Xpath element's text to array

I am trying to use Selenium. The problem is the following:
The doc structure:
<div class="jsSkills oSkills">
<a class="oTag oTagSmall oSkill" href="/contractors/skill/software-testing/" data-contractor="749244">software-testing</a>
<a class="oTag oTagSmall oSkill" href="/contractors/skill/software-qa-testing/" data-contractor="749244">software-qa-testing</a>
<a class="oTag oTagSmall oSkill" href="/contractors/skill/blog-writing/" data-contractor="749244">blog-writing</a>
</div>
I need to obtain all a's text to be in array like:
{"software-testing", "software-qa-testing", "blog-writing"}
I tried this:
contrSkill = driver.find_element(:xpath, "//div[contains(#class, 'jsSkills')]").text
puts contrSkill
but got this:
"software-testingsoftware-qa-testingblog-writing"
Please explain how to appropriately make an array.
You should get all of the link elements you want (using find_elements). Then you can iterate over each link and collect its text into an array (Ruby has a collect method that helps with this).
# Get all of the link elements within the div
skill_links = driver.find_elements(:xpath, "//div[contains(#class, 'jsSkills')]/a")
# Create an array of the text of each link
skill_text_array = skill_links.collect(&:text)
p skill_text_array
#=> ["software-testing", "software-qa-testing", "blog-writing"]

how to click a link in a table based on the text in a row

Using page-object and watir-webdriver how can I click a link in a table, based on the row text as below:
The table contains 3 rows which have names in the first column, and a corresponding Details link in columns to the right:
DASHBOARD .... Details
EXAMPLE .... Details
and so on.
<div class="basicGridHeader">
<table class="basicGridTable">ALL THE DETAILS:
....soon
</table>
</div>
<div class="basicGridWrapper">
<table class="basicGridTable">
<tbody id="bicFac9" class="ide043">
<tr id="id056">
<td class="bicRowFac10">
<span>
<label class="bicDeco5">
<b>DASHBOARD:</b> ---> Based on this text
</label>
</span>
</td>
<td class="bicRowFac11">
....some element
</td>
<td class="bicRowFac12">
<span>
<a class="bicFacDet5">Details</a> ---> I should able click this link
</span>
</td>
</tr>
</tbody>
</table>
</div>
You could locate a cell that contains the specified text, go to the parent row and then find the details link in that row.
Assuming that there might be other detail links you would want to click, I would define a view_details method that accepts the text of the row you want to locate:
class MyPage
include PageObject
table(:grid){ div_element(:class => 'basicGridWrapper')
.table_element(:class => 'basicGridTable') }
def view_details(label)
grid_element.cell_element(:text => /#{label}/)
.parent
.link_element(:text => 'Details')
.click
end
end
You can then click the link with:
page.view_details('DASHBOARD')
Table elements include the Enumerable module, and I find it very useful in cases like these. http://ruby-doc.org/core-2.0.0/Enumerable.html. You could use the find method to locate and return the row that matches the criteria you are looking for. For example:
class MyPage
include PageObject
table(:grid_table, :class => 'basicGridTable')
def click_link_by_row_text(text_value)
matched_row = locate_row_by_text(text_value)
matched_row.link_element.click
#if you want to make sure you click on the link under the 3rd column you can also do this...
#matched_row[2].link_element.click
end
def locate_row_by_text(text_value)
#find the row that matches the text you are looking for
matched_row = grid_table_element.find { |row| row.text.include? text_value }
fail "Could not locate the row with value #{text_value}" if matched_row.nil?
matched_row
end
end
Here, locate_row_by_text will look for the row that includes the text you are looking for, and will throw an exception if it doesnt find it. Then, once you find the row, you can drill down to the link, and click on it as shown in the click_link_by_row_text method.
Just for posterity, I would like to give an updated answer. It is now possible to traverse through a table using table_element[row_index][column_index].
A little bit more verbose:
row_index could also be the text in a row to be matched - in your case - table_element['DASHBOARD']
Then find the corresponding cell/td element using either the index(zero based) or the header of that column
table_element['DASHBOARD'][2] - Selecting the third element in the
selected row.
Since you do not have a header row (<th> element) you can filter the cell element using the link's class attribute. Something like this
table_element['DASHBOARD'].link_element(:class => 'bicRowFac10').click
So the code would look something like this:
class MyPage
include PageObject
def click_link_by_row_text(text_value)
table_element[text_value].link_element(:class => 'bicRowFac10').click
end
end
Let me know if you need more explanation. Happy to help :)

How can I search for a specific text element?

How can I search for the element containing Click Here to Enter a New Password using Nokigiri::HTML?
My HTML structure is like:
<table border="0" cellpadding="20" cellspacing="0" width="100%">
<tbody>
<tr>
<td class="bodyContent" valign="top">
<div>
<strong>Welcome to</strong>
<h2 style="margin-top:0">OddZ</h2>
Click Here
to Enter a New Password
<p>
Click this link to enter a new Password. This link will expire within 24 hours, so don't delay.
<br>
</p>
</div>
</td>
</tr>
</tbody>
</table>
I tried:
doc = (Nokogiri::HTML(#inbox_emails.first.body.raw_source))
password_container = doc.search "[text()*='Click Here to Enter a New Password']"
but this did not find a result. When I tried:
password_container = doc.search "[text()*='Click Here']"
I got no result.
I want to search the complete text.
I found there are many spaces before text " to Enter a New Password" but I have not added any space in the HTML code.
Much of the text you are searching for is outside of the a element.
The best you can do might be:
a = doc.search('a[text()="Click Here"]').find{|a| a.next.text[/to Enter a New Password/]}
You can use a mix of xpath and regex, but since there's no matches in xpath for nokogiri yet, you can implement your own as follows:
class RegexHelper
def content_matches_regex node_set, regex_string
! node_set.select { |node| node.content =~ /#{regex_string}/mi }.empty?
end
def content_matches node_set, string
content_matches_regex node_set, string.gsub(/\s+/, ".*?")
end
end
search_string = "Click Here to Enter a New Password"
matched_nodes = doc.xpath "//*[content_matches(., '#{search_string}')]", RegexHelper.new
You can try by using CSS selector. I've saved your HTML as a file called, test.html
require 'Nokogiri'
#doc = Nokogiri::HTML(open('test.html'))
puts #result = #doc.css('p').text.gsub(/\n/,'')
it returns
Click this link to enter a new Password. This link will expire within 24 hours, so don't delay.
There's a good post about Parsing HTML with Nokogiri
You were close. Here's how you find the text's containing element:
doc.search('*[text()*="Click Here"]')
This gives you the <a> tag. Is this what you want? If you actually want the parent element of the <a>, which is the containing <div>, you can modify it like so:
doc.search('//*[text()="Click Here"]/..').text
This selects the containing <div>, the text of which is:
Welcome to
OddZ
Click Here
to Enter a New Password
Click this link to enter a new Password. This link will expire within 24 hours, so don't delay.

Get last word inside table cell?

I want to scrape data from a table with Ruby and Nokogiri.
There are a lot of <td> elements, but I only need the country which is just text after a <br> element. The problem is, the <td> elements differ. Sometimes there is more than just the country.
For example:
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
I want to address the element before the closing </td> tag because the country is always the last element.
How can I do that?
I'd use this:
require 'awesome_print'
require 'nokogiri'
html = '
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
'
doc = Nokogiri::HTML(html)
ap doc.search('td').map{ |td| td.search('text()').last.text }
[
[0] "USA",
[1] "UK",
[2] "Switzerland"
]
The problem is that your HTML being parsed won't have rows of <td> tags, so you'll have to locate the ones you want to parse. Instead, they'll be interspersed between <tr> tags, and maybe even different <table> tags. Because your HTML sample doesn't show the true structure of the document, I can't help you more.
There are bunch of different solutions. Another solution using only the standard library is to substring out the things you dont want.
node_string = <<-STRING
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
STRING
node_string.split("<td>").collect do |str|
last_str = str.split("<br>").last
last_str.gsub(/[\n,\<\/td\>]/,'') unless last_str.nil?
end.compact

Parse a HTML table using Ruby, Nokogiri omitting the column headers

I have trouble parsing a HTML table using Nokogiri and Ruby. My HTML table structure looks like this
<table>
<tbody>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Middle</td>
</tr>
<tr>
<td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
....
....
.... {more tr's and td's with similar data exists.}
....
....
....
....
....
</tbody>
</table>
In the above HTML table I would like to entirely remove the first and corresponding elements, so remove Firstname, Lastname and Middle i.e., I want to start stripping the text only from the second . So this way I get only the contents of the table from the second or tr[2] and no column headers.
Can someone please provide me a code as to how to do this.
Thanks.
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(x)
rows = doc.xpath('//table/tbody/tr[position() > 1]')
# OR
rows = doc.xpath("//table/tbody/tr")
header = rows.shift
After you've run either one of the above 2 snippets, rows will contain every <tr>...</tr> after the first one. For example puts rows.to_xml prints the following:
<tr><td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
To get the inner text, removing all the html tags, run puts rows.text
ding
dong
ling
To get the inner text of the td tags only, run rows.xpath('td').map {|td| td.text }
["ding", "dong", "ling"]
Alternatively:
table.css('tr')[1..-1]
or to strip out the text starting at row 2:
table.css('tr')[1..-1].map{|tr| tr.css('td').map &:text}
Since Nokogiri does support :has CSS pseudo-class you can get heading row with
#doc.at_css('table#table_id').css('tr:has(th)')
and since it does supports :not CSS pseudo-class as well, you can get other rows with
#doc.at_css('table#table_id').css('tr:not(:has(th))')
respectively. Depending on your preferences you might like to avoid negation and just use css('tr:has(td)').

Resources