Nokogiri and xpath for extracting table data - ruby

I'm a bit of a newbie and I'm trying to scrape some data from a table, but am not having much luck using xpath. I can get the first field I need, but then... nothing.
The table structure for each row is as follows:
<tr bgcolor="#FFF7E7">
<td valign="Top"><font color="#8C4510">
<span id="DataGrid1__ctl3_Label2">Index</span>
</font></td>
<td><font color="#8C4510"><font color="#8C4510">Title</font></font></td>
<td><font color="#8C4510"><font color="#8C4510">People</font></font></td>
<td valign="Top"><font color="#8C4510">Date</font></td><td><font color="#8C4510"><a href="javascript:__doPostBack('DataGrid1$_ctl3$_ctl4','')">
<font color="#8C4510">Text</font></a></font></td>
<td><font color="#8C4510"><font color="#8C4510">Outcome</font></font></td>
<td valign="Top">
<font color="#8C4510"><font color="#8C4510">Click link for more</font></font></td>
</tr>
I'm trying to extract the Index, Title, People, Text, Outcome fields as well as the link.
I'm managing to extract the Index, but can't seem to get the rest.
In my ruby code, my call for actually getting the table seems to be working, but then my loop where I'm extracting the fields for each row of the table is not, apart from the Index.
Any help would be great.

With the excerpt you gave there, you can extract text and links with the following XPath query:
require 'rubygems'
require 'nokogiri'
f = File.open('test.html')
doc = Nokogiri::HTML(f)
doc.xpath("//tr//td//a").each do |node|
puts "#{node.text().strip()}: #{node.attribute('href')}"
end
f.close
However, not seeing the other rows in the table, not sure whether this is of any help for the rest.

Related

correct way to scrape this table (using scrapy / xpath)

Given a table (unknown number of <tr> but always three <td>, and sometimes containing a strikethrough (<s>) of the first element which should be captured as additional item (with value 0 or 1))
<table id="my_id">
<tr>
<td>A1</td>
<td>A2</td>
<td>A3</td>
</tr>
<tr>
<td><s>B1</s></td>
<td>B2</td>
<td>B3</td>
</tr>
...
</table>
Where scraping should yield [[A1,A2,A3,0],[B1,B2,B3,1], ...], I currently try along those lines:
my_xpath = response.xpath("//table[#id='my_id']")
for my_cell in my_xpath.xpath(".//tr"):
print('record 0:', my_cell.xpath(".//td")[0])
print('record 1:', my_cell.xpath(".//td")[1])
print('record 2:', my_cell.xpath(".//td")[2])
And in principle it works (e.g. by adding a pipeline after add_xpath()), just I am sure there is a more natural and elegant way to do this.
Try contains :
my_xpath = response.xpath("//table[contains(#id, 'my_id')]").getall()

XPath to get siblings between two elements

With the following markup I need to get the middle tr's
<tr class="H03">
<td>Artist</td>
...
<tr class="row_alternate">
<td>LIMP</td>
<td>Orion</td>
...
</tr>
<tr class="row_normal">
<td>SND</td>
<td>Tender Love</td>
...
</tr>
<tr class="report_total">
<td> </td>
<td> </td>
...
</tr>
That is every sibling tr between <tr class="H03"> and <tr class="report_total">. I'm scraping using mechanize and nokogiri, so am limited to their xpath support. My best attempt after looking at various StackOverflow questions is
page.search('/*/tr[#class="H03"]/following-sibling::tr[count(. | /*/tr[#class="report_total"]/preceding-sibling::tr)=count(/*/tr[#class="report_total"]/preceding-sibling::tr)]')
which returns an empty array, and is so ridiculously complicated that my limited xpath fu is completely overwhelmed!.
You can try the following xpath :
//tr[#class='H03']/following-sibling::tr[following-sibling::tr[#class='report_total']]
Above xpath select all <tr> following tr[#class='H03'], where <tr> have following sibling tr[#class='report_total'] or in other words selected <tr> are located before tr[#class='report_total'].
Mechanize has a few helper methods here that would be useful to employ.
presuming you are doing something like the following:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.website.com')
start_tr = page.at('.H03')
At this point, tr will be a nokogiri xml element of the first tr you list in your question.
You can then iterate through siblings with:
next_tr = start_tr.next_sibling
Do this until you hit the tr at which you want to stop.
trs = Array.new
until next_tr.attributes['class'].name == 'report_total'
next_tr = next_tr.next_sibling
trs << next_tr
end
If you want the range to be inclusive of the start and stop trs (H03 and report_total) just tweak the code above to include them in the trs array.

Selecting checkbox from the line that contains word Capybara

I have been searching answer for week.
I need to create automatic tests and one of them has to delete a line from table. This line looks like this:
Html looks like this:
<tr>
<td colspan="1" class="name" rowspan="1">
Capybara852
</td>
<td colspan="1" class="description" rowspan="1">This whitelist is added by Capybara automated test</td>
<td colspan="1" class="whitelistType" rowspan="1">Internal</td>
<td colspan="1" class="status" rowspan="1">None</td>
<td colspan="1" class="active" rowspan="1">false</td>
<td colspan="1" class="msCount" rowspan="1">40</td>
<td colspan="1" class="modifiedBy" rowspan="1">admin</td>
<td colspan="1" class="modifiedOn" rowspan="1">26.06.2014 11:08</td>
<td colspan="1" class="selected" rowspan="1">
<input class="check" onclick="disableButton('delete', false);" id="check_0" name="check_0" type="checkbox">
<img id="check_0_icon" class="t-error-icon" alt="" src="/mpromoter/assets/4716a6a0a357181/core/spacer.gif" style="display: none;">
</td>
</tr>
I need to check this checkbox that stands in the same line where is written Capybara. I can't select this checkbox because its id may be different every time I run these automatic tests.
I am asking, how can I select checkbox without its own id or name or class, I need to select checkbox that stands in the same line that contains Capybara
It is a href part there.
I can select checkbox with its id. But I want to select the checkbox with unknown id from the line that contains the word capybara :)
I have tried many different things but nothing works...
So I am asking for some help. What I have to do, any suggestions?
I am not sure if I understand what are you asking but...
So you can take each row from your table and then check if the row has the value 'Capybara'
$('#stack tbody tr').find('td a').text().match(new RegExp('Capybara'))
If that is true, from the same row do what you want with the checkbox.
I need to do to many things to give you a capybara version, but you just need to translate that to capybara.
Hope that helps.
I would suggest the approach you use be:
Determine how you can find the row you want (ie contains "Capybara")
Use the within method to find the checkbox within that row
For example, you can find any row that contains the word "Capybara" anywhere using:
find('tr', :text => 'Capybara')
You can use these same find options in within. When finding the checkbox within the within block, Capybara will only search in that row. So if you do not care where the word "Capybara" shows up in the row, you can do:
within('tr', :text => 'Capybara') do
find('input.check').set(true)
end
If needed, you can change the within options to be more specific. For example, you might only want rows where the first column, which has class "name", has the word "Capybara" (rather than "Capybara" just being in the description column). This could be done with:
within(:xpath, '//tr[td[#class="name" and contains(., "Capybara")]]') do
find('input.check').set(true)
end

Watir: Search table cell by bgcolor tag and get column number

Consider the following html as an example. its a scratch sheet I made to practice, but it has a snippet of the real html I am trying to work with.
http://www.carbide-red.com/prog/test_table.html
I am trying to locate a column and the only consistant identifier I can find is the background color (bgcolor).
<tr bgcolor="#ffffcc">
<td bgcolor="yellow" class="date" align=center>Equipment</td>
<td bgcolor="#ccccff" align=center class="date"><font color=black>8/12/12</font></td>
<td bgcolor="#ccccff" align=center class="date"><font color=black>8/19/12</font></td>
<td bgcolor="#ccccff" align=center class="date"><font color=black>8/26/12</font></td>
<td bgcolor="#ccccff" align=center class="date"><font color=black>9/2/12</font></td>
<td bgcolor="red" align=center class="date"><font color=yellow>9/9/12</font></td>
<td bgcolor="#ccffcc" align=center class="date"><font color=black>9/16/12</font></td>
<td bgcolor="#ccffcc" align=center class="date"><font color=black>9/23/12</font></td>
<td bgcolor="#ccffcc" align=center class="date"><font color=black>9/30/12</font></td>
<td bgcolor="#ccffcc" align=center class="date"><font color=black>10/7/12</font></td>
</tr>
I'm trying to find the <td> that has bgcolor=red. I would then like to save the column index of that cell, so that I can then use it to select the same column of the following rows.
But I can't seem to find a way to search for the bgcolor= tag. And I have not been able to find a way to get Watir to report back the column/row indexs to store in a variable. But if I can find the bgcolor= tag then I can search for like "equipment" and then count until I find the correct tag.
I know the html code is not ideal due to there note being any "name" or anything unique identifier, but I can't change that.
I am very new to Ruby & Watir. I tried to manipulate a website in Perl and it was was not going very well, and I discovered Watir and it did exactly what I needed (and suprisingly easy), but now I am trying to understand Ruby as well as the finer semantics.
Thanks for any help!
To get text of <td bgcolor="red"> try this:
browser.element(:css => "td[bgcolor=red]").text
You should get back "9/9/12". To click the element, replace text with click.
To put it's index in variable index try this:
index = nil
browser.tds.each_with_index {|td, i| index = i if td.attribute_value("bgcolor") == "red" or td.attribute_value("bgcolor") == "#ff0000"}
index variable should be 5.
I would use nokogiri if I were you:
doc = Nokogiri::HTML #browser.html
td = doc.at('td[#bgcolor="red"]')
index = td.search('./preceding-sibling::td').length
Unless there's tricky javascript on the page you're probably better off with mechanize than watir.
Yes the webpage I'm dealing with uses Javascript which is why I had a very hard time useing Mechanize::Firefox under Perl. Watir worked much more smoothly.
Thank you for your suggestion! It didn't work at first, but it helped me with Google searches and I was able to get a working version.
require "watir"
require "nokogiri"
browser = Watir::Browser.new
browser.goto "http://www.carbide-red.com/prog/test_table.html"
doc = Nokogiri::HTML.parse(browser.html)
td = doc.at('td[#bgcolor="red"]')
columnindex = td.search('./preceding-sibling::td').length
puts columnindex
browser.close
This returned "5"
Update:
For the sake of others who may find this while searching and learning. To use columnindex variable to find a specific column within a row use this code:
textvariable = browser.td(:text => "A58004").parent.td(:index => "#{columnindex}").text
puts "Textvariable: #{textvariable}"
This finds a <td> that contains the term "A58004" and then goes to the 5th column (0-5) over and returns the value of that cell. Using the webpage linked in my original question that would be "W=Sa"

XPath matching text in a table - Ruby - Nokigiri

I have a table that looks like this
<table cellpadding="1" cellspacing="0" width="100%" border="0">
<tr>
<td colspan="9" class="csoGreen"><b class="white">Bill Statement Detail</b></td>
</tr>
<tr style="background-color: #D8E4F6;vertical-align: top;">
<td nowrap="nowrap"><b>Bill Date</b></td>
<td nowrap="nowrap"><b>Bill Amount</b></td>
<td nowrap="nowrap"><b>Bill Due Date</b></td>
<td nowrap="nowrap"><b>Bill (PDF)</b></td>
</tr>
</table>
I am trying to create the XPATH to find this table where it contains the test Bill Statement Detail. I want the entire table and not just the td.
Here is what I have tried so far:
page.parser.xpath('//table[contains(text(),"Bill")]')
page.parser.xpath('//table/tbody/tr[contains(text(),"Bill Statement Detail")]')
Any Help is appreciated
Thanks!
Your first XPath example is the closest in that you're selecting table. The second example, if it ever matched, would select tr—this one will not work mainly because, according to your example, the text you want is in a b node, not a tr node.
This solution is as vague as I could make it, because of *. If the target text will always be under b, change it to descendant::b:
//table[contains(descendant::*, 'Bill Statement Detail')]
This is as specific, given the example, as I can make:
//table[tr[1]/td/b['Bill Statement Detail']]
You might want
//table[contains(descendant::text(),"Bill Statement Detail")]
The suggested codes don't work well if the match word is not in the first row. See the related post Find a table containing specific text

Resources