XPATH - Ruby - Nokogiri - Nodeset - ruby

I have a NodeSet of a table that looks similar to this:
<table cellpadding="1" cellspacing="0" width="100%" border="0">
<tr>
<td colspan="9" class="csoGreen"><b class="white">Bill Statement Detail</b></td>
</tr>
<tr>
<td><b>Bill Date</b></td>
<td"><b>Bill Amount</b></td>
<td"><b>Bill Due Date</b></td>
<td"><b>Bill (PDF)</b></td>
</tr>
<tr vAlign="top">
<td>blahA</td>
<td>blahB</td>
<td>blahC</td>
<td>View Bill</td>
</tr>
Now I plan on looping through each onclick in the table.
I've been attempting to loop through the NodeSet unsuccessfully.
I ended up with many failed attempts, but I imagine it would end up looking something like this:
doc_list.each_element ("//a[td/text()='onclick']/#href") do | |
#here I want to scan and save BlahA into a Variable
end

You want to iterate through everything with an onclick? Maybe:
doc.css('*[onclick]').each do |el|
puts el[:onclick]
end
Edit: what you probably really want is the first td of every row starting with the row 3. in that case:
table.css('td[1]')[2..-1].each do |td|
puts td.text
end

The key to doing this efficiently is not in your question, but in your comment "I want to extract the first td in the tr where there is an onclick".
This expression does exactly that:
doc.xpath('//tr[td/a/#onclick]/td[1]/text()')
In fact this will give you the set of all such matches. No iteration needed.

Related

correct way to scrape this table (using scrapy / xpath)

Given a table (unknown number of <tr> but always three <td>, and sometimes containing a strikethrough (<s>) of the first element which should be captured as additional item (with value 0 or 1))
<table id="my_id">
<tr>
<td>A1</td>
<td>A2</td>
<td>A3</td>
</tr>
<tr>
<td><s>B1</s></td>
<td>B2</td>
<td>B3</td>
</tr>
...
</table>
Where scraping should yield [[A1,A2,A3,0],[B1,B2,B3,1], ...], I currently try along those lines:
my_xpath = response.xpath("//table[#id='my_id']")
for my_cell in my_xpath.xpath(".//tr"):
print('record 0:', my_cell.xpath(".//td")[0])
print('record 1:', my_cell.xpath(".//td")[1])
print('record 2:', my_cell.xpath(".//td")[2])
And in principle it works (e.g. by adding a pipeline after add_xpath()), just I am sure there is a more natural and elegant way to do this.
Try contains :
my_xpath = response.xpath("//table[contains(#id, 'my_id')]").getall()

XPath: Find first occurance in children and siblings

So I have some HTML that looks like thus:
<tr class="a">
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>....</td>
<td class="b">A</td>
</tr>
<tr>....</tr>
<tr class="a">
<td class="b">B</td>
<td>....</td>
</tr>
<tr>
<td class="b">Not this</td>
<td>....</td>
</tr>
I'm basically wanting to find the first instance of td class b following a tr with a class of a. Problem comes about is that it could be in either a child of the tr or in the next tr after it.
I can get the second case with:
//tr[#class="a"]//td[#class="b"]
But that misses the first case, because the TD is in a sibling, not a direct descendant. Ideas?
For the 2nd case (td is direct descendant of tr) :
//tr[#class="a"]//td[#class="b"][1]
For the 1st case (td is following tr), that isn't fall in the the 2nd case category :
//tr[#class="a" and not(.//td[#class="b"])]/following::td[#class="b"][1]
Combining the two xpath queries together using union operator (|) yield the expected output :
//tr[#class="a"]//td[#class="b"][1] | //tr[#class="a" and not(.//td[#class="b"])]/following::td[#class="b"][1]
output :
Element='<td class="b">A</td>'
Element='<td class="b">B</td>'

XPath to get siblings between two elements

With the following markup I need to get the middle tr's
<tr class="H03">
<td>Artist</td>
...
<tr class="row_alternate">
<td>LIMP</td>
<td>Orion</td>
...
</tr>
<tr class="row_normal">
<td>SND</td>
<td>Tender Love</td>
...
</tr>
<tr class="report_total">
<td> </td>
<td> </td>
...
</tr>
That is every sibling tr between <tr class="H03"> and <tr class="report_total">. I'm scraping using mechanize and nokogiri, so am limited to their xpath support. My best attempt after looking at various StackOverflow questions is
page.search('/*/tr[#class="H03"]/following-sibling::tr[count(. | /*/tr[#class="report_total"]/preceding-sibling::tr)=count(/*/tr[#class="report_total"]/preceding-sibling::tr)]')
which returns an empty array, and is so ridiculously complicated that my limited xpath fu is completely overwhelmed!.
You can try the following xpath :
//tr[#class='H03']/following-sibling::tr[following-sibling::tr[#class='report_total']]
Above xpath select all <tr> following tr[#class='H03'], where <tr> have following sibling tr[#class='report_total'] or in other words selected <tr> are located before tr[#class='report_total'].
Mechanize has a few helper methods here that would be useful to employ.
presuming you are doing something like the following:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.website.com')
start_tr = page.at('.H03')
At this point, tr will be a nokogiri xml element of the first tr you list in your question.
You can then iterate through siblings with:
next_tr = start_tr.next_sibling
Do this until you hit the tr at which you want to stop.
trs = Array.new
until next_tr.attributes['class'].name == 'report_total'
next_tr = next_tr.next_sibling
trs << next_tr
end
If you want the range to be inclusive of the start and stop trs (H03 and report_total) just tweak the code above to include them in the trs array.

XPath matching text in a table - Ruby - Nokigiri

I have a table that looks like this
<table cellpadding="1" cellspacing="0" width="100%" border="0">
<tr>
<td colspan="9" class="csoGreen"><b class="white">Bill Statement Detail</b></td>
</tr>
<tr style="background-color: #D8E4F6;vertical-align: top;">
<td nowrap="nowrap"><b>Bill Date</b></td>
<td nowrap="nowrap"><b>Bill Amount</b></td>
<td nowrap="nowrap"><b>Bill Due Date</b></td>
<td nowrap="nowrap"><b>Bill (PDF)</b></td>
</tr>
</table>
I am trying to create the XPATH to find this table where it contains the test Bill Statement Detail. I want the entire table and not just the td.
Here is what I have tried so far:
page.parser.xpath('//table[contains(text(),"Bill")]')
page.parser.xpath('//table/tbody/tr[contains(text(),"Bill Statement Detail")]')
Any Help is appreciated
Thanks!
Your first XPath example is the closest in that you're selecting table. The second example, if it ever matched, would select tr—this one will not work mainly because, according to your example, the text you want is in a b node, not a tr node.
This solution is as vague as I could make it, because of *. If the target text will always be under b, change it to descendant::b:
//table[contains(descendant::*, 'Bill Statement Detail')]
This is as specific, given the example, as I can make:
//table[tr[1]/td/b['Bill Statement Detail']]
You might want
//table[contains(descendant::text(),"Bill Statement Detail")]
The suggested codes don't work well if the match word is not in the first row. See the related post Find a table containing specific text

xpath expression to find url and data

i want to get the values of every table and the href value for every within the table given below.
Being new to xpath, i am finding it difficult to write xpath expression.
However understanding what an xpath expression does lies somewhat in an easier category.
the expected output
http://a.com/ data for a 526735 Z
http://b.com/ data for b 522273 Z
http://c.com/ data for c 513335 Z
<table class = dataTabe>
<tbody>
<tr>
<td>data for a</td>
<td class="numericalColumn">526735</td>
<td class="numericalColumn">Z</td></tr>
<tr>
<td>data for b</td>
<td class="numericalColumn">522273</td>
<td class="numericalColumn">B</td></tr>
<tr>
<td>data for c</td>
<td class="numericalColumn">513335</td>
<td class="numericalColumn">B</td></tr>
</tbody>
</table>
You'll need two things: an XPath query which locates the wanted nodes and a second which outputs the text as you want it. Since you don't give more information about the languages you're using I'm putting together some pseudocode:
foreach node in document.select("//table[class='dataTable']//tr[td/a/#HREF]")
write node.select("concat(td/a/#HREF,' ',.)")
This site has a great free tool for building XPath Expressions (XPath Builder):
http://www.bubasoft.net/
Use this XPath: //tr/td/a/#HREF | //tr//text()

Resources