With the following markup I need to get the middle tr's
<tr class="H03">
<td>Artist</td>
...
<tr class="row_alternate">
<td>LIMP</td>
<td>Orion</td>
...
</tr>
<tr class="row_normal">
<td>SND</td>
<td>Tender Love</td>
...
</tr>
<tr class="report_total">
<td> </td>
<td> </td>
...
</tr>
That is every sibling tr between <tr class="H03"> and <tr class="report_total">. I'm scraping using mechanize and nokogiri, so am limited to their xpath support. My best attempt after looking at various StackOverflow questions is
page.search('/*/tr[#class="H03"]/following-sibling::tr[count(. | /*/tr[#class="report_total"]/preceding-sibling::tr)=count(/*/tr[#class="report_total"]/preceding-sibling::tr)]')
which returns an empty array, and is so ridiculously complicated that my limited xpath fu is completely overwhelmed!.
You can try the following xpath :
//tr[#class='H03']/following-sibling::tr[following-sibling::tr[#class='report_total']]
Above xpath select all <tr> following tr[#class='H03'], where <tr> have following sibling tr[#class='report_total'] or in other words selected <tr> are located before tr[#class='report_total'].
Mechanize has a few helper methods here that would be useful to employ.
presuming you are doing something like the following:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.website.com')
start_tr = page.at('.H03')
At this point, tr will be a nokogiri xml element of the first tr you list in your question.
You can then iterate through siblings with:
next_tr = start_tr.next_sibling
Do this until you hit the tr at which you want to stop.
trs = Array.new
until next_tr.attributes['class'].name == 'report_total'
next_tr = next_tr.next_sibling
trs << next_tr
end
If you want the range to be inclusive of the start and stop trs (H03 and report_total) just tweak the code above to include them in the trs array.
Related
Given a table (unknown number of <tr> but always three <td>, and sometimes containing a strikethrough (<s>) of the first element which should be captured as additional item (with value 0 or 1))
<table id="my_id">
<tr>
<td>A1</td>
<td>A2</td>
<td>A3</td>
</tr>
<tr>
<td><s>B1</s></td>
<td>B2</td>
<td>B3</td>
</tr>
...
</table>
Where scraping should yield [[A1,A2,A3,0],[B1,B2,B3,1], ...], I currently try along those lines:
my_xpath = response.xpath("//table[#id='my_id']")
for my_cell in my_xpath.xpath(".//tr"):
print('record 0:', my_cell.xpath(".//td")[0])
print('record 1:', my_cell.xpath(".//td")[1])
print('record 2:', my_cell.xpath(".//td")[2])
And in principle it works (e.g. by adding a pipeline after add_xpath()), just I am sure there is a more natural and elegant way to do this.
Try contains :
my_xpath = response.xpath("//table[contains(#id, 'my_id')]").getall()
<html>
<table border="1">
<tbody>
<tr>
<td>
<table border="1">
<tbody>
<tr>
<th>aaa</th>
<th>bbb</th>
<th>ccc</th>
<th>ddd</th>
<th>eee</th>
<th>fff</th>
</tr>
<tr>
<td>111</td>
<td>222</td>
<td>333</td>
<td>444</td>
<td>555</td>
<td>666</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</html>
How can i select specific related cousin data using xpath, The desired output would be be:
<th>aaa</th>
<th>ccc</th>
<th>fff</th>
<td>111</td>
<td>333</th>
<td>666</td>
The most important aspect of the xpath is that i am looking to be able to include or exclude certain <th> tags and their corresponding <td>tags
So based on the answers so far the closest I have is:
//th[not(contains(text(), "ddd"))] | //tr[2]/td[not(position()=4)]
Is there any way of not explicitly using position()=4 but instead reference the corresponding th tag
Using XPath 3.0 you can structure that into
let $th := //table/tbody/tr[1]/th,
$filteredTh := $th[not(. = ("bbb", "ddd", "eee"))],
$pos := $filteredTh!index-of($th, .)
return ($filteredTh, //table/tbody/tr[position() gt 1]/td[position() = $pos])
I'm not sure that this is the best solution, but you might try
//th[not(.="bbb") and not(.="ddd") and not(.="eee")] | //tr[2]/td[not(position()=index-of(//th, "bbb")) and not(position()=index-of(//th, "ddd")) and not(position()=index-of(//th, "eee"))]
or shorter version
//th[not(.=("bbb", "ddd", "eee"))]| //tr[2]/td[not(position()=(index-of(//th, "bbb"), index-of(//th, "ddd"),index-of(//th, "eee")))]
that returns
<th>aaa</th>
<th>ccc</th>
<th>fff</th>
<td>111</td>
<td>333</td>
<td>666</td>
You can avoid using complicated XPath expressions to get required output. Try to use Python + Selenium features instead:
# Get list of th elements
th_elements = driver.find_elements_by_xpath('//th')
# Get list of td elements
td_elements = driver.find_elements_by_xpath('//tr[2]/td')
# Get indexes of required th elements - [0, 2, 5]
ok_index = [th_elements.index(i) for i in th_elements if i.text not in ('bbb', 'ddd', 'eee')]
for i in ok_index:
print(th_elements[i].text)
for i in ok_index:
print(td_elements[i].text)
Output is
'aaa'
'ccc'
'fff'
'111'
'333'
'666'
If you need XPath 1.0 solution:
//th[not(.=("bbb", "ddd", "eee"))]| //tr[2]/td[not(position()=(count(//th[.="bbb"]/preceding-sibling::th)+1, count(//th[.="ddd"]/preceding-sibling::th)+1, count(//th[.="eee"]/preceding-sibling::th)+1))]
People, could you please help me with this XPATH. Lets say I have the following HTML code
<table>
<tr>
<td class="clickable">text</td>
<td>value1</td>
</tr>
<tr>
<td>value2</td>
<td>text</td>
</tr>
</table>
I need to build a XPath that will pick <tr>that have <td> with value text AND attribute class equals clickable.
I tried the following xpath:
//tr[contains(.,'text')][contains(./td/#class,'clickable')]
//tr[contains(.,'text')][contains(td/#class,'clickable')]
but none of those worked
Any help is appreciated
Thanks
You are almost there:
//tr[contains(td/#class,'clickable') and contains(td, 'text')]
Demo using xmllint:
$ xmllint input.xml --xpath "//tr[contains(td/#class,'clickable') and contains(td, 'text')]"
<tr>
<td class="clickable">text</td>
<td>value1</td>
</tr>
If you find tr with a td having value text and a td (maybe, another) with attribute class equals clickable, use answer of #alecxe.
If that is one td with two condition then
//tr[td[.='text' and #class='clickable']]
So I have some HTML that looks like thus:
<tr class="a">
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>....</td>
<td class="b">A</td>
</tr>
<tr>....</tr>
<tr class="a">
<td class="b">B</td>
<td>....</td>
</tr>
<tr>
<td class="b">Not this</td>
<td>....</td>
</tr>
I'm basically wanting to find the first instance of td class b following a tr with a class of a. Problem comes about is that it could be in either a child of the tr or in the next tr after it.
I can get the second case with:
//tr[#class="a"]//td[#class="b"]
But that misses the first case, because the TD is in a sibling, not a direct descendant. Ideas?
For the 2nd case (td is direct descendant of tr) :
//tr[#class="a"]//td[#class="b"][1]
For the 1st case (td is following tr), that isn't fall in the the 2nd case category :
//tr[#class="a" and not(.//td[#class="b"])]/following::td[#class="b"][1]
Combining the two xpath queries together using union operator (|) yield the expected output :
//tr[#class="a"]//td[#class="b"][1] | //tr[#class="a" and not(.//td[#class="b"])]/following::td[#class="b"][1]
output :
Element='<td class="b">A</td>'
Element='<td class="b">B</td>'
I have a NodeSet of a table that looks similar to this:
<table cellpadding="1" cellspacing="0" width="100%" border="0">
<tr>
<td colspan="9" class="csoGreen"><b class="white">Bill Statement Detail</b></td>
</tr>
<tr>
<td><b>Bill Date</b></td>
<td"><b>Bill Amount</b></td>
<td"><b>Bill Due Date</b></td>
<td"><b>Bill (PDF)</b></td>
</tr>
<tr vAlign="top">
<td>blahA</td>
<td>blahB</td>
<td>blahC</td>
<td>View Bill</td>
</tr>
Now I plan on looping through each onclick in the table.
I've been attempting to loop through the NodeSet unsuccessfully.
I ended up with many failed attempts, but I imagine it would end up looking something like this:
doc_list.each_element ("//a[td/text()='onclick']/#href") do | |
#here I want to scan and save BlahA into a Variable
end
You want to iterate through everything with an onclick? Maybe:
doc.css('*[onclick]').each do |el|
puts el[:onclick]
end
Edit: what you probably really want is the first td of every row starting with the row 3. in that case:
table.css('td[1]')[2..-1].each do |td|
puts td.text
end
The key to doing this efficiently is not in your question, but in your comment "I want to extract the first td in the tr where there is an onclick".
This expression does exactly that:
doc.xpath('//tr[td/a/#onclick]/td[1]/text()')
In fact this will give you the set of all such matches. No iteration needed.