"<i>" interrupting correct node selection - xpath

I'm trying to select a table field with the following structure:
<td class='postac'>proszek do sporz. roztworu do wlewu <I>i.v.</I>
1,5 g
1 fiol. typu Monovial
</td>
After using xpath expression sel.xpath("//table[#class='table-postaci']/tbody/tr/td[2]/text()").extract() I get two values instead of one:
u'proszek do sporz. roztworu do wlewu ',
u'\r\n 1,5 g\r\n 1 fiol. typu Monovial\r\n '
Is it some clean xpath method to get this "td" field as a single value? I know I could get the field with //table[#class='table-postaci']/tbody/tr/td[2] and then strip the tags in the scrapy pipeline. However, I'm looking for some simplier solution. Thank you

You can loop over each table row tr and for each row join all text node descendants of the 2nd td cell:
In [13]: from scrapy.selector import Selector
In [14]: selector = Selector(text="""<table class='table-postaci'>
....: <thead><th>Nazwa preparatu</th><th>Postać i dawka</th><th>Producent</th><th>Cena 100%</th>
....: <th>Odpłatność po refundacji</th>
....: </thead>
....: <tbody>
....:
....: <tr>
....: <td class='postac'>Zinacef </td>
....: <td class='postac'>proszek do sporz. roztworu do wlewu <I>i.v.</I>
....: 1,5 g
....: 1 fiol. typu Monovial
....: </td>
....: <td>GlaxoSmithKline – Wielka Brytania</td>
....: <td class='cena'> b/d </td>
....: <td>
....: </td>
....: </tr>
....: <tr>
....: <td class='postac'>Zinacef </td>
....: <td class='postac'>proszek do sporz. roztworu do wlewu <I>i.v.</I>
....: 750 mg
....: 1 fiol. typu Monovial
....: </td>
....: <td>GlaxoSmithKline – Wielka Brytania</td>
....: <td class='cena'> b/d </td>
....: <td>
....: </td>
....: </tr>
....: </tbody>
....: </table""")
In [15]: selector.xpath('//table/tr')
Out[15]: []
In [16]: selector.xpath('//table//tr')
Out[16]:
[<Selector xpath='//table//tr' data=u'<tr><td class="postac">Zinacef </td>\n\t\t<'>,
<Selector xpath='//table//tr' data=u'<tr><td class="postac">Zinacef </td>\n\t\t<'>]
In [17]: for row in selector.xpath('//table//tr'):
....: print row.xpath('td[2]//text()').extract()
....:
[u'proszek do sporz. roztworu do wlewu ', u'i.v.', u'\n 1,5 g\n 1 fiol. typu Monovial\n ']
[u'proszek do sporz. roztworu do wlewu ', u'i.v.', u'\n 750 mg\n 1 fiol. typu Monovial\n ']
In [18]: [u''.join(row.xpath('td[2]//text()').extract()) for row in selector.xpath('//table//tr')]
Out[18]:
[u'proszek do sporz. roztworu do wlewu i.v.\n 1,5 g\n 1 fiol. typu Monovial\n ',
u'proszek do sporz. roztworu do wlewu i.v.\n 750 mg\n 1 fiol. typu Monovial\n ']
In [19]:

You should avoid /text() for exactly this reason. Usually you don't want the individual text nodes, you want the string value of the element, which you can get with the string() function. It's not clear what programming language you are calling XPath from, or whether it's XPath 1.0 or 2.0 - that will affect the detail, e.g. whether to get the string value of the element in the XPath expression or in the host language.

The td node in your question has three child nodes – first a text node with the contents:
proszek do sporz. roztworu do wlewu
second an I element node that has its own child text node, and last another text node with the contents:
\n 1,5 g\n 1 fiol. typu Monovial\n
Your query, the end of which looks like td[2]/text(), only selects the immediate text node children of the td element, so it doesn’t select the I element node or its text node child. The result is the two text nodes that you are seeing.
You could select all text node decedents of the td element using td[2]//text() (note the double slash //). This will return three text nodes in the result – the two as above and a third containing i.v. in between them. You could then join them outside XPath (I’m not familiar with scrapy so I can’t tell you how to that in this case).
As far as I know you can’t join the three nodes directly using XPath 1.0, but it might be possible with XPath 2.0.

Related

xpath selecting text after a certain element or between elements

I'm trying to use xpath to select all the text within the elements:
between the h3 elements "Hay Point" and "Darymple Bay"
after h3 element "Darymlple Bay"
I've got this xpath syntax working which selects all the text within the td tags after <h3>Dalrymple Bay Coal Terminal</h3>.
.//h3[2]/following::td/text()
But I'm having trouble figuring out how to select all the text between the tags that fall between <h3>Hay Point Coal Terminal</h3> and <h3>Dalrymple Bay Coal Terminal</h3>
A sample of the structure of the html is below:
<h3>Hay Point Coal Terminal</h3>
<tr role="row" class="odd"><td headers="table06762r1c1" tabindex="0">July
</td><td style="text-align: left;"
headers="table06762r1c2">4,517,445</td>
<td headers="table06762r1c3">4,261,253</td>
<td headers="table06762r1c4">4,057,239</td>
<td headers="table06762r1c5">3,535,507</td>
</tr>
<h3>Dalrymple Bay Coal Terminal</h3>
<tr><td headers="table06762r1c1">July</td><td style="text-align: left;"
headers="table06762r1c2">5,462,591</td>
<td headers="table06762r1c3">5,625,700</td>
<td headers="table06762r1c4">5,816,977</td>
<td headers="table06762r1c5">5,396,644</td>
</tr>
If I understand your question correctly and given the html in the question, in order to get text nodes related to the <h3>Hay Point Coal Terminal</h3> node, try:
//h3[1]/following-sibling::tr[1]/td/text()
Output:
July
4,517,445
4,261,253
4,057,239
3,535,50
To get those related to the <h3>Dalrymple Bay Coal Terminal</h3> node, use:
//h3[2]/following-sibling::tr[1]/td/text()
or just
//h3[2]/following-sibling::tr/td/text()
Output:
July
5,462,591
5,625,700
5,816,977
5,396,644
To get both:
//h3/following-sibling::tr/td/text()
Assuming you want to group those you would do something like:
for h3 in response.css('h3'):
item = {
"h3": h3.css('*::text').extract()[0],
"tds": h3.css('* + tr td::text').extract()
}

how to get value of a node in nokogiri

I have this
1.9.3-p286 :073 > doc.css("tr[class~=strong]").children[3].children
=> [#<Nokogiri::XML::Element:0x3fee5e077e98 name="a" attributes=[#
<Nokogiri::XML::Attr:0x3fee5e077dd0 name="href"
value="http://somelink">]>]
Sample html:
<tr class='strong bf highbeam'>
<td>December 6th</td>
<td>Foo</td>
<td><a href='http://somelink' title='bar'>December 6th 2012 Episode</a></td>
<td><a href='http://somelink/#disqus_thread'></a></td>
</tr>
How can I fetch the value http://somelink at this point?
Don't use children, refine your css selector until you get the element you want:
doc.at('tr.strong a')[:href]

how do i create random letters in selenium ide?

I'm not an expert in Selenium IDE, I want to declare an array in Selenium IDE HTML and call it in the next line.
<tr>
<td>storeEval</td>
<td>new Array('en','de','da','cs','fi','fr','it','ja','ko','nl','no','pl','pt','ru','sv','tr')</td>
<td>myArray</td>
</tr>
<tr>
<td>type</td>
<td>FieldName</td>
<td>${myArray}</td>
</tr>
Thanks
Code below will randomly select item from array and type it to element with id=FieldName:
<tr>
<td>storeEval</td>
<td>var chars = 'en de da cs fi fr it ja ko nl no pl pt ru sv tr'.split(' '); str = chars[Math.floor(Math.random() * chars.length)];</td>
<td>item</td>
</tr>
<tr>
<td>type</td>
<td>FieldName</td>
<td>${item}</td>
</tr>
To access item from your initial array (lets say second item), you can add one more command:
<tr>
<td>storeEval</td>
<td>new Array('en','de','da','cs','fi','fr','it','ja','ko','nl','no','pl','pt','ru','sv','tr')</td>
<td>myArray</td>
</tr>
<tr>
<td>getEval</td>
<td>storedVars['item'] = storedVars['myArray'][2]</td>
<td></td>
</tr>
<tr>
<td>type</td>
<td>FieldName</td>
<td>${item}</td>
</tr>
You can pass random int in range [0 .. length_of_array] to storedVars['myArray'][randomInt] to retrieve values randomly.

Advice for replacing img tags with text in Ruby?

I'm trying to work out how to store an html table of drive stats in a database, but the developers have been a bit clever, and started using gifs to represent pass/fail/health stats
Here's a snippet of what I've got:
<tr class="status">
<td class="status"><img border="0" src="/tick_green.gif"></td>
<td class="status">8</td>
<td class="status">Ready</td>
<td class="status"><img border="0" src="/bar10.gif"></td>
<td class="status">SEAGATE ST3146807FC</td>
<td class="status">10000 RPM</td>
<td class="status">3HY61AG9</td>
<td class="status">XR12</td>
<td class="status">286749488</td>
<td class="status"> 28.0°C</td>
<td class="status" style="background-color: #00fa00"> 
</td>
**
And here's some of the ruby that I've written so far to strip the tags:
table = page.parser.xpath('//table/caption[contains(.,"Drive")]/..')
table.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
puts cell.to_html.gsub(/<a[^>]+>/,'').gsub(/<td[^>]+>/,'').gsub(/<\/td[^>]*>/,'').gsub(/<\/a[^>]*>/,'')
#puts cell.text
end
end
I can now get semi-rational output
<img border="0" src="/tick_green.gif">
15
Ready
<img border="0" src="/bar10.gif">
SEAGATE ST3146807FC
10000 RPM
3HY61ASW
XR12
286749488
29.0°C
 
But I want to replace a couple of other cell elements with other bits
For example, the tick_green can also be '/cross_red.gif' or '/caution.gif' which I want to replace with regular text, likewise, the img bar10.gif, I want to replace with just text of '10'
Is it best to come up with a whole bunch of values for all of my special cases?
I'd do some 'gsub'iing.
E.g.:
example = <<-STRING
<img border="0" src="/tick_green.gif">
15
Ready
<img border="0" src="/bar10.gif">
SEAGATE ST3146807FC
10000 RPM
3HY61ASW
XR12
286749488
29.0°C
 
STRING
replace = Hash.new("#unknown")
replace['tick_green.gif'] = "[OK]"
replace['bar10.gif'] = "[10]"
regex = /<img [^>]* src="\/(.*)">/
result = example.gsub(regex) { replace[$1] }
Somehow the I'd like to replace the $1 with a named backreference, but don't know how yet.
http://ruby-doc.org/core-1.9.3/String.html#method-i-gsub
edit: result from above
[OK]
15
Ready
[10]
SEAGATE ST3146807FC
10000 RPM
3HY61ASW
XR12
286749488
29.0°C
 
A case statement will clean that up a little but:
row.css('td').each do |td|
img = td.at('img')
puts case
when img && img[:src][/bar(\d+)\.gif/] then $1
when img && img[:src][/tick_green/] then 'ok'
else td.text.strip
end
end

How do I parse a plain HTML table with Nokogiri?

I'd like to parse a HTML page with the Nokogiri. There is a table in part of the page which does not use any specific ID. Is it possible to extract something like:
Today,3,455,34
Today,1,1300,3664
Today,10,100000,3444,
Yesterday,3454,5656,3
Yesterday,3545,1000,10
Yesterday,3411,36223,15
From this HTML:
<div id="__DailyStat__">
<table>
<tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
<tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
<tr class="blr">
<td>3</td>
<td>455</td>
<td>34</td>
<td class="r">3454</td>
<td class="r">5656</td>
<td class="r">3</td>
</tr>
<tr class="bla">
<td>1</td>
<td>1300</td>
<td>3664</td>
<td class="r">3545</td>
<td class="r">1000</td>
<td class="r">10</td>
</tr>
<tr class="blr">
<td>10</td>
<td>100000</td>
<td>3444</td>
<td class="r">3411</td>
<td class="r">36223</td>
<td class="r">15</td>
</tr>
</table>
</div>
As a quick and dirty first pass I'd do:
html = <<EOT
<div id="__DailyStat__">
<table>
<tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
<tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
<tr class="blr">
<td>3</td>
<td>455</td>
<td>34</td>
<td class="r">3454</td>
<td class="r">5656</td>
<td class="r">3</td>
</tr>
<tr class="bla">
<td>1</td>
<td>1300</td>
<td>3664</td>
<td class="r">3545</td>
<td class="r">1000</td>
<td class="r">10</td>
</tr>
<tr class="blr">
<td>10</td>
<td>100000</td>
<td>3444</td>
<td class="r">3411</td>
<td class="r">36223</td>
<td class="r">15</td>
</tr>
</table>
</div>
EOT
# Today Yesterday
# Qnty Size Length Length Size Qnty
# 3 455 34 3454 5656 3
# 1 1300 3664 3545 1000 10
# 10 100000 3444 3411 36223 15
require 'nokogiri'
doc = Nokogiri::HTML(html)
Use CSS to find the start of the table, and define some places to hold the data we're capturing:
table = doc.at('div#__DailyStat__ table')
today_data = []
yesterday_data = []
Loop over the rows in the table, rejecting the headers:
table.search('tr').each do |tr|
next if (tr['class'] == 'blh')
Initialize arrays to capture the pertinent data from each row, selectively push the data into the appropriate array:
today_td_data = [ 'Today' ]
yesterday_td_data = [ 'Yesterday' ]
tr.search('td').each do |td|
if (td['class'] == 'r')
yesterday_td_data << td.text.to_i
else
today_td_data << td.text.to_i
end
end
today_data << today_td_data
yesterday_data << yesterday_td_data
end
And output the data:
puts today_data.map{ |a| a.join(',') }
puts yesterday_data.map{ |a| a.join(',') }
> Today,3,455,34
> Today,1,1300,3664
> Today,10,100000,3444
> Yesterday,3454,5656,3
> Yesterday,3545,1000,10
> Yesterday,3411,36223,15
Just to help you visualize what's going, at the exit from the "tr" loop, the today_data and yesterday_data arrays are arrays-of-arrays looking like:
[["Today", 3, 455, 34], ["Today", 1, 1300, 3664], ["Today", 10, 100000, 3444]]
Alternatively, instead of looping over the "td" tags and sensing the class for the tag, I could have grabbed the contents of the "tr" and then used scan to grab the numbers and sliced the resulting array into "today" and "yesterday" arrays:
tr_data = tr.text.scan(/\d+/).map{ |i| i.to_i }
today_td_data = [ 'Today', *tr_data[0, 3] ]
yesterday_td_data = [ 'Yesterday', *tr_data[3, 3] ]
In real-world development, like at work, I'd use that instead of what I first wrote because it's succinct.
And notice that I didn't use XPath. It's very doable in Nokogiri to use XPath and accomplish this, but for simplicity I prefer CSS accessors. XPath would have allowed accessing individual "td" tag contents, but it also would begin to look like line-noise, which is something we want to avoid when writing code, because it impacts maintenance. I could also have used CSS to drill down to the correct "td" tags like 'tr td.r', but I don't think it would improve the code, it would just be an alternate way of doing it.

Resources