XPATH start scraping after certain word - xpath

I am trying to get the location from this html using XPATH. So what I want to say is [in human terms] "when you see Location: grab the next piece of text then stop.
<td width="670">
<h1>Accor Vacation Club - SOLD</h1>
<h2>All Australia, Australia</h2>
<p class="property_number">Property ref: 002</p>
<h3 class="cl2">Description</h3><p class="xh-highlight">Resort: Accor Vacation Club. <br>Location: Australia. <br>Type of Ownership: Points. <br>Season: All. <br>Size of Unit: Studio. <br>Price: SOLD</p><p class="xh-highlight"> </p><p class="xh-highlight"><span style="font-size: 16pt">SOLD</span> </p>
<table width="100%" border="0" cellspacing="0" cellpadding="0" id="photorealestate">
<tbody><tr>
I got this far but can't seem to isolate that word:
//p[./preceding-sibling::h3[contains(., 'Description')]]
//p/text()[./preceding-sibling::h3[contains(., 'Description')]]

If you need to get "Australia" as output you can use below expression
substring-after(//text()[starts-with(., 'Location')], 'Location: ')
This will select text node that starts with word "Location" and return sub-string preceded by "Location: "

Related

xpath selecting text after a certain element or between elements

I'm trying to use xpath to select all the text within the elements:
between the h3 elements "Hay Point" and "Darymple Bay"
after h3 element "Darymlple Bay"
I've got this xpath syntax working which selects all the text within the td tags after <h3>Dalrymple Bay Coal Terminal</h3>.
.//h3[2]/following::td/text()
But I'm having trouble figuring out how to select all the text between the tags that fall between <h3>Hay Point Coal Terminal</h3> and <h3>Dalrymple Bay Coal Terminal</h3>
A sample of the structure of the html is below:
<h3>Hay Point Coal Terminal</h3>
<tr role="row" class="odd"><td headers="table06762r1c1" tabindex="0">July
</td><td style="text-align: left;"
headers="table06762r1c2">4,517,445</td>
<td headers="table06762r1c3">4,261,253</td>
<td headers="table06762r1c4">4,057,239</td>
<td headers="table06762r1c5">3,535,507</td>
</tr>
<h3>Dalrymple Bay Coal Terminal</h3>
<tr><td headers="table06762r1c1">July</td><td style="text-align: left;"
headers="table06762r1c2">5,462,591</td>
<td headers="table06762r1c3">5,625,700</td>
<td headers="table06762r1c4">5,816,977</td>
<td headers="table06762r1c5">5,396,644</td>
</tr>
If I understand your question correctly and given the html in the question, in order to get text nodes related to the <h3>Hay Point Coal Terminal</h3> node, try:
//h3[1]/following-sibling::tr[1]/td/text()
Output:
July
4,517,445
4,261,253
4,057,239
3,535,50
To get those related to the <h3>Dalrymple Bay Coal Terminal</h3> node, use:
//h3[2]/following-sibling::tr[1]/td/text()
or just
//h3[2]/following-sibling::tr/td/text()
Output:
July
5,462,591
5,625,700
5,816,977
5,396,644
To get both:
//h3/following-sibling::tr/td/text()
Assuming you want to group those you would do something like:
for h3 in response.css('h3'):
item = {
"h3": h3.css('*::text').extract()[0],
"tds": h3.css('* + tr td::text').extract()
}

weird encode character in email

I have a mustache template parsed via ruby and then render it by marking it html_safe against email body but resultant HTML has some weird encode character embedded in it, for example
<body style=3D"min-width:640px;margin: 0 0 0 0;" bgcolor=3D"#f6f6f6" link==3D"#000000" vlink=3D"#000000" alink=3D"#000000" text=3D"#000000">
<br />
<table width=3D"100%" border=3D"0" align=3D"center"
cellpadding=3D"0" c=
ellspacing=3D"0" bgcolor=3D"#f6f6f6">
<tr>
<td bgcolor=3D"#f6f6f6" style=3D"border-bottom: 0;">
<table width=3D"640" style=3D"min-width:640px;"
cellspacing=3D"0"=
cellpadding=3D"0" border=3D"0" align=3D"center">
<tbody>
<tr>
<td bgcolor=3D"#000000">
<table width=3D"640" bgcolor=3D"#000000" cellspacing=3D"0=
" cellpadding=3D"0" border=3D"0" align=3D"center">
<tbody>
<tr>
<td width=3D"600" height=3D"10" bgcolor=3D"#000000"=
style=3D"line-height:0px;font-size:0px;">
<div width=3D"1" height=3D"10" alt=3D"" style=3D"=
display:block; border:0;"></div>
Why these character remains even after marking string as html safe? Am I missing something.
Mustache template is regular HTML template with mustache syntax in it that are to be replaced dynamically
That's quoted-printable style where it's similar to how things are escaped in a URL. You're probably used to %20 but here =20 is the same thing.
Since = is part of the escaping, like in HTML & becomes & and in a URL % becomes %25, = must be encoded as =3D.
HTML just so happens to use a lot of = characters so you'll see the =3D sigil all over.

Extract href from specific tr element using XPath

Given the following HTML code :
<tr>
<th scope="row" class="navbox-group">Family</th>
<td class="navbox-list navbox-even hlist" style="text-align:left;border-left-width:2px;border-left-style:solid;width:100%;padding:0px">
<div style="padding:0em 0.25em">
<ul>
<li>Andrew Parker Bowles <small>(first husband)</small></li>
<li>Tom Parker Bowles <small>(son)</small></li>
<li>Laura Lopes <small>(daughter)</small></li>
<li>Charles, Prince of Wales <small>(second husband)</small></li>
<li>Bruce Shand <small>(father)</small></li>
<li>Rosalind Shand <small>(mother)</small></li>
<li>Annabel Elliot <small>(sister)</small></li>
<li>Mark Shand <small>(brother)</small></li>
</ul>
</div>
</td>
</tr>
I want to get all the href within the tr element , but only from tr elements that contains :
<th scope="row" class="navbox-group">Family</th>
(Where th='Family')
I try to write the following XPath :
"//tr[#th='Family']//a/#href"
But I don't get any href.
Thanks a lot.
Shany
Try below XPath:
//tr[th="Family"]//#href
It should allow you to get list of links from tr that contains th with text "Family"

How can I search a table faster?

I am trying to search a table for specific a specific value using Ruby and Selenium-webdriver. I have a method that works but takes a lot of time for some reason. It is a one row table and the page HTML looks like this:
<div id="permitGridContainer">
<table id="calendar" class="items" style="width:430px;" name="calendar">
<thead>
<tbody>
<tr>
<td id="avail1" class="status r slct" onmouseout="return nd();" onmouseover="return overlib("Available Quota<br>River Launches : 0 of 4");">
<div class="permitStatus">R</div>
</td>
<td id="avail2" class="status r" onmouseout="return nd();" onmouseover="return overlib("Available Quota<br>River Launches : 0 of 4");">
<div class="permitStatus">R</div>
</td>
<td id="avail3" class="status a" onmouseout="return nd();" onmouseover="return overlib("Available Quota<br>River Launches : 89 of 99");">
<a onclick="javascript:setNewArrivalDate("Sun Sep 06 2015", 2);return false;" href="#">
A
<br>
<small>89</small>
</a>
</td>
<td id="avail4" class="status a" onmouseout="return nd();" onmouseover="return overlib("Available Quota<br>River Launches : 97 of 99");">
</tr>
</tbody>
</table>
</div>
... I shortened the table it has 14 columns.
I am looking for a column that has an Item available and I am checking the class for this, but the text also changes so there are other things I could look for.
This is the code I am using, but it visibly slow. I used puts statements to see the progress. My sense is that is has to do with time accessing the element. So I was hoping there is a better way to process the table quickly. Thank you.
for j in 1..days_to_check[i]
check_avail = driver.find_element(id: "avail#{j}")
check_availclass = check_avail.attribute ("class")
if check_availclass == "status a" or check_availclass == "status a slct"
#process if
end
Depending on your comment I would suggest to use the following xpath. I find this is often easier and feasible to use better xpath than looping though the html table
//td[(#class='status a') or (#class='status A')]
This xpath finds the class with status a or status A

Advice for replacing img tags with text in Ruby?

I'm trying to work out how to store an html table of drive stats in a database, but the developers have been a bit clever, and started using gifs to represent pass/fail/health stats
Here's a snippet of what I've got:
<tr class="status">
<td class="status"><img border="0" src="/tick_green.gif"></td>
<td class="status">8</td>
<td class="status">Ready</td>
<td class="status"><img border="0" src="/bar10.gif"></td>
<td class="status">SEAGATE ST3146807FC</td>
<td class="status">10000 RPM</td>
<td class="status">3HY61AG9</td>
<td class="status">XR12</td>
<td class="status">286749488</td>
<td class="status"> 28.0°C</td>
<td class="status" style="background-color: #00fa00"> 
</td>
**
And here's some of the ruby that I've written so far to strip the tags:
table = page.parser.xpath('//table/caption[contains(.,"Drive")]/..')
table.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
puts cell.to_html.gsub(/<a[^>]+>/,'').gsub(/<td[^>]+>/,'').gsub(/<\/td[^>]*>/,'').gsub(/<\/a[^>]*>/,'')
#puts cell.text
end
end
I can now get semi-rational output
<img border="0" src="/tick_green.gif">
15
Ready
<img border="0" src="/bar10.gif">
SEAGATE ST3146807FC
10000 RPM
3HY61ASW
XR12
286749488
29.0°C
 
But I want to replace a couple of other cell elements with other bits
For example, the tick_green can also be '/cross_red.gif' or '/caution.gif' which I want to replace with regular text, likewise, the img bar10.gif, I want to replace with just text of '10'
Is it best to come up with a whole bunch of values for all of my special cases?
I'd do some 'gsub'iing.
E.g.:
example = <<-STRING
<img border="0" src="/tick_green.gif">
15
Ready
<img border="0" src="/bar10.gif">
SEAGATE ST3146807FC
10000 RPM
3HY61ASW
XR12
286749488
29.0°C
 
STRING
replace = Hash.new("#unknown")
replace['tick_green.gif'] = "[OK]"
replace['bar10.gif'] = "[10]"
regex = /<img [^>]* src="\/(.*)">/
result = example.gsub(regex) { replace[$1] }
Somehow the I'd like to replace the $1 with a named backreference, but don't know how yet.
http://ruby-doc.org/core-1.9.3/String.html#method-i-gsub
edit: result from above
[OK]
15
Ready
[10]
SEAGATE ST3146807FC
10000 RPM
3HY61ASW
XR12
286749488
29.0°C
 
A case statement will clean that up a little but:
row.css('td').each do |td|
img = td.at('img')
puts case
when img && img[:src][/bar(\d+)\.gif/] then $1
when img && img[:src][/tick_green/] then 'ok'
else td.text.strip
end
end

Resources