Ruby Nokogiri Parsing HTML table - ruby

I am using mechanize/nokogiri and need to parse out the following HTML string.
can anyone help me with the xpath syntax to do this or any other methods that would work?
<table>
<tr class="darkRow">
<td>
<span>
<a href="?x=mSOWNEBYee31H0eV-V6JA0ZejXANJXLsttVxillWOFoykMg5U65P4x7FtTbsosKRbbBPuYvV8nPhET7b5sFeON4aWpbD10Dq">
<span>4242YP</span>
</a>
</span>
</td>
<td>
<span>Subject of Meeting</span>
</td>
<td>
<span>
<span>01:00 PM</span>
<span>Nov 11 2009</span>
<span>America/New_York</span>
</span>
</td>
<td>
<span>30</span>
</td>
<td>
<span>
<span>example#email.com</span>
</span>
</td>
<td>
<span>39243368</span>
</td>
</tr>
.
.
.
<more table rows with the same format>
</table>
I want this as the output
"4242YP","Subject of Meeting","01:00 PM Nov 11 2009 America/New_York","30","example#email.com", "39243368"
.
.
.
<however many rows exist in the html table>

something like this?
items=doc.xpath('//tr').map {|row| row.xpath('.//span/text()').select{|item| item.text.match(/\w+/)}.map {|item| item.text} }
returns:
=> [["4242YP", "Subject of Meeting", "01:00 PM", "Nov 11 2009", "America/New_York", "30", "example#email.com", "39243368"], ["abcdefg"]]
Select includes only spans that start with word characters (e.g. excluding the whitespace that some of your spans have). You may need to refine the "select" filter for your specific case.
I added a minimalist row that contained a span containing abcdefg, so that you can see the nested array.

Here's part of the XSL to transform your input, if you have an XSL transformer:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:apply-templates select="//tr"/>
</xsl:template>
<xsl:template match="tr">
"<xsl:value-of select="td/span/a/span"/>","<xsl:value-of select="td[position()=2]/span"/>","<xsl:value-of select="td[position()=3]/span/span[position()=1]"/>"
</xsl:template>
</xsl:stylesheet>
Output produced looks like this:
"4242YP","Subject of Meeting","01:00 PM"
"4242YP","Subject of Meeting","01:00 PM"
(I duplicated your first table row).
The XSL select bits give you a good idea of what XPATH input you'd need to get the rest.

Related

xsl sort not working as expected

I've the below XML and using XSLT2.0
<A>
<BID>Pt.IV</BID>
<BID>Pt.III</BID>
<BID>Pt.IIIA</BID>
<BID>Pt.IIIB</BID>
<BID>Pt.IIIC</BID>
<BID>Pt.IIID</BID>
<BID>Pt.IIIE</BID>
<BID>Pt.IIIF</BID>
<BID>Pt.IIIAA</BID>
<BID>s.2(1)</BID>
<BID>s.3</BID>
<BID>s.3(1)</BID>
<BID>s.3(2)</BID>
<BID>s.3A</BID>
<BID>s.3B</BID>
<BID>s.4</BID>
<BID>s.4(2)</BID>
<BID>s.4(5)</BID>
<BID>s.4(2A)</BID>
<BID>s.4(4A)</BID>
<BID>s.6(3)</BID>
<BID>s.7</BID>
<BID>s.7A</BID>
<BID>s.8</BID>
<BID>s.9</BID>
<BID>s.12</BID>
<BID>s.13</BID>
<BID>s.20A</BID>
<BID>s.20F</BID>
<BID>s.20O</BID>
<BID>s.20S</BID>
<BID>s.20T</BID>
<BID>s.20W</BID>
<BID>s.21</BID>
<BID>s.21(2)</BID>
<BID>s.21(3)</BID>
<BID>s.21(2A)</BID>
<BID>s.21(4B)</BID>
<BID>s.21(4C)</BID>
<BID>s.21(4D)</BID>
<BID>s.21B</BID>
<BID>s.22(1)</BID>
<BID>s.22(1)(b)</BID>
<BID>s.22(4)</BID>
<BID>s.23</BID>
<BID>s.25(1A)</BID>
<BID>s.27</BID>
<BID>s.28</BID>
<BID>s.31</BID>
<BID>s.20O(2)</BID>
<BID>s.20W(2)</BID>
<BID>s.21B(1)</BID>
<BID>s.21B(2)</BID>
<BID>s.21B(3)</BID>
</A>
here i'm trying to sort the values of BID using the below XSLT.
<xsl:template match="A">
<xsl:for-each select="BID">
<xsl:sort select="substring-after(.,'.')"/>
<table class="toa-entry">
<tbody>
<tr class="secondary-entry">
<td class="entry-name">
<xsl:value-of select="."/></td>
</tr>
</tbody>
</table>
</xsl:for-each>
</xsl:template>
here the output that is get is as below.
But the expected is as below.
s2(1)
s3
s3(1)
s3(2)
s3A
s3B
s4
s4(2)
s4(5)
s4(2A)
s4(4A)
s6(3)
s7
s7A
s8
s9
s12
s13
s20A
s20F
s20O
s20O(2)
s20S
s20T
s20W
s20W(2)
s21
s21(2)
s21(3)
s21(2A)
here what's happening is, the sorting is working as first get all the numbers starts with 1, then 2, and so on.
where as i want it like in regular ascending order. 1,2,2a,3,3a and so on.
please let me know how i can get this output.
Here is working demo.
DEMO
Thanks
You should try something like:
XSLT 2.0
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/A">
<table >
<xsl:for-each select="BID">
<xsl:sort select="substring-before(., '.')" data-type="text" order="ascending"/>
<xsl:sort select="replace(substring-before(substring-after(concat(., '('), '.'), '('),'[A-Z]', '')" data-type="number" order="ascending"/>
<xsl:sort select="replace(substring-before(substring-after(concat(., '('), '.'), '('),'[0-9]', '')" data-type="text" order="ascending"/>
<xsl:sort select="substring-after(., '(')" data-type="text" order="ascending"/>
<tr>
<td><xsl:value-of select="."/></td>
</tr>
</xsl:for-each>
</table>
</xsl:template>
</xsl:stylesheet>
The (rendered) result, when applied to your example:
Pt.III
Pt.IIIA
Pt.IIIAA
Pt.IIIB
Pt.IIIC
Pt.IIID
Pt.IIIE
Pt.IIIF
Pt.IV
s.2(1)
s.3
s.3(1)
s.3(2)
s.3A
s.3B
s.4
s.4(2)
s.4(2A)
s.4(4A)
s.4(5)
s.6(3)
s.7
s.7A
s.8
s.9
s.12
s.13
s.20A
s.20F
s.20O
s.20O(2)
s.20S
s.20T
s.20W
s.20W(2)
s.21
s.21(2)
s.21(2A)
s.21(3)
s.21(4B)
s.21(4C)
s.21(4D)
s.21B
s.21B(1)
s.21B(2)
s.21B(3)
s.22(1)
s.22(1)(b)
s.22(4)
s.23
s.25(1A)
s.27
s.28
s.31
You can't utilize a text sorting algorithm on numeric data.
Even though you have stripped out the characters, your data values are still text values.
If you require numeric sorting you need to tell the parser the data type of the data, which you can do using the data-type attribute.
data-type text | number | qname
Optional. Specifies the data-type of the data to be sorted. Default is "text"
EDIT: Replace your regex with this: [^a-zA-Z0-9 -]
There is a limitation here because the regex strips all non-numeric characters out of the values. Therefore if the initial list is not already sorted correctly within the numeric factor, for example
s.21(4B)
s.21(4C)
s.21(4D)
then the sorting will ignore the alphabetic component of the values.
If you're using Saxon, there is a collation you can request that treats any sequence of digits in the sort key as a number, so s12 sorts after s9.
collation="http://saxon.sf.net/collation?alphanumeric=yes"
It won't handle roman numerals though: sorting "App IX" after "App VIII" remains a challenge!

XPath to find element with a HTML line break

I need an xpath that will find some text containing HTML line breaks <br/>. For example:
<ul>
<li>ABC<br/>DEF</li>
<li>XYZ<br/>NOP</li>
</ul>
Let's say I'm trying to find the li that contains ABC<br/><DEF>. I've tried the following:
$x("//li[normalize-space(.)='ABC DEF']")
$x("//li[text() ='ABC<br/>DEF']")
$x("//li[contains(., 'ABC DEF']")
But they return nothing. I saw this answer XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode but I couldn't figure out how to use it in my case.
The following expression will get you close:
li[br[preceding-sibling::node()[1] = 'ABC']
[starts-with(following-sibling::node()[1], 'DEF')]]
If you need to match only items where the text ends with ABC, it will be a little longer.
The following transform will select the first matching li:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" />
<xsl:template match="/">
<matches>
<xsl:copy-of select="(//li[br[preceding-sibling::node()[1] = 'ABC']
[starts-with(following-sibling::node()[1], 'DEF')]
])
[1]" />
</matches>
</xsl:template>
</xsl:stylesheet>
Input:
<ul>
<li>ABC<br/>DEF</li>
<li>XYZ<br/>NOP</li>
<li><p>XYZ<br/>NOP</p></li>
<li>ABC<br/>DEF</li>
<li>DEF GHI</li>
<li>ABC<![CDATA[<br/>]]>DEF</li>
</ul>
Output:
<?xml version="1.0" encoding="utf-8"?>
<matches>
<li>ABC<br />DEF</li>
</matches>
//li[br]
This should work. It means: select all li elements having br child

How do I get the maximum length for each column using Nokogiri?

How do I get the maximum length for each column using Nokogiri?
Example HTML:
<table>
<tr>
<td>ONE</td><td>TWO</td><td>THREE</td>
</tr>
<tr>
<td>Monaco</td><td>Bangkok</td><td>Thailand</td>
</tr>
</table>
The result would the string length inside each <td>.
<td>one</td> => 3
<td>two</td> => 3
<td>three</td> => 5
....
First you would map the length of tr/td:
lengths = doc.search('tr').map{|tr| tr.search('td').map{|td| td.text.length}}
=> [[3, 3, 5], [6, 7, 8]]
transpose that to get columns and get just the max from each:
lengths.transpose.map &:max
=> [6, 7, 8]
A pure one-liner XPath 2.0 solution, supposing that the table has regular structure (each row has the same number of columns):
for $i in 1 to count(/*/tr[1]/td)
return
max(/*/tr/td[$i]/string-length())
XSLT 2.0 - based verification:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:sequence select=
"for $i in 1 to count(/*/tr[1]/td)
return
max(/*/tr/td[$i]/string-length())
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<table>
<tr>
<td>ONE</td>
<td>TWO</td>
<td>THREE</td>
</tr>
<tr>
<td>Monaco</td>
<td>Bangkok</td>
<td>Thailand</td>
</tr>
</table>
the XPath expression is evaluated and the result of this evaluation is copied to the output:
6 7 8

Targeting part of a comment using XPath

I'm trying to use xpath to return the value "Vancouver", from either the comment or the text after it. Can anyone point me in the right direction?
The location li is always the first item but is not always present, and the number of list items after it varies for each item.
<item>
<title>
<description>
<!-- Comment #1 -->
<ul class="class1">
<li> <!-- ABC Location=Vancouver -->Location: Vancouver</li>
<li> <!-- More comments -->Text</li>
<li> text</li>
</ul>
</description>
</item>
This will pull it from the text after the comment:
substring-after(//ul[#class='class1']/li[position()=1 and contains(.,'Location:')],'Location: ')
This specifies the first <li> inside the <ul> of class 'class1', only when it contains 'Location:', and takes the string after 'Location:'. If you want to relax the requirement that it be the first li, use this:
substring-after(//ul[#class='class1']/li[contains(.,'Location:')],'Location: ')
This isn't eloquent, and it could cause issues if your "Location: #####" were to change structurally, because this is a static solution, but it works for the above:
substring(//item//li[1],12,string-length(//item//li[1])-10)
And this returns the string equivalent, not a node.
Rushed this one a bit, so I'll give a better solution with time but this is just something to think about...
(it just strips off "Location: " and returns whatever's after it..)
Use:
substring-after(/*/description/ul
/li[1]/text()[starts-with(., 'Location: ')],
'Location: '
)
To extract the location from the comment use:
substring-after(/*/description/ul
/li[1]/comment()[starts-with(., ' ABC Location=')],
' ABC Location='
)
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:copy-of select=
"substring-after(/*/description/ul
/li[1]/text()[starts-with(., 'Location: ')],
'Location: '
)
"/>
==========
<xsl:copy-of select=
"substring-after(/*/description/ul
/li[1]/comment()[starts-with(., ' ABC Location=')],
' ABC Location='
)
"/>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<item>
<title/>
<description>
<!-- Comment #1 -->
<ul class="class1">
<li>
<!-- ABC Location=Vancouver -->Location: Vancouver
</li>
<li>
<!-- More comments -->Text
</li>
<li> text</li>
</ul>
</description>
</item>
the two XPath expressions are evaluated and the results of the evaluations are copied to the output:
Vancouver
==========
Vancouver

XPath: how to get text from this and next tag?

i have HTML like this:
<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>
So i need to get at the one time Hello1 with World1, Hello2 with World2 etc
UPDATE: I use Ruby Mechanize library
The Ruby library "Mechanize" uses the Nokogiri parsing library, so you can call Nokogiri directly. One potential solution might look something like this:
require 'mechanize'
require 'pp'
html = "<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>"
results = []
Nokogiri::HTML(html).xpath("//h1").each do |header|
p = header.xpath("following-sibling::p[1]").text
results << [header.text, p]
end
pp results
EDIT:
This example was tested with Mechanize v2.0.1 which uses Nokogiri ~v1.4. I also tested directly against Nokogiri v1.5.0 without issue.
EDIT #2:
This example answers a follow-up question to the original solution:
require 'nokogiri'
require 'pp'
html = <<HTML
<h1>
<p>
<font size="4">
<b>abide by (something)</b>
</font>
</p>
</h1>
<p>
<font size="3">- to follow the rules of something</font>
</p>
The cleaning staff must abide by the rules of the school.
<br>
<h1>
<p>
<font size="4">
<b>able to breathe easily again</b>
</font>
</p>
</h1>
<p>
My friend was able to breathe easily again when his company did not go bankrupt.
<br>
HTML
doc = Nokogiri::HTML(html)
results = []
Nokogiri::HTML(html).xpath("//h1").each do |header|
h1 = header.xpath("following-sibling::p/font/b").text
results << h1
end
pp results
H1 tags with nested elements are invalid, so Nokogiri corrects the error during the parsing process. The process to get at the formerly nested elements is very similar to the original solution.
Note: I glazed over the XPath part of this request. This answer is for an XSLT style sheet instead.
Expanding your XML example to give it a root element:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello3</h1>
<p>World3</p>
</root>
You could use a for-each loop along with "following-sibling" to get the elements with something like this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output encoding="UTF-8" method="text"/>
<xsl:template match="/">
<!-- start lookint for <h1> nodes -->
<xsl:for-each select="/root/h1">
<!-- output the h1 text -->
<xsl:value-of select="."/>
<!-- print a dash for spacing -->
<xsl:text> - </xsl:text>
<!-- select the next <p> node -->
<xsl:value-of select="following-sibling::p[1]"/>
<!-- print a new line -->
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The output would look like this:
Hello1 - World1
Hello2 - World2
Hello3 - World3

Resources