How do I use Nokogiri to parse this HTML? - ruby

I have an HTML document like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<title>Page Title</title>
<style type="text/css">
</style>
</head>
<body>
<div class="section">
<table>
<tr>
<td>this_is_what_i_want</td><td>this_is_what_i_want</
td><td>test</td><td>test</td>
</tr>
<tr>
<td>this_is_what_i_want</td><td>this_is_what_i_want</
td><td>test</td><td>test</td>
</tr>
<tr>
<td>this_is_what_i_want</td><td>this_is_what_i_want</
td><td>test</td><td>test</td>
</tr>
<tr>
<td>this_is_what_i_want</td><td>this_is_what_i_want</
td><td>test</td><td>test</td>
</tr>
</table>
</div>
<div class="section">
<table>
<tr>
<td>test</td><td>test</td><td>test</td><td>test</td>
</tr>
<tr>
<td>test</td><td>test</td><td>test</td><td>test</td>
</tr>
<tr>
<td>test</td><td>test</td><td>test</td><td>test</td>
</tr>
<tr>
<td>test</td><td>test</td><td>test</td><td>test</td>
</tr>
</table>
</div>
<div class="section">
<table>
<tr>
<td>this_is_what_i_want</td><td>this_is_what_i_want</
td><td>test</td><td>test</td>
</tr>
<tr>
<td>this_is_what_i_want</td><td>this_is_what_i_want</
td><td>test</td><td>test</td>
</tr>
<tr>
<td>this_is_what_i_want</td><td>this_is_what_i_want</
td><td>test</td><td>test</td>
</tr>
<tr>
<td>this_is_what_i_want</td><td>this_is_what_i_want</
td><td>test</td><td>test</td>
</tr>
</table>
</div>
</body>
</html>
I want to get the first two td elements in all rows of the first and
third table element. How to get this result?
Note that the two td
elements in a row have some relation and you can't treat all td
elements the same way. For example, how do I concatenate the content of
the two td elements in a row?

doc.xpath('//div[position()=1 or position()=3]/table/tr').map{|tr| tr.css('td')[0..1].map(&:text).join(' ')}

It can also be done using two XPath statements:
doc.xpath('//div[position()=1 or position()=3]/table/tr').map {|row| row.xpath('concat(//td[1]," ",//td[2])')}
The reason it can't be done in a single XPath statement is that the String XPath functions work on the first node of a nodeset only. You can do node selection or concatenation but not both.
Note that in XPath 2.0, it can be done using the string-join() function but Nokogiri supports only XPath 1.0.

Related

Outlook can't parser multiple tables from html with pywin32

I try to convert the html file to msg file, it stopped convert when meet third table tag in html.
I searched this question but didn't get any result -- it seems like only myself meet this problem.
So this is the example html code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<h1>This Email is using for understand outlook mail synthesis</h1>
<h2>0001</h2>
<table>
<tr>
<td>This is image001.jpg</td>
</tr>
<tr>
<td>
// stop parsing after parse this table
<table>
<tr>
<td>
<img src="cid:image001.jpg" alt="">
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td>This is image002.jpg</td>
</tr>
<tr>
<td>
<table>
<tr>
<td>
<img src="cid:image002.jpg" alt="">
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td>This is image003.jpg</td>
</tr>
<tr>
<td>
<table>
<tr>
<td>
<img src="cid:image003.jpg" alt="">
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td>This is image004.jpg</td>
</tr>
<tr>
<table>
<tr>
<td>
<img src="cid:image004.png" alt="">
</td>
</tr>
</table>
</tr>
</table>
<h1>No!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!</h1>
</body>
</html>
And this is my python code:
from win32com import client as win32
import os
outlook = win32.Dispatch("outlook.application")
mail = outlook.CreateItem(0)
mail.Subject = "This is a subject"
with open(".\\mix.html", "r", encoding="utf-8") as f:
html = f.read()
mail.HtmlBody = html
current_path = os.getcwd()
at = mail.Attachments.Add(current_path + "\\image001.jpg")
at.PropertyAccessor.SetProperty("http://schemas.microsoft.com/mapi/proptag/0x3712001F", "image001.jpg")
at = mail.Attachments.Add(current_path + "\\image002.jpg")
at.PropertyAccessor.SetProperty("http://schemas.microsoft.com/mapi/proptag/0x3712001F", "image002.jpg")
at = mail.Attachments.Add(current_path + "\\image003.jpg")
at.PropertyAccessor.SetProperty("http://schemas.microsoft.com/mapi/proptag/0x3712001F", "image003.jpg")
at = mail.Attachments.Add(current_path + "\\image004.png")
at.PropertyAccessor.SetProperty("http://schemas.microsoft.com/mapi/proptag/0x3712001F", "image004.png")
mail.SaveAs(current_path + "\\rst.msg")
This is what I see when I open the "rst.msg" file:
stop parsing after parse the table
I deleted the table in second tr tag and run the python script, this is what I get:
stop parsing again after parse the table
This is the html code I deleted the table in second tr tag:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<h1>This Email is using for understand outlook mail synthesis</h1>
<h2>0001</h2>
<table>
<tr>
<td>This is image001.jpg</td>
</tr>
<tr>
<td>
</td>
</tr>
<tr>
<td>This is image002.jpg</td>
</tr>
<tr>
<td>
// stop parsing after parse this table
<table>
<tr>
<td>
<img src="cid:image002.jpg" alt="">
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td>This is image003.jpg</td>
</tr>
<tr>
<td>
<table>
<tr>
<td>
<img src="cid:image003.jpg" alt="">
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td>This is image004.jpg</td>
</tr>
<tr>
<table>
<tr>
<td>
<img src="cid:image004.png" alt="">
</td>
</tr>
</table>
</tr>
</table>
<h1>No!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!</h1>
</body>
</html>
Hope you can help me! Thank you very much!
This question was solved by upgrading outlook version. In the new version, outlook can generate email as expected, but if you add some picture to the email used for html generating and save this email, then pictures are added to the email as appendix. If you send the email after generate it and before save it, the receiver can't see those appendix. That's what you might want.
Upgrading the outlook version is a way to solve this question.

Get a cell that is in a table before the current table

See html below. Have a series of tables that include rows with a name attribute name="laneStop". I can select those rows like this in the Chrome dev console
$x("/html[1]/body[1]//TR[#name='laneStop']")
However, I also need to get the 2nd cell of the 2nd row of the 1st table ABOVE these rows, eg. the value
abc_123_florida-45
Here is the html. Whats a way to refer to this value above - knowing that Im getting the "laneStop" rows first
<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title></title>
</head>
<body>
<table border="1">
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>Date</td>
<td>11/15/2019</td>
</tr>
<tr>
<td>shipment number</td>
<td>abc_123_florida-45</td>
</tr>
<tr>
<td>Departure time:</td>
<td>0430</td>
</tr>
</tbody>
</table>
</td>
<td>
<table>
<tbody>
<tr>
<td>Time arrival</td>
<td>1715</td>
</tr>
<tr>
<td>customer</td>
<td>bob smith</td>
</tr>
<tr>
<td>box type</td>
<td>square</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table border="1">
<tbody>
<tr>
<td>
<table>
<tbody>
<tr name="laneStop">
<td>box1</td>
<td>23.45</td>
<td>lane1</td>
<td>south</td>
</tr>
<tr name="laneStop">
<td>box2</td>
<td>17.14</td>
<td>lane1</td>
<td>south</td>
</tr>
<tr name="laneStop">
<td>box3</td>
<td>17.18</td>
<td>lane1</td>
<td>north</td>
</tr>
<tr name="laneStop">
<td>box2</td>
<td>199.14</td>
<td>lane1</td>
<td>west</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</body>
</html>
Try the following xpath.
//td[text()='shipment number']/following::td[1]
Demo:
If you want to travel from your current node (i.e., the "laneStop" rows), one way to do that is to use this xpath expression:
./preceding-sibling::*/ancestor::*[6]/preceding-sibling::table[1]//tr[1]/td[1]/table[1]//td[1]//tr[2]/td[2]
I'm curious to see if it works for you.

How to Get Nokogiri to Show Node and not just HTML

Right now when I am parsing some html (front page of hacker news for example), it works fine. I can call class on something like doc = Nokogiri::HTML(open('news.ycombinator.com')) and I will get back Nokogiri::HTML::Document < Nokogiri::XML::Document
The issue is, in the terminal, I am seeing the HTML and not the actual Nokogiri Element. I want to see it because it shows me valuable info like the Nokogiri Elements Children, or an array of links or or or.
I get the HTML using the Watir Gem using the following method:
[1] pry(main)> browser = Watir::Browser.new(:firefox)
#<Watir::Browser:0x2c5654b29ef00c22 url="about:blank" title="">
[2] pry(main)> browser.goto('news.ycombinator.com')
"http://news.ycombinator.com"
[3] pry(main)> browser.html
Where browser.html is an instance variable (I think?) containing un-parsed HTML.
Here is what I get back right now if I call doc = Nokogiri::HTML.parse(browser.html)
And here is what I would like to get back:
Where am I going wrong?
adding raw code as requested:
Nokogiri::HTML::Document < Nokogiri::XML::Document
[31] pry(main)> doc = Nokogiri::HTML.parse(browser.html)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html op="news">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="referrer" content="origin">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" type="text/css" href="news.css?stXbi7LCyutClfTUMe1b">
<link rel="shortcut icon" href="favicon.ico">
<link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
<title>Hacker News</title>
</head>
<body>
<center><table id="hnmain" width="85%" cellspacing="0" cellpadding="0" border="0" bgcolor="#f6f6ef">
<tbody>
<tr><td bgcolor="#ff6600"><table style="padding:2px" width="100%" cellspacing="0" cellpadding="0" border="0"><tbody><tr>
<td style="width:18px;padding-right:4px"><img src="y18.gif" style="border:1px white solid;" width="18" height="18"></td>
<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname">Hacker News</b>
new | past | comments | ask | show | jobs | submit </span></td>
<td style="text-align:right;padding-right:4px;"><span class="pagetop">
login
</span></td>
</tr></tbody></table></td></tr>
<tr id="pagespace" title="" style="height:10px"></tr>
<tr><td>
<table class="itemlist" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr class="athing" id="19388248">
<td class="title" valign="top" align="right"><span class="rank">1.</span></td> <td class="votelinks" valign="top"><center><a id="up_19388248" href="vote?id=19388248&how=up&goto=news"><div class="votearrow" title="upvote"></div></a></center></td>
<td class="title">
Getting Too Absorbed in Your Side Projects<span class="sitebit comhead"> (<span class="sitestr">bennettnotes.com</span>)</span>
</td>
</tr>
<tr>
<td colspan="2"></td>
<td class="subtext">
<span class="score" id="score_19388248">42 points</span> by _davebennett <span class="age">1 hour ago</span> <span id="unv_19388248"></span> | hide | 27 comments </td>
</tr>
<tr class="spacer" style="height:5px"></tr>
<tr class="athing" id="19384878">
<td class="title" valign="top" align="right"><span class="rank">2.</span></td> <td class="votelinks" valign="top"><center><a id="up_19384878" href="vote?id=19384878&how=up&goto=news"><div class="votearrow" title="upvote"></div></a></center></td>
<td class="title">
Facebook’s Data Deals Are Under Criminal Investigation<span class="sitebit comhead"> (<span class="sitestr">nytimes.com</span>)</span>
</td>
</tr>
<tr>
<td colspan="2"></td>
<td class="subtext">
<span class="score" id="score_19384878">661 points</span> by tysone <span class="age">13 hours ago</span> <span id="unv_19384878"></span> | hide | 156 comments </td>
</tr>
<tr class="spacer" style="height:5px"></tr>
<tr class="athing" id="19388091">
<td class="title" valign="top" align="right"><span class="rank">3.</span></td> <td class="votelinks" valign="top"><center><a id="up_19388091" href="vote?id=19388091&how=up&goto=news"><div class="votearrow" title="upvote"></div></a></center></td>
<td class="title">
Krita 4.2.0: First painting application with HDR support on Windows<span class="sitebit comhead"> (<span class="sitestr">krita.org</span>)</span>
</td>
...
It sounds like you want:
doc = Nokogiri::HTML browser.html

pandoc: convert HTML table to DOCX

I have a very simplistic HTML document with a table:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>Analysis</title>
</head>
<body>
<TABLE border=1>
<TR> <TD> 18.365 </TD> <TD> 1 </TD> </TR>
<TR> <TD> 23.465 </TD> <TD> 1 </TD> </TR>
<TR> <TD> 26.020 </TD> <TD> 1 </TD> </TR>
<TR> <TD> 14.371 </TD> <TD> 1 </TD> </TR>
<TR> <TD> 17.258 </TD> <TD> 1 </TD> </TR>
</TABLE>
</body>
</html>
and I would like to create a DOCX file from it using pandoc. In the result, however, the table is completely messed up. Can anyone please help me with a working example? It is the last step in a complex workflow I have and I assume that a table should be possible.
Pandoc version:1.12.4.2
It's a regression that has already been fixed in the development version
(https://github.com/jgm/pandoc/issues/1341). You can install the development
version from source or revert to a package for 1.12.3.3. This will be fixed in the next pandoc release.

Ruby list files in remote http server

I have a files listing page on a remote server, say http://myserver.com/uploads. How can I get the list of files using Ruby, preferably with net-http only?
This is the HTML code of the page:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<!-- saved from url=(0025)http://myserver.com/uploads/ -->
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Index of /uploads</title>
</head>
<body>
<h1>Index of /uploads</h1>
<table>
<tbody>
<tr>
<th><img src="./Index of uploads_files/blank.gif" alt="[ICO]"></th>
<th>Name</th>
<th>Last modified</th>
<th>Size</th>
<th>Description</th></tr><tr><th colspan="5"><hr></th>
</tr>
<tr>
<td valign="top"><img src="./Index of uploads_files/back.gif" alt="[DIR]"></td>
<td>Parent Directory</td>
<td> </td>
<td align="right"> - </td>
<td> </td>
</tr>
<tr>
<td valign="top"><img src="./Index of uploads_files/compressed.gif" alt="[ ]"></td>
<td>Backup_201305281256.tar.gz</td>
<td align="right">28-May-2013 18:00 </td>
<td align="right"> 13M</td><td> </td>
</tr>
<tr><th colspan="5"><hr></th></tr>
</tbody>
</table>
<address>Apache/2.2.22 (Ubuntu) Server at myserver.com Port 80</address>
</body>
</html>
What you see is an HTML page with link to files generated by the HTTP server.
You'll need to parse this HTML to get list the files or you use a regex to match the URI's.
Take a look at the URI regex.

Resources