Joining text from parsing a complex table structure in ruby nokogiri - ruby

I have an HTML table and I want to get the text from some td's. Now sometime the text is in single td but sometimes its spread into multiple td's. How can I join the text in case if its spread in multiple td's. Here is the HTML code
<table class="detailRecordTable">
<tbody>
<tr><td class="detailSeperator" colspan="6"> </td></tr>
<tr>
<td valign="top" style="width: 11% " class="detailData"><b>02/03/2016</b></td> <td style="width: 3%" class="detailLabels" valign="top"> </td>
<td style="width: 85%" class="detailData alignData" colspan="3"> <b>Disposed- Pet for Writ Denied</b> /td>
<td style="width: 1%" class="detailData"> </td>
</tr>
<tr>
<td colspan="2" style="width: 14% " class="detailLabels" valign="top"> </td>
<td style="width: 86% " class="detailData" colspan="2">ORDER ISSUED: PETITION FOR WRIT OF MANDAMUS DENIED. MANDATE AVAILABLE TO COUNSEL OF RECORD VIA SECURE CASE.NET.</td>
</tr>
<tr><td class="detailSeperator" colspan="6"> </td></tr>
<tr>
<td valign="top" style="width: 11% " class="detailData"><b>01/29/2016</b></td>
<td style="width: 3%" class="detailLabels" valign="top"> </td>
<td style="width: 85%" class="detailData alignData" colspan="3">
<b>Suggestions in Opposition</b></td>
<td style="width: 1%" class="detailData"> </td>
</tr>
<tr>
<td colspan="2" style="width: 14% " class="detailLabels" valign="top"> </td>
<td style="width: 86% " class="detailData" colspan="2">SUGGESTIONS IN OPPOSITION TO RELATORS PETITION FOR WRIT OF MANDAMUS; Electronic Filing Certificate of Service.</td>
</tr>
<tr>
<td colspan="2" style="width: 14%" class="detailLabels"> </td>
<td style="width: 86%" class="detailData" colspan="2"> <b>Filed By:</b>JOHN RICHARD SHANK JR
</td>
</tr><tr>
<td style="width: 14%" class="detailLabels" colspan="2"></td>
<td style="width: 86%" class="detailData" colspan="2"> <b>On Behalf Of:</b>ELIZABETH DAVIS
</td>
</tr>
<tr>
<td class="detailSeperator" colspan="6"> </td></tr>
<tr><td valign="top" style="width: 11% " class="detailData"><b>01/22/2016</b></td><td style="width: 3%" class="detailLabels" valign="top"> </td>
<td style="width: 85%" class="detailData alignData" colspan="3"><b>Court Order Issued</b></td>
<td style="width: 1%" class="detailData"> </td>
</tr>
<tr><td colspan="2" style="width: 14% " class="detailLabels" valign="top"> </td>
<td style="width: 86% " class="detailData" colspan="2">ORDER ISSUED: RESPONDENT REQUESTED TO FILE SUGGESTIONS IN OPPOSITION ON OR BEFORE 2:00 P.M. ON JANUARY 29, 2016.</td>
</tr>
</tbody></table>
I want the output like this,I put the asterisks around where the text should be joined
["ORDER ISSUED: PETITION FOR WRIT OF MANDAMUS DENIED. MANDATE AVAILABLE TO COUNSEL OF RECORD VIA SECURE CASE.NET." , "**SUGGESTIONS IN OPPOSITION TO RELATORS PETITION FOR WRIT OF MANDAMUS; Electronic Filing Certificate of Service. Filed By:JOHN RICHARD SHANK JR On Behalf Of:ELIZABETH DAVIS**" , "ORDER ISSUED: RESPONDENT REQUESTED TO FILE SUGGESTIONS IN OPPOSITION ON OR BEFORE 2:00 P.M. ON JANUARY 29, 2016"]
I have tried this but it not joining the text and I'm getting the text like a separate item, especially the text surrounded by asterisks
if !tr.css('td.detailData').empty?
ac_desc = tr.css('td.detailData')[0].text.strip.gsub("\n", '').gsub("\t", '')
end
if ac_desc != ""
acc_descs << ac_desc
end

Related

table width not set in iTextSharp when converting html to PDF

I am trying to convert an html to pdf but the problem i face is that the html table tags width is not getting set correctly..
This is my html
<table cellpadding='4' cellspacing='4' border='0' width='100%' style='width:100%'>
<tr style='background-color:#000000'>
<td colspan='2' align='center' valign='middle' width='100%'>
<font face='Calibri' size='6' color='#FFFFFF'>Retail Natural Gas Deal Sheet</font>
</td>
</tr>
<tr>
<td colspan='2' width='100%'> </td>
</tr>
<tr>
<td width='90%' style='width:90%'>
<table cellpadding='0' cellspacing='0' border='0' width='100%'>
<tr>
<td width='42%'>
<font face='Calibri' size='4'>
<b>Deal Number</b>
</font>
</td>
<td width='1%'> </td>
<td width='57%'>
<font face='Calibri' size='4'>
<b>15RTLG7149</b>
</font>
</td>
</tr>
<tr>
<td colspan='3' width='100%'> </td>
</tr>
<tr>
<td width='42%'>
<font face='Calibri' size='2'>
<b>Trade Date</b>
</font>
</td>
<td width='1%'> </td>
<td width='57%'>
<font face='Calibri' size='2'>February 09, 2015</font>
</td>
</tr>
<tr>
<td width='42%'>
<font face='Calibri' size='2'>
<b>Price Date</b>
</font>
</td>
<td width='1%'> </td>
<td width='57%'>
<font face='Calibri' size='2'>February 09, 2015</font>
</td>
</tr>
<tr>
<td width='42%'>
<font face='Calibri' size='2'>
<b>Authorize Date</b>
</font>
</td>
<td width='1%'> </td>
<td width='57%'>
<font face='Calibri' size='2'>February 09, 2015</font>
</td>
</tr>
<tr>
<td colspan='3' width='100%'> </td>
</tr>
</table>
</td>
<td width='10%' style='width:10%' valign='top'>
<table cellpadding='0' cellspacing='0' border='0' width='100%'>
<tr>
<td colspan='2' align='center' width='100%'>
<font face='Calibri' size='2'>
<b>Volumes (MMMBtu)</b>
</font>
</td>
</tr>
</table>
</td>
</tr>
</table>
this is the c# code i am using to generate the pdf
Document pdfDoc = new Document();
//Document pdfDoc = new Document(PageSize.A4, 10f, 10f, 10f, 0f);
//HTMLWorker htmlparser = new HTMLWorker(pdfDoc);
using (MemoryStream memoryStream = new MemoryStream())
{
PdfWriter writer = PdfWriter.GetInstance(pdfDoc, memoryStream);
pdfDoc.Open();
XMLWorkerHelper.GetInstance().ParseXHtml(writer, pdfDoc, new StringReader(HTML));
pdfDoc.Close();
byte[] bytes = memoryStream.ToArray();
memoryStream.Close();
return bytes;
}
but this is how its rendered in the pdf.. I am not able to find the right answers.. I need help.. Thanks in advance..
http://i.stack.imgur.com/8WyBh.jpg
I have copy pasted your HTML to a text editor (Notepad++; marked 1 in the screen shot below). I have opened this HTML in a browser (Chrome; marked 2 in the screen shot below). I have converted the HTML to PDF (using XML Worker; the PDF is marked 3 in the screen shot below).
When I compare what I see in the browser with what I see in the PDF, I have the impression that iText's XML Worker is doing a great job. There isn't that much difference between what I see in the browser and what I see in the PDF.
However, when I look at your HTML, I see inconsistencies. Have you tried viewing your HTML in a browser? It doesn't look the way you expected, does it? Seems like the problem isn't caused by iText, but it's caused by the way you create your HTML. Please tell us if the HTML looks the way you expect in a browser. If not, please explain what you expect. Right now, it is hard to understand the problem as what I see in the PDF corresponds really well with what I see in a browser.
Update:
In your question, you didn't add any borders (border='0') and it was hard to see what you mean. I've now added borders, so that the HTML looks like this:
You want the PDF to look like this:
This is very easy if you simplify your HTML like this:
<table cellpadding='4' cellspacing='4' border='1' width='100%' style='width:100%'>
<tr style='background-color:#000000'>
<td colspan='2' align='center' valign='middle'>
<font face='Calibri' size='6' color='#FFFFFF'>XXXX XXXXX XXXXX</font>
</td>
</tr>
<tr>
<td colspan='2'> </td>
</tr>
<tr>
<td width='90%' style='width:90%'>
<table cellpadding='0' cellspacing='0' border='1' width='100%'>
<tr>
<td width='42%'>
<font face='Calibri' size='4'>
<b>Deal Number</b>
</font>
</td>
<td width='1%'> </td>
<td width='57%'>
<font face='Calibri' size='4'>
<b>XXXXXXXXXX</b>
</font>
</td>
</tr>
<tr>
<td colspan='3' width='100%'> </td>
</tr>
<tr>
<td width='42%'>
<font face='Calibri' size='2'>
<b>Trade Date</b>
</font>
</td>
<td width='1%'> </td>
<td width='57%'>
<font face='Calibri' size='2'>February 09, 2015</font>
</td>
</tr>
<tr>
<td width='42%'>
<font face='Calibri' size='2'>
<b>Price Date</b>
</font>
</td>
<td width='1%'> </td>
<td width='57%'>
<font face='Calibri' size='2'>February 09, 2015</font>
</td>
</tr>
<tr>
<td width='42%'>
<font face='Calibri' size='2'>
<b>Authorize Date</b>
</font>
</td>
<td width='1%'> </td>
<td width='57%'>
<font face='Calibri' size='2'>February 09, 2015</font>
</td>
</tr>
<tr>
<td colspan='3' width='100%'> </td>
</tr>
</table>
</td>
<td width='10%' style='width:10%' valign='top'>
<table cellpadding='0' cellspacing='0' border='1' width='100%'>
<tr>
<td colspan='2' align='center' width='100%'>
<font face='Calibri' size='2'>
<b>Xxxxxxx (XXXXXXX)</b>
</font>
</td>
</tr>
</table>
</td>
</tr>
</table>
What did I change? I removed the width='100%' in the <td> tags where colspan='2'. This information is ambiguous: you are saying that the two columns together should take 100% of the width. However:
You already defined this in the <table> tag where you also have width='100%', and
If a cell has colspan 2 and you say that this cell should take 100% of the width, there is no way to tell the width of each column. It doesn't make sense to put width='100%' there.
iTextSharp defines the width of the columns based on the first row where it can find information about the width. In this case, the first row width such information is a row with colspan 2 in a table with 2 columns. You define the width of these 2 columns combined as 100%, and iTextSharp interprets this as if you want to say that each column takes 50% (100% / 2) of the width.
If you remove this ambiguous information, iText will define the width of the columns based on the widths defined in the third row (which is what you expect).

How to create a two column email newsletter

I am trying to create a two column email flyer but I'm having trouble with the coding as Outlook hates CSS.
I'm using tables to keep it as simple as possible but I want two separate tables on the left and the right so I can add data into it as I wish.
I tried using float left and right on the two tables but Outlook ignores this style.
I know the two grey tables at the bottom are each in their own separate "holder" tables but this is so I can duplicate the grey "data" tables for when I add new articles.
<table class="all" width="auto" height="auto" border="0" cellspacing="0"><tr><td height="504">
<table width="750" height="140" border="0" cellspacing="0">
<tr>
<td width="200" valign="bottom" bgcolor="#E6E6E6"> </td>
<td width="345" align="center" valign="bottom" bgcolor="#E6E6E6"> </td>
<td width="152" align="center" valign="bottom" bgcolor="#E6E6E6"> </td>
<td width="45" align="center" valign="bottom" bgcolor="#E6E6E6"> </td>
</tr>
<tr>
<td width="200" valign="bottom" bgcolor="#E6E6E6"> </td>
<td align="center" valign="bottom" bgcolor="#E6E6E6"><font color="#111111" face="Arial Narrow" size="+2">DECEMBER NEWSLETTER</font></td>
<td width="152" align="center" valign="bottom" bgcolor="#E6E6E6"><font size="2"><strong>#4 - <span class="orange">04.12.13</span></strong></font></td>
<td width="45" align="center" valign="bottom" bgcolor="#E6E6E6"> </td>
</tr>
</table>
<table width="750" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="75" height="50" bgcolor="#E6E6E6" scope="row"> </td>
<td width="600" rowspan="2" scope="row"><img src="http://placehold.it/600x200"/></td>
<td width="75" bgcolor="#E6E6E6" scope="row"> </td>
</tr>
<tr>
<td width="75" height="81" scope="row"> </td>
<td scope="row"> </td>
</tr>
</table>
<table class="holder" width="750" border="0" cellspacing="0" cellpadding="0">
<tr>
<td valign="top" scope="row">
<table class="inlinetableleft" width="360">
<tr>
<td width="371" align="left">
<!------------LEFT COLUMN------------------>
<table width="360" border="0" cellspacing="0" cellpadding="0">
<tr>
<th height="103" colspan="4" align="left" valign="middle" bgcolor="#CCCCCC" scope="row"> </th>
</tr>
</table>
<!--------------LEFT COLUMN END------------->
</td>
</tr>
</table>
<table class="inlinetableright" width="360">
<tr>
<td align="left">
<!------------RIGHT COLUMN------------------>
<table width="360" border="0" cellspacing="0" cellpadding="0">
<tr>
<td height="106" align="left" bgcolor="#CCCCCC" scope="row"> </td>
</tr>
</table>
<!-----------RIGHT COLUMN END-------------->
</td></tr>
</table>
</td>
</tr>
</table>
Here is a fiddle of my newsletter so far, it's the bottom two grey tables that I want to be side by side.
Fiddle
For HTML emails, nested tables are your friend :)
JSFiddle
Note: the border around the table is just to show you where the tables are.
<table border="0" width="600" cellpadding="0" cellspacing="0" align="center">
<tr>
<td colspan="2">
header content here
</td>
</tr>
<tr>
<td width="300">
<table border="0" width="300" cellpadding="1" cellspacing="0" align="left">
<tr>
<td>Left Content</td>
</tr>
</table>
</td>
<td width="300">
<table border="0" width="300" cellpadding="1" cellspacing="0" align="left">
<tr>
<td>Right content</td>
</tr>
</table>
</td>
</tr>
</table>

Based on multiple criterias on parent's siblings

I would like to know if I can combine at the same time an XPath looking for the previous sibling of a certain class with a certain text and at the same time a sibling at the same level with a certain text.
For example I would like to find the following cells:
<td class="sdawatt_booknow">Book</td>
by looking up a sibling of class sdawatt_hrdcell containing the text Spin preceded by a td of class sdawatt_banner with the text Monday - 16 September 2013.
Or the following td:
<td class="sdawatt_booknow">Book</td>
if we look for the date of the 'Friday - 13 September 2013'.
Is this something doable in Xpath ?
<table cellspacing="0" cellpadding="0" border="0" style="border-collapse:collapse;" class="sdawatt_outer">
<tbody><tr>
<td class="sdawatt_hdrcell">Time</td>
<td class="sdawatt_hdrcell">Class</td>
<td class="sdawatt_hdrcell">Level</td>
<td class="sdawatt_hdrcell">Spaces</td>
<td class="sdawatt_hdrcell">Location</td>
<td class="sdawatt_hdrcell">Instructors</td>
<td class="sdawatt_hdrcell">Tags</td>
<td class="sdawatt_hdrcell">Info</td>
<td class="sdawatt_hdrcell">Book</td>
</tr><tr>
<td colspan="9" class="sdawatt_banner">Friday - 13 September 2013</td>
</tr><tr class="sdawatt_classrow">
<td class="sdawatt_time">07:45-08:15</td>
<td class="sdawatt_classname">Boxing</td>
<td class="sdawatt_level"> </td>
<td class="sdawatt_spaces">14 Left</td>
<td class="sdawatt_location">Main Studio</td>
<td class="sdawatt_resources"> Darren</td>
<td class=" sdawatt_infotags"></td>
<td class="sdawatt_info"><img src="https://v4.fitnessandlifestylecentre.com/webaccess/TimetableView/information.gif" class="tiptip" /></td>
<td class="sdawatt_booknow">Book</td>
</tr><tr class="sdawatt_classrow">
<td class="sdawatt_time">12:00-12:45</td>
<td class="sdawatt_classname">Spin</td>
<td class="sdawatt_level"> </td>
<td class="sdawatt_spaces">8 Left</td>
<td class="sdawatt_location">Main Studio</td>
<td class="sdawatt_resources"> Matt</td>
<td class=" sdawatt_infotags"></td>
<td class="sdawatt_info"><img src="https://v4.fitnessandlifestylecentre.com/webaccess/TimetableView/information.gif" class="tiptip" /></td>
<td class="sdawatt_booknow">Book</td>
</tr><tr>
<td colspan="9" class="sdawatt_banner">Monday - 16 September 2013</td>
</tr><tr class="sdawatt_classrow">
<td class="sdawatt_time">13:00-13:45</td>
<td class="sdawatt_classname">Spin</td>
<td class="sdawatt_level"> </td>
<td class="sdawatt_spaces">12 Left</td>
<td class="sdawatt_location">Main Studio</td>
<td class="sdawatt_resources"> Marzena</td>
<td class=" sdawatt_infotags"></td>
<td class="sdawatt_info">
<img src="https://v4.fitnessandlifestylecentre.com/webaccess/TimetableView/information.gif" class="tiptip" /></td>
<td class="sdawatt_booknow">Book</td>
</tr>
</tbody></table>
//tr[
contains(
td[#class="sdawatt_banner"],
"Monday - 16 September 2013")
]
/following-sibling::tr[
contains(
td[#class="sdawatt_classname"],
"Spin")
]/td[#class="sdawatt_booknow"]
yields
<td class="sdawatt_booknow">
Book
</td>

How do I retrieve multiple row node data from an html table in XPATH?

Sometime during the dark ages a script was built that outputs the following html..
...
<TABLE BORDER=0 FRAME=ALL_FRAMES RULES=ALL_RULES ALIGN=CENTER BGCOLOR="ffffe5">
<CAPTION ALIGN=TOP>
<FONT COLOR=009594 SIZE=-1><B>Access Information</B></FONT>
</CAPTION>
<TR>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
<FONT COLOR=black SIZE=-1><B>Access Circuit(s):</B></FONT>
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
**DATA TO COLLECT 111**
</TD>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
<FONT COLOR=black SIZE=-1><B>Other Circuit(s):</B></FONT>
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
&nbsp
</TD>
</TR>
<TR>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
**DATA TO COLLECT AAA**
</TD>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
&nbsp
</TD>
</TR>
<TR>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
**DATA TO COLLECT BBB**
</TD>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
&nbsp
</TD>
</TR>
<TR>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
**DATA TO COLLECT CCC**
</TD>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
&nbsp
</TD>
</TR>
<TR>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
<FONT COLOR=black SIZE=-1><B>Customer:</B></FONT>
</TD>
...
Sorry, I would show you the table layout but I don't know how without <table> on SO
How can I use XPATH (in PHP) to collect only each DATA TO COLLECT section? So far I've been able to retrieve the first row with //*[*='Access Circuit(s):']/following-sibling::td[1].
Things to note:
This is only a small section of a large document.
I cannot change the scripts output.
I wont know how many rows there will be (figure 0 to 6).
The data should be expected to always be in the same "column".
I may only have XPATH version 1. But version 2 answers are still welcomed.
The expression I came up with is this:
//TR[(.//B[.='Access Circuit(s):']) or ((./preceding-sibling::TR//B[.='Access Circuit(s):']) and (./following-sibling::TR//B[.='Customer:']))]//TD[2]
returns
<TD ALIGN="LEFT" VALIGN="MIDDLE">**DATA TO COLLECT 111**</TD>
<TD ALIGN="LEFT" VALIGN="MIDDLE">**DATA TO COLLECT AAA**</TD>
<TD ALIGN="LEFT" VALIGN="MIDDLE">**DATA TO COLLECT BBB**</TD>
<TD ALIGN="LEFT" VALIGN="MIDDLE">**DATA TO COLLECT CCC**</TD>
It uses the knowledge that the first row contains Access Circuit(s): and the first uncollected row contains Customer:. If you can't be sure of either one of those, then I think it can't be done with a single XPath expression.
Step-by-step
1. //TR[
2. (.//B[.="Access Circuit(s):"])
3. or ( (./preceding-sibling::TR//B[.="Access Circuit(s):"])
4. and (./following-sibling::TR//B[.="Customer:"]) )
5. ]//TD[2]
Means
1. all TR nodes
2. that either contain "Access Circuit(s):"
3. or
- (3.) are positioned after "Access Circuit(s):"
- (4.) and are positioned before "Customer:"
5. all TD nodes that are the second TD of their parents

Need query for XPath that finds all <tr> elements that contain 7 <td> elements

Hello and hopefully thanks for the help.
Honestly I am not very experienced at XPath and I am hoping a guru out there will have a quick answer for me.
I am scraping a web page for data. The defining aspect of the data I want is that it is contained in a row <tr> that has 7 <td> elements. Each <td> element has one of the pieces of data I need to import. I am using the HTML Agility Pack on CodePlex to grab the data, but I can't seem to figure out how to define the query.
Contained in the web page is a section like this:
<table border="0" cellpadding="3" cellspacing="1" width="100%">
<tr class="bgWhite" xmlns:msxsl="urn:schemas-microsoft-com:xslt">
<td class="dataHdrText02" valign="top" width="50" align="center"><nobr>SYMBOL</nobr></td>
<td class="dataHdrText02" valign="top" align="center">PERIOD</td>
<td class="dataHdrText02" valign="top" align="center" width="*">EVENT TITLE</td>
<td class="dataHdrText02" valign="top" align="center">EPS ESTIMATE</td>
<td class="dataHdrText02" valign="top" align="center">EPS ACTUAL</td>
<td class="dataHdrText02" valign="top" align="center">PREV. YEAR ACTUAL</td>
<td class="dataHdrText02" valign="top" align="center"><nobr>DATE/TIME (ET)</nobr></td>
</tr>
<tr class="bgWhite">
<td align="center" width="50"><nobr>CSCO </nobr></td>
<td align="center">Q4 2011</td>
<td align="left" width="*">Q4 2011 CISCO Systems Inc Earnings Release</td>
<td align="center">$ 0.38 </td>
<td align="center">n/a </td>
<td align="center">$ 0.43 </td>
<td align="center"><nobr>10-Aug-11</nobr></td>
</tr>
<tr class="bgWhite">
<td align="center" width="50"><nobr>CSCO  </nobr></td>
<td align="center">Q3 2011</td>
<td align="left" width="*">Q3 2011 Cisco Systems Earnings Release</td>
<td align="center">$ 0.37 </td>
<td align="center">$ 0.42 </td>
<td align="center">$ 0.42 </td>
<td align="center"><nobr>11-May-11 AMC</nobr></td>
</tr>
<tr class="bgWhite" xmlns:msxsl="urn:schemas-microsoft-com:xslt">
<td align="center" colspan="7"><img src="/format/cb/images/spacer.gif" width="1" height="4"></td>
</tr>
</table>
My goal is to grab the earnings event data and place it into a database for analysis. My original thought was to grab all <tr> elements with 7 <td> elements then work with that data. Any advice or alternative suggestions would be welcome.
This should do it for you.
//tr[count(td)=7]

Resources