Using HtmlAgilityPack to parse HTML with Headers, Tables, Rows, Cells - html-agility-pack

I'm trying to use HtmlAgilityPack to parse through a webpage's HTML to parse out the rows/cells of tables.
The code sample almost works, except I get an exception on the Table collection. I presume this might have something to do with Header not formatted as a collection (and I cannot modify the source of the HTML).
Please help with the code, or please suggest alternatives or workarounds.
The structure is:
Header -> Table -> Row -> Cell
There are a collection of Headers (which contain the date), that contain collection of Tables, which contain a collection of Rows, and Rows contain a collection of Cells.
string html = #"
<html>
<body>
<h3>February 8, 2014</h3>
<table>
<tr>
<td><b>Site</b></td>
<td><b>ColumnA</b></td>
<td><b>ColumnB</b></td>
<td><b>ColumnC</b></td>
</tr>
<tr>
<td>SiteA</td>
<td>3</td>
<td>6</td>
<td>3</td>
</tr>
<tr>
<td>SiteB</td>
<td>4</td>
<td>6</td>
<td>2</td>
</tr>
<tr>
<td>SiteC</td>
<td>4</td>
<td>9</td>
<td>4</td>
</tr>
</table>
<h3>February 7, 2014</h3>
<table>
<tr>
<td><b>Site </b></td>
<td><b>ColumnA</b></td>
<td><b>ColumnB</b></td>
<td><b>ColumnC</b></td>
</tr>
<tr>
<td>SiteA</td>
<td>2</td>
<td>4</td>
<td>1</td>
</tr>
<tr>
<td>SiteB</td>
<td>1</td>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>SiteC</td>
<td>2</td>
<td>6</td>
<td>1</td>
</tr>
</table>
</body>
</html>
";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode header in doc.DocumentNode.SelectNodes("//h3"))
{
string headerDate = header.InnerText;
foreach (HtmlNode table in header.SelectNodes("table")) //System.NullReferenceException
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
Console.Write(headerDate);
foreach (HtmlNode cell in row.SelectNodes("td"))
{
Console.Write("\t" + cell.InnerText);
}
Console.WriteLine();
}
}
}
Expected Results:
February 8, 2014 Site ColumnA ColumnB ColumnC
February 8, 2014 SiteA 3 6 3
February 8, 2014 SiteB 4 6 2
February 8, 2014 SiteC 4 9 4
February 7, 2014 Site ColumnA ColumnB ColumnC
February 7, 2014 SiteA 2 4 1
February 7, 2014 SiteB 1 1 2
February 7, 2014 SiteC 2 6 1
Thank you. Jake.

You're iterating over the headers as if you're expecting the tables to be within the header tags, but the tables are not within the header tags, despite what the misleading indentation appears to suggest. The header tags are siblings of the tables, not parents.
<h3>February 8, 2014</h3> <-- </h3> closes the header tag
<table> <-- this is the next element at the same level, not a child
<tr>
<td><b>Site</b></td>
<td><b>ColumnA</b></td>
<td><b>ColumnB</b></td>
<td><b>ColumnC</b></td>
</tr>
</table>
Keep in mind that indentation/whitespace is meaningless in html. It's the tags that rule all.

Related

Html Agility Pack loop through Specific Row

I have a table like this
<table>
<thead>
<tr>
<th>Name</th>
<th>Department</th>
<th>Gender</th>
</tr>
</thead>
<tbody>
<tr id="data1">
</tr>
<tr>
</tr>
</tbody>
And I want to use Html Agility Pack to parse its specific row i.e i want to display row next to row which has id=data1
below is code I am trying ...
//Selecting Document Node....
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(data);
//Selecting Specific Node...
var tableNodes = doc.DocumentNode.SelectNodes("//table");
Your xpath should be like this:
//table/tbody/tr[#id='data1']/following-sibling::tr

PDF Layout with MigraDoc

I am trying to achieve following matrix kind of layout:
TABLE1,1 TABLE1,2
CHART2,1 TABLE2,2
TABLE3 --> occupies whole row
CHART4 --> ocupies whole row
CHART5,1 CHART5,2
................. List goes on...
These components may span over multiple pages. What is the best way to have them side by side and still be able to view them in MigraDoc.
CHART5,1 could be a combination of 4 charts in one cell.
In HTML view I can use following analogy:
<TABLE>
<TR>
<TD>TABLE1,1</TD> <TD>TABLE1,2 </TD>
</TR>
<TR>
<TD>CHART2,1</TD> <TD>TABLE2,2 </TD>
</TR>
<TR>
<TD>TABLE3</TD colspan =2>
</TR>
<TR>
<TD>CHART4</TD colspan =2>
</TR>
<TR>
<TD>CHART5,1</TD> <TD>CHART5,2 </TD>
</TR>
</TABLE>
The MigraDoc equivalent for colspan=2 is MergeRight=1. This is a property of the Cell class.

Scraping page with correct xpath using Mechanize and nokogiri

I am trying to access data contained in a table that is itself contained in a table with class ='L1'.
So basically my html structure is like this:
<table class="L1">
<table>
<tr></tr>
<tr>
<td></td>
<td>data</td>
</tr>
<tr>
<td></td>
<td>data</td>
</tr>
...ect...ect
</table>
</table>
I need to catch the data contained in a all <a> </a> that are in the second contained in <tr> </tr> but only starting with the second <tr> of the table.
So far I came up with that:
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr/td[2]/a[1]")
But seems to me that this doesn't express the fact that I want to start only after the second <tr> (second <tr> included?
What would be the right code to do this ?
You can use position() to select the later elements that you want.
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td[2]/a[1]")
As the comments on that SO answer say, remember XPath counts from 1, so >1 skips the first tr.

Creating grid with 2 axis

I am trying to create a grid that contains Events on the x axis and Requirements on the y axis and is filled with solutions.
Event1 Event2
Req1 Sol1
Req2 Sol2
My Model contains a list of Events, which contains their related Requirements, which contain their related Solutions. Every Event can have 0 or more Requirements and each Requirement can have 0 or more solutions.
How can I accurately show this grid in razor?
Here is my attempt:
<table border="1">
<tr>
<td class="span-6"></td>
#foreach(var events in Model.Events)
{
<td colspan="3">
#events.Name
</td>
requirementsList.AddRange(events.Requirements);
}
</tr>
#foreach(var req in requirementsList)
{
<tr>
<td>
#req.Name
</td>
<!--Insert logic to align solution with Event-->
<td>
#req.Solution
</td>
</tr>
}
</table>
Of course this is only showing all solutions in the first event column.
I've done a similar thing in PHP for my timesheet system, I want hours worked for each employee on each day (between 2 dates):
| 1 May | 2 May | ... May
Fred | 6 |
George | | 4 |
I used 3 foreach loops, first I outputted the dates, the I went rond each employee, and inside the employee loop, I looped round the dates again.
So for you it would be:
<table border="1">
<tr>
<td class="span-6"></td>
#foreach(var events in Model.Events)
{
<td colspan="3">
#events.Name
</td>
requirementsList.AddRange(events.Requirements);
}
</tr>
#foreach(var req in requirementsList)
{
<tr>
<td>
#req.Name
</td>
#foreach(var events in Model.Events)
{
#curRec = events.Requirements
#if (curRec.HasSolution) // If has a solution.
// Very important so solutions can be aligned properly
{ // Output the solution
<td>
#curRec.Solution
</td>
} else { // Output a empty cell
<td></td>
}
}
</tr>
}
</table>

html table max row and ajax navigation

I have a PHP page that returns an HTML table like this:
<table>
<tr>
<td>First Row data</td><td>Second Row data</td><td>Third Row data</td>
</tr>
<tr>
<td>First Row data</td><td>Second Row data</td><td>Third Row data</td>
</tr>
<tr>
<td>First Row data</td><td>Second Row data</td><td>Third Row data</td>
</tr>
<tr>
<td>etc...</td>
</tr>
</table>
What I want to do is to add an ajax numerical pagination system (1 2 ... 6) that allows we to fix a max 3 rows to display and reaching the others with the navigation.
Do you know where can I find a ready script that can help to solve this problem?
Is this about what your looking for?
http://www.dynamicdrive.com/dynamicindex17/ajaxpaginate/index.htm

Resources