HtmlAgilityPack algorithm question

HtmlAgilityPack algorithm question - html-agility-pack

I’m using HtmlAgilityPack to obtain some Html from a web site.
Here is the received Html:
<table class="table">
<tr>
<td>
<table class="innertable">...</table>
</td>
</tr>
<tr>
<td colspan="2"><strong>Contact</strong></td>
</tr>
<tr>
<td colspan="2">John Doe</td>
</tr>
<tr>
<td colspan="2">Jane Doe</td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td><strong>Units</strong></td>
<td>32</td>
</tr>
<tr>
<td><strong>Year</strong></td>
<td>1998</td>
</tr>
</table>
The Context:
I’m using the following code to get the first :
var table = document.DocumentNode.SelectNodes("//table[#class='table']").FirstOrDefault();
I’m using the following code to get the inner table :
var innerTable = table.SelectNodes("//table[#class=innertable]").FirstOrDefault();
So far so good!
I need to get some information from the first table and some from the inner table.
Since I begin with the information from the first table I need to skip the first row (which holds the inner table) so I do the following:
var tableCells = table.SelectNodes("tr[position() > 1]/td");
Since I now have all the cells from the first table excluding the inner table, I start doing the following:
string contact1 = HttpUtility.HtmlDecode(tableCells[1].InnerHtml);
string contact2 = HttpUtility.HtmlDecode(tableCells[2].InnerHtml);
string units = HttpUtility.HtmlDecode(tableCells[5].InnerHtml);
string years = HttpUtility.HtmlDecode(tableCells[7].InnerHtml);
The problem:
I’m getting the values I want by hardcoding the index in tableCells[] not thinking the layout would move…unfortunately, it does move.
In some cases I do not have a “Jane Doe” row (as shown in the above Html), this means I may or may not have two contacts.
Because of this, I can’t hardcode the indexes since I might end up having the wrong data in the wrong variables.
So I need to change my approach...
Does anyone know how I could perfect my algorithm so that it can take into account the fact that I may have one or two contacts and perhaps not use hardcoded indexes?
Thanks in advance!
vlince

There is never one unique solution to this kind of problem. Here is an XPATH that seems to do some kind of it though:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(yourHtmlFile);
doc.Save(Console.Out);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//tr[td/strong/text() = 'Contact']/following-sibling::tr/td/text()[. != ' ']"))
{
Console.WriteLine(node.OuterHtml);
}
will display this:
John Doe
Jane Doe
32
1998

Related

Search for a specific row in the table

I have a table that has no unique values in the columns. How to check if there is an appropriate item in the table based on several columns?
For example, in the table below I want to check if there is a <tr> containing the title some and the value 10.
<table border>
<thead>
<tr>
<th>title</th>
<th>value</th>
</tr>
</thead>
<tbody>
<tr>
<td>some</td>
<td>10</td>
</tr>
<tr>
<td>some</td>
<td>0</td>
</tr>
</tbody>
</table>

I guess that this should work:
cy.get('tr td:contains("some") + td:contains("10")').should('exist')
This will look for row tr which contains td with text "some", as well as td that contains text "10", and both the td are adjacent siblings.
Of course if you dont want to have hardcoded values, you could do something like this:
cy.get(`tr td:contains("${titleVariable}") + td:contains("${valueVariable}")`).should('exist')
and before this, assign the proper values to titleVariable and valueVariable.
Hope it helps!

Why adding non-relevant documents improve system performance? And how to evaluate the new result?

Suppose an IR system returns a ranked list of 20 documents in response to a query from a collection of 10,000 documents. If 5,000 non-relevant documents are added to the collection, we find that the same ranked list is returned for the query. That means the new setting, i.e., changing collection size to 15,000, does not change recall and precision on the 20 results. However, it seems that the system performs better in the new setting because more non-relevant documents need to be dealt with.

I'll answer this question based on my thinking:
<table border="1">
<tr>
<td> </td>
<td>relevant</td>
<td>nonrelevant</td>
<td> </td>
</tr>
<tr>
<td>retrieved</td>
<td>tp</td>
<td>fp</td>
<td>fix</td>
</tr>
<tr>
<td>not retrieved</td>
<td>fn</td>
<td>tn</td>
</tr>
<tr>
<td></td>
<td></td>
<td>increase tn</td>
</tr>
</table>
Adding non-relevant documents is equivalent to increasing tn, thus the new measure could be fn/(fn+tn)

XPath - Get table if child is not specific string

its posible to do that? Get all table "tr"s except the tr that have an elemente with an especific string.
Example:
<div class="span5">
<table class="table">
<tbody>
<tr>
<th>Apple</th>
<td>Red</td>
</tr>
<tr>
<th>Banana</th>
<td>Yellow</td>
</tr>
<tr>
<th>Potato</th>
<td>Brown</td>
</tr>
</tbody>
</table>
</div>
Simple example, a table with 2 columns, I can select the table with the next Xpath:
//div[#class='span5']/table[#class='table']
But its posible to select the table WITHOUT the "tr" that contains:
//th[.='Potato']
Im usualling solving that problem geting all the table and then filter "tr" contents in Python, but I want to filter with XPath and optimize a bit my code without charge it in memory.
Thanks

Your XPath can be a bit simpler, like so :
//div[#class='span5']/table[#class='table']//tr[th != 'Potato']

How to Display data in table format in Shoes

Is there any way to display data in table format in Shoes?
<table border ="1">
<tr>
<th>Name</th>
<th> Content</th>
</tr>
#products.each do |product|
<tr>
<td> product.name </td>
<td>product.detail</td>
</tr>
end
</table>

Shoes does not have a native table construct (yet). There is an open issue on shoes4 about adding one but don't expect that soon or count on it being implemented.
You can mix flows with fixed width values to achieve a table like effect. E.g. put some flows with a fixed width next to each other and put these in a stack. I made a table like construct here

Using jQuery Validate to ensure there is at least one row in my table

I am using a table in an entry program to allow the user to add one or more rows of information (much like this article).
I need to ensure that there is at least one row in this table. Google is not really turning much up for me on people doing this. Can anyone give me direction on this? Can I do a count based on a class name?
Here is the layout of my table:
<table id="editorRows">
...
<tbody class="editorRow">
<tr class="row1">
</tr>
<tr class="row2" style="display: none;">
</tr>
<tr class="row3" style="display:none;">
</tr>
</tbody>
</table>
A "row" in this case is the tag. Row 2 and 3 get dynamically showen based on options in row 1.

you can use $("#editorRows tr").length > 0

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

HtmlAgilityPack algorithm question - html-agility-pack

Related

Search for a specific row in the table

Why adding non-relevant documents improve system performance? And how to evaluate the new result?

XPath - Get table if child is not specific string

How to Display data in table format in Shoes

Using jQuery Validate to ensure there is at least one row in my table

Categories

Resources