Html Aglity pack extra <A> tag - html-agility-pack

The extra <A> in the following causes selectnode() to return too many elements. How can I remove the extra characters?
<DIV align=center><STRONG><A><A class=white
href="javascript: event_info = openWin('/events/search/index_results.cfm?action=plan&event_number=2013292001&cde_comp_group=CONF&cde_comp_type=&NEW_END_DATE1>=&key_stkhldr_event=&mixed_breed=N', 'eventinfo', 'width=800,height=600,toolbar=1,location=0>,directories=0,status=0,menuBar=0,scrollBars=1,resizable=1' ); event_info.focus()"><STRONG>Labrador
Retriever Club of the Piedmont</STRONG></A> </STRONG></DIV
>

You could select only those <a> tags, which have e.g. href attribute set:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var anchors = doc.DocumentNode
.SelectNodes("//a[#href]")
.ToList();
foreach (var anchor in anchors)
{
//process your node here
}

Related

html agility pack text inside in li

I am getting data inside <'li> tag like this:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var list = new List<string>(doc.DocumentNode.SelectNodes("//li")
.Select(li => li.InnerText));
but if li has another tag inside like <em> that been ignored.
How can I keep everything inside in <li> without using InnerHtml?
thanks
What you want is OuterHtml.
Quoting from MDN:
The outerHTML attribute of the Element DOM interface gets the
serialized HTML fragment describing the element including its
descendants. It can also be set to replace the element with nodes
parsed from the given string.
To only obtain the HTML representation of the contents of an element,
or to replace the contents of an element, use the innerHTML property
instead.
var doc = new HtmlDocument();
doc.LoadHtml(#"
<ul>
<li>
<em>item 1</em>
</li>
<li>
<span>item</span> <em>2</em> <br/>
</li>
</ul>
");
var ul = doc.DocumentNode.Element("ul");
var lis = ul.Elements("li");
foreach(var li in lis)
{
Console.WriteLine("----------------- inner html -------------------");
Console.WriteLine(li.InnerText); //prints "Item N" (content only)
Console.WriteLine("----------------- outer html -------------------");
Console.WriteLine(li.OuterHtml); //prints <li> + all descending tags + </li>
}

Web scraping data using Html Agility Pack

Using Html Agility Pack, how can I get the string ABC from the html code:
<td><a data-quoteapi="$cur.symbol href=/asx/{$cur.symbol} (stockLink)" href="/asx/abc">ABC</a></td>
All you need to do is to get the element's InnerText. You are searching for a TD elements, so just ask HtmlAgilityPack to select such and you will find the html element's text within its InnerText property.
Based on your sample:
string html = #"<td><a data-quoteapi='$cur.symbol href=/asx/{$cur.symbol} (stockLink)' href='/asx/abc'>ABC</a></td>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var selectedElement = doc.DocumentNode.SelectSingleNode("td");
if (selectedElement != null)
Console.WriteLine(selectedElement.InnerText); //prints ABC

check if there is a div that has the words `some text` and has the `i` tag

I have a class:
<div class = "abc def">
<i style="...."></i>
some text 2
</div>
<div class = "abc def">
<i></i>
some text
</div>
<div class = "abc def">
1 some text
</div>
how can I check if there is a div that has the words some text and there is the i tag in this div?
for this example, I have to get the first and the second div. the third div doesn't have the i tag, so then I won't get him.
I think it should be:
elements = driver.findElement(By.xpath(//div[contains(text(), 'some text')]));
if (elements.length > 0) {
for(var i = 0; i < elements.length; i++) {
if (elements[i].find('<i') != null) {
alert('the item: ' + i + 'is found');
}
}
}
The XPath expression more or less equals the natural language version:
//div[contains(., 'some text') and i]
If the <i/> tag may be contained within other elements, use .//i instead. In most cases you want to use . instead of text(), this joins all text nodes and scans the combined result, so <em>some</em> text would be matched, too.

Working with ASP.NET Razor and HTML

I have a list of categories and sub categories which is passing from controller to the view. Now, I want them to be represented in the HTML like following. But, I dont know how can i achieve this by using foreach or table or whatever.
EDIT : Code
public ActionResult Electronics()
{
var topCategories = pe.Categories.Where(category => category.ParentCategory.CategoryName == "Electronics").ToList();
//var catsAndSubs = pe.Categories.Include("ParentCategory").Where(c => c.ParentCategory.CategoryName == "Electronics");
return View(topCategories);
}
With this view code, I am just able to pull a vertical list.
#foreach (var cats in Model)
{
<li>#cats.CategoryName</li>
foreach (var subcats in cats.SubCategories)
{
<li>#subcats.CategoryName</li>
}
}
When designing HTML mark-up it is very important to consider semantics. What meaning are you trying to convey? That doesn't look like tabular data to me so please don't put it in tables :P
Based on your wireframe above, the way I would probably structure this is like this:
<h1>Category Directory</h1>
<h2>Multimedia Projectors</h2>
<h2>Home Audio</h2>
<p>
Amplifiers, Speakers
</p>
Adjust the hX tags to reflect their position within the document's hierachy. Remember to only ever have ONE h1 per page (or per <acticle>, or <section> if using HTML5).
If instead you wind up turning this into something like a Superfish menu then this is the markup that you would use:
<nav id="category_menu">
<ul>
<li>
Multimedia Projectors
</li>
<li>
Home Audio
<ul>
<li>
Amplifiers
</li>
<li>
Speakers
</li>
</ul>
</li>
</ul>
</nav>
Edit
Your model is not suitable for creating your desired view, the relationship is bottom-up, but to conveniently construct the view you will want the relationships defined top-down. You need to start by converting the data model into a view model, such as:
class CategoryViewModel
{
string CategoryName { get;set; }
IList<CategoryModel> SubCategories { get;set; }
}
and to make this:
IList<CategoryViewModel> Map(IList<CategoryDataModel> dataModel)
{
var model = new List<CategoryViewModel>();
//Select the categories with no parent (these are the root categories)
var rootDataCategories = dataModel.Where(x => x.ParentCategory == null);
foreach(var dataCat in rootDataCategories )
{
//Select the sub-categories for this root category
var children = dataModel
.Where(x => x.ParentCategory != null && x.ParentCategory.Name = cat.Name)
.Select(y => new CategoryViewModel() { CategoryName = y.CategoryName })
.ToList();
var viewCat = new CategoryViewModel()
{
CategoryName = dataCat.CategoryName,
SubCategories = children
};
model.Add(viewCat);
}
return model;
}
Then your view:
<h1>Category Directory</h1>
#foreach(var category in Model)
{
#Html.Partial("Category", category)
}
Category partial:
<h2>#Html.ActionLink(Model.CategoryName, "Detail", new { Model.CategoryName })</h2>
#if(Model.SubCategories.Count> 0)
{
<p>
#for (var i = 0; i < Model.SubCategories.Count; i++)
{
var subCat = Model.SubCategories[i];
#Html.ActionLink(subCat.CategoryName, "Detail", new { subCat.CategoryName })
#if(i < Model.SubCategories.Count - 1)
{
<text>,</text>
}
}
</p>
}
Note that my current solution only supports 2 levels of categories (as per your wireframe). It could however be easily extended to be recursive.

HTMLAgilityPack iterate all text nodes only

Here is a HTML snippet and all I want is to get only the text nodes and iterate them. Pls let me know. Thanks.
<div>
<div>
Select your Age:
<select>
<option>0 to 10</option>
<option>20 and above</option>
</select>
</div>
<div>
Help/Hints:
<ul>
<li>This is required field.
<li>Make sure select the right age.
</ul>
Learn More
</div>
</div>
Result:
Select your Age:
0 to 10
20 and above
Help/Hints:
This is required field.
Make sure select the right age.
Learn More
Something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtmlFile);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']"))
{
Console.WriteLine(node.InnerText.Trim());
}
Will output this:
Select your Age:
0 to 10
20 and above
Help/Hints:
This is required field.
Make sure select the right age.
Learn More
I tested #Simon Mourier's answer on the Google home page and got lots of CSS and Javascript, so I added an extra filter to remove it:
public string getBodyText(string html)
{
string str = "";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
try
{
// Remove script & style nodes
doc.DocumentNode.Descendants().Where( n => n.Name == "script" || n.Name == "style" ).ToList().ForEach(n => n.Remove());
// Simon Mourier's Answer
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']"))
{
str += node.InnerText.Trim() + " ";
}
}
catch (Exception)
{
}
return str;
}

Resources