Web scraping data using Html Agility Pack - html-agility-pack

Using Html Agility Pack, how can I get the string ABC from the html code:
<td><a data-quoteapi="$cur.symbol href=/asx/{$cur.symbol} (stockLink)" href="/asx/abc">ABC</a></td>

All you need to do is to get the element's InnerText. You are searching for a TD elements, so just ask HtmlAgilityPack to select such and you will find the html element's text within its InnerText property.
Based on your sample:
string html = #"<td><a data-quoteapi='$cur.symbol href=/asx/{$cur.symbol} (stockLink)' href='/asx/abc'>ABC</a></td>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var selectedElement = doc.DocumentNode.SelectSingleNode("td");
if (selectedElement != null)
Console.WriteLine(selectedElement.InnerText); //prints ABC

Related

html agility pack text inside in li

I am getting data inside <'li> tag like this:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var list = new List<string>(doc.DocumentNode.SelectNodes("//li")
.Select(li => li.InnerText));
but if li has another tag inside like <em> that been ignored.
How can I keep everything inside in <li> without using InnerHtml?
thanks
What you want is OuterHtml.
Quoting from MDN:
The outerHTML attribute of the Element DOM interface gets the
serialized HTML fragment describing the element including its
descendants. It can also be set to replace the element with nodes
parsed from the given string.
To only obtain the HTML representation of the contents of an element,
or to replace the contents of an element, use the innerHTML property
instead.
var doc = new HtmlDocument();
doc.LoadHtml(#"
<ul>
<li>
<em>item 1</em>
</li>
<li>
<span>item</span> <em>2</em> <br/>
</li>
</ul>
");
var ul = doc.DocumentNode.Element("ul");
var lis = ul.Elements("li");
foreach(var li in lis)
{
Console.WriteLine("----------------- inner html -------------------");
Console.WriteLine(li.InnerText); //prints "Item N" (content only)
Console.WriteLine("----------------- outer html -------------------");
Console.WriteLine(li.OuterHtml); //prints <li> + all descending tags + </li>
}

How to obtain inner html of a link using an id of a surrounding table cell

I've started using the Html Agility Pack and liking it a lot.
I have the following html:
<td id="1">This Link</td>
<td id="2">Not This Link</td>
I would like to obtain the inner html from the anchor when the table cell has an id of 1
i.e. the end result is that I'm left with "This Link"
I've managed to get the inner html when passing in the href:
var doc= new HtmlWeb().Load("mypage);
var selections = doc.DocumentNode.Descendants("a")
.Where(u => u.GetAttributeValue("href", null).Contains("offIgo"))
.Select(a => a.InnerHtml);
But how would I go about incorporating the table cell information? Is it a case of taking a step back and getting all the information from the tags first and then drilling further in?
Any advice appreciated
Ok found out what to do so for those who come across this, try the following:
var doc = new HtmlWeb().Load("myPage");
HtmlNodeCollection node = doc.DocumentNode.SelectNodes("//table//tbody//tr//td[#id='r1']");
var myAnchorText = node.Descendants("a")
.Where(u => u.GetAttributeValue("href", null).Contains("offIgo.aspx"))
.Select(a => a.InnerHtml);

Parent of htmlagilitypack text node is select instead of option?

Using htmlagility, I am searching for text nodes in a dom structure consisting of a select.
<select>
<option>
one
</option>
<option>
two
</option>
</select>
Those nodes parents seems to be the
<select>
instead of an
<option>
Why?
using System.IO;
using System.Linq;
using HtmlAgilityPack;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace Foo.Test
{
[TestClass]
public class HtmlAgilityTest
{
[TestMethod]
public void TestTraverseTextNodesInSelect()
{
var html = "<select><option>one</option><option>two</option></select>";
var doc = new HtmlDocument();
doc.Load(new StringReader(html));
var elements = doc.DocumentNode.Descendants().Where(n=>n.Name == "#text");
Assert.AreEqual(2, elements.Count());
Assert.AreEqual("select", elements.ElementAt(0).ParentNode.Name);
Assert.AreEqual("select", elements.ElementAt(1).ParentNode.Name);
}
}
}
[TestMethod]
public void TestTraverseTextNodesInSelect()
{
HtmlNode.ElementsFlags.Remove("option");
var html = "<select><option>one</option><option>two</option></select>";
var doc = new HtmlDocument();
doc.Load(new StringReader(html));
var elements = doc.DocumentNode.Descendants().Where(n=>n.Name == "#text");
Assert.AreEqual(2, elements.Count());
Assert.AreEqual("select", elements.ElementAt(0).ParentNode.Name);
Assert.AreEqual("select", elements.ElementAt(1).ParentNode.Name);
}
you can try with this.
In the library it has like this. You need to remove it. by default the AgilityPack is set to treat option tags as empty.
ElementsFlags.Add("option", HtmlElementFlag.Empty);
That's because HtmlAgilityPack drop closing <option> tag by default. HAP sees your HTML like this :
Console.WriteLine(doc.DocumentNode.OuterHtml);
//result :
//<select><option>one<option>two</select>
And as mentioned in the linked question above, you can alter that behavior by calling following line before initiating the HtmlDocument :
HtmlNode.ElementsFlags.Remove("option");

Html Aglity pack extra <A> tag

The extra <A> in the following causes selectnode() to return too many elements. How can I remove the extra characters?
<DIV align=center><STRONG><A><A class=white
href="javascript: event_info = openWin('/events/search/index_results.cfm?action=plan&event_number=2013292001&cde_comp_group=CONF&cde_comp_type=&NEW_END_DATE1>=&key_stkhldr_event=&mixed_breed=N', 'eventinfo', 'width=800,height=600,toolbar=1,location=0>,directories=0,status=0,menuBar=0,scrollBars=1,resizable=1' ); event_info.focus()"><STRONG>Labrador
Retriever Club of the Piedmont</STRONG></A> </STRONG></DIV
>
You could select only those <a> tags, which have e.g. href attribute set:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var anchors = doc.DocumentNode
.SelectNodes("//a[#href]")
.ToList();
foreach (var anchor in anchors)
{
//process your node here
}

Get a valid ID of the HTML field in EditorTemplate

I'm using the code below to create the id for one element in my editor template, is there another way to get a valid id in the editor template?
ViewData.TemplateInfo.HtmlFieldPrefix.Replace('.', '_');
Try like this:
#{
var id = ViewData.TemplateInfo.GetFullHtmlFieldId(string.Empty);
}

Resources