html agility pack text inside in li - html-agility-pack

I am getting data inside <'li> tag like this:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var list = new List<string>(doc.DocumentNode.SelectNodes("//li")
.Select(li => li.InnerText));
but if li has another tag inside like <em> that been ignored.
How can I keep everything inside in <li> without using InnerHtml?
thanks

What you want is OuterHtml.
Quoting from MDN:
The outerHTML attribute of the Element DOM interface gets the
serialized HTML fragment describing the element including its
descendants. It can also be set to replace the element with nodes
parsed from the given string.
To only obtain the HTML representation of the contents of an element,
or to replace the contents of an element, use the innerHTML property
instead.
var doc = new HtmlDocument();
doc.LoadHtml(#"
<ul>
<li>
<em>item 1</em>
</li>
<li>
<span>item</span> <em>2</em> <br/>
</li>
</ul>
");
var ul = doc.DocumentNode.Element("ul");
var lis = ul.Elements("li");
foreach(var li in lis)
{
Console.WriteLine("----------------- inner html -------------------");
Console.WriteLine(li.InnerText); //prints "Item N" (content only)
Console.WriteLine("----------------- outer html -------------------");
Console.WriteLine(li.OuterHtml); //prints <li> + all descending tags + </li>
}

Related

How do we crawl and fetch the data from multiple div tags having same class name? [duplicate]

I have to fetch two labels 'Text 1', 'Text 2' which belongs to same class ='xyz', which are located in two div's.Structure as shown below.
<div class='xyz'>TEXT 1</div>
<div class='xyz'>TEXT 2</div>
Can anyone please help me to solve this ?
You find elements by className and then use getText() to get the text:
List<WebElement> elements = driver.findElements(By.className("xyz"));
for(WebElement element:elements) {
System.out.println(element.getText());
}
Use FindElements method and then access to necessary div using index, e.g:
var elements = driver.FindElements(By.CssSelector((".xyz"));
//get text in first element;
elements[0].getText();
//in second
elements[1].getText(); //etc

Get data attribute inside multiple children

I have this issue: Multiple classes with several span inside of each class and want to extract all data attributes of the first class.
<div class="parent_class">
<span data-year="a_1">Data</span>
<span data-make="b_1">Data</span>
<span data-model="c_1">Data</span>
<span data-motor="d_1">Data</span>
</div>
<div class="parent_class">
<span data-year="a_2">Data 2</span>
<span data-make="b_2">Data 2</span>
<span data-model="c_2">Data 2</span>
<span data-motor="d_2">Data 2</span>
</div>
I have made several tries and just got the first data attribute with not problem.
var year_response = $('.parent_class:first span').data('year');
Response:
year_response = a1;
But when I tried for the make and other data attribute I got undefined
Actual:
var make_response = $('.parent_class:first span').data('make');
**Response:
make_response = undefined;**
Desire:
var make_response = $('.parent_class:first span').data('make');
**Response:
make_response = b_1;**
How about just fetching all data attributes of the spans as objects and mapping them to an array :
var data = $.map($('.parent_class:first span'), function(el) {
return $(el).data();
});
FIDDLE
or an object if all the data attributes are different :
var data = {};
$.each($('.parent_class:first span'), function(i, el) {
$.each($(el).data(), function(k,v) {data[k] = v});
});
FIDDLE
You are telling it to get the first span, but the you want the second span (the one with make). What about getting the first with the make data attribute?
console.log($('.parent_class > span[data-make]:first').data('make'));
You could also select the nth element with the nth-child selector:
console.log($('.parent_class > span:nth-child(2)').data('make'));

check if there is a div that has the words `some text` and has the `i` tag

I have a class:
<div class = "abc def">
<i style="...."></i>
some text 2
</div>
<div class = "abc def">
<i></i>
some text
</div>
<div class = "abc def">
1 some text
</div>
how can I check if there is a div that has the words some text and there is the i tag in this div?
for this example, I have to get the first and the second div. the third div doesn't have the i tag, so then I won't get him.
I think it should be:
elements = driver.findElement(By.xpath(//div[contains(text(), 'some text')]));
if (elements.length > 0) {
for(var i = 0; i < elements.length; i++) {
if (elements[i].find('<i') != null) {
alert('the item: ' + i + 'is found');
}
}
}
The XPath expression more or less equals the natural language version:
//div[contains(., 'some text') and i]
If the <i/> tag may be contained within other elements, use .//i instead. In most cases you want to use . instead of text(), this joins all text nodes and scans the combined result, so <em>some</em> text would be matched, too.

Html Aglity pack extra <A> tag

The extra <A> in the following causes selectnode() to return too many elements. How can I remove the extra characters?
<DIV align=center><STRONG><A><A class=white
href="javascript: event_info = openWin('/events/search/index_results.cfm?action=plan&event_number=2013292001&cde_comp_group=CONF&cde_comp_type=&NEW_END_DATE1>=&key_stkhldr_event=&mixed_breed=N', 'eventinfo', 'width=800,height=600,toolbar=1,location=0>,directories=0,status=0,menuBar=0,scrollBars=1,resizable=1' ); event_info.focus()"><STRONG>Labrador
Retriever Club of the Piedmont</STRONG></A> </STRONG></DIV
>
You could select only those <a> tags, which have e.g. href attribute set:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var anchors = doc.DocumentNode
.SelectNodes("//a[#href]")
.ToList();
foreach (var anchor in anchors)
{
//process your node here
}

Repeating a object that only occurs couple of times and has different values with htmlagilitypack c#.

I have a problem I cant seem to solve here. Lets say I have some html like beneth here that I want to parse. All this html is within one list on the page. And the names repeat themself like in the example I wrote.
<li class = "seperator"> a date </li>
<li class = "lol"> some text </li>
<li class = "lol"> some text </li>
<li class = "lol"> some text </li>
<li class = "seperator"> a new date </li>
<li class = "lol"> some text </li>
<li class = "seperator"> a nother new date </li>
<li class = "lol"> some text </li>
<li class = "lol"> some text </li>
I did manage to use htmlagility pack to parse every li object seperate, and almost formating it how I want. My print atm looks something like this:
"a date" "some text"
"some text"
"some text"
"some text"
"a new date" "some text"
"a nother new date " "some text"
"some text"
"some text"
What I want to achive:
"a date" "some text"
"a date" "some text"
"a date" "some text"
"a date" "some text"
"a new date" "some text"
"a nother new date " "some text"
"a nother new date " "some text"
"a nother new date " "some text"
But the problem is that beneath every seperator, the count of every lol object may vary. So one day, the webpage may have one lol object beneth date 1, and the next day it may have 10 lol objects. So I am woundering if there is an smart/easy way to somehow count the number of lol objects in between the seperators. Or if there is another way to figure this out? Within for example htmlagilitypack. And yes, I need the correct date in front of every lol object, not just infront the first one. This would have been a pice of cake if the seperator class would have ended beneath the last lol object, but sadly that is not the case... I dont think that I need to paste my code here, but basicly what I do is to parse the page, extract the seperators and lol objects and add them to a list, where I split them up to seperator and lol objects. Then I print it out to a file and since the seperator only occure 3 times(in the example) I will only get out 3 seperate dates.
Here's the plan, select all the seperator elements then find all consecutive sibling elements with the desired class.
Unfortunately, there is no simple way to get a collection of siblings in the current versions of HTML Agility Pack, you only have access to the (one) next sibling. It's hard to collect data from linked structures nicely using LINQ. And since there is no real hierarchy in the HTML, this would be somewhat of a challenge.
If you have XPath available, you can use the following-sibling axis to get all following sibling elements in conjunction with the TakeWhile() method to do the following:
var htmlStr = #"<li class = ""seperator""> a date </li>
<li class = ""lol""> some text </li>
<li class = ""lol""> some text </li>
<li class = ""lol""> some text </li>
<li class = ""seperator""> a new date </li>
<li class = ""lol""> some text </li>
<li class = ""seperator""> a nother new date </li>
<li class = ""lol""> some text </li>
<li class = ""lol""> some text </li>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var data =
from li in doc.DocumentNode.SelectNodes("li[#class='seperator']")
select new
{
Separator = li.InnerText,
Content = li.SelectNodes("following-sibling::li")
.TakeWhile(sli => sli.Attributes["class"].Value == "lol")
.Select(sli => sli.InnerText)
.ToList(),
};
Otherwise if you don't have XPaths available, you can create an enumerable from any linked structure with the following:
public static class Extensions
{
public static IEnumerable<TSource> ToLinkedEnumerable<TSource>(
this TSource source,
Func<TSource, TSource> nextSelector,
Func<TSource, bool> predicate)
{
for (TSource current = nextSelector(source);
predicate(current);
current = nextSelector(current))
yield return current;
}
public static IEnumerable<TSource> ToLinkedEnumerable<TSource>(
this TSource source, Func<TSource, TSource> nextSelector)
where TSource : class
{
return ToLinkedEnumerable(source, nextSelector, src => src != null);
}
}
Then your query now becomes this:
var data =
from li in doc.DocumentNode.Elements("li")
where li.Attributes["class"].Value == "seperator"
select new
{
Separator = li.InnerText,
Content = li.ToLinkedEnumerable(sli => sli.NextSibling)
.Where(sli => sli.Name == "li")
.TakeWhile(sli => sli.Attributes["class"].Value == "lol")
.Select(sli => sli.InnerText)
.ToList(),
};

Resources