Using Html Agility Pack to parse nodes in a context sensitive fashion - html-agility-pack

<div class="mvb"><b>Date 1</b></div>
<div class="mxb"><b>Header 1</b></div>
<div>
inner hmtl 1
</div>
<div class="mvb"><b>Date 2</b></div>
<div class="mxb"><b>Header 2</b></div>
<div>
inner html 2
</div>
I would like to parse the inner html between the tags in such a way that I can
* associate the inner html 1 with header 1 and date 1
* associate the inner html 2 with header 2 and date 2
In other words, at the time I parse the inner html 1 I would like to know that the html nodes containing "Date 1" and "Header 1" have been parsed (but the nodes containing "Date 2" and "Header 2" have not been parsed)
If I were doing this via regular text parsing, I would read one line at a time and record the last "Date" and "Header" than I had parsed. Then when it came time to parse the inner html 1, I could refer to the last parsed "Date" and "Header" object to associate them together.

Using the Html Agility Pack, you can leverage XPATH power - and forget about that verbose xlinq crap :-). The XPATH position() function is context sensitive. Here is a sample code:
HtmlDocument doc = new HtmlDocument();
doc.Load("your html file");
// select all DIV without a CLASS attribute defined
foreach (HtmlNode div in doc.DocumentNode.SelectNodes("//div[not(#class)]"))
{
Console.WriteLine("div=" + div.InnerText.Trim());
Console.WriteLine(" header=" + div.SelectSingleNode("preceding-sibling::div[position()=1]/b").InnerText);
Console.WriteLine(" date=" + div.SelectSingleNode("preceding-sibling::div[position()=2]/b").InnerText);
}
That will prrint this with your sample:
div=inner hmtl 1
header=Header 1
date=Date 1
div=inner html 2
header=Header 2
date=Date 2

Well, you can do this in several ways...
For example, if the HTML you want to parse is the one you wrote in your question, an easy way could be:
Store all dates in a HtmlNodeCollection
Store all headers in a HtmlNodeCollection
Store all inner texts in another HtmlNodeCollection
If everything is okay and the HTML has that layout, you will have the same number of elements in both 3 collections.
Then you can easily do:
for (int i = 0; i < innerTexts.Count; i++) {
//Get Date, Headers and Inner Texts at position i
}
The following should work:
var document = new HtmlWeb().Load("http://www.url.com"); //Or load it from a Stream, local file, etc.
var dateNodes = document.DocumentNode.SelectNodes("//div[#class='mvb']/b");
var headerNodes = document.DocumentNode.SelectNodes("//div[#class='mxb']/b");
var innerTextNodes = (from node in document.DocumentNode.SelectNodes("//div")
let previous = node.PreviousSibling
where previous.Name == "div" && previous.GetAttributeValue("class", "") == "mxb"
select node).ToList();
//Check here if the number of elements of the 3 collections are the same
for (int i = 0; i < dateNodes.Count; i++) {
var date = dateNodes[i].InnerText;
var header = headerNodes[i].InnerText;
var innerText = innerTextNodes[i].InnerText;
//Now you have the set you want: You have the Date, Header and Inner Text
}
This is a way of doing this.
Of course, you should check for exceptions (that .SelectNodes(..) method are not returning null), check for errors in the LINQ expression when storing innerTextNodes, and refactor the for (...), maybe into a method that receives a HtmlNode and returns the InnerText property of it.
Take in count that the only way you can know, in the HTML code you posted, what is the <div> tag that contains the Inner Text, is to assume it is the one that is next to the <div> tag that contains the Header. That's why I used the LINQ expression.
Another way of knowing it could be if the <div> has some particular attribute (like class="___") or similar, or if it contains some tags inside it and not just text. There is no magic when parsing HTMLs :)
Edit:
I have not tested this code. Test it by yourself and let me know if it worked.

Related

Obtaining a partial value from XPath

I have the current HTML code:
<div class="group">
<ul class="smallList">
<li><strong>Date</strong>
13.06.2019
</li>
<li>...</li>
<li>...</li>
</ul>
</div>
and here is my "wrong" XPath:
//div[#class='group']/ul/li[1]
and I would like to extract the date with XPath without the text in the strong tag, but I'm not sure how NOT is used in XPath or could it even be used in here?
Keep in mind that the date is dynamic.
Use substring-after() to get the date value.
substring-after(//div[#class='group']/ul/li[1],'Date')
Output:
The easiest way to get the date is by using the XPath-1.0 expression
//div[#class='group']/ul/li[1]/text()[normalize-space(.)][1]
The result does include the spaces.
If you want to get rid of them, too, use the following expression:
normalize-space(//div[#class='group']/ul/li[1]/text()[normalize-space(.)][1])
Unfortunately this only works for one result in XPath-1.0.
If you'd have XPath-2.0 available, you could append the normalize-space() to the end of the expression which also enables the processing of multiple results:
//div[#class='group']/ul/li[1]/text()[normalize-space(.)][1]/normalize-space()
Here is the python method that will read the data directly from the parent in your case the data is associated with ul/li.
Python:
def get_text_exclude_children(element):
return driver.execute_script(
"""
var parent = arguments[0];
var child = parent.firstChild;
var textValue = "";
while(child) {
if (child.nodeType === Node.TEXT_NODE)
textValue += child.textContent;
child = child.nextSibling;
}
return textValue;""",
element).strip()
This is how to call this in your case.
ulEle = driver.find_element_by_xpath("//div[#class='group']/ul/li[1]")
datePart = get_text_exclude_children(ulEle)
print(datePart)
Please feel free to convert to the language that you are using, if it's not python.

CKEditor Plugin - Proper behavior of elementPath

Currently, I have the following HTML content
<span criteria="{"animal":["DOG"]}">abc</span> def <span criteria="{"animal":["CAT"]}">ghi</span>
My purpose is
I wish to know my selected text contain criteria attribute?
If it contains criteria attribute, what is its value?
I run the following code.
editor.on('selectionChange', function( ev ) {
var elementPath = editor.elementPath();
var criteriaElement = elementPath.contains( function( el ) {
return el.hasAttribute('criteria');
});
var array = elementPath.elements;
var arrayLength = array.length;
for (var i = 0; i < arrayLength; i++) {
console.log(i + " --> " + array[i].$.innerHTML);
}
if (criteriaElement) {
console.log("criteriaElement is something");
console.log("criteriaElement attribute length is " + criteriaElement.$.attributes.length);
for (var i = 0; i < criteriaElement.$.attributes.length; i++) {
console.log("attribute is " + criteriaElement.$.attributes[i].value);
}
}
});
Test Case 1
When I select my text abc def as follow
I get the following logging
0 --> abc
1 --> <span criteria="{"animal":["DOG"]}">abc</span> def <span criteria="{"animal":["CAT"]}">ghi</span>
criteriaElement is something
criteriaElement attribute length is 1
attribute is {"operator":["DOG"]}
Some doubts in my mind.
I expect there will be 2 elements in elementPath. One is abc, another is def. However, it turns out, my first element is abc (correct), and my second element is the entire text (out of my expectation)
Test Case 2
I test with another test. This time, I select def ghi
I get the following logging
0 --> <span criteria="{"animal":["DOG"]}">abc</span> def <span criteria="{"animal":["CAT"]}">ghi</span>
Some doubts in my mind
Why there is only 1 element? I expect there will be 2 elements in elementPath. One is def, another is ghi.
Although Test Case 1 and Test Case 2 both contain element with entire text, why in Test Case 2, elementPath.contains... returns nothing?
Elementspath is not related to the selection in that way. It represent the stack of elements under the the caret. Imagine a situation like this where [] represents the selection and | represents the caret:
<ul>
<li>Quux</li>
<li>F[oo <span class="bar">Bar</span> <span class="baz">Ba|]z</span></li>
<li>Nerf</li>
</ul>
Your selection visually contains the text "oo Bar Ba" and your caret is in between a and z. At that time, the elementspath would display "ul > li > span". The other span element "bar" is a sibling of the span element "baz" and is thus not displayed, only ascendants are displayed.
You could think of it like that the caret can only exist inside a html TEXT_NODE and the elementspath displays the ascendants of that text node.
What are you trying to eachieve? To display the data in the current selection? Why? Where do you want it to show? How and why do you want it to show? I'm guessing that there is a different way of fillind the requirement that you have than with using the elementspath (I'm think this might be and XY problem).
Too long to be a comment: If your toolbar button action targets elements with the criteria attribute - what if there is one span with a criteria attribute and 1 without? Does their order matter? What if there are two spans with a criteria attribute? What if they are nested like this: <p>F[oo <span criteria="x">Bar <span criteria="y">Ba|]z </span>Quux </span>Xyzzy</p> - the targeting will be difficult. I would suggest that you add a small marker to the elementspath if an element has the attribute, than clicking the marker or rightclicking the element you could edit/view the criteria. You could even visually indicate spans with the attribute within the editor by customizing editor.css with a rule like span[criteria]{ color: red; }.

Xpath/HtmlAgilityPack: Getting the specific attributes from href tag

I'm using the HtmlAgilityPack to parse href tags in an html file. The href tags look like this:
<h3 class="product-name">Super Cool Product</h3>
So far I can successfully pull out the url and the title together, and display it in a list. This is the main code I'm using to parse the html:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//h3[#class='product-name']//a")
where
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
};
The code above gives me a result that looks like this:
Super Cool Product - http://www.somewebsite.com/blahblah
I'm trying to figure out how to pull out the name and url separately, and put them into separate strings, instead of pulling them out together and putting them into one string. I'm guessing there is some sort of Xpath notation I can use to do this. I would be extremely thankful if someone could lead me in the right direction
Thanks,
Miles

Capybara writing drop down's options texts into an array

I'd like to put a dropdown list's options into an array generically in capybara. After the process I'm expecting to have an arrray of strings, containing all dropdown options. I've tried the code below but the length of my array stays 1 regardless of what the option count is.
periods = Array.new()
periods = all('#MainContent_dd')
print periods.length
The problem is that all('#MainContent_dd') returns all elements that have the id MainContent_dd. Assuming this is your dropdown and ids are unique, it is expected that the periods.length is 1 (ie periods is the select list).
What you want to do is get the option elements instead of the select element.
Assuming your html is:
<select id="MainContent_dd">
<option>Option A</option>
<option>Option B</option>
<option>Option C</option>
</select>
Then you can do:
periods = find('#MainContent_dd').all('option').collect(&:text)
p periods.length
#=> 3
p periods
#=> ["Option A", "Option B", "Option C"]
What this does is:
find('#MainContent_dd') - Finds the select list that you want to get the options from
all('option') - Gets all option elements within the select list
collect(&:text) - Collects the text of each option and returns it as an array
#JustinCo's answer has a problem if used driver isn't fast: Capybara will make a query to driver for every invocation of text. So if select contains 200 elements, Capybara will make 201 query to browser instead of 1 which may be slow.
I suggest you to do it using one query with Javascript:
periods = page.execute_script("options = document.querySelectorAll('#MainContent_dd > option'); texts=[]; for (i=0; i<options.length; i++) texts.push(options[i].textContent); return texts")
or (shorter variant with jQuery):
periods = page.evaluate_script("$('#MainContent_dd').map(function() { return $(this).text() }).get()")

XPath Expression Issue in Html Agility Pack

I'm using Html Agility Pack to perform a basic web scraping of Google search results. As a newbie to XPath, I make sure my path expression is correct(with the help of FirePath). However, the returned HtmlNodeCollection is always NULL.
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmlDoc = web.Load("http://www.google.com/search?num=10&q=Hello+World");
// get search result URLs
var items = htmlDoc.DocumentNode.SelectNodes("//div[#id='ires']/ol[#id='rso']/li/div[#class='vsc']/h3/a/#href");
foreach (HtmlNode node in items)
{
Console.WriteLine(node.Attributes);
}
Am I missing something? Can anyone please enlighten me?
Thanks in advance,
HAP can only process the raw HTML that is returned from the url, it will not run any additional javascript that is on the page or whatnot. You need to adjust your query accordingly.
In the raw HTML, the ires div exists but the rso doesn't get inserted until the javascript is run hence you get no results. There are other transformations done here which you'll have to adjust for as well.
Here's a fragment of the HTML:
<div id="ires">
<ol>
<li class="g">
<h3 class="r">
...
A more appropriate xpath to use for this would be:
var xpath = "//li[contains(concat(' ',#class,' '),' g ')]" +
"/h3[contains(concat(' ',#class,' '),' r ')]" +
"/a/#href";
It'd be easier to find all li with the g class as those correspond to all the results. You'll want to filter all h3 with the r class otherwise you'd include other results (such as image results).

Resources