Parse HTML doc with HtmlAgilityPack-Xpath, RegExp - xpath

I try parse image url from html with HtmlAgilityPack. In html doc I have img tag :
<a class="css_foto" href="" title="Fotka: MyKe015">
<span>
<img src="http://213.215.107.125/fotky/1358/93/v_13589304.jpg?v=6"
width="176" height="216" alt="Fotka: MyKe015" />
</span>
</a>
I need get from this img tag atribute src. I need this: http://213.215.107.125/fotky/1358/93/v_13589304.jpg?v=6.
I know this:
Src atribute consist url, url start
with
http://213.215.107.125/fotky
I know value of alt atribute Url
have
variable lenght and also html doc
consist other img tags with url, which start with
http://213.215.107.125/fotky
I know alt attribute of img tag (Fotka: Myke015))
Any advance, I try many ways, but nothing works good.
Last I try this:
List<string> src;
var req = (HttpWebRequest)WebRequest.Create("http://pokec.azet.sk/myke015");
req.Method = "GET";
using (WebResponse odpoved = req.GetResponse())
{
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(odpoved.GetResponseStream());
var nodes = htmlDoc.DocumentNode.SelectNodes("//img[#src]");
src = new List<string>(nodes.Count);
if (nodes != null)
{
foreach (var node in nodes)
{
if (node.Id != null)
src.Add(node.Id);
}
}
}

Your XPath selects the img nodes, not the src attributes belonging to them.
Instead of (selecting all image tags that have a src attribute):
var nodes = htmlDoc.DocumentNode.SelectNodes("//img[#src]");
Use this (select the src attributes that are child nodes of all img elements):
var nodes = htmlDoc.DocumentNode.SelectNodes("//img/#src");

This XPath 1.0 expression:
//a[#alt='Fotka: MyKe015']/#src

Related

Dynamic section name in EvoHtmlToPdf header

Is there a way in EvoHtmlToPdf to display the section/subsection of a page in the header/footer (i.e. the text of the "current" h1/h2 tag)?
With wkhtmltopdf and other tools, it is possible to replace special tags via JavaScript and the HTML header template (as described here for example Dynamic section name in wicked_pdf header).
Unfortunately, such a solution does not seem to work with EvoHtmlToPdf.
Here's the HTML code of my header template:
<html id="headerFooterHtml">
<head>
<script>
function substHeaderFooter(){
var vars={};
var searchString = document.location.search;
var debugMessage = document.getElementById("showJavaScriptWasExecuted");
if (debugMessage)
debugMessage.textContent = "Search string: ["+ searchString + "]";
var search_list = searchString.substring(1).split('&');
for(var i in search_list){
var content=search_list[i].split('=',2);
vars[content[0]] = decodeQueryParam(content[1]);
}
var tags=['section','subsection'];
for(var i in tags){
var name = tags[i],
classElements = document.getElementsByClassName(name);
for(var j=0; j<classElements.length; ++j){
classElements[j].textContent = vars[name];
}
}
}
</script>
</head>
<body id="headerFooterBody" onload="substHeaderFooter()">
<div id="showJavaScriptWasExecuted"></div>
<div id="sections">{section} / {subsection}</div>
</body>
Resulting header in PDF
I already added the EvoHtmlToPdf PrepareRenderPdfPageDelegate event handler to my code (if that's the way I have to go) but I don't know how to access the section of the current page there...
Thanks in advance for your help!

HtmlAgilityPack - SelectSingleNode for descendants

I found that HtmlAgilityPack SelectSingleNode always starts from the first node of the original DOM. Is there an equivalent method to set its starting node ?
Sample html
<html>
<body>
Home
<div id="contentDiv">
<tr class="blueRow">
<td scope="row">target</td>
</tr>
</div>
</body>
</html>
Not working code
//Expected:iwantthis.com Actual:home.com,
string url = contentDiv.SelectSingleNode("//tr[#class='blueRow']")
.SelectSingleNode("//a") //What should this be ?
.GetAttributeValue("href", "");
I have to replace the code above with this:
var tds = contentDiv.SelectSingleNode("//tr[#class='blueRow']").Descendants("td");
string url = "";
foreach (HtmlNode td in tds)
{
if (td.Descendants("a").Any())
{
url= td.ChildNodes.First().GetAttributeValue("href", "");
}
}
I am using HtmlAgilityPack 1.7.4 on .Net Framework 4.6.2
The XPath you are using always starts at the root of the document. SelectSingleNode("//a") means start at the root of the document and find the first a anywhere in the document; that's why it grabs the Home link.
If you want to start from the current node, you should use the . selector. SelectSingleNode(".//a") would mean find the first a that is anywhere beneath the current node.
So your code would look like this:
string url = contentDiv.SelectSingleNode(".//tr[#class='blueRow']")
.SelectSingleNode(".//a")
.GetAttributeValue("href", "");

html-agility-pack extract a background image

How do I extract the url from the following HTML.
i.e.. extract:
http://media.somesite.com.au/img-101x76.jpg
from:
<div class="media-img">
<div class=" searched-img" style="background-image: url(http://media.somesite.com.au/img-101x76.jpg);"></div>
</div>
In XPath 1.0 in general, you can use combination of substring-after() and substring-before() functions to extract part of a text. But HAP's SelectNodes() and SelectSingleNode() can't return other than node(s), so those XPath functions won't help.
One possible approach is to get the entire value of style attribute using XPath & HAP, then process the value further from .NET, using regex for example :
var html = #"<div class='media-img'>
<div class=' searched-img' style='background-image: url(http://media.somesite.com.au/img-101x76.jpg);'></div>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var div = doc.DocumentNode.SelectSingleNode("//div[contains(#class,'searched-img')]");
var url = Regex.Match(div.GetAttributeValue("style", ""), #"(?<=url\()(.*)(?=\))").Groups[1].Value;
Console.WriteLine(url);
.NET Fiddle Demo
output :
http://media.somesite.com.au/img-101x76.jpg

XPath to find anchor with no text?

I have a couple of anchor tags which have no text, so no text here, how do I find those anchors with no text?
I've tried this but it's returning nothing:
global $post;
$doc = new DOMDocument();
$doc->loadHtml($post->post_content);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//a[string-length(.) = 0]');
foreach($nodes as $node) {
$remove_elements[] = $node->getAttribute('href');
}
return $remove_elements;
The html looks like this.
<br/>
<br/>
<br/>
<br/>
If you want to know if a node has no text(), you can use string-length() which excepts a node. In this case '.' is a reference to current element.
You can do
//a[string-length(.) = 0]

Need help reading XML using LINQ

I'm trying to bind the contents of the following file using LINQ but having issues with the syntax.
<metadefinition>
<page>
<name>home</name>
<metas>
<meta>
<metaname>
title
</metaname>
<metavalue>
Welcome Home
</metavalue>
</meta>
<meta>
<metaname>
description
</metaname>
<metavalue>
Welcome Home Description
</metavalue>
</meta>
</metas>
</page>
<page>
<name>results</name>
<metas>
<meta>
<metaname>
title
</metaname>
<metavalue>
Welcome to Results
</metavalue>
</meta>
</metas>
</page>
</metadefinition>
My query looks like this but as you can see it is missing the retrieval of the metas tag. How do I accomplish this?
var pages = from p in xmlDoc.Descendants(XName.Get("page"))
where p.Element("name").Value == pageName
select new MetaPage
{
Name = p.Element("name").Value,
MetaTags = p.Elements("metas").Select(m => new Tag { MetaName = m.Element("metaname").Value.ToString(),
MetaValue = m.Element("metacontent").Value.ToString()
}).ToList()
};
If <metadefinition> is a root element, then there is no need for iterating over all descendants of the document, that's way too inefficient.
var pages = from p in xmlDoc.Root.Elements("page")
where p.Element("name").Value == pageName
select new MetaPage {
Name = p.Element("name").Value,
MetaTags = p.Element("metas").Elements("meta").Select(m=>new Tag{
MetaName = m.Element("metaname").Value.ToString(),
MetaValue = m.Element("metavalue").Value.ToString()
}).ToList()
};

Resources