HtmlAgilityPack - SelectSingleNode for descendants - html-agility-pack

I found that HtmlAgilityPack SelectSingleNode always starts from the first node of the original DOM. Is there an equivalent method to set its starting node ?
Sample html
<html>
<body>
Home
<div id="contentDiv">
<tr class="blueRow">
<td scope="row">target</td>
</tr>
</div>
</body>
</html>
Not working code
//Expected:iwantthis.com Actual:home.com,
string url = contentDiv.SelectSingleNode("//tr[#class='blueRow']")
.SelectSingleNode("//a") //What should this be ?
.GetAttributeValue("href", "");
I have to replace the code above with this:
var tds = contentDiv.SelectSingleNode("//tr[#class='blueRow']").Descendants("td");
string url = "";
foreach (HtmlNode td in tds)
{
if (td.Descendants("a").Any())
{
url= td.ChildNodes.First().GetAttributeValue("href", "");
}
}
I am using HtmlAgilityPack 1.7.4 on .Net Framework 4.6.2

The XPath you are using always starts at the root of the document. SelectSingleNode("//a") means start at the root of the document and find the first a anywhere in the document; that's why it grabs the Home link.
If you want to start from the current node, you should use the . selector. SelectSingleNode(".//a") would mean find the first a that is anywhere beneath the current node.
So your code would look like this:
string url = contentDiv.SelectSingleNode(".//tr[#class='blueRow']")
.SelectSingleNode(".//a")
.GetAttributeValue("href", "");

Related

Selenium not grabbing the selected element but the one loaded by Javascript

I am using selenium to grab the href attribute of the a tag. But my code is not grabing the "/pros/52698281" as it should.
Is it because my code is wrong or because some javascript is loading dynamically another url ? Could he ?
Here is the html :
<article class="bi-bloc blocs clearfix bi-pro visited" id="bi-bloc-014805042600000000C0001" data-pjtoggleclasshisto="{"idbloc": {"id_bloc": "014805042600000000C0001", "no_sequence": "" }, "klass":"visited" }">
<div class="zone-bi">
<a class="visible-phone mob-zone-pro pj-lb pj-link" data-pjsearchctx-sethref="" href="/pros/52698281" data-pjstats="{"idTag":"MOB-ZONE-PRO","pos":54,"type_bi":"pro","genreBloc":"1","pjscript":"xt_click({},'C','{%xtn2}','LR_BI::zone_identification::info{%pjstats.type_bi}::identification_pro','A');"}">
<span class="not-visible">
XXXXXXXXXXX
</span>
</a>
I am using this code to grab the href attribute.:
elements = driver.find_elements(:css, "article.bi-bloc div.zone-bi a.visible-phone")
elements.each do |e|
p e.attribute("href")
end
Here is the javascript code that, I think, loads dynamically another url (the one printing in my terminal).
<script type="text/javascript">
var pj_searchctx = {
"1989516432": {
"form": {
"quoiqui": "climatisation",
"ou": "paris-75",
"proximite": 0
},
"search": {
"technicalUrl":"/annuaire/chercherlespros?quoiqui=climatisation&ou=paris-75&idOu=L07505600&page=3&contexte=BupKFuSlIjbFtxi68rty83eKL16bkxx3e0d5jKAkSaA%3D&proximite=0&quoiQuiInterprete=climatisation",
"breadcrumb": "Retour aux résultats",
"stats": {
"idTag": "VERS-LR-RESULTATS"
}
}
}
};
Any idea how I can do ?
Since you're using direct children in your CSS selector, try using this instead of yours (with >):
"article.bi-bloc > div.zone-bi > a.visible-phone"
This matches more specifically the element your looking for.

Scraping framework with xpath support

I'm looking for a web scraping framework that lets me
Hit a given endpoint and load the html response
Search for elements by some css selector
Recover the xpath for that element
Any suggestions? I've seen many that let me search by xpath, but none that actually generate the xpath for an element.
It seems to be true that not many people search by CSS selector yet want a result as an XPath instead, but there are some options to get there.
First I wound up doing this with JQuery plus an additional function. This is because JQuery has pretty nice selection and is easy to find support for. You can use JQuery in Node.js, so you should be able to implement my code in that domain (on a server) instead of on the client (as shown in my simple example). If that's not an option, you can look below for my other potential solution using Python or at the bottom for a C# starter.
For the JQuery approach, the pure JavaScript function is pretty simple for returning the XPath. In the following example (also on JSFiddle) I retrieved the example anchor element with the JQuery selector, got the stripped DOM element, and sent it to my getXPath function:
<html>
<head>
<title>The jQuery Example</title>
<script type="text/javascript"
src="http://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
<script type="text/javascript">
function getXPath( element )
{
var xpath = '';
for ( ; element && element.nodeType == 1; element = element.parentNode )
{
var id = $(element.parentNode).children(element.tagName).index(element) + 1;
id > 1 ? (id = '[' + id + ']') : (id = '');
xpath = '/' + element.tagName.toLowerCase() + id + xpath;
}
return xpath;
}
$(document).ready(function() {
$("#example").click(function() {
alert("Link Xpath: " + getXPath($("#example")[0]));
});
});
</script>
</head>
<body>
<p id="p1">This is an example paragraph.</p>
<p id="p2">This is an example paragraph with a <a id="example" href="#">link inside.</a></p>
</body>
</html>
There is a full library for more robust CSS selector to XPath conversions called css2xpath if you need more complexity than what I provided.
Python (lxml):
For Python you'll want to use lxml's CSS selector class (see link for full tutorial and docs) to get the xml node.
The CSSSelector class
The most important class in the lxml.cssselect module is CSSSelector.
It provides the same interface as the XPath class, but accepts a CSS
selector expression as input:
>>> from lxml.cssselect import CSSSelector
>>> sel = CSSSelector('div.content')
>>> sel #doctest: +ELLIPSIS <CSSSelector ... for 'div.content'>
>>> sel.css
'div.content'
The selector actually compiles to XPath, and you can see the
expression by inspecting the object:
>>> sel.path
"descendant-or-self::div[#class and contains(concat(' ', normalize-space(#class), ' '), ' content ')]"
To use the selector, simply call it with a document or element object:
>>> from lxml.etree import fromstring
>>> h = fromstring('''<div id="outer">
... <div id="inner" class="content body">
... text
... </div></div>''')
>>> [e.get('id') for e in sel(h)]
['inner']
Using CSSSelector is equivalent to translating with cssselect and
using the XPath class:
>>> from cssselect import GenericTranslator
>>> from lxml.etree import XPath
>>> sel = XPath(GenericTranslator().css_to_xpath('div.content'))
CSSSelector takes a translator parameter to let you choose which
translator to use. It can be 'xml' (the default), 'xhtml', 'html' or a
Translator object.
If you're looking to load from a url, you can do that directly when building the etree: root = etree.fromstring(xml, base_url="http://where.it/is/from.xml")
C#
There is a library called css2xpath-reloaded which does nothing but CSS to XPath conversion.
String css = "div#test .note span:first-child";
String xpath = css2xpath.Transform(css);
// 'xpath' will contain:
// //div[#id='test']//*[contains(concat(' ',normalize-space(#class),' '),' note ')]*[1]/self::span
Of course, getting a string from the url is very easy with C# utility classes and needs little discussion:
using(WebClient client = new WebClient()) {
string s = client.DownloadString(url);
}
As for the selection with CSS Selectors, you could try Fizzler, which is pretty powerful. Here's the front page example, though you can do much more:
// Load the document using HTMLAgilityPack as normal
var html = new HtmlDocument();
html.LoadHtml(#"
<html>
<head></head>
<body>
<div>
<p class='content'>Fizzler</p>
<p>CSS Selector Engine</p></div>
</body>
</html>");
// Fizzler for HtmlAgilityPack is implemented as the
// QuerySelectorAll extension method on HtmlNode
var document = html.DocumentNode;
// yields: [<p class="content">Fizzler</p>]
document.QuerySelectorAll(".content");
// yields: [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("p");
// yields empty sequence
document.QuerySelectorAll("body>p");
// yields [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("body p");
// yields [<p class="content">Fizzler</p>]
document.QuerySelectorAll("p:first-child");

html-agility-pack extract a background image

How do I extract the url from the following HTML.
i.e.. extract:
http://media.somesite.com.au/img-101x76.jpg
from:
<div class="media-img">
<div class=" searched-img" style="background-image: url(http://media.somesite.com.au/img-101x76.jpg);"></div>
</div>
In XPath 1.0 in general, you can use combination of substring-after() and substring-before() functions to extract part of a text. But HAP's SelectNodes() and SelectSingleNode() can't return other than node(s), so those XPath functions won't help.
One possible approach is to get the entire value of style attribute using XPath & HAP, then process the value further from .NET, using regex for example :
var html = #"<div class='media-img'>
<div class=' searched-img' style='background-image: url(http://media.somesite.com.au/img-101x76.jpg);'></div>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var div = doc.DocumentNode.SelectSingleNode("//div[contains(#class,'searched-img')]");
var url = Regex.Match(div.GetAttributeValue("style", ""), #"(?<=url\()(.*)(?=\))").Groups[1].Value;
Console.WriteLine(url);
.NET Fiddle Demo
output :
http://media.somesite.com.au/img-101x76.jpg

How to perform click event on an element present in the anchor tag?

<div class="buttonClear_bottomRight">
<div class="buttonBlueOnWhite">
<a onclick="$find('{0}').close(true); callPostBackFromAlert();" href="#">Ok</a><div
class='rightImg'>
</div>
</div>
</div>
In the above code i wanted to click on Ok button present in the anchor tag.But an id is not generated because of which i cannot directly perform a click action. I tried a work around mentioned below.
IElementContainer elm_container = (IElementContainer)pw.Element(Find.ByClass(classname));
foreach (Element element in elm_container.Elements)
{
if (element.TagName.ToString().ToUpper() == "A")
{
element.Click();
}
}
But here elm_container returns null for intial instances due to which we cannot traverse through it. Is there any other easy method to do it ?
Try this...
Div div = browser.Div(Find.ByClass("buttonClear_bottomRight")).Div(Find.ByClass("buttonBlueOnWhite"));
Debug.Assert(div.Exists);
Link link = div.Link(lnk => lnk.GetAttributeValue("onclick").ToLower().Contains(".close(true)"));
Debug.Assert(link.Exists);
link.Click();
Hope it helps!
You can simply Click on the link by finding its text
var OkButton = Browser.Link(Find.ByText("Ok"));
if(!OkButton.Exists)
{
\\Log error here
}
OkButton.Click();
Browser.WaitForCompplete();
Or you can find the div containing the link like,
var ContainerDiv = Browser.Div(Find.ByClass("buttonBlueOnWhite"));
if(!ContainerDiv.Exists)
{
\\Log error here
}
ContainerDiv.Links.First().Click();
Browser.WaitForComplete();

Parse HTML doc with HtmlAgilityPack-Xpath, RegExp

I try parse image url from html with HtmlAgilityPack. In html doc I have img tag :
<a class="css_foto" href="" title="Fotka: MyKe015">
<span>
<img src="http://213.215.107.125/fotky/1358/93/v_13589304.jpg?v=6"
width="176" height="216" alt="Fotka: MyKe015" />
</span>
</a>
I need get from this img tag atribute src. I need this: http://213.215.107.125/fotky/1358/93/v_13589304.jpg?v=6.
I know this:
Src atribute consist url, url start
with
http://213.215.107.125/fotky
I know value of alt atribute Url
have
variable lenght and also html doc
consist other img tags with url, which start with
http://213.215.107.125/fotky
I know alt attribute of img tag (Fotka: Myke015))
Any advance, I try many ways, but nothing works good.
Last I try this:
List<string> src;
var req = (HttpWebRequest)WebRequest.Create("http://pokec.azet.sk/myke015");
req.Method = "GET";
using (WebResponse odpoved = req.GetResponse())
{
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(odpoved.GetResponseStream());
var nodes = htmlDoc.DocumentNode.SelectNodes("//img[#src]");
src = new List<string>(nodes.Count);
if (nodes != null)
{
foreach (var node in nodes)
{
if (node.Id != null)
src.Add(node.Id);
}
}
}
Your XPath selects the img nodes, not the src attributes belonging to them.
Instead of (selecting all image tags that have a src attribute):
var nodes = htmlDoc.DocumentNode.SelectNodes("//img[#src]");
Use this (select the src attributes that are child nodes of all img elements):
var nodes = htmlDoc.DocumentNode.SelectNodes("//img/#src");
This XPath 1.0 expression:
//a[#alt='Fotka: MyKe015']/#src

Resources