html-agility-pack extract a background image - xpath

How do I extract the url from the following HTML.
i.e.. extract:
http://media.somesite.com.au/img-101x76.jpg
from:
<div class="media-img">
<div class=" searched-img" style="background-image: url(http://media.somesite.com.au/img-101x76.jpg);"></div>
</div>

In XPath 1.0 in general, you can use combination of substring-after() and substring-before() functions to extract part of a text. But HAP's SelectNodes() and SelectSingleNode() can't return other than node(s), so those XPath functions won't help.
One possible approach is to get the entire value of style attribute using XPath & HAP, then process the value further from .NET, using regex for example :
var html = #"<div class='media-img'>
<div class=' searched-img' style='background-image: url(http://media.somesite.com.au/img-101x76.jpg);'></div>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var div = doc.DocumentNode.SelectSingleNode("//div[contains(#class,'searched-img')]");
var url = Regex.Match(div.GetAttributeValue("style", ""), #"(?<=url\()(.*)(?=\))").Groups[1].Value;
Console.WriteLine(url);
.NET Fiddle Demo
output :
http://media.somesite.com.au/img-101x76.jpg

Related

Getting value from inside div using XPATH

I have the following div
<div data-dmid="product-detail-page" itemscope="" itemtype="http://schema.org/Product" itemid="3600542198158">
from which I would like to extract the itemid -> 3600542198158
I was using the following Xpath which does however not return any value:
//div[#data-dmid='product-detail-page']/#itemid
Could please someone advise how to built the Xpath correctly for it
#
Unfortunately I have to renew my question.
I was looking for the code with Firefox inspection tool.
Looking at the html source code which is different to the output with the inspection tool I have the following part which will be interesting:
<div class="onCanvas content-with-footer">
<div id="container-main" class="content-main">
<div data-dmid="uvp-banner-container" style="height: 54px; width: 100%"></div>
<script>
document.addEventListener("DOMContentLoaded", function() {
var props = {};
ReactInit.initReactComponent("contentViewService", "UvpBannerContainer", props, document.querySelector("[data-dmid='uvp-banner-container']"));
});
</script>
<div id="react-product-detail-page"></div>
<script>
var props = {
gtin: 3600542198158,
locale: dmSettings.localeLanguage
};
ReactInit.initReactComponent("product-detail-page", "ProductDetailPage", props, document.getElementById("react-product-detail-page"));
$(document).ready(function () {
var props = {
locale: dmSettings.localeLanguage
};
ReactInit.initReactComponent("product-detail-page", "PriceLegend", props, document.getElementById("react-price-legend"));
});
</script>
I would need to get the gtin (plain number) of the second script.
I would like to use the xpath in a scraping tool why only plain xpath code will work for me.
Thank you again and please excuse my previous not fully correct question.
I am assuming that you don't mind JavaScript and jQuery since you didn't specify:
var itemId = $("div[data-dmid]").attr("itemid");
console.log(itemId);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div data-dmid="product-detail-page" itemscope="" itemtype="http://schema.org/Product" itemid="3600542198158">
I got the answer with help of another post on Stackoverflow.
Reading a javascript variable's value
The correct code for my updated question is
substring-before(substring-after(//div[#class='onCanvas content-with-footer']//script[2][contains(.,'gtin')]/text(), "gtin: "), ",")
Thank you for any help.

Convert HTML string to HTML and display in DIV

I have HTML string saved in data base , I want to display it into DIV in HTML format using javascript.
Example:
<p>Dear Friends</p> <h1> You have got invitation </h1>
I have used DOMParser like this
parser = new DOMParser();
htmlDoc = parser.parseFromString(document.getElementById(controlID).innerHTML, "text/html");
console.log(htmlDoc);
document.getElementById("emailBodyArea").innerHTML = parser
but in result I see [htmlObject]
You dont need to use any DOMParser just replace emailBodyArea innerHTML with controlId innerHTML or any string based HTML.
document.getElementById("emailBodyArea").innerHTML = document.getElementById(controlID).innerHTML;
You can try it on jsfiddle.

Scraping the href value of anchor in Ruby

Working on this project where I have to scrape a "website," which is just a an html file in one of the local folders. Anyway, I've been trying to scrape down to the href value (a url) of the anchor tag for each student object. I am also scraping for other things, so ignore the rest. Here is what I have so far:
def self.scrape_index_page(index_url) #responsible for scraping the index page that lists all of the students
#return an array of hashes in which each hash represents one student.
html = index_url
doc = Nokogiri::HTML(open(html))
# doc.css(".student-name").first.text
# doc.css(".student-location").first.text
#student_card = doc.css(".student-card").first
#student_card.css("a").text
end
Here is one of the student profiles. They are all the same, so I'm just interested in scraping the href url value.
<div class="student-card" id="eric-chu-card">
<a href="students/eric-chu.html">
<div class="view-profile-div">
<h3 class="view-profile-text">View Profile</h3>
</div>
<div class="card-text-container">
<h4 class="student-name">Eric Chu</h4>
<p class="student-location">Glenelg, MD</p>
</div>
</a>
</div>
thanks for your help!
Once you get an anchor tag in Nokogiri, you can get the href like this:
anchor["href"]
So in your example, you could get the href by doing the following:
student_card = doc.css(".student-card").first
href = student_card.css("a").first["href"]
If you wanted to collect all of the href values at once, you could do something like this:
hrefs = doc.css(".student-card a").map { |anchor| anchor["href"] }

Scraping framework with xpath support

I'm looking for a web scraping framework that lets me
Hit a given endpoint and load the html response
Search for elements by some css selector
Recover the xpath for that element
Any suggestions? I've seen many that let me search by xpath, but none that actually generate the xpath for an element.
It seems to be true that not many people search by CSS selector yet want a result as an XPath instead, but there are some options to get there.
First I wound up doing this with JQuery plus an additional function. This is because JQuery has pretty nice selection and is easy to find support for. You can use JQuery in Node.js, so you should be able to implement my code in that domain (on a server) instead of on the client (as shown in my simple example). If that's not an option, you can look below for my other potential solution using Python or at the bottom for a C# starter.
For the JQuery approach, the pure JavaScript function is pretty simple for returning the XPath. In the following example (also on JSFiddle) I retrieved the example anchor element with the JQuery selector, got the stripped DOM element, and sent it to my getXPath function:
<html>
<head>
<title>The jQuery Example</title>
<script type="text/javascript"
src="http://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
<script type="text/javascript">
function getXPath( element )
{
var xpath = '';
for ( ; element && element.nodeType == 1; element = element.parentNode )
{
var id = $(element.parentNode).children(element.tagName).index(element) + 1;
id > 1 ? (id = '[' + id + ']') : (id = '');
xpath = '/' + element.tagName.toLowerCase() + id + xpath;
}
return xpath;
}
$(document).ready(function() {
$("#example").click(function() {
alert("Link Xpath: " + getXPath($("#example")[0]));
});
});
</script>
</head>
<body>
<p id="p1">This is an example paragraph.</p>
<p id="p2">This is an example paragraph with a <a id="example" href="#">link inside.</a></p>
</body>
</html>
There is a full library for more robust CSS selector to XPath conversions called css2xpath if you need more complexity than what I provided.
Python (lxml):
For Python you'll want to use lxml's CSS selector class (see link for full tutorial and docs) to get the xml node.
The CSSSelector class
The most important class in the lxml.cssselect module is CSSSelector.
It provides the same interface as the XPath class, but accepts a CSS
selector expression as input:
>>> from lxml.cssselect import CSSSelector
>>> sel = CSSSelector('div.content')
>>> sel #doctest: +ELLIPSIS <CSSSelector ... for 'div.content'>
>>> sel.css
'div.content'
The selector actually compiles to XPath, and you can see the
expression by inspecting the object:
>>> sel.path
"descendant-or-self::div[#class and contains(concat(' ', normalize-space(#class), ' '), ' content ')]"
To use the selector, simply call it with a document or element object:
>>> from lxml.etree import fromstring
>>> h = fromstring('''<div id="outer">
... <div id="inner" class="content body">
... text
... </div></div>''')
>>> [e.get('id') for e in sel(h)]
['inner']
Using CSSSelector is equivalent to translating with cssselect and
using the XPath class:
>>> from cssselect import GenericTranslator
>>> from lxml.etree import XPath
>>> sel = XPath(GenericTranslator().css_to_xpath('div.content'))
CSSSelector takes a translator parameter to let you choose which
translator to use. It can be 'xml' (the default), 'xhtml', 'html' or a
Translator object.
If you're looking to load from a url, you can do that directly when building the etree: root = etree.fromstring(xml, base_url="http://where.it/is/from.xml")
C#
There is a library called css2xpath-reloaded which does nothing but CSS to XPath conversion.
String css = "div#test .note span:first-child";
String xpath = css2xpath.Transform(css);
// 'xpath' will contain:
// //div[#id='test']//*[contains(concat(' ',normalize-space(#class),' '),' note ')]*[1]/self::span
Of course, getting a string from the url is very easy with C# utility classes and needs little discussion:
using(WebClient client = new WebClient()) {
string s = client.DownloadString(url);
}
As for the selection with CSS Selectors, you could try Fizzler, which is pretty powerful. Here's the front page example, though you can do much more:
// Load the document using HTMLAgilityPack as normal
var html = new HtmlDocument();
html.LoadHtml(#"
<html>
<head></head>
<body>
<div>
<p class='content'>Fizzler</p>
<p>CSS Selector Engine</p></div>
</body>
</html>");
// Fizzler for HtmlAgilityPack is implemented as the
// QuerySelectorAll extension method on HtmlNode
var document = html.DocumentNode;
// yields: [<p class="content">Fizzler</p>]
document.QuerySelectorAll(".content");
// yields: [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("p");
// yields empty sequence
document.QuerySelectorAll("body>p");
// yields [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("body p");
// yields [<p class="content">Fizzler</p>]
document.QuerySelectorAll("p:first-child");

How to remove the style tag using asp.net

I want to remove all style attribute in html tags using asp.net...
string source=#" <div style="font-size: 12pt;"> Hello world</div> <style id=fll margin:19px auto;text-align:center"></style>";
I want the result like this:
<div>Hello world </div>
For that i am using,
string expn =#"(?i)<(table|tr|td)(?:\s+(?:""[^""]""|'[^']'|[^""'>])*)?>";
return System.Text.RegularExpressions.Regex.Replace(source, expn, string.Empty);
I dont know which one is using,
Tell me the query what i have to use for this one....
This should work (though I don't understand the style tag at the end of your example):
string source="<div style=\"font-size: 12pt;\"> Hello world</div>";
string pattern = "style=\".*\"";
string result = Regex.Replace(source, pattern, "");

Resources