Scraping framework with xpath support - xpath

I'm looking for a web scraping framework that lets me
Hit a given endpoint and load the html response
Search for elements by some css selector
Recover the xpath for that element
Any suggestions? I've seen many that let me search by xpath, but none that actually generate the xpath for an element.

It seems to be true that not many people search by CSS selector yet want a result as an XPath instead, but there are some options to get there.
First I wound up doing this with JQuery plus an additional function. This is because JQuery has pretty nice selection and is easy to find support for. You can use JQuery in Node.js, so you should be able to implement my code in that domain (on a server) instead of on the client (as shown in my simple example). If that's not an option, you can look below for my other potential solution using Python or at the bottom for a C# starter.
For the JQuery approach, the pure JavaScript function is pretty simple for returning the XPath. In the following example (also on JSFiddle) I retrieved the example anchor element with the JQuery selector, got the stripped DOM element, and sent it to my getXPath function:
<html>
<head>
<title>The jQuery Example</title>
<script type="text/javascript"
src="http://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
<script type="text/javascript">
function getXPath( element )
{
var xpath = '';
for ( ; element && element.nodeType == 1; element = element.parentNode )
{
var id = $(element.parentNode).children(element.tagName).index(element) + 1;
id > 1 ? (id = '[' + id + ']') : (id = '');
xpath = '/' + element.tagName.toLowerCase() + id + xpath;
}
return xpath;
}
$(document).ready(function() {
$("#example").click(function() {
alert("Link Xpath: " + getXPath($("#example")[0]));
});
});
</script>
</head>
<body>
<p id="p1">This is an example paragraph.</p>
<p id="p2">This is an example paragraph with a <a id="example" href="#">link inside.</a></p>
</body>
</html>
There is a full library for more robust CSS selector to XPath conversions called css2xpath if you need more complexity than what I provided.
Python (lxml):
For Python you'll want to use lxml's CSS selector class (see link for full tutorial and docs) to get the xml node.
The CSSSelector class
The most important class in the lxml.cssselect module is CSSSelector.
It provides the same interface as the XPath class, but accepts a CSS
selector expression as input:
>>> from lxml.cssselect import CSSSelector
>>> sel = CSSSelector('div.content')
>>> sel #doctest: +ELLIPSIS <CSSSelector ... for 'div.content'>
>>> sel.css
'div.content'
The selector actually compiles to XPath, and you can see the
expression by inspecting the object:
>>> sel.path
"descendant-or-self::div[#class and contains(concat(' ', normalize-space(#class), ' '), ' content ')]"
To use the selector, simply call it with a document or element object:
>>> from lxml.etree import fromstring
>>> h = fromstring('''<div id="outer">
... <div id="inner" class="content body">
... text
... </div></div>''')
>>> [e.get('id') for e in sel(h)]
['inner']
Using CSSSelector is equivalent to translating with cssselect and
using the XPath class:
>>> from cssselect import GenericTranslator
>>> from lxml.etree import XPath
>>> sel = XPath(GenericTranslator().css_to_xpath('div.content'))
CSSSelector takes a translator parameter to let you choose which
translator to use. It can be 'xml' (the default), 'xhtml', 'html' or a
Translator object.
If you're looking to load from a url, you can do that directly when building the etree: root = etree.fromstring(xml, base_url="http://where.it/is/from.xml")
C#
There is a library called css2xpath-reloaded which does nothing but CSS to XPath conversion.
String css = "div#test .note span:first-child";
String xpath = css2xpath.Transform(css);
// 'xpath' will contain:
// //div[#id='test']//*[contains(concat(' ',normalize-space(#class),' '),' note ')]*[1]/self::span
Of course, getting a string from the url is very easy with C# utility classes and needs little discussion:
using(WebClient client = new WebClient()) {
string s = client.DownloadString(url);
}
As for the selection with CSS Selectors, you could try Fizzler, which is pretty powerful. Here's the front page example, though you can do much more:
// Load the document using HTMLAgilityPack as normal
var html = new HtmlDocument();
html.LoadHtml(#"
<html>
<head></head>
<body>
<div>
<p class='content'>Fizzler</p>
<p>CSS Selector Engine</p></div>
</body>
</html>");
// Fizzler for HtmlAgilityPack is implemented as the
// QuerySelectorAll extension method on HtmlNode
var document = html.DocumentNode;
// yields: [<p class="content">Fizzler</p>]
document.QuerySelectorAll(".content");
// yields: [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("p");
// yields empty sequence
document.QuerySelectorAll("body>p");
// yields [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("body p");
// yields [<p class="content">Fizzler</p>]
document.QuerySelectorAll("p:first-child");

Related

How to use replace in wkhtmltopdf or send value to header

First question:
I want to replace a value in the header. I use --header-HTML header.html for PDF header. For example :
I want to pass 3 values to a PDF:
date
Letter_Number
letter_title
Second question:
Can I use a view for the header? I want to use a view in ASP. For example:
CustomSwitches = "--header-HTML header.cshtml "
About first question
Maybe you could use an HTML page as header, as you actually do, generate new HTML using C# code, and replacing existent HTML file content, with the one you have created, just after generating PDF using Rotativa. The other option I can see, maybe a little bit efficient, because avoids generating all HTML code using C#, is that you use javascript inside your HTML to get this values (not sure if it's completely achievable, since I ignore the origin of the values you mention).
Supposing date value is current date, you could use something like this on your HTML:
<!DOCTYPE html>
<html>
<head>
<script>
function subst() {
var currentDate = new Date();
var dd = String(currentDate.getDate()).padStart(2, '0');
var mm = String(currentDate.getMonth() + 1).padStart(2, '0');
var yyyy = currentDate.getFullYear();
currentDate = dd + '-' + mm + '-' + yyyy;
document.getElementById("dateSpan").innerHTML = currentDate;
}
</script>
</head>
<body onload="subst()">
<div>
Date: <span id="dateSpan"></span>
</div>
</body>
</html>
And on the other side, point to the HTML in custom switches command. Guessing it is located in a folder called PDF, inside Views folder, you could do:
customSwitches = " --header-html " + Server.MapPath("~/Views/PDF/header.html");
I make use of similar code for generating a footer with page number and it works like a charm.
About second question:
I use an MVC action to generate the the partial view that I use as PDF header.
Your code for the custom switches should look like this (using GenerateHeader as action name, PDF as controller and yourModel as the model to be passed to the View, on which you are supposed to store you values):
customSwitches = "--header-html " + Url.Action("GenerateHeader", "PDF", yourModel, Request.Url.Scheme);
For your PDF controller, assuming PdfHeader.cshtml is the view you want to use as PDF header, the code for the action would be as this:
public PartialViewResult GenerateHeader(YourModelType yourModel)
{
return PartialView("PDF/PdfHeader", yourModel);
}
For this PartialView references, remember to include at your controller:
usign System.Web.Mvc;
Hope this helps, if don't, please let me know.

HtmlAgilityPack - SelectSingleNode for descendants

I found that HtmlAgilityPack SelectSingleNode always starts from the first node of the original DOM. Is there an equivalent method to set its starting node ?
Sample html
<html>
<body>
Home
<div id="contentDiv">
<tr class="blueRow">
<td scope="row">target</td>
</tr>
</div>
</body>
</html>
Not working code
//Expected:iwantthis.com Actual:home.com,
string url = contentDiv.SelectSingleNode("//tr[#class='blueRow']")
.SelectSingleNode("//a") //What should this be ?
.GetAttributeValue("href", "");
I have to replace the code above with this:
var tds = contentDiv.SelectSingleNode("//tr[#class='blueRow']").Descendants("td");
string url = "";
foreach (HtmlNode td in tds)
{
if (td.Descendants("a").Any())
{
url= td.ChildNodes.First().GetAttributeValue("href", "");
}
}
I am using HtmlAgilityPack 1.7.4 on .Net Framework 4.6.2
The XPath you are using always starts at the root of the document. SelectSingleNode("//a") means start at the root of the document and find the first a anywhere in the document; that's why it grabs the Home link.
If you want to start from the current node, you should use the . selector. SelectSingleNode(".//a") would mean find the first a that is anywhere beneath the current node.
So your code would look like this:
string url = contentDiv.SelectSingleNode(".//tr[#class='blueRow']")
.SelectSingleNode(".//a")
.GetAttributeValue("href", "");

Ajax Thymeleaf Springboot

I'm trying to use ajax with thymeleaf. I designed a simple html page with two input field. I would like to use addEventHandler for the value of first input text, then I want to send it to controller and make calculation, after that I need to write it in same html form in the second field which returns from controller.
For example:
first input text value -> controller (make calculation) -> (write value) in second input text.
My html page is
<!DOCTYPE html>
<html xmlns:th="http://www.thymeleaf.org">
<head>
<meta charset="UTF-8">
<link rel='stylesheet prefetch' href='http://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/css/bootstrap.min.css'>
<link rel="stylesheet" href="css/style.css">
</head>
<body>
<input type="text" name="Thing" value=""/>
<script th:inline="javascript">
window.onload = function () {
/* event listener */
document.getElementsByName("Thing")[0].addEventListener('change', doThing);
/* function */
function doThing() {
var url = '#{/testurl}';
$("#fill").load(url);
alert('Horray! Someone wrote "' + this.value + '"!');
}
}
</script>
<!-- Results block -->
<div id="fill">
<p th:text="${responseMsg}"/></div>
</div>
</body>
</html>
My controller
#RequestMapping(value = "/testurl", method = RequestMethod.GET)
public String test(Model model) {
model.addAttribute("responseMsg","calcualted value")
return "test";
}
However I cannot call controller from ajax. Could you help me?
There are a few issues with your code. First of all, it looks like you're using the same template for both the initial loading of the application, and returning the calculated result.
You should split these two into different calls if you're using AJAX, since one of the goals of AJAX is that you don't need to reload an entire page for one change.
If you need to return a simple value, you should use a separate request method like this:
#GetMapping("/calculation")
#ResponseBody
public int multiply(#RequestParam int input) {
return input * 2; // The calculation
}
What's important to notice here is that I'm using #ResponseBody and that I'm sending the input to this method as a #RequestParam.
Since you will be returning the calculated value directly, you don't need the Model, nor the responseMsg. So you can remove that from your original request mapping.
You can also remove it from your <div id="fill">, since the goal of your code is to use AJAX to fill this element and not to use Thymeleaf. So you can just have an empty element:
<div id="fill">
</div>
Now, there are also a few issues with your Thymeleaf page. As far as I know, '#{/testurl}' is not the valid syntax for providing URLs. The proper syntax would be to use square brackets:
var url = [[#{/calculation}]];
You also have to make sure you change the url to point to the new request mapping. Additionally, this doesn't look as beautiful since it isn't valid JavaScript, the alternative way to write this is:
var url = /*[[ #{/calculation} ]]*/ null;
Now, your script has also a few issues. Since you're using $().load() you must make sure that you have jQuery loaded somewhere (this looks like jQuery syntax so I'm assuming you want to use jQuery).
You also have to send your input parameter somehow. To do that, you can use the event object that will be passed to the doThing() function, for example:
function doThing(evt) {
var url = [[#{/calculation}]];
$("#fill").load(url + '?input=' + evt.target.value);
alert('Horray! Someone wrote "' + this.value + '"!');
}
As you can see, I'm also adding the ?input=, which will allow you to send the passed value to the AJAX call.
Finally, using $().load() isn't the best way to work with AJAX calls unless you try to load partial HTML templates asynchronously. If you just want to load a value, you could use the following code in stead:
$.get({
url: /*[[ #{/calculation} ]]*/ null,
data: { input: evt.target.value }
}).then(function(result) {
$('#fill').text(result);
});
Be aware that $.get() can be cached by browsers (the same applies to $().load() though). So if the same input parameter can lead to different results, you want to use different HTTP methods (POST for example).

html-agility-pack extract a background image

How do I extract the url from the following HTML.
i.e.. extract:
http://media.somesite.com.au/img-101x76.jpg
from:
<div class="media-img">
<div class=" searched-img" style="background-image: url(http://media.somesite.com.au/img-101x76.jpg);"></div>
</div>
In XPath 1.0 in general, you can use combination of substring-after() and substring-before() functions to extract part of a text. But HAP's SelectNodes() and SelectSingleNode() can't return other than node(s), so those XPath functions won't help.
One possible approach is to get the entire value of style attribute using XPath & HAP, then process the value further from .NET, using regex for example :
var html = #"<div class='media-img'>
<div class=' searched-img' style='background-image: url(http://media.somesite.com.au/img-101x76.jpg);'></div>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var div = doc.DocumentNode.SelectSingleNode("//div[contains(#class,'searched-img')]");
var url = Regex.Match(div.GetAttributeValue("style", ""), #"(?<=url\()(.*)(?=\))").Groups[1].Value;
Console.WriteLine(url);
.NET Fiddle Demo
output :
http://media.somesite.com.au/img-101x76.jpg

jQuery GET html as traversable jQuery object

This is a super simple question that I just can't seem to find a good answer too.
$.get('/myurl.html', function(response){
console.log(response); //works!
console.log( $(response).find('#element').text() ); //null :(
}, 'html');
I am just trying to traverse my the html response. So far the only thing I can think of that would works is to regex to inside the body tags, and use that as a string to create my traversable jQuery object. But that just seems stupid. Anyone care to point out the right way to do this?
Maybe its my html?
<html>
<head>
<title>Center</title>
</head>
<body>
<!-- tons-o-stuff -->
</body>
</html>
This also works fine but will not suit my needs:
$('#myelem').load('/myurl.html #element');
It fails because it doesn't like <html> and <body>.
Using the method described here: A JavaScript parser for DOM
$.get('/myurl.html', function(response){
var doc = document.createElement('html');
doc.innerHTML = response;
console.log( $("#element", doc).text() );
}, 'html');
I think the above should work.
When jQuery parses HTML it will normally strip out the html and body tags, so if the element you are searching for is at the top level of your document structure once the html and body tags have been removed then the find function may not be able to locate the element you're searching for.
See this question for further info - Using jQuery to search a string of HTML
Try this:
$("#element", $(response)).text()
This searches for the element ID in the $(response) treating $(response) as a DOM object.
Maybe I'm misunderstanding something, but
.find('#element')
matches elements with an ID of "element," like
<p id="element">
Since I don't see the "tons of stuff" HTML I don't understand what elements you're trying to find.

Resources