Jsoup css selector code (xpath code included) - xpath

I am trying to parse below HTML using jsoup but not able to get the right syntax for it.
<div class="info"><strong>Line 1:</strong> some text 1<br>
<b>some text 2</b><br>
<strong>Line 3:</strong> some text 3<br>
</div>
I need to capture some text 1, some text 2 and some text 3 in three different variables.
I have the xpath for first line (which should be similar for line 3) but unable to work out the equivalent css selector.
//div[#class='info']/strong[1]/following::text()
On a separate I have few hundred html files and need to parse and extract data from them to store in a database. Is Jsoup best choice for this?

It really looks like Jsoup can't handle getting text out of an element with mixed content. Here is a solution that uses the XPath you formulated that uses XOM and TagSoup:
import java.io.IOException;
import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import nu.xom.XPathContext;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;
public class HtmlTest {
public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
final String html = "<div class=\"info\"><strong>Line 1:</strong> some text 1<br><b>some text 2</b><br><strong>Line 3:</strong> some text 3<br></div>";
final Parser parser = new Parser();
final Builder builder = new Builder(parser);
final Document document = builder.build(html, null);
final nu.xom.Element root = document.getRootElement();
final Nodes textElements = root.query("//xhtml:div[#class='info']/xhtml:strong[1]/following::text()", new XPathContext("xhtml", root.getNamespaceURI()));
for (int textNumber = 0; textNumber < textElements.size(); ++textNumber) {
System.out.println(textElements.get(textNumber).toXML());
}
}
}
This outputs:
some text 1
some text 2
Line 3:
some text 3
Without knowing more specifics of what you're trying to do though, I'm not sure if this is exactly what you want.

It is possible to get an object reference to individual TextNodes. I think maybe you over looked Jsoup's TextNode Object.
The text at the top level of an Element is an instance of a TextNode Object. For instance, " some text 1" and " some text 3" are both TextNode Objects under "< div class='info' >" and "Line 1:" is a TextNode Object under "< strong >"
Element Objects have a textNodes() method which will be of use for you to get a hold of these TextNode Objects.
Check the following code:
String html = "<html>" +
"<body>" +
"<div class="info">" +
"<strong>Line 1:</strong> some text 1<br>" +
"<b>some text 2</b><br>" +
"<strong>Line 3:</strong> some text 3<br>" +
"</div>" +
"</body>" +
"</html>";
Document document = JSoup.parse(html);
Element infoDiv = document.select("div.info").first();
List<TextNode> infoDivTextNodes = infoDiv.textNodes();
This code finds the first < div > Element who has an Attribute with key="class" and value="info". Then get a reference to all of the TextNode Objects directly under "< div class='info' >". That list looks like:
List<TextNode>[" some text 1", " some text 3"]
TextNode Objects have some sweet data and methods associated with them which you can utilize, and extends Node giving you even more functionality to utilize.
The following is an example of getting object references for each TextNode inside div's with class="info".
for(Iterator<Element> elementIt = document.select("div.info").iterator(); elementIt.hasNext();){
Element element = elementIt.next();
for (Iterator<TextNode> textIt = element.textNodes().iterator(); textIt.hasNext();) {
TextNode textNode = textIt.next();
//Do your magic with textNode now.
//You can even reference it's parent via the inherited Node Object's
//method .parent();
}
}
Using this nested iterator technique you can access all the text nodes of an object and with some clever logic you can just about do anything you want within Jsoup's structure.
I have implemented this logic for a spell checking method I have created in the past and it does have some performance hits on very large html documents with a high number of elements, perhaps a lot of lists or something. But if your files are reasonable in length, you should get sufficient performance.
The following is an example of getting object references for each TextNode of a Document.
Document document = Jsoup.parse(html);
for (Iterator<Element> elementIt = document.body().getAllElements().iterator(); elementIt.hasNext();) {
Element element = elementIt.next();
//Maybe some magic for each element..
for (Iterator<TextNode> textIt = element.textNodes().iterator(); textIt.hasNext();) {
TextNode textNode = textIt.next();
//Lots of magic here for each textNode..
}
}

Your problem I think is that of the text you're interested in, only one phrase is enclosed within any defining tags, "some text 2" which is enclosed by <b> </b> tags. So this is easily obtainable via:
String text2 = doc.select("div.info b").text();
which returns
some text 2
The other texts of interest can only be defined as text held within your <div class="info"> tag, and that's it. So the only way that I know of to get this is to get all the text held by this larger element:
String text1 = doc.select("div.info").text();
But unfortunately, this gets all the text held by this element:
Line 1: some text 1 some text 2 Line 3: some text 3
That's about the best I can do, and I'm hoping someone can find a better answer and will keep following this question.

Related

how to get value of a tag that has no class or id in html agility pack?

I am trying to get the text value of this a tag:
67 comments
so i'm trying to get '67' from this. however there are no defining classes or id's.
i've managed to get this far:
IEnumerable<HtmlNode> commentsNode = htmlDoc.DocumentNode.Descendants(0).Where(n => n.HasClass("subtext"));
var storyComments = commentsNode.Select(n =>
n.SelectSingleNode("//a[3]")).ToList();
this only give me "comments" annoyingly enough.
I can't use the href id as there are many of these items, so i cant hardcord the href
how can i extract the number aswell?
Just use the #href attribute and a dedicated string function :
substring-before(//a[#href="item?id=22513425"],"comments")
returns 67.
EDIT : Since you can't hardcode all the content of #href, maybe you can use starts-with. XPath 1.0 solution.
Shortest form (+ text has to contain "comments") :
substring-before(//a[starts-with(#href,"item?") and text()[contains(.,"comments")]],"c")
More restrictive (+ text has to finish with "comments") :
substring-before(//a[starts-with(#href,"item?")][substring(//a, string-length(//a) - string-length('comments')+1) = 'comments'],"c")
I am using ScrapySharp nuget which adds in my sample below, (It's possible HtmlAgilityPack offers the same functionality built it, I am just used to ScrapySharp from years ago)
var doc = new HtmlDocument();
doc.Load(#"C:\desktop\anchor.html"); //I created an html file with your <a> element as the body
var anchor = doc.DocumentNode.CssSelect("a").FirstOrDefault();
if (anchor == null) return;
var digits = anchor.InnerText.ToCharArray().Where(c => Char.IsDigit(c));
Console.WriteLine($"anchor text: {anchor.InnerText} - digits only: {new string(digits.ToArray())}");
Output:

HtmlAgilityPack sanitizing string issue

I'm using HtmlAgilityPack to sanitize user entered rich text and strip any harmful/unwanted text. Problem occurs though when a simple text is also treated as html node
If I enter
a<b, c>d
and try to sanitize it, the output generated is
a<b, c="">d</b,>
The code I used was
HtmlDocument doc = new HthmlDocument();
doc.LoadHtml(value);
// Sanitizing Logic
var result = doc.DocumentNode.WriteTo();
I tried to set different parameters on HtmlDocument ('OptionCheckSyntax', 'OptionAutoCloseOnEnd', 'OptionWriteEmptyNodes') to not have the text be treated as a node but nothing worked. Is this is a known issue or any workaround possible?
IMO, there's no way you can tell HAP to not treat every '<' as start of new html node. But you can check if your html is a validate html or not by using
string html = "your-html";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
if (doc.ParseErrors.Count() > 0)
{
//here you can ignore or do whatever you want
}

How can I configure the separator character used for :menuselection:?

I am using Sphinx to generate HTML documentation for my project. Under Inline Markup, the Sphinx documentation discusses :menuselection: for marking a sequence of menu selections using markup like:
:menuselection:`Start --> Programs`
This results in the following HTML:
<span class="menuselection">Start ‣ Programs</span>
i.e. the --> gets converted to the small triangle, which I've determined is U+2023, TRIANGULAR BULLET.
That's all well and good, but I'd like to use a different character instead of the triangle. I have searched the Sphinx package and the theme package (sphinx-bootstrap-theme) somewhat exhaustively for 'menuselection', the triangle character, and a few other things, but haven't turned up anything that does the substitution from --> to ‣ (nothing obvious to me, anyway). But something must be converting it between my .rst source and the html.
My question is: what, specifically is doing the conversion (sphinx core? HTML writer? Theme JS?)?
The conversion is done in the sphinx.roles.menusel_role() function. You can create your own version of this function with a different separator character and register it to be used.
Add the following to your project's conf.py:
from docutils import nodes, utils
from docutils.parsers.rst import roles
from sphinx.roles import _amp_re
def patched_menusel_role(typ, rawtext, text, lineno, inliner, options={}, content=[]):
text = utils.unescape(text)
if typ == 'menuselection':
text = text.replace('-->', u'\N{RIGHTWARDS ARROW}') # Here is the patch
spans = _amp_re.split(text)
node = nodes.emphasis(rawtext=rawtext)
for i, span in enumerate(spans):
span = span.replace('&&', '&')
if i == 0:
if len(span) > 0:
textnode = nodes.Text(span)
node += textnode
continue
accel_node = nodes.inline()
letter_node = nodes.Text(span[0])
accel_node += letter_node
accel_node['classes'].append('accelerator')
node += accel_node
textnode = nodes.Text(span[1:])
node += textnode
node['classes'].append(typ)
return [node], []
# Use 'patched_menusel_role' function for processing the 'menuselection' role
roles.register_local_role("menuselection", patched_menusel_role)
When building html, make sure to make clean first so that the updated conf.py is re-parsed with the patch.

How to get full classname with xpathquery?

I'm parsing through a HTML document and I need a class name of a div. I know a part of the class name (that never changes) but I need the full class name.
Here's the code I use:
$doc = new DOMDocument;
$doc->loadHTMLFile('http://some_website.com');
$xpath = new DOMXPath($doc);
$classname_of_the_div=$xpath->query('//div[#class="part_of_the_class_name_that_never_changes"]');
When I var_dump() the $classname_of_the_div and $classname_of_the_div->item(0) the result is:
object(DOMNodeList)#3 (1) { ["length"]=> int(0) }
NULL
I know that $classname_of_the_div=$xpath->evaluate('string(//div[#class="part_of_the_class_name_that_never_changes"])'); gives me the content of the div but how do I get the full class name?
P.S.: The part of the classname is separated from the rest of the class name by white spaces, so it's not really a part of the class. The div has just several classes.
I mean the div has several class names like - I want to select it by "class2" for example and receive the
full class string including "class1 class2 class3"
Then, an XPath expression like
//div[#class="part_of_the_class_name_that_never_changes"]
will never yield a result, save for the situation that a particular div element only has one class, that is, the one "that never changes". That's because the XPath expression above means:
Select div elements that have a class attribute whose string value
exactly corresponds to "part_of_the_class_name_that_never_changes".
But imagine the following situation:
<div class="part_of_the_class_name_that_never_changes other_class1 other_class2"/>
Then, you would need to change the expression to:
//div[contains(#class,'part_of_the_class_name_that_never_changes')]/#class
The expression means:
Look for div elements that have a class attribute whose string
value contains the string
"part_of_the_class_name_that_never_changes" and return the attribute
value.

Unable to set InnerText using Html-Agility-Pack

Given an HTML document, I want to identify all the numbers in the document and add custom tags around the numbers.
Right now, i use the following:
HtmlNodeCollection bodyNode = htmlDoc.DocumentNode.SelectNodes("//body");
MatchCollection numbersColl = Regex.Matches(htmlNode.InnerText, <some regex>);
Once I get the numbersColl, I can traverse through each Match and get the index.
However, I can't change the InnerText since it is read-only.
What I need is that if match.Value = 100 and match.Index=25, I want to replace that 25 with
<span isIdentified='true'> 25 </span>
Any help on this will be greatly appreciated. Currently, since I am not able to modify the inner text, I have to modify the InnerHtml but some element might have 25 in it's innerHtml. That should not be touched. But how do I identify whether the number is within
an html tag i.e. < table border='1' > has 1 in the tag.
Here's what I did to work around the read-only property limitation of the InnerText property of a Text node, just select the Parent node of the Text node and note the index of the Text node in the child node collections of the Parent node. Then just do a ReplaceChild(...).
private void WriteText(HtmlNode node, string text)
{
if (node.ChildNodes.Count > 0)
{
node.ReplaceChild(htmlDocument.CreateTextNode(text), node.ChildNodes.First());
}
else
{
node.AppendChild(htmlDocument.CreateTextNode(text));
}
}
In your case I believe you need to create a new Element node that wraps the text into an HtmlElement and then just use it as a replacement of the Text node.
Or even better, see if you can do something like the answer posted here:
Replacing a HTML div InnerText tag using HTML Agility Pack
creating a textnode does not what it should do in this case:
myParentNode.AppendChild(D.CreateTextNode("<script>alert('a');</script>"));
Console.Write(myParentNode.InnerHtml);
The result should be something like
<script....
but it is a working script task even if i add it as "TEXT" not as html. This causes kind of a security issue for me because the text would be a input from a anonymous user.

Resources