Convert xPath to JSoup query - xpath

Does anyone know of an xPath to JSoup convertor? I get the following xPath from Chrome:
//*[#id="docs"]/div[1]/h4/a
and would like to change it into a Jsoup query. The path contains an href I'm trying to reference.

This is very easy to convert manually.
Something like this (not tested)
document.select("#docs > div:eq(1) > h4 > a").attr("href");
Documentation:
http://jsoup.org/cookbook/extracting-data/selector-syntax
Related question from comment
Trying to get the href for the first result here:
cbssports.com/info/search#q=fantasy%20tom%20brady
Code
Elements select = Jsoup.connect("http://solr.cbssports.com/solr/select/?q=fantasy%20tom%20brady")
.get()
.select("response > result > doc > str[name=url]");
for (Element element : select) {
System.out.println(element.html());
}
Result
http://fantasynews.cbssports.com/fantasyfootball/players/playerpage/187741/tom-brady
http://www.cbssports.com/nfl/players/playerpage/187741/tom-brady
http://fantasynews.cbssports.com/fantasycollegefootball/players/playerpage/1825265/brady-lisoski
http://fantasynews.cbssports.com/fantasycollegefootball/players/playerpage/1766777/blake-brady
http://fantasynews.cbssports.com/fantasycollegefootball/players/playerpage/1851211/brady-foltz
http://fantasynews.cbssports.com/fantasycollegefootball/players/playerpage/1860955/brady-earnhardt
http://fantasynews.cbssports.com/fantasycollegefootball/players/playerpage/1673397/brady-amack
Screenshot from Developer Console - grabbing urls

I am using Google Chrome Version 47.0.2526.73 m (64-bit) and I can now directly copy the Selector path which is compatible with JSoup
Copied Selector of the element in the screenshot span.com is
#question > table > tbody > tr:nth-child(1) > td.postcell > div > div.post-text > pre > code > span.com

You don't necessarily need to convert Xpath to JSoup specific selectors.
Instead you can use XSoup which is based on JSoup and supports Xpath.
https://github.com/code4craft/xsoup
Here is an example using XSoup from the docs.
#Test
public void testSelect() {
String html = "<html><div><a href='https://github.com'>github.com</a></div>" +
"<table><tr><td>a</td><td>b</td></tr></table></html>";
Document document = Jsoup.parse(html);
String result = Xsoup.compile("//a/#href").evaluate(document).get();
Assert.assertEquals("https://github.com", result);
List<String> list = Xsoup.compile("//tr/td/text()").evaluate(document).list();
Assert.assertEquals("a", list.get(0));
Assert.assertEquals("b", list.get(1));
}

I have tested the following XPath and Jsoup, it works.
example 1:
[XPath]
//*[#id="docs"]/div[1]/h4/a
[JSoup]
document.select("#docs > div > h4 > a").attr("href");
example 2:
[XPath]
//*[#id="action-bar-container"]/div/div[2]/a[2]
[JSoup]
document.select("#action-bar-container > div > div:eq(1) > a:eq(1)").attr("href");

Here is the working standalone snippet using Xsoup with Jsoup:
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import us.codecraft.xsoup.Xsoup;
public class TestXsoup {
public static void main(String[] args){
String html = "<html><div><a href='https://github.com'>github.com</a></div>" +
"<table><tr><td>a</td><td>b</td></tr></table></html>";
Document document = Jsoup.parse(html);
List<String> filasFiltradas = Xsoup.compile("//tr/td/text()").evaluate(document).list();
System.out.println(filasFiltradas);
}
}
Output:
[a, b]
Libraries included:
xsoup-0.3.1.jar
jsoup-1.103.jar

Depends what you want.
Document doc = JSoup.parse(googleURL);
doc.select("cite") //to get all the cite elements in the page
doc.select("li > cite") //to get all the <cites>'s that only exist under the <li>'s
doc.select("li.g cite") //to only get the <cite> tags under <li class=g> tags
public static void main(String[] args) throws IOException {
String html = getHTML();
Document doc = Jsoup.parse(html);
Elements elems = doc.select("li.g > cite");
for(Element elem: elems){
System.out.println(elem.toString());
}
}

Although this question is pretty old, I just want to mention that latest Jsoup release has some Beta features like the one requested in this question.
Release 1.14.3 added a native XPath selector. See it for yourselves: https://jsoup.org/news/release-1.14.3
Now you can use Jsoup native methods:
File downloadedPage = new File("/path/to/your/page.html");
String xPathSelector = "//*[#id="docs"]/div[1]/h4/a";
Document document = Jsoup.parse(downloadedPage, "UTF-8");
Elements elements = document.selectXpath(xPathSelector);
You can iterate over the elements returned!

Related

iText 7 pdhHtml keep table rows together

I am new in iText 7, i am developing a spa project (asp.net, c#, and angularjs), where i need to implement a report for existing html page.I found iText 7 (.Net) has a easy way to implement it. Using below code of line, that's return me a byte array and i can easily show in browser as pdf also can download.
var memStream = new MemoryStream();
ConverterProperties converterProperties = new ConverterProperties();
converterProperties.SetFontProvider(fontProvider); converterProperties.SetBaseUri(System.AppDomain.CurrentDomain.BaseDirectory);
HtmlConverter.ConvertToPdf(htmlText, memStream, converterProperties);
In my raw html there has some html tables (every table has some particular rows) and i want to keep them in a page (i mean if table rows not fit in a single page then start from next page). I got a solution like below
Paragraph p = new Paragraph("Test");
PdfPTable table = new PdfPTable(2);
for (int i = 1; i < 6; i++) {
table.addCell("key " + i);
table.addCell("value " + i);
}
for (int i = 0; i < 40; i++) {
document.add(p);
}
// Try to keep the table on 1 page
table.setKeepTogether(true);
document.add(table);
But in my case i cannot implement like that way because content already exist in html tables (in my existing html page).
Advance thanks, if anyone can help me.
This can easily be done using a custom TagWorkerFactory and TableTagWorker class.
Take a look at the code samples below.
The first thing we should do is create a custom TableTagWorker that tells iText to keep the table together. We do this using the code you've mentioned: table.setKeepTogether(true).
class CustomTableTagWorker extends TableTagWorker{
public CustomTableTagWorker(IElementNode element, ProcessorContext context) {
super(element, context);
}
#Override
public void processEnd(IElementNode element, ProcessorContext context) {
super.processEnd(element, context);
((com.itextpdf.layout.element.Table) getElementResult()).setKeepTogether(true);
}
}
As you can see the only thing we changed on our custom TableTagWorker is the fact that it has to keep the table together.
The next step would be to create a custom TagWorkerFactory that maps our CustomTableTagWorker to the table tag in HTML. We do this like so:
class CustomTagWorkerFactory extends DefaultTagWorkerFactory{
#Override
public ITagWorker getCustomTagWorker(IElementNode tag, ProcessorContext context) {
if (tag.name().equalsIgnoreCase("table")) {
return new CustomTableTagWorker(tag, context); // implements ITagWorker
}
return super.getCustomTagWorker(tag, context);
}
}
All we do here is tell iText that if it finds a table tag it should pass the element to the CustomTableTagWorker, in order to be converted to a PDF object (where setKeepTogether == true).
The last step is registering this CustomTagWorkerFactory on our ConverterProperties.
ConverterProperties converterProperties = new ConverterProperties();
converterProperties.setTagWorkerFactory(new CustomTagWorkerFactory());
HtmlConverter.convertToPdf(HTML, new FileOutputStream(DEST), converterProperties);
Using these code samples I was able to generate an output PDF where tables, if small enough to render on an entire page, will never be split across multiple pages.
I had a similar issue of trying to keep together content within a div. I applied the following css property and this kept everything together. This worked with itext7 pdfhtml.
page-break-inside: avoid;

How to get "Repro Steps" of a list of work items?

My team has been using VSTS for 8 months. Now, Our customer is asking to get "Repro Steps" of the work items in VSTS.
Is there any way to get the content of "Repro Steps" without the HTML format?
No, because the Repro Steps value is the rich text that can contain image etc…. So, the value is incorrect if just return the data without HTML format.
However, you can remove HTML tag programing.
Simple code:
public static string StripHTML(string input)
{
return Regex.Replace(input, "<.*?>", String.Empty);
}
var u = new Uri("[collection URL]"");
VssCredentials c = new VssCredentials(new Microsoft.VisualStudio.Services.Common.WindowsCredential(new NetworkCredential("[user name]", "[password]")));
var connection = new VssConnection(u, c);
var workitemClient = connection.GetClient<WorkItemTrackingHttpClient>();
var workitem = workitemClient.GetWorkItemAsync(96).Result;
object repoValue = workitem.Fields["Microsoft.VSTS.TCM.ReproSteps"];
string repoValueWithOutformat = StripHTML(repoValue.ToString());

concating multiple processing instruction results in a for loop in XQuery,XPath

I need to read all processing instructions with NAME="CONTENTTYPE" and I want to read #VALUE and concatenate all the Values and return in XQuery/XPath.
My XML:
<REG >
<MARKER MRKEID="SLREG:7.1" MRKTYPE="LD DU" MRKDATE="20130909" MRKTIME="10402688"/>
<?METADATA NAME="CONTENTTYPE" VALUE="STATUTE"?>
<?METADATA NAME="CONTENTTYPE" VALUE="LEGISLATIVEDOCUMENT"?>
<?METADATA NAME="CONTENTTYPE" VALUE="PRIMARYSOURCE"?>
<?METADATA NAME="SLTAXTYPE" VALUE="PRIMARYSOURCE"?>
</REG>
ExpectedOutput:
STATUTE
LEGISLATIVEDOCUMENT
PRIMARYSOURCE
Appreciate your help in writing the XQuery/XPath to get the output as above.
Thanks in Advance.
Regards,
Hari
//processing-instruction('METADATA')[matches(., 'NAME="CONTENTTYPE" VALUE="[^"]*"')]/replace(substring-after(., 'VALUE="'), '"', ''). That's XPath 2.0.
Tagging with JDOM helped me find this.
Long answer coming.... XPath does not have the native ability to parse the 'standard' way of adding 'attributes' to ProcessingInstructions. If you want to do the concatenation of the values as part of a single XPath expression I think you are out of luck.... actually, Martin's answer looks promising, but it will return a number of String values, not ProcessingInsructions. JDOM 2.x will need a Filters.string() on the XPath.compile(...) and you will get a List<String> result to path.evaluate(doc).... I think it's simpler to do it outside of the XPath. Especially given that there's only limited support for XPath2.0 by using the Saxon library with JDOM 2.x
As for doing it programmatically, JDOM 2.x helps a fair amount. Taking your example XML I did it two ways, the first way uses a custom Filter on the XPath resultset. The second way does effectively the same thing but restricting the PI's further in the loop.
public static void main(String[] args) throws Exception {
SAXBuilder saxb = new SAXBuilder();
Document doc = saxb.build(new File("data.xml"));
// This custom filter will return PI's that have the NAME="CONTENTTYPE" 'pseudo' attribute...
#SuppressWarnings("serial")
Filter<ProcessingInstruction> contenttypefilter = new AbstractFilter<ProcessingInstruction>() {
#Override
public ProcessingInstruction filter(Object obj) {
// because we know the XPath expression selects Processing Instructions
// we can safely cast here:
ProcessingInstruction pi = (ProcessingInstruction)obj;
if ("CONTENTTYPE".equals(pi.getPseudoAttributeValue("NAME"))) {
return pi;
}
return null;
}
};
XPathExpression<ProcessingInstruction> xp = XPathFactory.instance().compile(
// search for all METADATA PI's.
"//processing-instruction('METADATA')",
// The XPath will return ProcessingInstruction content, which we
// refine with our custom filter.
contenttypefilter);
StringBuilder sb = new StringBuilder();
for (ProcessingInstruction pi : xp.evaluate(doc)) {
sb.append(pi.getPseudoAttributeValue("VALUE")).append("\n");
}
System.out.println(sb);
}
This second way uses the simpler and pre-defined Filters.processingInstruction() but then does the additional filtering manually....
public static void main(String[] args) throws Exception {
SAXBuilder saxb = new SAXBuilder();
Document doc = saxb.build(new File("data.xml"));
XPathExpression<ProcessingInstruction> xp = XPathFactory.instance().compile(
// search for all METADATA PI's.
"//processing-instruction('METADATA')",
// Use the pre-defined filter to set the generic type
Filters.processinginstruction());
StringBuilder sb = new StringBuilder();
for (ProcessingInstruction pi : xp.evaluate(doc)) {
if (!"CONTENTTYPE".equals(pi.getPseudoAttributeValue("NAME"))) {
continue;
}
sb.append(pi.getPseudoAttributeValue("VALUE")).append("\n");
}
System.out.println(sb);
}

A better solution than element.Elements("Whatever").First()?

I have an XML file like this:
<SiteConfig>
<Sites>
<Site Identifier="a" />
<Site Identifier="b" />
<Site Identifier="c" />
</Sites>
</SiteConfig>
The file is user-editable, so I want to provide reasonable error message in case I can't properly parse it. I could probably write a .xsd for it, but that seems kind of overkill for a simple file.
So anyway, when querying for the list of <Site> nodes, there's a couple of ways I could do it:
var doc = XDocument.Load(...);
var siteNodes = from siteNode in
doc.Element("SiteConfig").Element("Sites").Elements("Site")
select siteNode;
But the problem with this is that if the user has not included the <SiteUrls> node (say) it'll just throw a NullReferenceException which doesn't really say much to the user about what actually went wrong.
Another possibility is just to use Elements() everywhere instead of Element(), but that doesn't always work out when coupled with calls to Attribute(), for example, in the following situation:
var siteNodes = from siteNode in
doc.Elements("SiteConfig")
.Elements("Sites")
.Elements("Site")
where siteNode.Attribute("Identifier").Value == "a"
select siteNode;
(That is, there's no equivalent to Attributes("xxx").Value)
Is there something built-in to the framework to handle this situation a little better? What I would prefer is a version of Element() (and of Attribute() while we're at it) that throws a descriptive exception (e.g. "Looking for element <xyz> under <abc> but no such element was found") instead of returning null.
I could write my own version of Element() and Attribute() but it just seems to me like this is such a common scenario that I must be missing something...
You could implement your desired functionality as an extension method:
public static class XElementExtension
{
public static XElement ElementOrThrow(this XElement container, XName name)
{
XElement result = container.Element(name);
if (result == null)
{
throw new InvalidDataException(string.Format(
"{0} does not contain an element {1}",
container.Name,
name));
}
return result;
}
}
You would need something similar for XDocument. Then use it like this:
var siteNodes = from siteNode in
doc.ElementOrThrow("SiteConfig")
.ElementOrThrow("SiteUrls")
.Elements("Sites")
select siteNode;
Then you will get an exception like this:
SiteConfig does not contain an element SiteUrls
You could use XPathSelectElements
using System;
using System.Linq;
using System.Xml.Linq;
using System.Xml.XPath;
class Program
{
static void Main()
{
var ids = from site in XDocument.Load("test.xml")
.XPathSelectElements("//SiteConfig/Sites/Site")
let id = site.Attribute("Identifier")
where id != null
select id;
foreach (var item in ids)
{
Console.WriteLine(item.Value);
}
}
}
Another thing that comes to mind is to define an XSD schema and validate your XML file against this schema. This will generate meaningful error messages and if the file is valid you can parse it without problems.

Umbraco DataTypes. Retrieve list of possible data types.

I have a property in umbraco that uses a drop down data type with a set of prevalues that you can select from.
How do I retrieve a list of all the possible prevalues that are in this drop down list?
There's a helper method in umbraco.library that does that.
From xslt:
<xsl:variable name="prevalues" select="umbraco.library:GetPreValues(1234)" />
From code:
using umbraco;
XPathNodeIterator prevalues = library.GetPrevalues(1234);
Replace 1234 with the id of your datatype (You can see it in the bottom of your browser when hovering your mouse over the datatype in the developers section)
Regards
Jesper Hauge
Here is the code that I use in one of my Umbraco datatypes to get a DropDownList containing all possible prevalues:
var prevalues = PreValues.GetPreValues(dataTypeDefinitionId);
DropDownList ddl = new DropDownList();
if (prevalues.Count > 0)
{
for (int i = 0; i < prevalues.Count; i++)
{
var prevalue = (PreValue)prevalues[i];
if (!String.IsNullOrEmpty(prevalue.Value))
{
ddl.Items.Add(new ListItem(prevalue.Value, prevalue.DataTypeId.ToString()));
}
}
}
Replace dataTypeDefinitionId with the id of your datatype.
I know this is an old question, but I created this method based on the information provided in this answer and I think it is worth documenting:
public static class UmbracoExtensions
{
public static IEnumerable<string> GetDropDownDataTypeValues(int dataTypeId)
{
var dataTypeValues = umbraco.library.GetPreValues(dataTypeId);
var dataTypeValuesEnumerator = dataTypeValues.GetEnumerator();
while (dataTypeValues.MoveNext())
{
dynamic dataTypeItem = dataTypeValues.Current;
yield return dataTypeItem.Value;
}
}
}

Resources