iText7 html to pdf conversion with target-counter having dynamic target value not working - itext7

I have been looking into why the page numbers in my generated toc are not working on the latest (3.0.3) version of htmlpdf - even though the release note state that this is now supported.
The devil is in the detail, as noted in the example (https://kb.itextpdf.com/home/it7kb/examples/pdfhtml-support-for-generating-cross-references-for-toc-creation-with-target-counter-target-counters-css-properties) the following works very well:
.preface::before {
content: target-counter(url('#id1'), page) ' ';
}
But as soon as you change this to something more dynamic, in order to not have to not have to specify an entry for each topic you would like to have in the TOC it fails.
.preface::before {
content: target-counter(attr(href), page) ' ';
}
As also indicated in the logging by the following line:
WARN com.itextpdf.html2pdf.attach.impl.layout.PageTargetCountRenderer - Cannot resolve target-counter value with given target "attr(href)"
It looks like the "attr(href)" is taken as a literal and not resolved to the context it is being used in, eg. in this case extracting the tag's href value and using that to find the right page number.
Could this be fixed somehow - or can this be covered by some custom coding?
Thanks.

Using the latest version of the renderer (3.0.5 at the time of writing) effectively supports this properly!

Related

Issue with xsl-fo :footnote when generating pdf/ua-1 document with fop.: "tagged PDF note id is missing"

I have an issue with <fo:footnote> when generating pdf/ua-1 document with fop.
The resulting pdf displays correctly the footnote in the page but don’t pass the pdf-ua validation. A severe error on pdf tag Note “id is missing” is raised so the document is not conformed. I'm using PAC3 for the conformance test.
In the example below I have extracted the basic <fo:footnote> element which has a unique id.
How can I generate the missing Id attribute in the pdf tagged Note element?
Here is the xsl-fo really simple footnote. Note that I used an id to reference the footnote.
some text...
<fo:footnote id="FNE0001">
<fo:inline font-size="6pt" baseline-shift="super">E0001</fo:inline>
<fo:footnote-body>
<fo:block>
<fo:inline>E0001</fo:inline><fo:inline > JO L 139 du 29.5.2002, p. 9.</fo:inline>
</fo:block>
</fo:footnote-body>
</fo:footnote> some text...
Apache FOP has been set to generate pdf-ua through the conf file as follow:
<renderers>
<renderer mime="application/pdf">
<!-- Before setting the pdf-ua-mode, we must insert metadata Title in FO declaration -->
<pdf-ua-mode>PDF/UA-1</pdf-ua-mode>
....
PAC3 is checking against the failure conditions in the Matterhorn Protocol; latest edition at https://www.pdfa.org/resource/the-matterhorn-protocol/.
PAC3 is probably reporting on failure condition 19-003, ID entry of the tag is not present.
fo:footnote-body can have an id property. You could try adding an ID to that. In the fo:inline for the footnote marker, you might also need to add an fo:basic-link that refers to that ID.
FWIW, your footnote does not cause that error when checking PDF/UA generated by AH Formatter, and I'm only guessing at what FOP would be doing differently. (PAC3 and I generally disagree when it complains about a footnote being a possibly incorrect use of a Note tag, but that's another story: PAC3 tries to automate checking the conditions that should be checked by a human, and it doesn't always get it right.)

How to properly scraping filtered content using XPath Query to Google Sheet?

So, this is about a content from a website which I want to get and put it in my Google Sheets, but I'm having difficulty understanding the class of the content.
target link: https://www.cnbc.com/quotes/?symbol=XAU=
This number is what I want to get from. Picture 1: The part which i want to scrape
And this is what the code looks like in inspector. Picture 2: The code shown in inspector
The target is inside a span attribute but the span attribute looks very difficult to me, so I tried to simplify it using this line of code here =IMPORTXML("https://www.cnbc.com/quotes/?symbol=XAU=","//table[#class='quote-horizontal regular']//tr/td/span")
Picture 3: List is shown when putting the code
After some tries, I am able to get the right target, but it confuse me, Im using this code =IMPORTXML("https://www.cnbc.com/quotes/?symbol=XAU=","//table[#class='quote-horizontal regular']//tr/td/span[#class='last original'][1]")
Picture 4: The right target is shown when the xpath query is more specified
As what you can see in 2nd Picture, 'last original' is not really the full name of the class, when I put the 'last original ng-binding' instead it gave me an error saying imported content is empty
So, correct me if my code is wrong, or accidental worked out somehow because there's another correct way?
How about this answer?
Modified formula 1:
When the name of class is last original and last original ng-binding, how about the following xpath and formula?
=IMPORTXML(A1,"//span[contains(#class,'last original')][1]")
In this case, the URL of https://www.cnbc.com/quotes/?symbol=XAU= is put in the cell "A1".
In this case, //span[contains(#class,'last original')][1] is used as the xpath. The value of span that the name of class includes last original is retrieved. So last original and last original ng-binding can be used.
Modified formula2:
As other xpath, how about the following xpath and formula?
=IMPORTXML(A1,"//meta[#itemprop='price']/#content")
It seems that the value is included in the metadata. So this sample retrieves the value from the metadata.
Reference:
IMPORTXML
To complete #Tanaike's answer, two alternatives :
=IMPORTXML(B2;"//span[#class='year high']")
"Year high" seems always equal to the current stock index value.
Or, with value retrieved from the script element :
=IMPORTXML(B2;"substring-before(substring-after(//script[contains(.,'modApi')],'""last\"":\""'),'\')")
Note : since I'm based in Europe, you need to replace ; with , in the formulas.

XPATH - Retrieve data from nestled html span classes

I am trying to get these two attributes separately. When I try to get the version class the duration also gets lumped in with it as the tag is not closed. Also if there happens to be no version then I'd just get the duration returned. How do I ensure I grab this data separately and correctly?
Here's the html:
<span class="version">Original Version <span class="duration">(6:20)</span></span>
This is my current code and also the results I get now:
.//span[#class='duration'] Result: "(6:20)" CORRECT
.//span[#class='version'] Result: "Original Version (6:20)" INCORRECT!
I tried playing around with the 'not contains' operator but still cannot figure it out. Thanks for any help in advance.
This might be one of the few valid use cases for text():
.//span[#class='version']/text()
would give you just text nodes that are direct children of the version span, and not the text contained in any child elements.
In your example you'd get one text node whose value is "Original Version " (including a trailing space).

HtmlUnit getByXpath returns null

I am coding with Groovy, however, I don't believe its a language specific set of questions.
I actually have two questions
First Question
I've run into an issue while using HtmlUnit. It is telling me that what I am trying to grab is null.
The page I'm testing it on is:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4
My code:
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage(url)
//coming up as null
title = page.getByXPath("//html/body/div[4]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a")
println title
This simply prints out: []
Is this because the page uses onclick()? If so, how would I get around that? Enabling javascript creates a mess in my cmd prompt.
Second Question
I am wanting to also get the image but am having trouble because when I attempt to get the XPath (via firebug) it shows up as: //*[#id="gmi-ResViewSizer_img"]
How do I handle that?
First Answer:
/html/body/div[3]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a
Your XPATH was off by one in the predicate filter for the 4th div of the body, it should be the 3rd div. It appears the HTML for the site can/does change from when you had origionally snagged the XPATH using Firebug. You may need to adjust your XPATH to accommodate for potential change and be less sensitive to some differences in document structure.
Maybe something like this:
/html/body//div/h1/a
Second Answer: The XPATH that you listed will work. It may look odd/short(and may not be the most efficient), but // starts at the root node and looks throughout every node in the tree, * matches on any element(to include the img) and the [] predicate filter restricts it to those that have an id attribute who's value equals "gmi-ResViewSizer_img".
There are many other options for XPATHs that could work as well. It will also depend on how often the HTML structure changes. This is one that also works for the page referenced to select that img:
/html/body/div/div/div/div/img[1]
I had the same problem, I solved when I realize iframe tags on page, try call
((HtmlPage)current_page.getFrames()[n].getEnclosedPage()).getElementByXPath(...
where n is the position in frame in iframe collection. It's work for me !!!
Thanks a lot.

Accessing Comments in XML using XPath

How to access the comments inside the XML document using XPath?
For example:
<table>
<length> 12 </length>
<!--Some comment here-->
</table>
I want to access the "Some comment here".
Thanks...
EDIT: I am using MSXML DOM ActiveX and the command comment() seems to be failing... Any idea why?
With the path
/foo/bar/comment()
you can select all comments in the /foo/bar element. May depend on your language of choice, of course. But generally this is how you do it.
Use comment() function for example:-
/table/length/following::comment()[1]
selects the first comment that follows the length element.
Edit
Manoj asks in a comment to this answer why this isn't working in MSXML. The reason will be you are using MSXML3. By default MSXML3 does not use XPath as its selection language, it defaults to an earlier much weaker language (XSL pattern). You need to set XPath as the selection language via the DOMDocument's setProperty method. E.g (in JScript):-
var dom = new ActiveXObject("MSXML2.DOMDocument.3.0");
dom.setProperty("SelectionLanguage", "XPath");
Now the full XPath language will work in your queries (note one breaking change is indexer predicates are 1 based in XPath whereas they were 0 based in XSL Pattern).
Based on the OP's comments to posted answers (and my curiosity as to why this simple thing would not work), here is my suggestion:
Using the XPath expression suggested by #Anthony, I was able to successfully load the comment node with the following JS function:
function SelectComment(s)
{
var xDoc = new ActiveXObject("MSXML2.DOMDocument.6.0");
if (xDoc)
{
xDoc.loadXML(s);
var selNode = xDoc.selectSingleNode("/table/length/following::comment()[1]");
if (selNode != null)
return selNode.text;
else
return "";
}
}
Sample invocation:
SelectComment("<table><length> 12</length><!--Some comment here--></table>");
Output:
"Some comment here"
Notes:
a. Your MSXML version may vary. Please use appropriately.
b. This kind of code is definitely not recommended because it works only on IE. However, since this is your explicitly stated requirement, I have used the ActiveXObject.
c. You have not mentioned in your comments what fails in the suggested XPath expressions. My guess is that you are not querying the text property of the retrieved node. Keep in mind that the SelectSingleNode always returns an IXmlNode and you need to query its data or text properties.
Maybe this coud help,
This sample removes Comments
XmlNodeList list = xmlDoc.SelectNodes("//comment()");
foreach(XmlNode node in list)
node.ParentNode.RemoveChild(node);
Leaned from here link text
<adjustment>
<!-- krishna k -->
<name>FX Update USD</name>
<!-- Since this plan updates existing adj's no ajd's will be created using this id -->
<id>7206</id>
Am facing the similar Issue my application is reading comments which causes stack crash. How can I avoid reading comments by DOM.

Resources