PDF Box generating blank images due to JBIG2 Images in it - pdf-generation

Let me give you an overview of my project first. I have a pdf which I need to convert into images(One image for one page) using PDFBox API and write all those images onto a new pdf using PDFBox API itself. Basically, converting a pdf into a pdf, which we refer to as PDF Transcoding.
For certain pdfs, which contain JBIG2 images, PDFbox implementation of convertToImage() method is failing silently without any exceptions or errors and finally, producing a PDF, but this time, just with blank content(white). The message I am getting on the console is:
Dec 06, 2013 5:15:42 PM org.apache.pdfbox.filter.JBIG2Filter decode
SEVERE: Can't find an ImageIO plugin to decode the JBIG2 encoded datastream.
Dec 06, 2013 5:15:42 PM org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap getRGBImage
SEVERE: Something went wrong ... the pixelmap doesn't contain any data.
Dec 06, 2013 5:15:42 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke process
WARNING: getRGBImage returned NULL
I need to know how to resolve this issue? We have something like:
import org.apache.pdfbox.filter.JBIG2Filter;
which I don't know how to implement.
I am searching on that, but to no avail. Could anyone please suggest?

Take a look at this ticket in PDFBox https://issues.apache.org/jira/browse/PDFBOX-1067 . I think the answer to your question is:
to make sure that you have JAI and the JAI-ImageIO plugins installed for your version of Java: decent installation instructions are available here: http://docs.geoserver.org/latest/en/user/production/java.html
to use the JBIG2-imageio plugin, (newer versions are licensed under the Apache2 license) https://github.com/levigo/jbig2-imageio/

I had the same problem and I fixed it by adding this dependency in my pom.xml :
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>jbig2-imageio</artifactId>
<version>3.0.2</version>
</dependency>
Good luck.

I had the exact same problem.
I downloaded the jar from
jbig2-imageio
and I just included it in my project's application libraries, and it worked right out of the box. As adam said, it uses GPL3.

Installing the JAI seems not needed.
I only needed to download the levigo-jbig2-imageio-1.6.5.jar, place it in the folder of my dependency-jars and in eclipse add it to the java build path libraries.
https://github.com/levigo/jbig2-imageio/

import java.awt.image.BufferedImage
import org.apache.pdfbox.cos.COSName
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.pdmodel.PDPage
import org.apache.pdfbox.pdmodel.PDPageTree
import org.apache.pdfbox.pdmodel.PDResources
import org.apache.pdfbox.pdmodel.graphics.PDXObject
import org.apache.pdfbox.rendering.ImageType
import org.apache.pdfbox.rendering.PDFRenderer
import org.apache.pdfbox.tools.imageio.ImageIOUtil
import javax.imageio.ImageIO
import javax.imageio.spi.IIORegistry
import javax.imageio.spi.ImageReaderSpi
import javax.swing.*
import javax.swing.filechooser.FileNameExtensionFilter
public class savePDFAsImage{
String path = "c:/pdfImage/"
//allow pdf file selection for extracting
public static File selectPDF() {
File file = null
JFileChooser chooser = new JFileChooser()
FileNameExtensionFilter filter = new FileNameExtensionFilter("PDF", "pdf")
chooser.setFileFilter(filter)
chooser.setMultiSelectionEnabled(false)
int returnVal = chooser.showOpenDialog(null)
if (returnVal == JFileChooser.APPROVE_OPTION) {
file = chooser.getSelectedFile()
println "Please wait..."
}
return file
}
public static void main(String[] args) {
try {
// help to view list of plugin registered. check by adding JBig2 plugin and JAI plugin
ImageIO.scanForPlugins()
IIORegistry reg = IIORegistry.getDefaultInstance()
Iterator spIt = reg.getServiceProviders(ImageReaderSpi.class, false)
spIt.each(){
println it.getProperties()
}
testPDFBoxSaveAsImage()
testPDFBoxExtractImagesX()
} catch (Exception e) {
e.printStackTrace()
}
}
public static void testPDFBoxExtractImagesX() throws Exception {
PDDocument document = PDDocument.load(selectPDF())
PDPageTree list = document.getPages()
for (PDPage page : list) {
PDResources pdResources = page.getResources()
for (COSName c : pdResources.getXObjectNames()) {
PDXObject o = pdResources.getXObject(c)
if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
File file = new File( + System.nanoTime() + ".png")
ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) o).getImage(), "png", file)
}
}
}
document.close()
println "Extraction complete"
}
public static void testPDFBoxSaveAsImage() throws Exception {
PDDocument document = PDDocument.load(selectPDF().getBytes())
PDFRenderer pdfRenderer = new PDFRenderer(document)
for (int page = 0; page < document.getNumberOfPages(); ++page) {
BufferedImage bim = pdfRenderer.renderImageWithDPI(page,300, ImageType.BINARY)
// suffix in filename will be used as the file format
OutputStream fileOutputStream = new FileOutputStream(+ System.nanoTime() + ".png")
boolean b = ImageIOUtil.writeImage(bim, "png",fileOutputStream,300)
}
document.close()
println "Extraction complete"
}
}

Related

itext 7: converting HTML to PDF fails when using landscape mode in some cases (test repo link included)

I created a small GitHub repo that demonstrates the problem:
See https://github.com/abrighton/itext-bug.
The repo contains a generated HTML file (TEST.html) that causes itext 7 to throw an exception when converting to PDF in landscape mode:
Exception in thread "main" java.lang.UnsupportedOperationException
at com.itextpdf.layout.renderer.AreaBreakRenderer.draw(AreaBreakRenderer.java:83)
at com.itextpdf.layout.renderer.AbstractRenderer.drawChildren(AbstractRenderer.java:855)
at com.itextpdf.layout.renderer.BlockRenderer.draw(BlockRenderer.java:580)
at com.itextpdf.layout.renderer.AbstractRenderer.drawChildren(AbstractRenderer.java:855)
at com.itextpdf.layout.renderer.BlockRenderer.draw(BlockRenderer.java:580)
at com.itextpdf.layout.renderer.DocumentRenderer.flushSingleRenderer(DocumentRenderer.java:147)
at com.itextpdf.layout.renderer.RootRenderer.processRenderer(RootRenderer.java:380)
at com.itextpdf.layout.renderer.RootRenderer.shrinkCurrentAreaAndProcessRenderer(RootRenderer.java:369)
at com.itextpdf.html2pdf.attach.impl.layout.HtmlDocumentRenderer.shrinkCurrentAreaAndProcessRenderer(HtmlDocumentRenderer.java:347)
at com.itextpdf.layout.renderer.RootRenderer.addChild(RootRenderer.java:264)
at com.itextpdf.html2pdf.attach.impl.layout.HtmlDocumentRenderer.processWaitingElement(HtmlDocumentRenderer.java:234)
at com.itextpdf.html2pdf.attach.impl.layout.HtmlDocumentRenderer.close(HtmlDocumentRenderer.java:194)
at com.itextpdf.layout.Document.close(Document.java:135)
at com.itextpdf.html2pdf.HtmlConverter.convertToPdf(HtmlConverter.java:261)
at com.itextpdf.html2pdf.HtmlConverter.convertToPdf(HtmlConverter.java:221)
at ItextBug$.saveAsPdf(ItextBug.scala:15)
at ItextBug$.delayedEndpoint$ItextBug$1(ItextBug.scala:23)
at ItextBug$delayedInit$body.apply(ItextBug.scala:9)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1(App.scala:73)
at scala.App.$anonfun$main$1$adapted(App.scala:73)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
at scala.collection.AbstractIterable.foreach(Iterable.scala:921)
at scala.App.main(App.scala:73)
at scala.App.main$(App.scala:71)
at ItextBug$.main(ItextBug.scala:9)
at ItextBug.main(ItextBug.scala)
Here is the code:
import java.io.{ByteArrayInputStream, FileOutputStream, OutputStream}
import java.nio.file.{Files, Paths}
import com.itextpdf.html2pdf.HtmlConverter
import com.itextpdf.kernel.geom.PageSize
import com.itextpdf.kernel.pdf.{PdfDocument, PdfWriter}
// Run this from the directory containing TEST.html
object ItextBug extends App {
def saveAsPdf(out: OutputStream, html: String, orientation: String): Unit = {
val pageSize = if (orientation == "landscape") PageSize.LETTER.rotate() else PageSize.LETTER
val writer: PdfWriter = new PdfWriter(out)
val document: PdfDocument = new PdfDocument(writer)
document.setDefaultPageSize(pageSize)
HtmlConverter.convertToPdf(new ByteArrayInputStream(html.getBytes()), document)
out.close()
}
val html = new String(Files.readAllBytes(Paths.get("TEST.html")))
val out = new FileOutputStream("TEST.pdf")
// This version crashes
saveAsPdf(out, html, "landscape")
// This version works
// saveAsPdf(out, html, "portrait")
}
Is there anything wrong with this code?
I have only seen this happen on certain input HTML files. There could be something odd in there, however the HTML displays fine in the browser. Browsers don't throw exceptions for bad HTML and the HTML to PDF converter probably should not either, assuming that is the problem.
(Uses Scala-2.13.1, Java-11)

How to retrieve the name of the file being rendered by "mvn site" in site.vm?

I'm creating a Maven Skin (see https://maven.apache.org/doxia/doxia-sitetools/doxia-site-renderer/).
My site.vm needs to highlight links in a nav bar if the link is the current file being rendered.
I therefore need to know the name of the HTML file that site.vm is rendering.
The $currentFileName and $alignedFileName macros work just fine for regular documents (from Markdown source, for example). But for multi-pages documents, like a Maven Report plugin would generate, these macros keep returning the name of main page of the report, rather than the page being rendered.
How to retrieve the actual name of the file being rendered, and not just the name of the main HTML page of a Maven Report plugin?
I tried the following macros with no luck:
$alignedFileName
$currentFileName
$docRenderingContext.getInputName()
$docRenderingContext.getOutputName()
They all return the same value, which is the HTML filename of the main page of the Maven Report plugin.
KmReference.java (Maven Report plugin, creating several pages with getSinkFactory().createSink(outputDirectory, pageFilename)):
public class KmReference extends AbstractMavenReport {
public String getOutputName() {
return "km-reference";
}
…
#Override
protected void executeReport(Locale locale) throws MavenReportException {
…
// Create a new sink!
Sink kmSink;
try {
kmSink = getSinkFactory().createSink(outputDirectory, pageFilename);
} catch (IOException e) {
throw new MavenReportException("Could not create sink for " + pageFilename + " in " + outputDirectory.getAbsolutePath(), e);
}
site.vm (Velocity):
alignedFileName = $alignedFileName
currentFileName = $currentFileName
getDoxiaSourcePath() = $docRenderingContext.getDoxiaSourcePath()
getGenerator() = $docRenderingContext.getGenerator()
getInputName() = $docRenderingContext.getInputName()
getOutputName() = $docRenderingContext.getOutputName()
getParserId() = $docRenderingContext.getParserId()
getRelativePath() = $docRenderingContext.getRelativePath()
In all HTML files generated by my Maven Report plugin, I will get the exact same values:
another-page.html (not km-reference.html):
alignedFileName = km-reference.html
currentFileName = km-reference.html
getDoxiaSourcePath() = $docRenderingContext.getDoxiaSourcePath()
getGenerator() = com.sentrysoftware.maven:patrolreport-maven-plugin:2.0:km-reference
getInputName() = km-reference.html
getOutputName() = km-reference.html
getParserId() = $docRenderingContext.getParserId()
getRelativePath() = .
I would expect at least $alignedFileName to return the value another-page.html.
This issue (https://issues.apache.org/jira/browse/MSITE-842) has been solved with a PR provided by the OP.
This was a bug in the maven-site-plugin Maven plugin (the one that runs when you execute "mvn site").
It has been fixed in version 3.8 of the plugin (which hasn't been released yet, at the time of writing).

java.lang.ClassNotFoundException:net.ucanaccess.jdbc.ucanaccessDriver

I'm beginner with java and using console to compile and run my programs. I'm trying to read data from MS Access .accdb file with ucanaccess driver. As i have added 5 ucanaccess files to C:\Program Files\Java\jdk1.8.0_60\jre\lib\ext, but still getting Exception java.lang.ClassNotFoundException:net.ucanaccess.jdbc.ucanaccessDriver.
Here is my code.
import java.sql.*;
public class jdbcTest
{
public static void main(String[] args)
{
try
{
Class.forName("net.ucanaccess.jdbc.UcanaccessDriver");
String url = "jdbc:ucanaccess://C:javawork/PersonInfoDB/PersonInfo.accdb";
Connection conctn = DriverManager.getConnection(url);
Statement statmnt = conctn.createStatement();
String sql = "SELECT * FROM person";
ResultSet rsltSet = statmnt.executeQuery(sql);
while(rsltSet.next())
{
String name = rsltSet.getString("name-");
String address = rsltSet.getString("address");
String phoneNum = rsltSet.getString("phoneNumber");
System.out.println(name + " " + address + " " + phoneNum);
}
conctn.close();
}
catch(Exception sqlExcptn)
{
System.out.println(sqlExcptn);
}
}
}
Please add JDBC driver jar to lib folder.
Download URL download jar
I tried the method mentioned by Gord in his post Manipulating an Access database from Java without ODBC and used eclipse instead of command line compile and run. Also to learn eclipse basics, i watched video tutorial https://www.youtube.com/watch?v=mMu-JlBrYXo.
Finally i was able to read my MS Access data base file from my java code.

How to use imgscalr using Grails

I've just begun using Groovy and Grails the last few days. I don't have any prior experience of Java, so you'll have to excuse this (probably) very basic question. I've searched Google and Stack Overflow and haven't found anything that helps me with the actually installation.
I have got an image upload working, and I am storing the file on the server. I used a IBM Grails tutorial to guide me through it. That works fine.
I would also like to resize the file in a large, medium, and small format. I wanted to use imgscalr for this, but I cant get it to work. I have downloaded version 4.2 which contains various .jar files. Do I need to put these somewhere on the server and reference them? The only thing I've done is add these lines to buildConfig.groovy
dependencies {
// specify dependencies here under either 'build', 'compile', 'runtime', 'test' or 'provided' scopes eg.
// runtime 'mysql:mysql-connector-java:5.1.20'
compile 'org.imgscalr:imgscalr-lib:4.2'
}
and import org.imgscalr.Scalr.* in my PhotoController.Groovy
Here's my code for saving the file onto the server, I would also like to resize and save the image.
def save() {
def photoInstance = new Photo(params)
// Handle uploaded file
def uploadedFile = request.getFile('photoFile')
if(!uploadedFile.empty) {
println "Class: ${uploadedFile.class}"
println "Name: ${uploadedFile.name}"
println "OriginalFileName: ${uploadedFile.originalFilename}"
println "Size: ${uploadedFile.size}"
println "ContentType: ${uploadedFile.contentType}"
def webRootDir = servletContext.getRealPath("/")
def originalPhotoDir = new File(webRootDir, "/images/photographs/original")
originalPhotoDir.mkdirs()
uploadedFile.transferTo(new File(originalPhotoDir, uploadedFile.originalFilename))
BufferedImage largeImg = Scalr.resize(uploadedFile, 1366);
def largePhotoDir = new File(webRootDir, "/images/photographs/large")
largePhotoDir.mkdirs()
photoInstance.photoFile = uploadedFile.originalFilename
}
if (!photoInstance.hasErrors() && photoInstance.save()) {
flash.message = "Photo ${photoInstance.id} created"
redirect(action:"list")
}
else {
render(view:"create", model:[photoInstance: photoInstance])
}
}
The error I'm getting is No such property: Scalr for class: garethlewisweb.PhotoController
I'm obviously doing something very wrong. Any guidance appreciated.
This is the first google result for "How to use imgscalr in grails" and I was surprised with the lack of informations and examples when googling it. Although the first answer is close, there's still a few mistakes to be corrected.
To anyone that ended here like me through google, heres a more detailed example of how to correctly use this nice plugin:
First, declare the plugin in your BuildConfig.groovy file:
dependencies {
// specify dependencies here under either 'build', 'compile', 'runtime', 'test' or 'provided' scopes eg.
// runtime 'mysql:mysql-connector-java:5.1.20'
compile 'org.imgscalr:imgscalr-lib:4.2'
}
Then, after installed, just paste this piece of code in your controller, in the action that receives the multi-part form with the image uploaded.
def create() {
def userInstance = new User(params)
//saving image
def imgFile = request.getFile('myFile')
def webRootDir = servletContext.getRealPath("/")
userInstance.storeImageInFileSystem(imgFile, webRootDir)
(...)
}
Inside my domain, I implemented this storeImageInFileSystem method, that will resize the image and store it in the filesystem. But first, import this to the file:
import org.imgscalr.Scalr
import java.awt.image.BufferedImage
import javax.imageio.ImageIO
And then, implement the method:
def storeImageInFileSystem(imgFile, webRootDir){
if (!imgFile.empty)
{
def defaultPath = "/images/userImages"
def systemDir = new File(webRootDir, defaultPath)
if (!systemDir.exists()) {
systemDir.mkdirs()
}
def imgFileDir = new File( systemDir, imgFile.originalFilename)
imgFile.transferTo( imgFileDir )
def imageIn = ImageIO.read(imgFileDir);
BufferedImage scaledImage = Scalr.resize(imageIn, 200); //200 is the size of the image
ImageIO.write(scaledImage, "jpg", new File( systemDir, imgFile.originalFilename )); //write image in filesystem
(...)
}
}
This worked well for me. Change any details as the need, like the system diretory or the size of the image.
Instead of
import org.imgscalr.Scalr.*
You want
import org.imgscalr.Scalr
import javax.imageio.ImageIO
Then resize needs a BufferedImage (looking at the JavaDocs), so try:
def originalPhotoDir = new File(webRootDir, "/images/photographs/original")
originalPhotoDir.mkdirs()
def originalPhotoFile = new File(originalPhotoDir, uploadedFile.originalFilename)
uploadedFile.transferTo( originalPhotoFile )
// Load the image
BufferedImage originalImage = ImageIO.read( originalPhotoFile )
// Scale it
BufferedImage largeImg = Scalr.resize(uploadedFile, 1366);
// Make the destination folder
def largePhotoDir = new File(webRootDir, "/images/photographs/large" )
largePhotoDir.mkdirs()
// Write the large image out
ImageIO.write( largeImg, 'png', new File( largePhotoDir, uploadedFile.originalFilename )
Of course, you'll have to watch for files overwriting already existing images
Put the jar file(s) in the 'lib' directory of your Grails application. You may then delete that line from BuildConfig.groovy

Can Selenium verify text inside a PDF loaded by the browser?

My web application loads a pdf in the browser. I have figured out how to check that the pdf has loaded correctly using:
verifyAttribute
xpath=//embed/#src
{URL of PDF goes here}
It would be really nice to be able to check the contents of the pdf with Selenium - for example verify that some text is present. Is there any way to do this?
While not natively supported, I have found a couple ways using the java driver. One way is to have the pdf open in your browser (having adobe acrobat installed) and then use keyboard shortcut keys to select all text (CTRL+A), then copy it to the clipboard (CTRL+C) and then you can verify the text in the clipboard. eg:
protected String getLastWindow() {
return session().getEval("var windowId; for(var x in selenium.browserbot.openedWindows ){windowId=x;} ");
}
#Test
public void testTextInPDF() {
session().click("link=View PDF");
String popupName = getLastWindow();
session().waitForPopUp(popupName, PAGE_LOAD_TIMEOUT);
session().selectWindow(popupName);
session().windowMaximize();
session().windowFocus();
Thread.sleep(3000);
session().keyDownNative("17"); // Stands for CTRL key
session().keyPressNative("65"); // Stands for A "ascii code for A"
session().keyUpNative("17"); //Releases CTRL key
Thread.sleep(1000);
session().keyDownNative("17"); // Stands for CTRL key
session().keyPressNative("67"); // Stands for C "ascii code for C"
session().keyUpNative("17"); //Releases CTRL key
TextTransfer textTransfer = new TextTransfer();
assertTrue(textTransfer.getClipboardContents().contains("Some text in my pdf"));
}
Another way, still in java, is to download the pdf and then convert the pdf to text with PDFBox, see http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html for an example on how to do this.
You cannot do this using WebDriver natively. However, PDFBox API can be used here to read content of PDF file. You will have to first of all shift a focus to browser window where PDF file is opened. You can then parse all the content of PDF file and search for the desired text string.
Here is a code to use PDFBox API to search within PDF document.
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
public class pdfToTextConverter {
public static void pdfToText(String path_to_PDF_file, String Path_to_output_text_file) throws FileNotFoundException, IOException{
//Parse text from a PDF into a string variable
File f = new File("path_to_PDF_file");
PDFParser parser = new PDFParser(new FileInputStream(f));
parser.parse();
COSDocument cosDoc = parser.getDocument();
PDDocument pdDoc = new PDDocument(cosDoc);
PDFTextStripper pdfStripper = new PDFTextStripper();
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
//Write parsed text into a file
PrintWriter pw = new PrintWriter("Path_to_output_text_file");
pw.print(parsedText);
pw.close();
}
}
JAR Source
http://sourceforge.net/projects/pdfbox/files/latest/download?source=files
Unfortunately you can not do this at all with Selenium
There is a way.
Before you click the link you can obtain the href value
element.FindElement(By.TagName("href")).Text
Then after the PDF loads you can get the Url
driver.GetUrl();
Then you can just check to see if the url contains the href.
It's not the best, but it's better than nothing.

Resources