I use different tools like processing to create vector plots. These plots are written as single or multi-page pdfs. I would like to include these plots in a single report-like pdf using pdfbox.
My current workflow includes these pdfs as images with the following pseudo code
PDDocument inFile = PDDocument.load(file);
PDPage firstPage = (PDPage) inFile.getDocumentCatalog().getAllPages().get(0);
BufferedImage image = firstPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
PDXObjectImage ximage = new PDPixelMap(document, image);
PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.drawXObject(ximage, 0, 0, ximage.getWidth(), ximage.getHeight());
contentStream.close();
While this works it looses the benefits of the vector file formats, espectially file/size vs. printing qualitity.
Is it possible to use pdfbox to include other pdf pages as embedded objects within a page (Not added as a separate page)? Could I e.g. use a PDStream? I would prefer a solution like pdflatex is able to embed pdf figures into a new pdf document.
What other Java libraries can you recommend for that task?
Is it possible to use pdfbox to include other pdf pages as embedded objects within a page
It should be possible. The PDF format allows the use of so called form xobjects to serve as such embedded objects. I don't see an explicit implementation for that, though, but the procedure is similar enough to what PageExtractor or PDFMergerUtility do.
A proof of concept derived from PageExtractor using the current SNAPSHOT of the PDFBox 2.0.0 development version:
PDDocument source = PDDocument.loadNonSeq(SOURCE, null);
List<PDPage> pages = source.getDocumentCatalog().getAllPages();
PDDocument target = new PDDocument();
PDPage page = new PDPage();
PDRectangle cropBox = page.findCropBox();
page.setResources(new PDResources());
target.addPage(page);
PDFormXObject xobject = importAsXObject(target, pages.get(0));
page.getResources().addXObject(xobject, "X");
PDPageContentStream content = new PDPageContentStream(target, page);
AffineTransform transform = new AffineTransform(0, 0.5, -0.5, 0, cropBox.getWidth(), 0);
content.drawXObject(xobject, transform);
transform = new AffineTransform(0.5, 0.5, -0.5, 0.5, 0.5 * cropBox.getWidth(), 0.2 * cropBox.getHeight());
content.drawXObject(xobject, transform);
content.close();
target.save(TARGET);
target.close();
source.close();
This code imports the first page of a source document to a target document as XObject and puts it twice onto a page there with different scaling and rotation transformations, e.g. for this source
it creates this
The helper method importAsXObject actually doing the import is defined like this:
PDFormXObject importAsXObject(PDDocument target, PDPage page) throws IOException
{
final PDStream src = page.getContents();
if (src != null)
{
final PDFormXObject xobject = new PDFormXObject(target);
OutputStream os = xobject.getPDStream().createOutputStream();
InputStream is = src.createInputStream();
try
{
IOUtils.copy(is, os);
}
finally
{
IOUtils.closeQuietly(is);
IOUtils.closeQuietly(os);
}
xobject.setResources(page.findResources());
xobject.setBBox(page.findCropBox());
return xobject;
}
return null;
}
As mentioned above this is only a proof of concept, corner cases have not yet been taken into account.
To update this question:
There is already a helper class in org.apache.pdfbox.multipdf.LayerUtility to do the import.
Example to show superimposing a PDF page onto another PDF: SuperimposePage.
This class is part of the Apache PDFBox Examples and sample transformations as shown by #mkl were added to it.
As mkl appropriately suggested, PDFClown is among the Java libraries which provide explicit support for page embedding (so-called Form XObjects (see PDF Reference 1.7, ยง 4.9)).
In order to let you get a taste of the way PDFClown works, the following code represents the equivalent of mkl's PDFBox solution (NOTE: as mkl later stated, his code sample was by no means optimised, so this comparison may not correspond to the actual status of PDFBox -- comments are welcome to clarify this):
Document source = new File(SOURCE).getDocument();
Pages sourcePages = source.getPages();
Document target = new File().getDocument();
Page targetPage = new Page(target);
target.getPages().add(targetPage);
XObject xobject = sourcePages.get(0).toXObject(target);
PrimitiveComposer composer = new PrimitiveComposer(targetPage);
Dimension2D targetSize = targetPage.getSize();
Dimension2D sourceSize = xobject.getSize();
composer.showXObject(xobject, new Point2D.Double(targetSize.getWidth() * .5, targetSize.getHeight() * .35), new Dimension(sourceSize.getWidth() * .6, sourceSize.getHeight() * .6), XAlignmentEnum.Center, YAlignmentEnum.Middle, 45);
composer.showXObject(xobject, new Point2D.Double(targetSize.getWidth() * .35, targetSize.getHeight()), new Dimension(sourceSize.getWidth() * .4, sourceSize.getHeight() * .4), XAlignmentEnum.Left, YAlignmentEnum.Top, 90);
composer.flush();
target.getFile().save(TARGET, SerializationModeEnum.Standard);
source.getFile().close();
Comparing this code to PDFBox's equivalent you can notice some relevant differences which show PDFClown's neater style (it would be nice if some PDFBox expert could validate my assertions):
Page-to-FormXObject conversion: PDFClown natively supports a dedicated method (Page.toXObject()), so there's no need for additional heavy-lifting such as the helper method importAsXObject();
Resource management: PDFClown automatically (and transparently) allocates page resources, so there's no need for explicit calls such as page.getResources().addXObject(xobject, "X");
XObject drawing: PDFClown supports both high-level (explicit scale, translation and rotation anchors) and low-level (affine transformations) methods to place your FormXObject into the page, so there's no need to necessarily deal with affine transformations.
The whole point is that PDFClown features a rich architecture made up of multiple abstraction layers: according to your requirements, you can choose the most appropriate coding style (either to delve into PDF's low-level basic structures or to leverage its convenient and elegant high-level model). PDFClown lets you tweak every single byte and solve complex tasks with a ridiculously simple method call, at your will.
DISCLOSURE: I'm the lead developer of PDFClown.
Related
I have several PDF documents that supposedly contain scanned images, but upon inspection in Acrobat Pro, each page contains a huge number of tiny "inline images". From what I understand these are not regular images inside XObjects, but rather images embedded directly inside content streams.
How could I go about extracting and merging these images?
The only code I could find online starts out like this:
var reader = new PdfReader(#"path\to\file.pdf");
PdfDocument document = new PdfDocument(reader);
for (var i = 1; i <= document.GetNumberOfPages(); i++)
{
PdfDictionary obj = (PdfDictionary)document.GetPdfObject(i);
// ... more code goes here
}
...but the rest of the code doesn't work because the PdfDictionary returned from GetPdfObject is not a stream, only a dictionary. I don't know how to access the images inside it.
I have the need to copy annotations from one PDF File to another. I have used the excellent PDFClown library but unable to manipulate things like color,rotation etc. Is this possible? I can see the baseobject information but also unsure how to manipulate that directly.
I can copy the appearance via cloning appearance but can't "edit" it.
Thanks in advance.
Alex
P.S If Stephano the author is listeing ,is project dead?
On annotations in general and Callout annotations in particular
I looked into it a bit, and I'm afraid there is not much you can deterministically manipulate for arbitrary inputs using high level methods. The reason is that there are numerous alternative ways to set the appearance of a Callout annotation and PDF Clown only supports the less prioritized ways with explicit high level methods. From high priority downwards
An explicit appearance in an AP stream. If it is given, it is used, ignoring whether this appearance looks like a Callout annotation at all, let alone like one defined by the other Callout properties.
PDF Clown does not create an appearance for callout annotations from the other values yet, let alone update existing appearances to follow up to some specific attribute (e.g. Color) change. For ISO 32000-2 support, PDF Clown here will have to improve as appearance streams have become mandatory.
If it exists, you can retrieve the appearance using getAppearance() but you only get a FormXObject with its low level drawing instructions, nothing Callout specific.
One thing you can manipulate quite easily given a FormXObject, though, you can rotate or skew the appearance quite easily by setting its Matrix accordingly, e.g.
annotation.getAppearance().getNormal().get(null).setMatrix(AffineTransform.getRotateInstance(100, 10));
A rich text string in the RC string or stream. Unless an appearance is given, the text in the Callout text box is generated from this rich text datum (rich text here uses a XHTML 1.0 subset for formatting).
PDF Clown does not create a rich text representation of the Callout text yet, let alone update existing ones to follow up to some specific attribute (e.g. Color) change..
If it exists, you can retrieve the rich text by low level access using getBaseDataObject().get(PdfName.RC), change this string or stream, and set it again using getBaseDataObject().put(PdfName.RC, ...). Similarly you can retrieve, manipulate, and set the rich text default style string using its name PdfName.DS instead.
A number of different settings for separate aspects used to build the Callout from in the absence of appearance stream and (as far as the text content is concerned) rich text string.
PDF Clown supports (many of) these attributes, in particular if you cast the cloned annotation to StaticNote, e.g. the opacity CA using get/set/withAlpha, the border Border / BS using get/set/withBorder, the background color C using get/set/withColor, ...
It by the way has an error in its line ending style LE support: Apparently the code for the Line annotation LE property was copied without checking; unfortunately that attribute there follows a different syntax...
Your tasks
Concerning the attributes you stated you want to change, therefore,
Rotation: There is no rotation attribute in the Callout annotation per se (other than the flag whether or not to follow the page rotation). Thus, you cannot set a rotation as a simple annotation attribute. If the source annotation does have an appearance stream, though, you can manipulate its Matrix to rotate it inside the annotation rectangle, see above.
Border color and font: If your Callout has an appearance stream, you can try and parse its content using a ContentScanner and manipulate color and font setting operations. Otherwise, if rich text information is set, for the font you can try and parse the rich text using some XML parser and manipulate font style attributes. Otherwise, you can parse the default appearance DA string and manipulate its font and color setting instructions.
Some example code
I created a file with an example Callout annotation using Adobe Acrobat: Callout-Yellow.pdf. It contains an appearance stream, rich text, and simple attributes, so one can use this file for example manipulations at different levels.
The I applied this code to it with different values for keepAppearanceStream and keepRichText (you didn't mention whether you used PDF Clown for Java or .Net; so I chose Java; a port to .Net should be trivial, though...):
boolean keepAppearanceStream = ...;
boolean keepRichText = ...;
try ( InputStream sourceResource = GET_STREAM_FOR("Callout-Yellow.pdf");
InputStream targetResource = GET_STREAM_FOR("test123.pdf");
org.pdfclown.files.File sourceFile = new org.pdfclown.files.File(sourceResource);
org.pdfclown.files.File targetFile = new org.pdfclown.files.File(targetResource); ) {
Document sourceDoc = sourceFile.getDocument();
Page sourcePage = sourceDoc.getPages().get(0);
Annotation<?> sourceAnnotation = sourcePage.getAnnotations().get(0);
Document targetDoc = targetFile.getDocument();
Page targetPage = targetDoc.getPages().get(0);
StaticNote targetAnnotation = (StaticNote) sourceAnnotation.clone(targetDoc);
if (keepAppearanceStream) {
// changing properties of an appearance
// rotating the appearance in the appearance rectangle
targetAnnotation.getAppearance().getNormal().get(null).setMatrix(AffineTransform.getRotateInstance(100, 10));
} else {
// removing the appearance to allow lower level properties changes
targetAnnotation.setAppearance(null);
}
// changing text background color
targetAnnotation.setColor(new DeviceRGBColor(0, 0, 1));
if (keepRichText) {
// changing rich text properties
PdfString richText = (PdfString) targetAnnotation.getBaseDataObject().get(PdfName.RC);
String richTextString = richText.getStringValue();
// replacing the font family
richTextString = richTextString.replaceAll("font-family:Helvetica", "font-family:Courier");
richText = new PdfString(richTextString);
targetAnnotation.getBaseDataObject().put(PdfName.RC, richText);
} else {
targetAnnotation.getBaseDataObject().remove(PdfName.RC);
targetAnnotation.getBaseDataObject().remove(PdfName.DS);
}
// changing default appearance properties
PdfString defaultAppearance = (PdfString) targetAnnotation.getBaseDataObject().get(PdfName.DA);
String defaultAppearanceString = defaultAppearance.getStringValue();
// replacing the font
defaultAppearanceString = defaultAppearanceString.replaceFirst("Helv", "HeBo");
// replacing the text and line color
defaultAppearanceString = defaultAppearanceString.replaceFirst(". . . rg", ".5 g");
defaultAppearance = new PdfString(defaultAppearanceString);
targetAnnotation.getBaseDataObject().put(PdfName.DA, defaultAppearance);
// changing the text value
PdfString contents = (PdfString) targetAnnotation.getBaseDataObject().get(PdfName.Contents);
String contentsString = contents.getStringValue();
contentsString = contentsString.replaceFirst("text", "text line");
contents = new PdfString(contentsString);
targetAnnotation.getBaseDataObject().put(PdfName.Contents, contents);
// change the line width and style
targetAnnotation.setBorder(new Border(0, new LineDash(new double[] {3, 2})));
targetPage.getAnnotations().add(targetAnnotation);
targetFile.save(new File(RESULT_FOLDER, "test123-withCalloutCopy.pdf"), SerializationModeEnum.Standard);
}
(CopyCallOut test testCopyCallout)
Beware, the code only has proof-of-concept quality: For arbitrary PDFs you cannot simply expect a string replace of "font-family:Helvetica" by "font-family:Courier" or "Helv" by "HeBo" or ". . . rg" by ".5 g" to do the job: fonts can be given using different style attributes or names, and different coloring instructions may be used.
Screenshots in Adobe
The original file:
keepAppearanceStream = true:
keepAppearanceStream = false and keepRichText = true:
keepAppearanceStream = false and keepRichText = false:
As a post commment Mkl
Your great advice is really helpful for when creating new annotations. I did apply the following as a method of "copying" an existing annotation where note is the "cloned" annotation ad baseAnnotation the source
foreach (PdfName t in baseAnnotation.BaseDataObject.Keys)
{
if (t.Equals(PdfName.DA) || t.Equals(PdfName.DS) || t.Equals(PdfName.RC) || t.Equals(PdfName.Rotate))
{
note.BaseDataObject[t] = baseAnnotation.BaseDataObject[t];
}
}
Thanks again
I'm trying to add a PNG image to an existing pdf, but the transparency is converted to black color.
PdfReader reader = new PdfReader(pdfPath);
File f = new File(pdfPath);
String result = f.getParent() + File.separator + UUID.randomUUID().toString() + ".pdf";
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(result));
Image image = Image.getInstance(ImageIO.read(new File(imagePath)), null);
PdfImage stream = new PdfImage(image, null, null);
PdfIndirectObject ref = stamper.getWriter().addToBody(stream);
image.setDirectReference(ref.getIndirectReference());
image.setAbsolutePosition(30, 300);
PdfContentByte canvas = stamper.getOverContent(1);
canvas.addImage(image);
stamper.close();
reader.close();
How can I keep transparency?
First this: I am violating the policy at iText Software by answering this question. You are using an old version of iText, and the policy dictates that voluntary support on iText 5 or earlier has stopped. You should either use iText 7, or you should get a support contract if you still want support for an old iText version.
However, I am curious. I want to know where you found this clunky code (or why you decided to write this code):
Image image = Image.getInstance(ImageIO.read(new File(imagePath)), null);
PdfImage stream = new PdfImage(image, null, null);
PdfIndirectObject ref = stamper.getWriter().addToBody(stream);
image.setDirectReference(ref.getIndirectReference());
image.setAbsolutePosition(30, 300);
PdfContentByte canvas = stamper.getOverContent(1);
canvas.addImage(image);
You don't need ImageIO and you don't need to create a PdfImage, nor do you need to add that image to the body of a PDF file. The code you are using is code specialists would use for a very particular purpose. If you know that particular purpose, please explain.
If adding an image at an absolute position is all you want to do (that's a general purpose, not a particular purpose), your code should be as simple as this:
Image image = Image.getInstance(imagePath);
image.setAbsolutePosition(30, 300);
PdfContentByte canvas = stamper.getOverContent(1);
canvas.addImage(image);
In this case, you don't have to worry about the image mask; iText will take care of that for you.
Please also explain why you're using an outdated version of iText instead of iText 7. If you want your application to be future-proof, you should upgrade to iText 7 now (to avoid wasting time later).
Pango syntax supports some text only markup. As far as i can see this does not extend to embedding images as well.
Looking around I cannot find much in the way of an existing implementation, but i havent done pango+cairo work before so i might be missing the obvious community for it.
As far as i can tell a reasonable approach would be to just analyse a string, pull out any tags, create cairo images, and then modify the pango layout around them accordingly.
It also seems like something someone might have done before.
Im specifically looking for an answer on these questions:
Does pango+cairo already solve this and I have just misread the docs?
Has something like this been done before, and where is a reference?
Is this a reasonable approach, or should i try something else, and what?
(also note i am using ruby, so that may affect my options)
I've been through the source of the markup parser and it does not allow for "shape" attributes (the way Pango almost incorporates graphics) but it is possible to do it "by hand".
Since there is absolutely no example code on the Web, here's Pango/Cairo/Images 101.
For a simple demo, I created an 800x400 window, added a GtkDrawingArea and connected up the "draw" signal. Before entering the main program loop, I initialized it with the following code:
PangoLayout *Pango;
void init_drawingArea (GtkWidget *pWidget)
{
cairo_surface_t *pImg = cairo_image_surface_create_from_png ("linux.png");
PangoRectangle r = {0, 0, PANGO_SCALE * cairo_image_surface_get_width (pImg),
PANGO_SCALE * cairo_image_surface_get_height(pImg)};
PangoContext *ctxt = gtk_widget_get_pango_context (pWidget);
PangoAttrList *attList = pango_attr_list_new();
PangoAttribute *attr;
Pango = pango_layout_new (ctxt);
pango_cairo_context_set_shape_renderer (ctxt, render, NULL, NULL);
pango_layout_set_text (Pango, pszLorem, -1);
pango_layout_set_width(Pango, PANGO_SCALE * 800);
attr = pango_attr_shape_new_with_data(&r, &r, pImg, NULL, NULL);
attr->start_index = 0; attr->end_index = 1;
pango_attr_list_insert (attList, attr);
attr = pango_attr_shape_new_with_data(&r, &r, pImg, NULL, NULL);
attr->start_index = 152; attr->end_index = 153;
pango_attr_list_insert (attList, attr);
pango_layout_set_attributes (Pango, attList);
}
The context's shape renderer is set to render () and a PangoLayout is created and initialized. It then creates 2 shape attributes, sets the user data to a cairo surface which we populate from a png file and applies the attributes to characters 0 and 152 of the text.
The "draw" signal processing is straightforward.
gboolean onDraw (GtkWidget *pWidget, cairo_t *cr, gpointer user_data)
{
pango_cairo_show_layout (cr, Pango);
return 1;
}
and the render () PangoCairoShapeRenderFunc function is called as needed:
void render (cairo_t *cr, PangoAttrShape *pShape, gboolean do_path, gpointer data)
{
cairo_surface_t *img = (cairo_surface_t *)pShape->data;
double dx, dy;
cairo_get_current_point(cr, &dx, &dy);
cairo_set_source_surface(cr, img, dx, dy);
cairo_rectangle (cr, dx, dy, pShape->ink_rect.width/PANGO_SCALE,
pShape->ink_rect.height/PANGO_SCALE);
cairo_fill(cr);
}
Taking the current point from cairo, it draws a rectangle and fills it with the image.
And that's pretty much all it does. Images were added as an afterthought and it shows. They are subject to the same rules as any other glyph so they are limited to the equivalent of CSS's display: inline.
I've put the code up at http://immortalsofar.com/PangoDemo/ if anyone wants to play with it. Me, I arrived here trying to get around GtkTextBuffer's limitations. Guess I'll just have to go deeper.
From your help I have managed to get a very nice PDF generation tool built. It builds a PDF based off of a 5 page template. On the 3rd and 5th page there is a possibility of needing additional pages added and moving the next pages down. The 5th page is landscape even. Everything works perfect except one little additional functionality that I am looking for.
The template that I have built has form fields on the fifth page. Therefore, I use the following code to fill the field:
var pdfReader = new PdfReader(existingFileStream);
var stamper = new PdfStamper(pdfReader, newFileStream);
var form = stamper.AcroFields;
form.SetField("fkClientName", clientName);
The field gets filled just fine, but not on the additional pages. Which is weird because I do call this line:
PdfImportedPage templatePage = stamper.GetImportedPage(pdfReader, 5);
I feel like it should see that there is form fields on that fifth page. However, I read that stamper.GetImportedPage does not retrieve form fields. I don't really care if it's a form field or text. I just need the client name at the top of each generated additional page. Here is what my columntext code looks like that builds the additional pages:
while (true)
{
ct.SetSimpleColumn(-75, 75, PageSize.A4.Height + 25, PageSize.A4.Width - 200);
if (!ColumnText.HasMoreText(ct.Go()))
break;
pageNum++;
stamper.InsertPage(pageNum, new Rectangle(792f, 612f));
stamper.GetOverContent(pageNum).AddTemplate(templatePage, 0, -1f, 1f, 0, 0, PageSize.A4.Width);
ct.Canvas = stamper.GetOverContent(pageNum);
}
If you had company stationery with some kind of background and you wanted to create a document that has flowing text (a column that can flow over to the next page) that also has a repeating header, then I would prefer using PdfWriter.
I'd use PdfWriter to add the content (without using ColumnText, just use the page size and the margins to define the column) and I would add the background and the header using page events. See for instance the Stationery example from my book.
I'd create a subclass for PdfPageEventHelper and I'd load the page you want to see repeated into a PdfImportedPage instance named page:
PdfReader reader = new PdfReader(STATIONERY);
page = writer.getImportedPage(reader, 1);
You may also want to initialize a Phrase with the name of your customer:
header = new Phrase(customerName);
Then you override the onEndPage() method like this:
public void onEndPage(PdfWriter writer, Document document) {
writer.getDirectContentUnder().addTemplate(page, 0, 0);
ColumnText.showTextAligned(writer.getDirectContent(),
Element.ALIGN_RIGHT, header, 36, 806, 0);
}
Now you don't have to worry about ColumnText and new pages. Every time a new page is created, the background and the header will be added automatically.
However, you are using PdfStamper because your original document isn't company stationery: it's a 5 page document. If this document doesn't contain any interactive elements (you've created it using iTextSharp, so you know if it's a flat document or not), I'd still try the PdfWriter approach and change the page instance in the event whenever a new page is needed.
If you want to keep on using PdfStamper, you'll have to add the header in a different way. For instance using a different ColumnText instance, or, if it's a single line, using ColumnText.showTextAligned(). If you don't know the coordinates for the header, you can retrieve the position of the field using the getFieldPositions() method.