I am learning the PDF "syntax" and try to create various PDF documents manually (on Windows 7, Notepad++ to write the 1st unreferenced, broken, file, then run them through pdftk to produce a valid file with updated references, as explained here...).
My learning materials:
PDF Reference 6th edition, version 1.7
+ various online resources.
My question : I would like to create a path only once in the document, then possibly reuse it many times in other parts of the same document. E.g. I could define a logo once then reuse it in different pages, maybe several times at different positions in the same page, maybe with different scaling factors... What is the best way to achieve this?
To give a better idea, I could define a logo (here a cross) that way :
3 0 obj
<< /Length 32>>
10 0 10 30 re f
0 10 30 10 re S
And would like to reuse the same "shape" (possibly with different scalings) in other places in the document, without having to specify the path again.
I am not looking for a software to do the job (eg. Acrobat...). I want to learn how to write this manually (then ask pdftk to fix the file).

You can look in this recently created GitHub repository for some examples of handcoded PDF files.
This file specifically is re-using a (very small, 2x2 pixels) image multiple times on one page, at different positions, with different skews, scalings and rotations.
Here is a screenshot of that PDF file:


How can I clustering multiple images in a folder?

Due the Hard Failure I lost the separated photos. The I recovered them using image recovery. But now all the images are in one folder. Those images may be over 500 in the same one folder.
The images have customized names also.
The images are not in the same size also.
The images are not in same dimension also.
I am unable to cluster them and separate them in to a new folder as manually and time consuming. So, is there any online solution or software to automatically cluster them and move them into a folder?
For example :
Image set 1 :
Image set 2 :
Image set 3 :
In the above set of pictures, every image has the same background. So those images should be clustered as one and put them in a folder.
As like this, is there any solution or API level solution to simplify the manual works?
If they are JPEG images, you can try running jhead on them and it should be able to find the dates in the files. See jhead.
It can then rename the files based on the date for you, then you could separate them by their names/dates.
It may also tell you the GPS latitude/longitude, so you could move them to folders based on their proximity to each other.
Try the -v option to see the full information in a file:
jhead -v recovered123.jpg
Get the time information from the EXIF metadata.
Use this to automatically name and sort the images. Since you likely did not operate two cameras at two different events at the same time, this will work extremely well. Unlesw you managed to destroy this metadata.

overlay one pdf with another from the command line: pdftk alternative?

I use a bash script to auto-generate a pdf calendar each month.I use the wonderful remind program as the basis for this routine. Great as are the calendars I get using that program, I need a more detailed header for the calendar (than just the name of the month and the year). I couldn't puzzle out a way to get the remind program to enhance the header, but I was able to get the enhanced results I wanted by creating a second pdf containing the header enhancements I need, then overlaying that pdf onto the calendar I produce with remind, via the pdftk utility (pdftk calendar.pdf stamp calendar_overlay.pdf output MONTH-YEAR-cal.pdf). Unfortunately, I recently lost the ability to use pdftk since keeping it on my system would necessitate me ceasing to do other system updates. In short, I had to remove it in order to continue updating my system.
So now I'm looking for some alternative that I can incorporate into my bash script. I am not finding any utility that will allow me to overlay one pdf with another, like pdftk allows. It seems I may be able to do something like what I'm after using imagemagick (-convert), though I would likely need to overlay the pdf with an image file like a .jpg rather than with a pdf. Another possible solution may be to use TeX/LaTeX to insert text into the pdf as described at
I wanted to ask here, before investing a lot of time and effort into pursuing one or other of the two potential options I've identified, whether there is some other way, using command line options that can be incorporated into a bash script, of overlaying one pdf with another in the manner described? Input will be appreciated.
LATER EDIT: another link with indications how to do such things using LaTeX
Assuming for simplicity that both of your files are of size 500pt x 200pt,
you can use pdfjam with nup and delta options to trick it into overlaying your source pdf files.
pdfjam bottom.pdf top.pdf --outfile merged.pdf \
--nup "1x2" \
--noautoscale true \
--delta "0 -200pt" \
--papersize "{500pt, 200pt}"
Unfortunately, I've found in my tests that I needed to increase the y delta by one point to get perfect alignment.
pdftk-java is a Java-based port of pdftk which looks to be actively in development. Given that its only real requirement appears to be Java 7+, it should work even in environments such as your own that no longer support the requirements of pdftk, so long as they have a Java runtime installed.

Duplicate photo searching with compare only pure imagedata and image similarity?

Having approximately 600GB of photos collected over 13 years - now stored on freebsd zfs/server.
Photos comes from family computers, from several partial backups to different external USB HDDs, reconstructed images from disk disasters, from different photo manipulation softwares (iPhoto, Picassa, HP and many others :( ) in several deep subdirectories - shortly = TERRIBLE MESS with many duplicates.
So in the first i done:
searched the the tree for the same size files (fast) and make md5 checksum for those.
collected duplicated images (same size + same md5 = duplicate)
This helped a lot, but here are still MANY MANY duplicates:
photos what are different only with exif/iptc data added by some photo management software, but the image is the same (or at least "looks as same" and have the same dimensions)
or they are only a resized versions of the original image
or they are the "enhanced" versions of originals, etc..
Now the questions:
how to find duplicates withg checksuming only the "pure image bytes" in a JPG without exif/IPTC and like meta informations? So, want filter out the photo-duplicates, what are different only with exif tags, but the image is the same. (therefore file checksuming doesn't works, but image checksuming could...). This is (i hope) not very complicated - but need some direction.
What perl module can extract the "pure" image data from an JPG file what is usable for comparison/checksuming?
More complex
how to find "similar" images, what are only the
resized versions of the originals
"enchanced" versions of the originals (from some photo manipulation programs)
is here already any algorithm available in a unix command form or perl module (XS?) what i can use to detect these special "duplicates"?
I'm able make complex scripts is BASH and "+-" :) know perl.. Can use FreeBSD/Linux utilities directly on the server and over the network can use OS X (but working with 600GB over the LAN not the fastest way)...
My rough idea:
delete images only at the end of workflow
use Image::ExifTool script for collecting duplicate image data based on image-creation date, and camera model (maybe other exif data too).
make checksum of pure image data (or extract histogram - same images should have the same histogram) - not sure about this
use some similarity detection for finding duplicates based on resize and foto enhancement - no idea how to do...
Any idea, help, any (software/algorithm) hint how to make order in the chaos?
Here is nearly identical question: Finding Duplicate image files but i'm already done with the answer (md5). and looking for more precise checksuming and image comparing algorithms.
Assuming you can work with localy mounted FS:
rmlint : fastest tool I've ever used to find exact duplicates
findimagedupes : automatize the whole ImageMagick way (as Randal Schwartz's script that I haven't tested? it seems)
Detecting Similar and Identical Images Using Perseptual Hashes goes all the way (a great reference post)
dupeguru-pe (gui) : dedicated tool that is fast and does an excellent job
geeqie (gui) : I find it fast/excellent to finish the job, using the granular deduplication options. Also then you can generate an ordered collection of images such that 'simular images are next to each other, allowing you to 'flip' between the two to see the changes.
Have you looked at this article by Randal Schwartz? He uses a perl script with ImageMagick to compare resized (4x4 RGB grid) versions of the pictures that he then compares in order to flag "similar" pictures.
You can remove exif data with mogrify -strip from ImageMagick toolset. So you could, for each image, copy it without exif, md5sum, and then compare md5sums.
When it comes to visually similar messages - you can, for example, use compare (also from ImageMagick toolset), and produce black/white diff map, like described here, then make histogram of the difference and check if there is "enough" white to mean that it's different.
I had a similar dilemma - several hundred gigs of photos and videos spread and duplicated over about a dozen drives. I know this may not be the exact way you are looking for, but the FSlint Janitor application (on Ubuntu 16.x, then 18.x) was a lifesaver for me. I took the project in chunks, eventually cleaning it all up and ended up with three complete sets (I wanted two off-site backups).
FSLint Janitor:
sudo apt install fslint

PostScript to PDF conversion/slow print issue [GhostScript]

I have several large PDF reports (>500 pages) with grid lines and background shading overlay that I converted from postscript using GhostScript's ps2pdf in a batch process. The PDFs that get created look perfect in the Adobe Reader.
However, when I go to print the PDF from Adobe Reader I get about 4-5 ppm from our Dell laser printer with long, 10+ second pauses between each page. The same report PDF generated from another proprietary process (not GhostScript) yeilds a fast 25+ ppm on the same printer.
The PDF file sizes on both are nearly the same at around 1.5 MB each, but when I print both versions of the PDF to file (i.e. postscript), the GhostScript generated PDF postscript output is about 5 times larger than that of the other (2.7 mil lines vs 675K) or 48 MB vs 9 MB. Looking at the GhostScript output, I see that the background pattern for the grid lines/shading (referenced by "/PatternType1" tag) is defined many thousands of times throughout the file, where it is only defined once in the other PDF output. I believe this constant re-defining of the background pattern is what is bogging down the printer.
Is there a switch/setting to force GhostScript to only define a pattern/image only once? I've tried using the -r and -dPdfsettings=/print switches with no relief.
Patterns (and indeed images) and many other constructs should only be emitted once, you don't need to do anything to have this happen.
Forms, however, do not get reused, and its possible that this is the source of your actual problem. As Kurt Pfiefle says above its not possible to tell without seeing a file which causes the problem.
You could raise a bug report at which will give you the opportunity to attach a file. If you do this please do NOT attach a > 500 page file, it would be appreciated if you would try to find the time to create a smaller file which shows the same kind of size inflation.
Without seeing the PostScript file I can't make any suggestions at all.
I've looked at the source PostScript now, and as suspected the problem is indeed the use of a form. This is a comparatively unusual area of PostScript, and its even more unusual to see it actually being used properly.
Because its rare usage, we haven't any impetus to implement the feature to preserve forms in the output PDF, and this is what results in the large PDF. The way the pattern is defined inside the form doesn't help either. You could try defining the pattern separately, at least that way pdfwrite might be able to detect the multiple pattern usage and only emit it once (the pattern contains an imagemask so this may be worthwhile).
This construction:
GS C20 setpattern 384 151 32 1024 RF GR
GS C20 setpattern 384 1175 32 1024 RF GR
is inefficient, you keep re-instantiating the pattern, which is expensive, this:
GS C20 setpattern
384 151 32 1024 RF
384 1175 32 1024 RF
is more efficient
In any event, there's nothing you can do with pdfwrite to really reduce this problem.
'[...] when I print both versions of the PDF to file (i.e. postscript), the GhostScript generated PDF postscript output is about 5 times larger than that of the other (2.7 mil lines vs 675K) or 48 MB vs 9 MB.'
Which version of Ghostscript do you use? (Try gs -v or gswin32c.exe -v or gswin64c.exe -v to find out.)
How exactly do you 'print to file' the PDFs? (Which OS platform, which application, which kind of settings?)
Also, ps2pdf may not be your best option for the batch process. It's a small shell/batch script anyway, which internally calls a Ghostscript command.
Using Ghostscript directly will give you much more control over the result (though its commandline 'usability' is rather inconvenient and awkward -- that's why tools like ps2pdf are so popular...).
Lastly, without direct access to one of your PS input samples for testing (as well as the PDF generated by the proprietary converter) it will not be easy to come up with good suggestions.

what are the various approaches for generating PDFs?

I have an idea for an app that would take some flash content which contains graphics and images like various geometric shapes and polygons and some random images and convert them to PDF.
Also, since I envision this app to be used my multiple users I want this process to be quick and scalable. One possible solution I could think of is have a small flash client with the capability of assembling the above mentioned graphics and images. Generate some sort of XML, send it to a server running a Java process which could render the PDF using iText.
I was wondering what are the other possible ways to do it or the best practices. Technology isn't an issue; open source or commercial.
I understand that image uploads etc will take variable amount of time so consider that images are readily available. Here are the criterias in terms of what I am looking for in a solution for PDF rendering:
No constraint on the flash client because the PDF render engine.
Scalable to multiple users
Speed and Efficiency
Least amount of serialization / deserialization
I would appreciate if you could share your tech stack idea. Thanks a lot!
PS: I would appreciated if you don't get bogged down my Flash >> XML >> Java approach.
I believe it to be one of the many approaches that could be taken.
If generating the PDF in the browser using Flash is an option, then consider using AlivePdf. If not, then check out XSL:FO, we use it for server side conversion to PDF.
I believe iText generates PDFs in Java code. It may or may not use XML as its data source; POJOs will do just as well.
Another way is XSL-FO. It requires an XML data source and an XSL-FO stylesheet to transform the XML and generate a PDF. Apache's Xalan (or any other XSL-T library) can do it for you.
"Quick" and "scalable" may require more than these. Uploading a lot of images is a process that has its own timescale and optimizations that have nothing to do with PDFs.
There's pdflib for PHP, and FPDF (also for PHP).
So you're also willing to consider other clients? It sounds like you've got a kids drawing app and want to generate something that'll preserve the state of their drawing at the time.
Lets face it, XML isn't that efficient. That's not its purpose. It's both machine and human readable, validatable, etc etc.
Instead, how about a <Canvas> based web page that submitted the state of that canvas to the server in JSON (fewer bytes, and less work to build them). The server can then work in whatever the hell library/language it wants. Lots of JSON->my-language libraries floating around out there.
Your choice in PDF libraries is then limited only by what is you have installed on your server. You also said you wanted to do as little reading/writing as possible.
The most efficient possible setup would be to have a read-only partial PDF already loaded into memory tailored to minimize the impact of canvas changes (including images). Each session would dupe that partial PDF, convert the JSON to PDF graphic commands, and save the PDF.
To minimize structural changes to the PDF you'd want to use Inline Images. No new objects in the PDF means you don't need to change your cross reference table at all (until you add fonts or want to reuse an existing image). You could build the "doc info" dictionary padded with a specific amount of spaces between objects so you could fill it in without changing any byte offsets (which would force you to recompute the xref table).
You may or may not need to mess with the page size... we are just talking about one page here, right?
So the PDF would look something like...
<3-4 random high order bytes to convince folks that we're a binary stream>
1 0 obj
<</Type/Catalog/Pages 2 0 R>>
2 0 obj
<</Type/Pages/Count 1/Kids[3 0 R]>>
3 0 obj
<</Type/Page/Contents 4 0 R/MediaBox[0 0 612 792]/Parent 2 0 R>>
5 0 obj
<</Type/DocInfo/Author() --<insert big whitespace gap here>--
/Title() --<ditto>--
/Subject() --<ditto>--
/Keywords() --<ditto>--
/Creator(My app's Name)
/Producer(My pdf library's name)
/CreationDate(encodedDateWhenThisTemplateWasBuilt) D:YYYYMMDDHHMMSS-timeZoneOffset
/ModDate() --<another, smaller whitespace gap>--
4 0 obj
<</Filter/SeveralDifferentFiltersAvailable/Length --<byte length of the stream in this file>-->>
And your template stops there. You'd have a similar "end of the PDF" template that would look something like this:
0 6
0000000000 65535 f
0000000010 00000 n
0000000025 00000 n
0000000039 00000 n
0000000097 00000 n
0000000050 00000 n
<</Root 1 0 R/Size 6/Info 5 0 R>>
--<some white space>--
The columns of numbers at the end are all wrong. The first column is the byte offset of that particular object (and I'm not up for counting bytes just now thank you). The second column is largely irrelevant.
PDF filling app will need to know:
The byte offsets of everything you intend to fill in within the first template.
All the "doc info" fields, which are all optional by the way. The /Info key and the dictionary it points to are optional for that matter. You could yank 'em if you cared to.
the /Length key of the content stream. That needs to be the post-filter byte length of the stream itself.
How to convert the JSON into pdf drawing commands. If you wanted to cheat a bit you could use iText[Sharp]'s PdfContentByte class, use its drawing commands, and then get the finished byte stream and slap that into your PDF. Be sure you use Inline Images or this whole scheme goes right out the window. There are probably other libraries you could gut similarly if you felt the need. Or you could just read up on the PDF spec and roll your own. You'll be sticking to a fairly limited subset of PDF's content syntax.
The byte offset of the word "xref" from the start of the file. You can calculate this: LengthOfInitialTemplate + LengthOfContentStream + OffsetFromStartOf2ndTemplateTo'xref'.
The byte offset of the line below "startxref", which is where you write the aforecalculated byte offset of 'xref'
You're not going to get much more efficient than that. You'd read in your templates once. Read/calculate the byte offsets you needed once.
