iTextSharp PDFWriter bottleneck - performance

So I'm taking 10000 2 page pdf files and merging them into one with iTextSharp.
This is some loose code of what I'm doing:
Document document = new Document();
using(PdfWriter writer = PdfWriter.GetInstance(document, new FileStream("merged.pdf", FileMode.Create)))
{
PdfContentByte cb = writer.DirectContent;
PdfReader reader = null;
foreach(string thisFile in files)
{
reader = new PdfReader(thisFile);
var page1 = writer.GetImportedPage(reader, 1);
var page2 = writer.GetImportedPage(reader, 2);
cb.AddTemplate(page1, 1f, 0, 0, 1f, 0, 0);
cb.AddTemplate(page2, 1f, 0, 0, 1f, 0, 0);
}
}
I'm trying to understand where the bottlenecks could be in two places. I ran some performance tests and the slowest processes are naturally reading in each file with PdfReader and the dispose that's saving the file, its called from the using PdfWriter block.
I'm getting about 25% utilization on all 16 cores for this process. I tried a solid state drive instead of my SATA 7.2k rpm drive and it's almost the exact same speed.
How can I speed this process up? There's no distributing the task because the read speed between computers would be even slower. Even if it means changing to another language,library or writing this lower level, I need to get this process done much faster than I currently am. Right now it takes about 10 minutes for the merge.

So I finally solved this. Here are my performance results with code of the winning approach below:
I used the same machine on all three of theses tests
iTextSharp - content builder directly on a pdfwriter
Windows 2008 64 bit
NTFS partition
merges about 30 pages per second during processing
significant overhead at the end when closing out the pdfwriter
25 pages per second over all
iTextSharp - PDFCopy
Windows 2008 64 bit
NTFS partition
writes the output to disk instead of memory so no overhead at the end
40 pages per second
iText (java) - PDFCopy (exact same code, just ported to java)
Ubuntu 12.04 64 bit server edition
EXT3 partition (going to try ext4 soon)
also writes the output to disk during processing
250 pages per second
Haven't tried to figure out why the same code runs faster in java on Ubuntu but I'll take it. In general I defined all major variables outside of this function since it gets called 36000 times during this process.
public void addPage(String inputPdf, String barcodeText, String pageTitle)
{
try
{
//read in the pdf
reader = new PdfReader(inputPdf);
//all pdfs must have 2 pages (front and back).
//set to throw an out of bounds error if not. caught up stream
for (int i = 1; i <= Math.Min(reader.NumberOfPages,2); i++)
{
//import the page from source pdf
copiedPage = copyPdf.GetImportedPage(reader, i);
// add the page to the new document
copyPdf.AddPage(copiedPage);
}
//cleanup this page, keeps a big memory leak away
copyPdf.FreeReader(reader);
copyPdf.Flush();
}
finally
{
reader.Close();
}
}

Give the PdfSmartCopy a try. Not sure if it's faster or not.
Document document = new Document();
using(PdfWriter writer = new PdfSmartCopy(document, new FileStream("merged.pdf", FileMode.Create)))
{
document.Open();
PdfReader reader = null;
foreach(string thisFile in files)
{
reader = new PdfReader(thisFile);
((PdfSmartCopy)writer).AddPage(writer.GetImportedPage(reader , 1));
((PdfSmartCopy)writer).AddPage(writer.GetImportedPage(reader , 2));
}
if(reader != null)
{
reader.Close();
}
}
document.Close();

Related

Set text over a scanned document in itext

I've made a lot of research on this subject but everything I find is everytime "use the function getOverContent of the stamper". I made this and it still not working.
I made a programm which merge together the PDFs of a repertory, than it paginates this new document (I hope you can follow what I'm writting). The original PDFs are self made (direct saved in PDF) or not (scanned). That's with these last ones where there are trouble. The pagination shows on the firsts but not on the seconds (it exists probably, but it should be behind the image)!
Here is the code for the pagination, has someone THE idea, where I'm mistaken?
PdfReader reader = new PdfReader(source);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(destination));
for (int i = start + 1; i <= reader.getNumberOfPages(); i++) {
Phrase noPage = new Phrase((i - start) + "", new Font(FontFamily.COURIER, 14));
float x = reader.getPageSize(i).getRight(20);
float y = reader.getPageSize(i).getTop(20);
PdfContentByte content = stamper.getOverContent(i);
content.beginText();
ColumnText.showTextAligned(content,Element.ALIGN_CENTER, noPage, x, y, 0);
content.endText();
}
stamper.close();
reader.close();
Thanks
After Answer from Bruno, I've made the following:
PdfReader reader = new PdfReader(source);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(destination));
for (int i = start + 1; i <= reader.getNumberOfPages(); i++) {
Phrase noPage = new Phrase((i - start) + "", new Font(FontFamily.COURIER, 14));
float x = reader.getCropBox(i).getRight(20);
float y = reader.getCropBox(i).getTop(20);
PdfContentByte content = stamper.getOverContent(i);
ColumnText.showTextAligned(content,Element.ALIGN_CENTER, noPage, x, y, 0);
}
stamper.close();
reader.close();
But it's still not working
For examples: https://www.transfernow.net/24axn1g4wq4l
The original PDFs are self made (direct saved in PDF) or not (scanned). That's with these last ones where there are trouble. The pagination shows on the firsts but not on the seconds (it exists probably, but it should be behind the image)!
The problem is not that the second type of PDFs is scanned but that it uses page rotation.
When a page is rotated, iText inserts a coordinate system rotation instruction at the start of the undercontent and overcontent which ensures that any text drawn without further transformation is displayed upright on the rotated page.
This rotation of the coordinate system obviously needs to be considered when choosing absolute coordinates.
Thus, instead of
reader.getPageSize(i)
you should use
reader.getPageSizeWithRotation(i)
Alternatively you can switch of this iText mechanism using
stamper.setRotateContents(false);
and then consider the presence of page rotation explicitly in all your following operations.

iTextSharp Image placement performance

I have about 2000 images at 8 X 11 inches, they range in size from 10 k to 1 meg With the below code I loop through a directory and insert the TIFF files onto a new page within a new PDF file that I create. this process takes about 10 mins I'm running this on a win 8 server with 12 gigs of ram and 2 X 2.13 GHz processors (Noting else running ) I like to see if I can get the time down.
I'm not sure if this is the most efficient way of doing this. It works just a little slow. It just might be the fasted way but was wondering if anyone might have a better way of doing the above process.
using (ouput = new FileStream(OutputFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (doc = new Document(PageSize.LETTER))
{
using (writer = PdfWriter.GetInstance(doc, ouput))
{
doc.Open();
foreach (string dir in Directory.GetFiles(TIFFiles,"*.tif", SearchOption.TopDirectoryOnly))
{
Console.Write("\rMerging : {0}....", Path.GetFileName(dir));
iTextSharp.text.Image TIFF = iTextSharp.text.Image.GetInstance(dir);
TIFF.SetAbsolutePosition(0, 0);
writer.DirectContent.AddImage(TIFF);
doc.NewPage();
}
Console.WriteLine("");
Console.WriteLine("End Time: {0}", DateTime.Now.ToString("hh:mm:ss"));
doc.Close();
Console.ReadLine();
}
}
}

OutOfMemoryException in Image Resizing

We are using a .net dll (http://imageresizing.net/download) for imageresizing on runtime. It works perfectly. However, it happen after some time (between 1-7 days) that system started to raise exception in the even viewer:
Exception information:
Exception type: OutOfMemoryException
Exception message: Insufficient memory to continue the execution of the program.
And after that exception the website usually stop working with the error throwing "System.OutOfMemoryException".
And if we "recycle" the application pool in which the website is running, it clears the problem and website get back to normal immediately without any code change.
Before imagereiszing dll, we were using our custom code and same problem happen with that too. Following is the code.
private Bitmap ConvertImage(Bitmap input, int width, int height, bool arc)
{
if (input.PixelFormat == PixelFormat.Format1bppIndexed ||
input.PixelFormat == PixelFormat.Format4bppIndexed ||
input.PixelFormat == PixelFormat.Format8bppIndexed)
{
Bitmap unpackedBitmap = new Bitmap(input.Width, input.Height);
Graphics g = Graphics.FromImage(unpackedBitmap);
g.Clear(Color.White);
g.DrawImage(input, new Rectangle(0,0,input.Width, input.Height));
g.Dispose();
input = unpackedBitmap;
}
double aspectRatio = (double)input.Height / (double)input.Width;
int actualHeight = CommonMethods.GetIntValue(Math.Round(aspectRatio * width, 0));
Bitmap _imgOut;
if (actualHeight > height)
{
ResizeImage resizeImage = new ResizeImage(width, actualHeight, InterpolationMethod.Bicubic);
Bitmap _tempBitmap = resizeImage.Apply(input);
Bitmap _croppedBitmap = new Bitmap(width, height);
Graphics _crop = Graphics.FromImage(_croppedBitmap);
_crop.SmoothingMode = System.Drawing.Drawing2D.SmoothingMode.AntiAlias;
_crop.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
_crop.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;
_crop.DrawImageUnscaledAndClipped(_tempBitmap, new Rectangle(0, 0, width, height));
_crop.Dispose();
_imgOut = _croppedBitmap;
}
else
{
ResizeImage resizeImage = new ResizeImage(width, height, InterpolationMethod.Bicubic);
_imgOut = resizeImage.Apply(input);
}
// Draw the arc if it has been requested
if (arc)
{
Graphics _arc = Graphics.FromImage(_imgOut);
_arc.SmoothingMode = System.Drawing.Drawing2D.SmoothingMode.AntiAlias;
_arc.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
_arc.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;
_arc.DrawArc(new Pen(Color.White, 24), new Rectangle(-13, -13, 50, 50), 180, 90);
_arc.Dispose();
}
// job done
return _imgOut;
}
We are resizing image like: www.mysite.com/images/myimage.jpg?width=196&height=131
Looking forward.
Farrukh
When you encounter an OutOfMemoryException (regardless of where it occurs), it can be caused by a memory leak anywhere in the application. Having debugged dozens of these instances with WinDbg, I've never found any that ended up being due to a bug in http://imageresizing.net.
That said, there's an easy way to determine whether it is a problem with http://imageresizing.net or not; create a separate application pool and subfolder application in IIS for your images and image resizing. Install nothing there except the image resizer. Next time you encounter the error, log on to the server and find out which w3wp.exe instance is responsible for the massive memory usage.
If it's in the ImageResizer app pool, go collect your $20-$50 bug bounty from http://imageresizing.net/support. If not, you need to figure out where you're leaking stuff in your main application.
If you're working with System.Drawing anywhere else in the app, that's the first place to look. Check your code against this list of pitfalls.
If you're positive you're disposing of every System.Drawing.* instance in a using or finally clause, then read this excellent article a few times to make sure you're not failing on any of the basics, then dig in with DebugDiag and/or WinDBG (see bottom of article).

WIA + network scanner with adf = 1 page

I am writing a program to work with a network scanner through WIA.
Everything works fine when scanning only one page. When I turn on the feeder:
foreach (WIA.Property deviceProperty in wia.Properties)
{
if (deviceProperty.Name == "Document Handling Select")
{
int value = duplex ? 0x004 : 0x001;
deviceProperty.set_Value(value);
}
}
the program receives a scan, the signal that there are still documents in the feeder and falls off with com error (scanner continues to scan).
Here's the code check the pages in the feeder:
//determine if there are any more pages waiting
Property documentHandlingSelect = null;
Property documentHandlingStatus = null;
foreach (Property prop in wia.Properties)
{
if (prop.PropertyID == WIA_PROPERTIES.WIA_DPS_DOCUMENT_HANDLING_SELECT)
documentHandlingSelect = prop;
if (prop.PropertyID == WIA_PROPERTIES.WIA_DPS_DOCUMENT_HANDLING_STATUS)
documentHandlingStatus = prop;
}
if ((Convert.ToUInt32(documentHandlingSelect.get_Value()) & 0x00000001) != 0)
{
return ((Convert.ToUInt32(documentHandlingStatus.get_Value()) & 0x00000001) != 0);
}
return false;
Getting the picture code:
imgFile = (ImageFile)WiaCommonDialog.ShowTransfer(item, wiaFormatJPEG, false);
Unfortunately could not find an example of using WIA WSD. Perhaps there are some settings to get multiple images through WSD.
I had almost the same problem using WIA 2.0 with vba to control a Brother MFC-5895CW Multi-Function Scanner.
When I transferred scans from the ADF I was not capable to catch more than 2 pictures to image-objects (and I tried probably every existing option and worked days and hours on that problem!)
The only solution I found with that scanner was to use the ShowAcquisitionWizard-method of the WIA.CommonDialog-Object to batch-transfer all scanned files to a specified folder. It was more a workaround than a satisfying solution for me because the postprocessing would have become more complicated.
Surprise surprise, I tried the same procedure on the neat-scanner of my client... ShowAcquisitionWizard delivered only one scanned page to the specified folder, the other pages disappeared.
To my second surprise with the 'CommonDialog.ShowTransfer'-method I was able to transfer all scanned documents picture by picture into image-objects in my application.

Rendering smallest possible image size with MVC3 vs Webforms Library

I am in the process of moving a webforms app to MVC3. Ironically enough, everything is cool beans except one thing - images are served from a handler, specifically the Microsoft Generated Image Handler. It works really well - on average a 450kb photo gets output at roughly 20kb.
The actual photo on disk weighs in at 417kb, so i am getting a great reduction.
Moving over to MVC3 i would like to drop the handler and use a controller action. However i seem to be unable to achieve the same kind of file size reduction when rendering the image. I walked through the source and took an exact copy of their image transform code yet i am only achieving 230~kb, which is still a lot bigger than what the ms handler is outputting - 16kb.
You can see an example of both the controller and the handler here
I have walked through the handler source code and cannot see anything that is compressing the image further. If you examine both images you can see a difference - the handler rendered image is less clear, more grainy looking, but still what i would consider satisfactory for my needs.
Can anyone give me any pointers here? is output compression somehow being used? or am i overlooking something very obvious?
The code below is used in my home controller to render the image, and is an exact copy of the FitImage method in the Image Transform class that the handler uses ...
public ActionResult MvcImage()
{
var file = Server.MapPath("~/Content/test.jpg");
var img = System.Drawing.Image.FromFile(file);
var sizedImg = MsScale(img);
var newFile = Server.MapPath("~/App_Data/test.jpg");
if (System.IO.File.Exists(newFile))
{
System.IO.File.Delete(newFile);
}
sizedImg.Save(newFile);
return File(newFile, "image/jpeg");
}
private Image MsScale(Image img)
{
var scaled_height = 267;
var scaled_width = 400;
int resizeWidth = 400;
int resizeHeight = 267;
if (img.Height == 0)
{
resizeWidth = img.Width;
resizeHeight = scaled_height;
}
else if (img.Width == 0)
{
resizeWidth = scaled_width;
resizeHeight = img.Height;
}
else
{
if (((float)img.Width / (float)img.Width < img.Height / (float)img.Height))
{
resizeWidth = img.Width;
resizeHeight = scaled_height;
}
else
{
resizeWidth = scaled_width;
resizeHeight = img.Height;
}
}
Bitmap newimage = new Bitmap(resizeWidth, resizeHeight);
Graphics gra = Graphics.FromImage(newimage);
SetupGraphics(gra);
gra.DrawImage(img, 0, 0, resizeWidth, resizeHeight);
return newimage;
}
private void SetupGraphics(Graphics graphics)
{
graphics.CompositingMode = CompositingMode.SourceCopy;
graphics.CompositingQuality = CompositingQuality.HighSpeed;
graphics.InterpolationMode = InterpolationMode.HighQualityBicubic;
graphics.SmoothingMode = SmoothingMode.HighSpeed;
}
If you don't set the quality on the encoder, it uses 100 by default. You'll never get a good size reduction by using 100 due to the way image formats like JPEG work. I've got a VB.net code example of how to set the quality parameter that you should be able to adapt.
80L here is the quality setting. 80 still gives you a fairly high quality image, but at DRASTIC size reduction over 100.
Dim graphic As System.Drawing.Graphics = System.Drawing.Graphics.FromImage(newImage)
graphic.InterpolationMode = Drawing.Drawing2D.InterpolationMode.HighQualityBicubic
graphic.SmoothingMode = Drawing.Drawing2D.SmoothingMode.HighQuality
graphic.PixelOffsetMode = Drawing.Drawing2D.PixelOffsetMode.HighQuality
graphic.CompositingQuality = Drawing.Drawing2D.CompositingQuality.HighQuality
graphic.DrawImage(sourceImage, 0, 0, width, height)
' now encode and send the new image
' This is the important part
Dim info() As Drawing.Imaging.ImageCodecInfo = Drawing.Imaging.ImageCodecInfo.GetImageEncoders()
Dim encoderParameters As New Drawing.Imaging.EncoderParameters(1)
encoderParameters.Param(0) = New Drawing.Imaging.EncoderParameter(Drawing.Imaging.Encoder.Quality, 80L)
ms = New System.IO.MemoryStream
newImage.Save(ms, info(1), encoderParameters)
When you save or otherwise write the image after setting the encoder parameters, it'll output it using the JPEG encoder (in this case) set to quality 80. That will get you the size savings you're looking for.
I believe it's defaulting to PNG format also, although Tridus' solution solves that also.
However, I highly suggest using this MVC-friendly library instead, as it avoids all the image resizing pitfalls and doesn't leak memory. It's very lightweight, free, and fully supported.

Resources