Looking for some better tcpdf performance? - refactoring

I am trying to speed up the pdf creation on my live website ( i didn't look into it in my production enviroment, and now looking in debug.log i see this PHP Fatal error;
Allowed memory size of 134217728 bytes exhausted (tried to allocate 1966080 bytes) in /var/www/vhosts/mydomain.com/httpdocs/tcpdf_min/tcpdf.php on line 18146
So i am looking into tcpdf.org/perfomance and i was wondering if i use the setfont the right way. I set it 8times like so
$this->SetFont('helvetica',
and only the size or bold is added.
I am using the barcodes_2d and barcodes_1d do i need the UTF-8 Unicode
`If you do not need UTF-8 Unicode, set the $unicode parameter on TCPDF constructor to false and the $encoding parameter to 'ISO-8859-1' or other character map.
Edit the config/tcpdf_config.php file: manually set the $_SERVER['DOCUMENT_ROOT'], K_PATH_MAIN and K_PATH_URL constants, and remove the automatic calculation part;
Means i better use
$image_file = "http://www.mydomian.com/images/our_logo.jpg"; vs $image_file = K_PATH_IMAGES . "our_logo.jpg";
Any tips, code example for a better tcpdf create pdf performance?

Related

Control the Pandoc word document output size / test image sizes

My client wants to convert markdown text to word and we'll be using Pandoc. However, we want to control malicious submissions (e.g., a Markdown doc with 1000 externally hosted images each being 10 MB) that can stress/break the server when attempting to produce the output.
options are to regex the image patterns in the Markdown and test their size (or even limit the number) or even disallow external images entirely, but I wonder if there's a way to abort Pandoc if the produced docx exceeds a certain size?
Or is there a simple way to get the images and test their size?
Pandoc normally fetches the images while writing the output file, but you can take control of that by using a Lua filter to fetch the images yourself. This allows to stop fetching as soon as the combined size of the images becomes too large.
local total_size_images = 0
local max_images_size = 100000 -- in bytes
-- Process all images
function Image (img)
-- use pandoc's default method to fetch the image contents
local mimetype, contents = pandoc.mediabag.fetch(img.src)
-- check that contents isn't too large
total_size_images = total_size_images + #contents
if total_size_images > max_images_size then
error('images too large!')
end
-- replace image path with the hash of the image's contents.
local new_filename = pandoc.utils.sha1(contents)
-- store image in pandoc's "mediabag", so it won't be fetched again.
pandoc.mediabag.insert(new_filename, mimetype, contents)
img.src = new_filename
-- return the modified image
return img
end
Please make sure to read the section "A note on security" in the pandoc manual before publishing the app.

how to read from byte array to generate binary file

Here a code snippet for downloading a binary file using VBScript:
...
Dim fs,ts
varByteArray = http.ResponseBody
Set fs = CreateObject("Scripting.FileSystemObject")
Set ts = fs.CreateTextFile("filetowrite", True)
For lngCounter = 0 to UBound(varByteArray)
ts.Write Chr(255 And Ascb(Midb(varByteArrary, lngCounter + 1, 1)))
Next
ts.Close
(full code can be found here)
I am wondering about:
Chr(255 And Ascb(...
From my understandig Chr generates 2 bytes UTF-8, not one (https://support.microsoft.com/en-us/kb/145745). But wouldn't this be necessary for a correct byte output for a newly generated binary file?
Why do you mask 255 using an And operator with the number of a one byte ANSI character. What purpose does this have?
That code is not using "Option Explicit" so variable declarations are useless. It is using undeclared variables. Two declared and initialized variables are not used.
The "binary and" with 255 seems to serve no purpose
I downloaded a test file of 4 MB using 4 different methods
Using Chrome, the regular way
Using the ADO method in the script. Is very fast and is byte-identical to browser version (Hex comparison)
Using the AscB method with "binary and with 255". It is very, very, very slow but is byte-identical to browser version (Hex comparison)
Using the AscB method without "binary and with 255". It is very, very slow (but a little faster than 5) but is byte-identical to browser version (Hex comparison)
Bottom Line: That code works. Tries multiple methods to connect in order of preference, tries two methods to download in order of preference (it tries ADO first and only falls back to AscB method if ADO fails). I like that code.

Convert PDF files to PDF/A via Ghostscript

I'd like to convert arbitrary PDF files to PDF/A with Ghostscript 9.15.
Is Ghostscript able to create PDF/A-3b conformant PDFs? There is no parameter which represents a PDF/A conformance level, so I assume there is no possibility. Or is there anything I have overlooked?
I was following a blog post where a Windows batch file is used to convert from PDF to PDF/A (see http://www.mcbsys.com/techblog/2013/04/batch-convert-pdf-to-pdfa/). The gs invokation in the batch is:
"%gs_path%\gswin64c" ^
-dPDFA ^
-dNOOUTERSAVE ^
-sProcessColorModel=DeviceRGB ^
-sDEVICE=pdfwrite ^
-o "GS_%file1%" ^
-dPDFACompatibilityPolicy=1 ^
"%currentdir%\PDFA_def.ps" ^
%inputfilelist%
The PDFA_def.ps is an adjusted version of the official one:
%!
% This prefix file for creating a PDF/A document is derived from
% the sample included with Ghostscript 9.07, released under the
% GNU Affero General Public License.
% Modified 4/15/2013 by MCB Systems.
% Feel free to modify entries marked with "Customize".
% This assumes an ICC profile to reside in the file (AdobeRGB1998.icc),
% unless the user modifies the corresponding line below.
% The color space described by the ICC profile must correspond to the
% ProcessColorModel specified when using this prefix file (GRAY with
% DeviceGray, RGB with DeviceRGB, and CMYK with DeviceCMYK).
% Define entries in the document Info dictionary :
/ICCProfile (... PATH TO ... AdobeRGB1998.icc) % Customize.
def
[ /Title (Title) % Customize.
/DOCINFO pdfmark
% Define an ICC profile :
[/_objdef {icc_PDFA} /type /stream /OBJ pdfmark
[{icc_PDFA} <</N systemdict /ProcessColorModel get /DeviceGray eq {1} {systemdict /ProcessColorModel get /DeviceRGB eq {3} {4} ifelse} ifelse >> /PUT pdfmark
[{icc_PDFA} ICCProfile (r) file /PUT pdfmark
% Define the output intent dictionary :
[/_objdef {OutputIntent_PDFA} /type /dict /OBJ pdfmark
[{OutputIntent_PDFA} <<
/Type /OutputIntent % Must be so (the standard requires).
/S /GTS_PDFA1 % Must be so (the standard requires).
/DestOutputProfile {icc_PDFA} % Must be so (see above).
/OutputConditionIdentifier (AdobeRGB1998) % Customize
>> /PUT pdfmark
[{Catalog} <</OutputIntents [ {OutputIntent_PDFA} ]>> /PUT pdfmark
So, I use AdobeRGB1998.icc which is obviously useable for PDF files with RGB color space. Depending on the -sProcessColorModel value (DEVICERGB) a correct value is printed out.
The conversion works for all files. But when I validate the created PDF file against PDF/A-1b, I get different results depending whether the input file has RGB color space or not (e.g. CMYK). So, when I have an input PDF file which uses CMYK color space, the file gets converted by the script, but the validator says something like this:
input.pdf", 1, 38, 0x03418614, "A device-specific color space (DeviceCMYK) without an appropriate output intent is used.", 1
"output.pdf", 20, 0, 0x83410612, "The document does not conform to the requested standard.", 1
My question: Is there a way to get the conversion done for arbitrary files (i.e. independent of the used color space in the input file)?
Update
#KenS Thanks for your answer. I've updated my initial post to clarify what I want to achieve.
To make it more explicit, I will use an example. There are two files: input1.pdf (seems to use RGB) and input2.pdf (seems to use CMYK). I want to convert both of them to PDF/A-1. Thanks to your hint, I've let go of the above mentioned batch script and instead tested the command directly in the command line. After reading Ps2pdf.htm#PDFA, I have adjusted the (official) PDFA_def.ps so that AdobeRGB1998.icc is used. Then I invoked the following command on both input files (replaced output1.pdf by output2.pdf and input1.pdf by input2.pdf for the second file):
gswin64c.exe -dPDFA=1 -dBATCH -dNOPAUSE -dNOOUTERSAVE \
-sColorConversionStrategy=/RGB \
-sOutputICCProfile=AdobeRGB1998.icc -sDEVICE=pdfwrite \
-sOutputFile=output1.pdf -dPDFACompatibilityPolicy=1 \
"PATH/TO/OFFICIAL/PDFA_def.ps" input1.pdf
The conversion was done without any errors. The output1.pdf seems to be valid, but the output2.pdf is still invalid (tested with 3heights Validator):
"output2.pdf", 1, 40, 0x03418614, "A device-specific color space (DeviceCMYK) without an appropriate output intent is used.", 1
"output2.pdf", 20, 0, 0x83410612, "The document does not conform to the requested standard.", 1
So when I understand your answer correctly, the above command should produce a pdf file which uses the RGB color space - independent of the color space of the input file. If the input file uses CMYK, than the colors have to be translated into RGB with the above command.
When I interpret the first error message correctly, the used color space in the output2.pdf is still CMYK (although the command parameters like ColorConversionStrategy=/RGB). Since I used AdobeRGB1998.icc, the validation error appears.
What am I missing in the above command?
Going back to my original question (which is one step further): Instead of always converting to RGB (or CMYK), I wanted to somehow detect which color space is used in the input file and then dynamically switch to a RGB or CMYK icc file. Is it possible to achieve that?
Ghostscript does not support PDF/A-3. The conformance parameter you are looking for is -dPDFA= where valid values are nothing (defaults to 1), 1 or 2. You can find this documented in ghostpdl/gs/doc/ps2pdf/htm#PDFA
I'm not sure what you are asking for here though. You must either create a PDF/A file (in level 1 or 2 anyway, I haven't read the revision 3 spec yet) which is RGB or CMYK, because you aren't allowed to use both (you can convert everything to device independent colour of course). The colour space used in the input isn't relevant, other than to decide whether it needs to be converted.
This is something you need to decide, we can't decide it for you. One important reason is that the OutputIntent must be consistent with either RGB or CMYK, and the pdfwrite device doesn't check it, it assumes you chose one which matches the device space you are using for the PDF file (by the way, don't set the ProcessColorModel, use ColorConversionStrategy instead) In your case you have set OutputIntent to AdobeRGB1988 so your colours must be specified either in device independent colour, or RGB.
Given the errors you quote, I would suggest the problem is that you haven't specified -sColorConversionStrategy, so the input colours are not being converted to the required device space. I would further guess that the script you copied this from set -dUseCIEColor, and you didn't copy that bit. DO NOT set -dUseCIEColor, its a horrbile ancient piece of PostScript hackery. Instead set ColorConversionStrategy, which will convert colours in a much better way, as required.
Updated answer as this started getting too long for a comment:
I can't immediately see any problems with your command line, can you share an example PDF file ? Its much easier to investigate these things with a solid example. I know from our customers and other free users that pdfwrite is capable of producing conforming PDF/A-1b files.
Regarding the second question; its not possible to do that because currently you need to set the OutputIntentProfile to either a CMYK one or an RGB one before you start. You can't just run through the input PDF file until you come to a colour operation and then decide. If you feel like some programming it could be done by modifying pdfwrite, because the profile isn't actually used till the output is closed.
One problem is that, in order to do the colour conversion, you need to set the underlying ProcessColorModel (this is done for you automatically by ColorConversionStategy). The only way to change ProcessColorModel is to execute a setpagedevice, which causes an erasepage. Now I think that's actually fixable with pdfwrite, all it does is write a white rectangle over the page, so you should be able to intercept that and not emit it. Otherwise any marks you made before you encountered an RGB or CMYK operation would be underneath the white rectangle.....
So essentially no, you can't do it right now, if its important to you then you could probably modify the code to do so (don't forget you will also need to supply 2 OutputIntent profiles to choose between as well). We've never had a customer request to do this, so we won't likely take it on as a project. Of course if you did get this working we might very well incorporate it into the code base if you were to offer it back to us.

Disallow editing but allow page extraction in Java iText / PDF

I'm using iText to generate PDF files. I want to disallow editing of the PDF, but allow the reader to extract pages. Here's my code to set encryption:
writer.setEncryption(null, null, 0xffffffff, PdfWriter.STANDARD_ENCRYPTION_128);
The third parameter specifies permissions. I'm using 0xffffffff instead of the individual iText flags ALLOW_PRINTING etc. This will ask iText to enable everything. But this is what I get in the PDF file:
I should think I should be allowed to enable extraction but disable editing, but am not certain. Here are the permissions bits per Adobe:
(From here, but be warned it's 30 meg)
So turn off bits 6 and 11 but leave on the others (especially bits 5 and 10), and that would turn off editing but allow extraction. In any case, by specifying 0xffffffff I would think that everything would be allowed; but instead everything except extraction is allowed.
I've skimmed the iText source code for setting permissions and don't see anything that would cause this. Here is the relevant code from PdfEncryption.setupAllKeys:
permissions |= (revision == STANDARD_ENCRYPTION_128 || revision == AES_128 || revision == AES_256) ? 0xfffff0c0
: 0xffffffc0;
permissions &= 0xfffffffc;
The first line is doing an OR and so wouldn't remove any permissions; the second line is setting the two right-most bites to 0, per the PDF specification.
I'm wondering if it's an iText thing, a PDF thing or if I'm doing something else wrong.
Thanks
A similar issue already has been raised here.
Using encryption actually is counter-productive as it can only be used to remove permissions, not to add them.
According to this, it might be helpful to completely unlock the PDF first:
PdfReader reader = new PdfReader(file.toURI().toURL());
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(
file.getAbsolutePath().replace(".pdf", "_UNLOCKED.pdf")));
stamper.close();
reader.close();
Afterwards you can grab the output and start from scratch (mess around with the permission bits). Hope this helps.
EDIT: If you don't have access to the password, the iText sources can be modified. Simply comment out if (!reader.isOpenedWithFullPermissions()) throw ... (line 121 and 122 in version 5.5.0) in com.itextpdf.text.pdf.PdfStamperImp.

Image Copy Issue with Ruby File method each_byte

This problem has bugged me for a while.
I have a jpeg file that is 34.6 kilobytes. Let's call it Image A. Using Ruby, when I copy each line of Image A to a newly created file, called Image B, it is copied exactly. It is exactly the same size as Image A and is accessible.
Here is the code I used:
image_a = File.open('image_a.jpg', 'r')
image_b = File.open('image_b.jpg', 'w+')
image_a.each_line do |l|
image_b.write(l)
end
image_a.close
image_b.close
This code generates a perfect copy of image_a into image_b.
When I try to copy Image A into Image B, byte by byte, it copies successfully but the file size is 88.9 kilobytes rather than the 34.6 kilobytes. I can't access Image B. My mac system alerted me it may be damaged or is using a file format that isn't recognized.
The related code:
//same as before
image_a.each_byte do |b|
image_b.write(b)
end
//same as before
Why is Image B, when copied into byte by byte, larger than Image A? Why is it also damaged in some way, shape, or form? Why is Image A the same size as B, when copied line by line, and accessible?
My guess is the problem is an encoding issue. If so, Why does encoding format matter when copying byte by byte if they translate into the correct code points? Are code points jumbled up into each other so the parser is unable to differentiate between them?
Do \s and \n matter? It seems like it. I did some more research and I found that Image A had 128 lines of code whereas Image B had only one line.
Thanks for reading!
IO#each_byte iterates over bytes (aka Integers). IO#write, however, takes a string as an argument. So it converts the integer to a string via to_s.
Given the first byte in your image is 2551, you'd write the string "255" into image_b. This is why your image_b gets larger. You write number-strings into it.
Try the following when writing back bytes:
image_a.each_byte do |l|
image_b.write l.chr
end
1 As #stefan pointed out jpeg images start with FF D8. So the first byte is 255.

Resources