Control the Pandoc word document output size / test image sizes - image

My client wants to convert markdown text to word and we'll be using Pandoc. However, we want to control malicious submissions (e.g., a Markdown doc with 1000 externally hosted images each being 10 MB) that can stress/break the server when attempting to produce the output.
options are to regex the image patterns in the Markdown and test their size (or even limit the number) or even disallow external images entirely, but I wonder if there's a way to abort Pandoc if the produced docx exceeds a certain size?
Or is there a simple way to get the images and test their size?

Pandoc normally fetches the images while writing the output file, but you can take control of that by using a Lua filter to fetch the images yourself. This allows to stop fetching as soon as the combined size of the images becomes too large.
local total_size_images = 0
local max_images_size = 100000 -- in bytes
-- Process all images
function Image (img)
-- use pandoc's default method to fetch the image contents
local mimetype, contents = pandoc.mediabag.fetch(img.src)
-- check that contents isn't too large
total_size_images = total_size_images + #contents
if total_size_images > max_images_size then
error('images too large!')
end
-- replace image path with the hash of the image's contents.
local new_filename = pandoc.utils.sha1(contents)
-- store image in pandoc's "mediabag", so it won't be fetched again.
pandoc.mediabag.insert(new_filename, mimetype, contents)
img.src = new_filename
-- return the modified image
return img
end
Please make sure to read the section "A note on security" in the pandoc manual before publishing the app.

Related

EOFError when converting gensim word2vec to binary format

I have a pretrained embeddings with word2vec format in txt. I loaded it and then saved it to .bin. But I cannot load this embeddings as an EOFError: unexpected end of input; is count incorrect or file otherwise damaged?
My original code is:
model = KeyedVectors.load_word2vec_format(wordfile)
model.save_word2vec_format("file.bin",binary=True,write_header=True)
bin_model = KeyedVectors.load_word2vec_format("file.bin",binary=True)
And I can load this file.bin with a limit arguement: KeyedVectors.load_word2vec_format("file.bin",binary=True, limit=10000).
Is there some other process needed when I save embeddings?
There's a good chance that your .bin file has an incorrect leading-count, or the file has been otherwise been damaged/truncated – because that error means the file declared in its header (1st line) a larger number of word-vectors than were found during attempted-load.
So, if you downloaded it or copied it from somewhere, check the original source, to make sure you've got the full file.
Is there a reason you're performing this conversion? The formats are essentialy equivalent, and result in the exact same object-in-Python after loading.
If there's any tiny on-disk size savings in binary-format, you could probably save more by GZIPping the file (which the .load_word2vec_format() will also happily decompress, if it sees a trailing .gz on the filename).

PDFClown MarkerContent gives only first two ContentObjects

I am a newbee to PDFClown and need help in parsing my pdf contents.
My PDF has huge number of MarkedContents which is displayed when converted as Stream.
But i am not able to parse them into objects to extract the Path Information contained within, which is my objective.
Here is my code -
if(level.Contents[i] is MarkedContent)
{
PdfDataObject ContentDataObj = level.Contents.BaseDataObject;
PdfIndirectObject pdfIndirectObject = level.Contents.BaseDataObject.IndirectObject;
PdfStream ContentStream = (PdfStream)ContentDataObj.Resolve();
ContentParser contentParser = new ContentParser(ContentStream.GetBody(true).ToByteArray());
IList<ContentObject> markerContentObjList = contentParser.ParseContentObjects();
//Here i am getting only two Content Objects, where as the stream has so many distinct Marked Contents
for (int k = 0; k < markerContentObjList.Count; k++)
{
}
}
Below is the DOM Inspector screenshot and Stream data
In Short
There are multiple errors in the content streams of your PDF, in particular errors that close more objects than are opened. This most likely is causing the early stop of parsing. Even if it is not, PDF Clown would associate starts and ends of objects differently than intended. Thus, the only real fix of the issue is to ask the source of the documents to provide a non-broken version.
The First Content Stream
The screen shot you provided shows your first page content stream:
The second content stream of that page exhibits the same issues as this one:
Non-Matching Starts and Ends of Marked Content Sequences
If we look at the marked content operators, we see
/OC /Heading BDC
...
EMC
EMC
/OC /Heading BDC
...
EMC
As you can see, there are two EMC operators for the first BDC. This is invalid. Confer ISO 32000-2 section 14.6 Marked content.
Invalid Fill Operator
Furthermore, there is a Fill operator directly following a text object:
BT
...
ET
f
This also is invalid, path painting operators are only allowed after a path object or a clipping path object, not after a text object. Confer ISO 32000-2 Figure 9 Graphics objects.
A Related PDF Clown Issue
Actually there is a bug in PDF Clown which makes processing of marked content with PDF Clown impossible anyway: PDF Clown assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap, see this answer for details. This assumption is wrong and results in incorrect graphic state contents as explained in that answer.
Thus, one should patch marked content support out of PDF Clown as explained there to at least have proper graphics state information. Thereafter, obviously, you cannot properly process marked content unless you add correct support for it yourself.
Why PDF Clown Stops at the End of the First Stream
As you observed, PDF Clown stops not after the extra EMC but instead at the end of the first content stream.
This is due to the PDF Clown issue explained above: Based on the assumption that marked content sections and save/restore graphics state blocks are properly contained in each other, PDF Clown simply makes EMC and Q close the most recently opened and still open marked content section or save/restore graphics state block without checking whether it matches alright.
Thus, it matches opening and closing operators in your stream like this:
[Start of page content]
. q
. . /OC /Heading BDC
. . EMC
. EMC
. /OC /Drawing BDC
. EMC
Q
So for PDF Clown that last Q does not match the initial q in the content but the start of page content itself.
I think that PDF Clown stops parsing here because it assumes it has found the end of page contents.

How to load all files from a directory of two different types in MATLAB

I know that it is possible to load all files of type .gif by using:
files = dir('C:\myfolder\*.gif');
However, my problem is that I want to load all files of type .gif and .jpg. What would be a good way of doing this?
You can simply search for both .gif and .jpg files then load and process the images one by one.
Just invoke dir twice - one for each type of image and store the results in two separate structures. Next, concatenate all of the file names to one structure, then go ahead and do your processing for all of the images.
Something like this:
%// Specify the folder where your images are stored
folder = fullfile('path', 'to', 'your', 'folder');
%// Specify search pattern for JPEG and GIF files
jpgFileFolder = fullfile(folder, '*.jpg');
gifFileFolder = fullfile(folder, '*.gif');
%// Invoke dir for both types of images
d1 = dir(jpgFileFolder);
d2 = dir(gifFileFolder);
%// Concatenate both dir structures together into a single structure
d = [d1; d2];
%// For each image we have...
for idx = 1 : numel(d)
%// Get full path to file
f = fullfile(folder, d(idx).name);
%// Read in the image
im = imread(f);
%// Do something with this image
%//...
%//...
end
fullfile allows you to create a directory string that is OS independent. Simply take each subdirectory that is part of your string and place them as separate string arguments into fullfile and it should work fine.

In Python 3, best way to open an image stored in a list as a file object?

Using python 3.4 in linux and windows, I'm trying to create qr code images from a list of string objects. I don't want to just store the image as a file because the list of strings may change frequently. I want to then tile all the objects and display the resulting image on screen for the user to scan with a barcode scanner. For the user to know which code to scan I need to add some text to the qr code image.
I can create the list of image objects correctly and they are in a list and calling .show on these objects displays them properly but I don't know how to treat these objects as a file object to open them. The object that is given to the open function, (img_list[0] in my case), in my add_text_to_img needs to support read, seek and tell methods. When I try this as is I get an attribute error. I've tried BytesIO and StringIO but I get an error message that Image.open does not support buffer interface. Maybe I am not doing that part correctly.
I'm sure there are several ways to do this, but what is the best way to open in memory objects as a file object?
from io import BytesIO
import qrcode
from PIL import ImageFont, ImageDraw, Image
def make_qr_image_list(code_list):
"""
:param code_list: a list of string objects to encode into QR code image
:return: a list of image or some type of other data objects
"""
img_list = []
for item in code_list:
qr = qrcode.QRCode(
version=None,
error_correction=qrcode.ERROR_CORRECT_L,
box_size=4,
border=10
)
qr.add_data(item)
qr_image = qr.make_image(fit=True)
img_list.append(qr_image)
return img_list
def add_text_to_img(text_list, img_list):
"""
While I was working on this, I am only saving the first image. Once
it's working, I'll save the rest of the images to a list.
:param text_list: a list of strings to add to the corresponding image.
:param img_list: the list containing the images already created from
the text_list
:return:
"""
base = Image.open(img_list[0])
# img = Image.frombytes(mode='P', size=(164,164), data=img_list[0])
text_img = Image.new('RGBA', base.size, (255,255,255,0))
font = ImageFont.truetype('sans-serif.ttf', 10)
draw = ImageDraw.Draw(text_img)
draw.text((0,-20),text_list[0], (0,0,255,128), font=font)
# include some method to save the images after the text
# has been added here. Shouldn't actually save to a file.
# Should be saved to memory/img_list
output = Image.alpha_composite(base,text_img)
output.show()
if __name__ == '__main__':
test_list = ['AlGaN','n-AlGaN','p-AlGaN','MQW','LED AlN-AlGaN']
image_list = make_qr_image_list(test_list)
add_text_to_img(test_list, image_list)
im = image_list[0]
im.save('/my_save_path/test_image.png')
im.show()
Edit: I've been using python for about a year and I feel like this is a pretty common thing to do but I'm not even sure that I'm looking up/searching for the right terms. What topics would you search for to answer this? If anyone can post a link or two to what I need to read up on regarding this, that would be very appreciated.
You already have PIL image objects; qr.make_image() returns the (a wrapper around) the right type of object and you do not need to open them again.
As such, all you need to do is:
base = img_list[0]
and go from there.
You do need to match image modes when compositing; QR codes are black-and-white images (mode 1), so either convert that or use the same mode in your text_img image object. The Image.alpha_composite() operation does require that both images have an alpha channel. Converting the base is easy:
base = img_list[0].convert('RGBA')

iText - adding Image element generates a corrupt PDF file

I'm using iText® 5.2.1 ©2000-2012 1T3XT BVBA and Integration Designer 8.0 to create a PDF file that is exported in an byte array.
I am creating a document with a fair amount of text and want to add a logo at the beginning.
Part of the code that is adding the image is as follows:
BASE64Decoder decoder = new BASE64Decoder();
byte[] decodedBytes = decoder.decodeBuffer(Stringovi.SLIKA1);
Image image1 = Image.getInstance(decodedBytes);
image1.setAbsolutePosition(30f, 770f);
image1.scalePercent(60f);
document.add(image1);
The input image is in byte array format because of the system requirements.
The rest of the document consists of different tables with various content and it's all text.
When I add the image in the before mentioned way the program finishes and i get an byte output that i run trough a Base64 decoder. Resulting PDF can not be opend and the error shown is:
"Error [PDF Structure 40]:Invalid reference table (xref)"
I can't see where my mistake is so if anybody could be so kind and point me in the right direction I would very much appreciate it.
The document you presented as a "broken PDF file" is not a complete PDF file. It doesn't end with %%EOF, it doesn't have a cross-reference table,... It's a PDF document that isn't complete.
This means that you don't have the following line in your code:
document.close();
If you do have this line, it isn't reached. For instance: an exception is thrown causing the code to jump to a catch clause, skipping the close() operation.
The error message saying Invalid reference table (xref) is consistent with that diagnosis. This isn't a problem caused by iText. It's a problem caused by bad coding: not closing the document and/or not dealing with exceptions correctly.

Resources