Adding metadata to PDF - ruby

I need to add metadata to a PDF which I am creating using prawn. That meta-data will be extracted later by, probably, pdf-reader. This metadata will contain internal document numbers and other information needed by downstream tools.
It would be convenient to associate meta-data with each page of the PDF. The PDF specification claims that I can store per-page private data in a "Page-Piece Dictionary". Section 14.5 states:
A page-piece dictionary (PDF 1.3) may be used to hold private
conforming product data. The data may be associated with a page or
form XObject by means of the optional PieceInfo entry in the page
object (see Table 30) or form dictionary (see Table 95). Beginning
with PDF 1.4, private data may also be associated with the PDF
document by means of the PieceInfo entry in the document catalogue
(see Table 28).
How can I set a "page-piece dictionary" with prawn? I'm using prawn 0.12.0.
If that's not possible, how else can I achieve my goal of storing metadata about each page, either at the page level, or at the document level?

you can look at the source of prawn
https://github.com/prawnpdf/prawn/commit/131082af5abb71d83de0e2005ecceaa829224904
info = { :Title => "Sample METADATA",
:Author => "Me",
:Subject => "Not Working",
:CreationDate => Time.now }
#pdf = Prawn::Document.new(:template => filename, :info => info)

One way is to do none of the above; that is, don't attach the metadata as a page-piece dictionary, and don't attach it with prawn. Instead, attach the metadata as a file attachment using the pdftk command-line tool.
To do it this way, create a file with the metadata. For example, the file metadata.yaml might contain:
---
- :document_id: '12345'
:account_id: 10
:page_numbers:
- 1
- 2
- 3
- :document_id: '12346'
:account_id: 24
:page_numbers:
- 4
After you are done creating the pdf file with prawn, then use pdftk to attach the metadata file to the pdf file:
$ pdftk foo.pdf attach_files metadata.yaml output foo-with-attachment.pdf
Since pdftk will not modify a file in place, the output file must be different than the input file.
You may be able to extract the metadata file using pdf-reader, but you can certainly do it with pdftk. This command unpacks metadata.yaml into the unpacked-attachments directory.
$ pdftk foo-with-attachment.pdf unpack_files output unpacked-attachments

Related

Manually populate an ImageField

I have a models.ImageField which I sometimes populate with the corresponding forms.ImageField. Sometimes, instead of using a form, I want to update the image field with an ajax POST. I am passing both the image filename, and the image content (base64 encoded), so that in my api view I have everything I need. But I do not really know how to do this manually, since I have always relied in form processing, which automatically populates the models.ImageField.
How can I manually populate the models.ImageField having the filename and the file contents?
EDIT
I have reached the following status:
instance.image.save(file_name, File(StringIO(data)))
instance.save()
And this is updating the file reference, using the right value configured in upload_to in the ImageField.
But it is not saving the image. I would have imagined that the first .save call would:
Generate a file name in the configured storage
Save the file contents to the selected file, including handling of any kind of storage configured for this ImageField (local FS, Amazon S3, or whatever)
Update the reference to the file in the ImageField
And the second .save would actually save the updated instance to the database.
What am I doing wrong? How can I make sure that the new image content is actually written to disk, in the automatically generated file name?
EDIT2
I have a very unsatisfactory workaround, which is working but is very limited. This illustrates the problems that using the ImageField directly would solve:
# TODO: workaround because I do not yet know how to correctly populate the ImageField
# This is very limited because:
# - only uses local filesystem (no AWS S3, ...)
# - does not provide the advance splitting provided by upload_to
local_file = os.path.join(settings.MEDIA_ROOT, file_name)
with open(local_file, 'wb') as f:
f.write(data)
instance.image = file_name
instance.save()
EDIT3
So, after some more playing around I have discovered that my first implementation is doing the right thing, but silently failing if the passed data has the wrong format (I was mistakingly passing the base64 instead of the decoded data). I'll post this as a solution
Just save the file and the instance:
instance.image.save(file_name, File(StringIO(data)))
instance.save()
No idea where the docs for this usecase are.
You can use InMemoryUploadedFile directly to save data:
file = cStringIO.StringIO(base64.b64decode(request.POST['file']))
image = InMemoryUploadedFile(file,
field_name='file',
name=request.POST['name'],
content_type="image/jpeg",
size=sys.getsizeof(file),
charset=None)
instance.image = image
instance.save()

Issues in Updating Metadata while Generating PDF

I am working on a Extend Script which saves FrameMaker Book as a PDF. The script is able to save to the PDF but when I tried to add the PDF Metadata (Author/CreationDate/Keywords/Subject/Title) etc, the same does not reflect in the generated PDF.
On Closure inspection I found that Metadata elements were not added to PDFDocInfo property of the Book.
Here is the code which I wrote to update the Author Details in PDFDocInfo
$.writeln("Length before" + doc.PDFDocInfo.length);
doc.PDFDocInfo.push("Author");
doc.PDFDocInfo.push("Mr Bond");
$.writeln("Length after" + doc.PDFDocInfo.length);
where doc is an Object of type Book
The output is
Length before0
Length after0
Should the PDFDocInfo not have 2 elements in it now. Am I missing any thing here ?
The following code did the trick...
var pdfDocInfo = new Strings();
pdfDocInfo.push("Author");
pdfDocInfo.push("Mr Bond");
book.PDFDocInfo = pdfDocInfo;

Build Metadata File (txt file) containing JSON

I am building a command line app that will generate metadata files amongst other things. I have a series of values that I want included, and I would like to insert those values into json format and than write it to a .txt file.
The complicated part (to me at least) is some of the values are dynamic (i.e. they may change everytime a file is created), other parts of the json file will need to be static. Is there any sort of templating that may help with this? (json erb)
If I were to use a json erb template, how would I write the result of the template (after it has been populated) to a txt file since this is not a rails app and I thus would not be calling the view.
Thank you in advance for any help.
It seems like two things could be helpful to you, but your question is pretty open ended ...
First, if your json templates are complex (static and dynamic parts?) I suggest you look at a tool like RABL ...
https://github.com/nesquena/rabl
There is a railscast on RABL here:
http://railscasts.com/episodes/322-rabl
RABL lets you create templates for generating custom JSON output.
Regarding writing to a file, you may or may not need to call the controller first. But the flow would be something like:
#sample_controller.rb
require 'json'
def get_sample
#x = {:a => "apple", :b => "baker"}
render json: #x
end
You can call the controller and get the rendered json.
z = get_sample
File.open(yourfile, 'w') { |file| file.write(z) }

axlsx serialize spreadsheet to string

For testing purposes, I'd like to serialize an axlsx spreadsheet to a string. The axlsx documentation indicates it is possible to "Output to file or StringIO". But I haven't found documentation or a code sample that explains how to output to a StringIO. How is it done?
From the code:
# Serialize to a stream
s = package.to_stream()
File.open('example_streamed.xlsx', 'w') { |f| f.write(s.read) }
In the end, an [xlsx] file is zip archive containing multiple xml files and other assets. You can use Package#to_stream to generate an IO stream for streaming purposes, but viewing that archive as a string is probably not what you are looking to do.
If you are just looking to investigate the xml for a specific Worksheet, you can use Worksheet#to_xml_string which will return a String object with all the goodies in there. (That is how worksheet validation works, we parse that XML and validate it against the schema for the object)
Hope this help!

Parsing Liquid in a Jekyll generator before converting to JSON

Best to start by saying that I am very new to Ruby and Liquid. I have searched around looking for some resource on this issue, but as yet haven't been able to find anything of real use.
I have a Jekyll site, which utilises the HTML5 History API. I have a Jekyll generator plugin which creates a single JSON file which holds all the post and page content, ready for use with HTML5 PushState and PopState. This part is functioning properly and is tested.
My problem comes when I have a post/page on the site which has Liquid tags in it. I am guessing I need to parse these Liquid tags to get the template output before I create my JSON object for each post/page. Here is what I have for pages as an example:
# Iterate over all pages
site.pages.each do |page|
# Encode the page HTML content to JSON
link = page.url
#content = Liquid::Template.parse(page.content)
hash[link] = { "body_class" => page.data['body_class'], "content" => converter.convert(#content.render), "title" => '<h1>' + page.data["content_title"] + '</h1>' }
end
Now, this at the minute is basically removing all Liquid tags from the generated JSON file, leaving nothing in it's place.
Here is my full generator file on Github which is based very heavily on nice work by Jezen Thomas.
The output JSON file is also in that repo with the site, or can be accessed quickly here. The blog.html content is the last item in the JSON file and shows the empty h1 and div tags which should have content.

Resources