How do you combine PDFs in ruby? - ruby

This was asked in 2008. Hopefully there's a better answer now.
How can you combine PDFs in ruby?
I'm using the pdf-stamper gem to fill out a form in a PDF. I'd like to take n PDFs, fill out a form in each of them, and save the result as an n-page document.
Can you do this with a native library like prawn? Can you do this with rjb and iText? pdf-stamper is a wrapper on iText.
I'd like to avoid using two libraries (i.e. pdftk and iText), if possible.

As of 2013 you can use Prawn to merge pdfs. Gist: https://gist.github.com/4512859
class PdfMerger
def merge(pdf_paths, destination)
first_pdf_path = pdf_paths.delete_at(0)
Prawn::Document.generate(destination, :template => first_pdf_path) do |pdf|
pdf_paths.each do |pdf_path|
pdf.go_to_page(pdf.page_count)
template_page_count = count_pdf_pages(pdf_path)
(1..template_page_count).each do |template_page_number|
pdf.start_new_page(:template => pdf_path, :template_page => template_page_number)
end
end
end
end
private
def count_pdf_pages(pdf_file_path)
pdf = Prawn::Document.new(:template => pdf_file_path)
pdf.page_count
end
end

After a long search for a pure Ruby solution, I ended up writing code from scratch to parse and combine/merge PDF files.
(I feel it is such a mess with the current tools - I wanted something native but they all seem to have different issues and dependencies... even Prawn dropped the template support they use to have)
I posted the gem online and you can find it at GitHub as well.
you can install it with:
gem install combine_pdf
It's very easy to use (with or without saving the PDF data to a file).
For example, here is a "one-liner":
(CombinePDF.load("file1.pdf") << CombinePDF.load("file2.pdf") << CombinePDF.load("file3.pdf")).save("out.pdf")
If you find any issues, please let me know and I will work on a fix.

Use ghostscript to combine PDFs:
options = "-q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite"
system "gs #{options} -sOutputFile=result.pdf file1.pdf file2.pdf"

I wrote a ruby gem to do this — PDF::Merger. It uses iText. Here's how you use it:
pdf = PDF::Merger.new
pdf.add_file "foo.pdf"
pdf.add_file "bar.pdf"
pdf.save_as "combined.pdf"

Haven't seen great options in Ruby- I got best results shelling out to pdftk:
system "pdftk #{file_1} multistamp #{file_2} output #{file_combined}"

We're closer than we were in 2008, but not quite there yet.
The latest dev version of Prawn lets you use an existing PDF as a template, but not use a template over and over as you add more pages.

Via iText, this will work... though you should flatten the forms before you merge them to avoid field name conflicts. That or rename the fields one page at a time.
Within PDF, fields with the same name share a value. This is usually not the desired behavior, though it comes in handy from time to time.
Something along the lines of (in java):
PdfCopy mergedPDF = new PdfCopy( new Document(), new FileOutputStream( outPath );
for (String path : paths ) {
PdfReader reader = new PdfReader( path );
ByteArrayOutputStream curFormOut = new ByteArrayOutputStream();
PdfStamper stamper = new PdfStamper( reader, curFormOut );
stamper.setField( name, value ); // ad nauseum
stamper.setFlattening(true); // flattening setting only takes effect during close()
stamper.close();
byte curFormBytes = curFormOut.toByteArray();
PdfReader combineMe = new PdfReader( curFormBytes );
int pages = combineMe .getNumberOfPages();
for (int i = 1; i <= pages; ++i) { // "1" is the first page
mergedForms.addPage( mergedForms.getImportedPage( combineMe, i );
}
}
mergedForms.close();

If you want to add any template (created by macOS Pages or Google Docs) using the combine_pdf gem then you can try with this:
final_pdf = CombinePDF.new
company_template = CombinePDF.load(template_file.pdf).pages[0]
pdf = CombinePDF.load (content_file.pdf)
pdf.pages.each {|page| final_pdf << (company_template << page)}
final_pdf.save "final_document.pdf"

Related

How to save an image in a subdirectory on android Q whilst remaining backwards compatible

I'm creating a simple image editor app and therefore need to load and save image files. I'd like the saved files to appear in the gallery in a separate album. From Android API 28 to 29, there have been drastic changes to what extent an app is able to access storage. I'm able to do what I want in Android Q (API 29) but that way is not backwards compatible.
When I want to achieve the same result in lower API versions, I have so far only found way's, which require the use of deprecated code (as of API 29).
These include:
the use of the MediaStore.Images.Media.DATA column
getting the file path to the external storage via Environment.getExternalStoragePublicDirectory(...)
inserting the image directly via MediaStore.Images.Media.insertImage(...)
My question is: is it possible to implement it in such a way, so it's backwards compatible, but doesn't require deprecated code? If not, is it okay to use deprecated code in this situation or will these methods soon be deleted from the sdk? In any case it feels very bad to use deprecated methods so I'd rather not :)
This is the way I found which works with API 29:
ContentValues values = new ContentValues();
String filename = System.currentTimeMillis() + ".jpg";
values.put(MediaStore.Images.Media.TITLE, filename);
values.put(MediaStore.Images.Media.DISPLAY_NAME, filename);
values.put(MediaStore.Images.Media.MIME_TYPE, "image/jpeg");
values.put(MediaStore.Images.Media.DATE_ADDED, System.currentTimeMillis() / 1000);
values.put(MediaStore.Images.Media.DATE_TAKEN, System.currentTimeMillis());
values.put(MediaStore.Images.Media.RELATIVE_PATH, "PATH/TO/ALBUM");
getContentResolver().insert(MediaStore.Images.Media.EXTERNAL_CONTENT_URI,values);
I then use the URI returned by the insert method to save the bitmap. The Problem is that the field RELATIVE_PATH was introduced in API 29 so when I run the code on a lower version, the image is put into the "Pictures" folder and not the "PATH/TO/ALBUM" folder.
is it okay to use deprecated code in this situation or will these methods soon be deleted from the sdk?
The DATA option will not work on Android Q, as that data is not included in query() results, even if you ask for it you cannot use the paths returned by it, even if they get returned.
The Environment.getExternalStoragePublicDirectory(...) option will not work by default on Android Q, though you can add a manifest entry to re-enable it. However, that manifest entry may be removed in Android R, so unless you are short on time, I would not go this route.
AFAIK, MediaStore.Images.Media.insertImage(...) still works, even though it is deprecated.
is it possible to implement it in such a way, so it's backwards compatible, but doesn't require deprecated code?
My guess is that you will need to use two different storage strategies, one for API Level 29+ and one for older devices. I took that approach in this sample app, though there I am working with video content, not images, so insertImage() was not an option.
This is the code that works for me. This code saves an image to a subdirectory folder on your phone. It checks the android version of the phone, if its above android q, it runs the required codes and if its below, it runs the code in the else statement.
Source: https://androidnoon.com/save-file-in-android-10-and-below-using-scoped-storage-in-android-studio/
private void saveImageToStorage(Bitmap bitmap) throws IOException {
OutputStream imageOutStream;
if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.Q) {
ContentValues values = new ContentValues();
values.put(MediaStore.Images.Media.DISPLAY_NAME,
"image_screenshot.jpg");
values.put(MediaStore.Images.Media.MIME_TYPE, "image/jpeg");
values.put(MediaStore.Images.Media.RELATIVE_PATH,
Environment.DIRECTORY_PICTURES + File.pathSeparator + "AppName");
Uri uri =
getContentResolver().insert(MediaStore.Images.Media.EXTERNAL_CONTENT_URI,
values);
imageOutStream = getContentResolver().openOutputStream(uri);
} else {
String imagesDir =
Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_PICTURES). toString() + "/AppName";
File image = new File(imagesDir, "image_screenshot.jpg");
imageOutStream = new FileOutputStream(image);
}
bitmap.compress(Bitmap.CompressFormat.JPEG, 100, imageOutStream);
imageOutStream.close();
}
For old API (<29) I place an image into the external media directory and scan it via MediaScannerConnection.
Let's see my code.
This function creates an image file. Pay attention to an appName variable - it's is a name of an album in which the image will be displayed.
override fun createImageFile(appName: String): File {
val dir = File(appContext.externalMediaDirs[0], appName)
if(!dir.exists()) {
ir.mkdir()
}
return File(dir, createFileName())
}
Then, I place an image into the file, and, at last, I run a media scanner like this:
private suspend fun scanNewFile(shot: File): Uri? {
return suspendCancellableCoroutine { continuation ->
MediaScannerConnection.scanFile(
appContext,
arrayOf<String>(shot.absolutePath),
arrayOf(imageMimeType)) { _, uri -> continuation.resume(uri)
}
}
}
After some trial and error, I discovered that it is possible to use MediaStore in a backwards compatible way, such that as much code as possible is shared between the implementations for different versions. The only trick is to remember that if you use MediaColumns.DATA, you need to create the file yourself.
Let's look at the code from my project (Kotlin). This example is for saving audio, not images, but you only need to substitute MIME_TYPE and DIRECTORY_MUSIC for whatever you require.
private fun newFile(): FileDescriptor? {
// Create a file descriptor for a new recording.
val date = DateFormat.getDateTimeInstance().format(Calendar.getInstance().time)
val filename = "$date.mp3"
val values = ContentValues().apply {
put(MediaColumns.TITLE, date)
put(MediaColumns.MIME_TYPE, "audio/mp3")
// store the file in a subdirectory
if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.Q) {
put(MediaColumns.DISPLAY_NAME, filename)
put(MediaColumns.RELATIVE_PATH, saveTo)
} else {
// RELATIVE_PATH was added in Q, so work around it by using DATA and creating the file manually
#Suppress("DEPRECATION")
val music = Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_MUSIC).path
with(File("$music/P2oggle/$filename")) {
#Suppress("DEPRECATION")
put(MediaColumns.DATA, path)
parentFile!!.mkdir()
createNewFile()
}
}
}
val uri = contentResolver.insert(MediaStore.Audio.Media.EXTERNAL_CONTENT_URI, values)!!
return contentResolver.openFileDescriptor(uri, "w")?.fileDescriptor
}
On Android 10 and above, we use DISPLAY_NAME to set the filename and RELATIVE_PATH to set the subdirectory. On older versions, we use DATA and create the file (and its directory) manually. After this, the implementation for both is the same: we simply extract the file descriptor from MediaStore and return it for use.

Concept for recipe-based parsing of webpages needed

I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.

Ruby RSS/Atom creation - including content

I am creating an Atom feed using ruby's stdlib rss library. This library is essentially undocumented , but I have it working using the example provided on this page:
require 'rss'
rss = RSS::Maker.make("atom") do |m|
m.channel.author = "Steve Wattam"
m.channel.updated = Time.now
m.channel.about = "http://stephenwattam.com/blog/"
m.channel.title = "Steve W's Blog"
storage.posts.each do |p|
m.items.new_item do |item|
item.link = p.link
item.title = p.title
item.updated = p.edited
item.pubDate = p.date
item.summary = p.summary
end
end
end
This works fine. I am unable, however, to add a content element. There is no such thing as item.content=, and I can't seem to find any example code online---a browse of the source indicates that content is stored in the item (docs here), but I lack the knowledge to tease it out.
Does anyone know how I might go about adding a content element?
Incidentally, I'm aware other libraries exist to do this, but would ideally like to get this working without requiring any gems.
By digging through the source of the library, I've discovered that item.content yields an object of type RSS::Maker::Atom::Feed::Items::Item::Content. It's possible to set the content on that object:
item.content.content = 'text to set as content'
This object also responds to #xml_content.
Hope this helps someone!

Read image IPTC data

I'm having some trouble with reading out the IPTC data of some images, the reason why I want to do this, is because my client has all the keywords already in the IPTC data and doesn't want to re-enter them on the site.
So I created this simple script to read them out:
$size = getimagesize($image, $info);
if(isset($info['APP13'])) {
$iptc = iptcparse($info['APP13']);
print '<pre>';
var_dump($iptc['2#025']);
print '</pre>';
}
This works perfectly in most cases, but it's having trouble with some images.
Notice: Undefined index: 2#025
While I can clearly see the keywords in photoshop.
Are there any decent small libraries that could read the keywords in every image? Or am I doing something wrong here?
I've seen a lot of weird IPTC problems. Could be that you have 2 APP13 segments. I noticed that, for some reasons, some JPEGs have multiple IPTC blocks. It's possibly the problem with using several photo-editing programs or some manual file manipulation.
Could be that PHP is trying to read the empty APP13 or even embedded "thumbnail metadata".
Could be also problem with segments lenght - APP13 or 8BIM have lenght marker bytes that might have wrong values.
Try HEX editor and check the file "manually".
I have found that IPTC is almost always embedded as xml using the XMP format, and is often not in the APP13 slot. You can sometimes get the IPTC info by using iptcparse($info['APP1']), but the most reliable way to get it without a third party library is to simply search through the image file from the relevant xml string (I got this from another answer, but I haven't been able to find it, otherwise I would link!):
The xml for the keywords always has the form "<dc:subject>...<rdf:Seq><rdf:li>Keyword 1</rdf:li><rdf:li>Keyword 2</rdf:li>...<rdf:li>Keyword N</rdf:li></rdf:Seq>...</dc:subject>"
So you can just get the file as a string using file_get_contents(get_attached_file($attachment_id)), use strpos() to find each opening (<rdf:li>) and closing (</rdf:li>) XML tag, and grab the keyword between them using substr().
The following snippet works for all jpegs I have tested it on. It will fill the array $keys with IPTC tags taken from an image on wordpress with id $attachment_id:
$content = file_get_contents(get_attached_file($attachment_id));
// Look for xmp data: xml tag "dc:subject" is where keywords are stored
$xmp_data_start = strpos($content, '<dc:subject>') + 12;
// Only proceed if able to find dc:subject tag
if ($xmp_data_start != FALSE) {
$xmp_data_end = strpos($content, '</dc:subject>');
$xmp_data_length = $xmp_data_end - $xmp_data_start;
$xmp_data = substr($content, $xmp_data_start, $xmp_data_length);
// Look for tag "rdf:Seq" where individual keywords are listed
$key_data_start = strpos($xmp_data, '<rdf:Seq>') + 9;
// Only proceed if able to find rdf:Seq tag
if ($key_data_start != FALSE) {
$key_data_end = strpos($xmp_data, '</rdf:Seq>');
$key_data_length = $key_data_end - $key_data_start;
$key_data = substr($xmp_data, $key_data_start, $key_data_length);
// $ctr will track position of each <rdf:li> tag, starting with first
$ctr = strpos($key_data, '<rdf:li>');
// Initialize empty array to store keywords
$keys = Array();
// While loop stores each keyword and searches for next xml keyword tag
while($ctr != FALSE && $ctr < $key_data_length) {
// Skip past the tag to get the keyword itself
$key_begin = $ctr + 8;
// Keyword ends where closing tag begins
$key_end = strpos($key_data, '</rdf:li>', $key_begin);
// Make sure keyword has a closing tag
if ($key_end == FALSE) break;
// Make sure keyword is not too long (not sure what WP can handle)
$key_length = $key_end - $key_begin;
$key_length = (100 < $key_length ? 100 : $key_length);
// Add keyword to keyword array
array_push($keys, substr($key_data, $key_begin, $key_length));
// Find next keyword open tag
$ctr = strpos($key_data, '<rdf:li>', $key_end);
}
}
}
I have this implemented in a plugin to put IPTC keywords into WP's "Description" field, which you can find here.
ExifTool is very robust if you can shell out to that (from PHP it looks like?)

Image tag not closing with HTMLAgilityPack

Using the HTMLAgilityPack to write out a new image node, it seems to remove the closing tag of an image, e.g. should be but when you check outer html, has .
string strIMG = "<img src='" + imgPath + "' height='" + pubImg.Height + "px' width='" + pubImg.Width + "px' />";
HtmlNode newNode = HtmlNode.Create(strIMG);
This breaks xhtml.
Telling it to output XML as Micky suggests works, but if you have other reasons not to want XML, try this:
doc.OptionWriteEmptyNodes = true;
Edit 1:Here is how to fix an HTML Agilty Pack document to correctly display image (img) tags:
if (HtmlNode.ElementsFlags.ContainsKey("img"))
{ HtmlNode.ElementsFlags["img"] = HtmlElementFlag.Closed;}
else
{ HtmlNode.ElementsFlags.Add("img", HtmlElementFlag.Closed);}
replace "img" for any other tag to fix them as well (input, select, and option come up frequently). Repeat as needed. Keep in mind that this will produce rather than , because of the HAP bug preventing the "closed" and "empty" flags from being set simultaneously.
Source: Mike Bridge
Original answer:
Having just labored over solutions to this issue, and not finding any sufficient answers (doctype set properly, using Output as XML, Check Syntax, AutoCloseOnEnd, and Write Empty Node options), I was able to solve this with a dirty hack.
This will certainly not solve the issue outright for everyone, but for anyone returning their generated html/xml as a string (EG via a web service), the simple solution is to use fake tags that the agility pack doesn't know to break.
Once you have finished doing everything you need to do on your document, call the following method once for each tag giving you a headache (notable examples being option, input, and img). Immediately after, render your final string and do a simple replace for each tag prefixed with some string (in this case "Fix_", and return your string.
This is only marginally better in my opinion than the regex solution proposed in another question I cannot locate at the moment (something along the lines of )
private void fixHAPUnclosedTags(ref HtmlDocument doc, string tagName, bool hasInnerText = false)
{
HtmlNode tagReplacement = null;
foreach(var tag in doc.DocumentNode.SelectNodes("//"+tagName))
{
tagReplacement = HtmlTextNode.CreateNode("<fix_"+tagName+"></fix_"+tagName+">");
foreach(var attr in tag.Attributes)
{
tagReplacement.SetAttributeValue(attr.Name, attr.Value);
}
if(hasInnerText)//for option tags and other non-empty nodes, the next (text) node will be its inner HTML
{
tagReplacement.InnerHtml = tag.InnerHtml + tag.NextSibling.InnerHtml;
tag.NextSibling.Remove();
}
tag.ParentNode.ReplaceChild(tagReplacement, tag);
}
}
As a note, if I were a betting man I would guess that MikeBridge's answer above inadvertently identifies the source of this bug in the pack - something is causing the closed and empty flags to be mutually exclusive
Additionally, after a bit more digging, I don't appear to be the only one who has taken this approach:
HtmlAgilityPack Drops Option End Tags
Furthermore, in cases where you ONLY need non-empty elements, there is a very simple fix listed in that same question, as well as the HAP codeplex discussion here: This essentially sets the empty flag option listed in Mike Bridge's answer above permanently everywhere.
There is an option to turn on XML output that makes this issue go away.
var htmlDoc = new HtmlDocument();
htmlDoc.OptionOutputAsXml = true;
htmlDoc.LoadHtml(rawHtml);
This seems to be a bug with HtmlAgilityPack. There are many ways to reproduce this, for example:
Debug.WriteLine(HtmlNode.CreateNode("<img id=\"bla\"></img>").OuterHtml);
Outputs malformed HTML. Using the suggested fixes in the other answers does nothing.
HtmlDocument doc = new HtmlDocument();
doc.OptionOutputAsXml = true;
HtmlNode node = doc.CreateElement("x");
node.InnerHtml = "<img id=\"bla\"></img>";
doc.DocumentNode.AppendChild(node);
Debug.WriteLine(doc.DocumentNode.OuterHtml);
Produces malformed XML / XHTML like <x><img id="bla"></x>
I have created a issue in CodePlex for this.

Resources