Ruby + how to split pdf into separate pages? - ruby

I'm using Docsplit to split pdf into pages using
Docsplit.extract_pages("my.pdf").
But I want to limit the pages to 4. I tried
Docsplit.extract_pages("my.pdf", :pages => 1..4)
which is not working..
Can anyone suggest me what to do

install pdftk in you machine if not already done and set your path accordingly
remove the ESCAPEs from the lib/docscript/page_extractor.rb:18 file like so:
pdftk #{ESCAPE[pdf]} burst output #{ESCAPE[page_path]} 2>&1"
change to :
pdftk #{pdf} burst output #{page_path} 2>&1"
by default, the gem ignores the page range you give and it will create one pdf file per page. If you're happy with this, then the output pages are created in the same folder as your input file.
However, the easiest solution IMHO would be to just use to pdftk binary directly, it's quite straightforward: to extract pages 1-4, you could use this snippet :
in_file = 'IN.pdf'
range = 1..4
range_s = range.to_s.gsub('..', '-')
cmd = "pdftk.exe #{in_file} cat #{range_s} output pages#{range_s}.pdf"
res = `cmd`.chomp
This works, provided that the pdftk executable is in your PATH

Related

Rstudio Rmarkdown knit to multiple pdfs?

Can I output multiple pdfs for different page ranges (or using some sort of delimiter) in Rstudio?
Here's a trick I'm using in case you can't find an easy way (on ubuntu, after installing pdftk):
Aside from the rmd file, I create an R script which edits the pdf generated by the rmd file and splits it into smaller pdfs.
example:
# 1 KNIT THE RMD FILE AND GENERATE A SINGLE PDF WITH ALL THE PAGES
rmarkdown::render('~/my_rmd_file.Rmd')
# 2 CUT THE FIRST 5 PAGES OF THE PDF
# 2.1 make up a name for the smaller pdf:
name_for_the_top5pages_pdf <- "my_rmd_file_top5.pdf"
# 2.2 compose the command that edits the pdf:
cmd_extract_first_5_pages <- paste0("pdftk my_rmd_file cat 1-5 output ",name_for_the_top5pages_pdf)
# 2.3 run the command
system(cmd_extract_first_5_pages)
It will keep the original pdf and create another one with the top 5 pages.

create a .txt file that lists all files in directory, using Matlab on Windows

I wrote this on my Matlab code for my MacOs:
folder_list = 'folder_list.txt';
cd(folder_paraboles)
if exist(folder_list) == 0
commande = ['ls >',folder_list];
unix(commande)
end
Does anyone can give me the corresponding line code on Matlab Windows? Thanks a lot
Rather than using unix to get the directory listing, you should just use the built-in dir or ls to get a list of files and then write them out to a file using MATLAB's built-in ability to write to files.
files = dir(pwd);
fid = fopen('output.txt', 'wt');
fprintf(fid ,'%s\n', files.name);
fclose(fid);

Convert a PDF to .txt gives me an empty .txt file

Hi I'm trying to read a pdf in Ruby, first of all I want to convert it into a txt. path is the path to the PDF, The point is that I get a .txt file empty, and as someone told me is a pdftotext problem, but I don't know how to fix it.
spec = path.sub(/\.pdf$/, '')
`pdftotext #{spec}.pdf`
file = File.new("#{spec}.txt", "w+")
text = []
file.readlines.each do |l|
if l.length > 0
text << l
Rails.logger.info l
end
end
file.close
What's wrong with my code? Thanks!
It's not possible to extract text from every PDF. Some PDF files use a font encoding that makes it impossible to extract text with simple tools such as pdftotext (and some PDF files are even completely immune to direct text extraction with any tool known to me -- in these cases you'll have to apply OCR first to have a chance to extract text...).
So if you test your code with the same "weird" PDF file all the time, it may well happen that you're getting frustrated over your code while in reality the fault lies with the PDF.
First make sure that the commandline usage of pdftotxt works well with a given PDF, then test (and develop further) your code with that PDF.
The problem is you are opening the file in write ("w") mode, whuch truncates the file. You can see a table of file modes and what they mean at http://ruby-doc.org/core-1.9.3/IO.html.
Try something like this, it uses a pdftotext option to send the text to stdout to avoid creating a temporary file and uses blocks for more idiomatic ruby.
text = `pdftotext #{path} -`
text.split.select { |line|
line.length > 0
}.each { |line|
Rails.logger.info(line)
}
You would need to open the txt file with write permission.
file = File.new("#{spec}.txt", "w")
You could consult How to create a file in Ruby
Update: your code is not complete and looks buggy.
Cant say what is path
Looks like you are trying to read the text file to which you intend to write file.readlines.each
spell check length you have it l.lenght
You may want to paste the actual code.
Check this gist https://gist.github.com/4160587
As mentioned, your code is not working because you are reading and writing to the same file.
Example
Ruby code file_write.rb to do the file write operation
pdf_file = File.open("in.txt")
output_file = File.open("out.txt", "w") # file to which you want to write
#iterate over input file and write the content to output file
pdf_file.readlines.each do |l|
output_file.puts(l)
end
output_file.close
pdf_file.close
Sample txt file in.txt
Some text in file
Another line of text
1. Line 1
2. Not really line 2
Once your run file_write.rb you should see new file called out.txt with same content as in.txt You could change the content of input file if you want. In your case you would use pdf reader to get the content and write it to the text file. Basically first line of the code will change.

What is the fastest way to unzip textfiles in Matlab during a function?

I would like to scan text of textfiles in Matlab with the textscan function. Before I can open the textfile with fid = fopen('C:\path'), I need to unzip the files first. The files have the extension: *.gz
There are thousands of files which I need to analyze and high performance is important.
I have two ideas:
(1) Use an external program an call it from the command line in Matlab
(2) Use a Matlab 'zip'toolbox. I have heard of gunzip, but don't know about its performance.
Does anyone knows a way to unzip these files as quick as possible from within Matlab?
Thanks!
You could always try the Matlab unzip() function:
unzip
Extract contents of zip file
Syntax
unzip(zipfilename)
unzip(zipfilename, outputdir)
unzip(url, ...)
filenames = unzip(...)
Description
unzip(zipfilename) extracts the archived contents of zipfilename into the current folder and sets the files' attributes, preserving the timestamps. It overwrites any existing files with the same names as those in the archive if the existing files' attributes and ownerships permit it. For example, files from rerunning unzip on the same zip filename do not overwrite any of those files that have a read-only attribute; instead, unzip issues a warning for such files.
Internally, this uses Java's zip library org.apache.tools.zip. If your zip archives each contain many text files it might be faster to drop down into Java and extract them entry by entry, without explicitly unzipped files. look at the source of unzip.m to get some ideas, and also the Java documentation.
I've found 7zip-commandline(Windows) / p7zip(Unix) to be somewhat speedier for this.
[edit]From some quick testing, it seems making a system call to gunzip is faster than using MATLAB's native gunzip. You could give that a try as well.
Just write a new function that imitates basic MATLAB gunzip functionality:
function [] = sunzip(fullfilename,output_dir)
if ~exist('output_dir','var'), output_dir = fileparts(fullfilename); end
app_path = '/usr/bin/7za';
switches = ' e'; %extract files ignoring directory structure
options = [' -o' output_dir];
system([app_path switches options '_' fullfilename]);
Then use it as you would use gunzip:
sunzip('/data/time_1000.out.gz',tmp_dir);
With MATLAB's toc timer, I get the following extraction times with 6 uncompressed 114MB ASCII files:
gunzip: 10.15s
sunzip: 7.84s
worked well, just needed a minor change to Max's syntax calling the executable.
system([app_path switches ' ' fullfilename options ]);

How can I modify .xfdl files? (Update #1)

The .XFDL file extension identifies XFDL Formatted Document files. These belong to the XML-based document and template formatting standard. This format is exactly like the XML file format however, contains a level of encryption for use in secure communications.
I know how to view XFDL files using a file viewer I found here. I can also modify and save these files by doing File:Save/Save As. I'd like, however, to modify these files on the fly. Any suggestions? Is this even possible?
Update #1: I have now successfully decoded and unziped a .xfdl into an XML file which I can then edit. Now, I am looking for a way to re-encode the modified XML file back into base64-gzip (using Ruby or the command line)
If the encoding is base64 then this is the solution I've stumbled upon on the web:
"Decoding XDFL files saved with 'encoding=base64'.
Files saved with:
application/vnd.xfdl;content-encoding="base64-gzip"
are simple base64-encoded gzip files. They can be easily restored to XML by first decoding and then unzipping them. This can be done as follows on Ubuntu:
sudo apt-get install uudeview
uudeview -i yourform.xfdl
gunzip -S "" < UNKNOWN.001 > yourform-unpacked.xfdl
The first command will install uudeview, a package that can decode base64, among others. You can skip this step once it is installed.
Assuming your form is saved as 'yourform.xfdl', the uudeview command will decode the contents as 'UNKNOWN.001', since the xfdl file doesn't contain a file name. The '-i' option makes uudeview uninteractive, remove that option for more control.
The last command gunzips the decoded file into a file named 'yourform-unpacked.xfdl'.
Another possible solution - here
Side Note: Block quoted < code > doesn't work for long strings of code
The only answer I can think of right now is - read the manual for uudeview.
As much as I would like to help you, I am not an expert in this area, so you'll have to wait for someone more knowledgable to come down here and help you.
Meanwhile I can give you links to some documents that might help you:
UUDeview Home Page
Using XDFLengine
Gettting started with the XDFL Engine
Sorry if this doesn't help you.
You don't have to get out of Ruby to do this, can use the Base64 module in Ruby to encode the document like this:
irb(main):005:0> require 'base64'
=> true
irb(main):007:0> Base64.encode64("Hello World")
=> "SGVsbG8gV29ybGQ=\n"
irb(main):008:0> Base64.decode64("SGVsbG8gV29ybGQ=\n")
=> "Hello World"
And you can call gzip/gunzip using Kernel#system:
system("gzip foo.something")
system("gunzip foo.something.gz")

Resources