ruby pandoc convert html string to docx temp file - ruby

Using the ruby pandoc gem, I'm trying to convert an html page as a string ("...") to a temp docx file, that users of the site can then download.
The documentation for pandoc ruby says to use:
PandocRuby.html("<h1>hello</h1>").to_latex
I assume this works for docx as well, although this is the output
PandocRuby.html("<h1>hello</h1>").to_docx
"PK\x03\x04\x14\x00\x02\x00\b\x00\xF8.\x7FF\x8E\r\x16\xD8]\x01\x00\x00$\x06\x00\x00\x13\x00\x00\x00[Content_Types].xml\xB5\x94\xCBN\xC30\x10E\xF7|E\xE4-J\xDC\xB2#\b5\xE9\x82\xC7\x12*Q>\xC0u&\xAD\x85_\xB2\xA7\xAF\xBFg\x92\xD0\b\xA1*\xA9h\xBB\x89\x94\xCC\xCC\xBD\xC7W\xE3L\xA6;\xA3\x93\r\x84\xA8\x9C\xCD\xD98\e\xB1\x04\xACt\xA5\xB2\xCB\x9C}\xCE_\xD3\a6-n&\xF3\xBD\x87\x98P\xAF\x8D9[!\xFAG\xCE\xA3\\\x81\x111s\x1E,U*\x17\x8C#z\rK\xEE\x85\xFC\x12K\xE0w\xetc...."
Just creating a new file
File.open(yourfile, 'w') { |file| file.write(docx_conversion) }
throws encoding errors, but I didn't think it would work because docx files are zipped doc and xml files.
Many thanks.

Try:
File.open(yourfile, 'wb') { |file| file.write(docx_conversion) }
Setting the file mode to wb tells Ruby that you will be writing unencoded binary data and sets the binary data encoding as the external encoding.

Related

Is it possible to load a CSV file with Pandoc and producing markdown file for each line?

I have a CSV file similar to below:
0,Bob's Business,50 some address,zip,telephone
1,Jill's Business,25 some address,zip,telephone
...
I would like to take this CSV file and have Pandoc produce a markdown file for each line in the CSV file. Each column accessible from a variable to be used in a markdown template file.
Is it possible to load a CSV file and produce markdown/html files in this way?
I can see three ways.
Use a static site generator
I would probably just use a tool like jekyll with its data files.
Alternative 1: Convert to YAML and use pandoc's template engine
Put something like this in mytemplate.md:
$for(data)$
$data$
$endfor$
Convert the csv to a JSON or YAML file
load that file with the --metadata-file option and use the template to render the output:
echo '' | pandoc --metadata-file data.yaml -t markdown --template mytemplate.md -o output.md
Alternative 2: Write a pandoc filter
There are many pandoc filters (like pandoc-placetable or pantable) that read csv and convert it to a pandoc table. But you want to convert it to a pandoc metadata format (which is usually parsed from the YAML frontmatter of markdown files). I guess you could adjust one of those pandoc filters to your purposes.

Specifying metadata for input formats other than Markdown

Pandoc allows you to include metadata at the beginning of a Markdown document using a header like
---
title: The Song That Never Ends
subtitle: It Goes On and On My Friends
author: Abraham Lincoln
lang: en_US
---
Is there any way to convey this information to Pandoc when the input format is not Markdown? I’m specifically interested in HTML input. I tried calling Pandoc with --from=html+yaml_metadata_block, but this didn’t seem to change the behavior at all—the YAML block is just interpreted as HTML.
(It is possible to include some metadata in the “percent format” shown in the “pandoc_title_block” section of the manual, but there doesn’t seem to be a way to give a separate title and subtitle with that syntax. It’s also possible to include the YAML header before the HTML and to force Pandoc to interpret the input as Markdown, but this seems hacky, and if you try to convert that to “real” Markdown then the output is full of HTML tags instead of Markdown formatting characters.)
You can use the --metadata (short -M) or --metadata-file options to supply metadata on the command line, for example:
pandoc -M title="The Song That Never Ends"
A simple solution would be to use Lua filters to augment the metadata read from the HTML file as described in the Lua filters doc. Below is an updated version:
-- file: additional-metadata.lua
function read_file_as_markdown_yaml (filename)
-- read metadata file into string
local metafile = io.open(filename, 'r')
local content = metafile:read('*a')
metafile:close()
-- get metadata
return pandoc.read(content, 'markdown').meta
end
function Meta (meta)
-- read YAML file and add its content to the metadata
local yaml_meta = read_file_as_markdown_yaml(meta.default_meta_file)
for k, v in pairs(yaml_meta) do
-- use YAML metadata as fallback
meta[k] = meta[k] or v
end
return meta
end
Use with
pandoc --lua-filter additional-metadata.lua \
--metadata default_meta_file:YOUR-FILE-HERE.yaml \
your-input-file.html

Read the content of the only file in a zip file in Ruby

I have a zip file in Ruby at a particular location on the file system. There is only 1 file in that zip file. I want to read the content of that file. How can I do it (without knowing the name of the file up-front)? I've tried looking at various libraries/ways but the APIs were either outdated, or the libraries weren't maintained for years.
You can use RubyZip:
require 'zip'
a = Zip::File.open(path_to_zip_file) { |z| z.first.get_input_stream.read }

Ruby Dropbox APP: How to download a word document

I'm having troubles trying to download word documents from a dropbox using an APP controlled by a ruby program. (I would like to have the ability to download any file from a dropbox).
The code they provide is great for "downloading" a .txt file, but if you try using the same code to download a .docx file, the "downloaded" file won't open in word due to "corruption."
The code I'm using:
contents = #client.get_file(path + filename)
open(filename, 'w') {|f| f.puts contents }
For variable examples, path could be '/', and filename could be 'aFile.docx'. This works, but the file, aFile.docx, that is created can not be opened. I am aware that this is simply grabbing the contents of the file and then creating a new file and inserting the contents.
Try this:
open(filename, 'wb') { |f| f.write contents }
Two changes from your code:
I used the file mode wb to specify that I'm going to write binary data. I don't think this makes a difference on Linux and OS X, but it matters on Windows.
I used write instead of puts. I believe puts expects a string, while you're trying to write arbitrary binary data. I assume this is the source of the "corruption."

Convert a PDF to .txt gives me an empty .txt file

Hi I'm trying to read a pdf in Ruby, first of all I want to convert it into a txt. path is the path to the PDF, The point is that I get a .txt file empty, and as someone told me is a pdftotext problem, but I don't know how to fix it.
spec = path.sub(/\.pdf$/, '')
`pdftotext #{spec}.pdf`
file = File.new("#{spec}.txt", "w+")
text = []
file.readlines.each do |l|
if l.length > 0
text << l
Rails.logger.info l
end
end
file.close
What's wrong with my code? Thanks!
It's not possible to extract text from every PDF. Some PDF files use a font encoding that makes it impossible to extract text with simple tools such as pdftotext (and some PDF files are even completely immune to direct text extraction with any tool known to me -- in these cases you'll have to apply OCR first to have a chance to extract text...).
So if you test your code with the same "weird" PDF file all the time, it may well happen that you're getting frustrated over your code while in reality the fault lies with the PDF.
First make sure that the commandline usage of pdftotxt works well with a given PDF, then test (and develop further) your code with that PDF.
The problem is you are opening the file in write ("w") mode, whuch truncates the file. You can see a table of file modes and what they mean at http://ruby-doc.org/core-1.9.3/IO.html.
Try something like this, it uses a pdftotext option to send the text to stdout to avoid creating a temporary file and uses blocks for more idiomatic ruby.
text = `pdftotext #{path} -`
text.split.select { |line|
line.length > 0
}.each { |line|
Rails.logger.info(line)
}
You would need to open the txt file with write permission.
file = File.new("#{spec}.txt", "w")
You could consult How to create a file in Ruby
Update: your code is not complete and looks buggy.
Cant say what is path
Looks like you are trying to read the text file to which you intend to write file.readlines.each
spell check length you have it l.lenght
You may want to paste the actual code.
Check this gist https://gist.github.com/4160587
As mentioned, your code is not working because you are reading and writing to the same file.
Example
Ruby code file_write.rb to do the file write operation
pdf_file = File.open("in.txt")
output_file = File.open("out.txt", "w") # file to which you want to write
#iterate over input file and write the content to output file
pdf_file.readlines.each do |l|
output_file.puts(l)
end
output_file.close
pdf_file.close
Sample txt file in.txt
Some text in file
Another line of text
1. Line 1
2. Not really line 2
Once your run file_write.rb you should see new file called out.txt with same content as in.txt You could change the content of input file if you want. In your case you would use pdf reader to get the content and write it to the text file. Basically first line of the code will change.

Resources