How can I directly use pandoc to generate docx files within a Sinatra app? - ruby

I have a Sinatra app which needs to provide downloadable reports in Microsoft Word format. My approach to creating the reports is to generate the content using ERB, and then convert the resulting HTML into docx. Pandoc seems to be the best tool for accomplishing this, but my implementation involves generating some temporary files which feels kludgy.
Is there a more direct way to generate the docx file and send it to the user?
I know that PandocRuby exists, but I couldn't quite get it working for my purposes. Here is an example of my current implementation:
#setting up the docx mime type
configure do
mime_type :docx, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
end
# route to generate the report
get '/report/:name' do
content_type :docx
input = erb :report, :layout=>false #get the HTML content for the input file
now = Time.now.to_i.to_s #create a unique file name
input_path = File.join('tmp', now+'.txt')
f = File.new(input_path, "w+")
f.write(input.to_s) #write HTML to the input to the file
f.close()
output_path = File.join('tmp', now+'.docx') # create a unique output file
system "pandoc -f html -t docx -o #{output_path} #{input_path}" # convert the input file to docs
send_file output_path
end

A recent update to pandoc-ruby added support for piping binary output to standard output. Does that solve your problem?
I don't have any experience with Sinatra, and I have not tried to use pandoc-ruby to pipe binary output, but something like
puts PandocRuby.convert(input, :from => :html, :to => :docx)
might do the trick.

Related

Ruby: Download zip file and extract

I have a ruby script that downloads a remote ZIP file from a server using rubys opencommand. When I look into the downloaded content, it shows something like this:
PK\x03\x04\x14\x00\b\x00\b\x00\x9B\x84PG\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\x00\x10\x00foobar.txtUX\f\x00\x86\v!V\x85\v!V\xF6\x01\x14\x00K\xCB\xCFOJ,RH\x03S\\\x00PK\a\b\xC1\xC0\x1F\xE8\f\x00\x00\x00\x0E\x00\x00\x00PK\x01\x02\x15\x03\x14\x00\b\x00\b\x00\x9B\x84PG\xC1\xC0\x1F\xE8\f\x00\x00\x00\x0E\x00\x00\x00\n\x00\f\x00\x00\x00\x00\x00\x00\x00\x00#\xA4\x81\x00\x00\x00\x00foobar.txtUX\b\x00\x86\v!V\x85\v!VPK\x05\x06\x00\x00\x00\x00\x01\x00\x01\x00D\x00\x00\x00T\x00\x00\x00\x00\x00
I tried using the Rubyzip gem (https://github.com/rubyzip/rubyzip) along with its class Zip::ZipInputStream like this:
stream = open("http://localhost:3000/foobar.zip").read # this outputs the zip content from above
zip = Zip::ZipInputStream.new stream
Unfortunately, this throws an error:
Failure/Error: zip = Zip::ZipInputStream.new stream
ArgumentError:
string contains null byte
My questions are:
Is it possible, in general, to download a ZIP file and extract its content in-memory?
Is Rubyzip the right library for it?
If so, how can I extract the content?
I found the solution myself and then at stackoverflow :D (How to iterate through an in-memory zip file in Ruby)
input = HTTParty.get("http://example.com/somedata.zip").body
Zip::InputStream.open(StringIO.new(input)) do |io|
while entry = io.get_next_entry
puts entry.name
parse_zip_content io.read
end
end
Download your ZIP file, I'm using HTTParty for this (but you could also use ruby's open command (require 'open-uri').
Convert it into a StringIO stream using StringIO.new(input)
Iterate over every entry inside the ZIP archive using io.get_next_entry (it returns an instance of Entry)
With io.read you get the content, and with entry.name you get the filename.
Like I commented in https://stackoverflow.com/a/43303222/4196440, we can just use Zip::File.open_buffer:
require 'open-uri'
content = open('http://localhost:3000/foobar.zip')
Zip::File.open_buffer(content) do |zip|
zip.each do |entry|
puts entry.name
# Do whatever you want with the content files.
end
end

How do I get 'puts' messages and standard output sent to a file while using RSpec / parallel_rspec?

This is the contents of my .rspec_parallel file. I am using parallel_tests gem to run tests in multiple browser instances. To my knowledge, the gem uses the same formatter options available in RSpec.
--format html --out results<%= ENV['TEST_ENV_NUMBER'] %>.html
This works fantastic and I'm able to get the HTML output I normally see from RSpec. However, all of the 'puts' messages and basic standard output is logged to my console window, and not to the HTML files.
How can I get this output into each individual HTML file that I have set up?
puts will output to $stdout where as output is actually an instance variable of the RSpec::Core::Formatters::BaseFormatter class. output is defaulted to $stdout but when you pass in a string it determines that it should create a new StringIO and then output this to the given file name. Thus puts will not append to the #output variable.
You could do something ugly like create a runner file like
File.open('some_file_name.html','w+') do |file|
file << `rspec spec --format html`
end
Then this file will have the output from $stdoutbut your puts code will not be html formatted in this case. Other than that you could try building your own Custom Formatter but it will probably take quite a bit of source searching to make sure you can capture everything appropriately.
That being said it does seem the reporter was exposed for adding custom messages but I am uncertain of how to use this appropriately See Pull Request 1866
Seems it would be something like
it "has a name" do |ex|
ex.reporter.message("Custom Message Here")
#actual test
end
but the html formatter seems to ignore this. I can see the output in $stdout but not in the html file itself.
Best of luck.

How do I find the path of a template file using ERB?

I am using embedded ruby (ERB) to generating text files. I need to know the directory of the template file in order to locate another file relative to the template file path. Is there a simple method from within ERB that will give me the file name and directory of the current template file?
I'm looking for something similar to __FILE__, but giving the template file instead of (erb).
When you use the ERB api from Ruby, you provide a string to ERB.new, so there isn’t really any way for ERB to know where that file came from. You can however tell the object which file it came from using the filename attribute:
t = ERB.new(File.read('my_template.erb')
t.filename = 'my_template.erb'
Now you can use __FILE__ in my_template.erb and it will refer to the name of the file. (This is what the erb executable does, which is why __FILE__ works in ERB files that you run from the command line).
To make this a bit a bit more useful, you could monkey patch ERB with a new method to read from a file and set the filename:
require 'erb'
class ERB
# these args are the args for ERB.new, which we pass through
# after reading the file into a string
def self.from_file(file, safe_level=nil, trim_mode=nil, eoutvar='_erbout')
t = new(File.read(file), safe_level, trim_mode, eoutvar)
t.filename = file
t
end
end
You can now use this method to read ERB files, and __FILE__ should work in them, and refer to the actual file and not just (erb):
t = ERB.from_file 'my_template.erb'
puts t.result

Trouble conceptualizing how to have LDA-Ruby read multiple .txt files

I am attempting to write a Ruby script that will look at a collection of unstructured plain text files and I am struggling with thinking through the best way to process these files. The current working version of my script for topic modeling is the following:
#!/usr/bin/env ruby -w
require 'rubygems'
require 'lda-ruby'
# Input a directory of files
FILES_DIRECTORY = ARGV[0]
File.open("files.csv", "w") do |f|
Dir.glob(FILES_DIRECTORY + "*.txt") do |filename|
file_id = File.basename(filename).gsub(".txt", "")
text = File.read(filename).clean
f.puts [file_id, text].join(",")
end
end
# Read csv
file = File.open("files.csv", "r") { |f| f.read }
# Train topics and infer
corpus = Lda::Corpus.new
corpus.add_document(Lda::TextDocument.new(corpus, file))
lda = Lda::Lda.new(corpus)
lda.verbose = false
lda.num_topics = 20
lda.em('random')
topics = lda.top_words(10)
puts topics
What I'm attempting to modify is having this program read through a collection of plain text files rather than a single file. It's not as easy as just tossing all the text files into a single file (as it currently does with files.csv) because, as I understand it, lda-ruby looks for multiple files to do a correct topic model rather than a single file. (I've come to this conclusion because there is little variance between having this script read a single text file [e.g., corpus.txt] that includes all the text, and the files.csv file.)
So, my question is how can I have lda-ruby iterate through these text files differently? Should the contents of the files be placed into a hash instead? If so, any pointers on where I should start with that? Or, should I scrap this and use a different LDA library?
Thanks ahead of time for any advice.
Basically, you just need to initialize the corpus before going through the directory and then add each file to the corpus in the block the same way you were previously adding your CSV file.
#!/usr/bin/env ruby -w
require 'rubygems'
require 'lda-ruby'
# Input a directory of files
FILES_DIRECTORY = ARGV[0]
corpus = Lda::Corpus.new
File.open("files.csv", "w") do |f|
Dir.glob(FILES_DIRECTORY + "*.txt") do |filename|
file = File.open(filename, "r") { |f| f.read }
corpus.add_document(Lda::TextDocument.new(corpus, file))
end
end
lda = Lda::Lda.new(corpus)
lda.verbose = false
lda.num_topics = 20
lda.em('random')
topics = lda.top_words(10)
puts topics
I know this is a rather old question, but I found this question while looking for a solution to a similar problem. Your code helped me so I thought my answer might be helpful to you or others.
If you have a directory of text files you want to use as documents, you can use the following line to create your corpus:
corpus = Lda::DirectoryCorpus.new('path/to/directory')

My file is getting shorter and I don't know why

I have a requirement where I need to edit part of xml file and save it, but in my code some part of the xml file it not saving.I want to modify <mtn:ttl>4</mtn:ttl> to <mtn:ttl>9</mtn:ttl>, this part is getting modified in the below code but while writting/saving only part of file is getting chaged or the format of the file is getting chaged, can any one tell me how to solve this? original xml file size is 79kb but after editing and saving its becoming 78kb...
require "rexml/text"
require "rexml/document"
include REXML
File.open("c://conf//cad-mtn-config.xml") do |config_file|
# Open the document and edit the file
config = Document.new(config_file)
if testField.to_s.match(/<mtn:ttl>/)
config.root.elements[4].elements[11].elements[1].elements[1].elements[1].elements[8].text="9"
# Write the result to a new file.
formatter = REXML::Formatters::Default.new
File.open("c://mtn-3//mtn-2.2//conf//cad-mtn-config.xml", 'w') do |result|
formatter.write(config, result)
end
end
end
It looks like your trying to use regular expressions, why not just use rexml? The only requirement is that you need to know where the namespace is located online. Note if it were not mtn:ttl and just ttl you would not need the namespace.
require 'rexml/document'
file_path="path to file"
contents=File.new(file_path).read
xml_doc=REXML::Document.new(contents)
xml_doc.add_namespace('mtn',"http://url to mtn namespace")
xml_doc.root.elements.each('mtn:ttl') do |element|
element.text="9"
end
File.open(file_path,"w") do |data|
data<<xml_doc
end

Resources