Using radix reports as blogdown posts, and saving .md from radix - rstudio

Many of my rmarkdown blogdown blog articles use netlify's ability to serve up .md files which RStudio makes from .Rmarkdown files. I'll bet that I can serve up an .html file from a radix document, but these are large. Is there a way to save the intermediate .md file produced by knitting to radix under RStudio, and would such a file be suitable for netlify?

Related

hwpf, xwpf, hssf, and xslf poi picture extraction

I'm looking to extract all images from new and legacy Word documents and spreadsheets to assist in a real time document classification system, and looking at the documentation, I seem to have run into a problem. I'm having no problems finding documentation within the hwpf module and packages for extracting images from the file, but when it comes to the other 3, it seems as though they don't support the same methods.
What I want to do is to have one block of code that is document type agnostic when it comes to the 4 above mentioned types, I just want fast, easy access to the pictures in the files so I can move on to my next task, but at this point it looks like only the hwpf module supports extraction of pictures or the methods in 'PicturesTable'.
I'm also somewhat concerned about the performance of the library: it looks like it loads the entire file when all I want to do is scrape the images out of it. Any suggestions on a library that operates directly on the 'Data' bytestream and the folder structure of the .***x zip files?
I've already tried using OLEtools to try to extract pictures from the streams, and I'm now moving on to this tool. I havn't tried any tools that operate on the lower levels of the documents yet though.

Append other pdf files using gofpdf

We are using gofpdf to write multiple images to PDF but we would like to also be able to write other pdfs into it (append it to the document or merge it). We have a base64 of the pdf files to merge.
We can't seem to find how to do it in the docs or if it's even possible. Does anyone know how?
I tried to use RawWriteBuf or RawWriteStr using the base64, but neither seem to work.

Sphinx: Use already translated files without creating translatable files

I am trying to figure a way to use already translated .md files (Russian Language) with Sphinx. I use readthedocs.io and I have already read the process of making translatable files (.po/.pot) from:
(1) https://docs.readthedocs.io/en/latest/localization.html
(2) https://docs.readthedocs.io/en/latest/guides/manage-translations.html.
This process requires to make .po or .pot files, translate them, and then produce translated html files - served under https://project.readthedocs.io/$language/$version
What I want is to use a different directory (for example named ru) and place there the Russian .md files.
Is that possible? How is it possible to avoid creating these .po/.pot files?

Is there an efficient way in docpad to keep static and to-be-rendered files in the same directory?

I am rebuilding a site with docpad and it's very liberating to form a folders structure that makes sense with my workflow of content-creation, but I'm running into a problem with docpad's hard-division of content-to-be-rendered vs 'static'-content.
Docpad recommends that you put things like images in /files instead of /documents, and the documentation makes it sound as if otherwise there will be some processing overhead incurred.
First, I'd like an explanation if anyone has it of why a file with a
single extension (therefore no rendering) and no YAML front-matter,
such as a .jpg, would impact site-regeneration time when placed
within /documents.
Second, the real issue: is there a way, if it does indeed create a
performance hit, to mitigate it? For example, to specify an 'ignore'
list with regex, etc...
My use case
I would like to do this for posts and their associated images to make authoring a post more natural. I can easily see the images I have to work with and all the related files are in one place.
I also am doing this for an artwork I am displaying. In this case it's an even stronger use case, as the only data in my html.eco file is yaml front matter of various meta data, my layout automatically generates the gallery from all the attached images located in a folder of the same-name as the post. I can match the relative output path folder in my /files directory but it's error prone, because you're in one folder (src/files/artworks/) when creating the folder of images and another (src/documents/artworks/) when creating the html file -- typos are far more likely (as you can't ever see the folder and the html file side by side)...
Even without justifying a use case I can't see why docpad should be putting forth such a hard division. A performance consideration should not be passed on to the end user like that if it can be avoided in any way; since with docpad I am likely to be managing my blog through the file system I ought to have full control over that structure and certainly don't want my content divided up based on some framework limitation or performance concern instead of based on logical content divisions.
I think the key is the line about "metadata".Even though a file does NOT have a double extension, it can still have metadata at the top of the file which needs to be scanned and read. The double extension really just tells docpad to convert the file from one format and output it as another. If I create a straight html file in the document folder I can still include the metadata header in the form:
---
tags: ['tag1','tag2','tag3']
title: 'Some title'
---
When the file is copied to the out directory, this metadata will be removed. If I do the same thing to a html file in the files directory, the file will be copied to the out directory with the metadata header intact. So, the answer to your question is that even though your file has a single extension and is not "rendered" as such, it still needs to be opened and processed.
The point you make, however, is a good one. Keeping images and documents together. I can see a good argument for excluding certain file extensions (like image files) from being processed. Or perhaps, only including certain file extensions.

Pig - load Word documents (.doc & .docx) with pig

I can't load Microsoft Word documents (.doc or .docx) with pig. Indeed, when i try to do so, by using TextLoader(), PigStorage() or no loader at all, it doesn't work. The output is some weird symbols.
I heard that I could write a custom loader in JAVA but it seems really difficult and I don't underdstand how we can program one of these at the moment.
I would like to put all the .doc file content in a single chararray bag so I could later use a filter function to process it.
How could I do ?
Thanks
They are right. Since .doc and .docx are binary formats, simple text loaders won't work. You can either write the UDF to be able to load the files directly into Pig, or you can do some preprocessing to convert all .doc and .docx files into .txt files so that Pig will be loading those .txt files instead. This link may help you get started in finding a way to convert the files.
However, I'd still recommend learning to write the UDF. Preprocessing the files is going to add significant overhead that can be avoided.
Update: Here are a couple of resources I've used for writing my java (Load) UDFs in the past. One, Two.

Resources