Extracting JS objects from PDF files using pdfcpu library - go

I'm trying to extract JS from PDF files as a requirement of safe upload. Using pdfcpu I can extract pdf's objects by index, but one of the problems is that we're receiving bytes, is there a way to work with that instead of having to write it into a file on disk?
Also, does pdfcpu provides a way for extracting JS in a more declarative manner, instead of having to loop over each object?

Related

Parsing a JSON file without JSON.parse()

This is my first time using Ruby. I'm writing an application that parses data and performs some calculations based on it, the source of which is a JSON file. I'm aware I can use JSON.parse() here but I'm trying to write my program so that it will work with other sources of data. Is there a clear cut way of doing this? Thank you.
When your source file is JSON then use JSON.parse. Do not implement a JSON parser on your own. If the source file is a CSV, then use the CSV class.
When your application should be able to read multiple different formats then just add one Reader class for each data type, like JSONReader, CSVReader, etc. And then decide depending on the file extension which reader to use to read the file.

hwpf, xwpf, hssf, and xslf poi picture extraction

I'm looking to extract all images from new and legacy Word documents and spreadsheets to assist in a real time document classification system, and looking at the documentation, I seem to have run into a problem. I'm having no problems finding documentation within the hwpf module and packages for extracting images from the file, but when it comes to the other 3, it seems as though they don't support the same methods.
What I want to do is to have one block of code that is document type agnostic when it comes to the 4 above mentioned types, I just want fast, easy access to the pictures in the files so I can move on to my next task, but at this point it looks like only the hwpf module supports extraction of pictures or the methods in 'PicturesTable'.
I'm also somewhat concerned about the performance of the library: it looks like it loads the entire file when all I want to do is scrape the images out of it. Any suggestions on a library that operates directly on the 'Data' bytestream and the folder structure of the .***x zip files?
I've already tried using OLEtools to try to extract pictures from the streams, and I'm now moving on to this tool. I havn't tried any tools that operate on the lower levels of the documents yet though.

Append other pdf files using gofpdf

We are using gofpdf to write multiple images to PDF but we would like to also be able to write other pdfs into it (append it to the document or merge it). We have a base64 of the pdf files to merge.
We can't seem to find how to do it in the docs or if it's even possible. Does anyone know how?
I tried to use RawWriteBuf or RawWriteStr using the base64, but neither seem to work.

Need to implement bulk PDF extraction using Tesseract API

I have large numbers PDF document, from which I need to extract text. The extracted text I use for further processing. I did this for a small subset of documents using Tesseract API in a linear approach and I get the required output. However, this takes a very long time when I have a large number of documents.
I tried to use the Hadoop environment processing capabilities (Map-Reduce) and storage (HDFS) for solving this issue. However, I am facing problem to implement Tesseract API into the Hadoop (Map-Reduce) approach. As Teserract converts the files into intermediate image files, I am confused as to how intermediate result Image files of Tesseract-API-process can be handled inside HDFS.
I have searched and unsuccesfully tried a few options earlier like:
I have extracted text from PDF by extending FileInputFormat class into my own PdfInputFormat class using Hadoop-Map-Reduce, for this i used Apache PDFBox to extract text from pdf, but when it comes to scanned-pdf's which contains image, this solution does not give me the required results.
I found few answers on the same topic stating to use -Fuse and that will help or one should generate image files locally and than upload those into hdfs for further processing. Not sure if this is the correct approach.
Would like to know approaches around this.
This is an approach found to process multiple pdf's to extract text using the power of the Hadoop Framework, and then use this text for further processing:
Put all the PDFs to be converted to text in one folder.
Create one text file per pdf to contain the path to the pdf. e.g. if I have 10 pdfs to convert, then I have 10 text files generated, each containing the unique path to the respective pdf.
These text files are given as input in the map-reduce program
Because input file size is very small only 1 input split is generated by framework for 1 input. e.g if I have 10 pdfs as input, then framework will generate 10 input-split.
From each Input-split one line(record) is read by Record-Reader and passed to one mapper as a value. So if there are 10 records(line==File Path) in input text file , 10 times mapper will run. As I have one record per input-split so one mapper-reducer is assigned to do task for that input-split.
As I have 10 input-split 10 mapper will run, parallel.
Inside the Mapper ghost-script generates images, passing the file name from Mapper value attribute. The image is converted to text using Tesseract inside the mapper itself to get the text of each pdf. This is the output.
This is passed to the reducer to do other analytics work as required.
This is the current solution. Would like feedback on this.

Pig - load Word documents (.doc & .docx) with pig

I can't load Microsoft Word documents (.doc or .docx) with pig. Indeed, when i try to do so, by using TextLoader(), PigStorage() or no loader at all, it doesn't work. The output is some weird symbols.
I heard that I could write a custom loader in JAVA but it seems really difficult and I don't underdstand how we can program one of these at the moment.
I would like to put all the .doc file content in a single chararray bag so I could later use a filter function to process it.
How could I do ?
Thanks
They are right. Since .doc and .docx are binary formats, simple text loaders won't work. You can either write the UDF to be able to load the files directly into Pig, or you can do some preprocessing to convert all .doc and .docx files into .txt files so that Pig will be loading those .txt files instead. This link may help you get started in finding a way to convert the files.
However, I'd still recommend learning to write the UDF. Preprocessing the files is going to add significant overhead that can be avoided.
Update: Here are a couple of resources I've used for writing my java (Load) UDFs in the past. One, Two.

Resources