Pig - load Word documents (.doc & .docx) with pig - hadoop

I can't load Microsoft Word documents (.doc or .docx) with pig. Indeed, when i try to do so, by using TextLoader(), PigStorage() or no loader at all, it doesn't work. The output is some weird symbols.
I heard that I could write a custom loader in JAVA but it seems really difficult and I don't underdstand how we can program one of these at the moment.
I would like to put all the .doc file content in a single chararray bag so I could later use a filter function to process it.
How could I do ?
Thanks

They are right. Since .doc and .docx are binary formats, simple text loaders won't work. You can either write the UDF to be able to load the files directly into Pig, or you can do some preprocessing to convert all .doc and .docx files into .txt files so that Pig will be loading those .txt files instead. This link may help you get started in finding a way to convert the files.
However, I'd still recommend learning to write the UDF. Preprocessing the files is going to add significant overhead that can be avoided.
Update: Here are a couple of resources I've used for writing my java (Load) UDFs in the past. One, Two.

Related

hwpf, xwpf, hssf, and xslf poi picture extraction

I'm looking to extract all images from new and legacy Word documents and spreadsheets to assist in a real time document classification system, and looking at the documentation, I seem to have run into a problem. I'm having no problems finding documentation within the hwpf module and packages for extracting images from the file, but when it comes to the other 3, it seems as though they don't support the same methods.
What I want to do is to have one block of code that is document type agnostic when it comes to the 4 above mentioned types, I just want fast, easy access to the pictures in the files so I can move on to my next task, but at this point it looks like only the hwpf module supports extraction of pictures or the methods in 'PicturesTable'.
I'm also somewhat concerned about the performance of the library: it looks like it loads the entire file when all I want to do is scrape the images out of it. Any suggestions on a library that operates directly on the 'Data' bytestream and the folder structure of the .***x zip files?
I've already tried using OLEtools to try to extract pictures from the streams, and I'm now moving on to this tool. I havn't tried any tools that operate on the lower levels of the documents yet though.

ruby excel - write data to existing xls

I have tried different gems in ruby and also searched a lot but Ruby doesnt seem to have a solution to write to existing excel file.
my excel file 'services.xls' has 3 columns
1st column name is 'inputxlm'
2nd column name is 'methodtoexecute'
3rd column name is 'output'
i have internal logic which takes the inputxlm, process it using the method and generates output
How do i write back the output to output column in 'services.xls' ?
Note : I dont want to use win32ole as my organization has some limitation on it
This article is a great source to find out which library suits you best: https://spin.atomicobject.com/2017/03/22/parsing-excel-files-ruby/
My company uses axlsx combined with axlsx_rails to render xls files with Rails rendering machine and axlsx_styler for styling.
Note that in a simple use case like the one you describe, you night not need an excel file like xls, a mere CSV would suffice, and for that, Ruby has CSV

Need to implement bulk PDF extraction using Tesseract API

I have large numbers PDF document, from which I need to extract text. The extracted text I use for further processing. I did this for a small subset of documents using Tesseract API in a linear approach and I get the required output. However, this takes a very long time when I have a large number of documents.
I tried to use the Hadoop environment processing capabilities (Map-Reduce) and storage (HDFS) for solving this issue. However, I am facing problem to implement Tesseract API into the Hadoop (Map-Reduce) approach. As Teserract converts the files into intermediate image files, I am confused as to how intermediate result Image files of Tesseract-API-process can be handled inside HDFS.
I have searched and unsuccesfully tried a few options earlier like:
I have extracted text from PDF by extending FileInputFormat class into my own PdfInputFormat class using Hadoop-Map-Reduce, for this i used Apache PDFBox to extract text from pdf, but when it comes to scanned-pdf's which contains image, this solution does not give me the required results.
I found few answers on the same topic stating to use -Fuse and that will help or one should generate image files locally and than upload those into hdfs for further processing. Not sure if this is the correct approach.
Would like to know approaches around this.
This is an approach found to process multiple pdf's to extract text using the power of the Hadoop Framework, and then use this text for further processing:
Put all the PDFs to be converted to text in one folder.
Create one text file per pdf to contain the path to the pdf. e.g. if I have 10 pdfs to convert, then I have 10 text files generated, each containing the unique path to the respective pdf.
These text files are given as input in the map-reduce program
Because input file size is very small only 1 input split is generated by framework for 1 input. e.g if I have 10 pdfs as input, then framework will generate 10 input-split.
From each Input-split one line(record) is read by Record-Reader and passed to one mapper as a value. So if there are 10 records(line==File Path) in input text file , 10 times mapper will run. As I have one record per input-split so one mapper-reducer is assigned to do task for that input-split.
As I have 10 input-split 10 mapper will run, parallel.
Inside the Mapper ghost-script generates images, passing the file name from Mapper value attribute. The image is converted to text using Tesseract inside the mapper itself to get the text of each pdf. This is the output.
This is passed to the reducer to do other analytics work as required.
This is the current solution. Would like feedback on this.

How to load complex web logs syntax with Pig?

I am a complete beginner on Pig. I have installed cdh4 pig and I am connected to a cdh4 cluster. We need to process these web log files that are going to be massive (the files are already being loaded to HDFS). Unfortunately the log syntax is quite involved (not a typical comma delimited file). A restriction is I cannot currently pre-process the log files with some other tool because they are just too huge and can't afford storing a copy. Here is a raw line in the logs:
"2013-07-02 16:17:12
-0700","?c=Thing.Render&d={%22renderType%22:%22Primary%22,%22renderSource%22:%22Folio%22,%22things%22:[{%22itemId%22:%225442f624492068b7ce7e2dd59339ef35%22,%22userItemId%22:%22873ef2080b337b57896390c9f747db4d%22,%22listId%22:%22bf5bbeaa8eae459a83fb9e2ceb99930d%22,%22ownerId%22:%222a4034e6b2e800c3ff2f128fa4f1b387%22}],%22redirectId%22:%22tgvm%22,%22sourceId%22:%226da6f959-8309-4387-84c6-a5ddc10c22dd%22,%22valid%22:false,%22pageLoadId%22:%224ada55ef-4ea9-4642-ada5-d053c45c00a4%22,%22clientTime%22:%222013-07-02T23:18:07.243Z%22,%22clientTimeZone%22:5,%22process%22:%22ml.mobileweb.fb%22,%22c%22:%22Thing.Render%22}","http://m.someurl.com/listthing/5442f624492068b7ce7e2dd59339ef35?rdrId=tgvm&userItemId=873ef2080b337b57896390c9f747db4d&fmlrdr=t&itemId=5442f624492068b7ce7e2dd59339ef35&subListId=bf5bbeaa8eae459a83fb9e2ceb99930d&puid=2a4034e6b2e800c3ff2f128fa4f1b387&mlrdr=t","Mozilla/5.0
(iPhone; CPU iPhone OS 6_1_3 like Mac OS X) AppleWebKit/536.26 (KHTML,
like Gecko) Mobile/10B329
[FBAN/FBIOS;FBAV/6.2;FBBV/228172;FBDV/iPhone4,1;FBMD/iPhone;FBSN/iPhone
OS;FBSV/6.1.3;FBSS/2;
FBCR/Sprint;FBID/phone;FBLC/en_US;FBOP/1]","10.nn.nn.nnn","nn.nn.nn.nn,
nn.nn.0.20"
As you probably noticed there is some json embedded there but it is url encoded. After url decoding (can Pig do url decoding?) here is how the json looks:
{"renderType":"Primary","renderSource":"Folio","things":[{"itemId":"5442f624492068b7ce7e2dd59339ef35","userItemId":"873ef2080b337b57896390c9f747db4d","listId":"bf5bbeaa8eae459a83fb9e2ceb99930d","ownerId":"2a4034e6b2e800c3ff2f128fa4f1b387"}],"redirectId":"tgvm","sourceId":"6da6f959-8309-4387-84c6-a5ddc10c22dd","valid":false,"pageLoadId":"4ada55ef-4ea9-4642-ada5-d053c45c00a4","clientTime":"2013-07-02T23:18:07.243Z","clientTimeZone":5,"process":"ml.mobileweb.fb","c":"Thing.Render"}
I need to extract the different fields in the json and the "things" field which is in fact a collection. I also need to extract the other query string values in the log. Can Pig directly deal with this kind of source data and if so could you be so kind to guide me through how to have Pig be able to parse and load it?
Thank you!
For such complicated task, you ususally need to write your Load function.
I recommend Chapter 11. Writing Load and Store Functions in Programming Pig. Load/Store Functions in official docuemnt is too simple.
I experimented plenty and learned tons. Tried a couple json libraries, piggybank and the java.net.URLDecoder. Even tried the CSVExcelStorage. I registered the libraries and was able to solve the problem partially. When I ran the tests against a larger data set, it started hitting encoding issues in some lines of the source data resulting in exceptions and job failure. So I ended up using Pig's built-in regex functionality to extract the desired values:
A = load '/var/log/live/collector_2013-07-02-0145.log' using TextLoader();
-- fix some of the encoding issues
A = foreach A GENERATE REPLACE($0,'\\\\"','"');
-- super basic url-decode
A = foreach A GENERATE REPLACE($0,'%22','"');
-- extract each of the fields from the embedded json
A = foreach A GENERATE
REGEX_EXTRACT($0,'^.*"redirectId":"([^"\\}]+).*$',1) as redirectId,
REGEX_EXTRACT($0,'^.*"fromUserId":"([^"\\}]+).*$',1) as fromUserId,
REGEX_EXTRACT($0,'^.*"userId":"([^"\\}]+).*$',1) as userId,
REGEX_EXTRACT($0,'^.*"listId":"([^"\\}]+).*$',1) as listId,
REGEX_EXTRACT($0,'^.*"c":"([^"\\}]+).*$',1) as eventType,
REGEX_EXTRACT($0,'^.*"renderSource":"([^"\\}]+).*$',1) as renderSource,
REGEX_EXTRACT($0,'^.*"renderType":"([^"\\}]+).*$',1) as renderType,
REGEX_EXTRACT($0,'^.*"engageType":"([^"\\}]+).*$',1) as engageType,
REGEX_EXTRACT($0,'^.*"clientTime":"([^"\\}]+).*$',1) as clientTime,
REGEX_EXTRACT($0,'^.*"clientTimeZone":([^,\\}]+).*$',1) as clientTimeZone;
I decided not to use REGEX_EXTRACT_ALL in case the order of the fields varies.

Is it wrong to put all Codeigniter translations into one translation file

Looking fro some advise...
I'm creating a multi-language site with Codeigniter. CI allows me to create several language files, e.g. one per controller and load language files whenever I need them.
For me it sounds easier to just work with one language file and auto-load it, but this approach doesn't seem to be encouraged. Can anyone tell me if working with one language file (per language) is OK, or should I use a language file per controller ?
It depends on the size of your file, if size of your single file is too big then for every time you load the file all data for that file will get loaded and your script will take much more memory, in case of big language file it is always to use multiple files and load it when needed, and if your language file is small it is always better to use single file so that you don't need to manage it and simple to use.

Resources