In map-reduce, emitting in reducer results in writing to an output file with a name like "output-00000". What if I want to output into two different files (with 2 different names apparently) within a reducer? If it is possible how I can change the name of the output files from default?
Use MultipleTextOutputFormat. MultipleOutputFormat allows to write the output data to different output files. Two variants of MultipleOutputFormat are MultipleSequenceFileOutputFormat and MultipleTextOutputFormat.
Simple example is shown here.
Related
I have large numbers PDF document, from which I need to extract text. The extracted text I use for further processing. I did this for a small subset of documents using Tesseract API in a linear approach and I get the required output. However, this takes a very long time when I have a large number of documents.
I tried to use the Hadoop environment processing capabilities (Map-Reduce) and storage (HDFS) for solving this issue. However, I am facing problem to implement Tesseract API into the Hadoop (Map-Reduce) approach. As Teserract converts the files into intermediate image files, I am confused as to how intermediate result Image files of Tesseract-API-process can be handled inside HDFS.
I have searched and unsuccesfully tried a few options earlier like:
I have extracted text from PDF by extending FileInputFormat class into my own PdfInputFormat class using Hadoop-Map-Reduce, for this i used Apache PDFBox to extract text from pdf, but when it comes to scanned-pdf's which contains image, this solution does not give me the required results.
I found few answers on the same topic stating to use -Fuse and that will help or one should generate image files locally and than upload those into hdfs for further processing. Not sure if this is the correct approach.
Would like to know approaches around this.
This is an approach found to process multiple pdf's to extract text using the power of the Hadoop Framework, and then use this text for further processing:
Put all the PDFs to be converted to text in one folder.
Create one text file per pdf to contain the path to the pdf. e.g. if I have 10 pdfs to convert, then I have 10 text files generated, each containing the unique path to the respective pdf.
These text files are given as input in the map-reduce program
Because input file size is very small only 1 input split is generated by framework for 1 input. e.g if I have 10 pdfs as input, then framework will generate 10 input-split.
From each Input-split one line(record) is read by Record-Reader and passed to one mapper as a value. So if there are 10 records(line==File Path) in input text file , 10 times mapper will run. As I have one record per input-split so one mapper-reducer is assigned to do task for that input-split.
As I have 10 input-split 10 mapper will run, parallel.
Inside the Mapper ghost-script generates images, passing the file name from Mapper value attribute. The image is converted to text using Tesseract inside the mapper itself to get the text of each pdf. This is the output.
This is passed to the reducer to do other analytics work as required.
This is the current solution. Would like feedback on this.
I have a mapreduce job which reads text file and creates parquet file from it and at the same time writes to simple text file as output. I have used multiple output format for that. But multiple output format object can be initialize for either writing parquet file or text file at a time. I need to accommodate both in single mapper. Any help is highly appreciated.
Not sure it's the best way, but you can just initialize a StringBuilder in our mapper's setup() method, append all text values to it during the map() method and then write it to disk in the cleanup method. Depends on the size of your text output and if you have enough memory or not. That way the text file doesn't need to be a mapper output at all, and your mapper output can be the Parquet data only.
You could use context.getInputSplit() or something similar as the text output file names so that each mapper outputs a unique file name and you know which output correponds to which input.
How can i use WholeFileInputFormat with many files as input?
Many files as one file...
FileInputFormat.addInputPaths(job, String ...); doesnt seem to work properly
You need to set "isSplittable" in your InputFormat to "false" so that the input file doesn't get split and get processed by just 1 mapper. One small suggestion though, you could give Sequence File a try. Combine multiple files, you are trying to process, into a single Sequence File and then process it. It would be more efficient as Sequence Files are already in key/value form.
Summary: can I specify some action to be executed on each output file after it's written with hadoop streaming?
Basically, this is follow-up to Easiest efficient way to zip output of hadoop mapreduce question. I want for each key X its value written to X.txt file, compressed into X.zip archive. But when we write zip output stream, it's hard to tell something about a key or a name of resulting file, so we end up with X.zip archive containing default-name.txt.
It'd be very simple operation to rename archive contents, but where can I place it? What I don't want to do is download all zips from S3 and upload them back then.
Consider using a custom MultipleOutputFormat:
Basic use cases:
This class is used for a map reduce job with at least one reducer. The reducer wants to write data to different files depending on the actual keys.
It is assumed that a key (or value) encodes the actual key (value) and the desired location for the actual key (value).
This class is used for a map only job. The job wants to use an output file name that is either a part of the input file name of the input data, or some derivation of it.
This class is used for a map only job. The job wants to use an output file name that depends on both the keys and the input file name
You may also control which key goes to which reducer (Partitioner)
I have two files with different data formats in HDFS. How would a job set up look like, if I needed to reduce across both data files?
e.g. imagine the common word count problem, where in one file you have space as the world delimiter and in another file the underscore. In my approach I need different mappers for the various file formats, that than feed into a common reducer.
How to do that?
Or is there a better solution than mine?
Check out the MultipleInputs class that solves this exact problem. It's pretty neat-- you pass in the InputFormat and optionally the Mapper class.
If you are looking for code examples on google, search for "Reduce-side join", which is where this method is typically used.
On the other hand, sometimes I find it easier to just use a hack. For example, if you have one set of files that is space delimited and the other that is underscore delimited, load both with the same mapper and TextInputFormat and tokenize on both possible delimiters. Count the number of tokens from the two results set. In the word count example, pick the one with more tokens.
This also works if both files are the same delimiter but have a different number of standard columns. You can tokenize on comma then see how many tokens there are. If it is say 5 tokens it is from data set A, if it is 7 tokens it is from data set B.