Distributed file processing in Hadoop? - hadoop

I have a large number of compressed tar files, where each tar itself contains several files. I want to extract these files and I want to use hadoop or a similar technique to speedup the the processing. Are there any tools for this kind of problem? As far as I know hadoop and similar frameworks like spark or flink do not use files directly and don't give you access to the filesystem directly. I also want to do some basic renaming of the extracted files and move them into appropriate directories.
I can image a solution where one creates a list of all tar files. This list is then passed to the mappers and a single mapper extracts one file from the list. Is this a reasonable approach?

It is possible to instruct MapReduce to use an input format where the input to each Mapper is a single file. (from https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3)
public class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> {
#Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
#Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(
InputSplit inputSplit, TaskAttemptContext context) throws IOException,
InterruptedException {
WholeFileRecordReader reader = new WholeFileRecordReader();
reader.initialize(inputSplit, context);
return reader;
}
}
Then, in your mapper, you can use the Apache commons compress library to unpack the tar file https://commons.apache.org/proper/commons-compress/examples.html
you don't need to pass a list of files to Hadoop, just put all the files in a single HDFS directory, and use that directory as your input path.

Distcp moves files from one place to another, you can take a look at its docs but I don't think it offers any decompress or unpack capability? If a file is bigger than main memory, you probably will get some out of memory errors. 8gb is not very big for a Hadoop cluster, how many machines do you have?

Related

How to use Distributed cache in partitioner hadoop?

I am new in hadoop and mapreduce partitioner.I want to write my own partitioner and i need to read a file in partitioner. i have searched many times and i get that i should use distributed cache. this is my question that how can i use distributed cache in my hadoop partitioner? what should i write in my partitioner?
public static class CaderPartitioner extends Partitioner<Text,IntWritable> {
#Override
public int getPartition(Text key, IntWritable value, int numReduceTasks) {
return 0;
}
}
Thanks
The easiest way to work this out is to look at the example Partitioners included with hadoop. In this case the one to look at is the TotalOrderPartitioner which reads in a pre-generated file to help direct keys.
You can find the source code here, and here's gist showing how to use it.
Firstly you need to tell the partitioner where the file can be found in your mapreduce jobs driver (on HDFS):
// Define partition file path.
Path partitionPath = new Path(outputDir + "-part.lst");
// Use Total Order Partitioner.
job.setPartitionerClass(TotalOrderPartitioner.class);
// Generate partition file from map-only job's output.
TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), partitionPath);
In the TotalOrderPartitioner you'll see that it implements Configurable which gives it access to the configuration so it can get the path to the file on HDFS.
The file is read in the public void setConf(Configuration conf) method, which will be called when the Partitioner object is created. At this point you can read the file and do whatever set-up you want.
I would think you can re-use a lot of the code from this partitioner.

alternative solutions for Hadoop/Hive distributed-cache for handling very large dictionary file?

We are creating a dictionary like application on Hadoop and Hive.
The general process is batch-scanning billion of log data (e.g. words) against a big fixed dictionary (about 100G, like a multiple language WordNet dictionary).
We already have a single machine version of the java application (let's call this "singleApp") to query this dictionary. We currently could not modify either this java application or the dictionary file, thus we could not re-design and re-write a complete new MapReduce application. We need use this single machine version Java Application as the building block to extend it to a MapReduce version.
Currently, we are able to create a MapReduce application by calling this "singleApp" and pass a subset of dictionary (e.g. 1G dictionary) using distributed-cache. However, if we use the full dictionary (100G), the app is very very slow to start. Furthermore, we really want to install these dictionaries into the Hadoop cluster without calling it each time using -file or distributed cache options.
We tried to copy the dictionary files directly into local disks in slave nodes and pointing the java app to it, but it could not find the dictionary. Any documents on what need to be done if we want to debug more on this approach?
Any suggestions on what should be the best practice/process for us to handle situations like this (very large dictionary files, and prefer to keep the dictionary files installed all the time)?
You don't need to use Hadoop for 100GB of data. You can use your distributed cache as a processing platform as well.
Consider your distributed cache as an In-Memory Data Grid.
Try TayzGrid an Open Source In-Memory DataGrid with a MapReduce usecase such as yours.
public class ProductAnalysisMapper implements
com.alachisoft.tayzgrid.runtime.mapreduce.Mapper {
#Override
public void map(Object ikey, Object ivalue, OutputMap omap)
{
//This line emits value count to the map.
omap.emit(ivalue, 1);
}
}
public class ProductAnalysisReducer implements
com.alachisoft.tayzgrid.runtime.mapreduce.Reducer {
public ProductAnalysisReducer(Object k) { /* ... */ }
#Override
public void reduce(Object iv) { /* ... */ }
#Override
public void finishReduce(KeyValuePair kvp) { /* ... */ }
}

Maintain an array structure per Vertex

Throughout a Giraph graph, I need to maintain an array on a Vertex basis to store the results of several "health" checks done at the Vertex level.
If it as simple as writing a new Input format that will get carried over?
My worry goes to the fact that the actual data that will feed the graph does not need to know about this array.
You don’t need to read the data from anywhere, if the array is just there to keep temporary calculations between steps you don’t need to read, nor write it.
You will need to create a new class which implements Writable. You’ll store the array within this class and take care of the serialisation/deserialization between the supersteps. This is done in the two functions:
#Override
public void write(DateOutput dataOutput) throws IOException {
. . . .
}
#Override
public void readFields(DataInput dataInput) throws IOException {
. . . .
}
Make sure, that you’ll read and write the fields in the same order, as they are written into a buffer and having different orders would screw up everything.
Afterwards you just need to specify this very class in the Generic type for the Vertex-Value-Type.
Although if you don’t initialize the VertexValue during the set-up process, when you read the input file, … you should do it in the first SuperStep (== 0)
I’ve written a blog post about complex data types in Giraph about a year ago, maybe it will help you further, although some things might have had changed in the meanwhile.

MultipleTextOutputFormat alternative in new API

As it stands out MultipleTextOutputFormat have not been migrated to the new API. So if we need to choose an output directory and output fiename based on the key-value being written on the fly, then what's the alternative we have with new mapreduce API ?
I'm using AWS EMR Hadoop 1.0.3, and it is possible to specify different directories and files based on k/v pairs. Use either of the following functions from the MultipleOutputs class:
public void write(KEYOUT key, VALUEOUT value, String baseOutputPath)
or
public <K,V> void write(String namedOutput, K key, V value,
String baseOutputPath)
The former write method requires the key to be the same type as the map output key (in case you are using this in the mapper) or the same type as the reduce output key (in case you are using this in the reducer). The value must also be typed in similar fashion.
The latter write method requires the key/value types to match the types specified when you setup the MultipleObjects static properties using the addNamedOutput function:
public static void addNamedOutput(Job job,
String namedOutput,
Class<? extends OutputFormat> outputFormatClass,
Class<?> keyClass,
Class<?> valueClass)
So if you need different output types than the Context is using, you must use the latter write method.
The trick to getting different output directories is to pass a baseOutputPath that contains a directory separator, like this:
multipleOutputs.write("output1", key, value, "dir1/part");
In my case, this created files named "dir1/part-r-00000".
I was not successful in using a baseOutputPath that contains the .. directory, so all baseOutputPaths are strictly contained in the path passed to the -output parameter.
For more details on how to setup and properly use MultipleOutputs, see this code I found (not mine, but I found it very helpful; does not use different output directories). https://github.com/rystsov/learning-hadoop/blob/master/src/main/java/com/twitter/rystsov/mr/MultipulOutputExample.java
Similar to: Hadoop Reducer: How can I output to multiple directories using speculative execution?
Basically you can write to HDFS directly from your reducer - you'll just need to be wary of speculative execution and name your files uniquely, then you'll need to implement you own OutputCommitter to clean up the aborted attempts (this is the most difficult part if you have truely dynamic output folders - you'll need to step through each folder and delete the attemps associated with aborted / failed tasks). A simple solution to this is to turn off speculative execution
For the best answer,turn to Hadoop - definitive guide 3rd Ed.(starting pg. 253.)
An Excerpt from the HDG book -
"In the old MapReduce API, there are two classes for producing multiple outputs: MultipleOutputFormat and MultipleOutputs. In a nutshell, MultipleOutputs is more fully featured, but MultipleOutputFormat has more control over the output directory structure and file naming. MultipleOutputs in the new API combines the best features of the two multiple output classes in the old API."
It has an example on how you can control directory structure,file naming and output format using MultipleOutputs API.
HTH.

Hadoop API: OutputFormat for Reducer

I am totally confused with hadoop API. (guess its changing all the time)
If i am not wrong, JobConf was deprecated and we were supposed to use Job and Configuration classes instead to run a map reduce job from java. it seems though that in recently released hadoop 1.0.0 JobConf is not longer deprecated!
So i am using Job and configuration classes to run a map reduce job. Now, i need to put reducers output files in a folder structure based on certain values that are part of my map output. I went through several articles and found that one can achieve that with a OutputFormat Class but we have this class in two packages:
org.apache.hadoop.mapred and
org.apache.hadoop.mapreduce
In our job object we can set a output format class as:
job.setOutputFormatClass(SomeOutputFormat.class);
Now if SomeOutputFormat extends say org.apache.hadoop.mapreduce.lib.output.FileOutputFormat , we get one method named getRecordWriter(); this does not help in any way to override the output path.
There is another way by using jobConf but that again does not seem to work in terms of setting mappers, reducers, partitions, sorting and grouping classes.
Is there something very obvious that i am missing? I want to write my reduce output file inside a folder which is based on a value. for exmaple, SomeOutputPrefix/Value1/Value2/realReduceFileName
Thanks!
I think you need to implement
your own output format class and
your own RecordWriter which will be writing different values to different places
So your SomeOutputWriter will return new SomeRecordWriter("SomeOutputPrefix") in its getRecordWriter() method, and SomeRecordWriter will write different values to different folders.

Resources