Hadoop approach to outputting millions of small binary/image files - hadoop

I need to process and manipulate many images in a Hadoop job, the input will be over the network, slow downloads using the MultiThreadedMapper.
But what is the best approach to the reduce ouput? I think I should write the raw binary image data into a sequence file, transfer those files to their eventual home, then write a small app to extract the individual images from the SequenceFile into individual JPGs and GIFs.
Or is there a better option to consider?

If you feel up to it (or maybe through some Googleing you can find an implementation), you could write a FileOutputFormat which wraps a FSDataOutputStream with a ZipOutputStream, giving you a Zip file for each reducer (and thus saving you the effort in writing seq file extraction program.
Don't be daunted by writing your own OutputFormat, it really isn't that difficult (and much easier than writing custom InputFormats which have to worry about splits). In fact here's a starting point - you just need to implement the write method:
// Key: Text (path of the file in the output zip)
// Value: BytesWritable - binary content of the image to save
public class ZipFileOutputFormat extends FileOutputFormat<Text, BytesWritable> {
#Override
public RecordWriter<Text, BytesWritable> getRecordWriter(
TaskAttemptContext job) throws IOException, InterruptedException {
Path file = getDefaultWorkFile(job, ".zip");
FileSystem fs = file.getFileSystem(job.getConfiguration());
return new ZipRecordWriter(fs.create(file, false));
}
public static class ZipRecordWriter extends
RecordWriter<Text, BytesWritable> {
protected ZipOutputStream zos;
public ZipRecordWriter(FSDataOutputStream os) {
zos = new ZipOutputStream(os);
}
#Override
public void write(Text key, BytesWritable value) throws IOException,
InterruptedException {
// TODO: create new ZipEntry & add to the ZipOutputStream (zos)
}
#Override
public void close(TaskAttemptContext context) throws IOException,
InterruptedException {
zos.close();
}
}
}

Related

Hadoop MapReduce example for string transformation

I have a big amount of strings in some text file and need transform this strings by such algorithm: convert string into lowercase and remove all spaces.
Can you give me example of Hadoop MapReduce function which implements that algorithm?
Thank you.
I tried the below code and getting the output in a single line.
public class toUpper {
public static class textMapper extends Mapper<LongWritable,Text,NullWritable,Text>
{
Text outvalue=new Text();
public void map(LongWritable key,Text values,Context context) throws IOException, InterruptedException
{
String token;
StringBuffer br=new StringBuffer();
StringTokenizer st=new StringTokenizer(values.toString());
while(st.hasMoreTokens())
{
token=st.nextToken();
br.append(token.toUpperCase());
}
st=null;
outvalue.set(br.toString());
context.write(NullWritable.get(), outvalue);
br=null;
}
}
public static class textReduce extends Reducer<NullWritable,Text,NullWritable,Text>
{
Text outvale=new Text();
public void reduce(NullWritable key,Iterable<Text> values,Context context) throws IOException, InterruptedException
{
StringBuffer br=new StringBuffer();
for(Text st:values)
{
br.append(st.toString());
}
outvale.set(br.toString());
context.write(NullWritable.get(), outvale);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf=new Configuration();
#SuppressWarnings("deprecation")
Job job=new Job(conf,"touipprr");
job.setJarByClass(toUpper.class);
job.setMapperClass(textMapper.class);
job.setReducerClass(textReduce.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?1:0);
}
}
In the days when I was playing around with map-reduce, I had a similar thought that there must be some practice or technique through which we can modify every word in a record and do all the cleaning stuffs.
When we recap the entire algorithm of map-reduce, we have a map function, which splits the incoming records into tokens with the help of delimiters(perhaps you will know about them better). Now, let us try to approach the problem statement given by you in a descriptive manner.
Following are the things that I will try to do when I am new to map-reduce:
> I will probably write a map() method which will split the lines for me
> I will possibly run out of options and write a reduce function
and somehow will be able to achieve my objective
The above practice is completely okay but there is a better technique that can help you to decide whether or not you are going to need the reduce function thereby you will have more options to enabling you think and completely focus on achieving your objective and also thinking about optimizing you code.
In such situations among which your problem statement falls into, a class came to my rescue : ChainMapper
Now, how the ChainMapper is going to work? following are few points to be considered
-> The first mapper will read the file from HDFS, split each lines as per delimiter and store the tokens in the context.
-> Second mapper will get the output from the first mapper and here you can do all sorts of string related operations as you business requires such as encrypting the text or changing to upper case or lowercase etc.
-> The operated string which is the result of the second mapper shall be stored into the context again
-> Now, if you need a reducer to do the aggregation task such as wordcount, go for it.
I have a piece of code which may not be efficient ( or some may feel its horrible) but it serves your purpose as you might be playing around with mapreduce.
SplitMapper.java
public class SplitMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
#Override
public void map(Object key,Text value,Context context)
throws IOException,InterruptedException{
StringTokenizer xs=new StringTokenizer(value.toString());
IntWritable dummyValue=new IntWritable(1);
while(xs.hasMoreElements()){
String content=(String)xs.nextElement();
context.write(new Text(content),dummyValue);
}
}
}
LowerCaseMapper.java
public class LowerCaseMapper extends Mapper<Text,IntWritable,Text,IntWritable>{
#Override
public void map(Text key,IntWritable value,Context context)
throws IOException,InterruptedException{
String val=key.toString().toLowerCase();
Text newKey=new Text(val);
Context.write(newKey,value);
}
}
Since I am performing a wordcount here so I require a reducer
ChainMapReducer.java
public class ChainMapReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
#Override
public void reduce(Text key,Iterable<IntWritable> value,Context context)
throws IOException,InterruptedException{
int sum=0;
for(IntWritable v:value){
sum+=value.get();
}
context.write(key,new IntWritables(sum));
}
}
To be able to implement the concept of chainmapper successfully, you must pay attention to every details of the driver class
DriverClass.java
public class DriverClass extends Configured implements Tool{
static Configuration cf;
public int run(String args[]) throws IOException,InterruptedException,ClassNotFoundException{
cf=new Configuration();
Job j=Job.getInstance(cf);
//configuration for the first mapper
Configuration.splitMapConfig=new Configuration(false);
ChainMapper.addMapper(j,SplitMapper.class,Object.class,Text.class,Text.class,IntWritable.class,splitMapConfig);
//configuration for the second mapper
Configuration.lowerCaseConfig=new Configuration(false);
ChainMapper.addMapper(j,LowerCaseMapper.class,Text.class,IntWritable.class,Text.class,IntWritable.class,lowerCaseConfig);
j.setJarByClass(DriverClass.class);
j.setCombinerClass(ChainMapReducer.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
Path outputPath=new Path(args[1]);
FileInputFormat.addInputPath(j,new Path(args[0]));
FileOutputFormat.setOutputPath(j,outputPath);
outputPath.getFileSystem(cf).delete(outputPath,true);
}
public static void main(String args[]) throws Exception{
int res=ToolRunner.run(cf,new DriverClass(),args);
System.exit(1);
}
}
The driver class is pretty much understandable only one needs to observe the signature of the ChainMapper.add(<job-object>,<Map-ClassName>,<Input arguments types>,<configuration-for-the-concerned-mapper>)
I hope that the solution serves your purpose, please let me know in case of any issues that might arise when you try to implement.
Thankyou!

set a conf value in mapper - get it in run method

In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.

Better way to Split files into 80% and 20% for building model and prediction in MapReduce

I am trying to divide my HDFS file into 2 parts/files
80% and 20% for classification algorithm(80% for modelling and 20% for prediction)
Please provide suggestion for the same.
To take 80% and 20% to 2 seperate files we need to know the exact number of record in the data set.And it is only known if we go through the data set once.
so we need to write 1 MapReduce Job for just counting the number of records and
2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple Inputs.
Am I in the right track or there is any alternative for the same.
But again a small confusion how to check if the reducer get filled with 80% data.
I suggest you to use Random for splitting the dataset and MultipleOutputs to write data into separate paths. It can be done with only one map-only job. Here is an example of mapper that you could use:
public class Splitter extends Mapper<LongWritable, Text, NullWritable, NullWritable> {
MultipleOutputs mos;
Random rnd = new Random();
#Override
protected void setup(Context context) throws IOException, InterruptedException {
mos = new MultipleOutputs(context);
}
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
if (rnd.nextDouble() < 0.8) {
mos.write(key, value, "learning-set");
} else {
mos.write(key, value, "test-set");
}
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}

Hadoop: Getting the input file name in the mapper only once

I am new in hadoop and currently working on hadoop. I have a small query.
I have around 10 files in input folder which I need to pass to my map reduce program. I want the file Name in my mapper as my fileName contains the time at which this file got created. I saw people using FileSplit to get the file Name in mapper. If let say my input files contains million of lines then every time mapper code will be called, it will get the file Name and then extract the time from the file, which is obvious a repeated time consuming thing for the same file. Once I get the time in the mapper I do not have to again and again assign the time from the file.
How can I achieve this?
You could use Mapper's setup method to get the filename, as setup method is gaurenteed to run only once before map() method gets initialized like this:
public class MapperRSJ extends Mapper<LongWritable, Text, CompositeKeyWritableRSJ, Text> {
String filename;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
FileSplit fsFileSplit = (FileSplit) context.getInputSplit();
filename = context.getConfiguration().get(fsFileSplit.getPath().getParent().getName()));
}
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// process each key value pair
}
}

How to set multiple Avro schemas with AvroParquetOutputFormat?

In my MapReduce job, Im using AvroParquetOutputFormat to write to Parquet files using Avro schema.
The application logic requires multiple types of files getting created by Reducer and each file has its own Avro schema.
The class AvroParquetOutputFormat has a static method setSchema() to set Avro schema of output. Looking at the code, AvroParquetOutputFormat uses AvroWriteSupport.setSchema() which again is a static implementation.
Without extending AvroWriteSupport and hacking the logic, is there a simpler way to achieve multiple Avro schema output from AvroParquetOutputFormat in a single MR job?
Any pointers/inputs highly appreciated.
Thanks & Regards
MK
It may be quite late to answer, but I have also faced this issue and came up with a solution.
First, There is no support like 'MultipleAvroParquetOutputFormat' inbuilt in parquet-mr. But to achieve a similar behavior I used MultipleOutputs.
For a map-only kind of job, put your mapper like this:
public class EventMapper extends Mapper<LongWritable, BytesWritable, Void, GenericRecord>{
protected KafkaAvroDecoder deserializer;
protected String outputPath = "";
// Using MultipleOutputs to write custom named files
protected MultipleOutputs<Void, GenericRecord> mos;
public void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
Configuration conf = context.getConfiguration();
outputPath = conf.get(FileOutputFormat.OUTDIR);
mos = new MultipleOutputs<Void, GenericRecord>(context);
}
public void map(LongWritable ln, BytesWritable value, Context context){
try {
GenericRecord record = (GenericRecord) deserializer.fromBytes(value.getBytes());
AvroWriteSupport.setSchema(context.getConfiguration(), record.getSchema());
Schema schema = record.getSchema();
String mergeEventsPath = outputPath + "/" + schema.getName(); // Adding '/' will do no harm
mos.write( (Void) null, record, mergeEventsPath);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
This will create a new RecordWriter for each schema and creates a new parquet file, appended with the schema name, for example, schema1-r-0000.parquet.
This will also create the default part-r-0000x.parquet files based on schema set in the driver. To avoid this, use LazyOutputFormat like:
LazyOutputFormat.setOutputFormatClass(job, AvroParquetOutputFormat.class);
Hope this helps.

Resources