how to set custom input format in MapReduce? - hadoop

I am writing MapReduce program and using classes in org.apache.hadoop.mapred.*. Can anybody tell me cause of this error? My CustomInputFormat class extends InputFormat and I have overridden createRecordReader method.
Signature of my CustomInputFormat is:
class ParagraphInputFormat extends InputFormat {
#Override
public RecordReader createRecordReader(InputSplit arg0,
TaskAttemptContext arg1) throws IOException, InterruptedException {
return new CustomRecordReader();
}
#Override
public List<InputSplit> getSplits(JobContext arg0) throws IOException,
InterruptedException {
// TODO Auto-generated method stub
return null;
}
}
And Signature of CustomRecordReader is class CustomRecordReader extends RecordReader
While declaring this class I used org.apache.hadoop.mapreduce.. I am confused between org.apache.hadoop.mapred. and org.apache.hadoop.mapreduce.*. Eclipse keeps on showing deprecated messages sometimes. I heard that apache has added some classes then removed those and then again added previous classes. Is is due to that? Is it affecting my code?
JobConf conf = new JobConf(new Configuration(),MyMRJob.class);
conf.setJobName("NameofJob");
conf.setOutputKeyClass(CutomeKeyClass.class); //no error to this line
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MYMap.class);
conf.setCombinerClass(MyReduce.class);
conf.setReducerClass(MyReduce.class);
conf.setInputFormat(CustomInputFormat.class);//ERROR to this line while typing
conf.setOutputFormat(IntWritable.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);

Your input format extends InputFormat of the mapreduce package (it extends rather than implements and the signature matches that of the new api), yet your job configuration is using the old API (JobConf rather than Job).
So you'll either need to amend your Custom input format to implement InputFormat (o.a.h.mapred.InputFormat), or amend your job configuration to use the new API (Job)

hey i have faced the same problem then i used the classes from
org.apache.hadoop.mapreduce instead of
org.apache.hadoop.mapred
its a problem of old and new API and for this dont use JobConf Configuration use Job Configuration only...

Related

Hadoop MapReduce example for string transformation

I have a big amount of strings in some text file and need transform this strings by such algorithm: convert string into lowercase and remove all spaces.
Can you give me example of Hadoop MapReduce function which implements that algorithm?
Thank you.
I tried the below code and getting the output in a single line.
public class toUpper {
public static class textMapper extends Mapper<LongWritable,Text,NullWritable,Text>
{
Text outvalue=new Text();
public void map(LongWritable key,Text values,Context context) throws IOException, InterruptedException
{
String token;
StringBuffer br=new StringBuffer();
StringTokenizer st=new StringTokenizer(values.toString());
while(st.hasMoreTokens())
{
token=st.nextToken();
br.append(token.toUpperCase());
}
st=null;
outvalue.set(br.toString());
context.write(NullWritable.get(), outvalue);
br=null;
}
}
public static class textReduce extends Reducer<NullWritable,Text,NullWritable,Text>
{
Text outvale=new Text();
public void reduce(NullWritable key,Iterable<Text> values,Context context) throws IOException, InterruptedException
{
StringBuffer br=new StringBuffer();
for(Text st:values)
{
br.append(st.toString());
}
outvale.set(br.toString());
context.write(NullWritable.get(), outvale);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf=new Configuration();
#SuppressWarnings("deprecation")
Job job=new Job(conf,"touipprr");
job.setJarByClass(toUpper.class);
job.setMapperClass(textMapper.class);
job.setReducerClass(textReduce.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?1:0);
}
}
In the days when I was playing around with map-reduce, I had a similar thought that there must be some practice or technique through which we can modify every word in a record and do all the cleaning stuffs.
When we recap the entire algorithm of map-reduce, we have a map function, which splits the incoming records into tokens with the help of delimiters(perhaps you will know about them better). Now, let us try to approach the problem statement given by you in a descriptive manner.
Following are the things that I will try to do when I am new to map-reduce:
> I will probably write a map() method which will split the lines for me
> I will possibly run out of options and write a reduce function
and somehow will be able to achieve my objective
The above practice is completely okay but there is a better technique that can help you to decide whether or not you are going to need the reduce function thereby you will have more options to enabling you think and completely focus on achieving your objective and also thinking about optimizing you code.
In such situations among which your problem statement falls into, a class came to my rescue : ChainMapper
Now, how the ChainMapper is going to work? following are few points to be considered
-> The first mapper will read the file from HDFS, split each lines as per delimiter and store the tokens in the context.
-> Second mapper will get the output from the first mapper and here you can do all sorts of string related operations as you business requires such as encrypting the text or changing to upper case or lowercase etc.
-> The operated string which is the result of the second mapper shall be stored into the context again
-> Now, if you need a reducer to do the aggregation task such as wordcount, go for it.
I have a piece of code which may not be efficient ( or some may feel its horrible) but it serves your purpose as you might be playing around with mapreduce.
SplitMapper.java
public class SplitMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
#Override
public void map(Object key,Text value,Context context)
throws IOException,InterruptedException{
StringTokenizer xs=new StringTokenizer(value.toString());
IntWritable dummyValue=new IntWritable(1);
while(xs.hasMoreElements()){
String content=(String)xs.nextElement();
context.write(new Text(content),dummyValue);
}
}
}
LowerCaseMapper.java
public class LowerCaseMapper extends Mapper<Text,IntWritable,Text,IntWritable>{
#Override
public void map(Text key,IntWritable value,Context context)
throws IOException,InterruptedException{
String val=key.toString().toLowerCase();
Text newKey=new Text(val);
Context.write(newKey,value);
}
}
Since I am performing a wordcount here so I require a reducer
ChainMapReducer.java
public class ChainMapReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
#Override
public void reduce(Text key,Iterable<IntWritable> value,Context context)
throws IOException,InterruptedException{
int sum=0;
for(IntWritable v:value){
sum+=value.get();
}
context.write(key,new IntWritables(sum));
}
}
To be able to implement the concept of chainmapper successfully, you must pay attention to every details of the driver class
DriverClass.java
public class DriverClass extends Configured implements Tool{
static Configuration cf;
public int run(String args[]) throws IOException,InterruptedException,ClassNotFoundException{
cf=new Configuration();
Job j=Job.getInstance(cf);
//configuration for the first mapper
Configuration.splitMapConfig=new Configuration(false);
ChainMapper.addMapper(j,SplitMapper.class,Object.class,Text.class,Text.class,IntWritable.class,splitMapConfig);
//configuration for the second mapper
Configuration.lowerCaseConfig=new Configuration(false);
ChainMapper.addMapper(j,LowerCaseMapper.class,Text.class,IntWritable.class,Text.class,IntWritable.class,lowerCaseConfig);
j.setJarByClass(DriverClass.class);
j.setCombinerClass(ChainMapReducer.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
Path outputPath=new Path(args[1]);
FileInputFormat.addInputPath(j,new Path(args[0]));
FileOutputFormat.setOutputPath(j,outputPath);
outputPath.getFileSystem(cf).delete(outputPath,true);
}
public static void main(String args[]) throws Exception{
int res=ToolRunner.run(cf,new DriverClass(),args);
System.exit(1);
}
}
The driver class is pretty much understandable only one needs to observe the signature of the ChainMapper.add(<job-object>,<Map-ClassName>,<Input arguments types>,<configuration-for-the-concerned-mapper>)
I hope that the solution serves your purpose, please let me know in case of any issues that might arise when you try to implement.
Thankyou!

How to set multiple Avro schemas with AvroParquetOutputFormat?

In my MapReduce job, Im using AvroParquetOutputFormat to write to Parquet files using Avro schema.
The application logic requires multiple types of files getting created by Reducer and each file has its own Avro schema.
The class AvroParquetOutputFormat has a static method setSchema() to set Avro schema of output. Looking at the code, AvroParquetOutputFormat uses AvroWriteSupport.setSchema() which again is a static implementation.
Without extending AvroWriteSupport and hacking the logic, is there a simpler way to achieve multiple Avro schema output from AvroParquetOutputFormat in a single MR job?
Any pointers/inputs highly appreciated.
Thanks & Regards
MK
It may be quite late to answer, but I have also faced this issue and came up with a solution.
First, There is no support like 'MultipleAvroParquetOutputFormat' inbuilt in parquet-mr. But to achieve a similar behavior I used MultipleOutputs.
For a map-only kind of job, put your mapper like this:
public class EventMapper extends Mapper<LongWritable, BytesWritable, Void, GenericRecord>{
protected KafkaAvroDecoder deserializer;
protected String outputPath = "";
// Using MultipleOutputs to write custom named files
protected MultipleOutputs<Void, GenericRecord> mos;
public void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
Configuration conf = context.getConfiguration();
outputPath = conf.get(FileOutputFormat.OUTDIR);
mos = new MultipleOutputs<Void, GenericRecord>(context);
}
public void map(LongWritable ln, BytesWritable value, Context context){
try {
GenericRecord record = (GenericRecord) deserializer.fromBytes(value.getBytes());
AvroWriteSupport.setSchema(context.getConfiguration(), record.getSchema());
Schema schema = record.getSchema();
String mergeEventsPath = outputPath + "/" + schema.getName(); // Adding '/' will do no harm
mos.write( (Void) null, record, mergeEventsPath);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
This will create a new RecordWriter for each schema and creates a new parquet file, appended with the schema name, for example, schema1-r-0000.parquet.
This will also create the default part-r-0000x.parquet files based on schema set in the driver. To avoid this, use LazyOutputFormat like:
LazyOutputFormat.setOutputFormatClass(job, AvroParquetOutputFormat.class);
Hope this helps.

PIG doesn't read my custom InputFormat

I have a custom MyInputFormat that suppose to deal with record boundary problem for multi-lined inputs. But when I put the MyInputFormat into my UDF load function. As follow:
import org.apache.hadoop.mapreduce.InputFormat;
public class EccUDFLogLoader extends LoadFunc {
#Override
public InputFormat getInputFormat() {
System.out.println("I am in getInputFormat function");
return new MyInputFormat();
}
}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class MyInputFormat extends TextInputFormat {
public RecordReader createRecordReader(InputSplit inputSplit, JobConf jobConf) throws IOException {
System.out.prinln("I am in createRecordReader");
//MyRecordReader suppose to handle record boundary
return new MyRecordReader((FileSplit)inputSplit, jobConf);
}
}
For each mapper, it print out I am in getInputFormat function but not I am in createRecordReader. I am wondering if anyone can provide a hint on how to hoop up my costome MyInputFormat to PIG's UDF loader? Much Thanks.
I am using PIG on Amazon EMR.
Your signature doesn't match that of the parent class (you're missing the Reporter argument), try this:
#Override
public RecordReader<LongWritable, Text> getRecordReader(
InputSplit inputSplit, JobConf jobConf, Reporter reporter)
throws IOException {
System.out.prinln("I am in createRecordReader");
//MyRecordReader suppose to handle record boundary
return new MyRecordReader((FileSplit)inputSplit, jobConf);
}
EDIT Sorry i didn't spot this earlier, as you note, you need to use the new API signature instead:
#Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split,
TaskAttemptContext context) {
System.out.prinln("I am in createRecordReader");
//MyRecordReader suppose to handle record boundary
return new MyRecordReader((FileSplit)inputSplit, jobConf);
}
And your MyRecordReader class needs to extend the org.apache.hadoop.mapreduce.RecordReader class

Hadoop MultipleOutputs does not write to multiple files when file formats are custom format

I am trying to read from cassandra and write the reducers output to multiple output files using MultipleOutputs api (Hadoop version 1.0.3). The file formats in my case are custom output formats extending FileOutputFormat. I have configured my job in a similar manner as shown in MultipleOutputs api.
However, when I run the job, I only get one output file named part-r-0000 which is in text output format. If job.setOutputFormatClass() is not set, by default it considers TextOutputFormat to be the format. Also it will only allow one of the two format classes to be initialized. It completely ignores the output formats I specified in MulitpleOutputs.addNamedOutput(job, "format1", MyCustomFileFormat1.class, Text.class, Text.class) and MulitpleOutputs.addNamedOutput(job, "format2", MyCustomFileFormat2.class, Text.class, Text.class). Is someone else facing similar problem or am I doing something wrong ?
I also tried to write a very simple MR program which reads from a text file and writes the output in 2 formats TextOutputFormat and SequenceFileOutputFormat as shown in the MultipleOutputs api. However, no luck there as well. I get only 1 output file in text output format.
Can someone help me with this ?
Job job = new Job(getConf(), "cfdefGen");
job.setJarByClass(CfdefGeneration.class);
//read input from cassandra column family
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.getConfiguration().set("cassandra.consistencylevel.read", "QUORUM");
//thrift input job configurations
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), HOST);
ConfigHelper.setInputPartitioner(job.getConfiguration(), "RandomPartitioner");
SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes("classification")));
//ConfigHelper.setRangeBatchSize(job.getConfiguration(), 2048);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
//specification for mapper
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//specifications for reducer (writing to files)
job.setReducerClass(ReducerToFileSystem.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//job.setOutputFormatClass(MyCdbWriter1.class);
job.setNumReduceTasks(1);
//set output path for storing output files
Path filePath = new Path(OUTPUT_DIR);
FileSystem hdfs = FileSystem.get(getConf());
if(hdfs.exists(filePath)){
hdfs.delete(filePath, true);
}
MyCdbWriter1.setOutputPath(job, new Path(OUTPUT_DIR));
MultipleOutputs.addNamedOutput(job, "cdb1', MyCdbWriter1.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, "cdb2", MyCdbWriter2.class, Text.class, Text.class);
boolean success = job.waitForCompletion(true);
return success ? 0:1;
public static class ReducerToFileSystem extends Reducer<Text, Text, Text, Text>
{
private MultipleOutputs<Text, Text> mos;
public void setup(Context context){
mos = new MultipleOutputs<Text, Text>(context);
}
//public void reduce(Text key, Text value, Context context)
//throws IOException, InterruptedException (This was the mistake, changed the signature and it worked fine)
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException
{
//context.write(key, value);
mos.write("cdb1", key, value, OUTPUT_DIR+"/"+"cdb1");
mos.write("cdb2", key, value, OUTPUT_DIR+"/"+"cdb2");
context.progress();
}
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
public class MyCdbWriter1<K, V> extends FileOutputFormat<K, V>
{
#Override
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException
{
}
public static void setOutputPath(Job job, Path outputDir) {
job.getConfiguration().set("mapred.output.dir", outputDir.toString());
}
protected static class CdbDataRecord<K, V> extends RecordWriter<K, V>
{
#override
write()
close()
}
}
I found my mistake after debugging that my reduce method is never called. I found that my function definition did not match API's definition, changed it from public void reduce(Text key, Text value, Context context) to public void reduce(Text key, Iterable<Text> values, Context context). I don't know why reduce method does not have #Override tag, it would have prevented my mistake.
I also encountered a similar issue - mine turned out to be that I was filtering all my records in the Map process so nothing was being passed to Reduce. With un-named multiple outputs in the reduce task, this still resulted in a _SUCCESS file and an empty part-r-00000 file.

Working of RecordReader in Hadoop

Can anyone explain how the RecordReader actually works? How are the methods nextkeyvalue(), getCurrentkey() and getprogress() work after the program starts executing?
(new API): The default Mapper class has a run method which looks like this:
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
The Context.nextKeyValue(), Context.getCurrentKey() and Context.getCurrentValue() methods are wrappers for the RecordReader methods. See the source file src/mapred/org/apache/hadoop/mapreduce/MapContext.java.
So this loop executes and calls your Mapper implementation's map(K, V, Context) method.
Specifically, what else would you like to know?
org.apache.hadoop.mapred.MapTask
- runNewMapper()
Imp steps:
creates new mapper
get input split for the mapper
get recordreader for the split
initialize record reader
using record reader iterate through getNextKeyVal() and pass key,val to mappers map method
clean up

Resources