I have to use Snappy to compress the map o/p and the map-reduce o/p as well. Further, this should be splittable.
As I studied online, to make Snappy write splittable o/p, we have to use it in a Container like format.
Can you please suggest how to go about it? I tried finding some examples online, but could not fine one. I am using Hadoop v0.20.203.
Thanks.
Piyush
for output
conf.setOutputFormat(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK);
SequenceFileOutputFormat.setCompressOutput(conf, true);
conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
For map output
Configuration conf = new Configuration();
conf.setBoolean("mapred.compress.map.output", true);
conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
In the new API OutputFormat installing for the Job, and not for the configuration.
Then, first part will be:
Job job = new Job(conf);
...
SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);
SequenceFileOutputFormat.setCompressOutput(job, true);
conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
Related
I am looking to write MapReduce output in parquet fileformat using parquet-mr library as something like below :
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(ParquetOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
ParquetOutputFormat.setOutputPath(job, new Path(args[2]));
ParquetOutputFormat.setCompression(job, CompressionCodecName.GZIP);
SkipBadRecords.setMapperMaxSkipRecords(conf, Long.MAX_VALUE);
SkipBadRecords.setAttemptsToStartSkipping(conf, 0);
job.submit();
However, I keep getting errors like these :
2018-02-23 09:32:58,325 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException: writeSupportClass should not be null
at org.apache.parquet.Preconditions.checkNotNull(Preconditions.java:38)
at org.apache.parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:350)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:293)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:548)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:622)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
I understand that writeSupportClass needs to be passed/set as something like
ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
but can I ask how can specify schema,implement ProtoWriteSupport or any other WriteSupport classes out there? What methods do I need to implement and are there any examples of doing this in a correct way?
If it helps, my MR job's output should look like & stored in parquet format:
Text INTWRITABLE
a 100
Try ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
ProtoWriteSupport<T extends MessageOrBuilder>
Implementation of WriteSupport for writing Protocol Buffers.
Check Javadoc for list of nested default classes available.
The CDH Tutorial on using parquet file format with MapReduce, Hive, HBase, and Pig.
I am using MRUnit to write unit tests for my mapreduce jobs.
However, I am having trouble including hdfs into that mix. My MR job needs a file from hdfs. How do I mock out the hdfs part in MRUnit test case?
Edit:
I know that I can specify inputs/exepctedOutput for my MR code in the test infrastructure. However, that is not what I want. My MR job needs to read another file that has domain data to do the job. This file is in HDFS. How do I mock out this file?
I tried using mockito but it didnt work. The reason was that FileSystem.open() returns a FSDataInputStream which inherits from other interfaces besides java.io.Stream. It was too painful to mock out all the interfaces. So, I hacked it in my code by doing the following
if (System.getProperty("junit_running") != null)
{
inputStream = this.getClass().getClassLoader().getResourceAsStream("domain_data.txt");
br = new BufferedReader(new InputStreamReader(inputStream));
} else {
Path pathToRegionData = new Path("/domain_data.txt");
LOG.info("checking for existence of region assignment file at path: " + pathToRegionData.toString());
if (!fileSystem.exists(pathToRegionData))
{
LOG.error("domain file does not exist at path: " + pathToRegionData.toString());
throw new IllegalArgumentException("region assignments file does not exist at path: " + pathToRegionData.toString());
}
inputStream = fileSystem.open(pathToRegionData);
br = new BufferedReader(new InputStreamReader(inputStream));
}
This solution is not ideal because I had to put test specific code in my production code. I am still waiting to see if there is an elegant solution out there.
Please follow the this small tutorial for MRUnit.
https://github.com/malli3131/HadoopTutorial/blob/master/MRUnit/Tutorial
In MRUnit test case, we supply the data inside the testMapper() and testReducer() methods. So there is no need of input from HDFS for MRUnit Job. Only MapReduce jobs require data inputs from HDFS.
I need to implement below Functionality using Hadoop Map-Reduce?
1) I am reading one input for a mapper from one source & another input from another different input source.
2) I need to pass both output of mapper into a single reducer for further process.
Is there any to do the above requirement in Hadoop Map-Reduce
MultipleInputs.addInputPath is what you are looking for. This is how your configuration would look like. Make sure both AnyMapper1 and AnyMapper2 write the same output expected by MergeReducer
JobConf conf = new JobConf(Merge.class);
conf.setJobName("merge");
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(Text.class);
conf.setReducerClass(MergeReducer.class);
conf.setOutputFormat(TextOutputFormat.class);
MultipleInputs.addInputPath(conf, inputDir1, SequenceFileInputFormat.class, AnyMapper1.class);
MultipleInputs.addInputPath(conf, inputDir2, TextInputFormat.class, AnyMapper2.class);
FileOutputFormat.setOutputPath(conf, outputPath);
You can create a custom writable. You can populate the same in the Mapper. Later in the Reducer you can get the Custom writable Object and do the necessary business operation.
I am using DistributedCache. But there are no files in the cache after execution of code.
I have referred to other similar questions but the answers does not solve my issue.
Please find the code below:
Configuration conf = new Configuration();
Job job1 = new Job(conf, "distributed cache");
Configuration conf1 = job1.getConfiguration();
DistributedCache.addCacheFile(new Path("File").toUri(), conf1);
System.out.println("distributed cache file "+DistributedCache.getLocalCacheFiles(conf1));
This gives null..
The same thing when given inside mapper also gives null hence. Please let me know your suggestions.
Thanks
try getCacheFiles() instead of getLocalCacheFiles()
I believe this is (at least partly) due to what Chris White wrote here:
After you create your Job object, you need to pull back the
Configuration object as Job makes a copy of it, and configuring values
in conf2 after you create the job will have no effect on the job
iteself. Try this:
job = new Job(new Configuration());
Configuration conf2 = job.getConfiguration();
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
I guess if it still does not work, there is another problem somewhere, but that doesn't mean that Chris White's point is not correct.
When distributing, don't forget the local link name, preferably using a relative path:
URI is of the form hdfs://host:port/absolute-path#local-link-name
When reading:
if you don't use distributed cache possibilities, you are supposed to use HDFS's FileSystem to access the hdfs://host:port/absolute-path
if you use the distributed cache, then you have to use standard Java file utilities to access the local-link-name
The cache file needs to be in the Hadoop FileSystem. You can do this:
void copyFileToHDFS(JobConf jobConf, String from, String to){
try {
FileSystem aFS = FileSystem.get(jobConf);
aFS.copyFromLocalFile(false, true, new Path(
from), new Path(to));
} catch (IOException e) {
throw new RuntimeException(e);
}
}
Once the files are copied you can add them to the cache, like so:
void fillCache(JobConf jobConf){
Job job;
copyFileToHDFS(jobConf, fromLocation, toLocation);
job = Job.getInstance(jobConf);
job.addCacheFile(new URI(toLocation));
JobConf newJobConf = new JobConf(job.getConfiguration());
}
I'm trying to set the OutputFormat of my job to MapFileOutputFormat using:
jobConf.setOutputFormat(MapFileOutputFormat.class);
I get this error: mapred.output.format.class is incompatible with new reduce API mode
I suppose I should use the set setOutputFormatClass() of the new Job class but the problem is that when I try to do this:
job.setOutputFormatClass(MapFileOutputFormat.class);
it expects me to use this class: org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.
In hadoop 1.0.X there is no such class. It only exists in earlier versions (e.g 0.x)
How can I solve this problem ?
Thank you!
This problem has no decently easily implementable solution.
I gave up and used Sequence files which fit my requirements too.
Have you tried the following?
import org.apache.hadoop.mapreduce.lib.output;
...
LazyOutputFormat.setOutputFormatClass(job, MapFileOutputFormat.class);