Hadoop failure copying input bz2 file from s3 - hadoop

I have a map-only hadoop job, running on Amazon's EMR, running on the latest ami-version: 3.0.4. Once in a while I get exceptions like this:
Error: com.amazonaws.AmazonClientException: Unable to verify integrity of data download. Client calculated content length didn't match content length received from Amazon S3. The
data may be corrupt.
at com.amazonaws.util.ContentLengthValidationInputStream.validate(ContentLengthValidationInputStream.java:144)
at com.amazonaws.util.ContentLengthValidationInputStream.read(ContentLengthValidationInputStream.java:81)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.read(EmrFileSystem.java:289)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.readAByte(CBZip2InputStream.java:195)
at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:866)
at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504)
at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:423)
at org.apache.hadoop.io.compress.BZip2Codec.read(BZip2Codec.java:483)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164)
at org.apache.hadoop.mapred.MapTask.nextKeyValue(MapTask.java:544)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:775)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Is there any way to cure this? Why does this happen? Is it network problem in amazon? It can't be a problem with the input file, as re-running the same job usually succeeds. Is there a way to catch this exception? Why doesn't hadoop automatically cure it?
My main class looks like this:
public class LogParserMapReduce extends Configured implements Tool {
private static final Log LOG = LogFactory.getLog(LogParserMapReduce.class);
#Override
public int run(String[] args) throws Exception {
Configuration conf = super.getConf();
conf.setBoolean("mapred.compress.map.output", true);
conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
conf.setBoolean("keep.failed.task.files", true);
/*
* Instantiate a Job object for your job's configuration.
*/
Job job = Job.getInstance(conf);
/*
* The expected command-line arguments are the paths containing
* input and output data. Terminate the job if the number of
* command-line arguments is not exactly 2.
*/
if (args.length != 2) {
System.out.printf("Usage: LogParserMapReduce <input dir> <output dir>\n");
System.exit(-1);
}
/*
* Specify the jar file that contains your driver, mapper, and reducer.
* Hadoop will transfer this jar file to nodes in your cluster running
* mapper and reducer tasks.
*/
job.setJarByClass(LogParserMapReduce.class);
/*
* Specify an easily-decipherable name for the job.
* This job name will appear in reports and logs.
*/
job.setJobName("LogParser");
/*
* Specify the paths to the input and output data based on the
* command-line arguments.
*/
FileInputFormat.addInputPaths(job, args[0]);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
/*
* Specify the mapper and reducer classes.
*/
job.setMapperClass(LogParserMapper.class);
/*
* For the SysLogEvent count application, the input file and output
* files are in text format - the default format.
*
* In text format files, each record is a line delineated by a
* by a line terminator.
*
* When you use other input formats, you must call the
* SetInputFormatClass method. When you use other
* output formats, you must call the setOutputFormatClass method.
*/
/*
* For the logs count application, the mapper's output keys and
* values have the same data types as the reducer's output keys
* and values: Text and IntWritable.
*
* When they are not the same data types, you must call the
* setMapOutputKeyClass and setMapOutputValueClass
* methods.
*/
/*
* Specify the job's output key and value classes.
*/
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(0);
LOG.info("LogParserMapReduce: waitingForCompletion");
/*
* Start the MapReduce job and wait for it to finish.
* If it finishes successfully, return 0. If not, return 1.
*/
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
}

The solution was very simple (after Amazon's customer support told me): I had to upgrade to the latest AMI (currently it's 3.1.0) that has the latest Hadoop (2.4) and also make sure that I used the same hadoop version for the compilation of the Java code. Ever since I haven't see this kind of problem.

Related

Bulk loading with LoadIncrementalHFiles and subdirectories

I wrote a Spark application that generates HFiles to be used for bulk loading with the LoadIncrementalHFiles command later. As the source data pool is very big, the input files are splitted into iterations that are processed one after the other. Each iteration creates its own HFile directory, so my HDFS structure looks like this:
/user/myuser/map_data/hfiles_0
... /hfiles_1
... /hfiles_2
... /hfiles_3
...
There are about 500 files in this map_data directory, therefore I'm searching for a way to automatically call the LoadIncrementalHFiles function, to process these subdirectories also in iterations later.
The corresponding command would be this:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable
I need to change this into an iterative command, as this command does not work with subdirectories (when I call it with the /user/myuser/map_data directory)!
I tried to use a Java Process instance to execute the command above automatically, but this doesn't seen to do anything (no output to console and also no more rows in my HBase table).
Using the org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles Java class out of my code also doesn't work, it's also not responsing!
Has anybody a working example for me? Or is there a parameter to be able to run the above hbase command on the parent directory? I'm working with HBase 1.1.2 in a Hortonworks Data Platform 2.5 cluster.
EDIT I tried to run the LoadIncrementalHFiles command from a Hadoop client Java application, but I'm getting an exception relating to snappy compression, see Run LoadIncrementalHFiles from Java client
The solution was to split the hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable command into many parts (one per command part), see this Java code snippet:
TreeSet<String> subDirs = getHFileDirectories(new Path(HDFS_PATH), hadoopConf);
for(String hFileDir : subDirs) {
try {
String pathToReadFrom = HDFS_OUTPUT_PATH + "/" + hFileDir;
==> String[] execCode = {"hbase", "org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles", "-Dcreate.table=no", pathToReadFrom, hbaseTableName};
ProcessBuilder pb = new ProcessBuilder(execCode);
pb.redirectErrorStream(true);
final Process p = pb.start();
// Write the output of the Process to the console
new Thread(new Runnable() {
public void run() {
BufferedReader input = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = null;
try {
while ((line = input.readLine()) != null)
System.out.println(line);
} catch (IOException e) {
e.printStackTrace();
}
}
}).start();
// Wait for the end of the execution
p.waitFor();
...
}

Map Reduce job on EMR successfully running but no output data on S3

Im running MR job on EMR master host.
My input file is in S3 and output set to a table in Hive via Hcatalog.
The job is running successful and i do see reducers output rows but looking at the S3 new partitions folder i can only see MR 0 byte SUCCESS file but no actual data files.
note- when reducer stage start i do see files writes to S3 into temp folder, but it seems the last operation throws the files somewhere.
I don't see any errors in MR logs.
Relevant MR driver code:"
Job job = Job.getInstance();
job.setJobName("Build Events");
job.setJarByClass(LoggersApp.class);
job.getConfiguration().set("fs.defaultFS", "s3://my-bucket");
// set input paths Path[] inputPaths = "file on s3";
FileInputFormat.setInputPaths(job, inputPaths); // set input output
format job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HCatOutputFormat.class);
_configureOutputTable(job);
private void _setReducer(Job job) {
job.setReducerClass(Reducer.class);
job.setOutputValueClass(DefaultHCatRecord.class); }
private void _configureOutputTable(Job job) throws IOException {
OutputJobInfo jobInfo =
OutputJobInfo.create(_cli.getOptionValue("hive-dbname"),
_cli.getOptionValue("output-table"), null); HCatOutputFormat.setOutput(job, jobInfo); HCatSchema schema =
HCatOutputFormat.getTableSchema(job.getConfiguration());
HCatFieldSchema partitionDate = new HCatFieldSchema("date",
TypeInfoFactory.stringTypeInfo, null); HCatFieldSchema
partitionBatchId = new HCatFieldSchema("batch_id",
TypeInfoFactory.stringTypeInfo, null);
schema.append(partitionDate); schema.append(partitionBatchId);
HCatOutputFormat.setSchema(job, schema);
}
Any help?

checkpointing: Is fsimage always copied from namenode

In checkpointing, Definitive Guides say
1. The secondary asks the primary to roll its edits file, so new edits goes to a new file
2. The secondary retrieves fsimage and edits from primary (using HTTP GET)
and at the end of checkpointing secondary namenode sends updated fsimage to namenode.
Now secondary namenode has latest fsimage, in next checkpointing will secondary namenode again copy fsimage from namenode?? If so why?? can't it simply compare two using checksum
Yes, when the edit file size in namenode grows to specific size (default: fs.checkpoint.size= 4194304), secondary name would copy the fsimage and the edit file from namenode server.
This code from SecondaryNameNode.java explains that -
long size = namenode.getEditLogSize();
if (size >= checkpointSize ||
now >= lastCheckpointTime + 1000 * checkpointPeriod) {
doCheckpoint();
lastCheckpointTime = now;
}
Please, check when doCheckpoint(); is called.
The answer to why, is in the design Hadoop follows (I don't know why the follow this design though) - see the code below what's being done
(I'm keeping only the statements relevant to this question). You can probably see how the functions downloadCheckpointFiles(sig) and doMerge(sig) are called.
/**
* Create a new checkpoint
*/
void doCheckpoint() throws IOException {
//---other code skipped---
// Tell the namenode to start logging transactions in a new edit file
// Retuns a token that would be used to upload the merged image.
CheckpointSignature sig = (CheckpointSignature)namenode.rollEditLog();
downloadCheckpointFiles(sig); // Fetch fsimage and edits
doMerge(sig); // Do the merge
//
// Upload the new image into the NameNode. Then tell the Namenode
// to make this new uploaded image as the most current image.
//
putFSImage(sig);
namenode.rollFsImage();
checkpointImage.endCheckpoint();
//----other code skipped----
}
Then how the downloadCheckpointFiles(sig); called from within doCheckpoint() above.
See code below -
/**
* Download <code>fsimage</code> and <code>edits</code>
* files from the name-node.
* #throws IOException
*/
private void downloadCheckpointFiles(final CheckpointSignature sig
) throws IOException {
try {
UserGroupInformation.getCurrentUser().doAs(new PrivilegedExceptionAction<Void>() {
#Override
public Void run() throws Exception {
// get fsimage
String fileid = "getimage=1";
File[] srcNames = checkpointImage.getImageFiles();
assert srcNames.length > 0 : "No checkpoint targets.";
TransferFsImage.getFileClient(fsName, fileid, srcNames);
LOG.info("Downloaded file " + srcNames[0].getName() + " size " +
srcNames[0].length() + " bytes.");
// get edits file
fileid = "getedit=1";
srcNames = checkpointImage.getEditsFiles();
assert srcNames.length > 0 : "No checkpoint targets.";
TransferFsImage.getFileClient(fsName, fileid, srcNames);
LOG.info("Downloaded file " + srcNames[0].getName() + " size " +
srcNames[0].length() + " bytes.");
checkpointImage.checkpointUploadDone();
return null;
}
});
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
And, for your third last question - "can't it simply compare two using checksum" -
One possible reason is they don't want to take any risk as checksum for two different files can sometime be same. Say in Namenode you have a fsImage which is different to what's in secondarynamenode but their checksum somehow becomes same. This might happen you might never know. Copying seems to be the best option they got to ensure the copies are same.
Hope this helps.

Reading Distributed Files in Hadoop

I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.
The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.

Regular expression confiuration in FlumeNG

I am trying to load data from flat-file(log file ) into Hbase using Flume-ng(1.2) . Flat file has multiple columns each is colon(:) seperated , they all need to be loaded into seperate columns in HBASE. i was checking the forums i found there is a jar from Apache to solve this issue (org.apache.flume.sink.hbase.RegexHbaseEventSerializer) , but i am unable to find any confuration files or the usage in internet. If someone can help me with the configuration file ,that would be helpful
Contents in Flat file
1:nn
2:pp
3:mm
Thanks
RegexHbaseEventSerializer has three configuration parameters you can set (as described in the source code); these are:
/** Regular expression used to parse groups from event data. */
public static final String REGEX_CONFIG = "regex";
/** Whether to ignore case when performing regex matches. */
public static final String IGNORE_CASE_CONFIG = "regexIgnoreCase";
/** Comma separated list of column names to place matching groups in. */
public static final String COL_NAME_CONFIG = "colNames";
A sample configuration using RegexHbaseEventSerializer would be like this (partially quoting from Cloudera's Flume and HBase presentation):
host1.sources = src1
host1.sinks = sink1
host1.channels = ch1
host1.sources.src1.type = seq
host1.sources.src1.port = 25001
host1.sources.src1.bind = localhost
host1.sources.src1.channels = ch1
host1.sinks.sink1.type = org.apache.flume.sink.hbase.HBaseSink
host1.sinks.sink1.channel = ch1
host1.sinks.sink1.table = test3
host1.sinks.sink1.columnFamily = testing
host1.sinks.sink1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
host1.sinks.sink1.serializer.regex = X
host1.sinks.sink1.serializer.regexIgnoreCase = true
host1.sinks.sink1.serializer.colNames = column_1,column_2,column_3
host1.channels.ch1.type=memory10

Resources