Whats the expected commit/rollback behavior of Camus? - hadoop

We've been running Camus for about a year successfully to pull avro payloads from Kafka (ver 0.82) and store as .avro files in HDFS, using just a few Kafka topics. Recently, a new team within our company registered about 60 new topics in our pre-production environment and started sending data to these topics. The team made some mistakes when routing their data to kafka topics, that resulted in errors when Camus deserialized the payloads to avro for these topics.
The Camus job failed due to exceeding the 'failed other' error threshold. The resulting behavior in Camus after the failure was surprising, I wanted to check with other developers to see whether the behavior we observed is expected or whether we have some issue going on with our implementation.
We noticed this behavior when the Camus job failed due to exceeding the 'failed other' threshold:
1. All of the mapper tasks succeeded, and so the TaskAttempt was allowed to commit - this means that all of the data written by Camus was copied to the final HDFS location.
2. The CamusJob throws an exception when it computes the % error rate (this is following the mapper commit), which caused the job to fail
3. Because the job failed (I think), the Kafka offsets weren't advance
The problem we ran into with this behavior is that our Camus job is set to run every 5 minutes. So, every 5 minutes we saw that data was committed to HDFS, the job failed, and the Kafka offsets weren't updated - this meant that we wrote duplicated data until we noticed that our disks were filling up.
I wrote an integration test that confirms the result - it submits 10 good records to a topic, and 10 records that use an unexpected schema to the same topic, runs the Camus job with only that topic whitelisted, and we can see that 10 records are written to HDFS and the Kafka offsets aren't advanced. Below is a snippet of the logs from that test, as well as the properties we used while running the job.
Any help is appreciated - I'm not sure whether this is expected behavior for Camus or whether we have a problem with our implementation, and what the best method is to prevent this behavior (duplicating data).
Thanks ~ Matt
CamusJob properties for the test:
etl.destination.path=/user/camus/kafka/data
etl.execution.base.path=/user/camus/kafka/workspace
etl.execution.history.path=/user/camus/kafka/history
dfs.default.classpath.dir=/user/camus/kafka/libs
etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.AvroRecordWriterProvider
camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageDecoder
camus.message.timestamp.format=yyyy-MM-dd HH:mm:ss Z
mapreduce.output.fileoutputformat.compress=false
mapred.map.tasks=15
kafka.max.pull.hrs=1
kafka.max.historical.days=3
kafka.whitelist.topics=advertising.edmunds.admax
log4j.configuration=true
kafka.client.name=camus
kafka.brokers=<kafka brokers>
max.decoder.exceptions.to.print=5
post.tracking.counts.to.kafka=true
monitoring.event.class=class.that.generates.record.to.submit.counts.to.kafka
kafka.message.coder.schema.registry.class=com.linkedin.camus.schemaregistry.AvroRestSchemaRegistry
etl.schema.registry.url=<schema repo url>
etl.run.tracking.post=false
kafka.monitor.time.granularity=10
etl.daily=daily
etl.ignore.schema.errors=false
etl.output.codec=deflate
etl.deflate.level=6
etl.default.timezone=America/Los_Angeles
mapred.output.compress=false
mapred.map.max.attempts=2
Log snippet from the test, showing the commit behavior after the mappers succeed and subsequent job failure due to surpassing the 'other' threshold:
LocalJobRunner] - advertising.edmunds.admax:2:6; advertising.edmunds.admax:3:7 begin read at 2016-07-08T05:50:26.215-07:00; advertising.edmunds.admax:1:5; advertising.edmunds.admax:2:2; advertising.edmunds.admax:3:3 begin read at 2016-07-08T05:50:30.517-07:00; advertising.edmunds.admax:0:4 > map
[Task] - Task:attempt_local866350146_0001_m_000000_0 is done. And is in the process of committing
[LocalJobRunner] - advertising.edmunds.admax:2:6; advertising.edmunds.admax:3:7 begin read at 2016-07-08T05:50:26.215-07:00; advertising.edmunds.admax:1:5; advertising.edmunds.admax:2:2; advertising.edmunds.admax:3:3 begin read at 2016-07-08T05:50:30.517-07:00; advertising.edmunds.admax:0:4 > map
[Task] - Task attempt_local866350146_0001_m_000000_0 is allowed to commit now
[EtlMultiOutputFormat] - work path: file:/user/camus/kafka/workspace/2016-07-08-12-50-20/_temporary/0/_temporary/attempt_local866350146_0001_m_000000_0
[EtlMultiOutputFormat] - Destination base path: /user/camus/kafka/data
[EtlMultiOutputFormat] - work file: data.advertising-edmunds-admax.3.3.1467979200000-m-00000.avro
[EtlMultiOutputFormat] - Moved file from: file:/user/camus/kafka/workspace/2016-07-08-12-50-20/_temporary/0/_temporary/attempt_local866350146_0001_m_000000_0/data.advertising-edmunds-admax.3.3.1467979200000-m-00000.avro to: /user/camus/kafka/data/advertising-edmunds-admax/advertising-edmunds-admax.3.3.2.2.1467979200000.avro
[EtlMultiOutputFormat] - work file: data.advertising-edmunds-admax.3.7.1467979200000-m-00000.avro
[EtlMultiOutputFormat] - Moved file from: file:/user/camus/kafka/workspace/2016-07-08-12-50-20/_temporary/0/_temporary/attempt_local866350146_0001_m_000000_0/data.advertising-edmunds-admax.3.7.1467979200000-m-00000.avro to: /user/camus/kafka/data/advertising-edmunds-admax/advertising-edmunds-admax.3.7.8.8.1467979200000.avro
[Task] - Task 'attempt_local866350146_0001_m_000000_0' done.
[LocalJobRunner] - Finishing task: attempt_local866350146_0001_m_000000_0
[LocalJobRunner] - map task executor complete.
[Job] - map 100% reduce 0%
[Job] - Job job_local866350146_0001 completed successfully
[Job] - Counters: 23
File System Counters
FILE: Number of bytes read=117251
FILE: Number of bytes written=350942
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=10
Map output records=15
Input split bytes=793
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=13
Total committed heap usage (bytes)=251658240
com.linkedin.camus.etl.kafka.mapred.EtlRecordReader$KAFKA_MSG
DECODE_SUCCESSFUL=10
SKIPPED_OTHER=10
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=5907
total
data-read=840
decode-time(ms)=123
event-count=20
mapper-time(ms)=58
request-time(ms)=12114
skip-old=0
[CamusJob] - Group: File System Counters
[CamusJob] - FILE: Number of bytes read: 117251
[CamusJob] - FILE: Number of bytes written: 350942
[CamusJob] - FILE: Number of read operations: 0
[CamusJob] - FILE: Number of large read operations: 0
[CamusJob] - FILE: Number of write operations: 0
[CamusJob] - Group: Map-Reduce Framework
[CamusJob] - Map input records: 10
[CamusJob] - Map output records: 15
[CamusJob] - Input split bytes: 793
[CamusJob] - Spilled Records: 0
[CamusJob] - Failed Shuffles: 0
[CamusJob] - Merged Map outputs: 0
[CamusJob] - GC time elapsed (ms): 13
[CamusJob] - Total committed heap usage (bytes): 251658240
[CamusJob] - Group: com.linkedin.camus.etl.kafka.mapred.EtlRecordReader$KAFKA_MSG
[CamusJob] - DECODE_SUCCESSFUL: 10
[CamusJob] - SKIPPED_OTHER: 10
[CamusJob] - job failed: 50.0% messages skipped due to other, maximum allowed is 0.1%

I'm facing a pretty similar problem: my Kafka/Camus pipeline has been working well for about a year, but recently I stucked with duplication issue while integrating the ingestion from remote broker with very unstable connection and frequent job failures.
Today when examining Gobblin documentation, I realized that Camus sweeper is a tool that possibly what we are looking for. Try to integrate it in your pipeline.
I also think that the good idea would be to migrate to Gobblin (Camus successor) in the nearest future.

Related

H2O cluster startup frequently timing out

Trying to start an h2o cluster on (MapR) hadoop via python
# startup hadoop h2o cluster
import os
import subprocess
import h2o
import shlex
import re
from Queue import Queue, Empty
from threading import Thread
def enqueue_output(out, queue):
"""
Function for communicating streaming text lines from seperate thread.
see https://stackoverflow.com/questions/375427/non-blocking-read-on-a-subprocess-pipe-in-python
"""
for line in iter(out.readline, b''):
queue.put(line)
out.close()
# clear legacy temp. dir.
hdfs_legacy_dir = '/mapr/clustername/user/mapr/hdfsOutputDir'
if os.path.isdir(hdfs_legacy_dir ):
print subprocess.check_output(shlex.split('rm -r %s'%hdfs_legacy_dir ))
# start h2o service in background thread
local_h2o_start_path = '/home/mapr/h2o-3.18.0.2-mapr5.2/'
startup_p = subprocess.Popen(shlex.split('/bin/hadoop jar {}h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir'.format(local_h2o_start_path)),
shell=False,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# setup message passing queue
q = Queue()
t = Thread(target=enqueue_output, args=(startup_p.stdout, q))
t.daemon = True # thread dies with the program
t.start()
# read line without blocking
h2o_url_out = ''
while True:
try: line = q.get_nowait() # or q.get(timeout=.1)
except Empty:
continue
else: # got line
print line
# check for first instance connection url output
if re.search('Open H2O Flow in your web browser', line) is not None:
h2o_url_out = line
break
if re.search('Error', line) is not None:
print 'Error generated: %s' % line
sys.exit()
print 'Connection url output line: %s' % h2o_url_out
h2o_cnxn_ip = re.search('(?<=Open H2O Flow in your web browser: http:\/\/)(.*?)(?=:)', h2o_url_out).group(1)
print 'H2O connection ip: %s' % h2o_cnxn_ip
frequently throws a timeout error
Waiting for H2O cluster to come up...
H2O node 172.18.4.66:54321 requested flatfile
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Error generated: ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Shutting down h2o cluster
Looking at the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html) (and just doing a wordfind for the word "timeout"), was unable to find anything that helped the problem (eg. extending the timeout time via hadoop jar h2odriver.jar -timeout <some time> did nothing but extend the time until the timeout error popped up).
Have noticed that this happens often when there is another instance of an h2o cluster already up and running (which I don't understand since I would think that YARN could support multiple instances), yet also sometimes when there is no other cluster initialized.
Anyone know anything else that can be tried to solve this problem or get more debugging info beyond the error message being thrown by h2o?
UPDATE:
Trying to recreate the problem from the commandline, getting
[me#mnode01 project]$ /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 172.18.4.62]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.62:29388
(You can override these with -driverif and -driverport/-driverportrange.)
Memory Settings:
mapreduce.map.java.opts: -Xms6g -Xmx6g -XX:PermSize=256m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10
mapreduce.map.memory.mb: 6758
18/08/15 09:18:46 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: number of splits:4
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523404089784_7404
18/08/15 09:18:48 INFO security.ExternalTokenManagerFactory: Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
18/08/15 09:18:48 INFO impl.YarnClientImpl: Submitted application application_1523404089784_7404
18/08/15 09:18:48 INFO mapreduce.Job: The url to track the job: https://mnode03.cluster.local:8090/proxy/application_1523404089784_7404/
Job name 'H2O_66888' submitted
JobTracker job ID is 'job_1523404089784_7404'
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
Waiting for H2O cluster to come up...
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
H2O node 172.18.4.66:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
Killed.
18/08/15 09:23:54 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
----- YARN cluster metrics -----
Number of YARN worker nodes: 6
----- Nodes -----
Node: http://mnode03.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 7.0 GB used, 0 / 2 vcores used
Node: http://mnode05.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode06.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode01.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 5.0 GB used, 0 / 2 vcores used
Node: http://mnode04.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 7.0 / 10.4 GB used, 1 / 2 vcores used
Node: http://mnode02.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 2.0 / 8.7 GB used, 1 / 2 vcores used
----- Queues -----
Queue name: root.default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 0.00
Maximum capacity: -1.00
Application count: 0
Queue 'root.default' approximate utilization: 0.0 / 0.0 GB used, 0 / 0 vcores used
----------------------------------------------------------------------
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster resource limitations
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
and noticing the later outputs
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster
I am confused by the reported 0GB mem. and 0 vcores becuase there are no other applications running on the cluster and looking at the cluster details in the YARN RM web UI shows
(using image, since could not find unified place in log files for this info and why the mem. availability is so uneven despite having no other running applications, I do not know). At this point, should mention that don't have much experience tinkering with / examining YARN configs, so it's difficult for me to find relevant information at this point.
Could it be that I am starting h2o cluster with -mapperXmx=6g, but (as shown in the image) one of the nodes only has 5g mem. available, so if this node is randomly selected to contribute to the initialized h2o application, it does not have enough memory to support the requested mapper mem.? Changing the startup command to /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 5g -timeout 300 -output hdfsOutputDir and start/stopping multiple times without error seems to support this theory (though need to check further to determine if I'm interpreting things correctly).
This is most likely because your Hadoop cluster is busy, and there just isn't space to start new yarn containers.
If you ask for N nodes, then you either get all N nodes, or the launch process times out like you are seeing. You can optionally use the -timeout command line flag to increase the timeout.

Unable to run spark-terasort with spark-1.6.1-bin-hadoop1

I am trying to run spark-terasort with spark-1.6.1-bin-hadoop1 (pre-built package for hadoop 1.X).
When I try to run spark:
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen ~/spark-terasort/target/spark-terasort-1.0-jar-with-dependencies.jar 100G hdfs:///input_terasort
I get the error:
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.JobContext, but interface was expected
This may have to do with different Hadoop versions (between spark and spark-terasort). I have tried playing around with pom.xml (used to compile spark-terasort) but without much success.
How can I use spark-terasort with spark-1.6.1-bin-hadoop1?
The spark-terasort is old:
<scala.binary.version>2.10</scala.binary.version>
<spark.version>1.2.1</spark.version>
I am looking into patching it. Will get back..
Update I tried with 1.6.0-SNAPSHOT and TeraGen worked fine.
Input size: 1000KB
Total number of records: 10000
Number of output partitions: 2
Number of records/output partition: 5000
===========================================================================
===========================================================================
Number of records written: 10000
This was when running against local filesystem. I will look at real hdfs in about 12 hours from now.

hadoop python job on snappy files produces 0 size output

When I run wordcount.py (python mrjob http://mrjob.readthedocs.org/en/latest/guides/quickstart.html#writing-your-first-job) using hadoop streaming on a text file it gives me the output, but when the same is run against .snappy files I got zero size output.
Options Tried:
[testgen word_count]# cat mrjob.conf
runners:
hadoop: # this will work for both hadoop and emr
jobconf:
mapreduce.task.timeout: 3600000
#mapreduce.max.split.size: 20971520
#mapreduce.input.fileinputformat.split.maxsize: 102400
#mapreduce.map.memory.mb: 8192
mapred.map.child.java.opts: -Xmx4294967296
mapred.child.java.opts: -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/
java.library.path: /opt/cloudera/parcels/CDH/lib/hadoop/lib/native/
# "true" must be a string argument, not a boolean! (#323)
#mapreduce.output.compress: "true"
#mapreduce.output.compression.codec: org.apache.hadoop.io.compress.SnappyCodec
[testgen word_count]#
command:
[testgen word_count]# python word_count2.py -r hadoop hdfs:///input.snappy --conf mrjob.conf
creating tmp directory /tmp/word_count2.root.20151111.113113.369549
writing wrapper script to /tmp/word_count2.root.20151111.113113.369549/setup-wrapper.sh
Using Hadoop version 2.5.0
Copying local files into hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/files/
PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols
Detected hadoop configuration property names that do not match hadoop version 2.5.0:
The have been translated as follows
mapred.map.child.java.opts: mapreduce.map.java.opts
HADOOP: packageJobJar: [/tmp/hadoop-root/hadoop-unjar3623089386341942955/] [] /tmp/streamjob3671127555730955887.jar tmpDir=null
HADOOP: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
HADOOP: Total input paths to process : 1
HADOOP: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
HADOOP: Running job: job_201511021537_70340
HADOOP: To kill this job, run:
HADOOP: /opt/cloudera/parcels/CDH//bin/hadoop job -Dmapred.job.tracker=logicaljt -kill job_201511021537_70340
HADOOP: Tracking URL: http://xxxxx_70340
HADOOP: map 0% reduce 0%
HADOOP: map 100% reduce 0%
HADOOP: map 100% reduce 11%
HADOOP: map 100% reduce 97%
HADOOP: map 100% reduce 100%
HADOOP: Job complete: job_201511021537_70340
HADOOP: Output: hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output
Counters from step 1:
(no counters found)
Streaming final output from hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output
removing tmp directory /tmp/word_count2.root.20151111.113113.369549
deleting hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549 from HDFS
[testgen word_count]#
No errors thrown, job output is successful, Verified job configurations in the job stats it has taken.
Is there any other way to troubleshoot?
I think you are not using correctly options.
In your mrjob.conf file:
mapreduce.output.compress: "true" means that you want a compressed output
mapreduce.output.compression.codec: org.apache.hadoop.io.compress.SnappyCodec means that the compression uses Snappy codec
You are apparently expecting that your compressed inputs will be correctly read by your mappers. Unfortunately, it does not work like that. If you really want to feed your job with compressed data, you may look at SequenceFile. Another simpler solution would be to feed your job with text files only.
What about also configuring your input format, like mapreduce.input.compression.codec: org.apache.hadoop.io.compress.SnappyCodec
[Edit: you should also remove this symbol # at the beginning of lines that define options. Otherwise, they will be ignored]
Thanks for your inputs Yann, but finally the below line inserted into the job script solved the problem.
HADOOP_INPUT_FORMAT='<org.hadoop.snappy.codec>'

running camus sample with kafka 0.8

I am new to camus and I want to try and use it we my kafka 0.8
so far i downloaded the source created 2 queue like the example expect
configured the job config file (see below)
and tried to run it on my machine(details below) with this command
$JAVA_HOME/bin/java -cp camus-example-0.1.0-SNAPSHOT.jar com.linkedin.camus.etl.kafka.CamusJob -P /root/Desktop/camus-workspace/camus-master/camus-example/target/camus.properties
the jar contains all the dependencies like the shade file
and I am getting this error:
[EtlInputFormat] - Discrading topic : TestQueue
[EtlInputFormat] - Discrading topic : test
[EtlInputFormat] - Discrading topic : DummyLog2
[EtlInputFormat] - Discrading topic : test3
[EtlInputFormat] - Discrading topic : TwitterQueue
[EtlInputFormat] - Discrading topic : test2
[EtlInputFormat] - Discarding topic (Decoder generation failed) : DummyLog
[CodecPool] - Got brand-new compressor
[JobClient] - Running job: job_local_0001
[JobClient] - map 0% reduce 0%
[JobClient] - Job complete: job_local_0001
[JobClient] - Counters: 0
[CamusJob] - Job finished
when i tried to run it with my intellij-idea editor
i got the some error but found the reason for the error
java.lang.RuntimeException: java.lang.ClassNotFoundException: com.linkedin.batch.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder
can some explain to me what i am doing wrong ?
camus config file
# Needed Camus properties, more cleanup to come
# final top-level data output directory, sub-directory will be dynamically created for each topic pulled
etl.destination.path=/root/Desktop/camus-workspace/camus-master/camus-example/target/1
# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files
etl.execution.base.path=/root/Desktop/camus-workspace/camus-master/camus-example/target/2
# where completed Camus job output directories are kept, usually a sub-dir in the base.path
etl.execution.history.path=/root/Desktop/camus-workspace/camus-master/camus-example/target3
# Kafka-0.8 handles all zookeeper calls
#zookeeper.hosts=localhost:2181
#zookeeper.broker.topics=/brokers/topics
#zookeeper.broker.nodes=/brokers/ids
# Concrete implementation of the Encoder class to use (used by Kafka Audit, and thus optional for now)
#camus.message.encoder.class=com.linkedin.batch.etl.kafka.coders.DummyKafkaMessageEncoder
# Concrete implementation of the Decoder class to use
camus.message.decoder.class=com.linkedin.batch.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder
# Used by avro-based Decoders to use as their Schema Registry
kafka.message.coder.schema.registry.class=com.linkedin.camus.example.DummySchemaRegistry
# Used by the committer to arrange .avro files into a partitioned scheme. This will be the default partitioner for all
# topic that do not have a partitioner specified
#etl.partitioner.class=com.linkedin.camus.etl.kafka.coders.DefaultPartitioner
# Partitioners can also be set on a per-topic basis
#etl.partitioner.class.<topic-name>=com.your.custom.CustomPartitioner
# all files in this dir will be added to the distributed cache and placed on the classpath for hadoop tasks
# hdfs.default.classpath.dir=/root/Desktop/camus-workspace/camus-master/camus-example/target
# max hadoop tasks to use, each task can pull multiple topic partitions
mapred.map.tasks=30
# max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1
# events with a timestamp older than this will be discarded.
kafka.max.historical.days=3
# Max minutes for each mapper to pull messages (-1 means no limit)
kafka.max.pull.minutes.per.task=-1
# if whitelist has values, only whitelisted topic are pulled. nothing on the blacklist is pulled
kafka.blacklist.topics=
kafka.whitelist.topics=DummyLog
log4j.configuration=true
# Name of the client as seen by kafka
kafka.client.name=camus
# Fetch Request Parameters
kafka.fetch.buffer.size=
kafka.fetch.request.correlationid=
kafka.fetch.request.max.wait=
kafka.fetch.request.min.bytes=
# Connection parameters.
kafka.brokers=localhost:9092
kafka.timeout.value=
#Stops the mapper from getting inundated with Decoder exceptions for the same topic
#Default value is set to 10
max.decoder.exceptions.to.print=5
#Controls the submitting of counts to Kafka
#Default value set to true
post.tracking.counts.to.kafka=true
log4j.configuration=true
# everything below this point can be ignored for the time being, will provide more documentation down the road
##########################
etl.run.tracking.post=false
kafka.monitor.tier=
etl.counts.path=
kafka.monitor.time.granularity=10
etl.hourly=hourly
etl.daily=daily
etl.ignore.schema.errors=false
# configure output compression for deflate or snappy. Defaults to deflate
etl.output.codec=deflate
etl.deflate.level=6
#etl.output.codec=snappy
etl.default.timezone=America/Los_Angeles
etl.output.file.time.partition.mins=60
etl.keep.count.files=false
etl.execution.history.max.of.quota=.8
mapred.output.compress=true
mapred.map.max.attempts=1
kafka.client.buffer.size=20971520
kafka.client.so.timeout=60000
#zookeeper.session.timeout=
#zookeeper.connection.timeout=
machine details:
hortonworks - hdp 2.0.0.6
with kafka 0.8 beta 1
There is a mistake in package name.
Change
camus.message.decoder.class=com.linkedin.batch.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder
to
camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder
Also you need to specify some Kafka-related properties or comment it (this way Camus will use default values):
# Fetch Request Parameters
# kafka.fetch.buffer.size=
# kafka.fetch.request.correlationid=
# kafka.fetch.request.max.wait=
# kafka.fetch.request.min.bytes=
# Connection parameters.
kafka.brokers=localhost:9092
# kafka.timeout.value=

How to fix "Task attempt_201104251139_0295_r_000006_0 failed to report status for 600 seconds."

I wrote a mapreduce job to extract some info from a dataset. The dataset is users' rating about movies. The number of users is about 250K and the number of movies is about 300k. The output of map is <user, <movie, rating>*> and <movie,<user,rating>*>. In the reducer, I will process these pairs.
But when I run the job, the mapper completes as expected, but reducer always complain that
Task attempt_* failed to report status for 600 seconds.
I know this is due to failed to update status, so I added a call to context.progress() in my code like this:
int count = 0;
while (values.hasNext()) {
if (count++ % 100 == 0) {
context.progress();
}
/*other code here*/
}
Unfortunately, this does not help. Still many reduce tasks failed.
Here is the log:
Task attempt_201104251139_0295_r_000014_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000012_1, Status : FAILED
Task attempt_201104251139_0295_r_000012_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000006_1, Status : FAILED
Task attempt_201104251139_0295_r_000006_1 failed to report status for 600 seconds. Killing!
BTW, the error happened in reduce to copy phase, the log says:
reduce > copy (28 of 31 at 26.69 MB/s) > :Lost task tracker: tracker_hadoop-56:localhost/127.0.0.1:34385
Thanks for the help.
The easiest way will be to set this configuration parameter:
<property>
<name>mapred.task.timeout</name>
<value>1800000</value> <!-- 30 minutes -->
</property>
in mapred-site.xml
The easiest another way is to set in your Job Configuration inside the program
Configuration conf=new Configuration();
long milliSeconds = 1000*60*60; <default is 600000, likewise can give any value)
conf.setLong("mapred.task.timeout", milliSeconds);
**before setting it please check inside the Job file(job.xml) file in jobtracker GUI about the correct property name whether its mapred.task.timeout or mapreduce.task.timeout
.
.
.
while running the job check in the Job file again whether that property is changed according to the setted value.
In newer versions, the name of the parameter has been changed to mapreduce.task.timeout as described in this link (search for task.timeout). In addition, you can also disable this timeout as described in the above link:
The number of milliseconds before a task will be terminated if it
neither reads an input, writes an output, nor updates its status
string. A value of 0 disables the timeout.
Below is an example setting in the mapred-site.xml:
<property>
<name>mapreduce.task.timeout</name>
<value>0</value> <!-- A value of 0 disables the timeout -->
</property>
If you have hive query and its timing out , you can set above configurations in following way:
set mapred.tasktracker.expiry.interval=1800000;
set mapred.task.timeout= 1800000;
From https://issues.apache.org/jira/browse/HADOOP-1763
causes might be :
1. Tasktrackers run the maps successfully
2. Map outputs are served by jetty servers on the TTs.
3. All the reduce tasks connects to all the TT where maps are run.
4. since there are lots of reduces wanting to connect the map output server, the jetty servers run out of threads (default 40)
5. tasktrackers continue to make periodic heartbeats to JT, so that they are not dead, but their jetty servers are (temporarily) down.

Resources