hadoop python job on snappy files produces 0 size output - hadoop

When I run wordcount.py (python mrjob http://mrjob.readthedocs.org/en/latest/guides/quickstart.html#writing-your-first-job) using hadoop streaming on a text file it gives me the output, but when the same is run against .snappy files I got zero size output.
Options Tried:
[testgen word_count]# cat mrjob.conf
runners:
hadoop: # this will work for both hadoop and emr
jobconf:
mapreduce.task.timeout: 3600000
#mapreduce.max.split.size: 20971520
#mapreduce.input.fileinputformat.split.maxsize: 102400
#mapreduce.map.memory.mb: 8192
mapred.map.child.java.opts: -Xmx4294967296
mapred.child.java.opts: -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/
java.library.path: /opt/cloudera/parcels/CDH/lib/hadoop/lib/native/
# "true" must be a string argument, not a boolean! (#323)
#mapreduce.output.compress: "true"
#mapreduce.output.compression.codec: org.apache.hadoop.io.compress.SnappyCodec
[testgen word_count]#
command:
[testgen word_count]# python word_count2.py -r hadoop hdfs:///input.snappy --conf mrjob.conf
creating tmp directory /tmp/word_count2.root.20151111.113113.369549
writing wrapper script to /tmp/word_count2.root.20151111.113113.369549/setup-wrapper.sh
Using Hadoop version 2.5.0
Copying local files into hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/files/
PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols
Detected hadoop configuration property names that do not match hadoop version 2.5.0:
The have been translated as follows
mapred.map.child.java.opts: mapreduce.map.java.opts
HADOOP: packageJobJar: [/tmp/hadoop-root/hadoop-unjar3623089386341942955/] [] /tmp/streamjob3671127555730955887.jar tmpDir=null
HADOOP: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
HADOOP: Total input paths to process : 1
HADOOP: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
HADOOP: Running job: job_201511021537_70340
HADOOP: To kill this job, run:
HADOOP: /opt/cloudera/parcels/CDH//bin/hadoop job -Dmapred.job.tracker=logicaljt -kill job_201511021537_70340
HADOOP: Tracking URL: http://xxxxx_70340
HADOOP: map 0% reduce 0%
HADOOP: map 100% reduce 0%
HADOOP: map 100% reduce 11%
HADOOP: map 100% reduce 97%
HADOOP: map 100% reduce 100%
HADOOP: Job complete: job_201511021537_70340
HADOOP: Output: hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output
Counters from step 1:
(no counters found)
Streaming final output from hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output
removing tmp directory /tmp/word_count2.root.20151111.113113.369549
deleting hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549 from HDFS
[testgen word_count]#
No errors thrown, job output is successful, Verified job configurations in the job stats it has taken.
Is there any other way to troubleshoot?

I think you are not using correctly options.
In your mrjob.conf file:
mapreduce.output.compress: "true" means that you want a compressed output
mapreduce.output.compression.codec: org.apache.hadoop.io.compress.SnappyCodec means that the compression uses Snappy codec
You are apparently expecting that your compressed inputs will be correctly read by your mappers. Unfortunately, it does not work like that. If you really want to feed your job with compressed data, you may look at SequenceFile. Another simpler solution would be to feed your job with text files only.
What about also configuring your input format, like mapreduce.input.compression.codec: org.apache.hadoop.io.compress.SnappyCodec
[Edit: you should also remove this symbol # at the beginning of lines that define options. Otherwise, they will be ignored]

Thanks for your inputs Yann, but finally the below line inserted into the job script solved the problem.
HADOOP_INPUT_FORMAT='<org.hadoop.snappy.codec>'

Related

How to supress info message "io.bytes.per.checksum is deprecated" in grunt shell

When analyzing a Big Data I'm running Apache Pig version 0.17.0 on top of Hadoop-2.7.2. Every time i run a load command in local mode of grunt> shell i get the following message:
grunt> A = load '/usr/lib/pig/data.txt' using TextLoader as (date:chararray);
[main] INFO org.apache.hadoop.conf.Configuration.deprecation-io.bytes.per.
checksum is deprecated. Instead, use dfs.bytes-per-checksum
Is there away to switch off this message as it becomes very annoying with frequent usage of grunt> shell?
Check if below solution works for you,
Create a file named nolog.conf, with the following content
log4j.rootLogger=fatal
and then run pig as follows
pig -x local -4 nolog.conf

Whats the expected commit/rollback behavior of Camus?

We've been running Camus for about a year successfully to pull avro payloads from Kafka (ver 0.82) and store as .avro files in HDFS, using just a few Kafka topics. Recently, a new team within our company registered about 60 new topics in our pre-production environment and started sending data to these topics. The team made some mistakes when routing their data to kafka topics, that resulted in errors when Camus deserialized the payloads to avro for these topics.
The Camus job failed due to exceeding the 'failed other' error threshold. The resulting behavior in Camus after the failure was surprising, I wanted to check with other developers to see whether the behavior we observed is expected or whether we have some issue going on with our implementation.
We noticed this behavior when the Camus job failed due to exceeding the 'failed other' threshold:
1. All of the mapper tasks succeeded, and so the TaskAttempt was allowed to commit - this means that all of the data written by Camus was copied to the final HDFS location.
2. The CamusJob throws an exception when it computes the % error rate (this is following the mapper commit), which caused the job to fail
3. Because the job failed (I think), the Kafka offsets weren't advance
The problem we ran into with this behavior is that our Camus job is set to run every 5 minutes. So, every 5 minutes we saw that data was committed to HDFS, the job failed, and the Kafka offsets weren't updated - this meant that we wrote duplicated data until we noticed that our disks were filling up.
I wrote an integration test that confirms the result - it submits 10 good records to a topic, and 10 records that use an unexpected schema to the same topic, runs the Camus job with only that topic whitelisted, and we can see that 10 records are written to HDFS and the Kafka offsets aren't advanced. Below is a snippet of the logs from that test, as well as the properties we used while running the job.
Any help is appreciated - I'm not sure whether this is expected behavior for Camus or whether we have a problem with our implementation, and what the best method is to prevent this behavior (duplicating data).
Thanks ~ Matt
CamusJob properties for the test:
etl.destination.path=/user/camus/kafka/data
etl.execution.base.path=/user/camus/kafka/workspace
etl.execution.history.path=/user/camus/kafka/history
dfs.default.classpath.dir=/user/camus/kafka/libs
etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.AvroRecordWriterProvider
camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageDecoder
camus.message.timestamp.format=yyyy-MM-dd HH:mm:ss Z
mapreduce.output.fileoutputformat.compress=false
mapred.map.tasks=15
kafka.max.pull.hrs=1
kafka.max.historical.days=3
kafka.whitelist.topics=advertising.edmunds.admax
log4j.configuration=true
kafka.client.name=camus
kafka.brokers=<kafka brokers>
max.decoder.exceptions.to.print=5
post.tracking.counts.to.kafka=true
monitoring.event.class=class.that.generates.record.to.submit.counts.to.kafka
kafka.message.coder.schema.registry.class=com.linkedin.camus.schemaregistry.AvroRestSchemaRegistry
etl.schema.registry.url=<schema repo url>
etl.run.tracking.post=false
kafka.monitor.time.granularity=10
etl.daily=daily
etl.ignore.schema.errors=false
etl.output.codec=deflate
etl.deflate.level=6
etl.default.timezone=America/Los_Angeles
mapred.output.compress=false
mapred.map.max.attempts=2
Log snippet from the test, showing the commit behavior after the mappers succeed and subsequent job failure due to surpassing the 'other' threshold:
LocalJobRunner] - advertising.edmunds.admax:2:6; advertising.edmunds.admax:3:7 begin read at 2016-07-08T05:50:26.215-07:00; advertising.edmunds.admax:1:5; advertising.edmunds.admax:2:2; advertising.edmunds.admax:3:3 begin read at 2016-07-08T05:50:30.517-07:00; advertising.edmunds.admax:0:4 > map
[Task] - Task:attempt_local866350146_0001_m_000000_0 is done. And is in the process of committing
[LocalJobRunner] - advertising.edmunds.admax:2:6; advertising.edmunds.admax:3:7 begin read at 2016-07-08T05:50:26.215-07:00; advertising.edmunds.admax:1:5; advertising.edmunds.admax:2:2; advertising.edmunds.admax:3:3 begin read at 2016-07-08T05:50:30.517-07:00; advertising.edmunds.admax:0:4 > map
[Task] - Task attempt_local866350146_0001_m_000000_0 is allowed to commit now
[EtlMultiOutputFormat] - work path: file:/user/camus/kafka/workspace/2016-07-08-12-50-20/_temporary/0/_temporary/attempt_local866350146_0001_m_000000_0
[EtlMultiOutputFormat] - Destination base path: /user/camus/kafka/data
[EtlMultiOutputFormat] - work file: data.advertising-edmunds-admax.3.3.1467979200000-m-00000.avro
[EtlMultiOutputFormat] - Moved file from: file:/user/camus/kafka/workspace/2016-07-08-12-50-20/_temporary/0/_temporary/attempt_local866350146_0001_m_000000_0/data.advertising-edmunds-admax.3.3.1467979200000-m-00000.avro to: /user/camus/kafka/data/advertising-edmunds-admax/advertising-edmunds-admax.3.3.2.2.1467979200000.avro
[EtlMultiOutputFormat] - work file: data.advertising-edmunds-admax.3.7.1467979200000-m-00000.avro
[EtlMultiOutputFormat] - Moved file from: file:/user/camus/kafka/workspace/2016-07-08-12-50-20/_temporary/0/_temporary/attempt_local866350146_0001_m_000000_0/data.advertising-edmunds-admax.3.7.1467979200000-m-00000.avro to: /user/camus/kafka/data/advertising-edmunds-admax/advertising-edmunds-admax.3.7.8.8.1467979200000.avro
[Task] - Task 'attempt_local866350146_0001_m_000000_0' done.
[LocalJobRunner] - Finishing task: attempt_local866350146_0001_m_000000_0
[LocalJobRunner] - map task executor complete.
[Job] - map 100% reduce 0%
[Job] - Job job_local866350146_0001 completed successfully
[Job] - Counters: 23
File System Counters
FILE: Number of bytes read=117251
FILE: Number of bytes written=350942
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=10
Map output records=15
Input split bytes=793
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=13
Total committed heap usage (bytes)=251658240
com.linkedin.camus.etl.kafka.mapred.EtlRecordReader$KAFKA_MSG
DECODE_SUCCESSFUL=10
SKIPPED_OTHER=10
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=5907
total
data-read=840
decode-time(ms)=123
event-count=20
mapper-time(ms)=58
request-time(ms)=12114
skip-old=0
[CamusJob] - Group: File System Counters
[CamusJob] - FILE: Number of bytes read: 117251
[CamusJob] - FILE: Number of bytes written: 350942
[CamusJob] - FILE: Number of read operations: 0
[CamusJob] - FILE: Number of large read operations: 0
[CamusJob] - FILE: Number of write operations: 0
[CamusJob] - Group: Map-Reduce Framework
[CamusJob] - Map input records: 10
[CamusJob] - Map output records: 15
[CamusJob] - Input split bytes: 793
[CamusJob] - Spilled Records: 0
[CamusJob] - Failed Shuffles: 0
[CamusJob] - Merged Map outputs: 0
[CamusJob] - GC time elapsed (ms): 13
[CamusJob] - Total committed heap usage (bytes): 251658240
[CamusJob] - Group: com.linkedin.camus.etl.kafka.mapred.EtlRecordReader$KAFKA_MSG
[CamusJob] - DECODE_SUCCESSFUL: 10
[CamusJob] - SKIPPED_OTHER: 10
[CamusJob] - job failed: 50.0% messages skipped due to other, maximum allowed is 0.1%
I'm facing a pretty similar problem: my Kafka/Camus pipeline has been working well for about a year, but recently I stucked with duplication issue while integrating the ingestion from remote broker with very unstable connection and frequent job failures.
Today when examining Gobblin documentation, I realized that Camus sweeper is a tool that possibly what we are looking for. Try to integrate it in your pipeline.
I also think that the good idea would be to migrate to Gobblin (Camus successor) in the nearest future.

running camus sample with kafka 0.8

I am new to camus and I want to try and use it we my kafka 0.8
so far i downloaded the source created 2 queue like the example expect
configured the job config file (see below)
and tried to run it on my machine(details below) with this command
$JAVA_HOME/bin/java -cp camus-example-0.1.0-SNAPSHOT.jar com.linkedin.camus.etl.kafka.CamusJob -P /root/Desktop/camus-workspace/camus-master/camus-example/target/camus.properties
the jar contains all the dependencies like the shade file
and I am getting this error:
[EtlInputFormat] - Discrading topic : TestQueue
[EtlInputFormat] - Discrading topic : test
[EtlInputFormat] - Discrading topic : DummyLog2
[EtlInputFormat] - Discrading topic : test3
[EtlInputFormat] - Discrading topic : TwitterQueue
[EtlInputFormat] - Discrading topic : test2
[EtlInputFormat] - Discarding topic (Decoder generation failed) : DummyLog
[CodecPool] - Got brand-new compressor
[JobClient] - Running job: job_local_0001
[JobClient] - map 0% reduce 0%
[JobClient] - Job complete: job_local_0001
[JobClient] - Counters: 0
[CamusJob] - Job finished
when i tried to run it with my intellij-idea editor
i got the some error but found the reason for the error
java.lang.RuntimeException: java.lang.ClassNotFoundException: com.linkedin.batch.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder
can some explain to me what i am doing wrong ?
camus config file
# Needed Camus properties, more cleanup to come
# final top-level data output directory, sub-directory will be dynamically created for each topic pulled
etl.destination.path=/root/Desktop/camus-workspace/camus-master/camus-example/target/1
# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files
etl.execution.base.path=/root/Desktop/camus-workspace/camus-master/camus-example/target/2
# where completed Camus job output directories are kept, usually a sub-dir in the base.path
etl.execution.history.path=/root/Desktop/camus-workspace/camus-master/camus-example/target3
# Kafka-0.8 handles all zookeeper calls
#zookeeper.hosts=localhost:2181
#zookeeper.broker.topics=/brokers/topics
#zookeeper.broker.nodes=/brokers/ids
# Concrete implementation of the Encoder class to use (used by Kafka Audit, and thus optional for now)
#camus.message.encoder.class=com.linkedin.batch.etl.kafka.coders.DummyKafkaMessageEncoder
# Concrete implementation of the Decoder class to use
camus.message.decoder.class=com.linkedin.batch.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder
# Used by avro-based Decoders to use as their Schema Registry
kafka.message.coder.schema.registry.class=com.linkedin.camus.example.DummySchemaRegistry
# Used by the committer to arrange .avro files into a partitioned scheme. This will be the default partitioner for all
# topic that do not have a partitioner specified
#etl.partitioner.class=com.linkedin.camus.etl.kafka.coders.DefaultPartitioner
# Partitioners can also be set on a per-topic basis
#etl.partitioner.class.<topic-name>=com.your.custom.CustomPartitioner
# all files in this dir will be added to the distributed cache and placed on the classpath for hadoop tasks
# hdfs.default.classpath.dir=/root/Desktop/camus-workspace/camus-master/camus-example/target
# max hadoop tasks to use, each task can pull multiple topic partitions
mapred.map.tasks=30
# max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1
# events with a timestamp older than this will be discarded.
kafka.max.historical.days=3
# Max minutes for each mapper to pull messages (-1 means no limit)
kafka.max.pull.minutes.per.task=-1
# if whitelist has values, only whitelisted topic are pulled. nothing on the blacklist is pulled
kafka.blacklist.topics=
kafka.whitelist.topics=DummyLog
log4j.configuration=true
# Name of the client as seen by kafka
kafka.client.name=camus
# Fetch Request Parameters
kafka.fetch.buffer.size=
kafka.fetch.request.correlationid=
kafka.fetch.request.max.wait=
kafka.fetch.request.min.bytes=
# Connection parameters.
kafka.brokers=localhost:9092
kafka.timeout.value=
#Stops the mapper from getting inundated with Decoder exceptions for the same topic
#Default value is set to 10
max.decoder.exceptions.to.print=5
#Controls the submitting of counts to Kafka
#Default value set to true
post.tracking.counts.to.kafka=true
log4j.configuration=true
# everything below this point can be ignored for the time being, will provide more documentation down the road
##########################
etl.run.tracking.post=false
kafka.monitor.tier=
etl.counts.path=
kafka.monitor.time.granularity=10
etl.hourly=hourly
etl.daily=daily
etl.ignore.schema.errors=false
# configure output compression for deflate or snappy. Defaults to deflate
etl.output.codec=deflate
etl.deflate.level=6
#etl.output.codec=snappy
etl.default.timezone=America/Los_Angeles
etl.output.file.time.partition.mins=60
etl.keep.count.files=false
etl.execution.history.max.of.quota=.8
mapred.output.compress=true
mapred.map.max.attempts=1
kafka.client.buffer.size=20971520
kafka.client.so.timeout=60000
#zookeeper.session.timeout=
#zookeeper.connection.timeout=
machine details:
hortonworks - hdp 2.0.0.6
with kafka 0.8 beta 1
There is a mistake in package name.
Change
camus.message.decoder.class=com.linkedin.batch.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder
to
camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder
Also you need to specify some Kafka-related properties or comment it (this way Camus will use default values):
# Fetch Request Parameters
# kafka.fetch.buffer.size=
# kafka.fetch.request.correlationid=
# kafka.fetch.request.max.wait=
# kafka.fetch.request.min.bytes=
# Connection parameters.
kafka.brokers=localhost:9092
# kafka.timeout.value=

Loading of concatenated bz2 files is not supported (YARN 2.2 + Pig 0.12)

I was processing a concatenated bz2 file using a pig script on top of Pig 0.12 and YARN 2.2 and got the following error message:
ERROR: java.io.IOException: Encountered additional bytes in the filesplit past the crc block. Loading of concatenated bz2 files is not supported
I thought YARN 2.2 should have the fix for the concatenated bz2 file handling (https://issues.apache.org/jira/browse/HADOOP-6835) but not yet? Or does Pig handle bzip2 file on its own instead of relying on underlying MapReduce framework or do I need to set some parameter?

Pattern match input files for Amazon Elastic MapReduce

I am trying to run a MapReduce streaming job that takes input files from directories in an s3 bucket that match a given pattern. The pattern is something like bucket-name/[date]/product/logs/[hour]/[logfilename]. An example log would be in a while like bucket-name/2013-05-02/product/logs/05/log123456789.
I can get the job to work by passing only the hour portion of the file name as a wildcard. For example: bucket-name/2013-05-02/product/logs/*/. This successfully picks each log file from each hour, and passes them individually to mappers.
The problem comes with I try to also make the date a wildcard, for example: bucket-name/*/product/logs/*/. When I do this, the job gets created but no tasks are created and eventually it fails. This error is printed in the syslog.
2013-05-02 08:03:41,549 ERROR org.apache.hadoop.streaming.StreamJob (main): Job not successful. Error: Job initialization failed:
java.lang.OutOfMemoryError: Java heap space
at java.util.regex.Matcher.<init>(Matcher.java:207)
at java.util.regex.Pattern.matcher(Pattern.java:888)
at org.apache.hadoop.conf.Configuration.substituteVars(Configuration.java:378)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:418)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:523)
at org.apache.hadoop.mapred.SkipBadRecords.getMapperMaxSkipRecords(SkipBadRecords.java:247)
at org.apache.hadoop.mapred.TaskInProgress.<init>(TaskInProgress.java:146)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:722)
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4238)
at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2013-05-02 08:03:41,549 INFO org.apache.hadoop.streaming.StreamJob (main): killJob...
On further testing, it looks like the multiple wildcard syntax works as expected in the command line client. I had trouble getting it to work at first, before realizing that requiring Ruby 1.8.7 meant it requires exactly Ruby 1.8.7, and nothing later.

Resources