Hadoop streaming "GC overhead limit exceeded" - hadoop

I am running this command:
hadoop jar hadoop-streaming.jar -D stream.tmpdir=/tmp -input "<input dir>" -output "<output dir>" -mapper "grep 20151026" -reducer "wc -l"
Where <input dir> is a directory with many avro files.
And getting this error:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead
limit exceeded at
org.apache.hadoop.hdfs.protocol.DatanodeID.updateXferAddrAndInvalidateHashCode(DatanodeID.java:287)
at
org.apache.hadoop.hdfs.protocol.DatanodeID.(DatanodeID.java:91)
at
org.apache.hadoop.hdfs.protocol.DatanodeInfo.(DatanodeInfo.java:136)
at
org.apache.hadoop.hdfs.protocol.DatanodeInfo.(DatanodeInfo.java:122)
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:633)
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:793)
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convertLocatedBlock(PBHelper.java:1252)
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1270)
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1413)
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1524)
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1533)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:557)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601) at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy15.getListing(Unknown Source) at
org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969) at
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.hasNextNoFilter(DistributedFileSystem.java:888)
at
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.hasNext(DistributedFileSystem.java:863)
at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:267)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:624)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:616)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296) at
org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415)
How can this issue be resolved ?

It took a while, but I found the solution here.
Prepending HADOOP_CLIENT_OPTS="-Xmx1024M" to the command solves the problem.
The final commandline is:
HADOOP_CLIENT_OPTS="-Xmx1024M" hadoop jar hadoop-streaming.jar -D stream.tmpdir=/tmp -input "<input dir>" -output "<output dir>" -mapper "grep 20151026" -reducer "wc -l"

Related

Compression codec com.hadoop.compression.lzo.LzoCodec was not found

Trying to run a mapreduce job with compression
hadoop jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
randomtextwriter \
-Ddfs.replication=1 -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec \
/tmp/randomtextwriter
Using parcels distributed lzo to all nodes in the cluster. Even then I am gettin the below error
Getting below error
Error: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec was not found.
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:140)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getSequenceWriter(SequenceFileOutputFormat.java:56)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:75)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:659)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1731)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2409)
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:138)
... 10 more
As a temporary solution you can manually add the hadoop-lzo jar in the hadoop classpath .
curl https://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/hadoop-lzo-0.4.19.jar
hadoop jar \ /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ randomtextwriter \ -Ddfs.replication=1 -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec \ /tmp/randomtextwriter --libjars hadoop-lzo-0.4.19.jar
Please make sure you download the compatible version of hadoop-lzo with your hadoop version.

Unable to import data from Hdfs to Hbase using importtsv

I moved tab delimited file into hdfs now was trying to move it to hbase.
Below is my importtsv command
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:ok,cf:ek,cf:rk,cf:rsk,cf:pdk,cf:pmk,cf:omk,cf:sok,cf:sdk,cf:cdk,cf:q,cf:uc,cf:up,cf:usp,cf:gm,cf:st,cf:gp -Dimporttsv.skip.bad.lines=false 'sales_fact' hdfs://localhost:54310/my/file.txt
it is trying to read a jar from location which doesnt exists.
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/home/elijah/Downloads/hbase/lib/htrace-core-3.1.0-incubating.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1072)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1064)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1064)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestampsAndCacheVisibilities(ClientDistributedCacheManager.java:57)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:265)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:389)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at org.apache.hadoop.hbase.mapreduce.ImportTsv.run(ImportTsv.java:738)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.hbase.mapreduce.ImportTsv.main(ImportTsv.java:747)
I am not getting why it has mixed up hdfs and local dir path into one.
hdfs://localhost:54310/home/elijah/Downloads/hbase/lib/htrace-core-3.1.0-incubating.jar
User who is running import job has full access to hbase lib on local directory.
I can see -libjars option is missing....You can use -libjars option below is example usage :
hadoop jar \
hbase-server-0.98.6-cdh5.2.1.jar \
importtsv \
-libjars /opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/high-scale-lib-1.1.1.jar \
-Dimporttsv.separator=, -Dimporttsv.bulk.output=output \
-Dimporttsv.columns=HBASE_ROW_KEY,f:count wordcount \
word_count.csv
You can also do something like this:-
# export HADOOP_CLASSPATH=`./hbase classpath`
One of the jar which was missing i.e hbase/lib/htrace-core-3.1.0-incubating.jar will be hbase classpath. and should work in this case.

OrcNewInputformat as a inputformat for hadoop streaming

I am using hadoop streaming and i want to give input format as a OrcNewFormat..
I am executing command:-
hadoop jar hadoop-streaming.jar -libjars /usr/hdp/2.2.4.2-2/hive/lib/hive-exec.jar -input /user/orcfiles -output /streamf -mapper 'cat' -inputformat org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat -outputformat org.apache.hadoop.hive.ql.io.orc.OrcNewOutputFormat
But I am getting below exception:
Exception in thread "main" java.lang.RuntimeException: class org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat not org.apache.hadoop.mapred.InputFormat
at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:2150)
at org.apache.hadoop.mapred.JobConf.setInputFormat(JobConf.java:702)
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:796)
at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:128)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
from this link
http://hive.apache.org/javadocs/r1.2.0/api/
I could see that Class OrcNewInputFormat extends org.apache.hadoop.mapreduce.InputFormat, but from exception i could figure out that class org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat not org.apache.hadoop.mapred.InputFormat.
What am I missing here?
IT is working fine now I was giving wrong classname.
This had been very popular question looking into number of views but it still lacks an "answer" in terms of correct class names. So completing it:
Correct argument part is -inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat -outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
For my case, I had to remove/mark false below environment variable before running the pig command.
export HADOOP_USE_CLIENT_CLASSLOADER='true'

Mahout - Error while running trainnb

Using Mahout seq2sparse command, I manage to successfully create the following folders in HDFS
df-count
dictionary.file-0
frequency.file-0
tf-vectors
tfidf-vectors
tokenized-documents
wordcount
After that when I run the trainnb command with the following syntax
mahout trainnb -i tweet-vectors -el -li labelindex -o model -ow -c
I get the following error. Does anyone know resolution for the same?
Exception in thread "main" java.lang.IllegalStateException: hdfs://machineinfo:8020/user/hhhh/tweetvectors/df-count
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:115)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:106)
at com.google.common.collect.Iterators$8.transform(Iterators.java:860)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:597)
at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
at org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:122)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:180)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:94)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.main(TrainNaiveBayesJob.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:194)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by: java.io.FileNotFoundException: File does not exist: /user/hhhh/tweet-vectors/df-count
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchLocatedBlocks(DFSClient.java:2006)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1975)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1967)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:735)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:165)
at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1499)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:63)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:110)
... 22 more
It seems mahout cannot see the file /user/hhhh/tweet-vectors/df-count in HDFS.
First, try hadoop dfs -ls /user/hhhh/tweet-vectors/df-count to verify the file exists.
If it doesn't exist, there's your problem. If it does exist, check if it is a file or a directory. mahout seems to be looking for a file, not a directory.
If is exists and it is a file, then verify that mahout is connecting to the same hadoop namenode instance where the file is stored.

shell script not found in hadoop

I am new to hadoop and hadoop streaming so this error is probably something obvious that I miss.
I run an inline awk mapper command and it works fine.
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar -input input -output output -mapper "/usr/bin/awk -F'\t' '\$1==\"and\"'" -reducer NONE
However, when I put the awk command to a file and run it. I got Java IOException on all machines in the cluster.
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar -input one_to_one/part* -output 4 -mapper test.sh -reducer NONE -file test.sh
test.sh
/usr/bin/awk -F '\t' '(NR == 1) {print NR,OFS,$0;}'
Exception:
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:230)
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "/Users/hadoop/tmp/taskTracker/hadoop/jobcache/job_201207051227_0012/attempt_201207051227_0012_m_000000_0/work/./test.sh": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
... 23 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
at java.lang.ProcessImpl.start(ProcessImpl.java:91)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
... 24 more
try:
Adding a hash/she-bang to your test.sh:
#!/bin/bash
Amend -file to -files
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar
-input one_to_one/part* -output 4 -mapper test.sh -reducer NONE -files test.sh

Resources