I recently came across this scenario where a MapReduce job seems to be successful in RM where as the PIG script returned with an exit code 8 which refers to "Throwable thrown (an unexpected exception)"
Added the script as requested:
REGISTER '$LIB_LOCATION/*.jar';
-- set number of reducers to 200
SET default_parallel $REDUCERS;
SET mapreduce.map.memory.mb 3072;
SET mapreduce.reduce.memory.mb 6144;
SET mapreduce.map.java.opts -Xmx2560m;
SET mapreduce.reduce.java.opts -Xmx5120m;
SET mapreduce.job.queuename dt_pat_merchant;
SET yarn.app.mapreduce.am.command-opts -Xmx5120m;
SET yarn.app.mapreduce.am.resource.mb 6144;
-- load data from EAP data catalog using given ($ENV = PROD)
data = LOAD 'eap-$ENV://event'
-- using a custom function
USING com.XXXXXX.pig.DataDumpLoadFunc
('{"startDate": "$START_DATE", "endDate" : "$END_DATE", "timeType" : "$TIME_TYPE", "fileStreamType":"$FILESTREAM_TYPE", "attributes": { "all": "true" } }', '$MAPPING_XML_FILE_PATH');
-- filter out null context entity records
filtered = FILTER data BY (attributes#'context_id' IS NOT NULL);
-- group data by session id
session_groups = GROUP filtered BY attributes#'context_id';
-- flatten events
flattened_events = FOREACH session_groups GENERATE FLATTEN(filtered);
-- remove the output directory if exists
RMF $OUTPUT_PATH;
-- store results in specified output location
STORE flattened_events INTO '$OUTPUT_PATH' USING com.XXXX.data.catalog.pig.EventStoreFunc();
And I can see "ERROR 2998: Unhandled internal error. GC overhead limit exceeded" in the pig logs.(log below)
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.hadoop.mapreduce.FileSystemCounter.values(FileSystemCounter.java:23)
at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:219)
at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:199)
at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:210)
at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:241)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:370)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:391)
at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskReports(ClientServiceDelegate.java:451)
at org.apache.hadoop.mapred.YARNRunner.getTaskReports(YARNRunner.java:594)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:545)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:543)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapreduce.Job.getTaskReports(Job.java:543)
at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.getTaskReports(HadoopShims.java:235)
at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.addMapReduceStatistics(MRJobStats.java:352)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.addSuccessJobStats(MRPigStatsUtil.java:233)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.accumulateStats(MRPigStatsUtil.java:165)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:360)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:282)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1431)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1416)
at org.apache.pig.PigServer.execute(PigServer.java:1405)
at org.apache.pig.PigServer.executeBatch(PigServer.java:456)
at org.apache.pig.PigServer.executeBatch(PigServer.java:439)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:624)
Configuration in the pig script looks like below:
SET default_parallel 200;
SET mapreduce.map.memory.mb 3072;
SET mapreduce.reduce.memory.mb 6144;
SET mapreduce.map.java.opts -Xmx2560m;
SET mapreduce.reduce.java.opts -Xmx5120m;
SET mapreduce.job.queuename dt_pat_merchant;
SET yarn.app.mapreduce.am.command-opts -Xmx5120m;
SET yarn.app.mapreduce.am.resource.mb 6144;
Status of the Job in the RM of the Cluster says the job succeeded [can't post the image as my reputation is too low ;) ]
This issue occurs frequently and we have to restart the job the job successful.
Please let me know a fix for this.
PS: The cluster the job is running is one of the biggest in the world, so no worry with resources or the storage space I say.
Thanks
Can you add your pig script here?
I think, you get this error because the pig itself (not mappers and reducers) can't handle the output.
If you use DUMP operation it your script, then try to limit the displayed dataset first. Let's assume, you have a X alias for your data. Try:
temp = LIMIT X 1;
DUMP temp;
Thus, you will see only one record and save some resources. You can do a STORE operation as well (see in pig manual how to do it).
Obviously, you can configure pig's heap size to be bigger, but pig's heap size is not mapreduce.map or mapreduce.reduce. Use PIG_HEAPSIZE environment variable to do that.
From oracle docs:
After a garbage collection, if the Java process is spending more than approximately 98% of its time doing garbage collection and if it is recovering less than 2% of the heap and has been doing so far the last 5 (compile time constant) consecutive garbage collections, then a java.lang.OutOfMemoryError is thrown The java.lang.OutOfMemoryError exception for GC Overhead limit exceeded can be turned off with the command line flag -XX:-UseGCOverheadLimit
As said in docs, you can turn this exception off or increase heap size.
Related
I'm trying to load a dataset (280GB) using the Phoenix csv bulk load tool on a HDInsight Hbase cluster. The job fails with the following error:
18/02/23 06:09:10 INFO mapreduce.Job: Task Id :
attempt_1519326441231_0004_m_000067_0, Status : FAILEDError: Java heap
spaceContainer killed by the ApplicationMaster.Container killed on
request. Exit code is 143Container exited with a non-zero exit code
143
Here's my cluster configuration:
Region Nodes
8 cores, 56 GB RAM, 1.5TB HDD
Master Nodes
4 cores, 28GB, 1.5TB HDD
I tried increasing the value of yarn.nodemanager.resource.memory-mb from 5GB to 38GB, but the job still fails.
Can anyone please help me troubleshoot this issue?
Can you provide more details ? Such as how are you kicking off the job? Are you following the instructions here - https://blogs.msdn.microsoft.com/azuredatalake/2017/02/14/hdinsight-how-to-perform-bulk-load-with-phoenix/ ?
Specifically Can you provide the command you used and also some more info as in is the job failing immediately or does it run for a while and then start to fail? Any other log messages than the one you described above ?
I am trying to split a large file (15GB) into multiple small files based on a key column inside the file.The same code works fine if i run it on few 1000s of rows.
My code is as below.
REGISTER /home/auto/ssachi/piggybank-0.16.0.jar;
input_dt = LOAD '/user/ssachi/sywr_sls_ln_ofr_dtl/sywr_sls_ln_ofr_dtl.txt-10' USING PigStorage(',');
STORE input_dt into '/user/rahire/sywr_sls_ln_ofr_dtl_split' USING org.apache.pig.piggybank.storage.MultiStorage('/user/rahire/sywr_sls_ln_ofr_dtl_split','4','gz',',');
Error is as below
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 6015: During execution, encountered a Hadoop error.
HadoopVersion 2.6.0-cdh5.8.2
PigVersion 0.12.0-cdh5.8.2
I tried setting the below parameters assuming it is a memory issue, but it did not help.
SET mapreduce.map.memory.mb 16000;
SET mapreduce.map.java.opts 14400;
With the above parameters set, i got the below error.
Container exited with a non-zero exit code 1
org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1486048646102_2613_m_000066_3 Info:Exception from container-launch.
Whats the Cardinality of your " key column " is it in 1000?
If its in 1000 then you will get the error as your Mappers are dying because of OOME.
Do understand each Mapper now maintain 1000 file pointers and a associated buffer for each filePointer enough to occupy whole of your heap.
Can you please provide logs of your mappers for further investigation
Multioutput in MapReduce which is being called internally.
http://bytepadding.com/big-data/map-reduce/multipleoutputs-in-map-reduce/
I ran a MapReduce program using the command hadoop jar <jar> [mainClass] path/to/input path/to/output. However, my job was hanging at: INFO mapreduce.Job: map 100% reduce 29%.
Much later, I terminated and checked the datanode log (I am running in pseudo-distributed mode). It contained the following exception:
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:804)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
at java.lang.Thread.run(Thread.java:745)
5 seconds later in the log was ERROR DataXceiver error processing WRITE_BLOCK operation.
What problem might be causing this exception and error?
My NodeHealthReport said:
1/1 local-dirs are bad: /home/$USER/hadoop/nm-local-dir;
1/1 log-dirs are bad: /home/$USER/hadoop-2.7.1/logs/userlogs
I found this which indicates that dfs.datanode.max.xcievers may need to be increased. However, it is deprecated and the new property is called dfs.datanode.max.transfer.threads with default value 4096. If changing this would fix my problem, what new value should I set it to?
This indicates that the ulimit for the datanode may need to be increased. My ulimit -n (open files) is 1024. If increasing this would fix my problem, what should I set it to?
Premature EOF can occur due to multiple reasons, one of which is spawning of huge number of threads to write to disk on one reducer node using FileOutputCommitter. MultipleOutputs class allows you to write to files with custom names and to accomplish that, it spawns one thread per file and binds a port to it to write to the disk. Now this puts a limitation on the number of files that could be written to at one reducer node. I encountered this error when the number of files crossed 12000 roughly on one reducer node, as the threads got killed and the _temporary folder got deleted leading to plethora of these exception messages. My guess is - this is not a memory overshoot issue, nor it could be solved by allowing hadoop engine to spawn more threads. Reducing the number of files being written at one time at one node solved my problem - either by reducing the actual number of files being written, or by increasing reducer nodes.
I have a large network of over 15 million nodes. I want to remove the property "CONTROL" from all of them using a Cypher query in the neo4-shell.
If I try and execute any of the following:
MATCH (n) WHERE has(n.`CONTROL`) REMOVE n.`CONTROL` RETURN COUNT(n);
MATCH (n) WHERE has(n.`CONTROL`) REMOVE n.`CONTROL`;
MATCH (n) REMOVE n.`CONTROL`;
the system returns:
Error occurred in server thread; nested exception is:
java.lang.OutOfMemoryError: Java heap space
Even the following query gives the OutOfMemoryError:
MATCH (n) REMOVE n.`CONTROL` RETURN n.`ID` LIMIT 10;
As a test, the following does execute properly:
MATCH (n) WHERE has(n.`CONTROL`) RETURN COUNT(n);
returning 16636351.
Some details:
The memory limit depends on the following settings:
wrapper.java.maxmemory (conf/neo4j-wrapper.conf)
neostore..._memory (conf/neo4j.properties)
By setting these values to total 28 GB in both files, results in a java_pidXXX.hprof file of about 45 GB (wrapper.java.additional=-XX:+HeapDumpOnOutOfMemoryError).
The only clue I could google was:
...you use the Neo4j-Shell which is just an ops tool and just collects the data in memory before sending back, it was never meant to handle huge result sets.
Is it really not possible to remove properties in large networks using the neo4j-shell and cypher? Or what am I doing wrong?
PS
Additional information:
Neo4j version: 2.1.3
Java versions: Java(TM) SE Runtime Environment (build 1.7.0_76-b13) and OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
The database is 7.4 GB (16636351 nodes, 14724489 relations)
The property "CONTROL" is empty, i.e., it has just been defined for all the nodes without actually assigning a property value.
An example of the exception from data/console.log:
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid20541.hprof ...
Dump file is incomplete: file size limit
Exception in thread "GC-Monitor" Exception in thread "pool-2-thread-2" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.lang.StringCoding.safeTrim(StringCoding.java:79)
at java.lang.StringCoding.access$300(StringCoding.java:50)
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:305)
at java.lang.StringCoding.encode(StringCoding.java:344)
at java.lang.StringCoding.encode(StringCoding.java:387)
at java.lang.String.getBytes(String.java:956)
at ch.qos.logback.core.encoder.LayoutWrappingEncoder.convertToBytes(LayoutWrappingEncoder.java:122)
at ch.qos.logback.core.encoder.LayoutWrappingEncoder.doEncode(LayoutWrappingEncoder.java:135)
at ch.qos.logback.core.OutputStreamAppender.writeOut(OutputStreamAppender.java:194)
at ch.qos.logback.core.FileAppender.writeOut(FileAppender.java:209)
at ch.qos.logback.core.OutputStreamAppender.subAppend(OutputStreamAppender.java:219)
at ch.qos.logback.core.OutputStreamAppender.append(OutputStreamAppender.java:103)
at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:88)
at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:48)
at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:273)
at ch.qos.logback.classic.Logger.callAppenders(Logger.java:260)
at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:442)
at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:396)
at ch.qos.logback.classic.Logger.warn(Logger.java:709)
at org.neo4j.kernel.logging.LogbackService$Slf4jToStringLoggerAdapter.warn(LogbackService.java:243)
at org.neo4j.kernel.impl.cache.MeasureDoNothing.run(MeasureDoNothing.java:84)
java.lang.OutOfMemoryError: Java heap space
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1079)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Statistics Gatherer[primitives]" java.lang.OutOfMemoryError: Java heap space
Exception in thread "RMI RenewClean-[10.65.4.212:42299]" java.lang.OutOfMemoryError: Java heap space
Exception in thread "RMI RenewClean-[10.65.4.212:43614]" java.lang.OutOfMemoryError: Java heap space
see here: http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/
To update data with Cypher it is also necessary to take transaction size into account. For the embedded case, batching transactions is discussed in the next installment of this series. For the remote execution via the Neo4j REST API there are a few important things to remember. Especially with large index lookups and match results, it might happen that the query updates hundreds of thousands of elements. Then a paging mechanism using WITH and SKIP/LIMIT can be put in front of the updating operation.
MATCH (m:Movie)<-[:ACTED_IN]-(a:Actor)
WITH a, count(*) AS cnt
SKIP {offset} LIMIT {pagesize}
SET a.movie_count = cnt
RETURN count(*)
Run with pagesize=20000 and increasing offset=0,20000,40000,… until the query returns a count < pagesize
So in your case, repeat this until it returns 0 rows. You can also increase the limit to 1M.
MATCH (n) WHERE has(n.`CONTROL`)
WITH n
LIMIT 100000
REMOVE n.`CONTROL`
RETURN COUNT(n);
My spark application process the files (average size is 20 MB) with custom hadoop input format and stores the result in HDFS.
Following is the code snippet.
Configuration conf = new Configuration();
JavaPairRDD<Text, Text> baseRDD = ctx
.newAPIHadoopFile(input, CustomInputFormat.class,Text.class, Text.class, conf);
JavaRDD<myClass> mapPartitionsRDD = baseRDD
.mapPartitions(new FlatMapFunction<Iterator<Tuple2<Text, Text>>, myClass>() {
//my logic goes here
}
//few more translformations
result.saveAsTextFile(path);
This application creates 1 task/ partition per file and processes and stores the corresponding part file in HDFS.
i.e, For 10,000 input files 10,000 tasks are created and 10,000 part files are stored in HDFS.
Both mapPartitions and map operations on baseRDD are creating 1 task per file.
SO question
How to set the number of partitions for newAPIHadoopFile?
suggests to set
conf.setInt("mapred.max.split.size", 4); for configuring no of partitions.
But when this parameter is set CPU is utilized at maximum and none of the stage is not started even after long time.
If I don't set this parameter then application will be completed successfully as mentioned above.
How to set number of partitions with newAPIHadoopFile and increase the efficiency?
What happens with mapred.max.split.size option?
============
update:
What happens with mapred.max.split.size option?
In my use case file size is small and changing the split size options are irrelevant here.
more info on this SO: Behavior of the parameter "mapred.min.split.size" in HDFS
Just use baseRDD.repartition(<a sane amount>).mapPartitions(...). That will move the resulting operation to fewer partitions, especially if your files are small.