Error in pig script while processing large file - hadoop

I am trying to split a large file (15GB) into multiple small files based on a key column inside the file.The same code works fine if i run it on few 1000s of rows.
My code is as below.
REGISTER /home/auto/ssachi/piggybank-0.16.0.jar;
input_dt = LOAD '/user/ssachi/sywr_sls_ln_ofr_dtl/sywr_sls_ln_ofr_dtl.txt-10' USING PigStorage(',');
STORE input_dt into '/user/rahire/sywr_sls_ln_ofr_dtl_split' USING org.apache.pig.piggybank.storage.MultiStorage('/user/rahire/sywr_sls_ln_ofr_dtl_split','4','gz',',');
Error is as below
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 6015: During execution, encountered a Hadoop error.
HadoopVersion 2.6.0-cdh5.8.2
PigVersion 0.12.0-cdh5.8.2
I tried setting the below parameters assuming it is a memory issue, but it did not help.
SET mapreduce.map.memory.mb 16000;
SET mapreduce.map.java.opts 14400;
With the above parameters set, i got the below error.
Container exited with a non-zero exit code 1
org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1486048646102_2613_m_000066_3 Info:Exception from container-launch.

Whats the Cardinality of your " key column " is it in 1000?
If its in 1000 then you will get the error as your Mappers are dying because of OOME.
Do understand each Mapper now maintain 1000 file pointers and a associated buffer for each filePointer enough to occupy whole of your heap.
Can you please provide logs of your mappers for further investigation
Multioutput in MapReduce which is being called internally.
http://bytepadding.com/big-data/map-reduce/multipleoutputs-in-map-reduce/

Related

java.lang.StackoverflowError when writing dataframe into Postgresql using JDBC

I'm trying to write the result of multiple operations into an AWS Aurora PostgreSQL cluster. All the calculations performs right but, when I try to write the result into the database I get the next error:
py4j.protocol.Py4JJavaError: An error occurred while calling o12179.jdbc.
: java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
I already tried to increase cluster size (15 r4.2xlarge machines), change number of partitions for the data to 120 partitions, change executor and driver memory to 4Gb each and I'm facing the same results.
The current SparkSession configuration is the next:
spark = pyspark.sql.SparkSession\
.builder\
.appName("profile")\
.config("spark.sql.shuffle.partitions", 120)\
.config("spark.executor.memory", "4g").config("spark.driver.memory", "4g")\
.getOrCreate()
I don't know if is a Spark configuration problem or if it's a programming problem.
Finally I found the problem.
The problem was an iterative read from S3 creating a really big DAG. I changed the way I read CSV files from S3 with the following instruction.
df = spark.read\
.format('csv')\
.option('header', 'true')\
.option('delimiter', ';')\
.option('mode', 'DROPMALFORMED')\
.option('inferSchema', 'true')\
.load(list_paths)
Where list_paths is a precalculated list of paths to S3 objects.

Phoenix csv Bulk load fails with large data sets

I'm trying to load a dataset (280GB) using the Phoenix csv bulk load tool on a HDInsight Hbase cluster. The job fails with the following error:
18/02/23 06:09:10 INFO mapreduce.Job: Task Id :
attempt_1519326441231_0004_m_000067_0, Status : FAILEDError: Java heap
spaceContainer killed by the ApplicationMaster.Container killed on
request. Exit code is 143Container exited with a non-zero exit code
143
Here's my cluster configuration:
Region Nodes
8 cores, 56 GB RAM, 1.5TB HDD
Master Nodes
4 cores, 28GB, 1.5TB HDD
I tried increasing the value of yarn.nodemanager.resource.memory-mb from 5GB to 38GB, but the job still fails.
Can anyone please help me troubleshoot this issue?
Can you provide more details ? Such as how are you kicking off the job? Are you following the instructions here - https://blogs.msdn.microsoft.com/azuredatalake/2017/02/14/hdinsight-how-to-perform-bulk-load-with-phoenix/ ?
Specifically Can you provide the command you used and also some more info as in is the job failing immediately or does it run for a while and then start to fail? Any other log messages than the one you described above ?

Map Reduce Completed but pig Job Failed

I recently came across this scenario where a MapReduce job seems to be successful in RM where as the PIG script returned with an exit code 8 which refers to "Throwable thrown (an unexpected exception)"
Added the script as requested:
REGISTER '$LIB_LOCATION/*.jar';
-- set number of reducers to 200
SET default_parallel $REDUCERS;
SET mapreduce.map.memory.mb 3072;
SET mapreduce.reduce.memory.mb 6144;
SET mapreduce.map.java.opts -Xmx2560m;
SET mapreduce.reduce.java.opts -Xmx5120m;
SET mapreduce.job.queuename dt_pat_merchant;
SET yarn.app.mapreduce.am.command-opts -Xmx5120m;
SET yarn.app.mapreduce.am.resource.mb 6144;
-- load data from EAP data catalog using given ($ENV = PROD)
data = LOAD 'eap-$ENV://event'
-- using a custom function
USING com.XXXXXX.pig.DataDumpLoadFunc
('{"startDate": "$START_DATE", "endDate" : "$END_DATE", "timeType" : "$TIME_TYPE", "fileStreamType":"$FILESTREAM_TYPE", "attributes": { "all": "true" } }', '$MAPPING_XML_FILE_PATH');
-- filter out null context entity records
filtered = FILTER data BY (attributes#'context_id' IS NOT NULL);
-- group data by session id
session_groups = GROUP filtered BY attributes#'context_id';
-- flatten events
flattened_events = FOREACH session_groups GENERATE FLATTEN(filtered);
-- remove the output directory if exists
RMF $OUTPUT_PATH;
-- store results in specified output location
STORE flattened_events INTO '$OUTPUT_PATH' USING com.XXXX.data.catalog.pig.EventStoreFunc();
And I can see "ERROR 2998: Unhandled internal error. GC overhead limit exceeded" in the pig logs.(log below)
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.hadoop.mapreduce.FileSystemCounter.values(FileSystemCounter.java:23)
at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:219)
at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:199)
at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:210)
at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:241)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:370)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:391)
at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskReports(ClientServiceDelegate.java:451)
at org.apache.hadoop.mapred.YARNRunner.getTaskReports(YARNRunner.java:594)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:545)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:543)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapreduce.Job.getTaskReports(Job.java:543)
at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.getTaskReports(HadoopShims.java:235)
at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.addMapReduceStatistics(MRJobStats.java:352)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.addSuccessJobStats(MRPigStatsUtil.java:233)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.accumulateStats(MRPigStatsUtil.java:165)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:360)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:282)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1431)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1416)
at org.apache.pig.PigServer.execute(PigServer.java:1405)
at org.apache.pig.PigServer.executeBatch(PigServer.java:456)
at org.apache.pig.PigServer.executeBatch(PigServer.java:439)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:624)
Configuration in the pig script looks like below:
SET default_parallel 200;
SET mapreduce.map.memory.mb 3072;
SET mapreduce.reduce.memory.mb 6144;
SET mapreduce.map.java.opts -Xmx2560m;
SET mapreduce.reduce.java.opts -Xmx5120m;
SET mapreduce.job.queuename dt_pat_merchant;
SET yarn.app.mapreduce.am.command-opts -Xmx5120m;
SET yarn.app.mapreduce.am.resource.mb 6144;
Status of the Job in the RM of the Cluster says the job succeeded [can't post the image as my reputation is too low ;) ]
This issue occurs frequently and we have to restart the job the job successful.
Please let me know a fix for this.
PS: The cluster the job is running is one of the biggest in the world, so no worry with resources or the storage space I say.
Thanks
Can you add your pig script here?
I think, you get this error because the pig itself (not mappers and reducers) can't handle the output.
If you use DUMP operation it your script, then try to limit the displayed dataset first. Let's assume, you have a X alias for your data. Try:
temp = LIMIT X 1;
DUMP temp;
Thus, you will see only one record and save some resources. You can do a STORE operation as well (see in pig manual how to do it).
Obviously, you can configure pig's heap size to be bigger, but pig's heap size is not mapreduce.map or mapreduce.reduce. Use PIG_HEAPSIZE environment variable to do that.
From oracle docs:
After a garbage collection, if the Java process is spending more than approximately 98% of its time doing garbage collection and if it is recovering less than 2% of the heap and has been doing so far the last 5 (compile time constant) consecutive garbage collections, then a java.lang.OutOfMemoryError is thrown The java.lang.OutOfMemoryError exception for GC Overhead limit exceeded can be turned off with the command line flag -XX:-UseGCOverheadLimit
As said in docs, you can turn this exception off or increase heap size.

Hadoop MapReduce job I/O Exception due to premature EOF from inputStream

I ran a MapReduce program using the command hadoop jar <jar> [mainClass] path/to/input path/to/output. However, my job was hanging at: INFO mapreduce.Job: map 100% reduce 29%.
Much later, I terminated and checked the datanode log (I am running in pseudo-distributed mode). It contained the following exception:
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:804)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
at java.lang.Thread.run(Thread.java:745)
5 seconds later in the log was ERROR DataXceiver error processing WRITE_BLOCK operation.
What problem might be causing this exception and error?
My NodeHealthReport said:
1/1 local-dirs are bad: /home/$USER/hadoop/nm-local-dir;
1/1 log-dirs are bad: /home/$USER/hadoop-2.7.1/logs/userlogs
I found this which indicates that dfs.datanode.max.xcievers may need to be increased. However, it is deprecated and the new property is called dfs.datanode.max.transfer.threads with default value 4096. If changing this would fix my problem, what new value should I set it to?
This indicates that the ulimit for the datanode may need to be increased. My ulimit -n (open files) is 1024. If increasing this would fix my problem, what should I set it to?
Premature EOF can occur due to multiple reasons, one of which is spawning of huge number of threads to write to disk on one reducer node using FileOutputCommitter. MultipleOutputs class allows you to write to files with custom names and to accomplish that, it spawns one thread per file and binds a port to it to write to the disk. Now this puts a limitation on the number of files that could be written to at one reducer node. I encountered this error when the number of files crossed 12000 roughly on one reducer node, as the threads got killed and the _temporary folder got deleted leading to plethora of these exception messages. My guess is - this is not a memory overshoot issue, nor it could be solved by allowing hadoop engine to spawn more threads. Reducing the number of files being written at one time at one node solved my problem - either by reducing the actual number of files being written, or by increasing reducer nodes.

amazon s3n integration with hadoop mapreduce not working

I am trying to run some map reduce job over the files which are stored in amazon s3. I saw http://wiki.apache.org/hadoop/AmazonS3 and following it to do the integration. Here is my code which sets the input directory for the map reduce job
FileInputFormat.setInputPaths(job, "s3n://myAccessKey:mySecretKey#myS3Bucket/dir1/dir2/*.txt");
When i run the mapreduce job i am getting this exception
Exception in thread "main" java.lang.IllegalArgumentException:
Wrong FS: s3n://myAccessKey:mySecretKey#myS3Bucket/dir1/dir2/*.txt,
expected: s3n://myAccessKey:mySecretKey#myS3Bucket
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:294)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:352)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:321)
at com.appdynamics.blitz.hadoop.migration.DataMigrationManager.convertAndLoadData(DataMigrationManager.java:340)
at com.appdynamics.blitz.hadoop.migration.DataMigrationManager.migrateData(DataMigrationManager.java:300)
at com.appdynamics.blitz.hadoop.migration.DataMigrationManager.migrate(DataMigrationManager.java:166)
at com.appdynamics.blitz.command.DataMigrationCommand.run(DataMigrationCommand.java:53)
at com.appdynamics.blitz.command.DataMigrationCommand.run(DataMigrationCommand.java:21)
at com.yammer.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:58)
at com.yammer.dropwizard.cli.Cli.run(Cli.java:53)
at com.yammer.dropwizard.Service.run(Service.java:61)
at com.appdynamics.blitz.service.BlitzService.main(BlitzService.java:84)
I can't find resource to help me on this. Any pointer will be deeply appreciated.
You're just gonna have to keep playing with
Wrong FS: s3n://myAccessKey:mySecretKey#myS3Bucket/dir1/dir2/*.txt
The path you're giving Hadoop just isn't correct and it won't work until it can access the correct files.
So i found the problem. It was caused by this bug
https://issues.apache.org/jira/browse/HADOOP-3733
Even though i replaced the "/" by "%2F" it kept giving the same problem. I regenerated the keys and put one where is no "/" in the secret key and it fixed the issue.

Resources