PySpark job slows down but runs fast when stopped and re-run - performance

I am running PySpark jobs on AWS EMR (EMR 5.19 & Python 2.7).
Logic: The job processes several rules one by one, refers to S3 parquet files, creates data frames and creates temporary views in memory which are further referred by subsequent rules via Spark SQL. SQL Output is written to S3. There are around 1500 rules/sqls to be executed.
Problem: While processing around last 100 rules, the process gets stuck at either running the SQL OR writing to S3 and remains there for hours.
But if I kill the job and re-run from same rule, the same SQL runs quickly, processes same number of records and S3 write is also done quickly and it moves to process next rule.....
Kindly help me in understanding this behavior - why it runs fast when stopped and re-run but gets stuck when all rules are run together. Also let me know if there is something that can be done in the job while its running to make it fast because SQL itself does not seem slow.
I tried exploring the logs in EMR (stdout and sterr) but nothing obvious came out. Monitoring also shows CPU & RAM utilization within limits of allocation. I tried logging into individual nodes and checked CPU (via top) and RAM (free -h), all seemed OK.
Cluster is launched with 3 core nodes of type r5.2xlarge (64 GB RAM, 8 cores, 64 GB EBS) and with auto-scaling enabled, it increases to 6 Core nodes and 2 task nodes of above type.
Job is launched via :
nohup spark-submit --master yarn --num-executors 10 --executor-cores 2 --conf spark.driver.memory=15g
--conf spark.sql.codegen.wholeStage=false --conf spark.default.parallelism=10 --conf spark.sql.shuffle.partitions=10
--executor-memory 15g --conf spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=8"
--py-files s3://bucket-s3/main.py --job apps.MyApp
I am writing to S3 from EMR in Parquet to take advantage of EMRFS S3-optimized commeter as per : this link
FYI, spark.sql.shuffle.partitions & G1 garbage collector was used to get rid of out-of-memory errors. Also executor memory was earlier 5g and was increased to 15g
Thanks in advance.

Related

Spark 2.2.0 FileOutputCommitter

DirectFileOutputCommitter is no longer available in Spark 2.2.0. This means writing to S3 takes insanely long time (3 hours vs 2 mins). I'm able to work around this by setting FileOutputCommitter version to 2 in spark-shell by doing this,
spark-shell --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
same does not work with spark-sql
spark-sql --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
The above command seems to be setting the version=2 but when the query is exeucted it still shows version 1 behaviour.
Two questions,
1) How do I get FileOutputCommitter version 2 behaviour with spark-sql?
2) Is there a way I can still use DirectFileOutputCommitter in spark 2.2.0? [I'm fine with non-zero chance of missing data]
Related items:
Spark 1.6 DirectFileOutputCommitter
I have been hit by this issue. Spark is discouraging the usage of DirectFileOutputCommitter as it might lead to data loss in case of race situation. The algorithm version 2 doesn't help a lot.
I have tried to use the gzip to save the data in s3 instead of snappy compression which gave some benefit.
The real issue here is that spark writes in the s3://<output_directory>/_temporary/0 first then copies the data from temporary to the output. This process is pretty slow in s3,(Generally 6MBPS) So if you get lot of data you will get considerable slowdown.
The alternative is to write to HDFS first then use distcp / s3distcp to copy the data to s3.
Also , You could look for a solution Netflix provided.
I haven't evaluated that.
EDIT:
The new spark2.4 version has solved the problem of slow s3 write. I have found the s3 write performance of spark2.4 with hadoop 2.8 in the latest EMR version (5.24) is almost at par with HDFS write.
See the documents
https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-performance.html

How to unzip large xml files into one HDFS directory

I have a requirement to load Zip files from HDFS directory, unzip it and write back to HDFS in a single directory with all unzipped files. The files are XML and the size varies in GB.
Firstly, I approached by implementing the Map-Reduce program by writing a custom InputFormat and Custom RecordReader to unzip the files and provide these contents to mapper, thereafter each mapper process and writes to HDFS using MultiOutput Format. The map reduce job running on YARN.
This approach works fine and able to get files in unzipped format in HDFS, when the input size is in MB's, but when the input size is in GB's, the job is failing to write and ended up with the following error.
17/06/16 03:49:44 INFO mapreduce.Job:  map 94% reduce 0%
17/06/16 03:49:53 INFO mapreduce.Job:  map 100% reduce 0%
17/06/16 03:51:03 INFO mapreduce.Job: Task Id : attempt_1497463655394_61930_m_000001_2, Status : FAILED
Container [pid=28993,containerID=container_e50_1497463655394_61930_01_000048] is running beyond physical memory limits. Current usage: 2.6 GB of 2.5 GB physical memory used; 5.6 GB of 12.5 GB virtual memory used. Killing container.
It is apparent that each unzipped file is processed by one mapper and yarn child container running mapper not able to hold the large file in the memory.
On the other hand, I would to like try on Spark, to unzip the file and write the unzipped files to a single HDFS directory running on YARN, I wonder with spark also, each executor has to process the single file.
I'm looking for the solution to process the files parallelly, but at the end write it to a single directory.
Please let me know this can be possible in Spark, and share me some code snippets.
Any help appreciated.
Actually, the task itself is not failing! YARN is killing the
container (inside map task is running) as that Yarn child using more
memory than requested memory from YARN. As you are planning to do it
in Spark, you can simply increase the memory to MapReduce tasks.
I would recommend you to
Increase YARN child memory as you are handling GBs of data, Some key properties
yarn.nodemanager.resource.memory-mb => Container Memory
yarn.scheduler.maximum-allocation-mb => Container Memory Maximum
mapreduce.map.memory.mb => Map Task Memory (Must be less then yarn.scheduler.maximum-allocation-mb at any pint of time in runtime)
Focus on data processing(Unzip) only for this job, invoke another job/command to merge files.

How to configure Hadoop parameters on Amazon EMR?

I run a MR job with one Master and two slavers on the Amazon EMR, but got lots of the error messages like running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container after map 100% reduce 35%
I modified my codes by adding the following lines in the Hadoop 2.6.0 MR configuration, but I still got the same error messages.
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest2");
//conf.set("mapreduce.input.fileinputformat.split.minsize","3073741824");
conf.set("mapreduce.map.memory.mb", "8192");
conf.set("mapreduce.map.java.opts", "-Xmx8192m");
conf.set("mapreduce.reduce.memory.mb", "8192");
conf.set("mapreduce.reduce.java.opts", "-Xmx8192m");
What is the correct way to configure those parameters(mapreduce.map.memory.mb, mapreduce.map.java.opts, mapreduce.reduce.memory.mb, mapreduce.reduce.java.opts) on Amazon EMR? Thank you!
Hadoop 2.x allows you to set the map and reduce settings per job so you are setting the correct section. The problem is the Java opts Xmx memory must be less than the map/reduce.memory.mb. This property represents the total memory for heap and off heap usage. Take a look at the defaults as an example: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-hadoop-task-config.html. If Yarn was killing off the containers for exceeding the memory when using the default settings then this means you need to give more memory to the off heap portion, thus increasing the gap between Xmx and the total map/reduce.memory.mb.
Take a look at the documentation for the AWS CLI. There is a section on Hadoop and how to map to specific XML config files on EMR instance creation. I have found this to be the best approach available on EMR.

Getting node utilization % in YARN (Hadoop 2.6.0)

In a YARN 2.6.0 cluster, is there a way to be able to get all the connected node's CPU utilization at the ResourceManager? Also, is the source code modifiable such that we can decide the nodes for a map-reduce job based on the utilization. If yes, where would this change take place?
Pls, find the implementation of Container Monitor:(CPU Utilization)
hadoop-2.6.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java
We have methods to check if a container is over the limitation.
isProcessTreeOverLimit will show you how yarn get the memory usage of certain container(process).
hadoop-2.6.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java
The above file shows you how Yarn gets memory usage: tracking process file in/proc.

Hadoop Error: Java heap space

So, after seeing the a percent or so of running the job I get an error that says, "Error: Java heap space" and then something along the lines of, "Application container killed"
I am literally running an empty map and reduce job. However, the job does take in an input that is, roughly, about 100 gigs. For whatever reason, I run out of heap space. Although the job does nothing.
I am using default configurations and it's on a single machine. It is running on hadoop version 2.2 and ubuntu. The machine has 4 gigs of ram.
Thanks!
//Note
Got it figured out.
Turns out I was setting the configuration to have a different terminating token/string. The format of the data had changed, so that token/string no longer existed. So it was trying to send all 100gigs into ram for one key.

Resources