Spark Executor low performance while writing a dataframe to parquet

Spark Executor low performance while writing a dataframe to parquet - performance

Spark version : 2.3
hadoop dist : azure Hdinsight 2.6.5
Platform : Azure
Storage : BLOB
Nodes in the cluster: 6
Executor instances : 6
cores per executor : 3
Memory per executor : 8gb
Trying to load a csv file(size 4.5g - 280 col , 2.8 mil rows ) in azure blob (wasb) to parquet format via a spark dataframe on the same storage account. I have repartitioned the file with different size i.e. 20, 40, 60, 100 but facing a weird issue where the 2 out of the 6 executors that process a very small subset of records ( < 1%) keep running for an 1 hour or so and eventually complete.
Question :
1) the partitions that is getting processed by these 2 executors has the least records to process ( less than 1%) but take almost an hour to complete. what is the reason for this. Is this opposite of a data skew scenario ?
2) local cache folders on the nodes running these executors are getting filled up (50-60GB). Not sure of the reason behind this.
3) Increasing the partitions does bring the over all execution time down to 40 min but wanted to know the reason behind the low through with these 2 executors only.
New to spark so looking forward to some pointers to tune this workload. Additonal info from Spark WebUi attached.

What hadoop cluster environment are you using ?
1)
Ans : Are you sing partitionColumnBy while writing the file? else try it and check.
2)
Ans : Increase the number of partitions i.e. use "spark.sql.shuffle.partitions"
3)
Ans : Need more specific information like sample data and etc. to give answer.

Related

Performance: Spark dynamic allocation with Cassandra

I hope my question is simple. What happens when someone enables the Dynamic Allocation of spark with cassandra database?
I have a 16 node cluster where every node has installed versions of Spark and Cassandra, in order to also achieve data locality. I am wondering how does the dynamic allocation works at this case. Spark will calculate the workload in order to "hire" workers right? But how does spark know the size of the data( in order to calculate the workload) from cassandra db unless it tries to query it first?
For example, what if spark hires 2 workers and the data in cassandra are located on a 3rd node? Wouldn't that increase network traffic and time until cassandra copies the data from node 3 to node 2?
I tried it with my application and I saw from SparkUI that the master hired 1 executor to query the data from cassandra and then added another 5 or 6 executors to do the further processing. Overall, it took 10 minutes more that the normal 1 minute that takes without the dynamic allocation.
(FYI: I am also using spark-cassandra-connector 3.1.0)

The Spark Cassandra connector estimates the size of the table using the values stored in the system.size_estimates table. For example, if the size_estimates indicates that there are 200K CQL partitions in the table and the mean partition size is 1MB, the estimated table size is:
estimated_table_size = mean_partition_size x number_of_partitions
= 1 MB x 200,000
= 200,000 MB
The connector then calculates the Spark partitions as:
spark_partitions = estimated_table_size / input.split.size_in_mb
= 200,000 MB / 64 MB
= 3,125
When there is data locality (Spark worker/executor JVMs are co-located with Cassandra JVM), the connector knows which nodes own the data so you can take advantage of this functionaly by using the repartitionByCassandraReplica() so that each Spark partition will be processed by an executor on the same node where the data resides to avoid shuffling.
For more info, see the Spark Cassandra connector documentation. Cheers!

Why Spark Fails for Huge Dataset with Container Getting Killed Issue and Hive works

I am trying to run a Simple Query Assuming running queries with spark.sql("query") compared to Dataframes has no performance Difference as I am using Spark 2.1.0 i have Catalyst Optimizer to take care of the optimization part & Tungsten Enabled.
Here i am joining 2 tables with a Left-Outer join. My 1st table is 200 GB & is the Driving table(being on left side) and the 2nd table is 2GB and there has to be no Filters as per our Business requirement.
Configuration of My Cluster. As this is Shared Cluster i have a assigned a specific queue which allows me to use 3-TB of Memory(Yes 3 tera bytes) but the No.of VCORES is 480 . That means i can only run 480 Parallel tasks. On top of that AT YARN LEVEL i have a Constraint to having MAX of 8 cores per node. And MAX of 16 GB of Container Memory Limit. Because of which i cannot give my Executor-Memory(which is per node) more than 12 GB as i am giving 3-GB as ExecutorMemoryOverhead to be on safer side which becomes 15 GB of per node memory utilization.
So after calculating 480 total allowed vcores with 8-cores per node limit i have got 480/8 = 60 Nodes for my computation. Which comes to 60*15 = 900 GB of usable memory(I don't why total queue memory is assigned 3 TB) And this is at peak .. IF i am the only one using the Queue but that's not always the case.
Now the doubt is how Spark this whole 900 GB of memory. From the Numbers & stats i can clearly say that my Job will run without any issues as the data size i am trying to process is just 210-250 GB MaX & i have 900 GB of available memory.
But i keep getting Container getting killed error msgs. And i cannot increase the YARN Container size becoz it is at YARN level and overall cluster will get the increased container size which is not the right thing. I have also tried Disabling vmem-check.enabled property to FALSE in my code using sparksession.config(property) but that doesn't help too May be i am not allowed to change anything at YARN Level so it might be ignoring that.
Now on what basis spark splits the data initially is it based on the Block size defined at Cluster Level (assuming 128 MB) I am thinking this because when my Job is started i see that my Big Table which is of around 200 GB has 2000 tasks so on what basis Spark calculates this 2000 tasks(partitions) I thought may be the Default partition size when spark starts to load my table is quite big by seeing the Input Size/Records && Shuffle Write Size/Records Under the Stage Tab of Spark UI and that is the reason why i am getting Container Killed Error & suggestion to increase Executor memory overhead which did not helped either.
I tried to Repartition the Data from 10k to 100k partitions and tried persisting to MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY but nothing Helped. Many of my task were getting failed and at the End job used to get Fail. Sometimes with Container killed, Direct Buffer, and others.
Now here what is the use of Persist /Caching and how does it behave ..???? I am doing
val result = spark.sql("query big_table").repartition(10000, $<column name>).persist()
The column in Repartition is the Joining key so it gets distributed. TO make this work before the JOIN i am doing result.show(1) . So the action is performed and data gets persisted on DISK and Spark will read data persisted on DISK for JOIN and there will be no load on memory as it is stored in small chunks on Disks(Am i correct over HERE ..??)
Why in HIVE this same job with the same Big Table plus some additional tables with Left Join get completed. Though it takes time but it completes successfully But it Fails in Spark..?? Why ?? Is Spark not the Complete Replacement of HIVE..?? Doesn't Spark works like HIVE when it comes to Spilling to Disk & write data to disk while using DISK for PERSISTING.
Does yarn-container size plays a role if we have less container size but good number of nodes ??
Does Spark combines memory of all the available nodes (15 GB Per Node as per container size) and Combine them to load a large partition..??

need to improve performance of spark-sql joins

I have been working on a project where we are using spark-sql as analytic platform and currently I am facing issues while joining two data frames df1 & df2
df1 has 25000 records
df2 has 127000 records
When I am joining these two tables in spark-dataframe it is taking lot of time to join
val df_join = df1.join(df2, df2("col1") ===
df1("col1")).drop(df1("col2"))
I checked the Spark-UI for status and it is showing some astonishing numbers
and input size/records are increasing weirdly
Kindly let me know why and how input size is increasing considerably and how should i tune my spark job
attached are the screen shot of cluster
3 node cluster running on yarn
6 gb for driver
5 gb for executor allocated and 2 cores per executor
Status of job after more than 30 mins, input size has increased to almost 1000GB

Hadoop Running reducers in parallel

I have a 4G file with ~ 16 mill lines, maps are running distributed with 6 maps in parallel out of 15 maps. Generates 35000 keys. I am using MultipleTextoutput so each reducer generates a output independent of other reducer.
I have configured the conf with 25-50 reducers, but it always runs 1 reducer at a time.
Machine - 4 core 32 G ram single machine running hortonworks stack
How do I get more than 1 reduce task to run in parallel ?

Have a look hadoop MapReduce Tutorial
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by ( * ).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Have a look at related SE questions:
How hadoop decides how many nodes will do map and reduce tasks
What is Ideal number of reducers on Hadoop?

With specifying a lower reducer memory of 2 GB, the default in the mapred-site xml was 6GB, the framework brings up 3 reducers in parallel rather than 1.

Hadoop machine configuration

I want to analyze 7TB of data and store the output in a database, say HBase.
My monthly increment is 500GB, but to analyze 500GB data I don't need to go through 7TB of data again.
Currently I am thinking of using Hadoop with Hive for analyzing the data, and
Hadoop with MapReducer and HBase to process and store the data.
At the moment I have 5 machines of following configuration:
Data Node Server Configuration: 2-2.5 Ghz hexa core CPU, 48 GB RAM, 1 TB -7200 RPM (X 8)
Number of data nodes: 5
Name Node Server: Enterprise class server configuration (X 2) (1 additional for secondary
I want to know if the above process is sufficient given the requirements, and if anyone has any suggestions.

Sizing
There is a formula given by Hortonworks to calculate your sizing
((Initial Size + YOY Growth + Intermediate Data Size) * Repl Cpount * 1.2) /Comp Ratio
Assuming default vars
repl_count == 3 (default)
comp_ration = 3-4 (default)
Intermediate data size = 30%-50% of raw data size .-
1,2 factor - temp space
So for your first year, you will need 16.9 TB. You have 8TB*5 == 40. So space is not the topic.
Performance
5 Datanodes. Reading 1 TB takes in average 2.5 hours (source Hadoop - The definitive guide) on a single drive. 600 GB with one drive would be 1.5 hours. Estimating that you have replicated so that you can use all 5 nodes in parallel, it means reading the whole data with 5 nodes can get up to 18 minutes.
You may have to add some more time time depending on what you do with your queries and how have configured your data processing.
Memory consumution
48 GB is not much. The default RAM for many data nodes is starting from 128 GB. If you use the cluster only for processing, it might work out. Depending also a bit, how you configure the cluster and which technologies you use for processing. If you have concurrent access, it is likely that you might run into heap errors.
To sum it up:
It depends much what you want to do with you cluster and how complex your queries are. Also keep in mind that concurrent access could create problems.
If 18 minutes processing time for 600 GB data (as a baseline - real values depend on much factors unknown answering that questions) is enough and you do not have concurrent access, go for it.

I would recommend transforming the data on arrival. Hive can give tremendous speed boost by switching to a columnar compressed format, like ORC or Parquet. We're talking about potential x30-x40 times improvements in queries performance. With latest Hive you can leverage streaming data ingest on ORC files.
You can leave things as you planned (HBase + Hive) and just rely on brute force 5 x (6 Core, 48GB, 7200 RPM) but you don't have to. A bit of work can get you into interactive ad-hoc query time territory, which will open up data analysis.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio