Spark Performance On Individual Record Lookups - hadoop

I am conducting a performance test which compares queries on existing internal Hive tables between Spark SQL and Hive on Tez. Throughout the tests, Spark was showing query execution time that was on par or faster than Hive on Tez. These results are consistent with many of the examples out there. However, there was one noted exception with a query that involved key based selection at the individual record level. In this instance, Spark was significantly slower than Hive on Tez.
After researching this topic on the internet, I could not find a satisfactory answer and wanted to pose this example to the SO community to see if this is an individual one-off case associated with our environment or data, or a larger pattern related to Spark.
Spark 1.6.1
Spark Conf: Executors 2, Executory Memory 32G, Executor Cores 4.
Data is in an internal Hive Table which is stored as ORC file types compressed with zlib. The total size of the compressed files is ~2.2 GB.
Here is the query code.
#Python API
#orc with zlib key based select
dforczslt = sqlContext.sql("SELECT * FROM dev.perf_test_orc_zlib WHERE test_id= 12345678987654321")
dforczslt.show()
The total time to complete this query was over 400 seconds, compared to ~6 seconds with Hive on Tez. I also tried using predicate pushdown via the SQL context configs but this resulted in no noticeable performance increase. Also, when this same test was conducted using Parquet the query time was on par with Hive as well. I'm sure there are other solutions out there to increase the performance of the queries such as using RDDS v. Dataframes etc. But I'm really looking to understand how Spark is interacting with ORC files which is resulting in this gap.
Let me know if I can provide additional clarification around any of the talking points listed above.

The following steps might help to improve the performance of the Spark SQL query.
In general, Hive take the memory of the whole Hadoop cluster which is significantly larger than the executer memory (Here 2* 32 = 64 GB). What's the memory size of the nodes ?.
Further, the number of executers seems to be less (2) when compare to the number of number of map/reduce jobs generated by the hive query. Increasing the number of executers in multiples of 2 might help to improve the performance.
In SparkSQL and Dataframe, optimised execution using manually managed memory (Tungsten) is now enabled by default, along with code generation
for expression evaluation. this features can be enabled by setting spark.sql.tungsten.enabled to true in case if it's not already enabled.
sqlContext.setConf("spark.sql.tungsten.enabled", "true")
The columnar nature of the ORC format helps to avoid reading unnecessary columns. However, But, we are still reading unnecessary rows even if the query has WHERE clause filter.ORC predicate push-down would improve the performance with it's built-in indexs. Here, the ORC predicate push-down is disabled in the Spark SQL by default and need to be explicitly enabled.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
I would recommend you to do some more research and find the potential performance blockers if any.

Related

How do we optimise the spark job if the base table has 130 billions of records

We are joining multiple tables and doing complex transformations and enrichments.
In that the base table will have around 130 billions of records, how can we optimise the spark job when the spark filters all the records keep in memory and do the enrichments with other left outer join tables. Currently spark job is running for more than 7 hours, can you suggest some techniques
Here is what you can try
Partition your base tables on which you want to run your query, create partition on specific column like Department, or Date etc which you use during joining. If the under lying table is hive you can also try bucketing.
Try optimised joins which suits your requirement such sorted merge join, hash join.
File format, use parquet file format as it much faster compared to ORC for analytical queries, and it also stores data in columnar format.
If your query has multiple steps and some steps are reused try to use caching, as spark supports memory and disk caching.
Tune your spark jobs by specifying the number of partitions, executor, cores, driver memory as per the resources available. Check spark history UI to understand how data is distributed. Try various configurations see what works best for you.
Spark might perform poorly if there large skewness in data. if that is the case you might need further optimisation to handle it.
Apart from the above mentioned techniques, you can try below option as well to optimize your job.
1.You can partition your data by inspecting your data fields. Most common columns that are used for partitioning are like date columns, region ID, country code etc.Once data is partitioned your can explain your dataframe like df.explain() and see if is using PartitioningAwareFileIndex.
2.Try tuning the spark settings and cluster configuration to scale with the input data volume.
Try changing the spark.sql.files.maxPartitionBytes to 256 MB or 512
MB , we have see significant performance gain by changing this
parameter.
Use appropriate number of executor , cores & executor memory based on
compute need
Try analyzing the spark history to identify the stage jobs which are
consuming significant time. This would be good point to start
debugging your job.

How to join big dataframes in Spark SQL? (best practices, stability, performance)

I'm getting the same error than Missing an output location for shuffle when joining big dataframes in Spark SQL. The recommendation there is to set MEMORY_AND_DISK and/or spark.shuffle.memoryFraction 0. However, spark.shuffle.memoryFraction is deprecated in Spark >= 1.6.0 and setting MEMORY_AND_DISK shouldn't help if I'm not caching any RDD or Dataframe, right? Also I'm getting lots of other WARN logs and task retries that lead me to think that the job is not stable.
Therefore, my question is:
What are best practices to join huge dataframes in Spark SQL >= 1.6.0?
More specific questions are:
How to tune number of executors and spark.sql.shuffle.partitions to achieve better stability/performance?
How to find the right balance between level of parallelism (num of executors/cores) and number of partitions? I've found that increasing the num of executors is not always the solution as it may generate I/O reading time out exceptions because of network traffic.
Is there any other relevant parameter to be tuned for this purpose?
My understanding is that joining data stored as ORC or Parquet offers better performance than text or Avro for join operations. Is there a significant difference between Parquet and ORC?
Is there an advantage of SQLContext vs HiveContext regarding stability/performance for join operations?
Is there a difference regarding performance/stability when the dataframes involved in the join are previously registerTempTable() or saveAsTable()?
So far I'm using this is answer and this chapter as a starting point. And there are a few more stackoverflow pages related to this subject. Yet I haven't found a comprehensive answer to this popular issue.
Thanks in advance.
That are a lot of questions. Allow me to answer these one by one:
Your number of executors is most of the time variable in a production environment. This depends on the available resources. The number of partitions is important when you are performing shuffles. Assuming that your data is now skewed, you can lower the load per task by increasing the number of partitions.
A task should ideally take a couple of minus. If the task takes too long, it is possible that your container gets pre-empted and the work is lost. If the task takes only a few milliseconds, the overhead of starting the task gets dominant.
The level of parallelism and tuning your executor sizes, I would like to refer to the excellent guide by Cloudera: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
ORC and Parquet only encode the data at rest. When doing the actual join, the data is in the in-memory format of Spark. Parquet is getting more popular since Netflix and Facebook adopted it and put a lot of effort in it. Parquet allows you to store the data more efficient and has some optimisations (predicate pushdown) that Spark uses.
You should use the SQLContext instead of the HiveContext, since the HiveContext is deprecated. The SQLContext is more general and doesn't only work with Hive.
When performing the registerTempTable, the data is stored within the SparkSession. This doesn't affect the execution of the join. What it stores is only the execution plan which gets invoked when an action is performed (for example saveAsTable). When performining a saveAsTable the data gets stored on the distributed file system.
Hope this helps. I would also suggest watching our talk at the Spark Summit about doing joins: https://www.youtube.com/watch?v=6zg7NTw-kTQ. This might provide you some insights.
Cheers, Fokko

Hive query having 15 tables join is expected to generate 1 Billion records, on 3 datanodes, 16GB RAM each Is this the right way to do?

My name is Vitthal.
The Hortonworks HDP 2.4 Cluster on Amazon is 3 Datanodes, Masters on different Instances.
7 Instances 16GB RAM each.
Total 1TB HDD Space
3 Data Nodes
Hadoop version 2.7
I have pulled data from Postgres into Hadoop Distributed Environment.
The Data is 15 Tables, Among them 4 tables are having 15 Million Records, rest are Masters.
I've pulled them in HDFS, compressed as ORC, and SnappyCodec. Created Hive External Tables with schema.
Now I'm firing a query which joins all the 15 tables and selects the columns which I need in a final flat table. The records expected are more than 1.5 Billion.
I have optimized Hive, Yarn, MapReduce Engine viz. Parallel Execution, Vectorization, Optimized Joins, Small Table Condition, Heap Size etc.
The query is running on Cluster / Hive / Tez since 20 hours & it's reached 90% where the last reducer is running. The 90% is reached long back like since 18 hours it's stuck at 90%.
Am I doing it the right way ?
If I understand, you have effectively copied tables in their raw form from your RDBMs into Hadoop in order to create a flattened view into one or more new tables. You're using Hive to do this. All of this sounds fine.
There are many possibilities why this is taking so long, but several come to mind.
First, YARN will allocate containers (one per CPU core, typically) that mappers and reducers will use to run the parallelized parts of the query. This should allow you to utilize all of the resources you have available.
I use Cloudera, but I assume Hortonworks has similar tools that let you see how many containers are in use, how many mappers and reducers are created by Hive, and so on. You should see that most or all of your available CPUs are in use constantly. Jobs should be finishing at some reasonable rate (perhaps every minute, or every 15 minutes). Depending on the query, Hive is often able to break it into distinct "stages" that are executed distinctly from others, then reassembled at the end.
If this is the case, everything may be fine, but your cluster may be under-resourced. But before you throw more AWS instances at the problem, consider the query itself.
First, Hive has several tools that are essential for optimizing performance, most importantly, partitioning. When you create tables, you should find some means of partitioning the resulting datasets into roughly equal subsets. A common method is to use dates, for example year+month+day (perhaps 20160417), or if you expect to have lots of historical data, maybe just year+month. This will also allow you to dramatically optimize queries that can be constrained by date. I seem to recall that Hive (or maybe it's YARN) will allocate partitions to different containers, so if you don't see all your workers working, then this would be a possible cause. Use the PARTITIONED BY clause in your CREATE TABLE statement.
The reason to choose something like date is that presumably your data is relatively evenly distributed over time (dates). We had chosen a customer_id as a partition key in an early implementation but as we grew, so did our customers. Hundreds of smaller customers would finish in a few minutes, then hundreds of mid-sized customers would finish in an hour, then a couple of our largest customers would take 10 or more hours to complete. We would see complete utilization of the cluster for that first hour, then only a couple containers in use for the last couple of customers. Not good.
This phenomenon is known as "data skew", so you want to carefully choose partitions to avoid skew. There are some options involving SKEW BY and CLUSTER BY that can help deal with getting evenly sized or smaller data files that you could consider.
Note that the raw import data should also be partitioned, as partitions act like indexes in a RDBMS, so are important for performance. In this case, choose partitions that use the keys that your larger query joins on. It is possible and common to have multiple partitions, so a date-based top-level partition, with a sub-partition on the join key could be helpful ... maybe ... depends on your data.
We have also found that it's very important to optimize the query itself. Hive has some hinting mechanisms that can direct it to run the query differently. While quite rudimentary compared to RDBMS, EXPLAIN is very helpful for understanding how Hive will break up the query and when it needs to scan a full dataset. It's hard to read the explain output, so get comfortable with the Hive documentation :-).
Lastly, if you can't make Hive do things in a sensible manner (if its optimizer still results in imbalanced stages) you can create intermediate tables with an additional Hive query that runs to create a partially transformed dataset before building the final one. This seems expensive since you're adding an additional write, and read of new tables, but in the case you describe it may be much faster overall. Also, it's sometimes useful to have intermediate tables just to test or sample data.
Writing Hive is a lot less like writing regular software -- you can get the Hive query done pretty quickly in most cases. Getting it to run fast has taken us 10 or 15 tries in a few cases. Good luck, and I hope this is helpful.

The best way to filter large data sets

I have a query about how to filter relevant records from a large data set of financial transactions. We use Oracle 11g database and one of the requirements is to produce various end-of-day reports with all sorts of criteria.
The relevant tables look roughly like this:
trade_metadata 18m rows, 10 GB
trade_economics 18m rows, 15 GB
business_event 18m rows, 11 GB
trade_business_event_link 18m rows, 3 GB
One of our reports is now taking ages to run ( > 5 hours). The underlying proc has been optimized time and again but new criteria keep getting added so we start struggling again. The proc is pretty standard - join all the tables and apply a host of where clauses (20 at the last count).
I was wondering if I have a problem large enough to consider big data solutions to get rid of this optimize-the-query game every few months. In any case, the volumes are only going up. I have read up a bit about Hadoop + HBase, Cassandra, Apache Pig etc. but being very new to this space, am a little confused about the best way to proceed.
I imagine this is not a map-reduce problem. HBase does seem to offer Filters but I am not sure about their performance. Could the enlightened folks here please answer a few questions for me:
Is the data set large enough for big data solutions (Do I need entry into the billion club first?)
If it is, would HBase be a good choice to implement this?
We are not moving away from Oracle anytime soon even though the volumes are growing steadily. Am I looking at populating the HDFS every day with a dump from the relevant tables? Or is delta write possible everyday?
Thanks very much!
Welcome to the incredibly varied big data eco-system. If your dataset size is big enough that it is taxing your ability to analyze it using traditional tools, then it is big enough for big data technologies. As you have probably seen, there are a huge number of big data tools available with many of them having overlapping capabilities.
First of all, you did not mention if you have a cluster set-up. If not, then I would suggest looking into the products by Cloudera and Hortonworks. These companies provide Hadoop distributions that include many of the most popular big data tools(hbase, spark, sqoop, etc), and make it easier to configure and manage the nodes that will make up your cluster. Both companies provide their distributions free of charge, but you will have to pay for support.
Next you will need to get your data out of Oracle and into some format in the hadoop cluster to analyze it. The tool often used to get data from a relational database and into the cluster is Sqoop. Sqoop has the ability to load your tables into HBase, Hive, and files on the Hadoop Distributed Filesystem (HDFS). Sqoop also has the ability to do incremental imports for updates instead of whole table loads. Which of these destinations you choose affects which tools you can use in the next step. HDFS is the most flexible in that you can access it from PIG, MapReduce code you write, Hive, Cloudera Impala, and others. I have found HBase to be very easy to use, but others highly recommend Hive.
An aside: There is a project called Apache Spark that is expected to be the replacement for Hadoop MapReduce. Spark claims 100x speedup compared to traditional hadoop mapreduce jobs. Many projects including Hive will run on Spark giving you the ability to do SQL-like queries on big data and get results very quickly (Blog post)
Now that your data is loaded you need to run those end of day reports. If you choose Hive, then you can reuse a lot of your sql knowledge instead of having to program Java or learn Pig Latin (not that it’s very hard). Pig Translates Pig Latin to MapReduce jobs (as does Hive’s Query Language for now), but, like Hive, Pig can target Spark as well. Regardless of which tool you choose for this step, I recommend looking into Oozie to automate the ingestion, analaytics, and movement of results back out of the cluster (sqoop export for this). Oozie allows you to schedule recurring workflows like yours so you can focus on the results not the process. The full capabilities of Oozie are documented here.
There are a crazy number of tools at your disposal, and the speed of change in this eco-system can give you whip-lash. Both cloudera and Hortonworks provide Virtual Machines you can use to try their distributions. I strongly recommend spending less time deeply researching each tool and just trying some of the them (like Hive, Pig, Oozie,...) to see what works best for your application).

What can I expect about hive and hadoop in performance?

I'am actually trying to implement a solution with Hadoop using Hive on CDH 5.0 with Yarn. So my architecture is:
1 Namenode
3 DataNode
I'm querying ~123 millions rows with 21 columns
My node are virtualized with 2vCPU #2.27 and 8 GO RAM
So I tried some request and i got some result, and after that i tried the same requests in a basic MySQL with the same dataset in order to compare the results.
And actually MySQL is very faster than Hive. So I'm trying to understand why. I know I have some bad performance because of my hosts. My main question is : is my cluster well sizing ?
Do i need to add same DataNode for this amount of data (which is not very enormous in my opinion) ?
And if someone try some request with appoximately the same architecture, you are welcome to share me your results.
Thanks !
I'm querying ~123 millions rows with 21 columns [...] which is not very enormous in my opinion
That's exactly the problem, it's not enormous. Hive is a big data solution and is not designed to run on small data-sets like the one your using. It's like trying to use a forklift to take out your kitchen trash. Sure, it will work, but it's probably faster to just take it out by hand.
Now, having said all that, you have a couple of options if you want realtime performance closer to that of a traditional RDBMS.
Hive 0.13+ which uses TEZ, ORC and a number of other optimizations that greatly improve response time
Impala (part of CDH distributions) which bypasses MapReduce altogether, but is more limited in file format support.
Edit:
I'm saying that with 2 datanodes i get the same performance than with 3
That's not surprising at all. Since Hive uses MapReduce to handle query operators (join, group by, ...) it incurs all the cost that comes with MapReduce. This cost is more or less constant regardless of the size of data and number of datanodes.
Let's say you have a dataset with 100 rows in it. You might see 98% of your processing time in MapReduce initialization and 2% in actual data processing. As the size of your data increases, the cost associated with MapReduce becomes negligible compared to the total time taken.

Resources