Spark read table performance optimization - performance

I am creating a spark job, and am wondering if there are any performance benefits of reading a table via spark.sqlContext().read("table") vs spark.sql("select * from table") Or does spark's logical plan end up the same regardless?

If you use spark.read.jdbc, you can specify a partition key to read the table in parallel and then have multiple partitions for spark to work on. Whether or not this is faster depends on the rdbms and the physical design of the table. It will greatly reduce the amount of memory needed by a single executor.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Related

How do we optimise the spark job if the base table has 130 billions of records

We are joining multiple tables and doing complex transformations and enrichments.
In that the base table will have around 130 billions of records, how can we optimise the spark job when the spark filters all the records keep in memory and do the enrichments with other left outer join tables. Currently spark job is running for more than 7 hours, can you suggest some techniques
Here is what you can try
Partition your base tables on which you want to run your query, create partition on specific column like Department, or Date etc which you use during joining. If the under lying table is hive you can also try bucketing.
Try optimised joins which suits your requirement such sorted merge join, hash join.
File format, use parquet file format as it much faster compared to ORC for analytical queries, and it also stores data in columnar format.
If your query has multiple steps and some steps are reused try to use caching, as spark supports memory and disk caching.
Tune your spark jobs by specifying the number of partitions, executor, cores, driver memory as per the resources available. Check spark history UI to understand how data is distributed. Try various configurations see what works best for you.
Spark might perform poorly if there large skewness in data. if that is the case you might need further optimisation to handle it.
Apart from the above mentioned techniques, you can try below option as well to optimize your job.
1.You can partition your data by inspecting your data fields. Most common columns that are used for partitioning are like date columns, region ID, country code etc.Once data is partitioned your can explain your dataframe like df.explain() and see if is using PartitioningAwareFileIndex.
2.Try tuning the spark settings and cluster configuration to scale with the input data volume.
Try changing the spark.sql.files.maxPartitionBytes to 256 MB or 512
MB , we have see significant performance gain by changing this
parameter.
Use appropriate number of executor , cores & executor memory based on
compute need
Try analyzing the spark history to identify the stage jobs which are
consuming significant time. This would be good point to start
debugging your job.

is it possible to convert from hbase to spark rdd efficiency?

I have a large dataset of items in hbase that I want to load into a spark rdd for processing. My understanding is that hbase is optimized for low-latency single item searches on hadoop, so I am wondering if it's possible to efficiently query for 100 million items in hbase (~10Tb in size)?
Here is some general advice on making Spark and HBase work together.
Data colocation and partitioning
Spark avoids shuffling : if your Spark workers and HBase regions are located on the same machines, Spark will create partitions according to regions.
A good region split in HBase will map to a good partitioning in Spark.
If possible, consider working on your rowkeys and region splits.
Operations in Spark vs operations in HBase
Rule of thumb : use HBase scans only, and do everything else with Spark.
To avoid shuffling in your Spark operations, you can consider working on your partitions. For example : you can join 2 Spark rdd from HBase scans on their Rowkey or Rowkey prefix without any shuffling.
Hbase configuration tweeks
This discussion is a bit old (some configurations are not up to date) but still interesting : http://community.cloudera.com/t5/Storage-Random-Access-HDFS/How-to-optimise-Full-Table-Scan-FTS-in-HBase/td-p/97
And the link below has also some leads:
http://blog.asquareb.com/blog/2015/01/01/configuration-parameters-that-can-influence-hbase-performance/
You might find multiple sources (including the ones above) suggesting to change the scanner cache config, but this holds only with HBase < 1.x
We had this exact question at Splice Machine. We found the following based on our tests.
HBase had performance challenges if you attempted to perform remote scans from spark/mapreduce.
The large scans hurt performance of ongoing small scans by forcing garbage collection.
There was not a clear resource management dividing line between OLTP and OLAP queries and resources.
We ended up writing a custom reader that reads the HFiles directly from HDFS and performs incremental deltas with the memstore during scans. With this, Spark could perform quick enough for most OLAP applications. We also separated the resource management so the OLAP resources were allocated via YARN (On Premise) or Mesos (Cloud) so they would not disturb normal OLTP apps.
I wish you luck on your endeavor. Splice Machine is open source and you are welcome to checkout out our code and approach.

Spark Performance On Individual Record Lookups

I am conducting a performance test which compares queries on existing internal Hive tables between Spark SQL and Hive on Tez. Throughout the tests, Spark was showing query execution time that was on par or faster than Hive on Tez. These results are consistent with many of the examples out there. However, there was one noted exception with a query that involved key based selection at the individual record level. In this instance, Spark was significantly slower than Hive on Tez.
After researching this topic on the internet, I could not find a satisfactory answer and wanted to pose this example to the SO community to see if this is an individual one-off case associated with our environment or data, or a larger pattern related to Spark.
Spark 1.6.1
Spark Conf: Executors 2, Executory Memory 32G, Executor Cores 4.
Data is in an internal Hive Table which is stored as ORC file types compressed with zlib. The total size of the compressed files is ~2.2 GB.
Here is the query code.
#Python API
#orc with zlib key based select
dforczslt = sqlContext.sql("SELECT * FROM dev.perf_test_orc_zlib WHERE test_id= 12345678987654321")
dforczslt.show()
The total time to complete this query was over 400 seconds, compared to ~6 seconds with Hive on Tez. I also tried using predicate pushdown via the SQL context configs but this resulted in no noticeable performance increase. Also, when this same test was conducted using Parquet the query time was on par with Hive as well. I'm sure there are other solutions out there to increase the performance of the queries such as using RDDS v. Dataframes etc. But I'm really looking to understand how Spark is interacting with ORC files which is resulting in this gap.
Let me know if I can provide additional clarification around any of the talking points listed above.
The following steps might help to improve the performance of the Spark SQL query.
In general, Hive take the memory of the whole Hadoop cluster which is significantly larger than the executer memory (Here 2* 32 = 64 GB). What's the memory size of the nodes ?.
Further, the number of executers seems to be less (2) when compare to the number of number of map/reduce jobs generated by the hive query. Increasing the number of executers in multiples of 2 might help to improve the performance.
In SparkSQL and Dataframe, optimised execution using manually managed memory (Tungsten) is now enabled by default, along with code generation
for expression evaluation. this features can be enabled by setting spark.sql.tungsten.enabled to true in case if it's not already enabled.
sqlContext.setConf("spark.sql.tungsten.enabled", "true")
The columnar nature of the ORC format helps to avoid reading unnecessary columns. However, But, we are still reading unnecessary rows even if the query has WHERE clause filter.ORC predicate push-down would improve the performance with it's built-in indexs. Here, the ORC predicate push-down is disabled in the Spark SQL by default and need to be explicitly enabled.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
I would recommend you to do some more research and find the potential performance blockers if any.

Hive query generation is taking long time to generate dataset

I am trying to run hive query on huge amount of data(almost in half of petabyte), and these query running map reduce internally. it takes very long time to generate the data set(map reduce to complete) what optimization mechanism for hive and Hadoop i can use to make these query faster, one more important question i have does the amount of disk available for map reduce or in /tmp directory is important for faster map reduce?
There is not too much you can do, but I can give a few direction what usually can be done with Hive:
You should select SQLs which cause less shuffling. For example you can try to cause map side joins when possible. You can also do some operations in a way that will lead to map-only queries.
Another way is to tune number of reducers - sometimes Hive defines much less reducers then needed - so you can set it manually to better utilize your cluster
If you have number of queries to run to do your transformation - you can define low replication factor for this temporary data in HDFS
More help can be provided if we have info what are you doing.

How does Hive compare to HBase?

I'm interested in finding out how the recently-released (http://mirror.facebook.com/facebook/hive/hadoop-0.17/) Hive compares to HBase in terms of performance. The SQL-like interface used by Hive is very much preferable to the HBase API we have implemented.
It's hard to find much about Hive, but I found this snippet on the Hive site that leans heavily in favor of HBase (bold added):
Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.
Since HBase and HyperTable are all about performance (being modeled on Google's BigTable), they sound like they would certainly be much faster than Hive, at the cost of functionality and a higher learning curve (e.g., they don't have joins or the SQL-like syntax).
From one perspective, Hive consists of five main components: a SQL-like grammar and parser, a query planner, a query execution engine, a metadata repository, and a columnar storage layout. Its primary focus is data warehouse-style analytical workloads, so low latency retrieval of values by key is not necessary.
HBase has its own metadata repository and columnar storage layout. It is possible to author HiveQL queries over HBase tables, allowing HBase to take advantage of Hive's grammar and parser, query planner, and query execution engine. See http://wiki.apache.org/hadoop/Hive/HBaseIntegration for more details.
Hive is an analytics tool. Just like pig, it was designed for ad hoc batch processing of potentially enourmous amounts of data by leveraging map reduce. Think terrabytes. Imagine trying to do that in a relational database...
HBase is a column based key value store based on BigTable. You can't do queries per se, though you can run map reduce jobs over HBase. It's primary use case is fetching rows by key, or scanning ranges of rows. A major feature is being able to have data locality when scanning across ranges of row keys for a 'family' of columns.
To my humble knowledge, Hive is more comparable to Pig. Hive is SQL-like and Pig is script based.
Hive seems to be more complicated with query optimization and execution engines as well as requires end user needs to specify schema parameters(partition etc).
Both are intend to process text files, or sequenceFiles.
HBase is for key value data store and retrieve...you can scan or filter on those key value pairs(rows). You can not do queries on (key,value) rows.
Hive and HBase are used for different purpose.
Hive:
Pros:
Apache Hive is a data warehouse infrastructure built on top of Hadoop.
It allows for querying data stored on HDFS for analysis via HQL, an SQL-like language, which will be converted into series of Map Reduce Jobs
It only runs batch processes on Hadoop.
it’s JDBC compliant, it also integrates with existing SQL based tools
Hive supports partitions
It supports analytical querying of data collected over a period of time
Cons:
It does not currently support update statements
It should be provided with a predefined schema to map files and directories into columns
HBase:
Pros:
A scalable, distributed database that supports structured data storage for large tables
It provides random, real time read/write access to your Big Data. HBase operations run in real-time on its database rather than MapReduce jobs
it supports partitions to tables, and tables are further split into column families
Scales horizontally with huge amount of data by using Hadoop
Provides key based access to data when storing or retrieving. It supports add or update rows.
Supports versoning of data.
Cons:
HBase queries are written in a custom language that needs to be learned
HBase isn’t fully ACID compliant
It can't be used with complicated access patterns (such as joins)
It is also not a complete substitute for HDFS when doing large batch MapReduce
Summary:
Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.
As of the most recent Hive releases, a lot has changed that requires a small update as Hive and HBase are now integrated. What this means is that Hive can be used as a query layer to an HBase datastore. Now if people are looking for alternative HBase interfaces, Pig also offers a really nice way of loading and storing HBase data. Additionally, it looks like Cloudera Impala may offer substantial performance Hive based queries on top of HBase. They are claim up to 45x faster queries over traditional Hive setups.
To compare Hive with Hbase, I'd like to recall the definition below:
A database designed to handle transactions isn’t designed to handle
analytics. It isn’t structured to do analytics well. A data warehouse,
on the other hand, is structured to make analytics fast and easy.
Hive is a data warehouse infrastructure built on top of Hadoop which is suitable for long running ETL jobs.
Hbase is a database designed to handle real time transactions

Resources