I am working on a project where in I have to tune hive's performance. I have found nine most important parameters that will help in tuning hive's performance. They are as follows(in no specific order):
hive.exec.reducers.max
hive.limit.optimize.fetch.max
hive.limit.row.max.size
hive.exec.max.dynamic.partitions
hive.index.compact.query.max.entries
hive.merge.size.per.task
hive.index.compact.query.max.size
hive.metastore.server.min.threads
hive.mapjoin.check.memory.rows
I wanted to know whether I am going in the right direction or not? Please let me know if I missed out on some other parameters also.
Thanks in advance.
Consider running hive on tez engine which is significantly runs faster than MapReduce.
hive.execution.engine = tez
Related
I'm getting the same error than Missing an output location for shuffle when joining big dataframes in Spark SQL. The recommendation there is to set MEMORY_AND_DISK and/or spark.shuffle.memoryFraction 0. However, spark.shuffle.memoryFraction is deprecated in Spark >= 1.6.0 and setting MEMORY_AND_DISK shouldn't help if I'm not caching any RDD or Dataframe, right? Also I'm getting lots of other WARN logs and task retries that lead me to think that the job is not stable.
Therefore, my question is:
What are best practices to join huge dataframes in Spark SQL >= 1.6.0?
More specific questions are:
How to tune number of executors and spark.sql.shuffle.partitions to achieve better stability/performance?
How to find the right balance between level of parallelism (num of executors/cores) and number of partitions? I've found that increasing the num of executors is not always the solution as it may generate I/O reading time out exceptions because of network traffic.
Is there any other relevant parameter to be tuned for this purpose?
My understanding is that joining data stored as ORC or Parquet offers better performance than text or Avro for join operations. Is there a significant difference between Parquet and ORC?
Is there an advantage of SQLContext vs HiveContext regarding stability/performance for join operations?
Is there a difference regarding performance/stability when the dataframes involved in the join are previously registerTempTable() or saveAsTable()?
So far I'm using this is answer and this chapter as a starting point. And there are a few more stackoverflow pages related to this subject. Yet I haven't found a comprehensive answer to this popular issue.
Thanks in advance.
That are a lot of questions. Allow me to answer these one by one:
Your number of executors is most of the time variable in a production environment. This depends on the available resources. The number of partitions is important when you are performing shuffles. Assuming that your data is now skewed, you can lower the load per task by increasing the number of partitions.
A task should ideally take a couple of minus. If the task takes too long, it is possible that your container gets pre-empted and the work is lost. If the task takes only a few milliseconds, the overhead of starting the task gets dominant.
The level of parallelism and tuning your executor sizes, I would like to refer to the excellent guide by Cloudera: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
ORC and Parquet only encode the data at rest. When doing the actual join, the data is in the in-memory format of Spark. Parquet is getting more popular since Netflix and Facebook adopted it and put a lot of effort in it. Parquet allows you to store the data more efficient and has some optimisations (predicate pushdown) that Spark uses.
You should use the SQLContext instead of the HiveContext, since the HiveContext is deprecated. The SQLContext is more general and doesn't only work with Hive.
When performing the registerTempTable, the data is stored within the SparkSession. This doesn't affect the execution of the join. What it stores is only the execution plan which gets invoked when an action is performed (for example saveAsTable). When performining a saveAsTable the data gets stored on the distributed file system.
Hope this helps. I would also suggest watching our talk at the Spark Summit about doing joins: https://www.youtube.com/watch?v=6zg7NTw-kTQ. This might provide you some insights.
Cheers, Fokko
I'm actually asking my self about performance of using Spark SQL with Hive to do real time analytics.
I know that Hive has been created for batch processing, and Spark is use to do fast queries.
But, use Spark SQL with Hive will allow me to do real time queries ? Or it just will make fastest queries but not real time.
Should I use an other datawarehouse instead of Hive, like Hbase ?
Thanks in advance,
Florian
While Spark can be much faster than hive, its still probably not an ideal solution for say serving a website. So if Spark SQL can do "realtime" queries or not depends largely on what sort of timelines you consider realtime, if your dataset is small enough to cache in memory, and if your queries are able to take advantage of partitioning.
Hadoop is not designed to do updates. I tried with hive it has to do insert overwrite which is a costly operation also we can do some work around using Map reduce which again is a costly operation.
Is their any other tool or way by which i can do frequent updates on Hadoop or can i use spark for the same. Please help me i am not getting enough information about this even after googling 100 times. Thanks in advance.
If you need to update realtime on Hadoop, Hbase is the solution you might want to take a look at, Hive is not meant for random/frequent updates its more of a Data crunching tool not a replacement of RDBMS
After benchmarking Hive and Pig, I found that the Group By operator in Pig is drastically slower that Hive's. I was wondering whether anybody has experienced the same? And whether people may have any tips for improving the performance of this operation? (Adding a DISTINCT as suggested by an earlier post on here doesn't help. I am currently re-running the benchmark with LZO compression enabled).
It seems that you are looking in the wrong way. Group By just groups the data in some way, it is very important what you do afterwards. When trying to analyze performance in Pig, you should keep these things in mind:
1) Several statements can be merged into a single MR job, so don't look at the statements, look at the performance of the generated MR jobs.
2) There should be a reason for a drastic difference in performance. This may be:
2.1 Different input format, other circumstances when benchmarking Pig vs Hive.
2.2 Combiner being disabled for some reason:
http://pig.apache.org/docs/r0.9.1/perf.html#When+the+Combiner+is+Used
This happens to be the bottleneck for me in most cases.
And according to my experience there is no drastic difference in Pig/Hive performance.
I am loading 10 million records into Hbase table through importsv tool from hadoop multinode cluster. Right now it is taking 5 minutes for this task. But i was wondering how i could improve the performance of this. The importtsv tool does not seem like using reducers at all. I was wondering if i could anyway force this to use reducers, it could improve performance or any other way which you think would improve the performance would be appreciated. Thank you.
Try Importtsv with HfileOutPutFormat , completeBulkLoadTool.
when it comes to performance, there is no easy answer. If the 5 minutes equals to the speed of the network, or the speed of the hard disk, you have to move the source data to somewhere else or change the hardware.
I don't know importsv. I would suggest you to try multi-way load. Take a look at Sqoop.
You can get best HBase bulk load performance with use of HFileOutputFormat and CompleteBulkLoad
Check here.