Hive find expected run time of query - runtime

I want to find the expected run time of query in Hive. Using EXPLAIN gives the execution plan. Is there a way to find the expected time?
I need Hive equivalent of SQL query EXPLAIN COSTS .

There is no OOTB feature at this moment that facilitates this. One way to achieve this would be to learn from history. Gather patterns based on similar data and queries you have run previously and try to deduce some insights. You might find tools like Starfish helpful in the process.
I would not recommend you to decide anything based on a subset of your data, as running queries on a small dataset and on the actual dataset are very different. This is good to test the functionality but not for any kind of cost approximation. The reason behind this is that a lot of factors are involved in the process, like system resources(disk, CPU slots, N/W etc), system configuration, other running jobs etc. You might find smooth operation on a small dataset, but as the data size increases all these factors start playing much important role. Even a small configuration parameter may play an important role.(You might have noticed sometimes that a Hive query runs fast initially but starts getting slow gradually). Also, execution of a Hive query is much more involved than a simple MR job.
See this JIRA, to get some idea, where they are talking about developing a Cost Based Query optimization for Joins in Hive. You might also find this helpful.

I think that is not possible to because internally map reduce gets executed for any particular Hive query. Moreover map reduce job's execution time depends on the cluster load and its configuration. So it is tough to predict the execution time. May be you can do one thing you can use some timer before running the query and then after that finishes up you can calculate the exact execution time that was needed for execution.

May be you could sample a small % of records from your table using partitions , bucket features etc then run the query against the small dataset. Note the execution time and then multiply with the factor (total_size/sample_size).

Related

Performance benchmarking between Hive (on Tez) and Spark for my particular use case

I'm playing around with some data on cluster and want to do some aggregations --- nothing too complicated, but more complicated than sum, there are few joins and count distincts. I have implemented this aggregation in Hive and Spark with Scala and want to compare the execution times.
When I submit the scripts from gateway, linux time functions gives me real time smaller than sys time, which I expected. But I'm not sure which one to pick as proper comparision. Maybe just use sys.time and run the both queries for several times? Is it acceptable or I'm complete noob in this case?
Real time. From a performance benchmark perspective, you only care about how long (human time) it takes before your query is completed and you can look at the results, not how many processes are getting spun up by the application internally.
Note, I would be very careful with performance benchmarking, as both Spark and Hive have plenty of tunable configuration knobs that greatly affect performance. See here for a few examples to alter Hive performance with vectorization, data format choices, data bucketing and data sorting.
The "general consensus" is that Spark is faster than Hive on Tez, but that Hive can handle huge data sets that don't fit in memory better. (I'm not going to cite a source since I'm lazy, do some googling)

Hbase or hdfs which will be better

I have a use case in which we have large amount of data on which analytic is to be performed. The data will be continuously fetched and analytic to be performed at the run time. For this use case scenario what will be best to Use, Hbase+hive or HDFS+hive.
As much as I have read I have found that for run time changes its best to use Hbase. Needed some suggestion and advice. Please feel free to provide your inputs.
If you have any such use case in mind you can give example of it will be great.
Thanks in advance
Based on my experience so far, it often boils down to a choice between Hbase and Hive. Hbase fits well for use cases involving real time querying on data that is changing fast (chat messages), and Hive for use cases where analytics (often using SQL) needs to be performed over data that has aggregated over a long period of time (website analytics).

Cassandra + Solr/Hadoop/Spark - Choosing the right tools

I'm currently investigating how to store and analyze enriched time based data with up to 1000 columns per line. At the moment Cassandra together with either Solr, Hadoop or Spark offered by Datastax Enterprise seem to fulfill my requirements on the rough. But the devil is in the detail.
Out of the 1000 columns about 60 are used for real-time-like queries (web-frontend, user sends form and expect quick response). These queries are more or less GROUPBY statements where the number or occurrences are counted.
As Cassandra itself does not provide the required analytical capabilities (no GROUPBY), I'm left these alternatives:
Roughly query via Cassandra and filter the resultset within self-written code
Index the data with Solr and run facet.pivot queries
Use either Hadoop or Spark and run the queries
The first approach seems cumbersome and prone to errors… Solr does have some anayltic features but without multifield grouping I'm stuck with pivots. I don't know whether this is a good or performant approach though… Last but not least there are Hadoop and Spark, the prior known not to be the best for real-time queries, the later pretty new and maybe not production ready.
So which way to go? There is no one-fits-all here, but before I go one way through I'd like to get some feedback. Maybe I'm thinking to complex or my expectations are too high :S
Thanks in advance,
Arman
In a place I work now we have a similar set of tech requirements and a solution is Cassandra-Solr-Spark, exactly in that order.
So if a query can be "covered" by Cassandra indices - good, if not - it's covered by Solr. For testing & less often queries - Spark (Scala, no SparkSQL due to old version of it -- it's a bank, everything should be tested and matured, from cognac to software, argh).
Generally I agree with the solution, though sometimes I have a feeling that some client's requests should NOT be taken seriously at all, saving us from loads of weird queries :)
I would recommend Spark, if you take a loot at the list of companies using it you'll such names as Amazon, eBay and Yahoo!. Also, as you noted in the comment, it's becoming a mature tool.
You've given arguments against Cassandra and Solr already, so I'll focus on explaining why Hadoop MapReduce wouldn't do as well as Spark for real-time queries.
Hadoop and MapReduce were designed to leverage hard disk under the assumption that for big data IO is negligible. As a result data are read and wrote at least twice - in map stage and in reduce stage. This allows you to recover from failures as partial result are secured but it that's not want you want when aiming for real-time queries.
Spark not only aims to fix MapReduce shortcomings, it also focuses on interactive data analysis, which is exactly what you want. This goal is achieved mainly by utilizing RAM and the results are astonishing. Spark jobs will often be 10-100 times faster than MapReduce equivalents.
The only caveat is the amount of memory you have. Most probably your data is probably going to feat in the RAM you can provide or you can rely on sampling. Usually when interactively working with data there is no real need to use MapReduce and it seems to be so in your case.

Why does the same query takes different amount of time to run? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I have this problem that has been going on for months. I automate reports at my job, we use oracle. I write a procedure, time it, it runs in a few minutes. I then set it up for monthly runs.
And then every month, some report runs for hours. It's all the same queries that ran in a few minutes for months before and all of a sudden they're taking hours to run.
I end up rewriting my procedures every now and then and to me this defeats the purpose of automating. No one here can help me.
What am I doing wrong? How can I ensure that my queries will always take the same amount of time to run.
I did some research and it says that in a correctly setup database with correct statistics you don't even have to use hints, everything should consistently run in about the same time.
Is this true? Or does everyone have this problem and everyone just rewrites their procedures whenever they run?
Sorry for 100 questions, I'm really frustrated about this.
My main question is, why does the same query takes different amount of time (drastic difference, from minutes to hours) to run on different days?
There are three broad reasons that queries take longer at different times. Either you are getting different performance because the system is under a different sort of load, you are getting different performance because of data volume changes, or you are getting different performance because you are getting different query plans.
Different Data Volume
When you generate your initial timings, are you using data volumes that are similar to the volumes that your query will encounter when it is actually run? If you test a query on the first of the month and that query is getting all the data for the current month and performing a bunch of aggregations, you would expect that the query would get slower and slower over the course of the month because it had to process more and more data. Or you may have a query that runs quickly outside of month-end processing because various staging tables that it depends on only get populated at month end. If you are generating your initial timings in a test database, you'l very likely get different performance because test databases frequently have a small subset of the actual production data.
Different System Load
If I take a query and run it during the middle of the day against my data warehouse, there is a good chance that the data warehouse is mostly idle and therefore has lots of resources to give me to process the query. If I'm the only user, my query may run very quickly. If I try to run exactly the same query during the middle of the nightly load process, on the other hand, my query will be competing for resources with a number of other processes. Even if my query has to do exactly the same amount of work, it can easily take many times more clock time to run. If you are writing reports that will run at month end and they're all getting kicked off at roughly the same time, it's entirely possible that they're all competing with each other for the limited system resources available and that your system simply isn't sized for the load it needs to process.
Different system load can also encompass things like differences in what data is cached at any point in time. If I'm testing a particular query in prod and I run it a few times in a row, it is very likely that most of the data I'm interested in will be cached by Oracle, by the operating system, by the SAN, etc. That can make a dramatic difference in performance if every read is coming from one of the caches rather than requiring a disk read. If you run the same query later after other work has flushed out most of the blocks your query is interested in, you may end up doing a ton of physical reads rather than being able to use the nicely warmed up cache. There's not generally much you can do about this sort of thing-- you may be able to cache more data or arrange for processes that need similar data to be run at similar times so that the cache is more efficient ut that is generally expensive and hard to do.
Different Query Plans
Over time, your query plan may also change because statistics have changed (or not changed depending on the statistic in question). Normally, that indicates that Oracle has found a more efficient plan or that your data volumes have changed and Oracle expects a different plan would be more efficient with the new data volume. If, however, you are giving Oracle bad statistics (if, for example, you have tables that get much larger during month-end processing but you gather statistics when the tables are almost empty), you may induce Oracle to choose a very bad query plan. Depending on the version of Oracle, there are various ways to force Oracle to use the same query plan. If you can drill down and figure out what the problem with statistics is, Oracle probably provides a way to give the optimizer better statistics.
If you take a look at AWR/ ASH data (if you have the appropriate licenses) or Statspace data (if your DBA has installed that), you should be able to figure out which camp your problems originate in. Are you getting different query plans for different executions (you may need to capture a query plan from your initial benchmarks and compare it to the current plan or you may need to increase your AWR retention to retain query plans for a few months in order to see this). Are you doing the same number of buffer gets over time but getting vastly different amounts of I/O waits? Do you see a lot of contention for resources from other sessions?If so, that probably indicates that the issue is different load at different times.
One possibility is that your execution plan is cached so it takes a short amount of time to rerun the query, but when the plan is no longer cached (like after the DB is restarted) it might take significantly longer.
I had a similar issue with Oracle a long while ago where a very complex query for a report ran against a very large amount of data, and it would take hours to complete the first time it was run after the DB was restarted, but after that it finished in a few minutes.
this is not an answer, this is a reply to Justin Cave, i couldn't format it in any readable way in the comments.
Different Data Volume
When ….. data.
Yes, I’m using the same archive tables that I then use for months to come. Of course, data changes but it’s a pretty consistent rise, for example, if a table has 10M rows this month – it might gain 100K rows the next, 200K the next, 100K the next and so on. There are no drastic jumps as far as I know. And I’d understand if today the query took 2 minutes and next month it’d take 5. But not 3 hours. However, thank you for the idea, I will start counting rows in tables from month to month as well.
Question though, so how do people code to account for this? let’s say someone works with tables that will get large amounts of data at random times, is there a way to write the query to ensure the run times are at least in the ball park? Or do people just put up with the fact that any month their reports will run 10-20 hours.
Different System Load
If I take a …. to process.
**No, I run my queries on different days and times but I have logs of the days and the times so I will see if I can find a pattern.
Different system load …hard to do.
So are you saying that the fast times I may be getting at the time of the report design might be fast because of the things I ran on my computer previously?
Also, does the cache get stored on my computer or on the database under my login or where?**
Different Query Plans
Over time, your query plan … different load at different times.
Thank you for your explanations, you’ve given me enough to start digging.

how to tune mapred.reduce.parallel.copies?

Following reading http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html we want to experiment with mapred.reduce.parallel.copies.
The blog mentions "looking very carefully at the logs". How would we know we've reached the sweet spot? what should we look for? how can we detect that we're over-parallelizing?
In order to do that you should basically look for 4 things : CPU, RAM, Disk and Network. If your setup is crossing the threshold of these metrics you can deduce that you are pushing the limits. For example, if you have set the value of "mapred.reduce.parallel.copies" to a value much higher than the number of cores available, you'll end up with too many threads in waiting state, as based on this property Threads will be created to fetch the Map output. In addition to that network might get overwhelmed. Or, if there is too much intermediate output to be shuffled , your job will become slow as you will need disk based shuffle in such a case, which will be slower than RAM based shuffle. Choose a wise value for "mapred.job.shuffle.input.buffer.percent" based on your RAM(defaults to 70% of Reducer heap, which is normally good). So, these are kinda things which will tell you whether you are over-parallelizing or not. There are a lot of other things as well which you should consider. I would recommend you to go through the Chapter 6 of "Hadoop Definitve Guide".
Some of the measures which you could take, in order to make your jobs efficient, are like using a combiner to limit the data transfer, enable intermediate compression etc.
HTH
P.S : The answer is not very specific to just "mapred.reduce.parallel.copies". It tells you about tuning your job in general. Actually speaking setting only this property is not gonna help you much. You should consider other important properties as well.
Reaching the "sweet spot" is really just finding the parameters that give you the best result for whichever metric you consider the most important, usually overall job time. To figure out what parameters are working I would suggest using the following profiling tools that Hadoop comes with, MrBench, TestDFSIO, and NNBench. These are found in the hadoop-mapreduce-client-jobclient-*.jar.
By running this command you will see a long list of benchmark programs that you can use besides the ones I mentioned above.
hadoop ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*.jar
I would suggest running with the default parameters, run tests to give baseline benchmarks, then changing one parameter and rerunning. A bit time consuming but worth it, especially if you use a script to change parameters and run the benchmarks.

Resources