Analyze the runtime characteristics of a HiveQL query without actual execution - hadoop

How can I determine an approximate runtime of a HiveQL query without (a) executing the query or (b) fetching the results?

HIVE command EXPLAIN gives the execution plan of the query. Just add the keyword EXPLAIN before the query and execute it.
Otherwise, instead of returning the result, you could return the count of the records from the query. That might provide some insight into the execution time.

As mentioned by #visakh the "explain" gives an execution plan. However it is cryptic and does NOT give execution time. You will have to do a fair amount of analysis on the (potentially copious) output of explain to derive the information you are looking for.
Running "analyze" on the hive tables helps but still does not make the explain user friendly. The "explain" is a feature that my team at a former major employer requested to HortonWorks to improve.
However I disagree with the "count" approach comment : the "count" typically takes as much time as running the query itself. After all the data has to be fetched and the various filtering and aggregation operations performed in order to return the count. Unfortunately Hive is not intelligent enough to discard the "sorting/ordering" steps when doing the count - so you end up paying essentially the entire "price" of the query.

Related

Impala query with LIMIT 0

Being production support team member, I investigate issues with various Impala queries and while researching on an issue , I see a team submits an Impala query with LIMIT 0 which obviously do not return any rows and then again without LIMIT 0 which gives them result. I guess they submit these queries from IBM Datastage. Before I question them why they do so.. wanted to check what could be a reason for someone to run with LIMIT 0. Is it just to check syntax or connection with Impala? I see a similar question discussed here in context of SQL but thought to ask anyway in Impala perspective. Thanks Neel
I think you are partially correct.
Pls note, limit will process all the data and then apply limit clause.
LIMIT 0 is mostly used to -
to check if syntax of SQL is correct. But impala do fetch all the records before applying limit. so SQL is completely validated. Some system may use this to check out the sql they generated automatically before actually applying it in server.
limit fetching lots of rows from a huge table or a data set every time you run a SQL.
sometime you want to create an empty table using structure of some other tables but do not want to copy store format, configurations etc.
dont want to burden the hue/any interface that is interacting with impala. All data will be processed but will not be returned.
performance test - this will somewhat give you an idea of run time of SQL. i used the word somewhat because its not actual time to complete but estimated time to complete a SQL.

MonetDB Query Plan

I have a few queries that I am running and I would like to view some sort of query plan for a given query. When I add "explain" before the query, I get a long (~4,000 lines) result that is not possible to interpret.
The MAL plan exposes all parallel activity needed to solve the query. Each line is a relational algebra operator or catalog action.
You might also use PLAN to get an idea of the output of the SQL optimizer.
Each part in the physical execution plan that'll be executed in parallel is repeated the same number of times as the number of cores you have in the result of EXPLAIN. That's why EXPLAIN can sometimes produce a huge MAL plan.
If you just want to have an idea of how are query is handled, you can force MonetDB to generate a sequential MAL plan, then at least, you get rid of the repetitions. For this, you can change the default optimiser pipe line to, e.g., 'sequential_pipe'. This can be done both in a client (it works then only for this client session), or in a server (it works then for the whole server session). For more information: https://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/OptimizerPipelines

Microsoft Access equivalent of explain in MySQL

I'm working on a very large query, in a inherited application. This is a large insert-query, that takes 4 tables with well over a million records. I know, I would also rather have this in SQL-server, but there is no infrastructure at this customer to do this :-)
This query has worked for over a year. However, the source-tables keep on growing, and last week it threw the dreaded 'out of system resources'-error. Bummer...!
I think it is possible to optimize this query. Working in MySQL, I would use the explain-command, to see where optimalisation might occur. Is there a equivalent of this in Access? I cannot seem to find it....
kind regards,
Paul
Probably Jet ShowPlan is closest to what you want. You will have to set a registry key. Then query plan information gets dumped to a text file named SHOWPLAN.OUT. You can read about the details in this article on TechRepublic: Use Microsoft Jet's ShowPlan to write more efficient queries
Also try the Performance Analyzer wizard. You can ask it to examine your query alone, or also ask it to examine table or other queries used by that query.
If you haven't compacted the database recently, see whether that improves performance. Compacting also updates index statistics which allows the engine to make better decisions for the query plan.

performance with IN clause in postgresql

what are the performance aspects if you have something like this in your query:
... AND x.somfield IN (
33620,262,394,450,673,674,675,2331,2370,2903,4191,4687,5153,6776,6898,6899,7127,7217,7225,
7227,7757,8830,8889,8999,9036,9284,9381,9382,9411,9412,9423,10088,10089,10304,10333,10515,
10527,10596,10651,11442,12636,12976,13275,14261,14262,14382,14389,14567,14568,15792,16557,
17043,17459,17675,17699,17700,17712,18240,18370,18591,18980,19023,19024,19025,19026,19211,
19272,20276,20426,20471,20494,20833,21126,21315,21990,22168,22284,22349,22563,22796,23739,
24006,24321,24642,24827,24867,25049,25248,25249,25276,25572,25665,26000,26046,26646,26647,
26656,27343,27406,27753,28560,28850,29796,29817,30026,30090,31020,31505,32188,32347,32629
,32924,32931,33062,33254,33600,33601,33602,33603,33604,33605,33606,33607,33608,34010,34472,
35800,35977,36179,37342,37439,37459,38425,39592,39661,39926,40376,40561,41226,41279,41568,
42272,42481,43483,43867,44958,45295,45408,46022,46258) AND ...
should i avoid this or is it okay and fast enough?
thanks
You certainly want to check the execution plan. Depending on data, it may or may not be "okay".
If the table is large enough, it's possible that PG converts that to "array contains" operation and decides not to use an index on it. This could lead to a seq scan (if you don't have other WHERE criteria on this table).
In some cases OR is better than IN, because it's executed as two index scans and combined. May not work in your case though, because you have so many values in there. Again, depends on data.
Unless your table is small, in such cases you usually need to rely on other criteria which are easily indexed, such as dates, states, "types" etc. Then this IN is merely a "recheck" filter on limited data.
If the query uses index on the x.somfield - it will be fast enough.
As it was mentioned - you sould use "explain" and "explain analyze" to realy understand what's going on there.

Identify source of linq to sql query

We are starting to have numerous linq to sql queries in our code. We have started to pay more attention to performance and are starting to see queries that we think are coming from linq. They have the t1, t2...tN values, so we are sure they are linq generated. However, we are having difficulty determining the location in code that is the source of the query. Obviously we have a general idea based on the tables and columns requested.
Is there a way to "tag" or "name" a query so that is shows up in a trace to more easily identify the query?
You might find my Linq-to-SQL query profiler useful; it allows you to log queries together with stack trace and db-side I/O, timings, execution plans, and other details that can be used to pinpoint both what effect the query had and where it came from (in code, what user action(s) and or calls triggered it etc).
It has a number of filter options that you can control from within your own code, so you can set it up to catch queries that fulfill specific criteria only. E.g. queries that: are expensive I/O-wise, has long execution time, does table scans, hits specific tables, even your own custom filters, etc. It is designed for runtime profiling, so you can distribute the logging component with your apps and switch it on as necessary in prod environments.
I have posted a short intro to it here:
http://huagati.blogspot.com/2009/06/profiling-linq-to-sql-applications.html
And you can download the profiler and get a free 45-day trial license from:
http://www.huagati.com/L2SProfiler/
I have, to date, found no way to do this.

Resources