How does one run compute stats on a subset of columns from a hive table using Impala? - hadoop

I have a very long and wide hive table that is taking an exorbitant amount of time to return query results. As such, I attempted a 'compute stats' on the table but due to the width of the table, this operation often times out. As such, I was wondering if there is a way to run 'compute stats' on select columns? Documentation on the Cloudera website suggests that it is possible but the syntax does not work.
Here is what I've tried but to no avail. Please advise as these all result in syntax errors.
COMPUTE STATS database.table field1
COMPUTE STATS database.table field1, field2
COMPUTE STATS database.table (field1, field2)

After further research, it was confirmed that the syntax is, in fact, correct but the column list parameter for compute stats was not made available until CDH 5.15.x. I am leaving this here in case anybody comes across the same issue.

Hive works by creating jobs that run in a different engine (originally MapReduce, which can be rather slow) and the underlying engine can be changed.
Rather than MapReduce, you may be able to use Apache Spark or Apache Tez, both of which are faster than MapReduce.
Newer versions of Hive also support an architecture called LLAP (Live Long And Process) which caches metadata similarly to Impala, reducing query latency.
You may want to test some typical queries against your own tables to see if one of these works better for you than Impala for interactive and ad-hoc queries.
UNDERSTANDING EXECUTIONS PLANS
To get a true grasp on what causes a query to take a long time, you need to understand what operations Hive or Impala will perform when it executes a query.
To find this out, you can view the execution plan for a query.
The execution plan is a description of the tasks required for a query, the order in which they'll be executed, and some details about each task.
To see an execution plan for a query, you can do this:
Prefix the query with the keyword EXPLAIN, then run it.
Execution plans can be long and complex.
Fully understanding them requires a deep knowledge of MapReduce.
The execution plans provided by Hive and by Impala look slightly different, but at a basic level, they provide more or less the same information.
Hive explain plan understanding
TABLE AND COLUMNS STATISTICS
The SQL engines you use do a certain amount of optimizing of the queries on their own—they look for the best way to proceed with your query, when possible.
When the query uses joins, the optimizers can do a better job when they have table statistics and column statistics.
For the table as a whole, these statistics include the number of rows, the number of files used to store the data, and the total size of the data.
The column statistics includes the approximate number of distinct values and the maximum and average sizes of the values (not the maximum or average value, but rather the size used in storage).
The optimizers use this information when deciding how to perform the join tasks.
Statistics also help your system prevent issues due to memory usage and resource limitations.
These statistics are not automatically calculated—you have to manually trigger it using a SQL command.
Once statistics are computed, both Hive and Impala can use them, though if you compute them in Hive, you need to refresh Impala's metadata cache.
If you make any changes to the table, such as adding or deleting data, you'll need to recompute the statistics.
Both Hive and Impala can use the statistics, even when calculated by the other machine.
However, when you have both Impala and Hive available, Cloudera recommends using Impala's COMPUTE STATS command to calculate and view the statistics.
The method for Hive is a bit more difficult to use.
If you do use Hive, you must refresh Impala's metadata cache for the table if you want Impala to use the statistics.
Statistics in Impala
Impala's syntax for calculating statistics for a table (including statistics for all columns) is COMPUTE STATS dbname.tablename;
If the table is in the active database, you can omit dbname. from the command.
To see the statistics in Impala, run SHOW TABLE STATS dbname.tablename; or
SHOW COLUMN STATS dbname.tablename;
Note: If the statistics have not yet been computed, #Rows for the table shows -1.
The #Nulls statistics for each column will always be -1;
old versions of Impala would calculate this statistic, but it is not used for optimization, so newer versions skip it.
Statistics in Hive
Hive's syntax for calculating statistics for a table is ANALYZE TABLE dbname.tablename COMPUTE STATISTICS;
If the table is in the active database, you can omit dbname. from the command.
To calculate column statistics, add FOR COLUMNS at the end of the command.
To see the table statistics in Hive, run DESCRIBE FORMATTED dbname.tablename;
The Table Parameters section will include numFIles, numRows, rawDataSize, and totalSize.
To see the statistics for a column, include the column name at the end:
DESCRIBE FORMATTED dbname.tablename columnname;
You can only display column statistics one column at a time.

Related

Impala query with LIMIT 0

Being production support team member, I investigate issues with various Impala queries and while researching on an issue , I see a team submits an Impala query with LIMIT 0 which obviously do not return any rows and then again without LIMIT 0 which gives them result. I guess they submit these queries from IBM Datastage. Before I question them why they do so.. wanted to check what could be a reason for someone to run with LIMIT 0. Is it just to check syntax or connection with Impala? I see a similar question discussed here in context of SQL but thought to ask anyway in Impala perspective. Thanks Neel
I think you are partially correct.
Pls note, limit will process all the data and then apply limit clause.
LIMIT 0 is mostly used to -
to check if syntax of SQL is correct. But impala do fetch all the records before applying limit. so SQL is completely validated. Some system may use this to check out the sql they generated automatically before actually applying it in server.
limit fetching lots of rows from a huge table or a data set every time you run a SQL.
sometime you want to create an empty table using structure of some other tables but do not want to copy store format, configurations etc.
dont want to burden the hue/any interface that is interacting with impala. All data will be processed but will not be returned.
performance test - this will somewhat give you an idea of run time of SQL. i used the word somewhat because its not actual time to complete but estimated time to complete a SQL.

Oracle: Analyze table vs Gather_Table_Stats vs Gather_Schema_Stats

Currently in my DB, my predecessor who was working on DB has created a job which on every weekend gathers a list of all tables in the DB and individually for every table executes below two commands in that order:
ANALYZE TABLE xxTable_Namexx ESTIMATE STATISTICS FOR ALL INDEXED COLUMNS SIZE 75 SAMPLE 100 PERCENT;
EXEC dbms_stats.gather_table_stats(xxSchemaxx,xxTable_Namexx,cascade=>TRUE);
Recently someone suggested to me that:
1.) ANALYZE table is an old method of gathering stats and the optimizer doesn't use it anymore? is it true that the stats from this command is useless throughout the DB or is it being used somewhere?
2.) Instead of doing all this, it is enough to daily run:
dbms_stats.gather_schema_stats(xxSchemaxx,cascade=>true);
What is the size/sample % if nothing is specified ??
3.) What is the general practice/frequency for gather stats in a database. (Data is being updated,inserted,deleted on a daily basis). I still prefer doing it every week-end.
Thanks in advance.
Yes, ANALYZE is hardly used nowadays:
For the collection of most statistics, use the DBMS_STATS package,
which lets you collect statistics in parallel, collect global
statistics for partitioned objects, and fine tune your statistics
collection in other ways. See Oracle Database PL/SQL Packages and
Types Reference for more information on the DBMS_STATS package.
Use the ANALYZE statement (rather than DBMS_STATS) for statistics
collection not related to the cost-based optimizer:
To use the VALIDATE or LIST CHAINED ROWS clauses
To collect information on freelist blocks
What is the size/sample % if nothing is specified ??
Parameter is called estimate_percent:
Percentage of rows to estimate (NULL means compute): The valid range
is [0.000001,100]. Use the constant DBMS_STATS.AUTO_SAMPLE_SIZE to
have Oracle determine the appropriate sample size for good statistics.
This is the default.The default value can be changed using the
SET_DATABASE_PREFS Procedure, SET_GLOBAL_PREFS Procedure,
SET_SCHEMA_PREFS Procedure and SET_TABLE_PREFS Procedure.
Default value you can interrogate by function DBMS_STATS.get_param('ESTIMATE_PERCENT') or DBMS_STATS.GET_PREFS('ESTIMATE_PERCENT')
By default Oracle creates an internal scheduler job at installation which gathers the statistics automatically at night time. (called BSLN_MAINTAIN_STATS_JOB)

Does the number of columns in a Vertica table impact query performance?

We are working with a Vertica 8.1 table containing 500 columns and 100 000 rows.
The following query will take around 1.5 seconds to execute, even when using the vsql client straight on one of the Vertica cluster nodes (to eliminate any network latency issue) :
SELECT COUNT(*) FROM MY_TABLE WHERE COL_132 IS NOT NULL and COL_26 = 'anotherValue'
But when checking the query_requests table, the request_duration_ms is only 98 ms, and the resource_acquisitions table doesn't indicate any delay in resource asquisition. I can't understand where the rest of the time is spent.
If I then export to a new table only the columns used by the query, and run the query on this new, smaller, table, I get a blazing fast response, even though the query_requests table still tells me the request_duration_ms is around 98 ms.
So it seems that the number of columns in the table impacts the execution time of queries, even if most of these columns are not referenced. Am I wrong ? If so, why is it so ?
Thanks by advance
It sounds like your query is running against the (default) superprojection that includes all tables. Even though Vertica is a columnar database (with associated compression and encoding), your query is probably still touching more data than it needs to.
You can create projections to optimize your queries. A projection contains a subset of columns; if one is available that has all the columns your query needs, then the query uses that instead of the superprojection. (It's a little more complicated than that, because physical location is also a factor, but that's the basic idea.) You can use the Database Designer to create some initial projections based on your schema and sample queries, and iteratively improve it over time.
I was running Vertica 8.1.0-1, it seems the issue was a Vertica bug in the Vertica planning phase causing a performance degradation. It was solved in versions >= 8.1.1 :
[https://my.vertica.com/docs/ReleaseNotes/8.1.x/Vertica_8.1.x_Release_Notes.htm]
VER-53602 - Optimizer - This fix improves complex query performance during the query planning phase.

how to we define hbase rowkey so we get reords in optimize manner when millons of records in table

I have 30 millions of records into table but when tried to find one of records from there then it i will take to much time retrieve. Could you suggest me how I can I need to generate row-key in such a way so we can get fetch records fast.
Right now I have take auto increment Id of 1,2,3 like so on as row-key and what steps need to take to performance improvement. Let me know your concerns
generally when we come for performance to a SQL structured table, we follow some basic/general tuning like apply proper index to columns which are being used in query. apply proper logical partition or bucketing to table. give enough memory for buffer to do some complex operations.
when it comes to big data , and specially if you are using hadoop , then the real problems comes with context switching between hard disk and buffer. and context switching between different servers. you need to make sure how to reduce context switching to get better performance.
some NOTES :
use Explain Feature to know Query structure and try to improve performance.
if you are using integer row-key , then it is going to give best performance, but always create row-key/index at the beginning of table. because later performance killing.
When creating external tables in Hive / Impala against hbase tables, map the hbase row-key against a string column in Hive / Impala. If this is not done, row-key is not used in the query and entire table is scanned.
never use LIKE in row-key query , because it scans whole table. use BETWEEN or = , < , >=.
If you are not using a filter against row-key column in your query, your row-key design may be wrong. The row key should be designed to contain the information you need to find specific subsets of data

Apache Drill has bad performance against SQL Server

I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was:
SELECT p.Product_Category, SUM(f.sales)
FROM facts f
JOIN Product p on f.pkey = p.pkey
GROUP BY p.Product_Category
Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows.
First I tested this query on SqlServer and got a result back in about 150ms.
With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec).
Then I tried saving the tables into json files and reading from them, but that was even slower, so I tried parquet files.
I got the result back in the first run in about 3 sec. next run was about 900ms and then it stabled at about 500ms.
From reading around, this makes no sense and drill should be faster!
I tried "REFRESH TABLE METADATA", but the speed didn't change.
I was running this on windows, through the drill command line.
Any idea if I need some extra configuration or something?
Thanks!
Drill is very fast, but it's designed for large distributed queries while joining across several different data sources... and you're not using it that way.
SQL Server is one of the fastest relational databases. Data is stored efficiently, cached in memory, and the query runs in a single process so the scan and join is very quick. Apache Drill has much more work to do in comparison. It has to interpret your query into a distributed plan, send it to all the drillbit processes, which then lookup the data sources, access the data using the connectors, run the query, return the results to the first node for aggregation, and then you receive the final output.
Depending on the data source, Drill might have to read all the data and filter it separately which adds even more time. JSON files are slow because they are verbose text files that are parsed line by line. Parquet is much faster because it's a binary compressed column-oriented storage format designed for efficient scanning, especially when you're only accessing certain columns.
If you have a small dataset stored on a single machine then any relational database will be faster than Drill.
The fact that Drill gets you results in 500ms with Parquet is actually impressive considering how much more work it has to do to give you the flexibility it provides. If you only have a few million rows, stick with SQL server. If you have billions of rows, then use the SQL Server columnstore feature to store data in columnar format with great compression and performance.
Use Apache Drill when you:
Have 10s of billions of rows or more
Have data spread across many machines
Have unstructured data like JSON stored in files without a standard schema
Want to split the query across many machines to run in faster in parallel
Want to access data from different databases and file systems
Want to join data across these different data sources
One thing people need to understand about how Drill works is how Drill translates an SQL query to an executable plan to fetch and process data from, theoretically, any source of data. I deliberately didn't say data source so people won't think of databases or any software-based data management system.
Drill uses storage plugins to read records from whatever data the storage plugin supports.
After Drill gets these rows, it starts performing what is needed to execute the query, whats needed may be filtering, sorting, joining, projecting (selecting specific columns)...etc
So drill doesn't by default use any of the source's capabilities of processing the queried data. In fact, the source may not support any capability of such !
If you wish to leverage any of the source's data processing features, you'll have to modify the storage plugin you're using to access this source.
One query I regularly remember when I think about Drill's performance, is this one
Select a.CUST_ID, (Select count(*) From SALES.CUSTOMERS where CUST_ID < a.CUST_ID) rowNum from SALES.CUSTOMERS a Order by CUST_ID
Only because of the > comparison operator, Drill has to load the whole table (i.e actually a parquet file), SORT IT, then perform the join.
This query took around 18 minutes to run on my machine which is a not so powerful machine but still, the effort Drill needs to perform to process this query must not be ignored.
Drill's purpose is not to be fast, it's purpose is to handle vast amounts of data and run SQL queries against structured and semi-structured data. And probably other things that I can't think about at the moment but you may find more information for other answers.

Resources