Does the number of columns in a Vertica table impact query performance? - vertica

We are working with a Vertica 8.1 table containing 500 columns and 100 000 rows.
The following query will take around 1.5 seconds to execute, even when using the vsql client straight on one of the Vertica cluster nodes (to eliminate any network latency issue) :
SELECT COUNT(*) FROM MY_TABLE WHERE COL_132 IS NOT NULL and COL_26 = 'anotherValue'
But when checking the query_requests table, the request_duration_ms is only 98 ms, and the resource_acquisitions table doesn't indicate any delay in resource asquisition. I can't understand where the rest of the time is spent.
If I then export to a new table only the columns used by the query, and run the query on this new, smaller, table, I get a blazing fast response, even though the query_requests table still tells me the request_duration_ms is around 98 ms.
So it seems that the number of columns in the table impacts the execution time of queries, even if most of these columns are not referenced. Am I wrong ? If so, why is it so ?
Thanks by advance

It sounds like your query is running against the (default) superprojection that includes all tables. Even though Vertica is a columnar database (with associated compression and encoding), your query is probably still touching more data than it needs to.
You can create projections to optimize your queries. A projection contains a subset of columns; if one is available that has all the columns your query needs, then the query uses that instead of the superprojection. (It's a little more complicated than that, because physical location is also a factor, but that's the basic idea.) You can use the Database Designer to create some initial projections based on your schema and sample queries, and iteratively improve it over time.

I was running Vertica 8.1.0-1, it seems the issue was a Vertica bug in the Vertica planning phase causing a performance degradation. It was solved in versions >= 8.1.1 :
[https://my.vertica.com/docs/ReleaseNotes/8.1.x/Vertica_8.1.x_Release_Notes.htm]
VER-53602 - Optimizer - This fix improves complex query performance during the query planning phase.

Related

Speed up SQLite 'SELECT EXISTS' query

I've got a 3GB SQLite database file with a single table with 40 million rows and 14 fields (mostly integers and very short strings and one longer string), no indexes or keys or other constraints -- so really nothing fancy. I want to check if there are entries where a specific integer field has a specific value. So of course I'm using
SELECT EXISTS(SELECT 1 FROM FooTable WHERE barField=?)
I haven't got much experience with SQLite and databases in general and on my first test query, I was shocked that this simple query took about 30 seconds. Subsequent tests showed that it is much faster if a matching row occurs at the beginning, which of course makes sense.
Now I'm thinking of doing an initial SELECT DISTINCT barField FROM FooTable at application startup, and caching the results in software. But I'm sure there must be a cleaner SQLite way to do this, I mean, that should be part of what a DBMS's job right?
But so far, I've only created primary keys for speeding up queries, which doesn't work here because the field values are non-unique. So how can I speed up this query so that it works at constant time? (It doesn't have to be lightning fast, I'd be completely fine if it was under one second.)
Thanks in advance for answering!
P.S. Oh, and there will be about 500K new rows every month for an indefinite period of time, and it would be great if that doesn't significantly increase query time.
Adding an index on barField should speed up the subquery inside the EXISTS clause:
CREATE INDEX barIdx ON FooTable (barField);
To satisfy the query, SQLite would only have to seek the index once and detect that there is at least one matching value.

How does one run compute stats on a subset of columns from a hive table using Impala?

I have a very long and wide hive table that is taking an exorbitant amount of time to return query results. As such, I attempted a 'compute stats' on the table but due to the width of the table, this operation often times out. As such, I was wondering if there is a way to run 'compute stats' on select columns? Documentation on the Cloudera website suggests that it is possible but the syntax does not work.
Here is what I've tried but to no avail. Please advise as these all result in syntax errors.
COMPUTE STATS database.table field1
COMPUTE STATS database.table field1, field2
COMPUTE STATS database.table (field1, field2)
After further research, it was confirmed that the syntax is, in fact, correct but the column list parameter for compute stats was not made available until CDH 5.15.x. I am leaving this here in case anybody comes across the same issue.
Hive works by creating jobs that run in a different engine (originally MapReduce, which can be rather slow) and the underlying engine can be changed.
Rather than MapReduce, you may be able to use Apache Spark or Apache Tez, both of which are faster than MapReduce.
Newer versions of Hive also support an architecture called LLAP (Live Long And Process) which caches metadata similarly to Impala, reducing query latency.
You may want to test some typical queries against your own tables to see if one of these works better for you than Impala for interactive and ad-hoc queries.
UNDERSTANDING EXECUTIONS PLANS
To get a true grasp on what causes a query to take a long time, you need to understand what operations Hive or Impala will perform when it executes a query.
To find this out, you can view the execution plan for a query.
The execution plan is a description of the tasks required for a query, the order in which they'll be executed, and some details about each task.
To see an execution plan for a query, you can do this:
Prefix the query with the keyword EXPLAIN, then run it.
Execution plans can be long and complex.
Fully understanding them requires a deep knowledge of MapReduce.
The execution plans provided by Hive and by Impala look slightly different, but at a basic level, they provide more or less the same information.
Hive explain plan understanding
TABLE AND COLUMNS STATISTICS
The SQL engines you use do a certain amount of optimizing of the queries on their own—they look for the best way to proceed with your query, when possible.
When the query uses joins, the optimizers can do a better job when they have table statistics and column statistics.
For the table as a whole, these statistics include the number of rows, the number of files used to store the data, and the total size of the data.
The column statistics includes the approximate number of distinct values and the maximum and average sizes of the values (not the maximum or average value, but rather the size used in storage).
The optimizers use this information when deciding how to perform the join tasks.
Statistics also help your system prevent issues due to memory usage and resource limitations.
These statistics are not automatically calculated—you have to manually trigger it using a SQL command.
Once statistics are computed, both Hive and Impala can use them, though if you compute them in Hive, you need to refresh Impala's metadata cache.
If you make any changes to the table, such as adding or deleting data, you'll need to recompute the statistics.
Both Hive and Impala can use the statistics, even when calculated by the other machine.
However, when you have both Impala and Hive available, Cloudera recommends using Impala's COMPUTE STATS command to calculate and view the statistics.
The method for Hive is a bit more difficult to use.
If you do use Hive, you must refresh Impala's metadata cache for the table if you want Impala to use the statistics.
Statistics in Impala
Impala's syntax for calculating statistics for a table (including statistics for all columns) is COMPUTE STATS dbname.tablename;
If the table is in the active database, you can omit dbname. from the command.
To see the statistics in Impala, run SHOW TABLE STATS dbname.tablename; or
SHOW COLUMN STATS dbname.tablename;
Note: If the statistics have not yet been computed, #Rows for the table shows -1.
The #Nulls statistics for each column will always be -1;
old versions of Impala would calculate this statistic, but it is not used for optimization, so newer versions skip it.
Statistics in Hive
Hive's syntax for calculating statistics for a table is ANALYZE TABLE dbname.tablename COMPUTE STATISTICS;
If the table is in the active database, you can omit dbname. from the command.
To calculate column statistics, add FOR COLUMNS at the end of the command.
To see the table statistics in Hive, run DESCRIBE FORMATTED dbname.tablename;
The Table Parameters section will include numFIles, numRows, rawDataSize, and totalSize.
To see the statistics for a column, include the column name at the end:
DESCRIBE FORMATTED dbname.tablename columnname;
You can only display column statistics one column at a time.

How to maintain cache in ClickHouse?

I use crontab to schedule a SQL that queries a big table every 2 hours.
select a,b,c,d,e,f,g,h,i,j,k,many_cols from big_table format Null
It takes anywhere from 5 minutes to 30 seconds at a time.
What I can see from the query_log is that when the SQL time is low, the MarkCacheHits value is high, when the time is high, the MarkCacheHits value is low, and the MarkCacheMiss value is high.
And I'm wondering how to make mark cache hit as many as possible? (This is probably not the only big table that needs to be warmed up)
Will mark cache be replaced by other queries and what is its limit?
Does the warm-up way of selecting specific columns really work for an aggregate query of those columns? For example, warm-up SQL is as above, and the aggregate query can be select a,sum(if(b,c,0)) from big_table group by a
My clickhouse server has been hanging occasionally recently, and I can't see any errors or exceptions at the corresponding time from the log. Could this be related to my regular warm-up query of the big table?
In reality you placing data into Linux disk cache.
Will mark cache be replaced by other queries and what is its limit?
yes, will be replaced, 5GB <mark_cache_size>5368709120</mark_cache_size>
Does the warm-up way of selecting specific columns really work for an aggregate query of those columns?
Yes because you put files into Linux cache.
Could this be related to my regular warm-up query of the big table?
No.

(TSQL) INSERT doubling time of the query

I have a quite complex multi-join TSQL SELECT query that runs for about 8 seconds and returns about 300K records. Which is currently acceptable. But I need to reuse results of that query several times later, so I am inserting results of the query into a temp table. Table is created in advance with columns that match output of SELECT query. But as soon as I do INSERT INTO ... SELECT - execution time more than doubles to over 20 seconds! Execution plans shows that 46% of the query cost goes to "Table Insert" and 38% to Table Spool (Eager Spool).
Any idea why this is happening and how to speed it up?
Thanks!
The "Why" of it hard to say, we'd need a lot more information. (though my SWAG would be that it has to do with logging...)
However, the solution, 9 times out of 10 is to use SELECT INTO to make your temp table.
I would start by looking at standard tuning itmes. Is disk performing? Are there sufficient resources (IOs, RAM, CPU, etc)? Is there a bottleneck in the RDBMS? Does sound like the issue but what is happening with locking? Does other code give similar results? Is other code performant?
A few things I can suggest based on the information you have provided. If you don't care about dirty reads, you could always change the transaction isolation level (if you're using MS T-SQL)
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
select ...
This may speed things up on your initial query as locks will not need to be done on the data you are querying from. If you're not using SQL server, do a google search for how to do the same thing with the technology you are using.
For the insert portion, you said you are inserting into a temp table. Does your database support adding primary keys or indexes on your temp table? If it does, have a dummy column in there that is an indexed column. Also, have you tried to use a regular database table with this? Depending on your set up, it is possible that using that will speed up your insert times.

Slow Performance on Sql Express after inserting big chunks of data

We have noticed that our queries are running slower on databases that had big chunks of data added (bulk insert) when compared with databases that had the data added on record per record basis, but with similar amounts of data.
We use Sql 2005 Express and we tried reindexing all indexes without any better results.
Do you know of some kind of structural problem on the database that can be caused by inserting data in big chunks instead of one by one?
Thanks
One tip I've seen is to turn off Auto-create stats and Auto-update stats before doing the bulk insert:
ALTER DATABASE databasename SET AUTO_CREATE_STATISTICS OFF WITH NO_WAIT
ALTER DATABASE databasename SET AUTO_UPDATE_STATISTICS OFF WITH NO_WAIT
Afterwards, manually creating statistics by one of 2 methods:
--generate statistics quickly using a sample of data from the table
exec sp_createstats
or
--generate statistics using a full scan of the table
exec sp_createstats #fullscan = 'fullscan'
You should probably also turn Auto-create and Auto-update stats back on when you're done.
Another option is to check and defrag the indexes after a bulk insert. Check out Pinal Dave's blog post.
Probably SQL Server allocated new disk space in many small chunks. When doing big transactions, it's better to pre-allocate much space in both the data and log files.
That's an interesting question.
I would have guessed that Express and non-Express have the same storage layout, so when you're Googling for other people with similar problems, don't restrict yourself to Googling for problems in the Express version. On the other hand though, bulk insert is a common-place operation and performance is important, so I wouldn't consider it likely that this is a previously-undetected bug.
One obvious question: which is the clustered index? Is the clustered index also the primary key? Is the primary key unassigned when you insert, and therefore initialized by the database? If so then maybe there's a difference (between the two insert methods) in the pattern or sequence of successive values assigned by the database, which affects the way in which the data is clustered, which then affects performance.
Something else: as well as indexes, people say that SQL uses statistics (which it created as a result of runing previous queries) to optimize its execution plan. I don't know any details of that, but as well as "reindexing all indexes", check the execution plans of your queries in the two test cases to ensure that the plans are identical (and/or check the associated statistics).

Resources