Is 'distinct' an ordinary operation for ClickHouse? - clickhouse

I would like to use ClickHouse for marketing. Most of the time they not just want to know HOW much people use some feature but the exact emails to send spam to.
Is that a good choice to use ClickHouse for such purpose (select DISTINCT email from table where ...)? What is the difference in performance between 'select COUNT' and 'select DISTINCT'?

Is that a good choice to use ClickHouse for such purpose
Yes, ClickHouse has decent HashTable and Aggregator implementations. It heavily uses templated code for static type dispatching and applies a lot of memory tricks. And it stores data in a compact form.
I assume you'd like to compare select count and select count(distinct) as select distinct is a different beast. ClickHouse transforms count(distinct) into aggregator uniqExact which is about 8 times slower than count(*), but is still much faster than tranditional databases like Postgres. There are also approximate aggregators uniq, uniqCombined and uniqHLL12 for faster estimations, which is around 1.5 times slower than count(*). See https://clickhouse.yandex/docs/en/query_language/agg_functions/reference/ if you need more info.
If your goal is select distinct, ClickHouse can still do it well, which uses Set data structure to uniquify the data streams (Set is also used for building its SQL in (...) construct). Without measuring the data output process, it's only 1.3x slower than plain count(*).

Related

Efficient sqlite query based on list of primary keys

For querying an sqlite table based on a list of IDs (i.e. distinct primary keys) I am using following statement (example based on the Chinook Database):
SELECT * FROM Customer WHERE CustomerId IN (1,2,3,8,20,35)
However, my actual list of IDs might become rather large (>1000). Thus, I was wondering if this approach using the IN statement is the most efficient or if there is a better/optimized way to query an sqlite table based on a list of primary keys.
If the number of elements in the IN is large enough, SQLite constructs a temporary index for them. This is likely to be more efficient than creating a temporary table manually.
The length of the IN list is limited only be the maximum length of an SQL statement, and by memory.
Because the statement you wrote does not include any instructions to SQLite about how to find the rows you want the concept of "optimizing" doesn't really exist -- there's nothing to optimize. The job of planning the best algorithm to retrieve the data belongs to the SQLite query optimizer.
Some databases do have idiosyncrasies in their query optimizers which can lead to performance issues but I wouldn't expect SQLite to have any trouble finding the correct algorithm for this simple query, even with lots of values in the IN list. I would only worry about trying to guide the query optimizer to another execution plan if and when you find that there's a performance problem.
SQLite Optimizer Overview
IN (expression-list) does use an index if available.
Beyond that, I can't glean any guarantees from it, so the following is subject to a performance measaurement.
Axis 1: how to pass the expression-list
hardocde as string. Overhead for int-to-string conversion and string-to-int parsing
bind parameters (i.e. the statement is ... WHERE CustomerID in (?,?,?,?,?,?,?,?,?,?....), which is easier to build from a predefined string than hardcoded values). Prevents int → string → int conversion, but the default limit for number of parameters is 999. This can be increased by SQLITE_LIMIT_VARIABLE_NUMBER, but might lead to excessive allocations.
Temporary table. Possibly less efficient than any of the above methods after the statement is prepared, but that doesn't help if most time is spent preparing the statement
Axis 2: Statement optimization
If the same expression-list is used in multiple queries against changing CustomerIDs, one of the following may help:
reusing a prepared statement with hardcoded values (i.e. don't pass 1001 parameters)
create a temporary table for the CustomerIDs with index (so the index is created once, not on the fly for every query)
If the expression-list is different with every query, ist is probably best to let SQLite do its job. The following might be an improvement
create a temp table for the expression-list
bulk-insert expression-list elements using union all
use a sub query
(from my experience with SQLite, I'd expect it to be on par or slightly worse)
Axis 3 Ask Richard
the sqlite mailing list (yeah I know, that technology even older than rotary phones!) is pretty active with often excellent advise, including from the author of SQLite. 90% chance someone will dismiss you ass "Measure before asking suhc a question!", 10% chance someone gives you detailed insight.

ADO Search Performance

Because I am not familiar with ADO under the hood, I was wonder which of the two methods of finding a record generally yields quicker results using VB6.
Use a 'select' statement using 'where' as a qualifier. If the recordset count yields zero, the record was not found.
Select all records iterating through records with a client-side cursor until record is found, or not at all.
The recordset is in the range of 10,000 records and will grow. Also, I am open to anything that will yield shorter search times other than what was mentioned.
SELECT count(*) FROM foo WHERE some_column='some value'
If the result is greater than 0 the record satisfying your condition was found in the database. It is unlikely you would get any faster than this. Proper indexes on the columns you are using in the WHERE clause could considerably improve performance.
In every case I can think of, selecting using the where clause is faster.
Even in situations where the client code will iterate through the whole database (file-based databases like Access, for example), you will have optimized code written in c or c++ doing the selection (in the database driver.) This is always faster than VB6.
For Database engines (SQL, MySQL, etc), the performance increase can even be more profound. By using the where clause, you limit the amount of data that must be transmitted over the network, vastly improving the response.
Some additional performance tips:
Select only the fields you want.
Build indexes on frequently used fields
Watch what kind of recordset you are returning. Use Forward-only cursors if you are just returning data from a database.
Lastly, I was shocked by VB.NET's database performance, it being several times faster than the fastest VB6 code.

do's and don'ts for writing mysql queries

One thing I always wonder while writing query is that am I writing most optimized query or not? I know certain things like:
1) using SELECT field1, filed2 instead of SELECT *
2) Giving proper indexes to the tables
but I am sure there are more things that should be kept in mind for writing queries, since most of the database can only grow more and optimal query will help in execution time. Can you share some tips and tricks on writing queries?
Testing is the best way to measure performance. Monitor your queries on the live database and make use of things like the slow query log.
I would also recommend enabling the query cache, which will give most typical usage situations a massive boost.
Use proper data types for your fields
Use back-tick character (`) for reserved keywords
When dealing with multiple tables, try using joins
Resource:
See:
20 SQL Tips
As well as the Do's and Dont's, you may find the Hidden Features of MySQL useful.
As a matter of fact, no "tips" can help you.
Database design require deep knowledge, not tips.
There are always "weight" of these "dont's". Most of such listings fall to list most unimportant things and fail to mention important ones. Your list for example, is if it was culinary forum:
Always use a knife with black handle
To prepare good dish you need to choose proper ingredients.
First one is impressing but never help in the real world.
Second one is right, but must be backed with deep knowledge to make it right.
So, it must be a book, not tips. Ones from Paul Dubios are among recommended.
use below fields necessarily in each table
tablename_id( auto increment , unsigned zerofill)
created_by( timestamp)
tablerow_status( enum ('t','f') by default set 't')
always make an comment when u create a field in mysql( it helps when u search in phpmyadmin))
alwayz take care of Normalization forms
if u r doing some field that would be alwayz positive then select unsigned .
use decimal data type instead of float in somw case( like discount, it should be maximum 99.99% so use decimal( 5,2)
use date, time data type whereve needed, don't use timestamp everywhere
Correlated subqueries are very bad, but often not well understood and end up in production. They can often be fixed by using derived tables and a join instead.
http://en.wikipedia.org/wiki/Correlated_subquery
One more thing I found today is regarding the difference between COUNT(*) and COUNT(col)
Using COUNT(*) is faster than COUNT(col)
MYISAM tables cached number of rows in this table, for innoDB doesn't cache row count and may be slower without WHERE clause
It is better to use NOT NULL column for both MYISAM and innoDB than some other column where Null is allowed.
More details here

ABAP select performance hints?

Are there general ABAP-specific tips related to performance of big SELECT queries?
In particular, is it possible to close once and for all the question of FOR ALL ENTRIES IN vs JOIN?
A few (more or less) ABAP-specific hints:
Avoid SELECT * where it's not needed, try to select only the fields that are required. Reason: Every value might be mapped several times during the process (DB Disk --> DB Memory --> Network --> DB Driver --> ABAP internal). It's easy to save the CPU cycles if you don't need the fields anyway. Be very careful if you SELECT * a table that contains BLOB fields like STRING, this can totally kill your DB performance because the blob contents are usually stored on different pages.
Don't SELECT ... ENDSELECT for small to medium result sets, use SELECT ... INTO TABLE instead.
Reason: SELECT ... INTO TABLE performs a single fetch and doesn't keep the cursor open while SELECT ... ENDSELECT will typically fetch a single row for every loop iteration.
This was a kind of urban myth - there is no performance degradation for using SELECT as a loop statement. However, this will keep an open cursor during the loop which can lead to unwanted (but not strictly performance-related) effects.
For large result sets, use a cursor and an internal table.
Reason: Same as above, and you'll avoid eating up too much heap space.
Don't ORDER BY, use SORT instead.
Reason: Better scalability of the application server.
Be careful with nested SELECT statements.
While they can be very handy for small 'inner result sets', they are a huge performance hog if the nested query returns a large result set.
Measure, Measure, Measure
Never assume anything if you're worried about performance. Create a representative set of test data and run tests for different implementations. Learn how to use ST05 and SAT.
There won't be a way to close your second question "once and for all". First of all, FOR ALL ENTRIES IN 'joins' a database table and an internal (memory) table while JOIN only operates on database tables. Since the database knows nothing about the internal ABAP memory, the FOR ALL ENTRIES IN statement will be transformed to a set of WHERE statements - just try and use the ST05 to trace this. Second, you can't add values from the second table when using FOR ALL ENTRIES IN. Third, be aware that FOR ALL ENTRIES IN always implies DISTINCT. There are a few other pitfalls - be sure to consult the on-line ABAP reference, they are all listed there.
If the number of records in the second table is small, both statements should be more or less equal in performance - the database optimizer should just preselect all values from the second table and use a smart joining algorithm to filter through the first table. My recommendation: Use whatever feels good, don't try to tweak your code to illegibility.
If the number of records in the second table exceeds a certain value, Bad Things [TM] happen with FOR ALL ENTRIES IN - the contents of the table are split into multiple sets, then the query is transformed (see above) and re-run for each set.
Another note: The "Avoid SELECT *" statement is true in general, but I can tell you where it is false.
When you are going to take most of the fields anyway, and where you have several queries (in the same program, or different programs that are likely to be run around the same time) which take most of the fields, especially if they are different fields that are missing.
This is because the App Server Data buffers are based on the select query signature. If you make sure to use the same query, then you can ensure that the buffer can be used instead of hitting the database again. In this case, SELECT * is better than selecting 90% of the fields, because you make it much more likely that the buffer will be used.
Also note that as of the last version I tested, the ABAP DB layer wasn't smart enough to recognize SELECT A, B as being the same as SELECT B, A, which means you should always put the fields you take in the same order (preferable the table order) in order to make sure again that the data buffer on the application is being well used.
I usually follow the rules stated in this pdf from SAP: "Efficient Database Programming with ABAP"
It shows a lot of tips in optimizing queries.
This question will never be completely answered.
ABAP statement for accessing database is interpreted several times by different components of whole system (SAP and DB). Behavior of each component depends from component itself, its version and settings. Main part of interpretation is done in DB adapter on SAP side.
The only viable approach for reaching maximum performance is measurement on particular system (SAP version and DB vendor and version).
There are also quite extensive hints and tips in transaction SE30. It even allows you (depending on authorisations) to write code snippets of your own & measure it.
Unfortunately we can't close the "for all entries" vs join debate as it is very dependent on how your landscape is set up, wich database server you are using, the efficiency of your table indexes etc.
The simplistic answer is let the DB server do as much as possible. For the "for all entries" vs join question this means join. Except every experienced ABAP programmer knows that it's never that simple. You have to try different scenarios and measure like vwegert said. Also remember to measure in your live system as well, as sometimes the hardware configuration or dataset is significantly different to have entirely different results in your live system than test.
I usually follow the following conventions:
Never do a select *, Select only the required fields.
Never use 'into corresponding table of' instead create local structures which has all the required fields.
In the where clause, try to use as many primary keys as possible.
If select is made to fetch a single record and all primary keys are included in where clause use Select single, or else use SELECT UP TO TO 1 ROWS, ENDSELECT.
Try to use Join statements to connect tables instead of using FOR ALL ENTRIES.
If for all entries cannot be avoided ensure that the internal table is not empty and a delete the duplicate entries to increase performance.
Two more points in addition to the other answers:
usually you use JOIN for two or more tables in the database and you use FOR ALL ENTRIES IN to join database tables with a table you have in memory. If you can, JOIN.
usually the IN operator is more convinient than FOR ALL ENTRIES IN. But the kernel translates IN into a long select statement. The length of such a statement is limited and you get a dump when it gets too long. In this case you are forced to use FOR ALL ENTRIES IN despite the performance implications.
With in-memory database technologies, it's best if you can finish all data and calculations on the database side with JOINs and database aggregation functions like SUM.
But if you can't, at least try to avoid accessing database in LOOPs. Also avoid reading the database without using indexes, of course.

How can I optimize a dynamic search query in Oracle

I am writing a stored procedure to perform a dynamic search that spans 10+ database tables. With millions of records in each table and a dynamic set of search parameters*, I am having some trouble optimizing the procedure.
Is there a "best practice" for building these kinds of queries? E.g. Use strings to build a dynamic query, use a huge list of IF THEN .. ELSE statements, etc? Can anyone provide a simple example or point me to some literature that will help? Here's some psuedocode for the stored procedure I am developing, which accepts a collection of parameters and a ref cursor.
v_query = "SELECT .....";
v_name = ... -- retrieve "name" parameter from collection
if v_name is not null then
v_query := v_query || ' AND table.Name = ' || v_name;
end if;
open search_cursor for v_query;
...
*By "dynamic set of search parameters," I mean that I pass in a collection of parameters. I figured this would be easier than making the caller pass in 20 parameters if they only want to search on one.
There are problems with using the static query approach; also be very careful about using the CURSOR_SHARING=FORCE option - it can really raise hell with your system if you haven't done a coverage test to ensure that all your other queries will work the way you want.
Problems with static queries:
The (x is null or x = col) predicates tend to kill any chance of using indexes. Since the query plan is computed at the time query is parsed the first time, the indexes you use will be based on the values for the first run of the query; later runs, which may not constrain on the same columns, will still use the same indexes.
Having one static statement with substitution variables will prevent the optimizer from making an intelligent choice about which index to use based on the data distribution. In a dynamic query (or in the first run of a query with bind variables), Oracle will see how selective your constraint is; a highly selective constraint will become a prime candidate for index use. For example, if your table had a row for every person in the U.S., STATE='Alaska' will be much more likely to use the index on STATE than STATE='California'.
Of course, in both these cases, if the dynamic columns in your WHERE clause are not indexed anyway, it doesn't matter, although I'd be surprised if that were the case in a database the size you're talking about.
Also, consider the real cost of all that hard parsing. Yes, hard parses serialize system resources, which makes them expensive, but only in the context of high volume queries. By their nature, ad-hoc queries do not get run very often. The cost you pay for all the hard parses you incur in an entire day will likely be hundreds of times less than the cost of a single query that uses the wrong indexes.
In the past, I've implemented these systems pretty much like you've done here - a base query portion, then iterating over a constraint list and adding WHERE clause predicates. I don't think it's hard for someone to maintain or understand, especially if you're talking about constraints that don't involve adding a lot of subqueries or extra tables to the FROM clause.
One thing to consider: If this system is primarily an offline one (in other words, not constantly being updated or inserted into - populated by periodic loads of bulk data), you may want to look into using BITMAP indexes. Bitmap indexes differ from regular b-tree indexes in that multiple indexes on a single table can be used simultaneously, and bitmap indexes are much, much smaller on disk than b-trees. They work very well for applications like this - where you will have a variety of constraints that can't be defined at design time. You will only want to put bitmap indexes on columns that have relatively few distinct values - say, one value constitutes no less than 1/1000 of the table - so don't use bitmaps on unique columns.
However, the downside is that bitmap indexes will noticeably degrade the performance of inserts and updates. The best practice for bitmaps is to use them in data warehouse applications, and they are dropped prior to loads and recreated afterwards.
Except in very particular cases, I don't think it is advisable (or even possible) to try to generate an optimized query. My advice is not to use dynamic SQL if you can : hard to read, hard to debug, hard to optimize, hard to maintain.
First, write a generic query that will work with any parameter sent to your procedure. According to your example, that would give something like :
SELECT * FROM table WHERE ((v_name IS NULL) OR (table.Name=v_name));
As you see, you could easily add other parameters to this query without using dynamic SQL. This query is much easier to read and debug. Ask your DBA for optimization tips.
Then, if you have a particular set of parameters that you know are often passed together, you could write a particular query for this set that you could specifically optimize. Pseudocode :
IF particular_set
THEN
/* Specific query */
ELSE
/* Generic query */
END IF;
The difficult part is to try not to have too many specific queries here, or you could fall into a maintenance hell.
We've had a similar requirement for one of our clients. They have half a dozen tables with millions of rows, and they wanted adhoc search capability on most of the columns.
The solution was a separate package for each table, which would take the search criteria and construct the SQL to run the search. We took advantage of the old system that was being replaced, to discover what the most common types of searches the users were doing, and made sure that those searches ran the best, by tuning the queries that were being generated (supported by the strategic use of indexes). Because each package was only responsible for queries against one table, it could have specific code designed to work with that table (including the odd hint, in a few rare cases).
One question/problem that you'll need to address is, do you hard-code the criteria (e.g. WHERE SURNAME='SMITH') or use bind variables? Using bind variables reduces hard parsing, which reduces load on the database server; however it can be impractical to use bind variables when the SQL is dynamically generated. The way we ended up going was to set CURSOR_SHARING=FORCE (which has its own disadvantages) which was a reasonable compromise in our case.
Read http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6711305251199

Resources