Clickhouse. ReplacingMergeTree doesn't guarantee absence of duplicates - clickhouse

According to the documentation the ReplacingMergeTree engine does not guarantee absence of duplicates. Does it mean I must not count on this engine and I have to use my own method for control duplicates?

Yes, it does mean that you must not count on this engine.
For a small ReplacingMergeTree tables you can FINAL keyword.
select ... from SomeReplacingMergeTreeTable FINAL
https://clickhouse.tech/docs/en/sql-reference/statements/select/from/#select-from-final
Or argMax group by or order by ... limit by ...
But the best solution is to redesign dataflow and avoid duplicates.

Related

Does the presence of a single UPPER in CosmosDb SQL Queries force a full collection scan on partition?

Given the following SQL, the ManufacturerIdUpperCase is the partition key, and a lower cased value is passed as a hint to direct Cosmos to the correct partition. The "boat.OwnerIdUpperCase" in an indexed property. Will Cosmos use the ownerId to narrow the scan to the subset of documents for this owner, or does the use of the other two UPPER calls require a full collection scan?
SELECT * FROM boat
WHERE boat.ManufacturerIdUpperCase= #ManufacturerId
AND UPPER(boat.Owner.Type)= UPPER(#OwnerType)
AND boat.OwnerIdUppererCase= #BoatOwnerId)
AND UPPER(boat.BoatType) = UPPER(#BoatType)
I'm trying to decide if I need to maintain a lowercase copy of every property included in the various WHERE clauses, or, if I can do this for one of the remaining UPPER conversions on an indexed property that will reduce the scope of the dataset such that a scan is only required on the resulting subset, not the entire partition?
I've read the old posts like the one below, and run the SQL in the sandbox as proposed. In the simple scenario, I am seeing the same result as the author. However, my work scenario is more complex as described above.
DocumentDB: Performance impact of built-in string functions (like UPPER)
Victor, welcome to StackOverflow! I am from the Cosmos DB engineering team.
In this particular query, since all the filter predicates are intersections (ANDs), and not unions (ORs), Cosmos DB will narrow down the set of documents to evaluate and will not do a full scan. Please ensure that all the 4 fields (/ManufacturerIdUpperCase, /Owner/Type, /OwnerIdUppererCase, /BoatType) are indexed (added as part of "includedPaths" in the indexingPolicy).

Is 'distinct' an ordinary operation for ClickHouse?

I would like to use ClickHouse for marketing. Most of the time they not just want to know HOW much people use some feature but the exact emails to send spam to.
Is that a good choice to use ClickHouse for such purpose (select DISTINCT email from table where ...)? What is the difference in performance between 'select COUNT' and 'select DISTINCT'?
Is that a good choice to use ClickHouse for such purpose
Yes, ClickHouse has decent HashTable and Aggregator implementations. It heavily uses templated code for static type dispatching and applies a lot of memory tricks. And it stores data in a compact form.
I assume you'd like to compare select count and select count(distinct) as select distinct is a different beast. ClickHouse transforms count(distinct) into aggregator uniqExact which is about 8 times slower than count(*), but is still much faster than tranditional databases like Postgres. There are also approximate aggregators uniq, uniqCombined and uniqHLL12 for faster estimations, which is around 1.5 times slower than count(*). See https://clickhouse.yandex/docs/en/query_language/agg_functions/reference/ if you need more info.
If your goal is select distinct, ClickHouse can still do it well, which uses Set data structure to uniquify the data streams (Set is also used for building its SQL in (...) construct). Without measuring the data output process, it's only 1.3x slower than plain count(*).

Can we boost the performance of COUNT, DISTINCT and LIKE queries?

As far as I understand, when we run SQL query with COUNT, DISTINCT or LIKE %query% (wildcards at both sides) keywords the indexes cannot be used and the database have to do the full table scan.
Is there some way to boost the performance of these queries?
Do they really cannot use indexes or we can fix this somehow?
Can we make an index-only scan if we need to return only one column? For example: select count(id) from MY_TABLE: probably in this case we can make index-only scan and avoid hitting the whole table if we have index on 'id'?
My question has a general meaning: could you give me some performance guidelines if we have to use the mentioned operators?
UPDATE
As for me I use PostgreSQL.
with PostgreSQL, you can create GIN pg_trgm indexes for text strings to make LIKE '%foo%' faster, though this requires addons, and PostgreSQL 9.1 or higher.
I doubt distinct by itself will ever use an index. I tried in fact and could not get it to use one. You can sort of force an index to be used by using a recursive CTE to pull individual records out (what can be called a "sparse scan"). We do something like this when pulling individual years out of the accounting record. This requires writing special queries though and so isn't really the general case.
count(*) is never going to be able to use an index due to mvcc rules. You can get approximate results by looking in the appropriate system catalogs however.

performance with IN clause in postgresql

what are the performance aspects if you have something like this in your query:
... AND x.somfield IN (
33620,262,394,450,673,674,675,2331,2370,2903,4191,4687,5153,6776,6898,6899,7127,7217,7225,
7227,7757,8830,8889,8999,9036,9284,9381,9382,9411,9412,9423,10088,10089,10304,10333,10515,
10527,10596,10651,11442,12636,12976,13275,14261,14262,14382,14389,14567,14568,15792,16557,
17043,17459,17675,17699,17700,17712,18240,18370,18591,18980,19023,19024,19025,19026,19211,
19272,20276,20426,20471,20494,20833,21126,21315,21990,22168,22284,22349,22563,22796,23739,
24006,24321,24642,24827,24867,25049,25248,25249,25276,25572,25665,26000,26046,26646,26647,
26656,27343,27406,27753,28560,28850,29796,29817,30026,30090,31020,31505,32188,32347,32629
,32924,32931,33062,33254,33600,33601,33602,33603,33604,33605,33606,33607,33608,34010,34472,
35800,35977,36179,37342,37439,37459,38425,39592,39661,39926,40376,40561,41226,41279,41568,
42272,42481,43483,43867,44958,45295,45408,46022,46258) AND ...
should i avoid this or is it okay and fast enough?
thanks
You certainly want to check the execution plan. Depending on data, it may or may not be "okay".
If the table is large enough, it's possible that PG converts that to "array contains" operation and decides not to use an index on it. This could lead to a seq scan (if you don't have other WHERE criteria on this table).
In some cases OR is better than IN, because it's executed as two index scans and combined. May not work in your case though, because you have so many values in there. Again, depends on data.
Unless your table is small, in such cases you usually need to rely on other criteria which are easily indexed, such as dates, states, "types" etc. Then this IN is merely a "recheck" filter on limited data.
If the query uses index on the x.somfield - it will be fast enough.
As it was mentioned - you sould use "explain" and "explain analyze" to realy understand what's going on there.

do's and don'ts for writing mysql queries

One thing I always wonder while writing query is that am I writing most optimized query or not? I know certain things like:
1) using SELECT field1, filed2 instead of SELECT *
2) Giving proper indexes to the tables
but I am sure there are more things that should be kept in mind for writing queries, since most of the database can only grow more and optimal query will help in execution time. Can you share some tips and tricks on writing queries?
Testing is the best way to measure performance. Monitor your queries on the live database and make use of things like the slow query log.
I would also recommend enabling the query cache, which will give most typical usage situations a massive boost.
Use proper data types for your fields
Use back-tick character (`) for reserved keywords
When dealing with multiple tables, try using joins
Resource:
See:
20 SQL Tips
As well as the Do's and Dont's, you may find the Hidden Features of MySQL useful.
As a matter of fact, no "tips" can help you.
Database design require deep knowledge, not tips.
There are always "weight" of these "dont's". Most of such listings fall to list most unimportant things and fail to mention important ones. Your list for example, is if it was culinary forum:
Always use a knife with black handle
To prepare good dish you need to choose proper ingredients.
First one is impressing but never help in the real world.
Second one is right, but must be backed with deep knowledge to make it right.
So, it must be a book, not tips. Ones from Paul Dubios are among recommended.
use below fields necessarily in each table
tablename_id( auto increment , unsigned zerofill)
created_by( timestamp)
tablerow_status( enum ('t','f') by default set 't')
always make an comment when u create a field in mysql( it helps when u search in phpmyadmin))
alwayz take care of Normalization forms
if u r doing some field that would be alwayz positive then select unsigned .
use decimal data type instead of float in somw case( like discount, it should be maximum 99.99% so use decimal( 5,2)
use date, time data type whereve needed, don't use timestamp everywhere
Correlated subqueries are very bad, but often not well understood and end up in production. They can often be fixed by using derived tables and a join instead.
http://en.wikipedia.org/wiki/Correlated_subquery
One more thing I found today is regarding the difference between COUNT(*) and COUNT(col)
Using COUNT(*) is faster than COUNT(col)
MYISAM tables cached number of rows in this table, for innoDB doesn't cache row count and may be slower without WHERE clause
It is better to use NOT NULL column for both MYISAM and innoDB than some other column where Null is allowed.
More details here

Resources