Is there any way to do clickhouse index and partition analysis - clickhouse

In my limited experience with ClickHouse,I need analysis table's indexes and partitions is there any way to get clickhouse's queries of using what index and what partition?
Thanks in advance.

With console client you can use set send_logs_level='debug';. After this for every query you will see info about query execution pipeline. This is very helpful for analysing partitions, index, memory and filters efficiency.
Example log for query:
2021.03.30 16:17:13.716531 [ 3784 ] {db566a26-e6a8-4a15-ac48-f7b8e2d09652} <Debug> (SelectExecutor): Selected 1607 parts by date, 1607 parts by key, 2265893 marks by primary key, 2265893 marks to read from 1607 ranges
Shows that query will read from 1607 parts.
If you need info to compare different variants of queries/table structures - look at query_log, it has lots of information that can be compared more precisely than query exec time.

Related

How does one run compute stats on a subset of columns from a hive table using Impala?

I have a very long and wide hive table that is taking an exorbitant amount of time to return query results. As such, I attempted a 'compute stats' on the table but due to the width of the table, this operation often times out. As such, I was wondering if there is a way to run 'compute stats' on select columns? Documentation on the Cloudera website suggests that it is possible but the syntax does not work.
Here is what I've tried but to no avail. Please advise as these all result in syntax errors.
COMPUTE STATS database.table field1
COMPUTE STATS database.table field1, field2
COMPUTE STATS database.table (field1, field2)
After further research, it was confirmed that the syntax is, in fact, correct but the column list parameter for compute stats was not made available until CDH 5.15.x. I am leaving this here in case anybody comes across the same issue.
Hive works by creating jobs that run in a different engine (originally MapReduce, which can be rather slow) and the underlying engine can be changed.
Rather than MapReduce, you may be able to use Apache Spark or Apache Tez, both of which are faster than MapReduce.
Newer versions of Hive also support an architecture called LLAP (Live Long And Process) which caches metadata similarly to Impala, reducing query latency.
You may want to test some typical queries against your own tables to see if one of these works better for you than Impala for interactive and ad-hoc queries.
UNDERSTANDING EXECUTIONS PLANS
To get a true grasp on what causes a query to take a long time, you need to understand what operations Hive or Impala will perform when it executes a query.
To find this out, you can view the execution plan for a query.
The execution plan is a description of the tasks required for a query, the order in which they'll be executed, and some details about each task.
To see an execution plan for a query, you can do this:
Prefix the query with the keyword EXPLAIN, then run it.
Execution plans can be long and complex.
Fully understanding them requires a deep knowledge of MapReduce.
The execution plans provided by Hive and by Impala look slightly different, but at a basic level, they provide more or less the same information.
Hive explain plan understanding
TABLE AND COLUMNS STATISTICS
The SQL engines you use do a certain amount of optimizing of the queries on their own—they look for the best way to proceed with your query, when possible.
When the query uses joins, the optimizers can do a better job when they have table statistics and column statistics.
For the table as a whole, these statistics include the number of rows, the number of files used to store the data, and the total size of the data.
The column statistics includes the approximate number of distinct values and the maximum and average sizes of the values (not the maximum or average value, but rather the size used in storage).
The optimizers use this information when deciding how to perform the join tasks.
Statistics also help your system prevent issues due to memory usage and resource limitations.
These statistics are not automatically calculated—you have to manually trigger it using a SQL command.
Once statistics are computed, both Hive and Impala can use them, though if you compute them in Hive, you need to refresh Impala's metadata cache.
If you make any changes to the table, such as adding or deleting data, you'll need to recompute the statistics.
Both Hive and Impala can use the statistics, even when calculated by the other machine.
However, when you have both Impala and Hive available, Cloudera recommends using Impala's COMPUTE STATS command to calculate and view the statistics.
The method for Hive is a bit more difficult to use.
If you do use Hive, you must refresh Impala's metadata cache for the table if you want Impala to use the statistics.
Statistics in Impala
Impala's syntax for calculating statistics for a table (including statistics for all columns) is COMPUTE STATS dbname.tablename;
If the table is in the active database, you can omit dbname. from the command.
To see the statistics in Impala, run SHOW TABLE STATS dbname.tablename; or
SHOW COLUMN STATS dbname.tablename;
Note: If the statistics have not yet been computed, #Rows for the table shows -1.
The #Nulls statistics for each column will always be -1;
old versions of Impala would calculate this statistic, but it is not used for optimization, so newer versions skip it.
Statistics in Hive
Hive's syntax for calculating statistics for a table is ANALYZE TABLE dbname.tablename COMPUTE STATISTICS;
If the table is in the active database, you can omit dbname. from the command.
To calculate column statistics, add FOR COLUMNS at the end of the command.
To see the table statistics in Hive, run DESCRIBE FORMATTED dbname.tablename;
The Table Parameters section will include numFIles, numRows, rawDataSize, and totalSize.
To see the statistics for a column, include the column name at the end:
DESCRIBE FORMATTED dbname.tablename columnname;
You can only display column statistics one column at a time.

Hive Bucketing - How to run hive query for specific bucket

I have hive query which reads 5 large tables and outputs the records to next process. All these tables are partitioned on proc_dt and bucketed on user_id (5 buckets). Joins are done on user_id and filtering on proc_dt.
How can I run this query for specific bucket of all the tables? For ex. I want to run the query for just first bucket of all tables.
The reason behind doing this is, once I complete the query for first bucket, I can send the output data to next process. While next process is running I can complete query for next bucket and so on. This way next process is not waiting for entire query to finish.
If I had one more column which had Mod5 of user ID, then I would have gone for partitioning. But there is no such column and I cannot add it.
Could anyone please give me some solution for this. Any suggestions will be really helpful.
I got the answer for it. We can mention the bucket number in join query. Check the below link for more detail.
https://www.qubole.com/blog/big-data/5-tips-for-efficient-hive-queries/
You can specify partitions within query statements but not buckets. Buckets are used for optimization purposes - e.g. faster sampling and mapside joins. But they are not visible to sql statements.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables
So here is the documentation example:
CLUSTERED BY(user_id) INTO 256 BUCKETS;
This clearly does not permit access to individual buckets by value/name.

How to improve performance with MonetDB on OSX?

I am using monetdb on a 16GB Macbook Pro with OSX 10.10.4 Yosemite.
I execute queries with SQLWorkbenchJ (configured with a minimum of 2048M RAM).
I find the performance overall erratic:
performance is acceptable / good with small size tables (<100K rows)
abysmal with tables with many rows: a query with a join of two tables (8670 rows and 242K rows) and a simple sum took 1H 20m!!
My 16GB of memory notwithstanding, in one run I never saw MSERVER5 using more than 35MB of RAM, 450MB in another. On the other hand the time is consumed swapping data onto disk (according to Activity Monitor over 160GB of data!).
There are a number of performance-related issues that I would like to understand better:
I have the impression that MonetDB struggles with understanding how much RAM to use / is available in OSX. How can I "force" MonetDB to use more RAM?
I use MonetDB through R. The MonetDB.R driver converts all the character fields into CLOB. I wonder if CLOBs create memory allocation issues?
I find difficult to explain the many GBs of writes (as mentioned >150GB!!) even for index creation or temporary results. On the other hand when I create the DB and load the tables overall the DB is <50MB. Should I create an artificial integer key and set it as index?
I join 2 tables on a timestamp field (e.g. "2015/01/01 01:00") that again is seen as a text CLOB by MonetDb / MonetDb.R. Should I just convert it to integer before saving it to MonetDb?
I have configured each table with a primary key, using a field of type integer. MonetDB (as a typical columnar database) doesn't need the user to specify an index. Is there any other way to improve performance?
Any recommendation is welcome.
For clarity the two tables I join have the following layout:
Calendar # classic calendar table with one entry per our in a year = 8760 rows
Fields: datetime, date, month, weekbyhour, monthbyday, yearbyweek, yearbymonth # all fields are CLOBs as mentioned
Activity # around 200K rows
Fields: company, department, subdepartment, function, subfunction, activityname, activityunits, datetime, duration # all CLOBs except activityunits; datetime refers to when the activity has occurred
I have tied various types of join syntax, but an example would (`*` used for brevity)
select * from Activity as a, Calendar as b where a.datetime=b.datetime

Cassandra query flexibility

I'm pretty new to the field of big data and currently stucking by a fundamental decision.
For a research project i need to store millions of log entries per minute to my Cassandra based data center, which works pretty fine. (single data center, 4 nodes)
Log Entry
------------------------------------------------------------------
| Timestamp | IP1 | IP2 ...
------------------------------------------------------------------
| 2015-01-01 01:05:01 | 10.10.10.1 | 192.10.10.1 ...
------------------------------------------------------------------
Each log entry has a specific timestamp. The log entries should be queried by different time ranges in first instance. As recommended i start to "model my query" in a big row approach.
Basic C* Schema
------------------------------------------------------------------
| row key | column key a | column key b ...
------------------------------------------------------------------
| 2015-01-01 01:05 | 2015-01-01 01:05:01 | 2015-01-01 01:05:23
------------------------------------------------------------------
Additional detail:
column keys are composition of timestamp+uuid, to be unique and to avoid overwritings;
log entries of a specific time are stored nearby on a node by its identical partition key;
Thus log entries are stored in shorttime intervals per row. For example every log entry for 2015-01-01 01:05 with the precision of a minute. Queries are not really peformed as a range query with an < operator, rather entries are selected as blocks of a specified minute.
Range based queries succeed in a decent response time which is fine for me.
Question:
In the next step we want to gain additional informations by queries, which are mainly focused on the IP field. For example: select all the entries which have IP1=xx.xx.xx.xx and IP2=yy.yy.yy.yy.
So obviously the current model is pretty not usable for additional IP focused CQL queries. So the problem is not to find a possible solution, rather the various choices of possible technologies which could be a possible solution:
Try to solve the problem with standalone C* solutions. (Build a second model and administer the same data in a different shape)
Choose additional technologies like Spark...
Switch to HDFS/Hadoop - Cassandra/Hadoop solution...
and so on
With my lack of knowledge in this field, it is pretty hard to find the best way which i should take. Especially with the feeling that the usage of a cluster computing framework would be an excessive solution.
As I understood your question, your table schema looks like this:
create table logs (
minute timestamp,
id timeuuid,
ips list<string>,
message text,
primary key (minute,id)
);
With this simple schema, you:
can fetch all logs for a specific minute.
can fetch short inter-minute ranges of log events.
want to query dataset by IP.
From my point of view, there are multiple ways of implementing this idea:
create secondary index on IP addresses. But in C* you will lose the ability to query by timestamp: C* cannot merge primary and secondary indexes (like mysql/pgsql).
denormalize data. Write your log events to two tables at once, first being optimized for timestamp queries (minute+ts as PK), second being for IP-based queries (IP+ts as PK).
use spark for analytical queries. But spark will need to perform (full?) table scan (in a nifty distributed map-reduce way, but nevertheless it's a table scan) each time to extract all the data you've requested, so all your queries will require a lot of time to finish. This way can cause problems if you plan to have a lot of low-latency queries.
use external index like ElasticSearch for quering, and C* for storing the data.
For my opinion, the C* way of doing such things is to have a set of separate tables for different queries. It will give you an ability to perform blazing-fast queries (but with increased storage cost).

Using Partitioning and Indexing on Same Column in Oracle is there any benefit out there

We are having a database design where we have table on which we have 1 Day Interval Partitioning on the column named as "5mintime" and on the same column we have created index also.
"5mintime" column can have data such as 1-Mar-2011,2-Mar-2011, in short there is no time component in it and from the UI also the user can select only one day period as minimum date.
My question is that while firing the select queries is there any advantage gained because of indexes since the partition is already there, on the flip side if i remove the indexes the insertion will be come faster, so any help on this would be greatly appreciated.
If I understand you right, then I think there's no need for the index:
A local index is indexed for every partition, which in your case has the same value in all rows (ie: 1-Mar-2011 in the 1-Mar-2011 partition, 2-Mar-2011 in the 2-Mar-2011 partiotion and so on).
A global Index will actually index the whole table but will find a whole partiotion, which is also not usefull since you already have partiones ...
But, why not check it?
If each day's data goes into its own partition and you can never search within days, but only for entire days worth of data, then, no, I don't see this index adding any value.
You can confirm whether or not SQL queries are using this index by enabling monitoring:
alter index myindex monitoring usage;
And then check to see if it's been used by querying v$object_usage for it some time later.

Resources