Clickhouse query with a LIMIT clause inefficiently reads too many rows - clickhouse

I'm querying Clickhouse with a query that has ORDER BY and LIMIT 1, and the ORDER BY matches the table's sort order. The query returns 1 row as expected, however, 50+ rows were scanned to return the result.
I would expect ClickHouse to scan only 1 row as the ORDER BY is in the table's sort order. What's happening here and what can I do to fix this?
SELECT * FROM comp_intel_scrapes
order by
client_slug,
client_hotel_id,
argset_id,
scrape_datetime,
preferred_country,
preferred_currency,
adults,
children,
nights,
min_checkin_date,
max_checkin_date
limit 1
----
Elapsed: 0.004s
Read: 54 rows (8.84KB)
By the way, Clickhouse.com's cloud is being used here.

It depends on a table engine.
Primary index is sparse https://clickhouse.com/docs/en/guides/improving-query-performance/sparse-primary-indexes/sparse-primary-indexes-design/
Because of this CH is unable to read less than one granule ~8192 rows.

Related

Power Bi count rows for all tables in one measure

In my power Bi I would like to count rows for all my tables and having this output:
Table Name
Row count
Table1
126
Table2
985
Table3
998
...
...
As long as I have few tables I can do
NEWTABLE = UNION(
ROW("TableName","Table1", "Rowcount",ROWSCOUNT(Table1)),
ROW("TableName","Table2", "Rowcount",ROWSCOUNT(Table2)),
...
)
But this starts to be complicated when I have many tables.
Is there a way I can do it? Like a loop or something?
Thank you
If you only need a metrics then you can use DaxStudio -> ViewMetrics
where cardinality is your "rowCounts"
If you need something more, then you can get all table name from DMV
select * from $SYSTEM.TMSCHEMA_TABLES
populate this as another table in your model, and use M language to loop through.
here useful example:
https://community.powerbi.com/t5/Power-Query/Power-query-Counting-rows-from-all-table-in-query-editor-but-not/td-p/1198489

Efficent use of an index for a self join with a group by

I'm trying to speed up the following
create table tab2 parallel 24 nologging compress for query high as
select /*+ parallel(24) index(a ix_1) index(b ix_2)*/
a.usr
,a.dtnum
,a.company
,count(distinct b.usr) as num
,count(distinct case when b.checked_1 = 1 then b.usr end) as num_che_1
,count(distinct case when b.checked_2 = 1 then b.usr end) as num_che_2
from tab a
join tab b on a.company = b.company
and b.dtnum between a.dtnum-1 and a.dtnum-0.0000000001
group by a.usr, a.dtnum, a.company;
by using indexes
create index ix_1 on tab(usr, dtnum, company);
create index ix_2 on tab(usr, company, dtnum, checked_1, checked_2);
but the execution plan tells me that it's going to be an index full scan for both indexes, and the calculations are very long (1 day is not enough).
About the data. Table tab has over 3 mln records. None of the single columns are unique. The unique values here are pairs of (usr, dtnum), where dtnum is a date with time written as a number in the format yyyy,mmddhh24miss. Columns checked_1, checked_2 have values from set (null, 0, 1, 2). Company holds an id for a company.
Each pair can only have one value checked_1, checked_2 and company as it is unique. Each user can be in multple pairs with different dtnum.
Edit
#Roberto Hernandez: I've attached the picture with the execution plan. As for parallel 24, in our company we are told to create tables with options 'parallel [num] nologging compress for query high'. I'm using 24 but I'm no expert in this field.
#Sayan Malakshinov: http://sqlfiddle.com/#!4/40b6b/2 Here I've simplified by giving data with checked_1 = checked_2, but in real life this may not be true.
#scaisEdge:
For
create index my_id1 on tab (company, dtnum);
create index my_id2 on tab (company, dtnum, usr);
I get
For table tab Your join condition is based on columns
company, datun
so you index should be primarly based on these columns
create index my_id1 on tab (company, datum);
The indexes you are using are useless because don't contain in left most position columsn use ij join /where condition
Eventually you can add user right most potition for avoid the needs of table access and let the db engine retrive alla the inf inside the index values
create index my_id1 on tab (company, datum, user, checked_1, checked_2);
Indexes (bitmap or otherwise) are not that useful for this execution. If you look at the execution plan, the optimizer thinks the group-by is going to reduce the output to 1 row. This results in serialization (PX SELECTOR) So I would question the quality of your statistics. What you may need is to create a column group on the three group-by columns, to improve the cardinality estimate of the group by.

SQLite SELECT with max() performance

I have a table with about 1.5 million rows and three columns. Column 'timestamp' is of type REAL and indexed. I am accessing the SQLite database via PHP PDO.
The following three selects run in less than a millisecond:
select timestamp from trades
select timestamp + 1 from trades
select max(timestamp) from trades
The following select needs almost half a second:
select max(timestamp) + 1 from trades
Why is that?
EDIT:
Lasse has asked for a "explain query plan", I have run this within a PHP PDO query since I have no direct SQLite3 command line tool access at the moment. I guess it does not matter, here is the result:
explain query plan select max(timestamp) + 1 from trades:
[selectid] => 0
[order] => 0
[from] => 0
[detail] => SCAN TABLE trades (~1000000 rows)
explain query plan select max(timestamp) from trades:
[selectid] => 0
[order] => 0
[from] => 0
[detail] => SEARCH TABLE trades USING COVERING INDEX tradesTimestampIdx (~1 rows)
The reason this query
select max(timestamp) + 1 from trades
takes so long is that the query engine must, for each record, compute the MAX value and then add one to it. Computing the MAX value involves doing a full table scan, and this must be repeated for each record because you are adding one to the value.
In the query
select timestamp + 1 from trades
you are doing a calculation for each record, but the engine only needs to scan the entire table once. And in this query
select max(timestamp) from trades
the engine does have to scan the entire table, however it also does so only once.
From the SQLite documentation:
Queries that contain a single MIN() or MAX() aggregate function whose argument is the left-most column of an index might be satisfied by doing a single index lookup rather than by scanning the entire table.
I emphasized might from the documentation, because it appears that a full table scan may be necessary for a query of the form SELECT MAX(x)+1 FROM table
if column x be not the left-most column of an index.

Performance issue in hive version 0.13.1

I use AWS-EMR to run my Hive queries and I have a performance issue while running hive version 0.13.1.
The newer version of hive took around 5 minutes for running 10 rows of data. But the same script for 230804 rows is taking 2 days and is still running. What should I do to analyze and fix the problem?
Sample Data:
Table 1:
hive> describe foo;
OK
orderno string
Time taken: 0.101 seconds, Fetched: 1 row(s)
Sample data for table1:
hive>select * from foo;
OK
1826203307
1826207803
1826179498
1826179657
Table 2:
hive> describe de_geo_ip_logs;
OK
id bigint
startorderno bigint
endorderno bigint
itemcode int
Time taken: 0.047 seconds, Fetched: 4 row(s)
Sample data for Table 2:
hive> select * from bar;
127698025 417880320 417880575 306
127698025 3038626048 3038626303 584
127698025 3038626304 3038626431 269
127698025 3038626560 3038626815 163
My Query:
SELECT b.itemcode
FROM foo a, bar b
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
In the very top of your Hive log output, it states "Warning: Shuffle Join JOIN[4][Tables a, b] in Stage 'Stage-1 Mapred' is a cross product."
EDIT:
A 'cross product' or Cartesian product is a join without conditions, which returns every row in the 'b' table, for every row in the 'a' table. So, if you take an example of 'a' is 5 rows, and 'b' is 10 rows, you get the product, or, 5 multiplied by 10 = 50 rows returned. There will be a lot of rows that are completely 'null' for one or the other tables.
Now, if you have a table 'a' of 20,000 rows and join it to another table 'b' of 500,000 rows, you are asking the SQL engine to return to you a data set 'a, b' of 10,000,000,000 rows, and then perform the BETWEEN operation on the 10-million rows.
So, if you drop the number of 'b' rows, you see you will get more benefit than the 'a' - in your example, if you can filter the ip_logs table, table 2, since I am making a guess that it has more rows than your order number table, it will cut down on the execution time.
END EDIT
You're forcing the execution engine to work through a Cartesian product by not specifying a condition for the join. It's having to scan all of table a over and over. With 10 rows, you will not have a problem. With 20k, you are running into dozens of map/reduce waves.
Try this query:
SELECT b.itemcode
FROM foo a JOIN bar b on <SomeKey>
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
But I'm having trouble figuring out what column your model will allow joining on. Maybe the data model for this expression could be improved? It may just be me not reading the sample clearly.
Either way, you need to filter the number of comparisons BEFORE the where clause. Other ways I have done this in Hive is to make a view with a smaller set of data, and join/match the view instead of the original table.

Hive Count(DISTINCT column) versus SELECT COUNT(*) from (SELECT DISTINCT column)

There have been discussions and claims that the query 2 is faster than query 1.
Query 1
SELECT COUNT(DISTINCT A) FROM TAB_X;
QUERY 2
SELECT COUNT(*) FROM (SELECT DISTINCT A FROM TAB_X)
I fail to understand exactly why it is so.
This is my understanding of how these queries would be converted to the map reduce behind the scene.
Query 1
- Only one stage
- The mappers emit the Column A as the key and the value as 1. **Is this correct? How distinct is achieved?**
- There would be only one reducer, which would have to just increment the counter for every key and the list of values that it gets. However, not sure how would that single reducer knows when to emit the final count (**how does it know when to emit eventually?**).
Query -2
- Two stages
- Stage 1
- The mappers emit the key as the column A and the value as 1
- There will be a lot of reducers, which can aggregate the results for each key and emit the results of that key (which is column A).
Stage 2
The mappers gets details of each user and emits the same key for all and value as 1.
The reducers would just sum these counts and emits the final result.
Can you please help understand/answer my questions inline for query 1 and confirm my understanding of query 2?

Resources