How to do pagination in clickhouse - clickhouse

Can you please suggest how can I do pagination in click house?
Dor example in elastic search I do aggregation query like below. Here elastic search takes parameters partition number and partition size and give the result. Let's say in total we have 100 records than if we give partition size of 10 and partition number 2 then we will get 11-20 latest records.
How can we do it in click house considering data in inserting in a table.
SearchResponse response = elasticClient.prepareSearch(index)
.setTypes(documentType)
.setQuery(boolQueryBuilder)
.setSize(0)
.addAggregation(AggregationBuilders.terms("unique_uids")
.field(Constants.UID_NAME)
.includeExclude(new IncludeExclude(partition,numPartitions))
.size(Integer.MAX_VALUE))
.get();

According to specification common sql syntax for limit and offset will work:
LIMIT n, m allows you to select the first m rows from the result after skipping the first n rows. The LIMIT m OFFSET n syntax is also supported.
https://clickhouse.yandex/docs/en/query_language/select/#limit-clause

I think you're wanting to only select a subset of the result set? I haven't needed to do this yet, but seems you could specify the format you want CH to return the data in (https://clickhouse-docs.readthedocs.io/en/latest/formats/index.html) and go from there. For instance, select one of the JSON formats as shown in the ^^ documentation and then get the subset of results appropriate for your situation out of the JSON response.

Related

Faster select when filtering with second or third sorted column

We have a time series table with the following definition
CREATE TABLE timeseries.mytable
(
`ts` DateTime('UTC'),
`src_ip` String,
`dst_ip` String,
`col_other` String
)
ENGINE = MergeTree()
PARTITION BY toDate(tr)
ORDER BY (dst_ip,ts,src_ip)
SETTINGS index_granularity = 8192
SELECT count(*) FROM timeseries.mytable;
# Elapsed 0.004 sec. Has 383M records
SELECT count(*) FROM timeseries.timeseries WHERE dst_ip = 'a.b.c.d';
# Elapsed: 0.085 sec.
SELECT count(*) FROM timeseries.timeseries WHERE src_ip = 'a.b.c.d';
# Elapsed: 53.031 sec
As can be seen above, filtering the data using the first sorted column (dst_ip) is very quick.
How can I make the select using the third sorted column (src_ip) faster?
Some remarks:
the third query (WHERE src_ip = 'a.b.c.d') works slowly because of index is not used and CH uses full scan. No good way to make it faster besides as redesign the primary key or if this query calculates just aggregates use the additional AggregatingMergeTree-table
use-cases which you provided looks as artificial because the calculation of row count by all dataset is not key use-case for timeseries data. Why the result not restricted by dst_ip and ts?
consider using ClickHouse AggregatingMergeTree Approach when need to calculate aggregated-values (as count in your case)
design of primary key required the understanding as CH use it in query optimization (see Primary Keys and Indexes in Queries, More secrets of ClickHouse Query Performance)
it recommends using the monotonic index
to choose the best index need to make the series of tests to find the index fittest for concrete use-cases
I would suggest the next primary keys:
/* [pretty suspicious suggestion] Remove date-column (it makes much slower the all date range queries with a range less than Daily). */
ORDER BY (dst_ip, src_ip)
/* Define the granularity of date. Instead of toStartOfHour can be used any interval less than 'Daily' (where Daily is defined by partition key) */
ORDER BY (dst_ip, toStartOfHour(ts), src_ip)
/* Move the date to the first position (it makes faster queries with date range without dst_ip and get monotonic-index related advantages). */
ORDER BY (toStartOfHour(ts), dst_ip, src_ip)
For each primary key need to choose the more effective Index granularity-value.
As for 2022, the solution is to use Data Skipping Index https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes for src_ip
You should try testing by keeping different order in ORDER BY clause depending upon value cardinality of your columns. In this case, maybe trying bringing src_ip before ts in ORDER by class.
In MergeTree engine, rows are sorted on the basis of ORDER BY keys in each partition.
After that, you can decide the final arrangement of columns in ORDER by clause depending upon how your application will query data most of the items.
You can find a similar discussion here.

Adding Index To A Column Having Flag Values

I am a novice in tuning oracle queries thus need help.
If I have a sql query like:
select a.ID,a.name.....
from a,b,c
where a.id=b.id
and ....
and b.flag='Y';
then will adding index to the FLAG column of table b help to tune the query by any means? The FLAG column has only 2 values Y and N
With a standard btree index, the SQL engine can find the row or rows in the index for the specified value quickly due to its binary structure, then use the physical address (the rowid) stored in the index to access the desired row in a second hop. It's like looking in the index of a book to find the page number. So that is:
Go to index with the key value you want to look up.
The index tells you the physical address in the table.
Go straight to that physical address.
That is nice and quick for something like a unique customer ID. It's still OK for something nonunique, like a customer ID in a table of orders, although the database has to go through the index entries and for each one go to the indicated address. That can still be faster than slogging through the entire table from top to bottom.
But for a column with only two distinct values, you can see that it is going to be more work going through all of the index entries for 'Y' for example, and for each one going to the indicated location in the table, than it would be to just forget the index and scan the whole table in one shot.
That's unless the values are unevenly distributed. If there are a million Y rows and ten N rows then an index will help you find those N rows fast but be no use for Y.
Adding an index to a column with only 2 values normally isn't very useful, because Oracle might just as well do a full table scan.
From your query it looks like it would be more useful to have an index on id, because that would help with the join a.id=b.id.
If you really want to get into tuning then learn to use "explain plan", as that will give you some indication of how much work Oracle needs to do for a query. Add (or remove) an index, then rerun the explain plan.

Why is a paginated query slower than a plain one with Spring Data?

Given I have a simple query:
List<Customer> findByEntity(String entity);
This query returns 7k records in 700ms.
Page<Customer> findByEntity(String entity, Pageable pageable);
this query returns 10 records in 1080ms. I am aware of the additional count query for pagination, but still something seems off. Also one strange thing I've noticed is that if I increase page size from 10 to 1900, response time is exactly the same around 1080 ms.
Any suggestions?
It might indeed be the count query that's expensive here. If you insist on knowing about the total number of elements matching in the collection there's unfortunately no way around that additional query. However there are two possibilities to avoid more of the overhead if you're able to sacrifice on information returned:
Using Slice as return type — Slice doesn't expose a method to find out about the total number of elements but it allows you to find out about whether a next slice is available. We avoid the count query here by reading one more element than requested and using its (non-)presence as indicator of the availability of a next slice.
Using List as return type — That will simply apply the pagination parameters to the query and return the window of elements selected. However it leaves you with no information about whether subsequent data is available.
Method with pagination runs two query:
1) select count(e.id) from Entity e //to get number of total records
2) select e from Entity e limit 10 [offset 10] //'offset 10' is used for next pages
The first query runs slow on 7k records, IMHO.
Upcoming release Ingalis of Spring Data will use improved algorithm for paginated queries (more info).
Any suggestions?
I think using a paginated query with 7k records it's useless. You should limit it.

Scan on DynamDB table or Query on secondary global index or a local index (What's the best solution)

I have AWS DynamoDB table called "Users", whose hash key/primary key is "UserID" which consist of emails. It has two attributes, first called "Daily Points" and second "TimeSpendInTheApp". Now I need to run a query or scan on the table, that will give me top 50 users which have the highest points and top 50 users which have spend the most time in the app. Now this query will be executed only once a day by cron aws lambda. I am trying to find the best solutions for this query or scan. For me, the cost is most important than speed/or efficiency. As maintaining secondary global index or a local index on points can be costly operations, as I have to assign Read and Write units for those indexes, which I want to avoid. "Users" table will have a maximum of 100,000 to 150,000 records and on average it will have 50,000 records. What are my best options? Please suggest.
I am thinking, my first option is, I can scan the whole table on Filter Expression for records above certain points (5000 for example), after this scan, if 50 or more than 50 records are found, then simply sort the values and take the top 50 records. If this scan returns no or very less results then reduce the Filter Expression value (3000 for example), then again do the same scan operation. If Filter Expression value (2500 for example) returns too many records, like 5000 or more, then reduce the Filter Expression value. Is this even possible, I guess it would also need to handle pagination. Is it advisable to scan on a table which has 50,000 record?
Any advice or suggestion will be helpful. Thanks in advance.
Firstly, creating indexes for the above use case doesn't simplify the process as it doesn't have solution for aggregation or sorting.
I would export the data to HIVE and run the queries rather than writing code to determine the result especially as it is going to be a batch executed only once per day.
Something like below:-
Create Hive table:-
CREATE EXTERNAL TABLE hive_users(userId string, dailyPoints bigint, timeSpendInTheApp bigint)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "Users",
"dynamodb.column.mapping" = "userId:UserID,dailyPoints:Daily_Points,timeSpendInTheApp:TimeSpendInTheApp");
Queries:-
SELECT dailyPoints, userId from hive_users sort by dailyPoints desc;
SELECT timeSpendInTheApp, userId from hive_users sort by timeSpendInTheApp desc;
Hive Reference

Access - Get the total of a column in a select query

I have an Access database set up that takes a bunch of raw data, splits things up in different 'select' queries and pipes the results into various CSV files, where a dashboard set up in Excel will pick it up.
There's some data that I'm trying to calculate in Access, namely I have a quantity field, and I need to calculate the percentage of for each record. In other words, quantity / total of quantity.
Using my rather limited Access abilities, I tried the following query:
SELECT [Sales].*, [Quantity] / Sum([Quantity]) AS QuantityPercent FROM [Sales];
Which comes up with an error:
Your query does not include the specified expression 'company_name' as part of an aggregate function.
Company_name is the first field of the table, and after some Googling and Binging, I'm still quite confused as to what it means in this context.
To sum it up, my question is this: Is there a way to calculate data based off the total of a column/field?
The easy method is to use DSum:
SELECT
[Sales].*,
[Quantity] / DSum("[Quantity]", "[Sales]") AS QuantityPercent
FROM
[Sales];

Resources