AggregatingMergeTree order by column not in the sorting key - clickhouse

What are some options to have AggregatingMergeTree merge by a column but ordered by a column that's not in the sorting key?
My application is similar to Zendesk tickets. A ticket has a category, status, and ID. The application emits ticket status change events to CH and I'm calculating statistics on the time it took to close since it was created given some time range R group by some time period P.
For example, events look like this
{
"ticket": "A",
"event_time": 2022-12-08T15:00:00Z,
"category": "bug",
"status": "created"
},
{
"ticket": "A",
"event_time": 2022-12-08T15:30:00Z,
"category": "bug",
"status": "reviewing"
},
{
"ticket": "A",
"event_time": 2022-12-08T16:00:00Z,
"category": "bug",
"status": "reviewed"
}
My AggregatingMergeTree (more specifically, it's replicated) has a sorting key on the ticket ID to aggregate two states into one.
CREATE TABLE ticket_created_to_reviewed
(
`ticket` String,
`created_ticket_event_id` SimpleAggregateFunction(max, String),
`created_ticket_event_time` SimpleAggregateFunction(max, DateTime64(9)),
`created_ticket_category` SimpleAggregateFunction(max, String),
`close_ticket_event_id` SimpleAggregateFunction(max, String),
`close_ticket_event_time` SimpleAggregateFunction(max, DateTime64(9)),
`close_ticket_category` SimpleAggregateFunction(max, String),
)
ENGINE = ReplicatedAggregatingMergeTree('<path>', '{replica}')
PARTITION BY toYYYYMM(close_ticket_event_time)
PRIMARY KEY ticket
ORDER BY ticket
TTL date_trunc('second', if(close_ticket_event_time > created_ticket_event_time,
close_ticket_event_time, created_ticket_event_time)) + toIntervalMonth(12)
SETTINGS index_granularity = 8192
Two MVs SELECT on the raw events and inserts to the ticket_created_to_reviewed. One for WHERE status = 'created' and another for WHERE status = 'reviewed'
So far the data populates correctly, although I have to exclude rows that only have one of the status events populated. Getting hourly p99 of ticket time to close past 1 day for each category looks something like this
SELECT
quantile(0.9)(date_diff('second', created_ticket_event_time, close_ticket_event_time)),
date_trunc('hour', close_ticket_event_time) as t,
close_ticket_category as category
FROM
(
SELECT
ticket,
max(created_ticket_event_id) AS created_ticket_event_id,
max(created_ticket_event_time) AS created_ticket_event_time,
max(created_ticket_category) AS created_ticket_category,
max(close_ticket_event_id) AS close_ticket_event_id,
max(close_ticket_event_time) AS close_ticket_event_time,
max(close_ticket_category) AS close_ticket_category
FROM ticket_unreviewed_to_reviewed
GROUP BY ticket
)
WHERE close_ticket_event_id != '' AND created_ticket_event_id != '' AND
close_ticket_event_time > addDays(now(), -1)
GROUP BY t, category
The problem is close_ticket_event_time is not in the sorting key so the query scans the full table, but I can't also include that column in the sorting key because the table wouldn't then aggregate by the ticket ID.
Any suggestions?
Things tried:
Adding an index and/or projection that orders by close_ticket_event_time. However, I think the main problem is that the sorting key is on ticket ID so the data is not ordered by time to efficiently find the matching time range, but at the same time adding close_ticket_event_time breaks the aggregation behavior in AggregatingMergeTree
MV that joins created ticket and closed ticket, and a different destination table with close_ticket_event_time as the sorting key. The destination table doesn't contain all the data if the right side of the JOIN isn't available at the time MV was triggered (i.e. left side). This can happen if events are ingested out of order.
Ideally, what I'm looking for is something like this in AggregatingMergeTree, but it appears this isn't possible due to the nature of how the data is stored.
PRIMARY KEY ticket
ORDER BY close_ticket_event_time
Thanks in advance

Related

Which Postgresql index is most efficient for text column with queries based on similarity

I would like to create an index on text column for the following use case. We have a table of Segment with a column content of type text. We perform queries based on the similarity by using pg_trgm. This is used in a translation editor for finding similar strings.
Here are the table details:
CREATE TABLE public.segments
(
id integer NOT NULL DEFAULT nextval('segments_id_seq'::regclass),
language_id integer NOT NULL,
content text NOT NULL,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT segments_pkey PRIMARY KEY (id),
CONSTRAINT segments_language_id_fkey FOREIGN KEY (language_id)
REFERENCES public.languages (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT segments_content_language_id_key UNIQUE (content, language_id)
)
And here is the query (Ruby + Hanami):
def find_by_segment_match(source_text_for_lookup, source_lang, sim_score)
aggregate(:translation_records)
.where(language_id: source_lang)
.where { similarity(:content, source_text_for_lookup) > sim_score/100.00 }
.select_append { float::similarity(:content, source_text_for_lookup).as(:similarity) }
.order { similarity(:content, source_text_for_lookup).desc }
end
---EDIT---
This is the query:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity" FROM "segments" WHERE (("language_id" = 2) AND (similarity("content", 'This will not work.') > 0.45)) ORDER BY SIMILARITY("content", 'This will not work.') DESC
SELECT "translation_records"."id", "translation_records"."source_segment_id", "translation_records"."target_segment_id", "translation_records"."domain_id",
"translation_records"."style_id",
"translation_records"."created_by", "translation_records"."updated_by", "translation_records"."project_name", "translation_records"."created_at", "translation_records"."updated_at", "translation_records"."language_combination", "translation_records"."uid",
"translation_records"."import_comment" FROM "translation_records" INNER JOIN "segments" ON ("segments"."id" = "translation_records"."source_segment_id") WHERE ("translation_records"."source_segment_id" IN (27548)) ORDER BY "translation_records"."id"
---END EDIT---
---EDIT 1---
What about re-indexing? Initially we'll import about 2 million legacy records. When and how often, if at all, should we rebuild the index?
---END EDIT 1---
Would something like CREATE INDEX ON segment USING gist (content) be ok? I can't really find which of the available indices would be best suitable for our use case.
Best, seba
The 2nd query you show seems to be unrelated to this question.
Your first query can't use a trigram index, as the query would have to be written in operator form, not function form, to do that.
In operator form, it would look like this:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity"
FROM segments
WHERE language_id = 2 AND content % 'This will not work.'
ORDER BY content <-> 'This will not work.';
In order for % to be equivalent to similarity("content", 'This will not work.') > 0.45, you would first need to do a set pg_trgm.similarity_threshold TO 0.45;.
Now how you get ruby/hanami to generate this form, I don't know.
The % operator can be supported by either the gin_trgm_ops index or the gist_index_ops index. The <-> can only be supported by gist_trgm_ops. But it is pretty hard to predict how efficient that support will be. If your "contents" column is long or your text to compare is long, it is unlikely to be very efficient, especially in the case of gist.
Ideally you would partition your table by language_id. If not, then it might be helpful to build a multicolumn index having both columns.
CREATE INDEX segment_language_id_idx ON segment USING btree (language_id);
CREATE INDEX segment_content_gin ON segment USING gin (content gin_trgm_ops);

ClickHouse: how to enable performant queries against increasing user-defined attributes

I am designing a system that handles a large number of buried point event. An event record contains:
buried_point_id, for example: 1 means app_launch, 2 means user_register.
happened_at: the event timestamp.
user_id: the user identifier.
other attributes, including basic ones (phone_number, city, country) and user-defined ones (click_item_id, it literally can be any context information). PMs will add more and more user-defined attributes to the event record.
The query pattern is like:
SELECT COUNT(DISTINCT user_id) FROM buried_points WHERE buried_point_id = 1 AND city = 'San Francisco' AND click_item_id = 123;
Since my team invests heavily in ClickHouse, I want to leverage ClickHouse for the problem. I wonder if it is a good practice to use the experimental Map data type to store all attributes in a MAP-type column such as {city: San Francisco, click_item_id: 123, ...}, or any other recommendation? Thanks.

Convert column value from null to value of similar row with similar values

sorry for the slightly strange Title I couldn't think of a succinct way to describe my problem.
I have a set of data that is created by one person, the data is structured as follows
ClientID ShortName WarehouseZone RevenueStream Name Budget Period
This data is manually inputted, but as there are many Clients and Many RevenueStreams only lines where budget != 0 have been included.
This needs to connect to another data set to generate revenue and there are times when revenue exists, but no budget exists.
For this reason I have gathered all customers and cross joined them to all codes and then appended these values into the main query, however as warehousezone is mnaually inputted there are a lot of entries where WarehouseZone is null.
This will always be the same for every instance of the customer.
Now after my convuluted explanation there's my question, how can I
-Psuedo Code that I hope makes sense.
SET WarehouseZone = WarehouseZone WHERE ClientID = ClientID AND
WarehouseZone != NULL
Are you sure that a client has one WarehouseZone? otherwise you need a aggregation.
Let's check, you can add a custom column that will return a record like this:
Table.Max(
Table.SelectColumns(
Table.SelectRows(#"Last Step" ,
each [ClientID] = _[ClientID])
, "Warehousezone")
,"Warehousezone"
)
This may create a new column that will bring the max warehousezone of a clientid everytime. At the end you can expand the record to get the value.
P/D The calculation is not so good for performance

Power Bi - Filter on currency type

I have transactions in a variety of currencies in a Transaction table (columns TransactionAmount and TransactionCurrency), and also a related Currency table:
Using the column RateToEuro I have been able to convert all my transaction amounts, using a calculate column, into euros:
An example of what I would like: I want to select in my report filter 'Dollar' and then the Transaction amount should convert all original transaction amounts to dollars. So in my example above, the $2052 original trx amount will be 2052 also in the 'Transaction amount ($)' column.
[EDIT:]
Currently I have created measure that gets the filter value:
CurrencyFilter = IF(LASTNONBLANK('CurrencyFormat'[Name], 1) = "USD", "USD", "EUR")
And a calculated column that for each transaction calculates the converted transaction amount (depending on the report filter chosen):
TransactionAmountConverted = CALCULATE(VALUES(Transactions[TransactionAmount]) * (IF([CurrencyFilter] = "EUR", VALUES('Currency'[RateToEuro]), VALUES('Currency'[RateToDollar]))))
But for some reason the IF statement always returns TRUE (i.e. always uses the RateToEuro column).
Any hint or tip to point me in the right direction would be much appreciated!
Currently this is not possible in Power BI: it is not supported that a calculated column uses a slicer value.
A measure can be used to get the slicer value, but in this particular case (if one value for each row is required to be calculated), there is no solution possible unfortunately.

Cassandra slow get_indexed_slices speed

We are using Cassandra for log collecting.
About 150,000 - 250,000 new records per hour.
Our column family has several columns like 'host', 'errorlevel', 'message', etc and special indexed column 'indexTimestamp'.
This column contains time rounded to hours.
So, when we want to get some records, we use get_indexed_slices() with first IndexExpression by indexTimestamp ( with EQ operator ) and then some other IndexExpressions - by host, errorlevel, etc.
When getting records just by indexTimestamp everything works fine.
But, when getting records by indexTimestamp and, for example, host - cassandra works for long ( more than 15-20 seconds ) and throws timeout exception.
As I understand, when getting records by indexed column and non-indexed column, Cassandra firstly gets all records by indexed column and than filters them by non-indexed columns.
So, why Cassandra does it so slow? By indexTimestamp there are no more than 250,000 records. Isn't it possible to filter them at 10 seconds?
Our Cassandra cluster is running on one machine ( Windows 7 ) with 4 CPUs and 4 GBs memory.
You have to bear in mind that Cassandra is very bad with this kind of queries. Indexed columns queries are not meant for big tables. If you want to search for your data around this type of queries you have to tailor your data model around it.
In fact Cassandra is not a DB you can query. It is a key-value storage system. To understand that please go there and have a quick look: http://howfuckedismydatabase.com/
The most basic pattern to help you is bucket-rows and ranged range-slice-queries.
Let's say you have the object
user : {
name : "XXXXX"
country : "UK"
city : "London"
postal_code :"N1 2AC"
age : "24"
}
and of course you want to query by city OR by age (and & or is another data model yet).
Then you would have to save your data like this, assuming the name is a unique id :
write(row = "UK", column_name = "city_XXXX", value = {...})
AND
write(row = "bucket_20_to_25", column_name = "24_XXXX", value = {...})
Note that I bucketed by country for the city search and by age bracket for age search.
the range query for age EQ 24 would be
get_range_slice(row= "bucket_20_to_25", from = "24-", to = "24=")
as a note "minus" == "under_score" - 1 and "equals" == "under_score" + 1, giving you effectively all the columns that start with "24_"
This also allow you to query for age between 21 and 24 for example.
hope it was useful

Resources