How to model array of key/value pairs - clickhouse

We are using ClickHouse to store internal performance metrics for webpage loads. Each metric contains an array of key/value pairs for custom load times they care about. We'd like to store these in ClickHouse and be able to query the times like any other time value.
For example, when I get a metric, along with all the standard data I may have data that gives me the load time for a bunch of custom stuff, like this:
TimeStamp=1548268715
CustomEvents="a=10,b=20,c=30"
In this case, I want to store the values a=10, b=20, and c=30 in such a way that:
It's still tied to the original data (so I can filter by timestamp, any other field, etc.).
I can aggregate and query specific "custom events." For example, I may want to do a histogram on all a time values between certain dates.
The challenge is that I don't know in advance which custom events exist. I suppose I could white-list these, but the number of them may become very large, and the cardinality of custom events is very high.
I'd appreciate any thoughts on this. I've got a few ideas, but wouldn't mind any thoughts.

The standard approach for that in ClickHouse is using Nested structures and select from that using ARRAY JOIN.
Under the hood Nested field in ClickHouse is just a group of arrays of the same length.
Sample:
Create the table like that
CREATE TABLE performance_metrics
(
timestamp DateTime,
website String,
custom_events Nested (
metric String,
value UInt64 -- actually you can have more attributes here, if needed
)
)
ENGINE = MergeTree
PARTITION BY toMonday(timestamp)
ORDER BY (website, timestamp);
Put the data, referencing Nested subfields as multiple arrays. Names of those arrays should be prefixed by Nested name, and length should be the same:
INSERT INTO performance_metrics (timestamp, website, custom_events.metric, custom_events.value) VALUES
( '2019-02-04 10:00:00', 'google.com', ['a', 'b', 'c'],[10,20,30]),
( '2019-02-04 10:00:01', 'stackoverflow.com', ['b', 'c', 'd'],[22,29,40]),
( '2019-02-04 10:00:01', 'google.com', ['a','d'], [8,42]);
And now you can select from performance_metrics using ARRAY JOIN:
SELECT
website,
custom_events.metric,
median(custom_events.value),
min(timestamp),
max(timestamp)
FROM performance_metrics
ARRAY JOIN custom_events
GROUP BY
website,
custom_events.metric
ORDER BY
website ASC,
custom_events.metric ASC

Related

How to speed-up a spatial join in BigQuery?

I have a BigQuery table with point registers along a whole country, and I need to assign a "censal zone" to each one of them, which polygons are contained in another table. I've been trying to do so using a query like this one:
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON ST_CONTAINS(zone_polygon, point_geo)
The first table is quite large, so the query performes very inefficiently as it is comparing each possible pairs of (point, censal zone). However, both tables have a column identifier for the municipality in which they are in, so the question is, can rewrite my query in some way that ST_CONTAINS(*) is performed for each (point, censal zone) pair that belongs to the same municipality, hence not comparing all posible censal zones within the country for each point? Can I do this without having to read points_table multiple times?
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON 1.municipality = 2.municipality
AND ST_CONTAINS(zone_geo, point_geo)
I'm quite new to BigQuery so I don't really know if a query like this would actually do what I'am expecting, as I couldn't find anything in the documentation.
Thanks!
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON 1.municipality = 2.municipality
AND ST_CONTAINS(zone_geo, point_geo)

Faster select when filtering with second or third sorted column

We have a time series table with the following definition
CREATE TABLE timeseries.mytable
(
`ts` DateTime('UTC'),
`src_ip` String,
`dst_ip` String,
`col_other` String
)
ENGINE = MergeTree()
PARTITION BY toDate(tr)
ORDER BY (dst_ip,ts,src_ip)
SETTINGS index_granularity = 8192
SELECT count(*) FROM timeseries.mytable;
# Elapsed 0.004 sec. Has 383M records
SELECT count(*) FROM timeseries.timeseries WHERE dst_ip = 'a.b.c.d';
# Elapsed: 0.085 sec.
SELECT count(*) FROM timeseries.timeseries WHERE src_ip = 'a.b.c.d';
# Elapsed: 53.031 sec
As can be seen above, filtering the data using the first sorted column (dst_ip) is very quick.
How can I make the select using the third sorted column (src_ip) faster?
Some remarks:
the third query (WHERE src_ip = 'a.b.c.d') works slowly because of index is not used and CH uses full scan. No good way to make it faster besides as redesign the primary key or if this query calculates just aggregates use the additional AggregatingMergeTree-table
use-cases which you provided looks as artificial because the calculation of row count by all dataset is not key use-case for timeseries data. Why the result not restricted by dst_ip and ts?
consider using ClickHouse AggregatingMergeTree Approach when need to calculate aggregated-values (as count in your case)
design of primary key required the understanding as CH use it in query optimization (see Primary Keys and Indexes in Queries, More secrets of ClickHouse Query Performance)
it recommends using the monotonic index
to choose the best index need to make the series of tests to find the index fittest for concrete use-cases
I would suggest the next primary keys:
/* [pretty suspicious suggestion] Remove date-column (it makes much slower the all date range queries with a range less than Daily). */
ORDER BY (dst_ip, src_ip)
/* Define the granularity of date. Instead of toStartOfHour can be used any interval less than 'Daily' (where Daily is defined by partition key) */
ORDER BY (dst_ip, toStartOfHour(ts), src_ip)
/* Move the date to the first position (it makes faster queries with date range without dst_ip and get monotonic-index related advantages). */
ORDER BY (toStartOfHour(ts), dst_ip, src_ip)
For each primary key need to choose the more effective Index granularity-value.
As for 2022, the solution is to use Data Skipping Index https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes for src_ip
You should try testing by keeping different order in ORDER BY clause depending upon value cardinality of your columns. In this case, maybe trying bringing src_ip before ts in ORDER by class.
In MergeTree engine, rows are sorted on the basis of ORDER BY keys in each partition.
After that, you can decide the final arrangement of columns in ORDER by clause depending upon how your application will query data most of the items.
You can find a similar discussion here.

Access - Get the total of a column in a select query

I have an Access database set up that takes a bunch of raw data, splits things up in different 'select' queries and pipes the results into various CSV files, where a dashboard set up in Excel will pick it up.
There's some data that I'm trying to calculate in Access, namely I have a quantity field, and I need to calculate the percentage of for each record. In other words, quantity / total of quantity.
Using my rather limited Access abilities, I tried the following query:
SELECT [Sales].*, [Quantity] / Sum([Quantity]) AS QuantityPercent FROM [Sales];
Which comes up with an error:
Your query does not include the specified expression 'company_name' as part of an aggregate function.
Company_name is the first field of the table, and after some Googling and Binging, I'm still quite confused as to what it means in this context.
To sum it up, my question is this: Is there a way to calculate data based off the total of a column/field?
The easy method is to use DSum:
SELECT
[Sales].*,
[Quantity] / DSum("[Quantity]", "[Sales]") AS QuantityPercent
FROM
[Sales];

SSRS Tablix Cell Calculation based on RowGroup Value

I have looked through several of the posts on SSRS tablix expressions and I can't find the answer to my particular issue.
I have a dashboard I am creating that contains summary data for various managers. They are entering monthly summary data into a single table structured like this:
Create TABLE OperationMetrics
AS
Date date
Plant char(10)
Sales float
ReturnedProduct float
The data could use some grouping so I created a table for referencing which report group these metrics go into looks like this:
Create Table OperationsReport
as
ReportType varchar(50)
MetricType varchar(50)
In this table, 'Sales' and 'ReturnedProduct' are the Metric column, while 'ExecSummary' or 'Quality' are ReportType entries. To do the join, I decided to UNPIVOT the OperationMetrics table...
Select Date, Plant, Metric, MetricType
From (Select Date, Plant, Sales, ReturnedProduct From OperationMetrics)
UNPVIOT (Metric for MetricType in (Sales, ReturnedProduct) UnPvt
and join it to the OperationsReport table so I have grouped metrics.
Select Date, Plant, Metric, Rpt.MetricReport, MetricType
FROM OpMetrics_Unpivoted OpEx
INNER JOIN OperationsReport Rpt on OpEx.MetricType = Rpt.MetricType
(I understand that elements of this is not ideal but sometimes we are not in control of our destiny.)
This does not include the whole of the tables but you get the gist. So, they have a form they fill in the OperationMetrics table. I chose SSRS to display the output.
I created a tablix with the following configuration (I can't post images due to my rep...)
Date is the only column group, grouped on 'MMM-yy'
Parent Row Group is the ReportType
Child Row Group is the MetricType
Now, my problem is that some of the metrics are calculations of other metrics. For instance, 'Returned Product (% of Sales)' is not entered by the manager because it is assumed we can simply calculate that. It would be ReturnedProduct divided by Sales.
I attempted to calculate this by using a lookup function, as below:
Switch(Fields!FriendlyName.Value="Sales",SUM(Fields!Metric.Value),
Fields!FriendlyName.Value="ReturnedProduct",SUM(Fields!Metric.Value),
Fields!FriendlyName.Value="ReturnedProductPercent",Lookup("ReturnedProduct",
Fields!FriendlyName.Value,Fields!Metric.Value,"MetricDataSet")/
Lookup("Sales",Fields!FriendlyName.Value,Fields!Metric.Value,
"MetricDataSet"))
This works great! For the first month... but since Lookup looks for the first match, it just posts the same value for the rest of the months after.
I attempted to use this but it got me back to where I was at the beginning since the dataset does not have the value.
Any help with this would be well received. I would like to keep the rowgroup hierarchy.
It sounds like the LookUp is working for you but you just need to include the date to find the right month. LookUp will return the first match which is why it's only working on the first month.
What you can try is concatenating the Metric Name and Date fields in the LookUp.
Lookup("Sales" & CSTR(Fields!DATE.Value), Fields!FriendlyName.Value & CSTR(Fields!DATE.Value), Fields!Metric.Value, "MetricDataSet")
Let me know if I misunderstood the issue.

Query performance in PostgreSQL using 'similar to'

I need to retrieve certain rows from a table depending on certain values in a specific column, named columnX in the example:
select *
from tableName
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')
So if columnX contains at least one of the values specified (A, B, C, 1, 2, 3), I will keep the row.
I can't find a better approach than using similar to. The problem is that the query takes too long for a table with more than a million rows.
I've tried indexing it:
create index tableName_columnX_idx on tableName (columnX)
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')
However, if the condition is variable (the values could be other than A, B, C, 1, 2, 3), I would need a different index for each condition.
Is there any better solution for this problem?
EDIT: Thanks everybody for the feedback. Looks like I've achieved to this point maybe because of a design mistake (topic I've posted in a separated question).
If you are only going to search lists of one-character values, then split each string into an array of characters and index the array:
CREATE INDEX
ix_tablename_columnxlist
ON tableName
USING GIN((REGEXP_SPLIT_TO_ARRAY(columnX, '')))
then search against the index:
SELECT *
FROM tableName
WHERE REGEXP_SPLIT_TO_ARRAY(columnX, '') && ARRAY['A', 'B', 'C', '1', '2', '3']
I agree with #Quassnoi, a GIN index is fastest and simplest - unless write performance or disk space are issues because it occupies a lot of space and eats quite a bit of performance for INSERT, UPDATE and DELETE.
My additional answer is triggered by your statement:
I can't find a better approach than using similar to.
If that is what you found, then your search isn't over, yet. SIMILAR TO is a complete waste of time. Literally. PostgreSQL only features it to comply to the (weird) SQL standard. Inspect the output of EXPLAIN ANALYZE for your query and you will find that SIMILAR TO has been replaced by a regular expression.
Internally every SIMILAR TO expression is rewritten to a regular expression. Consequently, for each and every SIMILAR TO expression there is at least one regular expression match that is a bit faster. Let EXPLAIN ANALYZE translate it for you, if you are not sure. You won't find this in the manual, PostgreSQL does not promise to do it this way, but I have yet to see an exception.
More details in this related answer on dba.SE.
This strikes me as a data modelling issue. You appear to be using a text field as a set, storing single character codes to identify values present in the set.
If so, I'd want to remodel this table to use one of the following approaches:
Standard relational normalization. Drop columnX, and replace it with a new table with a foreign key reference to tableName(id) and a charcode column that contains one character from the old columnX per row, like CREATE TABLE tablename_columnx_set(tablename_id integer not null references tablename(id), charcode "char", primary key (tablename_id, charcode)). You can then fairly efficiently search for keys in columnX using normal SQL subqueries, joins, etc. If your application can't cope with that change you could always keep columnX and maintain the side table using triggers.
Convert columnX to a hstore of keys with a dummy value. You can then use hstore operators like columnX ?| ARRAY['A','B','C']. A GiST index on the hstore of columnX should provide fairly solid performance for those operations.
Split to an array as recommended by Quassnoi if your table change rate is low and you can pay the costs of the GIN index;
Convert columnX to an array of integers, use intarray and the intarray GiST index. Have a mapping table of codes to integers or convert in the application.
Time permitting I'll follow up with demos of each. Making up the dummy data is a pain, so it'll depend on what else is going on.
I'll post this as an answer because it may guide other people in the future: Why not have 6 columns, haveA, haveB ~ have3 and do a 6-part OR query? Or use a bitmask?
If there are too many attributes to assign a column each, I might try creating an "attribute" table:
(fkey, attr) VALUES (1, 'A'), (1, 'B'), (2, '3')
and let the DBMS worry about the optimization.

Resources