ClickHouse: how to enable performant queries against increasing user-defined attributes - clickhouse

I am designing a system that handles a large number of buried point event. An event record contains:
buried_point_id, for example: 1 means app_launch, 2 means user_register.
happened_at: the event timestamp.
user_id: the user identifier.
other attributes, including basic ones (phone_number, city, country) and user-defined ones (click_item_id, it literally can be any context information). PMs will add more and more user-defined attributes to the event record.
The query pattern is like:
SELECT COUNT(DISTINCT user_id) FROM buried_points WHERE buried_point_id = 1 AND city = 'San Francisco' AND click_item_id = 123;
Since my team invests heavily in ClickHouse, I want to leverage ClickHouse for the problem. I wonder if it is a good practice to use the experimental Map data type to store all attributes in a MAP-type column such as {city: San Francisco, click_item_id: 123, ...}, or any other recommendation? Thanks.

Related

How do I create a filter using just one field where the drop down list shows grouped items from the field in Amazon Quicksight?

I need to make regional groups from a 'Company' field in Quicksight. How do you create a filter that will show grouped companies? For example Region 1, Region 2, Region 3, etc. Each of these groups when chosen in the filter will need to show a specific list of companies from the one field 'Company' based on the Region chosen.
I've tried creating separate parameters (Region 1, Region 2, etc.) with the appropriate companies under each one but I could not figure out how to use those in a filter. In short I need to group companies together so the groups can be chosen from a dropdown filter.
I was able to find the answer to my question from another site. I wanted to share.
I had to use the locate function to create my groups in a calculated field.
Syntax for Calculated field:
locate(expression, substring, start)
Looked something like this:
ifelse
(
locate('Hotel by Marriott',{Hotel Name}) > 0,
'Marriott',
locate('Homewood Suites Hotel A, Homewood Suites Hotel B ',{Hotel Name}) > 0,
'Homewood Suites',
locate('Home 2 Suites Hotel A,Home2 Suites Hotel B',{Hotel Name}) > 0,
'Home2 Suites',
'Other Brands'
)
Expression is the actual Name in the specific field called Hotel Name list all values for that group in single quotes separated by commas.
Substring is the Field reference (Hotel Name)
Start is the Group Name ('Marriott') you want to refer to it by.
Then create a parameter as String with Multiple Values. Save and create a control with Multiple Static values which is your list of Group Names as written in the calculated field.

Performance with pagination

Question
Given the following query:
MATCH (t:Tenant)-[:lives_in]->(:Apartment)-[:is_in]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
So: "Give me the first 10 tenants that live in City1"
With the sample data below, the database will get hit for every single apartment in City1 and for every tenant that lives in each of these apartments.
If I remove the ORDER BY this doesn't happen.
I am trying to implement pagination so I need the ORDER BY. How to improve the performance on this?
Sample data
UNWIND range(1, 5) as CityIndex
CREATE (c:City { id: CityIndex, name: 'City' + CityIndex})
WITH c, CityIndex
UNWIND range(1, 5000) as ApartmentIndex
CREATE (a:Apartment { id: CityIndex * 1000 + ApartmentIndex, name: 'Apartment'+CityIndex+'_'+ApartmentIndex})
CREATE (a)-[:is_in]->(c)
WITH c, a, CityIndex, ApartmentIndex
UNWIND range(1, 3) as TenantIndex
CREATE (t:Tenant { id: (CityIndex * 1000 + ApartmentIndex) * 10 + TenantIndex, name: 'Tenant'+CityIndex+'_'+ApartmentIndex+'_'+TenantIndex})
CREATE (t)-[:lives_in]->(a)
Without the ORDER BY, cypher can lazily evaluate the tenants and stop at 10 rather than matching every tenant in City1. However, because you need to order the tenants, the only way it can do that is to fetch them all and then sort.
If the only labels that can live in apartments is Tenants then you could possibly save a Filter step by removing the Tenant in your query like MATCH (t)-[:lives_in]->(:Apartment)....
You might want to check the profile of your query as well and see if it uses the index backed order by
What sort of numbers are you expecting back from this query? What's the worst case number of tenants in a given city?
EDIT
I was hoping a USING JOIN on t would use the index to improve the plan but it does not.
The query performs slightly better if you add a redundant relation from the tenant to the city:
MATCH (t:Tenant)-[:CITY]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
and similarly by embedding the city name onto the tenant- no major gains. I tested for 150,000 tenants in City1, perhaps the gains are more visible as you approach millions, but not sure.

Tableau - Filter Measure Based on Different Variables of the Same Dimension

I have the following dimensions: Patients and Collection Type (Blood or Tissue). Measure: Collections.
I am counting how many blood and tissue collections for each patient have been made.
Here is my table: Collections per Patient by Collection Type
Now I want to filter this table: I want to display only those Patients who have more then 2 Blood Collections and more then 2 Tissue Collections.
So, I want to see only Patient B, D, and E.
How can I do this?
There are a variety of ways you could accomplish your desired result. Probably one of the easier ways would be to unpivot your data such that 'blood collections' and 'tissue collections' are separate columns instead of one. I don't believe Tableau natively supports this while importing a data source currently; however, you can created two additional calculated fields to replicate an unpivot.
Blood Field:
IF [Collection_Type] = 'Blood'
THEN [Collection]
ELSE Null
END
Tissue Field:
IF [Collection_Type] = 'Tissue'
THEN [Collection]
ELSE Null
END
EDIT: Create a Calculated field that contains your desired condition for filtering, Ex.:
(SUM([Blood_field]) > 2 AND SUM([Tissue Field]) > 2)
Calculated field will evaluate to TRUE or FLASE. Filter for records for TRUE on this field

How to design querying multiple tags on analytics database

I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..
These tags are that seller interested in querying back to understand the sales.
My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".
Let us assume mapping is (at application level)
"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE" = z //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...
So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..
any solution to fix this?
Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.
I got suggestion like below
EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)
I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") ---- "SHOE" AND ("NIKE" OR "ADIDAS")
Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?
Solution_1:
I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.
so something like this
"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE" = z = 26
"ADIDAS" = aa = 27
So my column values are
`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq
And query for searching shoe(ay=51) is
maxTag <= 51 AND tag LIKE %|ay|%
And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is
minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%
Will this give any benefit? Kindly suggest any alternatives.
You can implement auto-tagging while the files get loaded to S3. Tagging at the DB level is too-late in the process. Tedious and involves lot of hard-coding
While loading to S3 tag it using the AWS s3API
example below
aws s3api put-object-tagging --bucket --key --tagging "TagSet=[{Key=Addidas,Value=AY}]"
capture tags dynamically by sending and as a parameter
2.load the tags to dynamodb as a metadata store
3.load data to Redshift using S3 COPY command
You can store tags column as varchar bit mask, i.e. a strictly defined bit sequence of 1s or 0s, so that if a purchase is marked by a tag there will be 1 and if not there will be 0, etc. For every row, you will have a sequence of 0s and 1s that has the same length as the number of tags you have. This sequence is sortable, however you would still need lookup into the middle but you will know at which specific position to look so you don't need like, just substring. For further optimization, you can convert this bit mask to integer values (it will be unique for each sequence) and make matching based on that but AFAIK Redshift doesn't support that yet out of box, you will have to define the rules yourself.
UPD: Looks like the best option here is to keep tags in a separate table and create an ETL process that unwraps tags into tabular structure of order_id, tag_id, distributed by order_id and sorted by tag_id. Optionally, you can create a view that joins the this one with the order table. Then lookups for orders with a particular tag and further aggregations of orders should be efficient. There is no silver bullet for optimizing this in a flat table, at least I don't know of such that would not bring a lot of unnecessary complexity versus "relational" solution.

Calculated Measure in Analysis Services

The AdventureWorksDW has the construct of the Financial Reporting Fact table. I have a similar fact table where the fact contains only the FK to the dimension tables and a value. The measure gets it's context from an DimAccount dimension. Are there any code samples that show how to do a simple ratio in a calculated member between two measures of the AdventureWorks Financial Reporting sample?
So basically I would like to see say Total Long term Debt / Total Assets from AdventureWorksDW? What I need is the expression or MDX.
Thanks in advance.
Use a query like this:
with member [Account].[Accounts].[Balance Sheet].[Dept by Assets] as
IIf([Account].[Accounts].[Assets] <> 0,
[Account].[Accounts].[Long Term Liabilities] / [Account].[Accounts].[Assets],
null
)
,format_string = "0.00%"
select {
[Account].[Accounts].[Assets],
[Account].[Accounts].[Long Term Liabilities],
[Account].[Accounts].[Dept by Assets]
}
on columns,
{ [Measures].[Amount] }
on rows
from [Adventure Works]
You can define members in any hierarchy, not only in the measures. In the definition, you should use the parent member before the name of the new member, to tell AS the position in the hierarchy. This is more important for CREATE MEMBER in the cube calculation script than for WITH MEMBER, as it influences the position where the client tool will display it.

Resources