I have ~7 million docs in a bucket and I am struggling to write the correct query/index combo to prevent it from running >5 seconds.
Here is a similar scenario to the one I am trying to solve:
I have multiple coffee shops each making coffee with different container/lid combos. These field key’s are also different for different doc types. With each sale being generated I keep track of these combos.
Here are a few example docs:
[{
"shopId": "x001",
"date": "2022-01-01T08:49:00Z",
"cappuccinoContainerId": "a001",
"cappuccinoLidId": "b001"
},
{
"shopId": "x001",
"date": "2022-01-02T08:49:00Z",
"latteContainerId": "a002",
"latteLidId": "b002"
},
{
"shopId": "x001",
"date": "2022-01-02T08:49:00Z",
"espressoContainerId": "a003",
"espressoLidId": "b003"
},
{
"shopId": "x002",
"date": "2022-01-01T08:49:00Z",
"cappuccinoContainerId": "a001",
"cappuccinoLidId": "b001"
},
{
"shopId": "x002",
"date": "2022-01-02T08:49:00Z",
"latteContainerId": "a002",
"latteLidId": "b002"
},
{
"shopId": "x002",
"date": "2022-01-02T08:49:00Z",
"espressoContainerId": "a003",
"espressoLidId": "b003"
}]
What I need to get out of the query is the following:
[{
"shopId": "x001",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 2
},
{
"shopId": "x001",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 4
},
{
"shopId": "x002",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 2
},
{
"shopId": "x002",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 4
}]
I.e. I want the total number of unique containers and lids combined per site and day.
I have tried using composite, adaptive and FTS indexes but I unable to figure this one out.
Does anybody have a different suggestion? Can someone please help?
CREATE INDEX ix1 ON default(shopId, DATE_FORMAT_STR(date,"1111-11-11"), [cappuccinoContainerId, cappuccinoLidId]);
If Using EE and shopId is immutable add PARTITION BY HASH (shopId) to above index definition (with higher partition numbers).
SELECT d.shopId,
DATE_FORMAT_STR(d.date,"1111-11-11") AS day
COUNT(DISTINCT [d.cappuccinoContainerId, d.cappuccinoLidId]) AS uniqueContainersLidsCombined
FROM default AS d
WHERE d.shopId IS NOT NULL
GROUP BY d.shopId, DATE_FORMAT_STR(d.date,"1111-11-11");
Adjust index key order of shopId, day based on the query predicates.
https://blog.couchbase.com/understanding-index-grouping-aggregation-couchbase-n1ql-query/
Update:
Based on EXPLAIN you have date predicate and all shopIds so use following index
CREATE INDEX ix2 ON default( DATE_FORMAT_STR(date,"1111-11-11"), shopId, [cappuccinoContainerId, cappuccinoLidId]);
As you need to DISTINCT of cappuccinoContainerId, cappuccinoLidId storing as single key (array of 2 elements) as [cappuccinoContainerId, cappuccinoLidId]. The advantage of this you can directly reference in COUNT as DISTINCT this allows use index aggregation. (NO DISTINCT in the Index that turns into ARRAY index and things will not work as expected .
I assume
That the cup types and lid types can be used for any drink type.
That you don't want to add any precomputed stuff to your data.
Perhaps an index like this my collection keyspace is in bulk.sales.amer, note I am not sure if this performs better or worse (or even if it is equivalent) WRT the solution posted by vsr:
CREATE INDEX `adv_shopId_concat_nvls`
ON `bulk`.`sales`.`amer`(
`shopId` MISSING,
(
nvl(`cappuccinoContainerId`, "") ||
nvl(`cappuccinoLidId`, "") ||
nvl(`latteContainerId`, "") ||
nvl(`latteLidId`, "") ||
nvl(`espressoContainerId`, "") ||
nvl(`espressoLidId`, "")),substr0(`date`, 0, 10)
)
And then a using the covered index above do your query like this:
SELECT
shopId,
CONCAT(
NVL(cappuccinoContainerId,""),
NVL(cappuccinoLidId,""),
NVL(latteContainerId,""),
NVL(latteLidId,""),
NVL(espressoContainerId,""),
NVL(espressoLidId,"")
) AS uniqueContainersLidsCombined,
SUBSTR(date,0,10) AS day,
COUNT(*) AS cnt
FROM `bulk`.`sales`.`amer`
GROUP BY
shopId,
CONCAT(
NVL(cappuccinoContainerId,""),
NVL(cappuccinoLidId,""),
NVL(latteContainerId,""),
NVL(latteLidId,""),
NVL(espressoContainerId,""),
NVL(espressoLidId,"")
),
SUBSTR(date,0,10)
Note I used the following 16 lines of data:
{"amer":"amer","date":"2022-01-01T08:49:00Z","cappuccinoContainerId":"a001","cappuccinoLidId":"b001","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-01T08:49:00Z","cappuccinoContainerId":"a001","cappuccinoLidId":"b001","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","latteContainerId":"a002","latteLidId":"b002","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","latteContainerId":"a002","latteLidId":"b002","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","espressoContainerId":"a003","espressoLidId":"b003","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","espressoContainerId":"a003","espressoLidId":"b003","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","cappuccinoContainerId":"a007","cappuccinoLidId":"b004","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","cappuccinoContainerId":"a007","cappuccinoLidId":"b004","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","latteContainerId":"a007","latteLidId":"b004","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","latteContainerId":"a007","latteLidId":"b004","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T01:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T02:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T03:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T04:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T05:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T06:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
Applying some sorting by wrapping the above query with
SELECT T1.* FROM
(
-- paste above --
) AS T1
ORDER BY T1.day, T1,shopid, T1.uniqueContainersLidsCombined
We get
cnt day shopId uniqueContainersLidsCombined
1 "2022-01-01" "x001" "a001b001"
1 "2022-01-01" "x002" "a001b001"
1 "2022-01-02" "x001" "a002b002"
1 "2022-01-02" "x001" "a003b003"
1 "2022-01-02" "x002" "a002b002"
1 "2022-01-02" "x002" "a003b003"
1 "2022-01-03" "x001" "a007b005"
2 "2022-01-03" "x001" "a007b004"
2 "2022-01-03" "x002" "a007b004"
5 "2022-01-03" "x002" "a007b005"
If you still don't get the performance you need, you could possibly use the Eventing service to do a continuous map/reduce and an occasional update query to make sure things stay perfectly in sync.
I have 3 tables in my database.
The first two tables are just normal tables with an ID and some other columns like:
Table 1
ID
col01
1
...
2
...
Table 2
ID
col01
1
...
2
...
The third table is some kind of a relation/assignment table:
Table 3
ID
table1_id
table2_id
text
1
1
1
..
2
1
2
..
3
1
3
..
4
2
1
..
5
3
3
..
Now I do have a SQL statement which does exactly what I want:
SELECT * FROM table_3 where (table1_id, table2_id) in ( (1, 1), (2, 1), (3, 3));
So Im sending following Request Body to the API:
{
"assignments": [
{
"table1_id": 1,
"table2_id": 1
},
{
"table1_id": 2,
"table2_id": 1
},
{
"table1_id": 3,
"table2_id": 3
}
]
}
I do validate my the request with
->validate($request,
[
'assignments' => 'required|array',
'assignments.*.table1_id' => 'required|integer|min:1|max:20',
'assignments.*.table2_id' => 'required|integer|min:1|max:20'
]
Now Im kinda stuck how to use the eloquent commands (e.g. whereIn) to get my desired output.
Thanks in advance!
EDIT
So I took the workaround of arcanedev-maroc mentioned here: https://github.com/laravel/ideas/issues/1021
and edited it to fit my Request.
Works like a charm.
Laravel does not provide any functions by default. The core team said that they would not maintain this feature. You can read the post here.
But you can create your own query to accomplish this. I am providing a function that you can use as per your specification:
public function test(Request $request)
{
$body=$request->input('data');
$data=json_decode($body)->assignments;
$query='(table1_id, table2_id) in (';
$param=array();
foreach($data as $datum)
{
$query=$query."(".$datum->table1_id.",".$datum->table2_id."), ";
}
$query = rtrim($query, ", ");
$query = $query.")";
$result=DB::table('table3')->whereRaw($query)->get();
return $result;
}
So I took the workaround of arcanedev-maroc mentioned here: https://github.com/laravel/ideas/issues/1021
and edited it to fit my Request.
Works like a charm.
I'm a messenger developer and trying to calculate DAU/MAU using an event stream of user's requests using KSQL.
I've tried to calculate it using the following query:
CREATE TABLE ACTIVE_USER_ACTIONS_BY_1_HOUR WITH (
KAFKA_TOPIC='active-user-actions-by-1-hour'
) AS
SELECT
MCCU.UID AS UID,
COUNT(MCCU.UID) AS ACTIVITY_COUNT
FROM METRICS_REQUESTS MR
JOIN METRICS_CONTEXT_CID_UID MCCU ON MCCU.CID = MR.CID
WINDOW TUMBLING (SIZE 1 HOUR)
WHERE
MR.REQ_NAME = 'SendMessage' OR
MR.REQ_NAME = 'UpdateMessage'
GROUP BY MCCU.UID;
I'm getting the following results:
{
"order": 3,
"ROWTIME": 1570095657670,
"ROWKEY": "1365010623 : Window{start=1570093200000 end=-}",
"UID": 1365010623,
"ACTIVITY_COUNT": 3
}
{
"order": 1,
"ROWTIME": 1570095651905,
"ROWKEY": "1637035978 : Window{start=1570093200000 end=-}",
"UID": 1637035978,
"ACTIVITY_COUNT": 9
}
Don't understand how to map those rows to something like:
{
"ACTIVE_UID_COUNT": 2,
"START": 1570093200000,
"END": null
}
If you want to calculate DAU/MAU you must use TOPKDISTINCT function which counts only distinct occurrences of user_id within incoming events.
You should use sql code like below:
CREATE TABLE dau_1min WITH (KAFKA_TOPIC='dau-1m', VALUE_FORMAT='AVRO') AS
SELECT
country,
WINDOWSTART() AS window_timestamp,
ARRAYLENGTH(TOPKDISTINCT(user_id, 10000000)) AS dau
FROM eventssource
WINDOW TUMBLING (SIZE 1 MINUTE)
GROUP BY country;
As TOPKDISTINCT returns ARRAY of distinct identifiers you must implement custom UDF function i.e. ARRAYLENGTH that just returns size of given array.
After that you will get an object like {country, window_timestamp, dau} which is simple enough to store in some persistent database like pqsql.
I have an index with the following data:
{
"_index":"businesses",
"_type":"business",
"_id":"1",
"_version":1,
"found":true,
"_source":{
"business":{
"account_level_id":"2",
"business_city":"Abington",
"business_country":"United States of America",
}
}
}
When I query the index, I want to sort by account_level_id (which is a digit between 1-5). The problem is, I don't want to sort in ASC or DESC order, but by the following: 4..3..5..2..1. This was caused by bad practice a couple years ago, where the account level maxed out at level 4, but then a lower level account was added with the value of 5. Is there a way to tell ES that I want the results returned in that specific order?
You could write a sort based script something like (not tested):
doc['account_level_id'].value == "5" ? 3 : doc['account_level_id'].value == "4" ? 5 : doc['account_level_id'].value == "3" ? 4 : doc['account_level_id'].value == "2" ? 2 : 1;
Or if possible you could create another field sort_level that maps account_level_id to sensible values that you can sort on.
{
"_index":"businesses",
"_type":"business",
"_id":"1",
"_version":1,
"found":true,
"_source":{
"business":{
"account_level_id":"4",
"business_city":"Abington",
"business_country":"United States of America",
"sort_level": 5
}
}
}
If you can sort in DESC you can create function that maps integers and sort using it.
DESC should sort them like (5 4 3 2 1), 5 replaced by 4, 4 replaced by 3, 3 replaced by 5.
int map_to(int x){
switch(x){
case 1: case 2: return x;
case 3: return 4;
case 4: return 5;
case 5: return 3;
}
}
and use it for your sorting algorithm (so when sorting algorithm has to compare x vs y it should compare map_to(x) vs map_to(y) , and this will make 4 comes before 3 and 5 as you want.