N1Ql query grouping on distinct values with different keys - performance

I have ~7 million docs in a bucket and I am struggling to write the correct query/index combo to prevent it from running >5 seconds.
Here is a similar scenario to the one I am trying to solve:
I have multiple coffee shops each making coffee with different container/lid combos. These field key’s are also different for different doc types. With each sale being generated I keep track of these combos.
Here are a few example docs:
[{
"shopId": "x001",
"date": "2022-01-01T08:49:00Z",
"cappuccinoContainerId": "a001",
"cappuccinoLidId": "b001"
},
{
"shopId": "x001",
"date": "2022-01-02T08:49:00Z",
"latteContainerId": "a002",
"latteLidId": "b002"
},
{
"shopId": "x001",
"date": "2022-01-02T08:49:00Z",
"espressoContainerId": "a003",
"espressoLidId": "b003"
},
{
"shopId": "x002",
"date": "2022-01-01T08:49:00Z",
"cappuccinoContainerId": "a001",
"cappuccinoLidId": "b001"
},
{
"shopId": "x002",
"date": "2022-01-02T08:49:00Z",
"latteContainerId": "a002",
"latteLidId": "b002"
},
{
"shopId": "x002",
"date": "2022-01-02T08:49:00Z",
"espressoContainerId": "a003",
"espressoLidId": "b003"
}]
What I need to get out of the query is the following:
[{
"shopId": "x001",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 2
},
{
"shopId": "x001",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 4
},
{
"shopId": "x002",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 2
},
{
"shopId": "x002",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 4
}]
I.e. I want the total number of unique containers and lids combined per site and day.
I have tried using composite, adaptive and FTS indexes but I unable to figure this one out.
Does anybody have a different suggestion? Can someone please help?

CREATE INDEX ix1 ON default(shopId, DATE_FORMAT_STR(date,"1111-11-11"), [cappuccinoContainerId, cappuccinoLidId]);
If Using EE and shopId is immutable add PARTITION BY HASH (shopId) to above index definition (with higher partition numbers).
SELECT d.shopId,
DATE_FORMAT_STR(d.date,"1111-11-11") AS day
COUNT(DISTINCT [d.cappuccinoContainerId, d.cappuccinoLidId]) AS uniqueContainersLidsCombined
FROM default AS d
WHERE d.shopId IS NOT NULL
GROUP BY d.shopId, DATE_FORMAT_STR(d.date,"1111-11-11");
Adjust index key order of shopId, day based on the query predicates.
https://blog.couchbase.com/understanding-index-grouping-aggregation-couchbase-n1ql-query/
Update:
Based on EXPLAIN you have date predicate and all shopIds so use following index
CREATE INDEX ix2 ON default( DATE_FORMAT_STR(date,"1111-11-11"), shopId, [cappuccinoContainerId, cappuccinoLidId]);
As you need to DISTINCT of cappuccinoContainerId, cappuccinoLidId storing as single key (array of 2 elements) as [cappuccinoContainerId, cappuccinoLidId]. The advantage of this you can directly reference in COUNT as DISTINCT this allows use index aggregation. (NO DISTINCT in the Index that turns into ARRAY index and things will not work as expected .

I assume
That the cup types and lid types can be used for any drink type.
That you don't want to add any precomputed stuff to your data.
Perhaps an index like this my collection keyspace is in bulk.sales.amer, note I am not sure if this performs better or worse (or even if it is equivalent) WRT the solution posted by vsr:
CREATE INDEX `adv_shopId_concat_nvls`
ON `bulk`.`sales`.`amer`(
`shopId` MISSING,
(
nvl(`cappuccinoContainerId`, "") ||
nvl(`cappuccinoLidId`, "") ||
nvl(`latteContainerId`, "") ||
nvl(`latteLidId`, "") ||
nvl(`espressoContainerId`, "") ||
nvl(`espressoLidId`, "")),substr0(`date`, 0, 10)
)
And then a using the covered index above do your query like this:
SELECT
shopId,
CONCAT(
NVL(cappuccinoContainerId,""),
NVL(cappuccinoLidId,""),
NVL(latteContainerId,""),
NVL(latteLidId,""),
NVL(espressoContainerId,""),
NVL(espressoLidId,"")
) AS uniqueContainersLidsCombined,
SUBSTR(date,0,10) AS day,
COUNT(*) AS cnt
FROM `bulk`.`sales`.`amer`
GROUP BY
shopId,
CONCAT(
NVL(cappuccinoContainerId,""),
NVL(cappuccinoLidId,""),
NVL(latteContainerId,""),
NVL(latteLidId,""),
NVL(espressoContainerId,""),
NVL(espressoLidId,"")
),
SUBSTR(date,0,10)
Note I used the following 16 lines of data:
{"amer":"amer","date":"2022-01-01T08:49:00Z","cappuccinoContainerId":"a001","cappuccinoLidId":"b001","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-01T08:49:00Z","cappuccinoContainerId":"a001","cappuccinoLidId":"b001","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","latteContainerId":"a002","latteLidId":"b002","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","latteContainerId":"a002","latteLidId":"b002","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","espressoContainerId":"a003","espressoLidId":"b003","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","espressoContainerId":"a003","espressoLidId":"b003","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","cappuccinoContainerId":"a007","cappuccinoLidId":"b004","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","cappuccinoContainerId":"a007","cappuccinoLidId":"b004","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","latteContainerId":"a007","latteLidId":"b004","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","latteContainerId":"a007","latteLidId":"b004","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T01:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T02:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T03:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T04:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T05:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T06:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
Applying some sorting by wrapping the above query with
SELECT T1.* FROM
(
-- paste above --
) AS T1
ORDER BY T1.day, T1,shopid, T1.uniqueContainersLidsCombined
We get
cnt day shopId uniqueContainersLidsCombined
1 "2022-01-01" "x001" "a001b001"
1 "2022-01-01" "x002" "a001b001"
1 "2022-01-02" "x001" "a002b002"
1 "2022-01-02" "x001" "a003b003"
1 "2022-01-02" "x002" "a002b002"
1 "2022-01-02" "x002" "a003b003"
1 "2022-01-03" "x001" "a007b005"
2 "2022-01-03" "x001" "a007b004"
2 "2022-01-03" "x002" "a007b004"
5 "2022-01-03" "x002" "a007b005"
If you still don't get the performance you need, you could possibly use the Eventing service to do a continuous map/reduce and an occasional update query to make sure things stay perfectly in sync.

Related

ElasticSearch - Combining Queries for 4 seperate randomly sourted groups?

I'm fairly new to elasticsearch (though with a fair bit of SQL experience) and am currently struggling with putting a proper query together. I have 2 boolean fields isPlayer and isEvil that an entry is either true or false on. Based on that, I want to split my dataset into 4 groups:
isPlayer: true, isEvil: true
isPlayer: true, isEvil: false
isPlayer: false, isEvil: true
isPlayer: false, isEvil: false
These groups I want to randomly sort within themselves, then attach them to be one long list that I can paginate. I'd like to do that inside the query, as that seems like the "correct" way to do this, since I'd do it similarly in SQL. In that list, the groups are to be sorted in order, so first all entries of Group 1 in a random order, then all entries of Group 2 in a random order, then all entries of Group 3 etc. . It is necessary that the randomness of the sorting is reproducible if given the same inputs, so if the sorting is based on random_score ideally I'd be using a seed for the randomness.
I can build a single query, but how do I combine 4?
As approaches I've found so far MultiSearch and Disjunction Max Query. MultiSearch seems like it doesn't support Pagination. Regarding Disjunction Max Query it might be that I'm missing the forest for the trees, but there I'm struggling in having the subqueries be randomly sorted only within themselves before appending them to one another.
Here how I write a single query for now without Disjunction Max Query, in case it helps:
{
"query": {
"bool": {
"should": [
{
"term": {
"isPlayer": true
}
},
{
"term": {
"isEvil": true
}
}
]
}
}
}
The solution to this problem is not doing 4 separate groups, but instead ensuring they all have different ranges of scores and sorting by scores. This can be achieved, by scoring the hits not by some kind of matching criteria, but through a script-score field. This field allows you to write code yourself that returns a logic score (The default language is called "painless", but I've seen examples of groovy as well).
The logic is fairly simple:
If isPlayer = true, add 2 points to the score
If isEvil = true, add 4 points to the score
Either way, add a random number between 0 and 1 to the score at the end
This creates the 4 groups I wanted with distinct score-ranges:
isPlayer = true, isEvil = true --> Score-range: 6-7
isPlayer = false, isEvil = true --> Score-range: 4-5
isPlayer = true, isEvil = false --> Score-range: 2-3
isPlayer = false, isEvil = false --> Score-range: 0-1
The query would look like this:
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": """
double score = 0;
if(doc['isPlayer']){
score += 2;
}
if(doc['isEvil']){
score += 4;
}
int partialSeed = 1;
score += randomScore(partialSeed, 'id');
return score;
"""
}
}
}
}

How to calculate DAU/MAU using KSQL?

I'm a messenger developer and trying to calculate DAU/MAU using an event stream of user's requests using KSQL.
I've tried to calculate it using the following query:
CREATE TABLE ACTIVE_USER_ACTIONS_BY_1_HOUR WITH (
KAFKA_TOPIC='active-user-actions-by-1-hour'
) AS
SELECT
MCCU.UID AS UID,
COUNT(MCCU.UID) AS ACTIVITY_COUNT
FROM METRICS_REQUESTS MR
JOIN METRICS_CONTEXT_CID_UID MCCU ON MCCU.CID = MR.CID
WINDOW TUMBLING (SIZE 1 HOUR)
WHERE
MR.REQ_NAME = 'SendMessage' OR
MR.REQ_NAME = 'UpdateMessage'
GROUP BY MCCU.UID;
I'm getting the following results:
{
"order": 3,
"ROWTIME": 1570095657670,
"ROWKEY": "1365010623 : Window{start=1570093200000 end=-}",
"UID": 1365010623,
"ACTIVITY_COUNT": 3
}
{
"order": 1,
"ROWTIME": 1570095651905,
"ROWKEY": "1637035978 : Window{start=1570093200000 end=-}",
"UID": 1637035978,
"ACTIVITY_COUNT": 9
}
Don't understand how to map those rows to something like:
{
"ACTIVE_UID_COUNT": 2,
"START": 1570093200000,
"END": null
}
If you want to calculate DAU/MAU you must use TOPKDISTINCT function which counts only distinct occurrences of user_id within incoming events.
You should use sql code like below:
CREATE TABLE dau_1min WITH (KAFKA_TOPIC='dau-1m', VALUE_FORMAT='AVRO') AS
SELECT
country,
WINDOWSTART() AS window_timestamp,
ARRAYLENGTH(TOPKDISTINCT(user_id, 10000000)) AS dau
FROM eventssource
WINDOW TUMBLING (SIZE 1 MINUTE)
GROUP BY country;
As TOPKDISTINCT returns ARRAY of distinct identifiers you must implement custom UDF function i.e. ARRAYLENGTH that just returns size of given array.
After that you will get an object like {country, window_timestamp, dau} which is simple enough to store in some persistent database like pqsql.

Name of sorting algorithm?

I'm trying to figure out the name of a sorting algorithm (or just a method?) that sorts via 3 values.
We start off with 3 values and the array should sort based on the id of the object, position and then the date it was set to that position, allowing both date and position to be the same. Please excuse my horrible explanation. I will give an example.
we have 6 positions, without any edits the array would look something like this
{id:1,pos:0,date:0}
{id:2,pos:0,date:0}
{id:3,pos:0,date:0}
{id:4,pos:0,date:0}
{id:5,pos:0,date:0}
{id:6,pos:0,date:0}
if I was to move the first object to the second position, it would return this order
{id:2,pos:0,date:0}
{id:1,pos:2,date:1}
{id:3,pos:0,date:0}
{id:4,pos:0,date:0}
{id:5,pos:0,date:0}
{id:6,pos:0,date:0}
However if we where to then move the third object into the second position
{id:2,pos:0,date:0}
{id:3,pos:2,date:2}
{id:1,pos:2,date:1}
{id:4,pos:0,date:0}
{id:5,pos:0,date:0}
{id:6,pos:0,date:0}
Note the pos does not change but is ordered before positions of the same number based on the higher date value.
We now move the 4th object into position 1
{id:4,pos:1,date:3}
{id:2,pos:0,date:0}
{id:3,pos:2,date:2}
{id:1,pos:2,date:1}
{id:5,pos:0,date:0}
{id:6,pos:0,date:0}
note id 2 takes the position of number 2 even though pos and date are still 0 because the id is less than the id behind it
We now move id 6 to position 2
{id:4,pos:1,date:3}
{id:6,pos:2,date:4}
{id:2,pos:0,date:0}
{id:3,pos:2,date:2}
{id:1,pos:2,date:1}
{id:5,pos:0,date:0}
id 5 to position 4
{id:4,pos:1,date:3}
{id:6,pos:2,date:4}
{id:2,pos:0,date:0}
{id:5,pos:4,date:5}
{id:3,pos:2,date:2}
{id:1,pos:2,date:1}
And finally id 2 to position 6
{id:4,pos:1,date:3}
{id:6,pos:2,date:4}
{id:5,pos:4,date:5}
{id:3,pos:2,date:2}
{id:1,pos:2,date:1}
{id:2,pos:6,date:6}
I hope my examples aid any response given, I know this is not a question of much quality and if answered I will do my best to edit the question as best I can.
Just a guess, because your final order doesn't look "sorted", lexicographical sort? See Lexicographical order.
The movement of objects is similar to insertion sort, where an entire sub-array is shifted in order to insert an object. The date indicates the order of operations that were performed, and the position indicates where the object was moved to, but there's no field for where an object was moved from. There's enough information to reproduce the sequence by starting with the initial ordering and following the moves according to the date. I don't know if the sequence can be followed in reverse with the given information.
The original ordering can be restored using any sort algorithm using the id field.
I was unfortunately unable to find the name of the 'sort'(?) however, I was able to achieve the effect I was aiming for using the code bellow.
(If I missed something entirely let me know I'll change it and credit you)
PHP Implementation.
$data = '[
{"id":"1","pos":"1","date":"0"},
{"id":"2","pos":"5","date":"0"},
{"id":"3","pos":"4","date":"0"},
{"id":"4","pos":"3","date":"0"},
{"id":"5","pos":"4","date":"1"},
{"id":"6","pos":"2","date":"0"}
]'; //simulated data set
$arr = json_decode($data,true);
$final_arr = $arr;
$tmp_array = array();
$actions = array();
for ($i=0; $i < sizeof($arr); $i++) {
$num = $i+1;
$tmp = array();
for ($o=0; $o < sizeof($arr); $o++) {
if($arr[$o]['pos'] == 0)continue;
if($arr[$o]['pos'] == $num){
array_push($tmp,$arr[$o]);
}
}
if($tmp){
usort($tmp,function($a,$b){
return $a['date'] > $b['date'];
});
for ($o=0; $o < sizeof($tmp); $o++) {
array_push($tmp_array,$tmp[$o]);
}
}
}
for ($i=0; $i < sizeof($tmp_array); $i++) {
for ($o=0; $o < sizeof($arr); $o++) {
if($final_arr[$o]['id'] == $tmp_array[$i]['id']){
array_splice($final_arr, $tmp_array[$i]['pos']-1, 0, array_splice($final_arr, $o, 1));
}
}
}
$output = json_encode($final_arr,JSON_PRETTY_PRINT);
printf($output);
Result:
[
{
"id": "1",
"pos": "1",
"date": "0"
},
{
"id": "6",
"pos": "2",
"date": "0"
},
{
"id": "4",
"pos": "3",
"date": "0"
},
{
"id": "5",
"pos": "4",
"date": "1"
},
{
"id": "2",
"pos": "5",
"date": "0"
},
{
"id": "3",
"pos": "4",
"date": "0"
}
]

How do I dynamically name a collection?

Title: How do I dynamically name a collection?
Pseudo-code: collect(n) AS :Label
The primary purpose of this is for easy reading of the properties in the API Server (node application).
Verbose example:
MATCH (user:User)--(n)
WHERE n:Movie OR n:Actor
RETURN user,
CASE
WHEN n:Movie THEN "movies"
WHEN n:Actor THEN "actors"
END as type, collect(n) as :type
Expected output in JSON:
[{
"user": {
....
},
"movies": [
{
"_id": 1987,
"labels": [
"Movie"
],
"properties": {
....
}
}
],
"actors:" [ .... ]
}]
The closest I've gotten is:
[{
"user": {
....
},
"type": "movies",
"collect(n)": [
{
"_id": 1987,
"labels": [
"Movie"
],
"properties": {
....
}
}
]
}]
The goal is to be able to read the JSON result with ease like so:
neo4j.cypher.query(statement, function(err, results) {
for result of results
var user = result.user
var movies = result.movies
}
Edit:
I apologize for any confusion in my inability to correctly name database semantics.
I'm wondering if it's enough just to output the user and their lists of both actors and movies, rather than trying to do a more complicated means of matching and combining both.
MATCH (user:User)
OPTIONAL MATCH (user)--(m:Movie)
OPTIONAL MATCH (user)--(a:Actor)
RETURN user, COLLECT(m) as movies, COLLECT(a) as actors
This query should return each User and his/her related movies and actors (in separate collections):
MATCH (user:User)--(n)
WHERE n:Movie OR n:Actor
RETURN user,
REDUCE(s = {movies:[], actors:[]}, x IN COLLECT(n) |
CASE WHEN x:Movie
THEN {movies: s.movies + x, actors: s.actors}
ELSE {movies: s.movies, actors: s.actors + x}
END) AS types;
As far as a dynamic solution to your question, one that will work with any node connected to your user, there are a few options, but I don't believe you can get the column names to be dynamic like this, or even the names of the collections returned, though we can associate them with the type.
MATCH (user:User)--(n)
WITH user, LABELS(n) as type, COLLECT(n) as nodes
WITH user, {type:type, nodes:nodes} as connectedNodes
RETURN user, COLLECT(connectedNodes) as connectedNodes
Or, if you prefer working with multiple rows, one row each per node type:
MATCH (user:User)--(n)
WITH user, LABELS(n) as type, COLLECT(n) as collection
RETURN user, {type:type, data:collection} as connectedNodes
Note that LABELS(n) returns a list of labels, since nodes can be multi-labeled. If you are guaranteed that every interested node has exactly one label, then you can use the first element of the list rather than the list itself. Just use LABELS(n)[0] instead.
You can dynamically sort nodes by label, and then convert to the map using the apoc library:
WITH ['Actor','Movie'] as LBS
// What are the nodes we need:
MATCH (U:User)--(N) WHERE size(filter(l in labels(N) WHERE l in LBS))>0
WITH U, LBS, N, labels(N) as nls
UNWIND nls as nl
// Combine the nodes on their labels:
WITH U, LBS, N, nl WHERE nl in LBS
WITH U, nl, collect(N) as RELS
WITH U, collect( [nl, RELS] ) as pairs
// Convert pairs "label - values" to the map:
CALL apoc.map.fromPairs(pairs) YIELD value
RETURN U as user, value

Summarise nested fields

I'm having a hard time getting a specific query right.
I have a schema following this style:
{
"User": "user1" ,
"active": true ,
"points": {
"2015-07": 2 ,
"2015-08": 5 ,
"2015-09": 7 ,
"2015-10": 1 ,
"2015-11": 28 ,
"2015-12": 5 ,
"2016-01": 3
}
}
{
"User": "user2" ,
"active": true ,
"points": {
"2015-01": 8 ,
"2015-02": 4 ,
"2015-09": 6 ,
"2015-10": 12 ,
"2015-11": 34 ,
"2015-12": 1 ,
"2016-01": 2
}
}
How can I write a query that will return each user and the total amount of points?
Just a small typo in #Tryneus' example, .sum(...) should go at the end. Try this one:
r.db('test').table('test').map(function (x) {
return x.merge( {points: x('points').values().sum()})
})
For those of us unfortunate enough to still be stuck with a version of RethinkDB without a values function, the trick is to first retrieve all keys and use map to retrieve a sequence of values:
r.db("test").table("test").map(function (x) {
return x.merge({
points: x("points").keys().map(function(key) {
return x("points")(key)
}).sum()
})
})
It should be simple enough to do this with a map, summing the values of the "points" object in each row. This code is in Javascript, so it should work in the Data Explorer, where r.expr(data) is the sequence of values you want to operate on.
r.expr(data).map(function (x) {
return x.merge({points:r.sum(x('points').values())});
})

Resources