ArangoDB Post-filtering aggregated data - filter

I'm stuck with filtering on aggregated data using ArangoDB
Imagine a graph with user documents. Every user has a number of game consoles. So the graph is
User -[]-> GameConsole
I need a query that lists all users with its game consoles. AND I want to be able to FILTER for all users with a specific game console but the query result still needs to show ALL consoles of the user (if he has more than one)
I took the example for post-filtering aggregated data from the Arango DB docs: https://www.arangodb.com/docs/stable/aql/examples-grouping.html an modified it to my needs:
FOR u IN User
FOR c IN 1..1 OUTBOUND u plays
COLLECT userData = u INTO consoles = c
FILTER "GameConsole/Playstation3" IN consoles
RETURN {userData, consoles}
Expected result
[
{ "userData":
{
"_id": "User/JohnDoe",
"Name": "John Doe"
},
"consoles": [
{
"_id": "GameConsole/Playstation3",
"Name: "Playstation 3"
},
{
"_id": "GameConsole/Wii",
"Name": "Wii"
}
]
}
]
But the result is an empty array:
[
[]
]
Same for
[...]
FILTER consoles == "GameConsole/Playstation3"
FILTER consoles._id == "GameConsole/Playstation3"
FILTER consoles[*]._id == "GameConsole/Playstation3"
What is the correct query/FILTER statement to show all users that own a Playstation3 AND list all consoles they own?

I've found a solution. As consoles is an array and I need to filter on attribute (like _id), I need to expand the array when filtering: consoles[*]._id
So the working query is
FOR u IN User
FOR c IN 1..1 OUTBOUND u plays
COLLECT userData = u INTO consoles = c
FILTER "GameConsole/Playstation3" IN consoles[*]._id
RETURN {userData, consoles}
Hope that helps anyone else.

Related

Elastic Ingest Pipeline split field and create a nested field

Dear freindly helpers,
I have an index that is fed by a database via Kafka. Now this database holds a field that aggregates a couple of pieces of information like so key/value; key/value; (don't ask for the reason, I have no idea who designed it liked that and why ;-) )
93/4; 34/12;
it can be empty, or it can hold 1..n key/value pairs.
I want to use an ingest pipeline and ideally have a "nested" field which holds all values that are in tha field.
Probably like this:
{"categories":
{ "93": 7,
"82": 4
}
}
The use case is the following: we want to visualize the sum of a filtered number of these categories (they tell me how many minutes a specific process took longer) and relate them in ranges.
Example: I filter categories x, y ,z and then group how many documents for the day had no delay, which had a delay up to 5 minutes and which had a delay between 5 and 15 minutes.
I have tried to get the fields neatly separated with the kv processor and wanted to work from there on but it was a complete wrong approach I guess.
"kv": {
"field": "IncomingField",
"field_split": ";",
"value_split": "/",
"target_field": "delays",
"ignore_missing": true,
"trim_key": "\\s",
"trim_value": "\\s",
"ignore_failure": true
}
When I test the pipeline it seems ok
"delays": {
"62": "3",
"86": "2"
}
but there are two things that don't work.
I can't know upfront how many of these combinations I have and thus converting the values from string t int in the same pipeline is an issue.
When I want to create a kibana index pattern I end up with many fields like delay.82 and delay.82.keyword which does not make sense at all for the usecase as I can't filter (get only the sum of delays where the key is one of x,y,z) and aggregate.
I have looked into other processors (dorexpander) but can't really get my head around how to get this working.
I hope my question is clear (I lack english skills, sorry) and that someone can point me at the right direction.
Thank you very much!
You should rather structure them as an array of objects with shared accessors, for instance:
[ {key: 93, value: 7}, ...]
That way, you'll be able to aggregate on categories.key and categories.value.
So this means iterating the categories' entrySet() using a custom script processor like so:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "extracts k/v pairs",
"processors": [
{
"script": {
"source": """
def categories = ctx.categories;
def kv_pairs = new ArrayList();
for (def pair : categories.entrySet()) {
def k = pair.getKey();
def v = pair.getValue();
kv_pairs.add(["key": k, "value": v]);
}
ctx.categories = kv_pairs;
"""
}
}
]
},
"docs": [
{
"_source": {
"categories": {
"82": 4,
"93": 7
}
}
}
]
}
P.S.: Do make sure your categories field is mapped as nested b/c otherwise you'll lose the connections between the keys & the values (also called flattening).

How can I combine multimatch query with a boolquery in elasticsearch?

I am trying to write a query that will run a query coupled with a multimatch query (possibly with an empty value) in elasticsearch. Consider the following data example where each item is a document in elasticsearch. Additional fields are redacted to reduce complexity in the example.
[
{
"name": "something",
"categories": ["python", "lib"]
},
{
"name": "test",
"categories": ["python", "lib"]
},
{
"name": "another",
"categories": ["javascript", "lib"]
}
]
What I am trying to do is writing a bool query where categories must match python and lib, and then run a multimatch query on that. So my code structure is:
// assume cat.Filter folds []string{"python", "lib"}
filters := []elastic.Query{}
for _, ff := range cat.Filter {
filters = append(filters, elastic.NewTermQuery("categories", ff))
}
// create multimatch query
e := elastic.NewMultiMatchQuery("something").Fuzziness("2")
// create a query that will match all fields in an array
q := elastic.NewBoolQuery().Must(filters...).Filter(e)
hits, err := client.Search().Index(index).Query(q).Size(limit).Do(ctx)
If I run this query with as is, then I get back 1 hit as expected. But if I change the multimatch query to e := elastic.NewMultiMatchQuery("some"), then I get back an empty array.
What I am trying to accomplish is:
When using e := elastic.NewMultiMatchQuery("some").Fuzziness("2"), return an array of 1 item that matches something
When I set e := elastic.NewMultiMatchQuery("").Fuzziness("2"), return an array of two items that match both categories for python and lib. (this works if i remove the multimatch array filter.
My issue is that I can do either or, but not both. I have a feeling because the Must is enforcing that something has to be an exact match, and some or "" does not match that. Thats what I am trying to overcome. First match all the values in an array, and then query that.

How can I filter if any value of an array is contained in another array in rethinkdb/reql?

I want to find any user who is member of a group I can manage (using the webinterface/javascript):
Users:
{
"id": 1
"member_in_groups": ["all", "de-south"]
},
{
"id": 2
"member_in_groups": ["all", "de-north"]
}
I tried:
r.db('mydb').table('users').filter(r.row('member_in_groups').map(function(p) {
return r.expr(['de-south']).contains(p);
}))
but always both users are returned. Which command do I have to use and how can I use an index for this (I read about multi-indexes in https://rethinkdb.com/docs/secondary-indexes/python/#multi-indexes but there only one value is searched for)?
I got the correct answer at the slack channel so posting it here if anyone else comes to this thread through googling:
First create a multi index as described in
https://rethinkdb.com/docs/secondary-indexes/javascript/, e. g.
r.db('<db-name>').table('<table-name>').indexCreate('<some-index-name>', {multi: true}).run()
(you can omit .run() if using the webadmin)
Then query the data with
r.db('<db-name>').table('<table-name>').getAll('de-north', 'de-west', {index:'<some-index-name>'}).distinct()

AQL: Flattening document for fulltext queries

I have a somewhat complex use-case for the fulltext features in AQL. I have a large, hierarchical document that is returned as the result of a graph traversal. This constructs something like a social network feed. It's analogous to posts of various categories with comments as child documents that contain their own structures. The returned data looks something like this:
[
{
"data": {
"_key": "",
"_id": "someCollection/someKey",
"_rev": "",
"userID": "12345",
"otherAttributeOfFeedEvent": "",
.
.
.
},
"date": "2016-10-25",
"category": "",
"children": [
{
"category": "",
"child": "myCollection/childDocumentKey",
"date": "2016-10-26"
},
{ sameStructureAsAbove },
{ anotherChildLikeAbove },
]
},
{ etc }
]
Of course, the attributes that would be fulltext searched for each of these event types that go into a feed are different and numerous, and I need to, for a given user input, search them all simultaneously. My initial thought is that, since the _key of each document, no matter whether a parent or child in the feed, is guaranteed to be listed in this structure, I could create some sort of collection that contains all the documents as identified by their keys.
A challenge is that this fulltext search needs to retain the hierarchy. Back to the social network comments analogy, if a user searches a term that exists in a comment (i.e. a child event), the query should return the parent event with a flag on every child event that matched the term, so that the interface can display the context for the search result (else, a secondary query to get the context would be needed).
This hierarchical structure as defined above is generated by a graph traversal on a graph with a structure that looks something like this:
profile ---> event ---> childEvent
| ^
| |
\------------------/
The query that generates the data looks something like this:
let events = (
for v, e, p in 1..3 outbound #profileKey graph 'myGraph' options { "uniqueEdges": "global"}
filter e.type == "hasEvent"
filter p.edges[0].category in ["cat1", "cat2", "cat3"]
filter e.category in ["cat1", "cat2", "cat3"]
let children = (
for v1, e1, p1 in outbound v._id graph 'myGraph'
filter e1.type =="hasEvent" or e1.isChildEvent == "True"
sort (e1.date) desc
return {category: e1.category, child: v1._id, date: e1.date }
)
let date = e.date
let category = e.category
let data = v
return distinct { data: data, date: date, category: category, children: children }
)
for event in events
sort(event.date) desc
return event
Bottom line
So to sum up my question: I need to write AQL that will perform fulltext search on several attributes from every document that shows up in the described feed and return a structured result, or something that can be used in a structured result, to display a feed of the same structure as described above containing only events that match or have children that match the fulltext search results.
In my testing, I tried creating a query like this:
let events = (
FOR v, e, p in 1..3 OUTBOUND 'myCollection/myDocument' GRAPH 'myGraph' OPTIONS { "uniqueEdges": "global" }
FILTER e.type == "hasEvent"
FILTER (p.edges[0].category in ["cat1", "cat2", "cat3"] )
FILTER (e.category in ["cat1","cat2","cat3] )
LET children = (
FOR v1, e1, p1 in OUTBOUND v._id GRAPH 'myGraph'
FILTER e1.type == "hasEvent" OR e1.isChildEvent == "True"
SORT(e1.date) DESC
RETURN {category: e1.category, _id: v1._id, date: e1.date}
)
let date = e.date
let category = e.category
let data = v
RETURN DISTINCT {data: data, date: date, category: category, children: children}
)
let eventIds = (
for event in events
return event.data._id
)
let childEventIds = (
for event in events
for child in event.children
return child._id
)
let allIds = append(eventIds, childEventIds)
let allDocs = (for doc in allIds
return document(doc))
let firstAttributeMatches = (for doc in fulltext(allDocs, "firstAttribute", #queryTerm)
return doc._id)
let secondAttributeMatches = (for doc in fulltext(allDocs, "secondAttribute", #queryTerm)
return doc._id)
let nthAttributeMatches = (for doc in fulltext(allDocs, "nthAttribute", #queryTerm)
return doc._id)
let results = union_distinct(firstAttributeMatches,secondAttributeMatches,nthAttributeMatches)
return results
But this had the error: Query: invalid argument type in call to function 'FULLTEXT()' (while executing)
Presumably, even though there are fulltext indices on all of the attributes I used, because I've collected all these documents into a new collection that is not also fulltext indexed, I cannot simply call fulltext() on them. Does this mean my best bet is to just get a list of all the document collections returned by my first query, perform global fulltext searches on those collections, then inner-join the result to the result of my first query? That sounds extremely complex and time-intensive. Is there some simpler way to do what I'm after?
My next try looked more like this:
let events = (
FOR v, e, p in 1..3 OUTBOUND 'myCollection/myDocument' GRAPH 'myGraph' OPTIONS { "uniqueEdges": "global" }
FILTER e.type == "hasEvent"
FILTER (p.edges[0].category in ["cat1", "cat2", "cat3"] )
FILTER (e.category in ["cat1", "cat2", "cat3"] )
LET children = (
FOR v1, e1, p1 in OUTBOUND v._id GRAPH 'myGraph'
FILTER e1.type == "hasEvent" OR e1.isChildEvent == "True"
SORT(e1.date) DESC
RETURN {category: e1.category, _id: v1._id, date: e1.date}
)
let date = e.date
let category = e.category
let data = v
RETURN DISTINCT {data: data, date: date, category: category, children: children}
)
let eventIds = (
for event in events
return event.data._id
)
let childEventIds = (
for event in events
for child in event.children
return child._id
)
let allIds = append(eventIds, childEventIds)
let losCollections = (for id in allIds
return distinct parse_identifier(id).collection)
let searchAttrs = ["attr1","attr2","attr3","attrN"]
for col in losCollections
for attr in searchAttrs
return (for doc in fulltext(col, attr, #queryTerm) return doc._id)
But this seems to fail whenever it tries an attribute that isn't a fulltext index in the collection. Maybe there's a way in AQL to check if the attribute has a fulltext index, then only perform the query in that case?
First a few general remarks:
Currently, a fulltext index can only index documents from one collection and can only look at the string value of a single attribute. Corresponding FULLTEXT searches in AQL will only be able to use a single such index and thus will only look into one collection and one attribute. If this is not enough one has to run multiple FULLTEXT queries and unite the results.
A graph query is faster, if the full path does not have to be built, so instead of
for v, e, p in 1..3 outbound #profileKey graph 'myGraph' options {"uniqueEdges": "global"}
filter e.type == "hasEvent"
filter p.edges[0].category in ["cat1", "cat2", "cat3"]
filter e.category in ["cat1", "cat2", "cat3"]
one should rather write
for v, e in 1..3 outbound #profileKey graph 'myGraph' options {"uniqueEdges": "global"}
filter e.type == "hasEvent"
filter e.category in ["cat1", "cat2", "cat3"]
which is equivalent but faster (the last filter implies the middle one).
If you have a query of the form
let events = (... return xyz)
for event in events
sort event.date desc
return event
it is usually better to avoid the subquery by writing
...
let event=xyz
sort event.date desc
return event
because then the query engine is not forced to compute the result of the whole subquery before starting with the bottom for statement.
Now I am coming to your concrete question at hand: Both your approaches fail because the FULLTEXT function in AQL can only be used for an existing collection with an existing fulltext index. In particular, it cannot be used to perform a fulltext search on intermediate results produced earlier in the AQL query. That is, because for an efficient full text search a fulltext index structure is needed which does not exist for the intermediate results.
Therefore, my hunch would be that if you want to perform a fulltext search on profiles, events and child-events at the same time that you would have to first perform the fulltext search using an index, and then from each result put together the hierarchy as needed using a graph query.
I see two basic approaches to this. The first would be to do three independent fulltext searches on each of the existing collections, and then run a separate graph query for each result to put together the hierarchy. This would have to be different depending on whether your fulltext search finds a profile, event or child-event. Using subqueries, these three approaches could all be done in a single AQL query.
The second is to have an additional collection for fulltext search, in which there would be a document for each of the documents in all of the three other collections, which contains the attribute to be fulltext searched. Yes, this is a data denormalisation and it needs extra memory space and extra effort when saving and updating the data, but it would probably speed up the fulltext search.
The other idea I would like to mention is that the complexity of your query has reached a level that one should consider writing it in Javascript (run on the server, probably in a Foxx app). There it would be relatively straightforward to implement the query logic in a procedural manner. My hunch would be that one could even improve performance in this way, even if the JS code has to issue multiple AQL queries. At the very least I would expect that the code is better understandable.

Grouping non null fields together in Kibana

Given the following three User entries in an ElasticSearch index:
"user": [
{
"userId": "100",
"hobby": "chess"
}
"user": [
{
"userId": "200",
"hobby": "music"
}
"user": [
{
"userId": "300",
"hobby": ""
}
I want to create a vertical bar chart to compare the number of users who have a hobby as opposed to those who do not. Individual hobbies should not be shown separately, but grouped together.
If split along the Y axis, one block would take up two thirds of the height (the two users with hobbies) and one block one third of the height (the one user with no hobbies).
How could one achieve this grouping in Kibana?
Thanks
You'll need to choose Split Bars and then Filters aggregation. Once you have that selected you should see Query 1 with * in it. Change the * to hobby:*. Next hit Add Filter and put in NOT hobby:*
The filters aggregation lets you bucket things pretty much any way you can search for things.

Resources