AQL: Flattening document for fulltext queries - full-text-search

I have a somewhat complex use-case for the fulltext features in AQL. I have a large, hierarchical document that is returned as the result of a graph traversal. This constructs something like a social network feed. It's analogous to posts of various categories with comments as child documents that contain their own structures. The returned data looks something like this:
[
{
"data": {
"_key": "",
"_id": "someCollection/someKey",
"_rev": "",
"userID": "12345",
"otherAttributeOfFeedEvent": "",
.
.
.
},
"date": "2016-10-25",
"category": "",
"children": [
{
"category": "",
"child": "myCollection/childDocumentKey",
"date": "2016-10-26"
},
{ sameStructureAsAbove },
{ anotherChildLikeAbove },
]
},
{ etc }
]
Of course, the attributes that would be fulltext searched for each of these event types that go into a feed are different and numerous, and I need to, for a given user input, search them all simultaneously. My initial thought is that, since the _key of each document, no matter whether a parent or child in the feed, is guaranteed to be listed in this structure, I could create some sort of collection that contains all the documents as identified by their keys.
A challenge is that this fulltext search needs to retain the hierarchy. Back to the social network comments analogy, if a user searches a term that exists in a comment (i.e. a child event), the query should return the parent event with a flag on every child event that matched the term, so that the interface can display the context for the search result (else, a secondary query to get the context would be needed).
This hierarchical structure as defined above is generated by a graph traversal on a graph with a structure that looks something like this:
profile ---> event ---> childEvent
| ^
| |
\------------------/
The query that generates the data looks something like this:
let events = (
for v, e, p in 1..3 outbound #profileKey graph 'myGraph' options { "uniqueEdges": "global"}
filter e.type == "hasEvent"
filter p.edges[0].category in ["cat1", "cat2", "cat3"]
filter e.category in ["cat1", "cat2", "cat3"]
let children = (
for v1, e1, p1 in outbound v._id graph 'myGraph'
filter e1.type =="hasEvent" or e1.isChildEvent == "True"
sort (e1.date) desc
return {category: e1.category, child: v1._id, date: e1.date }
)
let date = e.date
let category = e.category
let data = v
return distinct { data: data, date: date, category: category, children: children }
)
for event in events
sort(event.date) desc
return event
Bottom line
So to sum up my question: I need to write AQL that will perform fulltext search on several attributes from every document that shows up in the described feed and return a structured result, or something that can be used in a structured result, to display a feed of the same structure as described above containing only events that match or have children that match the fulltext search results.
In my testing, I tried creating a query like this:
let events = (
FOR v, e, p in 1..3 OUTBOUND 'myCollection/myDocument' GRAPH 'myGraph' OPTIONS { "uniqueEdges": "global" }
FILTER e.type == "hasEvent"
FILTER (p.edges[0].category in ["cat1", "cat2", "cat3"] )
FILTER (e.category in ["cat1","cat2","cat3] )
LET children = (
FOR v1, e1, p1 in OUTBOUND v._id GRAPH 'myGraph'
FILTER e1.type == "hasEvent" OR e1.isChildEvent == "True"
SORT(e1.date) DESC
RETURN {category: e1.category, _id: v1._id, date: e1.date}
)
let date = e.date
let category = e.category
let data = v
RETURN DISTINCT {data: data, date: date, category: category, children: children}
)
let eventIds = (
for event in events
return event.data._id
)
let childEventIds = (
for event in events
for child in event.children
return child._id
)
let allIds = append(eventIds, childEventIds)
let allDocs = (for doc in allIds
return document(doc))
let firstAttributeMatches = (for doc in fulltext(allDocs, "firstAttribute", #queryTerm)
return doc._id)
let secondAttributeMatches = (for doc in fulltext(allDocs, "secondAttribute", #queryTerm)
return doc._id)
let nthAttributeMatches = (for doc in fulltext(allDocs, "nthAttribute", #queryTerm)
return doc._id)
let results = union_distinct(firstAttributeMatches,secondAttributeMatches,nthAttributeMatches)
return results
But this had the error: Query: invalid argument type in call to function 'FULLTEXT()' (while executing)
Presumably, even though there are fulltext indices on all of the attributes I used, because I've collected all these documents into a new collection that is not also fulltext indexed, I cannot simply call fulltext() on them. Does this mean my best bet is to just get a list of all the document collections returned by my first query, perform global fulltext searches on those collections, then inner-join the result to the result of my first query? That sounds extremely complex and time-intensive. Is there some simpler way to do what I'm after?
My next try looked more like this:
let events = (
FOR v, e, p in 1..3 OUTBOUND 'myCollection/myDocument' GRAPH 'myGraph' OPTIONS { "uniqueEdges": "global" }
FILTER e.type == "hasEvent"
FILTER (p.edges[0].category in ["cat1", "cat2", "cat3"] )
FILTER (e.category in ["cat1", "cat2", "cat3"] )
LET children = (
FOR v1, e1, p1 in OUTBOUND v._id GRAPH 'myGraph'
FILTER e1.type == "hasEvent" OR e1.isChildEvent == "True"
SORT(e1.date) DESC
RETURN {category: e1.category, _id: v1._id, date: e1.date}
)
let date = e.date
let category = e.category
let data = v
RETURN DISTINCT {data: data, date: date, category: category, children: children}
)
let eventIds = (
for event in events
return event.data._id
)
let childEventIds = (
for event in events
for child in event.children
return child._id
)
let allIds = append(eventIds, childEventIds)
let losCollections = (for id in allIds
return distinct parse_identifier(id).collection)
let searchAttrs = ["attr1","attr2","attr3","attrN"]
for col in losCollections
for attr in searchAttrs
return (for doc in fulltext(col, attr, #queryTerm) return doc._id)
But this seems to fail whenever it tries an attribute that isn't a fulltext index in the collection. Maybe there's a way in AQL to check if the attribute has a fulltext index, then only perform the query in that case?

First a few general remarks:
Currently, a fulltext index can only index documents from one collection and can only look at the string value of a single attribute. Corresponding FULLTEXT searches in AQL will only be able to use a single such index and thus will only look into one collection and one attribute. If this is not enough one has to run multiple FULLTEXT queries and unite the results.
A graph query is faster, if the full path does not have to be built, so instead of
for v, e, p in 1..3 outbound #profileKey graph 'myGraph' options {"uniqueEdges": "global"}
filter e.type == "hasEvent"
filter p.edges[0].category in ["cat1", "cat2", "cat3"]
filter e.category in ["cat1", "cat2", "cat3"]
one should rather write
for v, e in 1..3 outbound #profileKey graph 'myGraph' options {"uniqueEdges": "global"}
filter e.type == "hasEvent"
filter e.category in ["cat1", "cat2", "cat3"]
which is equivalent but faster (the last filter implies the middle one).
If you have a query of the form
let events = (... return xyz)
for event in events
sort event.date desc
return event
it is usually better to avoid the subquery by writing
...
let event=xyz
sort event.date desc
return event
because then the query engine is not forced to compute the result of the whole subquery before starting with the bottom for statement.
Now I am coming to your concrete question at hand: Both your approaches fail because the FULLTEXT function in AQL can only be used for an existing collection with an existing fulltext index. In particular, it cannot be used to perform a fulltext search on intermediate results produced earlier in the AQL query. That is, because for an efficient full text search a fulltext index structure is needed which does not exist for the intermediate results.
Therefore, my hunch would be that if you want to perform a fulltext search on profiles, events and child-events at the same time that you would have to first perform the fulltext search using an index, and then from each result put together the hierarchy as needed using a graph query.
I see two basic approaches to this. The first would be to do three independent fulltext searches on each of the existing collections, and then run a separate graph query for each result to put together the hierarchy. This would have to be different depending on whether your fulltext search finds a profile, event or child-event. Using subqueries, these three approaches could all be done in a single AQL query.
The second is to have an additional collection for fulltext search, in which there would be a document for each of the documents in all of the three other collections, which contains the attribute to be fulltext searched. Yes, this is a data denormalisation and it needs extra memory space and extra effort when saving and updating the data, but it would probably speed up the fulltext search.
The other idea I would like to mention is that the complexity of your query has reached a level that one should consider writing it in Javascript (run on the server, probably in a Foxx app). There it would be relatively straightforward to implement the query logic in a procedural manner. My hunch would be that one could even improve performance in this way, even if the JS code has to issue multiple AQL queries. At the very least I would expect that the code is better understandable.

Related

Counting occurrences of search terms in Elasticsearch function score script

I have an Elasticsearch index with document structure like below.
{
"id": "foo",
"tags": ["Tag1", "Tag2", "Tag3"],
"special_tags": ["SpecialTag1", "SpecialTag2", "SpecialTag3"],
"reserved_tags": ["ReservedTag1", "ReservedTag2", "Tag1", "SpecialTag2"],
// rest of the document
}
The fields tags, special_tags, reserved_tags are stored separately for multiple use cases. In one of the queries, I want to order the documents by number of occurrences for searched tags in all the three fields.
For example, if I am searching with three tags Tag1,
Tag4 and SpecialTag3, total occurrences are 2 in the above document. Using this number, I want to add a custom score to this document and sort by the score.
I am already using function_score as there are few other attributes on which the scoring depends. To compute the matched number, I tried painless script like below.
def matchedTags = 0;
def searchedTags = ["Tag1", "Tag4", "SpecialTag3"];
for (int i = 0; i < searchedTags.length; ++i) {
if (doc['tags'].contains(searchedTags[i])) {
matchedTags++;
continue;
}
if (doc['special_tags'].contains(searchedTags[i])) {
matchedTags++;
continue;
}
if (doc['reserved_tags'].contains(searchedTags[i])) {
matchedTags++;
}
}
// logic to score on matchedTags (returning matchedTags for simplicity)
return matchedTags;
This runs as expected, but extremely slow. I assume that ES has to count the occurrences for each doc and cannot use indexes here. (If someone can shed light on how this will work internally or provide documentation/resources links, that would be helpful.)
I want to have two scoring functions.
Score as a function of number of occurrences
Score higher for higher occurrences. This is basically same as 1, but the repeated occurrences would be counted.
Is there any way where I can get benefits of both faster searching and also the custom scoring using script?
Any help is appreciated. Thanks.

Elasticsearch: find document doesn't have intersecting nested items

I have a document with nested collection and the goal is to find that ones which don't have any inner items intersecting particular period considering also an item's status.
inb4 I've found at least two SO items that didn't help me (probably I'm noob):
ElasticSearch - find all documents whose nested documents do not intersect with date range
and
Elasticsearch inverse range overlap query
So, my document, let's say, simple (I'll paste a JSON definition, just for simplicity. all mappings are fine, trust me ;) ):
{
"maintenances": [ // <-- this is nested collection
{
"start": "date/time",
"end": "date/time",
"status": boolean
} ]
}
What I need is to write a query for documents that don't have any active (status = true) maintenance items intersecting some period (from and to for instance).
I started from simple expression:
must_not { nested { exists { field: maintenances } } }
or
must_not { nested { must [
{ maintenances.start <= to },
{ maintenances.end >= from },
{ status = true }
] } }
That returned me any document from test entries.
Remembering that nested query will return outer document in case if any document will match expression I decided to make a query complex, something like:
maintenances not exists
or
(
any maintenance within range is not active
and
any maintenance outside range is active
)
But it became clear (not very fast unfortunately) that this query doesn't work for sorts of edge-cases: like all nested maintenance items are inactive, or all maintenance are outside requested bounds.
Currently I'm not sure, but I assume that query should contain as many or-d items as many edge-cases are? Like:
maintenances not exists
or
(
any maintenance within range is not active
and
any maintenance outside range
)
or
(
any maintenance within range is not active
and
no maintenance outside range
)
or
(
no maintenance within range
and
any maintenance outside range
)
or
OVER9000 of them
Does anyone know simplest way to query Elastic for my case?
all mappings are fine, trust me
Suddenly (no actually) this was an issue. Absence of mapping for status prevented from data being filtered correctly.

Couchbase Filter Query -> number in range between two numbers using Spring Data Couchbase (SpEL notation).)

I'm trying to make a query in a Couchbase Database. The idea is to retrieve the elements which are in the range of two numbers. I'm using Spring Data Couchbase.
My query looks like this:
#Query("#{#n1ql.selectEntity} WHERE #{#n1ql.filter} AND $age BETWEEN minAge AND maxAge ")
Optional<Room> findByMinAgeAndMaxAge(#Param("age") int age);
But
Unable to execute query due to the following n1ql errors:
{"msg":"No index available on keyspace bucketEx that matches your query. Use CREATE INDEX or CREATE PRIMARY INDEX to create an index, or check that your expected index is online.","code":4000}
This is what I get in the console:
SELECT META(`bucketEx`).id AS _ID, META(`bucketEx`).cas AS _CAS, `bucketEx`.* FROM `bucketEx` WHERE `docType` = \"com.rccl.middleware.engine.repository.model.salon\" AND $age BETWEEN minAge AND maxAge ","$age":7,"scan_consistency":"statement_plus"}
My doubt is if I have to create the indexes for the two fields ( minAge AND maxAge) or there is another issue related with my query. I'm starting with Couchbase and not pretty sure of what is happening.
My document looks like this:
{
"salons": [
{
"name": "salon_0",
"id": "salon-00",
"maxAge": 6,
"minAge": 3
}
],
"docType": "com.rccl.middleware.engine.repository.model.salon"
}
The age you are looking is inside salons array. If you want document if any one of the array object matches you should use array index on on one of the filed.
CREATE INDEX ix1 ON bucketEx(DISTINCT ARRAY v.maxAge FOR v IN salons END)
WHERE `docType` = "com.rccl.middleware.engine.repository.model.salon";
SELECT META( b ).id AS _ID, META( b ).cas AS _CAS, b.*
FROM `bucketEx` AS b
WHERE b.`docType` = "com.rccl.middleware.engine.repository.model.salon" AND
ANY v IN b.salons SATISFIES $age BETWEEN v.minAge AND v.maxAge END;

Single query to return documents sorted by distance based on one documents Id rather than its geopoint

I have an index in elasticsearch which contains a Id field and a geopoint.
right now in order to get the nearest documents I have to make two queries, one to get the original document by its id and after that use its coordinates to do a geosort. I was wondering if there is anyway to execute this as a single query.
public IEnumerable<RestaurantSearchItem> GetNearbyRestaurants(double latitude, double longitude)
{
var query = _elasticClient.Search<RestaurantSearchItem>(s =>
s.Index(RestaurantSearchItem.IndexName)
.Sort(
ss =>ss.GeoDistance(
g => g
.Field(p => p.Location)
.DistanceType(GeoDistanceType.Plane)
.Unit(DistanceUnit.Meters)
.Order(SortOrder.Ascending)
.Points(new GeoLocation(latitude,longitude)))));
var nearByRestaurants = query.Documents;
foreach (var restaurant in nearByRestaurants)
{
restaurant.Distance = Convert.ToDouble(query.Hits.Single(x => x.Id == restaurant.Id).Sorts.Single());
}
return nearByRestaurants;
}
I don't think it's possible to do this in one query; the latitude and longitude used for sorting can't be looked up from elsewhere in the data, so needs to be supplied in the request.
As of my knowledge, the only Elasticsearch query that accepts id of a document as its parameter is terms query, which fetches list of terms for the query from the given document.
But you want to find relevant documents based on location, not exact terms.
This can be achieved with denormalization of your data. It might look like storing the list of nearby restaurants in a nested field.
In the case of denormalization you will have to pre-compute all nearby restaurants before inserting the document in the index.

Elastic Search using NEST

How to sort data using multiple filters in elastic search using NEST queries.
I need to apply two fields lets say price and kilometer for cars in results set. I want the results sorted based on these fields by ASC or DESC.
How to get top five results based on conditions or some filters.
Please provide some links if available.
This should answer your questions:
IEnumerable<string> searchableFields = new List<string>() { "price", "kilometer" };
SearchDescriptor<T> descriptor = new SearchDescriptor<T>();
descriptor = descriptor.Size(5).OnFields(searchableFields)
.Sort(s => s.OnField("price").Descending().OnField("kilometer").Ascending());
var result = client.Search<T>(body => descriptor);

Resources