Elasticsearch - aggregating multi level hierarchy - elasticsearch

I am facing a problem with providing aggregated search result of documents with multi level hierarchy. Simplified documents structure looks like this:
Magazine title (Hunting) -> Magazine year (1999) -> Magazine issue (II.) -> Pages (Text of pages ...)
Every level od document is mapped to its parent by attribute "parentDocumentId".
I have prepared simple query, which works just fine for hierarchy with just 2 levels:
POST http://localhost:9200/my_index/document/_search?search_type=count&q=hunter
{
"query": {
"multi_match" : {
"query": "hunter",
"fields": [ "title", "text", "labels" ]
}
},
"aggregations": {
"my_agg": {
"terms": {
"field": "parentDocumentId"
}
}
}
}
This query is able to search through text of pages, and istead of giving me thousands of pages containting work "hunter" returns buckets (aggregated by parentDocumentId) of documents. However these buckets represent just "Magazine issues" which containt these pages.
Response:
{
"took": 54,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 44,
"max_score": 0,
"hits": []
},
"aggregations": {
"my_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 5,
"doc_count": 43
},
{
"key": 0,
"doc_count": 1
}
]
}
}
}
What I need, is to be able to aggregate search results on highest possible level. That means, in this particular case, to aggregate on "Magazine title" level. This could be done outside the elasticsearch query (on our application side), but as I see this, it should be definitely made in elasticsearch (performance, and other issues).
Does anybody have experience with similar aggregation? Is elasticsearch aggregations the right approach to use?
Every idea is welcome.
Thanks
Peter
Update:
Our mapping looks like this:
{
"my_index": {
"mappings": {
"document": {
"properties": {
"dateIssued": {
"type": "date",
"format": "dateOptionalTime"
},
"documentId": {
"type": "long"
},
"filter": {
"properties": {
"geo_bounding_box": {
"properties": {
"issuedLocation": {
"properties": {
"bottom_right": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
},
"top_left": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
}
}
}
}
}
}
},
"issuedLocation": {
"type": "geo_point"
},
"labels": {
"type": "string"
},
"locationLinks": {
"type": "geo_point"
},
"parentDocumentId": {
"type": "long"
},
"query": {
"properties": {
"match_all": {
"type": "object"
}
}
},
"storedLocation": {
"type": "geo_point"
},
"text": {
"type": "string"
},
"title": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
That means we use 1 mapping for all types of documents. We are indexing set of books, newspapers and other press. That means, that sometimes there is only one parent for set of pages, any sometimes there are multiple levels of parents above the pages level.
To distinguish the type of document there is an attribute "type".
When indexing top levels (these contain especially book meta-data) we leave the "text" attribute empty, always specifying the parent of document using the parentDocumentId. The top level documents have their parentDocumentId set to 0. When indexing the lowest level (pages), we provide only text attribute and parentDocumentId for indexed document.
The link used is very similar to classic one-to-many mapping (magazine has many years, has many issues, has many pages).
You could also say, that we have flattened the nested documents in elasticsearch, but the reason for this is, that there are multiple document types, that can have different level of their hierarchy.

You need to rethink your data modelling. In essence, you need a join over your data and moreover the join needs to be over an arbitrarily deep hierarchy. That is a problem even in relational databases let alone in a fulltext search engine like Elasticsearch.
Elasticsearch does support a couple of joins. You could use nested documents - a single document with all the subdocs nested. That's clearly not ideal in your case.
You could use the parent-child relationship feature which lets you index your (sub-)docs separately always referring to their parent. Underneath, that feature uses Lucene's blockjoin. However, to aggregate over a hierarchy, you would have to explicitly specify the join - listing all the intermediate steps. You want to always aggregate by the top-most available doc but that could be a different level each time (once a magazine, another time a magazine collection or perhaps a publisher).
I would consider indexing each doc with a field pointing to the top-most document. Then you can easily aggregate by that field. It would mean precomputing a part of the complex aggregation you want to do but it would result in fast aggregations and updates also wouldn't be very painful. It all depends on the source of your data, how you imagine that it will change, what updates and other queries you'll need to do.
This blog post could help to guide you a bit too: https://www.elastic.co/blog/managing-relations-inside-elasticsearch

Related

How to correctly query inside of terms aggregate values in elasticsearch, using include and regex?

How do you filter out/search in aggregate results efficiently?
Imagine you have 1 million documents in elastic search. In those documents, you have a multi_field (keyword, text) tags:
{
...
tags: ['Race', 'Racing', 'Mountain Bike', 'Horizontal'],
...
},
{
...
tags: ['Tracey Chapman', 'Silverfish', 'Blue'],
...
},
{
...
tags: ['Surfing', 'Race', 'Disgrace'],
...
},
You can use these values as filters, (facets), against a query to pull only the documents that contain this tag:
...
"filter": [
{
"terms": {
"tags": [
"Race"
]
}
},
...
]
But you want the user to be able to query for possible tag filters. So if the user types, race the return should show (from previous example), ['Race', 'Tracey Chapman', 'Disgrace']. That way, the user can query for a filter to use. In order to accomplish this, I had to use aggregates:
{
"aggs": {
"topics": {
"terms": {
"field": "tags",
"include": ".*[Rr][Aa][Cc][Ee].*", // I have to dynamically form this
"size": 6
}
}
},
"size": 0
}
This gives me exactly what I need! But it is slow, very slow. I've tried adding the execution_hint, it does not help me.
You may think, "Just use a query before the aggregate!" But the issue is that it'll pull all values for all documents in that query. Meaning, you can be displaying tags that are completely unrelated. If I queried for race before the aggregate, and did not use the include regex, I would end up with all those other values, like 'Horizontal', etc...
How can I rewrite this aggregation to work faster? Is there a better way to write this? Do I really have to make a separate index just for values? (sad face) Seems like this would be a common issue but have found no answers through documentation and googling.
You certainly don't need a separate index just for the values...
Here's my take on it:
What you're doing with the regex is essentially what should've been done by a tokenizer -- i.e. constructing substrings (or N-grams) such that they can be targeted later.
This means that the keyword Race will need to be tokenized into the n-grams ["rac", "race", "ace"]. (It doesn't really make sense to go any lower than 3 characters -- most autocomplete libraries choose to ignore fewer than 3 characters because the possible matches balloon too quickly.)
Elasticsearch offers the N-gram tokenizer but we'll need to increase the default index-level setting called max_ngram_diff from 1 to (arbitrarily) 10 because we want to catch as many ngrams as is reasonable:
PUT tagindex
{
"settings": {
"index": {
"max_ngram_diff": 10
},
"analysis": {
"analyzer": {
"my_ngrams_analyzer": {
"tokenizer": "my_ngrams",
"filter": [ "lowercase" ]
}
},
"tokenizer": {
"my_ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [ "letter", "digit" ]
}
}
}
},
{ "mappings": ... } --> see below
}
When your tags field is a list of keywords, it's simply not possible to aggregate on that field without resorting to the include option which can be either exact matches or a regex (which you're already using). Now, we cannot guarantee exact matches but we also don't want to regex! So that's why we need to use a nested list which'll treat each tag separately.
Now, nested lists are expected to contain objects so
{
"tags": ["Race", "Racing", "Mountain Bike", "Horizontal"]
}
will need to be converted to
{
"tags": [
{ "tag": "Race" },
{ "tag": "Racing" },
{ "tag": "Mountain Bike" },
{ "tag": "Horizontal" }
]
}
After that we'll proceed with the multi field mapping, keeping the original tags intact but also adding a .tokenized field to search on and a .keyword field to aggregate on:
"index": { ... },
"analysis": { ... },
"mappings": {
"properties": {
"tags": {
"type": "nested",
"properties": {
"tag": {
"type": "text",
"fields": {
"tokenized": {
"type": "text",
"analyzer": "my_ngrams_analyzer"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
We'll then add our adjusted tags docs:
POST tagindex/_doc
{"tags":[{"tag":"Race"},{"tag":"Racing"},{"tag":"Mountain Bike"},{"tag":"Horizontal"}]}
POST tagindex/_doc
{"tags":[{"tag":"Tracey Chapman"},{"tag":"Silverfish"},{"tag":"Blue"}]}
POST tagindex/_doc
{"tags":[{"tag":"Surfing"},{"tag":"Race"},{"tag":"Disgrace"}]}
and apply a nested filter terms aggregation:
GET tagindex/_search
{
"aggs": {
"topics_parent": {
"nested": {
"path": "tags"
},
"aggs": {
"topics": {
"filter": {
"term": {
"tags.tag.tokenized": "race"
}
},
"aggs": {
"topics": {
"terms": {
"field": "tags.tag.keyword",
"size": 100
}
}
}
}
}
}
},
"size": 0
}
yielding
{
...
"topics_parent" : {
...
"topics" : {
...
"topics" : {
...
"buckets" : [
{
"key" : "Race",
"doc_count" : 2
},
{
"key" : "Disgrace",
"doc_count" : 1
},
{
"key" : "Tracey Chapman",
"doc_count" : 1
}
]
}
}
}
}
Caveats
in order for this to work, you'll have to reindex
ngrams will increase the storage footprint -- depending on how many tags-per-doc you have, it may become a concern
nested fields are internally treated as "separate documents" so this affects the disk space too
P.S.: This is an interesting use case. Let me know how the implementation went!

how to index questions and answers in elaticsearch

I am doing a project to index questions and answers of a website in elasticsearch (version 6) for search purpose.
I have first thought of creating two indexes as shown below, one for questions and one for answers.
questions mapping:
{"mappings": {
"question": {
"properties": {
"title":{
"type":"text"
},
"question": {
"type": "text"
},
"questionId":{
"type":"keyword"
}
}
}
}
}
answers mapping:
{"mappings": {
"answer": {
"properties": {
"answer":{
"type":"text"
},
"answerId": {
"type": "keyword"
},
"questionId":{
"type":"keyword"
}
}
}
}
}
I have used multimatch query along with term and top_hits aggregation to search the indexed Q&As (referred question).I used this method to remove the duplicates from the search results. As answers or the question itself of the same question can appear in the result. I only want one entry per question in the results. the problem I am facing is to paginate the results. there is no possible way to paginate aggregation in elasticsearch. It can only paginate hits not aggregations.
then I thought of saving the both question and answers in one document, answers in a Json array. the problem with this approach is that there is no clean way to add, remove, update a specific answer in a given question document. only way I found was using a groovy script (referred question). which is deprecated in elasticsearch v6 AFAIK.
Is there a better and clean way to design this ?
Thanks.
Parent-Child Relationship
Use the parent-child relationship. It is similar to the nested model, and allows association of one entity with another. You can associate one document type with another, in a one-to-many relationship.
More information on here: https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html
Child documents can be added, changed, or deleted without affecting the parent nor other children. You can do pagination on the parent documents using the Scroll API.
Child documents can be retrieved using the has_parent join.
The trade-off: you do not have to take care of duplicates and pagination problems, but parent-child queries can be 5 to 10 times slower than the equivalent nested query.
Your mapping can be like the following:
PUT /my-index
{
"mappings": {
"question": {
"properties": {
"title": {
"type": "text"
},
"question": {
"type": "text"
},
"questionId": {
"type": "keyword"
}
}
},
"answer": {
"_parent": {
"type": "question"
},
"properties": {
"answer": {
"type": "text"
},
"answerId": {
"type": "keyword"
},
"questionId": {
"type": "keyword"
}
}
}
}
}

How to apply synonyms at query time instead of index time in Elasticsearch

According to the elasticsearch reference documentation, it is possible to:
Expansion can be applied either at index time or at query time. Each has advantages (⬆)︎ and disadvantages (⬇)︎. When to use which comes down to performance versus flexibility.
The advantages and disadvantages all make sense and for my specific use I want to make use of synonyms at query time. My use case is that I want to allow admin users in my system to curate these synonyms without having to reindex everything on an update. Also, I'd like to do it without closing and reopening the index.
The main reason I believe this is possible is this advantage:
(⬆)︎ Synonym rules can be updated without reindexing documents.
However, I can't find any documentation describing how to apply synonyms at query time instead of index time.
To use a concrete example, if I do the following (example stolen and slightly modified from the reference), it seems like this would apply the synonyms at index time:
/* NOTE: This was all run against elasticsearch 1.5 (if that matters; documentation is identical in 2.x) */
// Create our synonyms filter and analyzer on the index
PUT my_synonyms_test
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"queen,monarch"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
// Create a mapping that uses this analyzer
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string",
"analyzer": "my_synonyms"
}
}
}
// Some data
PUT my_synonyms_test/rulers/1
{
"name": "Elizabeth II",
"title": "Queen"
}
// A query which utilises the synonyms
GET my_synonyms_test/rulers/_search
{
"query": {
"match": {
"title": "monarch"
}
}
}
// And we get our expected result back:
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4142135,
"hits": [
{
"_index": "my_synonyms_test",
"_type": "rulers",
"_id": "1",
"_score": 1.4142135,
"_source": {
"name": "Elizabeth II",
"title": "Queen"
}
}
]
}
}
So my question is: how could I amend the above example so that I would be using the synonyms at query time?
Or am I barking up completely the wrong tree and can you point me somewhere else please? I've looked at plugins mentioned in answers to similar questions like https://stackoverflow.com/a/34210587/2240218 and https://stackoverflow.com/a/18481495/2240218 but they all seem to be a couple of years old and unmaintained, so I'd prefer to avoid these.
Simply use search_analyzer instead of analyzer in your mapping and your synonym analyzer will only be used at search time
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string",
"search_analyzer": "my_synonyms" <--- change this
}
}
}
To use the custom synonym filter at QUERY TIME instead of INDEX TIME, you first need to remove the analyzer from your mapping:
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string"
}
}
}
You can then use the analyzer that makes use of the custom synonym filter as part of a query_string query:
GET my_synonyms_test/rulers/_search
{
"query": {
"query_string": {
"default_field": "title",
"query": "monarch",
"analyzer": "my_synonyms"
}
}
}
I believe the query_string query is the only one that allows for specifying an analyzer since it uses a query parser to parse its content.
As you said, when using the analyzer only at query time, you won't need to re-index on every change to your synonyms collection.
Apart from using the search_analyzer, you can refresh the synonyms list by restarting the index after making changes in the synonym file.
Below is the command to restart your index
curl -XPOST 'localhost:9200/index_name/_close'
curl -XPOST 'localhost:9200/index_name/_open'
After this automatically your synonym list will be refreshed without the need to reingest the data.
I followed this reference Elasticsearch — Setting up a synonyms search to configure the synonyms in ES

Terms aggregation (to achieve hierarchical faceting) query performance slow

I am indexing metric names in elastic search. Metric names are of the form foo.bar.baz.aux. Here is the index I use.
{
"index": {
"analysis": {
"analyzer": {
"prefix-test-analyzer": {
"filter": "dotted",
"tokenizer": "prefix-test-tokenizer",
"type": "custom"
}
},
"filter": {
"dotted": {
"patterns": [
"([^.]+)"
],
"type": "pattern_capture"
}
},
"tokenizer": {
"prefix-test-tokenizer": {
"delimiter": ".",
"type": "path_hierarchy"
}
}
}
}
}
{
"metrics": {
"_routing": {
"required": true
},
"properties": {
"tenantId": {
"type": "string",
"index": "not_analyzed"
},
"unit": {
"type": "string",
"index": "not_analyzed"
},
"metric_name": {
"index_analyzer": "prefix-test-analyzer",
"search_analyzer": "keyword",
"type": "string"
}
}
}
}
The above index creates the following terms for a metric name foo.bar.baz
foo
bar
baz
foo.bar
foo.bar.baz
If I have bunch of metrics, like below
a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z
I have to write a query to grab the nth level of tokens. In the example above
for level = 0, I should get [a, x]
for level = 1, with 'a' as first token I should get [b]
with 'x' as first token I should get [y]
for level = 2, with 'a.b' as first token I should get [c, m]
I couldn't think of any other way, other than to write terms aggregation. To figure out level 2 tokens of a.b, here is the query I came up with.
time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
"size": 0,
"query": {
"term": {
"tenantId": "12345"
}
},
"aggs": {
"metric_name_tokens": {
"terms": {
"field" : "metric_name",
"include": "a[.]b[.][^.]*",
"execution_hint": "map",
"size": 0
}
}
}
}'
This would result in the following buckets. I parse the output and grab [c, m] from there.
"buckets" : [ {
"key" : "a.b.c",
"doc_count" : 2
}, {
"key" : "a.b.m",
"doc_count" : 1
} ]
So far so good. The query works great for most of the tenants(notice tenantId term query above). For certain tenants which has large amounts of data (around 1 Mil), the performance is really slow. I am guessing all the terms aggregation takes time.
I am wondering if terms aggregation is the right choice for this kind of data and also looking for other possible kinds of queries.
Some suggestions:
"mirror" the filter at the aggregations level in the query part as well. So, for a.b. matching, use the following as a query and keep the same aggs section:
"bool": {
"must": [
{
"term": {
"tenantId": 123
}
},
{
"prefix": {
"metric_name": {
"value": "a.b."
}
}
}
]
}
or even use regexp with the same regular expression as in the aggregation part. In this way, the aggregations will have to evaluate less buckets as the documents that reach the aggregation part will be less.
You mentioned that regexp is working better for you, my initial guess was that the prefix would perform better.
change "size": 0 from aggregations to "size": 100. After testing you mentioned this doesn't make any difference
remove "execution_hint": "map" and let Elasticsearch use the defaults. After testing you mentioned that the default execution_hint was performing far worse.
the only other thing I could think of is to relieve the pressure at searching time by moving it at indexing time. What I mean by that: at indexing time, in your own application or whatever indexing method you are using, split the text to be indexed programaticaly (not ES doing it) and index each element in the hierarchy in a separate field. For example a.b in field2, a.b.c in field3 and so on. This for the same document. Then, at search time, you look at specific fields depending on what the search text is. This whole idea, though, requires some additional work outside ES.
From all the suggestions above the first one had the greatest impact: queries response times improved from 23 secs to 11 seconds.

Elasticsearch indexing homogenous objects under dynamic keys

The kind of document we want to index and query contains variable keys but are grouped into a common root key as follows:
{
"articles": {
"0000000000000000000000000000000000000001": {
"crawled_at": "2016-05-18T19:26:47Z",
"language": "en",
"tags": [
"a",
"b",
"d"
]
},
"0000000000000000000000000000000000000002": {
"crawled_at": "2016-05-18T19:26:47Z",
"language": "en",
"tags": [
"b",
"c",
"d"
]
}
},
"articles_count": 2
}
We want to able to ask: what documents contains articles with tags "b" and "d", with language "en".
The reason why we don't use list for articles, is that elasticsearch can efficiently and automatically merge documents with partial updates. The challenge however is to index the objects inside under the variable keys. One possible way we tried is to use dynamic_templates as follows:
{
"sources": {
"dynamic": "strict",
"dynamic_templates": [
{
"article_template": {
"mapping": {
"fields": {
"crawled_at": {
"format": "dateOptionalTime",
"type": "date"
},
"language": {
"index": "not_analyzed",
"type": "string"
},
"tags": {
"index": "not_analyzed",
"type": "string"
}
}
},
"path_match": "articles.*"
}
}
],
"properties": {
"articles": {
"dynamic": false,
"type": "object"
},
"articles_count": {
"type": "integer"
}
}
}
}
However this dynamic template fails because when documents are inserted, the following can be found in the logs:
[2016-05-30 17:44:45,424][WARN ][index.codec] [node]
[main] no index mapper found for field:
[articles.0000000000000000000000000000000000000001.language] returning
default postings format
Same for the two other fields as well. When I try to query for the existence of a certain article, or even articles it doesn't return any document (no error but empty hits):
curl -LsS -XGET 'localhost:9200/main/sources/_search' -d '{"query":{"exists":{"field":"articles"}}}'
When I query for the existence of articles_count, it returns everything. Is there a minor error in what we are trying to achieve, for example in the schema: the definition of articles as a property and in the dynamic template? What about the types and dynamic false? The path seems correct. Maybe this is not possible to define templates for objects in variable-keys, but it should be according to the documentation.
Otherwise, what alternatives are possible without changing the document if possible?
Notes: we have other types in the same index main that also have these fields like language, I ignore if it could influence. The version of ES we are using is 1.7.5 (we cannot upgrade to 2.X for now).

Resources