How to correctly query inside of terms aggregate values in elasticsearch, using include and regex? - elasticsearch

How do you filter out/search in aggregate results efficiently?
Imagine you have 1 million documents in elastic search. In those documents, you have a multi_field (keyword, text) tags:
{
...
tags: ['Race', 'Racing', 'Mountain Bike', 'Horizontal'],
...
},
{
...
tags: ['Tracey Chapman', 'Silverfish', 'Blue'],
...
},
{
...
tags: ['Surfing', 'Race', 'Disgrace'],
...
},
You can use these values as filters, (facets), against a query to pull only the documents that contain this tag:
...
"filter": [
{
"terms": {
"tags": [
"Race"
]
}
},
...
]
But you want the user to be able to query for possible tag filters. So if the user types, race the return should show (from previous example), ['Race', 'Tracey Chapman', 'Disgrace']. That way, the user can query for a filter to use. In order to accomplish this, I had to use aggregates:
{
"aggs": {
"topics": {
"terms": {
"field": "tags",
"include": ".*[Rr][Aa][Cc][Ee].*", // I have to dynamically form this
"size": 6
}
}
},
"size": 0
}
This gives me exactly what I need! But it is slow, very slow. I've tried adding the execution_hint, it does not help me.
You may think, "Just use a query before the aggregate!" But the issue is that it'll pull all values for all documents in that query. Meaning, you can be displaying tags that are completely unrelated. If I queried for race before the aggregate, and did not use the include regex, I would end up with all those other values, like 'Horizontal', etc...
How can I rewrite this aggregation to work faster? Is there a better way to write this? Do I really have to make a separate index just for values? (sad face) Seems like this would be a common issue but have found no answers through documentation and googling.

You certainly don't need a separate index just for the values...
Here's my take on it:
What you're doing with the regex is essentially what should've been done by a tokenizer -- i.e. constructing substrings (or N-grams) such that they can be targeted later.
This means that the keyword Race will need to be tokenized into the n-grams ["rac", "race", "ace"]. (It doesn't really make sense to go any lower than 3 characters -- most autocomplete libraries choose to ignore fewer than 3 characters because the possible matches balloon too quickly.)
Elasticsearch offers the N-gram tokenizer but we'll need to increase the default index-level setting called max_ngram_diff from 1 to (arbitrarily) 10 because we want to catch as many ngrams as is reasonable:
PUT tagindex
{
"settings": {
"index": {
"max_ngram_diff": 10
},
"analysis": {
"analyzer": {
"my_ngrams_analyzer": {
"tokenizer": "my_ngrams",
"filter": [ "lowercase" ]
}
},
"tokenizer": {
"my_ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [ "letter", "digit" ]
}
}
}
},
{ "mappings": ... } --> see below
}
When your tags field is a list of keywords, it's simply not possible to aggregate on that field without resorting to the include option which can be either exact matches or a regex (which you're already using). Now, we cannot guarantee exact matches but we also don't want to regex! So that's why we need to use a nested list which'll treat each tag separately.
Now, nested lists are expected to contain objects so
{
"tags": ["Race", "Racing", "Mountain Bike", "Horizontal"]
}
will need to be converted to
{
"tags": [
{ "tag": "Race" },
{ "tag": "Racing" },
{ "tag": "Mountain Bike" },
{ "tag": "Horizontal" }
]
}
After that we'll proceed with the multi field mapping, keeping the original tags intact but also adding a .tokenized field to search on and a .keyword field to aggregate on:
"index": { ... },
"analysis": { ... },
"mappings": {
"properties": {
"tags": {
"type": "nested",
"properties": {
"tag": {
"type": "text",
"fields": {
"tokenized": {
"type": "text",
"analyzer": "my_ngrams_analyzer"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
We'll then add our adjusted tags docs:
POST tagindex/_doc
{"tags":[{"tag":"Race"},{"tag":"Racing"},{"tag":"Mountain Bike"},{"tag":"Horizontal"}]}
POST tagindex/_doc
{"tags":[{"tag":"Tracey Chapman"},{"tag":"Silverfish"},{"tag":"Blue"}]}
POST tagindex/_doc
{"tags":[{"tag":"Surfing"},{"tag":"Race"},{"tag":"Disgrace"}]}
and apply a nested filter terms aggregation:
GET tagindex/_search
{
"aggs": {
"topics_parent": {
"nested": {
"path": "tags"
},
"aggs": {
"topics": {
"filter": {
"term": {
"tags.tag.tokenized": "race"
}
},
"aggs": {
"topics": {
"terms": {
"field": "tags.tag.keyword",
"size": 100
}
}
}
}
}
}
},
"size": 0
}
yielding
{
...
"topics_parent" : {
...
"topics" : {
...
"topics" : {
...
"buckets" : [
{
"key" : "Race",
"doc_count" : 2
},
{
"key" : "Disgrace",
"doc_count" : 1
},
{
"key" : "Tracey Chapman",
"doc_count" : 1
}
]
}
}
}
}
Caveats
in order for this to work, you'll have to reindex
ngrams will increase the storage footprint -- depending on how many tags-per-doc you have, it may become a concern
nested fields are internally treated as "separate documents" so this affects the disk space too
P.S.: This is an interesting use case. Let me know how the implementation went!

Related

Sort similar data by property

I have the following data:
[
{
DocumentId": "85",
"figureText": "General Seat Assembly - DBL",
"descriptionShort": "Seat Assembly - DBL",
"partNumber": "1012626-001FG05",
"itemNumeric": "5"
},
{
DocumentId": "85",
"figureText": "General Seat Assembly - DBL",
"descriptionShort": "Seat Assembly - DBL",
"partNumber": "1012626-001FG05",
"itemNumeric": "45"
}
]
I use the following query to get data:
{
"query": {
"bool": {
"must": {
"match": {
"DocumentId": "85"
}
},
"should": [
{
"match": {
"figureText": {
"boost": 5,
"query": "General Seat Assembly - DBL",
"operator": "or"
}
}
},
{
"match": {
"descriptionShort": {
"boost": 4,
"query": "Seat Assembly - DBL",
"operator": "or"
}
}
},
{
"term": {
"partNumber": {
"boost": 1,
"value": "1012626-001FG05"
}
}
}
]
}
}
}
Currently, it will returns the item with "itemNumeric" = 45 and I would like to get itemNumeric = "5" (the lowest).
Is a tips exists to do that ? I tried with "sort":[{"itemNumeric":"desc"}]
Thx
Looking at your comment, you can resolve the issue in two ways.
Solution 1: Updating your mapping, so that your query would work as expected now
PUT my_index/_mapping/_doc
{
"properties": {
"itemNumeric": {
"type": "text",
"fielddata": true
}
}
}
Solution 2: Check the mapping of your itemNumeric field in case if your mapping has been created dynamically, you field itemNumeric would be multi-field.
"itemNumeric": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
In this case you can have your sorting logic applied on itemNumeric.keyword field.
"sort":[{"itemNumeric.keyword":"desc"}]
In elasticsearch, whenever you have text data, it is always recommended to have two fields created for it. One of type text so that you can apply full text queries and other of type keyword so that you can use if to implement sorting or any aggregation operations.
Solution 1 is not recommended as ES official documentation mentions below reason
Fielddata is disabled on text fields by default. Set fielddata=true on
[your_field_name] in order to load fielddata in memory by uninverting
the inverted index. Note that this can however use significant memory.
I'd suggest to read about multi-field and fielddata so that you will have more clarity on what's happening.

ElasticSearch 5.x context suggester with multiple contexts

I want to use the context suggester from elasticSearch, but my suggestion results need to match 2 context values.
Expanding the example from the docs, i want to do something like:
POST place/_search?pretty
{
"suggest": {
"place_suggestion" : {
"prefix" : "tim",
"completion" : {
"field" : "suggest",
"size": 10,
"contexts": {
"place_type": [ "cafe", "restaurants" ],
"rating": ["good"]
}
}
}
}
}
I would like to have results that have a context 'cafe' or 'restaurant' for place_type AND that have the context 'good' for rating.
When I try something like this, elastic performs an OR operation on the contexts, giving me all suggestions with the context 'cafe', restaurant' OR 'good'.
Can I somehow specify what BOOL operator elastic needs to use for combining multiple contexts?
It looks like this functionality isn't supported from Elasticsearch 5.x onwards:
https://github.com/elastic/elasticsearch/issues/21291#issuecomment-375690371
Your best bet is to create a composite context, which seems to be how Elasticsearch 2.x achieved multiple contexts in a query:
https://github.com/elastic/elasticsearch/pull/26407#issuecomment-326771608
To do this, I guess you'll need a new field in your mapping. Let's call it cat-rating:
PUT place
{
"mappings": {
"properties": {
"suggest": {
"type": "completion",
"contexts": [
{
"name": "place_type-rating",
"type": "category",
"path": "cat-rating"
}
]
}
}
}
}
When you index new documents you'll need to concantenate the fields place_type and rating together, separated by -, for the cat-rating field.
Once that's done your query will need to look something like this:
POST place/_search?pretty
{
"suggest": {
"place_suggestion": {
"prefix": "tim",
"completion": {
"field": "suggest",
"size": 10,
"contexts": {
"place_type-rating": [
{
"context": "cafe-good"
},
{
"context": "restaurant-good"
}
]
}
}
}
}
}
That'll return suggestions of good cafe's OR good restaurants.

Terms aggregation (to achieve hierarchical faceting) query performance slow

I am indexing metric names in elastic search. Metric names are of the form foo.bar.baz.aux. Here is the index I use.
{
"index": {
"analysis": {
"analyzer": {
"prefix-test-analyzer": {
"filter": "dotted",
"tokenizer": "prefix-test-tokenizer",
"type": "custom"
}
},
"filter": {
"dotted": {
"patterns": [
"([^.]+)"
],
"type": "pattern_capture"
}
},
"tokenizer": {
"prefix-test-tokenizer": {
"delimiter": ".",
"type": "path_hierarchy"
}
}
}
}
}
{
"metrics": {
"_routing": {
"required": true
},
"properties": {
"tenantId": {
"type": "string",
"index": "not_analyzed"
},
"unit": {
"type": "string",
"index": "not_analyzed"
},
"metric_name": {
"index_analyzer": "prefix-test-analyzer",
"search_analyzer": "keyword",
"type": "string"
}
}
}
}
The above index creates the following terms for a metric name foo.bar.baz
foo
bar
baz
foo.bar
foo.bar.baz
If I have bunch of metrics, like below
a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z
I have to write a query to grab the nth level of tokens. In the example above
for level = 0, I should get [a, x]
for level = 1, with 'a' as first token I should get [b]
with 'x' as first token I should get [y]
for level = 2, with 'a.b' as first token I should get [c, m]
I couldn't think of any other way, other than to write terms aggregation. To figure out level 2 tokens of a.b, here is the query I came up with.
time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
"size": 0,
"query": {
"term": {
"tenantId": "12345"
}
},
"aggs": {
"metric_name_tokens": {
"terms": {
"field" : "metric_name",
"include": "a[.]b[.][^.]*",
"execution_hint": "map",
"size": 0
}
}
}
}'
This would result in the following buckets. I parse the output and grab [c, m] from there.
"buckets" : [ {
"key" : "a.b.c",
"doc_count" : 2
}, {
"key" : "a.b.m",
"doc_count" : 1
} ]
So far so good. The query works great for most of the tenants(notice tenantId term query above). For certain tenants which has large amounts of data (around 1 Mil), the performance is really slow. I am guessing all the terms aggregation takes time.
I am wondering if terms aggregation is the right choice for this kind of data and also looking for other possible kinds of queries.
Some suggestions:
"mirror" the filter at the aggregations level in the query part as well. So, for a.b. matching, use the following as a query and keep the same aggs section:
"bool": {
"must": [
{
"term": {
"tenantId": 123
}
},
{
"prefix": {
"metric_name": {
"value": "a.b."
}
}
}
]
}
or even use regexp with the same regular expression as in the aggregation part. In this way, the aggregations will have to evaluate less buckets as the documents that reach the aggregation part will be less.
You mentioned that regexp is working better for you, my initial guess was that the prefix would perform better.
change "size": 0 from aggregations to "size": 100. After testing you mentioned this doesn't make any difference
remove "execution_hint": "map" and let Elasticsearch use the defaults. After testing you mentioned that the default execution_hint was performing far worse.
the only other thing I could think of is to relieve the pressure at searching time by moving it at indexing time. What I mean by that: at indexing time, in your own application or whatever indexing method you are using, split the text to be indexed programaticaly (not ES doing it) and index each element in the hierarchy in a separate field. For example a.b in field2, a.b.c in field3 and so on. This for the same document. Then, at search time, you look at specific fields depending on what the search text is. This whole idea, though, requires some additional work outside ES.
From all the suggestions above the first one had the greatest impact: queries response times improved from 23 secs to 11 seconds.

How to sort items by array size in ElasticSearch?

I have 3 millions items with this structure:
{
"id": "some_id",
"title": "some_title",
"photos": [
{...},
{...},
...
]
}
Some items may have empty photos field:
{
"id": "some_id",
"title": "some_title",
"photos": []
}
I want to sort by the number of photos to result in elements without photos were at the end of the list.
I have the one working solution but it's very slow on 3 million items:
GET myitems/_search
{
"filter": {
...some filters...
},
"sort": [
{
"_script": {
"script": "_source.photos.size()",
"type": "number",
"order": "desc"
}
}
]
}
This query executes 55 seconds. How to optimize this query?
As suggested in the comments, adding a new field with the number of photos would be the way to go. There's a way to achieve this without reindexing all your data by using the update by query plugin.
Basically, after installing the plugin, you can run the following query and all your documents will get that new field. However, make sure that your indexing process also populates that new field in the new documents:
curl -XPOST 'localhost:9200/myitems/_update_by_query' -d '{
"query" : {
"match_all" : {}
},
"script" : "ctx._source.nb_photos = ctx._source.photos.size();"
}'
After this has run, you'll be able to sort your results simply with:
"sort": {"nb_photos": "desc"}
Note: for this plugin to work, one needs to have scripting enabled, it is already the case for you since you were able to use a sort script, but I'm just mentioning this for completeness' sake.
Problem was solved with Transform directive. Now I have a mapping:
PUT /myitems/_mapping/lol
{
"lol" : {
"transform": {
"lang": "groovy",
"script": "ctx._source['has_photos'] = ctx._source['photos'].size() > 0"
},
"properties" : {
... fields ...
"photos" : {"type": "object"},
"has_photos": {"type": "boolean"}
... fields ...
}
}
}
Now I can sort items by photos existence:
GET /test/_search
{
"sort": [
{
"has_photos": {
"order": "desc"
}
}
]
}
Unfortunately, this will cause full reindexation.

How to define a bucket aggregation where buckets are defined by arbitrary filters on a field (GROUP BY CASE equivalent)

ElasticSearch enables us to filter a set of documents by regex on any given field, and also to group the resulting documents by the terms in a given (same or different field, using "bucket aggregations". For example, on an index that contains a "Url" field and a "UserAgent" field (some kind of web server log), the following will return the top document counts for terms found in the UserAgent field.
{
query: { filtered: { filter: { regexp: { Url : ".*interestingpage.*" } } } },
size: 0,
aggs: { myaggregation: { terms: { field: "UserAgent" } } }
}
What I'd like to do is use the power of the regexp filter (which operates on the whole field, not just terms within a field) to manually define my aggregation buckets, so that I can relatively reliably split my documents/counts/hits by "user agent type" data, rather than the arbitrary terms parsed by elastic search in the field.
Basically, I am looking for the equivalent of a CASE statement in a GROUP BY, in SQL terms. The SQL query that would express my intent would be something like:
SELECT Bucket, Count(*)
FROM (
SELECT CASE
WHEN UserAgent LIKE '%android%' OR UserAgent LIKE '%ipad%' OR UserAgent LIKE '%iphone%' OR UserAgent LIKE '%mobile%' THEN 'Mobile'
WHEN UserAgent LIKE '%msie 7.0%' then 'IE7'
WHEN UserAgent LIKE '%msie 8.0%' then 'IE8'
WHEN UserAgent LIKE '%firefox%' then 'FireFox'
ELSE 'OTHER'
END Bucket
FROM pagedata
WHERE Url LIKE '%interestingpage%'
) Buckets
GROUP BY Bucket
Can this be done in an ElasticSearch query?
This is an interesting use-case.
Here's a more Elasticsearch-way solution.
The idea is to do all this regex matching at indexing time and the search time to be fast (scripts during search time, if there are many documents, are not performing well and will take time). Let me explain:
define a sub-field for your main field, in which the manipulation of terms is customized
this manipulation will be performed so that the only terms that will be kept in the index will be the ones you defined: FireFox, IE8, IE7, Mobile. Each document can have more than one of these fields. Meaning a text like msie 7.0 sucks and ipad rules will generate only two terms: IE7 and Mobile.
All this is made possible by the keep token filter.
there should be another list of token filters that will actually perform the replacement. This will be possible by using the pattern_replace token filter.
because you have two words that should be replaced (msie 7.0 for example), you need a way to capture these two words (msie and 7.0) one beside the other. This will be possible using the shingle token filter.
Let me put all this together and provide the complete solution:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_pattern_replace_analyzer": {
"tokenizer": "whitespace",
"filter": [
"filter_shingle",
"my_pattern_replace1",
"my_pattern_replace2",
"my_pattern_replace3",
"my_pattern_replace4",
"words_to_be_kept"
]
}
},
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 10,
"min_shingle_size": 2,
"output_unigrams": true
},
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": "android|ipad|iphone|mobile",
"replacement": "Mobile"
},
"my_pattern_replace2": {
"type": "pattern_replace",
"pattern": "msie 7.0",
"replacement": "IE7"
},
"my_pattern_replace3": {
"type": "pattern_replace",
"pattern": "msie 8.0",
"replacement": "IE8"
},
"my_pattern_replace4": {
"type": "pattern_replace",
"pattern": "firefox",
"replacement": "FireFox"
},
"words_to_be_kept": {
"type": "keep",
"keep_words": [
"FireFox", "IE8", "IE7", "Mobile"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"UserAgent": {
"type": "string",
"fields": {
"custom": {
"analyzer": "my_pattern_replace_analyzer",
"type": "string"
}
}
}
}
}
}
}
Test data:
POST /test/test/_bulk
{"index":{"_id":1}}
{"UserAgent": "android OS is the best firefox"}
{"index":{"_id":2}}
{"UserAgent": "firefox is my favourite browser"}
{"index":{"_id":3}}
{"UserAgent": "msie 7.0 sucks and ipad rules"}
Query:
GET /test/test/_search?search_type=count
{
"aggs": {
"myaggregation": {
"terms": {
"field": "UserAgent.custom",
"size": 10
}
}
}
}
Results:
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"myaggregation": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "FireFox",
"doc_count": 2
},
{
"key": "Mobile",
"doc_count": 2
},
{
"key": "IE7",
"doc_count": 1
}
]
}
}
You could use a terms aggregation with a scripted field:
{
query: { filtered: { filter: { regexp: { Url : ".*interestingpage.*" } } } },
size: 0,
aggs: {
myaggregation: {
terms: {
script: "doc['UserAgent'] =~ /.*android.*/ || doc['UserAgent'] =~ /.*ipad.*/ || doc['UserAgent'] =~ /.*iphone.*/ || doc['UserAgent'] =~ /.*mobile.*/ ? 'Mobile' : doc['UserAgent'] =~ /.*msie 7.0.*/ ? 'IE7' : '...you got the idea by now...'"
}
}
}
}
But beware of the performance hit!

Resources