How to sort ordinal values in elasticsearch? - elasticsearch

Say i've got a field 'spicey' with possible values 'hot', 'hotter', 'smoking'.
There's an intrinsic ordening in these values: they're ordinals.
I'd like to be able to sort or filter on them using their intrinsic order. For example: give me all documents where spicey > hot.
Sure i can translate the values to integers 0,1,2 but this requires extra housekeeping on both the index and the query side which I'd rather avoid.
Is this possible in some way? Already contemplated using multi field mapping but not sure if that would help me.

You can sort based on string values by scripting a sort operation, so that you set each spicey string a specific field value.
curl -XGET 'http://localhost:9200/yourindex/yourtype/_search' -d
{
"sort": {
"_script": {
"script": "factor.get(doc[\"spicey\"].value)",
"type": "number",
"params": {
"factor": {
"hot": 0,
"hotter": 1,
"smoking": 2
}
},
"order": "asc"
}
}
}

One solution could be to create a specific analyzer for spice levels. The idea is to map each level to a discrete value which increases the more spicy the spice is.
{
"settings": {
"analysis": {
"char_filter": {
"spices": {
"type": "mapping",
"mappings": [
"mild=>1",
"hot=>2",
"hotter=>3",
"smoking=>4"
]
}
},
"analyzer": {
"spice_synonyms": {
"type": "custom",
"char_filter": "spices",
"tokenizer": "standard",
"filter": [
"standard"
]
}
}
}
},
"mappings": {
"ordinal": {
"properties": {
"spicy": {
"type": "string",
"fields": {
"level": {
"type": "string",
"analyzer": "spice_synonyms"
}
}
}
}
}
}
}
In the above index settings and mappings, the spicy field would contain the plain english word (hot, mild, etc) while the spicy.level field would contain a discrete value that you can then use in queries and sorting.
For instance, retrieving documents whose spice level is strictly bigger than hot and ordered in decreasing order (smoking first) could be done like this:
{
"sort": {
"spicy.level": "desc"
},
"query": {
"query_string": {
"query": "spicy.level:>2"
}
}
}
or a range query would work, too
{
"sort": {
"spicy.level": "desc"
},
"query": {
"range": {
"spicy.level" {
"gt": 2
}
}
}
}

Related

Elasticsearch - extract values from a flattened field inside a nested type using a runtime field script

I have this mapping, a big document actually, a lot of fields excluded for brevity
{
"items": {
"mappings": {
"dynamic": "false",
"properties": {
"id": {
"type": "keyword"
},
"costs": {
"type": "nested",
"properties": {
"id": {
"type": "keyword"
},
"costs_samples": {
"type": "flattened"
}
}
}
}
}
}
}
costs_samples is a flattened field type huge collection of possible costs (sometimes more than 10k entries), based on some dynamic dimensions. Have to highlight that costs_sample cannot live outside costs, since at query time some conditions from costs should be composed with should or match clauses, like (costs.country=this_value AND some_other costs_samples_condition.
I would like to be able to extract and eventually inject a new field at costs level as a runtime field, and then use that field to sort, filter and aggregate.
Something like this
{
"runtime_mappings": {
"costs.selected_cost": {
"type": "long",
"script": {
"source":
"for (def cost : doc['costs.costs_samples']) { if(cost.values!= null) {emit(Long.parseLong(cost.values.some_dynamic_identity_known_at_query_time.last))} }"
}
}
},
"query":{
"nested": {
"path": "costs",
"query": {
"bool": {
"filter": [
{
"terms": {
"costs.id": ["id-1","id-2"]
}
},
{
"term": {
"costs.selected_cost": 10
}
}
]
}
}
}
},
"fields": ["costs.selected_cost"]
}
The trouble is selected_cost it is not created / returned by ES. No error message.
Where did I do wrong? Documentation was not helpful.
Maybe worth mention that I've also tried with 2 different documents, like items and costs and then execute kind of a join operation, but the performance tests were really poor.
Thanks!
Long story short, there is no way to use a runtime field inside a nested field type.

Composite and Terms Aggregations on a field with a high cardinality

I am facing a huge performance problem with ES which results in more than 2 min response.
I have an index that has more than 25M files and composes of the next 4 fields (among others):
...
"group_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"group_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
...
I have something like 100K unique users and groups and each field is a list of users/groups that holds ~100 values. For example:
"user_read": ["user_1", "group_1", ...],
"user_write": ["user_1", "group_2", ...]
...
I have 2 kinds of aggregation I am using, composite and terms. Composite aggregations for getting only first X results to display and terms aggregation for prefix search.
Composite aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"composite": {
"sources": [
{
"Group Read": {
"terms": {
"field": "group_read.raw"
}
}
}
],
"size": 10
}
},
"Group_Write_Permissions": {
"composite": {
"sources": [
{
"Group Write": {
"terms": {
"field": "group_write.raw"
}
}
}
]
}
},
"User_Write_Permissions": {
"composite": {
"sources": [
{
"User Write": {
"terms": {
"field": "user_write.raw"
}
}
}
]
}
},
"User_Read_Permissions": {
"composite": {
"sources": [
{
"User Read": {
"terms": {
"field": "user_read.raw"
}
}
}
]
}
}
}
}
Terms aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"terms": {
"field": "group_read.raw",
"include": ".*[Ss].*"
}
},
"Group Write Permissions": {
"terms": {
"field": "group_write.raw",
"include": ".*[Ss].*"
}
},
"User Read Permissions": {
"terms": {
"field": "user_read.raw",
"include": ".*[Ss].*"
}
},
"User Write Permissions": {
"terms": {
"field": "user_write.raw",
"include": ".*[Ss].*"
}
}
}
}
Composite aggregation returns results within 1 min and the terms aggregation can take up to 5 min.
What I have tried so far:
Adding new field user_group_permissions and adding to the above 4 fields "copy_to": "user_group_permissions"
Adding to the above 4 fields and to the field "user_group_permissions" the next property: "eager_global_ordinals": true
Increased the refresh_interval up to 200s
** I reindexed for the first 2 suggestions [took something like 6 hours]
All of the above did help a little with the retrieval time but still: composite aggregation takes up to 20s and terms aggregation takes up to 3 min.
[The best results were on the fields user_group_permissions which has been created in the first suggestion, with eager_global_ordinals = true and refresh_interval = 120s].
Please, if someone has any idea how to improve the retrieval times I will be grateful.
First of all, if you only need the first 10 results, you don't need to use the composite aggregation, which is meant to be used only if you need to paginate over all results. Simply use the terms aggregation with default size 10, that'll do the job.
Second, what you're doing with the terms aggregation is not a prefix filtering, but infix filtering, which is completely different in terms of performance. While it's easy to search for prefixes, searching for infixes requires the equivalent of a "full table scan" because each and every term must be visited.
A first optimization I would suggest is that in your second query you should do your regex in the query part (bool/should with one regex query per field), so as to reduce the document set on which the terms aggregations need to run. That might help a bit.
A second optimization is to leverage the wildcard field type which is a specialized field type made specially for grep-like wildcard and regexp queries.
Another possible optimization is to lowercase all your permissions, so that you only need to search for .*s.* instead of the uppercase variant.
Depending on your comments, I'll add more optimizations as the discussion goes on.

Elasticsearch discard documents that contain superset of query

Let's say I have 3 documents:
{ "cities": "Paris Zurich Milan" }
{ "cities": "Paris Zurich" }
{ "cities": "Zurich"}
cities is just text, I'm not using any custom analyzer.
I want to query for documents that have in cities both Paris and Zurich, in this order, and do not have any other city. So I want to get only the second document.
This is what I'm trying so far:
{
"query": {
"match_phrase": {
"cities": "Paris Zurich"
}
}
}
But this returns also the first document.
What should I do instead?
If you do not care about case sensitivity just use term query:
{
"query": {
"term": {
"cities.keyword": "Paris Zurich"
}
}
}
It will only match the exact value of field.
On the other hand you can create custom analyzer that will still store the exact value of field (just like keyword) with one exception: the stored value will be converted to lowercase so you will be able to find Paris Zurich as well as paris Zurich. Here is the example:
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": [],
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"cities": {
"type": "text",
"fields": {
"lowercased": {
"type": "text",
"analyzer": "lowercase_analyzer"
}
}
}
}
}
}
}
{
"query": {
"term": {
"cities.lowercased": "paris zurich" // Query string should also be in lowercase
}
}
}

Autocomplete functionality using elastic search

I have an elastic search index with following documents and I want to have an autocomplete functionality over the specified fields:
mapping: https://gist.github.com/anonymous/0609b1d110d91dceb9a90faa76d1d5d4
Usecase:
My query is of the form prefix type eg "sta", "star", "star w" .."start war" etc with an additional filter as tags = "science fiction". Also there queries could match other fields like description, actors(in cast field, not this is nested). I also want to know which field it matched to.
I investigated 2 ways for doing that but non of the methods seem to address the usecase above:
1) Suggester autocomplete:
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-suggesters-completion.html
With this it seems I have to add another field called "suggest" replicating the data which is not desirable.
2) using a prefix filter/query:
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/query-dsl-prefix-filter.html
this gives the whole document back not the exact matching terms.
Is there a clean way of achieving this, please advise.
Don't create mapping separately, insert data directly into index. It will create default mapping for that. Use below query for autocomplete.
GET /netflix/movie/_search
{
"query": {
"query_string": {
"query": "sta*"
}
}
}
I think completion suggester would be the cleanest way but if that is undesirable you could use aggregations on name field.
This is a sample index(I am assuming you are using ES 1.7 from your question
PUT netflix
{
"settings": {
"analysis": {
"analyzer": {
"prefix_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"edge_filter"
]
},
"keyword_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim"
]
}
},
"filter": {
"edge_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
},
"mappings": {
"movie":{
"properties": {
"name":{
"type": "string",
"fields": {
"prefix":{
"type":"string",
"index_analyzer" : "prefix_analyzer",
"search_analyzer" : "keyword_analyzer"
},
"raw":{
"type": "string",
"analyzer": "keyword_analyzer"
}
}
},
"tags":{
"type": "string", "index": "not_analyzed"
}
}
}
}
}
Using multi-fields, name field is analyzed in different ways. name.prefix is using keyword tokenizer with edge ngram filter
so that string star wars can be broken into s, st, sta etc. but while searching, keyword_analyzer is used so that search query does not get broken into multiple small tokens. name.raw will be used for aggregation.
The following query will give top 10 suggestions.
GET netflix/movie/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"tags": "sci-fi"
}
},
"query": {
"match": {
"name.prefix": "sta"
}
}
}
},
"size": 0,
"aggs": {
"unique_movie_name": {
"terms": {
"field": "name.raw",
"size": 10
}
}
}
}
Results will be something like
"aggregations": {
"unique_movie_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "star trek",
"doc_count": 1
},
{
"key": "star wars",
"doc_count": 1
}
]
}
}
UPDATE :
You could use highlighting for this purpose I think. Highlight section will get you the whole word and which field it matched. You can also use inner hits and highlighting inside it to get nested docs also.
{
"query": {
"query_string": {
"query": "sta*"
}
},
"_source": false,
"highlight": {
"fields": {
"*": {}
}
}
}

Elasticsearch Query aggregated by unique substrings (email domain)

I have an elasticsearch query that queries over an index and then aggregates based on a specific field sender_not_analyzed. I then use a term aggregation on that same field sender_not_analyzed which returns buckets for the top "senders". My query is currently:
{
"size": 0,
"query": {
"regexp": {
"sender_not_analyzed": ".*[#].*"
}
},
"aggs": {
"sender-stats": {
"terms": {
"field": "sender_not_analyzed"
}
}
}
}
which returns buckets that look like:
"aggregations": {
"sender-stats": {
"buckets": [
{
"key": "<Mike <mike#fizzbuzz.com>#MISSING_DOMAIN>",
"doc_count": 5017
},
{
"key": "jon.doe#foo.com",
"doc_count": 3963
},
{
"key": "jane.doe#foo.com",
"doc_count": 2857
},
{
"key": "jon.doe#bar.com",
"doc_count":1544
}
How can I write an aggregation such that I get single bucket for each unique email domain, eg foo.com would have a doc_count of (3963 + 2857) 6820? Can I accomplish this with a regex aggregation or do I need to write some kind of custom analyzer to split the string at the # to the end of string?
This is pretty late, but I think this can be done by using pattern_replace char filter, you capture the domain name with regex, This is my setup
POST email_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"char_filter": [
"domain"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"domain": {
"type": "pattern_replace",
"pattern": ".*#(.*)",
"replacement": "$1"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"domain": {
"type": "string",
"analyzer": "my_custom_analyzer"
},
"sender_not_analyzed": {
"type": "string",
"index": "not_analyzed",
"copy_to": "domain"
}
}
}
}
}
Here domain char filter will capture the domain name, we need to use keyword tokenizer to get the domain as it is, I am using lowercase filter but it is up to you if you want to use it or not. Using copy_to parameter to copy the value of the sender_not_analyzed to domain field, although _source field won't be modified to include this value but we can query it.
GET email_index/_search
{
"size": 0,
"query": {
"regexp": {
"sender_not_analyzed": ".*[#].*"
}
},
"aggs": {
"sender-stats": {
"terms": {
"field": "domain"
}
}
}
}
This will give you desired result.

Resources