Aggregation on terms and not on whole field - elasticsearch

I have an index with products (ES 6.3), where some of the product names look like this Tomato, Tomatosoup, Tomatojuice etc. What I'm trying to achieve, is when I query for example by the term Toma, to get an aggregation of the best matching terms instead of the whole product names.
To achieve this, I have the following mapping:
{
"name": {
"type": "text",
"analyzer": "custom-ngram" // Defined in the mapping
"search-analyzer": "standard",
"fields": {
"suggestion": {
"type": "text",
"fielddata": true,
"analyzer": "standard"
}
}
}
}
and my query looks like this:
{
"query": {
"bool": {
"must":{
"multi_match": {
"query": "tom",
"fields": ["name^3", "description"]
}
}
}
},
"aggs": {
"suggestions": {
"terms": {
"field": "name.suggestion",
"include": "tom.*",
"size": 10
}
}
},
"size": 0
}
Indeed this works and gives me back what I need but I have two concerns:
The usage of fielddata which is not encouraged based on the ES docs
The usage of the includes directive to actually filter the aggregation buckets
Is this the right way to go on solving this issue or the approach is completely wrong? Is there any best practice for this problem?

Related

How do I do a terms aggregation by concatenating two arrays?

I have an Elasticsearch mapping that looks like this:
"product": {
"properties": {
"attributes": {
"type": "keyword",
"normalizer": "lowercase"
},
"skus": {
"type": "nested",
"properties": {
"attributes": {
"type": "keyword",
"normalizer": "lowercase"
}
}
}
}
}
I'm trying to do a terms aggregation on both the field attributes and the field skus.attributes by concatenating them but I haven't figured out how. Both fields are simple string arrays. This is as far as I've gotten:
{
"query": {
"match_all": {}
},
"aggregations": {
"unique_attrs": {
"terms": {
"field": "attributes"
}
}
}
}
Of course, I could reindex my data in a way that there would be another field that contains a concatenation of the values of both fields but that seem right.
As mentioned on the Elasticsearch Forums: https://discuss.elastic.co/t/combining-nested-and-non-nested-aggregations/82583 it recommends merging them using a copy_to mapping when indexing the data.

Autocomplete functionality using elastic search

I have an elastic search index with following documents and I want to have an autocomplete functionality over the specified fields:
mapping: https://gist.github.com/anonymous/0609b1d110d91dceb9a90faa76d1d5d4
Usecase:
My query is of the form prefix type eg "sta", "star", "star w" .."start war" etc with an additional filter as tags = "science fiction". Also there queries could match other fields like description, actors(in cast field, not this is nested). I also want to know which field it matched to.
I investigated 2 ways for doing that but non of the methods seem to address the usecase above:
1) Suggester autocomplete:
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-suggesters-completion.html
With this it seems I have to add another field called "suggest" replicating the data which is not desirable.
2) using a prefix filter/query:
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/query-dsl-prefix-filter.html
this gives the whole document back not the exact matching terms.
Is there a clean way of achieving this, please advise.
Don't create mapping separately, insert data directly into index. It will create default mapping for that. Use below query for autocomplete.
GET /netflix/movie/_search
{
"query": {
"query_string": {
"query": "sta*"
}
}
}
I think completion suggester would be the cleanest way but if that is undesirable you could use aggregations on name field.
This is a sample index(I am assuming you are using ES 1.7 from your question
PUT netflix
{
"settings": {
"analysis": {
"analyzer": {
"prefix_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"edge_filter"
]
},
"keyword_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim"
]
}
},
"filter": {
"edge_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
},
"mappings": {
"movie":{
"properties": {
"name":{
"type": "string",
"fields": {
"prefix":{
"type":"string",
"index_analyzer" : "prefix_analyzer",
"search_analyzer" : "keyword_analyzer"
},
"raw":{
"type": "string",
"analyzer": "keyword_analyzer"
}
}
},
"tags":{
"type": "string", "index": "not_analyzed"
}
}
}
}
}
Using multi-fields, name field is analyzed in different ways. name.prefix is using keyword tokenizer with edge ngram filter
so that string star wars can be broken into s, st, sta etc. but while searching, keyword_analyzer is used so that search query does not get broken into multiple small tokens. name.raw will be used for aggregation.
The following query will give top 10 suggestions.
GET netflix/movie/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"tags": "sci-fi"
}
},
"query": {
"match": {
"name.prefix": "sta"
}
}
}
},
"size": 0,
"aggs": {
"unique_movie_name": {
"terms": {
"field": "name.raw",
"size": 10
}
}
}
}
Results will be something like
"aggregations": {
"unique_movie_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "star trek",
"doc_count": 1
},
{
"key": "star wars",
"doc_count": 1
}
]
}
}
UPDATE :
You could use highlighting for this purpose I think. Highlight section will get you the whole word and which field it matched. You can also use inner hits and highlighting inside it to get nested docs also.
{
"query": {
"query_string": {
"query": "sta*"
}
},
"_source": false,
"highlight": {
"fields": {
"*": {}
}
}
}

How to sort ordinal values in elasticsearch?

Say i've got a field 'spicey' with possible values 'hot', 'hotter', 'smoking'.
There's an intrinsic ordening in these values: they're ordinals.
I'd like to be able to sort or filter on them using their intrinsic order. For example: give me all documents where spicey > hot.
Sure i can translate the values to integers 0,1,2 but this requires extra housekeeping on both the index and the query side which I'd rather avoid.
Is this possible in some way? Already contemplated using multi field mapping but not sure if that would help me.
You can sort based on string values by scripting a sort operation, so that you set each spicey string a specific field value.
curl -XGET 'http://localhost:9200/yourindex/yourtype/_search' -d
{
"sort": {
"_script": {
"script": "factor.get(doc[\"spicey\"].value)",
"type": "number",
"params": {
"factor": {
"hot": 0,
"hotter": 1,
"smoking": 2
}
},
"order": "asc"
}
}
}
One solution could be to create a specific analyzer for spice levels. The idea is to map each level to a discrete value which increases the more spicy the spice is.
{
"settings": {
"analysis": {
"char_filter": {
"spices": {
"type": "mapping",
"mappings": [
"mild=>1",
"hot=>2",
"hotter=>3",
"smoking=>4"
]
}
},
"analyzer": {
"spice_synonyms": {
"type": "custom",
"char_filter": "spices",
"tokenizer": "standard",
"filter": [
"standard"
]
}
}
}
},
"mappings": {
"ordinal": {
"properties": {
"spicy": {
"type": "string",
"fields": {
"level": {
"type": "string",
"analyzer": "spice_synonyms"
}
}
}
}
}
}
}
In the above index settings and mappings, the spicy field would contain the plain english word (hot, mild, etc) while the spicy.level field would contain a discrete value that you can then use in queries and sorting.
For instance, retrieving documents whose spice level is strictly bigger than hot and ordered in decreasing order (smoking first) could be done like this:
{
"sort": {
"spicy.level": "desc"
},
"query": {
"query_string": {
"query": "spicy.level:>2"
}
}
}
or a range query would work, too
{
"sort": {
"spicy.level": "desc"
},
"query": {
"range": {
"spicy.level" {
"gt": 2
}
}
}
}

Nested Objects aggregations (with Kibana)

We got an Elasticsearch index containing documents with a subset of arbitrary nested object called devices. Each of those devices has a key call "aw".
What I try to accomplish, is to get an average of the aw key for each device type.
When trying to aggregate and visualize this average I don't get the average of the aw of every device type, but of all devices within the documents containing the specific device.
So instead of fetching all documents where device.id=7 and aggregating the awper device.id, Elasticsearch / Kibana fetches all documents containing device.id=7 but then builds it's average using all devices within the documents.
Out index mapping looks like this (only important parts):
"mappings" : {
"devdocs" : {
"_all": { "enabled": false },
"properties" : {
"cycle": {
"type": "object",
"properties": {
"t": {
"type": "date",
"format": "dateOptionalTime||epoch_second"
}
}
},
"devices": {
"type": "nested",
"include_in_parent": true,
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"aw": {
"type": "long"
}
"t": {
"type": "date",
"format": "dateOptionalTime||epoch_second"
},
}
}
}
}
Kibana generates the following query:
{
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*"
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"cycle.t": {
"gte": 1290760324744,
"lte": 1448526724744,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
}
}
},
"aggs": {
"2": {
"terms": {
"field": "devices.name",
"size": 35,
"order": {
"1": "desc"
}
},
"aggs": {
"1": {
"avg": {
"field": "devices.aw"
}
}
}
}
}
}
Is there a way to aggregate the average aw on device level, or what am I doing wrong?
Kibana doesn't support nested aggregations yet , Nested Aggregations Issue.
I had the same issue and solved it by building kibana from src from this fork by user ppadovani. [branch : nestedAggregations]
See instructions to build kibana from source here.
After building when you run kibana now it will contain a Nested Path text box and a reverse nested checkbox in advanced options for buckets and metrics.
Here is an example of nested terms aggregation on lines.category_1, lines.category_2, lines.category_3 and lines being of nested type. using the above with three buckets, :
I would suggest adding filter aggregation to leave everything with aw: 7.
Defines a single bucket of all the documents in the current document
set context that match a specified filter. Often this will be used to
narrow down the current aggregation context to a specific set of
documents.
Kibana does not support Nested json.

I don't get any documents back from my elasticsearch query. Can someone point out my mistake?

I thought I had figured out Elasticsearch but I suspect I have failed to grok something, and hence this problem:
I am indexing products, which have a huge number of fields, but the ones in question are:
{
"show_in_catalogue": {
"type": "boolean",
"index": "no"
},
"prices": {
"type": "object",
"dynamic": false,
"properties": {
"site_id": {
"type": "integer",
"index": "no"
},
"currency": {
"type": "string",
"index": "not_analyzed"
},
"value": {
"type": "float"
},
"gross_tax": {
"type": "integer",
"index": "no"
}
}
}
}
I am trying to return all documents where "show_in_catalogue" is true, and there is a price with site_id 1:
{
"filter": {
"term": {
"prices.site_id": "1",
"show_in_catalogue": true
}
},
"query": {
"match_all": {}
}
}
This returns zero results. I also tried an "and" filter with two separate terms - no luck.
A subset of one of the documents returned if I have no filters looks like:
{
"prices": [
{
"site_id": 1,
"currency": "GBP",
"value": 595,
"gross_tax": 1
},
{
"site_id": 2,
"currency": "USD",
"value": 745,
"gross_tax": 0
}
]
}
I hope I am OK to omit so much of the document here; I don't believe it to be contingent but I cannot be certain, of course.
Have I missed a vital piece of knowledge, or have I done something terminally thick? Either way, I would be grateful for an expert's knowledge at this point. Thanks!
Edit:
At the suggestion of J.T. I also tried reindexing the documents so that prices.site_id was indexed - no change. Also tried the bool/must filter below to no avail.
To clarify, the reason I'm using an empty query is that the web interface may supply a query string, but the same code is used to simply filter all products. Hence I left in the query, but empty, since that's what Elastica seems to produce with no query string.
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"term": {
"show_in_catalogue": true
}
},
{
"term": {
"prices.site_id": 1
}
}
]
}
}
}
}
}
You have site_id set as {"index": "no"}. This tells ElasticSearch to exclude the field from the index which makes it impossible to query or filter on that field. The data will still be stored. Likewise, you can set a field to only be in the index and searchable, but not stored.
I'm new to ElasticSearch as well and can't always grok the questions! I'm actually confused by you query. If you are going to "just filter" then you don't need a query. What I don't understand is your use of two fields inside the term filter. I've never done this. I guess it acts as an OR? Also, if nothing matches, it seems to return everything. If you wanted a query with the results of that query filtered, then you would want to use a
-d '{
"query": {
"filtered": {
"query": {},
"filter": {}
}
}
}'
If you just want to apply filters is the filter that should work without any "query" necessary
-d '{
"filter": {
"bool": {
"must": [
{
"term": {
"show_in_catalogue": true
}
},
{
"term": {
"prices.site_id": 1
}
}
]
}
}
}'

Resources