Can I get a field if I disabled the _source and _all in Elasticsearch - elasticsearch

Elasticsearch suggested to dissable _source and _all field in my case, this my mapping
{
"template": "mq-body-*",
"settings": {
"number_of_shards": 3,
"number_of_replicas": 0,
"max_result_window": 100,
"codec": "best_compression"
},
"mappings": {
"_default_": {
"_source": {
"enabled": false
},
"_all": {
"enabled": false
}
},
"body": {
"properties": {
"body": {
"type": "string",
"doc_values": true,
"index": "not_analyzed"
}
}
}
}
}
The body.body is a very large field(20k-300k), we don't have to index and rare get,this is lost-able. But after
PUT /mq-body-local/body/1
{"body":"My body"}
I can't find the body by GET /mq-body-local/body/1?fields=body or POST /mq-body-local/body/_search -d'{"fields":["body"]}',the result is found one but no document.I know there is no _source I can not do get or search, but how can I retrive my document ?

From Elasticsearch's website:
The _source field contains the original JSON document body that was
passed at index time. The _source field itself is not indexed (and
thus is not searchable), but it is stored so that it can be returned
when executing fetch requests, like get or search
Disabling the source will prevent Elasticsearch from displaying it in the resultset. However, filtering, querying and aggregations will not be affected.
So these two queries will not generate any results in terms of the actual body:
GET mq-body-local/body/_search
GET mq-body-local/body/1
However, you could run this aggregation that will include some of the source, for example:
POST mq-body-local/body/_search
{
"aggs": {
"test": {
"terms": {
"field": "body"
}
}
}
}
Will produce this result set (I've created some test records):
"aggregations": {
"test": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "my body",
"doc_count": 1
},
{
"key": "my body2",
"doc_count": 1
}
]
}
}

Related

elastic search copy_to field not filled

I'm trying to copy a main title field in Elastic Search 5.6, to an other field with: index:false, so I can use this field to match the exact value.
However. After the reindex, and performed search with _source:["exact_hoofdtitel"], the field "exact_hoofdtitel" is not filled with the value of "hoofdtitel".
PUT producten_prd_5_test
{
"aliases": {},
"mappings": {
"boek": {
"properties": {
"hoofdtitel": {
"type": "text",
"copy_to": [
"suggest-hoofdtitel", "exact_hoofdtitel"
]
},
"suggest-hoofdtitel": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": false,
"preserve_position_increments": true,
"max_input_length": 50
},
"exact_hoofdtitel":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"index":false
}
}
},
}
}
},
"settings": {
"number_of_shards": "1",
"number_of_replicas": "0"
}
}
GET producten_prd_5_test/_search
{
"_source":["hoofdtitel","exact_hoofdtitel"]
}
hits": [
{
"_index": "producten_prd_5_test",
"_type": "boek",
"_id": "9781138340671",
"_score": 1,
"_source": {
"hoofdtitel": "The Nature of the Firm in the Oil Industry"
}
},
I believe that you can achieve what you want without copy_to. Let me show you how and why you don't need it here.
How can I make both full-text and exact match queries on the same field?
This can be done with fields mapping attribute. Basically, with the following piece of mapping:
PUT producten_prd_5_test_new
{
"aliases": {},
"mappings": {
"boek": {
"properties": {
"hoofdtitel": {
"type": "text", <== analysing for full text search
"fields": {
"keyword": {
"type": "keyword" <== analysing for exact match
},
"suggest": {
"type": "completion", <== analysing for suggest
"analyzer": "simple",
"preserve_separators": false,
"preserve_position_increments": true,
"max_input_length": 50
}
}
}
}
}
}
}
you will be telling Elasticsearch to index the same field three times: one for full-text search, one for exact match and one for suggest.
The exact search will be possible to do via a term query like this:
GET producten_prd_5_test_new/_search
{
"query": {
"term": {
"hoofdtitel.keyword": "The Nature of the Firm in the Oil Industry"
}
}
}
Why the field exact_hoofdtitel does not appear in the returned document?
Because copy_to does not change the source:
The original _source field will not be modified to show the copied
values.
It works like _all field, allowing you to concat values of multiple fields in one imaginary field and analyse it in a special way.
Does it make sense to do a copy_to to an index: false field?
With index: false the field will not be analyzed and will not be searchable (like in your example, the field exact_hoofdtitel.keyword).
It may still make sense to do so if you want to do keyword aggregations on that field:
GET producten_prd_5_test/_search
{
"aggs": {
"by copy to": {
"terms": {
"field": "exact_hoofdtitel.keyword"
}
}
}
}
This will return something like:
{
"aggregations": {
"by copy to": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "The Nature of the Firm in the Oil Industry",
"doc_count": 1
}
]
}
}
}

Count of unique nested documents in ElasticSearch

The problem domain has got kiosks on which many tokens are displayed. A token is issued by only one issuer and it can be present on multiple kiosks. The kiosk logic accepts/refuses users based on what tokens are present on that kiosk.
Our Elastic mapping is this:
"mappings": {
"Kiosk": {
"dynamic": "strict",
"properties": {
"kioskId": {
"type": "keyword"
},
"token": {
"type": "nested",
"include_in_parent": true,
"properties": {
"tokenId": {
"type": "keyword"
},
"issuer": {
"type": "keyword"
}
}
}
}
}
}
Here are two typical documents:
Kiosk1
"kioskId": "123",
"token": {
"tokenId": "fp1",
"issuer": "i1"
Kiosk2
"kioskId": "321",
"token": [
{
"tokenId": "fp1",
"issuer": "i1"
},
{
"tokenId": "fp2",
"issuer": "i2"
}
]
Now, the ask is to find count of all the unique tokens in the system bucketed by issuers. There's been no luck in finding them. We tried this query:
POST _search
{
"aggs": {
"state": {
"nested": {
"path": "token"
},
"aggs": {
"TOKENS_BY_ISSUER": {
"terms": {
"field": "token.issuer"
}
}
}
}
}
}
This obviously gives this result:
"aggregations": {
"state": {
"doc_count": 3,
"TOKENS_BY_ISSUER": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "i1",
"doc_count": 2
},
{
"key": "i2",
"doc_count": 1
}
]
}
}
}
Is there a way to know that there are only two tokens in the system each issued by i1 and i2? Something like this...
"buckets": [
{
"key": "i1",
"doc_count": 1
},
{
"key": "i2",
"doc_count": 1
}
]
If not, where has mapping gone wrong? I feel it's not an unusual mapping though. Do note that I have truncated the mapping posted here for brevity, we have further nested levels under tokens. Those additional nested levels carry fields which are specific to a token and its parent kiosk.
You can change your query to match something like this
{
"query": {
"match_all": {}
},
"aggs":{
"state": {
"nested": {
"path": "token"
},
"aggs": {
"TOKENS_BY_ISSUER": {
"terms": {
"field": "token.issuer"
},
"aggs":{
"distinct_tokens":{
"cardinality":{"field":"token.tokenId"}
}
}
}
}
}
}
}
Note:
The cardinality aggregation in elasticsearch has an error rate associated with it as it uses the HyperLogLog approximation technique to calculate unique field values in a bucket.Hence error rate for the same would increase as the number of tokens in your system increases.
While indexing the kiosk1 document the token should be a vector/array so as to make sure you are not doing anything wrong while indexing.
In order to increase the accuracy of the cardinality aggregation try with increasing the precision_threshold controller in the POST query. This one comes at a cost of more memory utilisation.
Checkout link for further details
Elasticsearch Cardinality Aggregation
Would rather recommend designing this based on the requirement and only if you are ready to accept the error percentages when under scale.

Fielddata is disabled on text fields by default in elasticsearch

I have problem that I updated from elasticsearch 2.x to 5.1. However, some of my data does not work in newer elasticsearch because of this "Fielddata is disabled on text fields by default" https://www.elastic.co/guide/en/elasticsearch/reference/5.1/fielddata.html before 2.x it was enabled it seems.
Is there way to enable fielddata automatically to text fields?
I tried code like this
curl -XPUT http://localhost:9200/_template/template_1 -d '
{
"template": "*",
"mappings": {
"_default_": {
"properties": {
"fielddata-*": {
"type": "text",
"fielddata": true
}
}
}
}
}'
but it looks like elasticsearch does not understand wildcard there in field name. Temporary solution to this is that I am running python script every 30 minutes, scanning all indices and adding fielddata=true to fields which are new.
The problem is that I have string data like "this is cool" in elasticsearch.
curl -XPUT 'http://localhost:9200/example/exampleworking/1' -d '
{
"myfield": "this is cool"
}'
when trying to aggregate that:
curl 'http://localhost:9200/example/_search?pretty=true' -d '
{
"aggs": {
"foobar": {
"terms": {
"field": "myfield"
}
}
}
}'
"Fielddata is disabled on text fields by default. Set fielddata=true on [myfield]"
that elasticsearch documentation suggest using .keyword instead of adding fielddata. However, that is not returning data what I want.
curl 'http://localhost:9200/example/_search?pretty=true' -d '
{
"aggs": {
"foobar": {
"terms": {
"field": "myfield.keyword"
}
}
}
}'
returns:
"buckets" : [
{
"key" : "this is cool",
"doc_count" : 1
}
]
which is not correct. Then I add fielddata true and everything works:
curl -XPUT 'http://localhost:9200/example/_mapping/exampleworking' -d '
{
"properties": {
"myfield": {
"type": "text",
"fielddata": true
}
}
}'
and then aggregate
curl 'http://localhost:9200/example/_search?pretty=true' -d '
{
"aggs": {
"foobar": {
"terms": {
"field": "myfield"
}
}
}
}'
return correct result
"buckets" : [
{
"key" : "cool",
"doc_count" : 1
},
{
"key" : "is",
"doc_count" : 1
},
{
"key" : "this",
"doc_count" : 1
}
]
How I can add this fielddata=true automatically to all indices to all text fields? Is that even possible? In elasticsearch 2.x this is working out of the box.
i will answer to myself
curl -XPUT http:/localhost:9200/_template/template_1 -d '
{
"template": "*",
"mappings": {
"_default_": {
"dynamic_templates": [
{
"strings2": {
"match_mapping_type": "string",
"mapping": {
"type": "text",
"fielddata": true
}
}
}
]
}
}
}'
this is doing what i want. Now all indexes have default settings fielddata true
Adding "fielddata": true allows the text field to be aggregated, but this has performance problems at scale. A better solution is to use a multi-field mapping.
Unfortunately, this is hidden a bit deep in Elasticsearch's documentations, in a warning under the fielddata mapping parameter: https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html#before-enabling-fielddata
Here's a complete example of how this helps with a terms aggregation, tested on Elasticsearch 7.12 as of 2021-04-24:
Mapping (in ES7, under the mappings property of the body of a "put index template" request etc):
{
"properties": {
"bio": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
Four documents indexed:
{
"bio": "Dogs are the best pet."
}
{
"bio": "Cats are cute."
}
{
"bio": "Cats are cute."
}
{
"bio": "Cats are the greatest."
}
Aggregation query:
{
"size": 0,
"aggs": {
"bios_with_cats": {
"filter": {
"match": {
"bio": "cats"
}
},
"aggs": {
"bios": {
"terms": {
"field": "bio.keyword"
}
}
}
}
}
}
Aggregation query results:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"bios_with_cats": {
"doc_count": 3,
"bios": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Cats are cute.",
"doc_count": 2
},
{
"key": "Cats are the greatest.",
"doc_count": 1
}
]
}
}
}
}
Basically, this aggregation says "Of the documents whose bios are like 'cats', how many of each distinct bio are there?" The one document without "cats" in its bio property is excluded, and then the remaining documents are grouped into buckets, one of which has one document and the other has two documents.

Removing stopwords from basic Terms aggregation in Elasticsearch?

I'm a little new to Elasticsearch, but basically I have an single index called posts with multiple post documents that take the following form:
"post": {
"id": 123,
"message": "Some message"
}
I'm trying to get the most frequently occurring words in the message field across the entire index, with a simple Terms aggregation:
curl -XPOST 'localhost:9200/posts/_search?pretty' -d '
{
"aggs": {
"frequent_words": {
"terms": {
"field": "message"
}
}
}
}
'
Unfortunately, this aggregation includes stopwords, so I end up with a list of words like "and", "the", "then", etc. instead of more meaningful words.
I've tried applying an analyzer to exclude those stopwords, but to no avail:
curl -XPUT 'localhost:9200/posts/?pretty' -d '
{
"settings": {
"analysis": {
"analyzer": {
"standard": {
"type": "standard",
"stopwords": "_english_"
}
}
}
}
}'
Am I applying the analyzer correctly, or am I going about this the wrong way? Thanks!
I guess you forgot set analyzer to your message filed of your type field. Because Elasticsearch use their indexed data while aggregating data. This means that Elasticsearch dont get your stopwords if you analyze your field correctly. You can check this link. I used sense plugin of kibana to execute following requests. Check mapping create request
PUT /posts
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"stopwords": ["test", "testable"]
}
}
}
}
}
### Dont forget these lines
POST /posts/post/_mapping
{
"properties": {
"message": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
POST posts/post/1
{
"id": 1,
"message": "Some messages"
}
POST posts/post/2
{
"id": 2,
"message": "Some testable message"
}
POST posts/post/3
{
"id": 3,
"message": "Some test message"
}
POST /posts/_search
{
"aggs": {
"frequent_words": {
"terms": {
"field": "message"
}
}
}
}
This is my resultset for this search request :
{
"hits": {
...
},
"aggregations": {
"frequent_words": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "some",
"doc_count": 3
},
{
"key": "message",
"doc_count": 2
},
{
"key": "messages",
"doc_count": 1
}
]
}
}
}
In latest version 5.5, the string type has been changed to text/keyword. I enabled the stopwords for the field title and it is working for search. Means if i search for the, it is not returning but if I use below for aggregation
"field": "message_analyzed.keyword"
getting the stopwords too in aggregation bucket.
Any suggestion are welcome.
Thanks

Broken aggregation in elasticsearch

I'm getting erroneous results on performing terms aggregation in the field names in the index.
The following is the mappings I have used to the names field:
{
"dbnames": {
"properties": {
"names": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
Here is the results I'm getting for a simple terms aggregation on the field:
"aggregations": {
"names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John Martin",
"doc_count": 1
},
{
"key": "John martin",
"doc_count": 1
},
{
"key": " Victor Moses",
"doc_count": 1
}
]
}
}
As you can see, I have the same names with different casings being shown as different buckets in the aggregation. What I want here is irrespective of the case, the names should be clubbed together.
The easiest way would be to make sure you properly case the value of your names field at indexing time.
If that is not an option, the other way to go about it is to define an analyzer that will do it for you and set that analyzer as index_analyzer for the names field. Such a custom analyzer would need to use the keyword tokenizer (i.e. take the whole value of the field as a single token) and the lowercase token filter (i.e. lowercase the value)
curl -XPUT localhost:9200/your_index -d '{
"settings": {
"index": {
"analysis": {
"analyzer": {
"casing": { <--- custom casing analyzer
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"names": {
"type": "string",
"index_analyzer": "casing" <--- use your custom analyzer
}
}
}
}
}'
Then we can index some data:
curl -XPOST localhost:9200/your_index/your_type/_bulk -d '
{"index":{}}
{"names": "John Martin"}
{"index":{}}
{"names": "John martin"}
{"index":{}}
{"names": "Victor Moses"}
'
And finally the terms aggregation on the names field would return your the expected results:
curl -XPOST localhost:9200/your_index/your_type/_search-d '{
"size": 0,
"aggs": {
"dbnames": {
"terms": {
"field": "names"
}
}
}
}'
Results:
{
"dbnames": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "john martin",
"doc_count": 2
},
{
"key": "victor moses",
"doc_count": 1
}
]
}
}
There are 2 options here
Use not_analyzed option - This one has a disadvantage that same
string with different cases wont be seen as on
keyword tokenizer + lowercase filter - This one does not have the
above issue
I have neatly outlined these two approaches and how to use them here - https://qbox.io/blog/elasticsearch-aggregation-custom-analyzer

Resources