go elastic sum text field - go

I want to sum up filed price, but got error "Fielddata is disabled on text fields by default."
This is my code:
sumAgg := elastic.NewSumAggregation().Field("price")
q := query.Must(elastic.NewRangeQuery("price").Gt(0))
res, err := p.config.ElasticClient.Search().Index(idx).Query(q).Aggregation("sum", sumAgg).Size(0).Do(ctx)
This is the mapping:
"mappings": {
"properties": {
"price": {
"type": "scaled_float",
"scaling_factor": 100000
},
}
}
Anybody can help?

By default field data is disabled on text fields and a detailed reason is mentioned here , as it's very costly on text fields and mainly require for aggregations(your use-case), read more here
from official doc
Instead, text fields use a query-time in-memory data structure called
fielddata. This data structure is built on demand the first time that
a field is used for aggregations, sorting, or in a script. It is built
by reading the entire inverted index for each segment from disk,
inverting the term ↔︎ document relationship, and storing the result in
memory, in the JVM heap.
Make sure, a field which you are using in your aggregation has this enabled. for numeric fields doc_values are used for aggregation and its enabled on by default on the numeric field.

Related

Does non-indexed field update triggers reindexing in elasticsearch8?

My index mapping is the following:
{
"mappings": {
"dynamic": False,
"properties": {
"query_str": {"type": "text", "index": False},
"search_results": {
"type": "object",
"enabled": False
},
"query_embedding": {
"type": "dense_vector",
"dims": 768,
},
}
}
Field search_result is disabled. Actual search is performed only via query_embedding, other fields are just non-searchable data.
If I will update search_result field in existing document, will it trigger reindexing?
The docs say that "The enabled setting, which can be applied only to the top-level mapping definition and to object fields, causes Elasticsearch to skip parsing of the contents of the field entirely. The JSON can still be retrieved from the _source field, but it is not searchable or stored in any other way". So, it seems logical not to re-index docs if changes took place only in non-indexed part, but I'm not sure
Elasticsearch documents (Lucene Segments) are inmutable, so every change you make in a document will delete the document and create a new one. This is a Lucene's behavior:
Lucene's index is composed of segments, each of which contains a
subset of all the documents in the index, and is a complete searchable
index in itself, over that subset. As documents are written to the
index, new segments are created and flushed to directory storage.
Segments are immutable; updates and deletions may only create new
segments and do not modify existing ones. Over time, the writer merges
groups of smaller segments into single larger ones in order to
maintain an index that is efficient to search, and to reclaim dead
space left behind by deleted (and updated) documents.
When you set enable:false you are just avoiding to have the field content in the searchable structures but the data still lives in Lucene.
You can see a similar answer here:
Partial update on field that is not indexed

How to compare two text fields in elastic search

I have a need to compare two text fields in elastic search, but they are text fields.
For normal fields I can use script to compare using doc['field'].value, but is there a way to do the same for text fields.
See below excerpt from ES DOCS :
By far the fastest most efficient way to access a field value from a script is to use the doc['field_name'] syntax, which retrieves the field value from doc values. Doc values are a columnar field value store, enabled by default on all fields except for analyzed text fields.
There are 2 ways known to me to access a text value from script.
Map a keyword representation of a text field as well and access that field.
{
"mappings": {
"properties": {
"name":{
"type": "text",
"fields": {
"keyword":{ // <======= See this
"type":"keyword"
}
}
}
}
}
}
The keyword representation can be accessed like 'doc[name.keyword].value'
It is recommended to index/store keyword representation of fields for small-size text fields like 'name', 'emailId' but is not recommended for larger fields like 'description', due to memory overhead
Another way is to enable field data on the text field and access that field.
Fielddata is disabled on text fields by default. Set fielddata=true on
[your_field_name] in order to load fielddata in memory by uninverting the
inverted index. Note that this can however use significant memory.
It is not recommended to use the field-data over the text fields however.
Note: Please do add details on why you need comparison and what kind of comparison is required on 'text' fields.

Sorting Results efficiently by non mapped/searchable Object properties

In my current index called items I have previously sorted the results of my queries by some property values of a specifiy object property, lets call this property just oprop. After I have realized, that with each new key in oprop the total number of used fields on the items index increased, I had to change the mapping on the index. So I set oprop's mapping to dynamic : false, so the keys of oprop are not searchable anymore (not indexed).
Now in some of my queries I need to sort the results on the items index by the values of oprop keys. I don't know how ElasticSearch still can give me the possibility to sort on these key values.
Do I need to use scripts for sorting? Do I have access on non indexed data when using scripts?
Somehow I don't think that this is a good approach and I think that in long term I will run into performance issues.
You could use scripts for sorting, since the data will be stored in the _source field, but that should probably be a last resort. If you know which fields need to be sortable, you could just add those to the mapping, and keep oprop as a non dynamic field otherwise?
"properties": {
"oprop": {
"dynamic": false,
"properties": {
"sortable_key_1": {
"type": "text"
},
"sortable_key_2": {
"type": "text"
}
}
}
}

What is the difference between source filtering, stored fields, and doc values in elsaticsearch?

I've read the docs for source filtering, stored fields, and doc values.
In certain situations it can make sense to store a field. For instance, if you have a document with a title, a date, and a very large content field, you may want to retrieve just the title and the date without having to extract those fields from a large _source field
The stored_fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended. Use source filtering instead to select subsets of the original source document to be returned.
All fields which support doc values have them enabled by default.
Example 1
I have documents with title (short string), and content (>1MB). I want to search for matching titles, and return the title.
With source filtering
GET /_search
{ _source: "obj.title", ... }
With stored fields
GET /_search
{ _source: false, stored_fields: ["title"], ... }
With doc values
GET /_search
{_source: false, stored_fields: "_none_", docvalue_fields: "title", ... }
Okay, so
Will the source filtered request read the full _source, title and content, from disk then apply the filter and return only the title, or will elasticsearch only read the title from disk?
Will the source filtered reques use doc values?
Do stored fields store the analyzed tokens or the original value?
Are stored fields or doc values more or less efficient than _source?
Will the source filtered request read the full _source, title and content, from disk then apply the filter and return only the title, or will elasticsearch only read the title from disk?
The document you send for indexing to Elasticsearch will be stored in a field called _source (by default). So this means that if your document contains a large amount of data (like in the content field in your case), the full content will be stored in the _source field. When using source filtering, first the whole source document must be retrieved from the _source field and then only the title field will be returned. You're wasting space because nothing really happens with the content field, since you're searching on title and returning only the title value.
In your case, you'd be better off to not store the _source document, and only store the title field (but it has some disadvantages, too, so read this before you do), basically like this:
PUT index
{
"mappings": {
"_source": {
"enabled": false
},
"properties": {
"title": {
"type": "text",
"store": true
},
"content": {
"type": "text"
}
}
}
}
Will the source filtered request use doc values?
doc-values are enabled by default on all fields, except on analyzed text fields. If you use _source filtering, it's not using doc values, as explained above, the _source field is retrieved and the fields you specified are filtered.
Do stored fields store the analyzed tokens or the original value?
Stored fields store the exact value as present in the _source document
Are stored fields or doc values more or less efficient than _source?
doc_values is a different beast, it's more of a optimization to store the tokens of non-analyzed fields in a way to will make it easy to sort, filter and aggregate on those values.
Stored fields (default is false) are also an optimization if you don't want to store the full source but only a few important fields (as explained above).
The _source field itself is a stored field that contains the whole document.

Elasticsearch: Constant Data Field Type

Is there a way to add an Elasticsearch data field to an index mapping, such that it always returns a constant numeric value?
I know I can just add a numeric datatype, and then reindex everything with the constant, but I would like to avoid reindexing, and I'd also like to be able to change the constant dynamically without reindexing.
Motivation: Our cluster has a lot of different indexes. We routinely search multiple indexes at once for various reasons. However, when searching multiple indices, our search logic still needs to treat each index slightly differently. One way we could do this is by adding a constant numeric field to each index, and then use that field in our search query.
However, because this is a constant, it seems like we should not need to reindex everything (seems pointless to add a constant value to every record).
You could use the _meta field for that purpose:
PUT index1
{
"mappings": {
"_meta": {
"constant": 1
},
"properties": {
... your fields
}
}
}
PUT index2
{
"mappings": {
"_meta": {
"constant": 2
},
"properties": {
... your fields
}
}
}
You can change that constant value anytime, without any need for reindexing anything. The value is stored at the index level and can be retrieved anytime by simply retrieve the index mapping with GET index1,index2/_mapping

Resources