How to compare two text fields in elastic search - elasticsearch

I have a need to compare two text fields in elastic search, but they are text fields.
For normal fields I can use script to compare using doc['field'].value, but is there a way to do the same for text fields.

See below excerpt from ES DOCS :
By far the fastest most efficient way to access a field value from a script is to use the doc['field_name'] syntax, which retrieves the field value from doc values. Doc values are a columnar field value store, enabled by default on all fields except for analyzed text fields.
There are 2 ways known to me to access a text value from script.
Map a keyword representation of a text field as well and access that field.
{
"mappings": {
"properties": {
"name":{
"type": "text",
"fields": {
"keyword":{ // <======= See this
"type":"keyword"
}
}
}
}
}
}
The keyword representation can be accessed like 'doc[name.keyword].value'
It is recommended to index/store keyword representation of fields for small-size text fields like 'name', 'emailId' but is not recommended for larger fields like 'description', due to memory overhead
Another way is to enable field data on the text field and access that field.
Fielddata is disabled on text fields by default. Set fielddata=true on
[your_field_name] in order to load fielddata in memory by uninverting the
inverted index. Note that this can however use significant memory.
It is not recommended to use the field-data over the text fields however.
Note: Please do add details on why you need comparison and what kind of comparison is required on 'text' fields.

Related

go elastic sum text field

I want to sum up filed price, but got error "Fielddata is disabled on text fields by default."
This is my code:
sumAgg := elastic.NewSumAggregation().Field("price")
q := query.Must(elastic.NewRangeQuery("price").Gt(0))
res, err := p.config.ElasticClient.Search().Index(idx).Query(q).Aggregation("sum", sumAgg).Size(0).Do(ctx)
This is the mapping:
"mappings": {
"properties": {
"price": {
"type": "scaled_float",
"scaling_factor": 100000
},
}
}
Anybody can help?
By default field data is disabled on text fields and a detailed reason is mentioned here , as it's very costly on text fields and mainly require for aggregations(your use-case), read more here
from official doc
Instead, text fields use a query-time in-memory data structure called
fielddata. This data structure is built on demand the first time that
a field is used for aggregations, sorting, or in a script. It is built
by reading the entire inverted index for each segment from disk,
inverting the term ↔︎ document relationship, and storing the result in
memory, in the JVM heap.
Make sure, a field which you are using in your aggregation has this enabled. for numeric fields doc_values are used for aggregation and its enabled on by default on the numeric field.

Sorting Results efficiently by non mapped/searchable Object properties

In my current index called items I have previously sorted the results of my queries by some property values of a specifiy object property, lets call this property just oprop. After I have realized, that with each new key in oprop the total number of used fields on the items index increased, I had to change the mapping on the index. So I set oprop's mapping to dynamic : false, so the keys of oprop are not searchable anymore (not indexed).
Now in some of my queries I need to sort the results on the items index by the values of oprop keys. I don't know how ElasticSearch still can give me the possibility to sort on these key values.
Do I need to use scripts for sorting? Do I have access on non indexed data when using scripts?
Somehow I don't think that this is a good approach and I think that in long term I will run into performance issues.
You could use scripts for sorting, since the data will be stored in the _source field, but that should probably be a last resort. If you know which fields need to be sortable, you could just add those to the mapping, and keep oprop as a non dynamic field otherwise?
"properties": {
"oprop": {
"dynamic": false,
"properties": {
"sortable_key_1": {
"type": "text"
},
"sortable_key_2": {
"type": "text"
}
}
}
}

What is the difference between source filtering, stored fields, and doc values in elsaticsearch?

I've read the docs for source filtering, stored fields, and doc values.
In certain situations it can make sense to store a field. For instance, if you have a document with a title, a date, and a very large content field, you may want to retrieve just the title and the date without having to extract those fields from a large _source field
The stored_fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended. Use source filtering instead to select subsets of the original source document to be returned.
All fields which support doc values have them enabled by default.
Example 1
I have documents with title (short string), and content (>1MB). I want to search for matching titles, and return the title.
With source filtering
GET /_search
{ _source: "obj.title", ... }
With stored fields
GET /_search
{ _source: false, stored_fields: ["title"], ... }
With doc values
GET /_search
{_source: false, stored_fields: "_none_", docvalue_fields: "title", ... }
Okay, so
Will the source filtered request read the full _source, title and content, from disk then apply the filter and return only the title, or will elasticsearch only read the title from disk?
Will the source filtered reques use doc values?
Do stored fields store the analyzed tokens or the original value?
Are stored fields or doc values more or less efficient than _source?
Will the source filtered request read the full _source, title and content, from disk then apply the filter and return only the title, or will elasticsearch only read the title from disk?
The document you send for indexing to Elasticsearch will be stored in a field called _source (by default). So this means that if your document contains a large amount of data (like in the content field in your case), the full content will be stored in the _source field. When using source filtering, first the whole source document must be retrieved from the _source field and then only the title field will be returned. You're wasting space because nothing really happens with the content field, since you're searching on title and returning only the title value.
In your case, you'd be better off to not store the _source document, and only store the title field (but it has some disadvantages, too, so read this before you do), basically like this:
PUT index
{
"mappings": {
"_source": {
"enabled": false
},
"properties": {
"title": {
"type": "text",
"store": true
},
"content": {
"type": "text"
}
}
}
}
Will the source filtered request use doc values?
doc-values are enabled by default on all fields, except on analyzed text fields. If you use _source filtering, it's not using doc values, as explained above, the _source field is retrieved and the fields you specified are filtered.
Do stored fields store the analyzed tokens or the original value?
Stored fields store the exact value as present in the _source document
Are stored fields or doc values more or less efficient than _source?
doc_values is a different beast, it's more of a optimization to store the tokens of non-analyzed fields in a way to will make it easy to sort, filter and aggregate on those values.
Stored fields (default is false) are also an optimization if you don't want to store the full source but only a few important fields (as explained above).
The _source field itself is a stored field that contains the whole document.

What is the difference between a field and a property in Elasticsearch?

I'm currently trying to understand the difference between fields (https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html) and properties (https://www.elastic.co/guide/en/elasticsearch/reference/current/properties.html).
They are both somehow defined as a "subfield/subproperty" of a type/mapping property, both can have separate types and analyzers (as far as I understood it), both are accessed by the dot notation (mappingProperty.subField or mappingProperty.property).
The docs are using the terms "field" and "property" randomly, I have the feeling, for example:
Type mappings, object fields and nested fields contain sub-fields,
called properties.
What is the difference between properties and (sub-)fields? How do I decide if I have a property or a field?
In other words, how do I decide if I use
{
"mappings": {
"_doc": {
"properties": {
"myProperty": {
"properties": {
}
}
}
}
}
}
or
{
"mappings": {
"_doc": {
"properties": {
"myProperty": {
"fields": {
}
}
}
}
}
}
Subfields are indexed from the parent property source. While sub-properties need to have a "real" value in the document's source.
If your source contains a real object, you need to create properties. Each property will correspond to a different value from your source.
If you only want to index the same value but with different analyzers then use subfields.
It is often useful to index the same field in different ways for
different purposes. This is the purpose of multi-fields. For instance,
a string field could be mapped as a text field for full-text search,
and as a keyword field for sorting or aggregations:
(sorry I find its hard to explain =| )
Note: This is an explanation from my current understanding. It may not be 100% accurate.
A property is what we used to call field in a RDBMS (a standard relationship db like MySQL). It stores properties of an object and provides the high-level structure for an index (which we can compare to a table in a relational DB).
A field, which is linked (or included) into the property concept, is a way to index that property using a specific analyzer.
So lets say you have:
One analyzer (A) to uppercase
One analyzer (B) to lowercase
One analyzer (C) to translate to Spanish (this doesn't even exist, just to give you an idea)
What an analyzer does is transform the input (the text on a property) into a series of tokens that will be indexed. When you do a search the same analyzer is used so the text is transformed into those tokens, it gives each one a score and then those tokens are used to grab documents from the index.
(A) Dog = DOG
(B) Dog = dog
(C) Dog = perro
To search using a specific field configuration you call it using a dot:
The text field uses the standard analyzer.
The text.english field uses the English analyzer.
So the fields basically allow you to perform searches using different token generation models.

What does it mean when it appears <propertie>.keyword in Elasticsearch?

I am a beginner in the world of Elasticsearch and And I don't know what it means when a property called .keyword. It only appears when I'm in the section "Management" > "Index Pattern".
Only those properties (propertie.keyword) have option 'aggregatable' active.
What's the difference between 'locality' and 'locality.keyword'.
And I don't get the same result when I do
{'match': {'locality': "Sant Climent"}}
or
{'match': {'locality.keyword': "Sant Climent"}}
Someone could explain to me the difference and what each thing is used for? I'm going crazy.
(I'm using the latest version of Elasticsearch BTW, 6.X).
Your field is indexed two times: one time locality is indexed with a text datatype and is used to perform fulltext search. Your text is subvided into tokens and you can apply a filter transformation and retrieve every single word. E.G. Once you have tokenized a text you can apply a stopword list to remove unmeaningful words or a stemmer, giving the possibility to retrieve a word from its stem. You could query against this field with the match query. locality.keyword is the same content indexed as keyword datatype. That it means that your text is considered as a unique token and you can retrieve your content only with a literal search on that field - if you don't provide a keyword normalizer is also case sensitive!, no fulltext. You could query against this field with the term query. For that reason you could make aggregation only on the keyword field content: the server could make a kind of groupby on the field, because its content is as is it, and is not composed by differents units - the tokens.If you have any other doubt , please ask for help :-)
You need to understand the text datatype and keyword datatype.
You can map fields as text and keyword combined like this:
{
"properties":{
"locality":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
... other fields here
}
When doing this, you can now query the field locality as text OR you could use locality.keyword to query it as keyword.

Resources