Does Elasticsearch store or not store field values by default? - elasticsearch

In Elasticsearch all fields of a mapping have a stored property which determines whether the data of the field will be stored on disk (in addition to the storing of the whole _source).
It defaults to false.
However each segment in every shard also has a Docvalues structure per field in the mapping. The structure stores the value of the field for all documents in the segment.
All documents and fields are included in the structured by default.
So on one hand, by default Elasticsearch doesn't store the values for fields. On the other hand, it does store the values in the Docvalues structure.
So which is it? Does Elasticsearch store or not store values by default?

ES stores the same field in multiple formats for different purposes.
For eg. Consider this :
"prop_1":{ "type":"string", "index":"not_analyzed","store":true,"doc_values":true}
prop_1 would be stored on its own as an indexed, doc_values and
stored field. On top of that the prop_1 is stored to into the
_source field together with your other fields.
As explained above, even if stored:false, the field data is still persisted on disk in multiple formats.
Stored fields are designed for optimal storage whereas doc values are
designed the access field values quickly. During the execution of
query many of doc values fields are accessed for candidate hits, so
the access must be fast. This the reason why you should use doc values
in sorting, aggregations and scrips.On the other hand stored fields should be used for return field values for the top matching documents.
Now, you can use doc_values to return fields in response as well :-
GET /_search
{
"query" : {
"match_all": {}
},
"docvalue_fields" : ["test1", "test2"]
}
Doc value fields can work on fields that are not stored. So IMO, stored fields do not have any significance now.

Related

In elasticsearch, how does aggregation work on fields which are not stored

In the document that I index in elastic search i have 6 columns a,b,c,d,e,f. I have set _source=false for all, for columns a,b, I have set stored=true and for columns c,d,e,f, I have set stored=false.
As far as my understanding of aggregation in elasticsearch goes, aggregation works on the results of a query. But since I have set stored=true only for columns a,b, my search returns only columns a,b. What if I want to aggregate based on the column c. How will this aggregation work if I set stored=false. To make aggregation work on column c, will I have to set stored=true for it ?
You are correct that in order to do aggregations, the underlying values have to be saved somewhere on disk. These are not saved in the standard stored or _source fields of the mapping, but are instead saved in the doc_values.
This means that even if you set stored=False and _source=False, the values that you index may still nevertheless be saved if the doc values are saved. Doc values are automatically turned on for unanalyzed strings and numbers, but you can manually turn them off by setting doc_values to False in the mapping. If you turn them off, then aggregating on these fields will not work.
As a result of the above, you can retrieve the underlying values even if stored=False by querying the doc_values directly. Information on how to do that is located here.

How to change field value in elasticsearch from string to integer?

I have some data indexed in elasticsearch, in _source I have a field to store file size:
{"file_size":"25.2MB"}
{"file_size":"2GB"}
{"file_size":"800KB"}
Currently the mapping of this field is string. I want to do search with sorting by file_size. I guess I need change the mapping to integer and do re-index.
How can I calculate the size in bytes and re-index them as integer?
Elasticsearch does not support field reindexing, as documents in lucenes index is immutable. So, internally, every document need to be fetched, changed, indexed back to index and old copy should be removed. Its doesn't matter what you actually need - change mapping or change data.
So, about practical part. Straightforward way:
Create new index with proper mapping
Fetch all your documents from old index
Change your file_size field to integer according to any logic you need
Index documents to new index
Drop old index after full migration
So, application side will contain additional logic to transform data from human-readable strings to Long + standard ES driver functionality. To speed this process up, consider using scroll-scan for read and bulk api for write. For future, I recommend using aliases to be able to migrate your data seamlessly.
In case, when you can't do server-side changes for some reason, you can potentially add new field with proper mapping and fire up ES-side updates with scripted partial updates (). Or try your luck with experimental plugin
why not use sort by keyword?
just add this:
{
"sort": {
"file_size.keyword": {
"order": "asc"
}
}
}
it was only sort it by string, so if there is data 2.5GB, 1KB, 5KB, the data will be 1KB, 2.5GB, 5KB
i think you have to save it into Byte first, so you can easily sorting it if it was in the same format.

Stored field in elastic search

In the documentation, some types, such as numbers and dates, it specifies that store defaults to no. But that the field can still be retrieved from the json.
Its confusing. Does this mean _source?
Is there a way to not store a field at all, and just have it indexed and searchable?
None of the field types are stored by default. Only the _source field is. That means you can always get back what you sent to the search engine. Even if you ask for specific fields, elasticsearch is going to parse the _source field for you and give you back those fields.
You can disable the _source if you want but then you could only retrieve the fields that you explicitly stored, according to your mapping.

ElasticSearch: mappings for fields that are sorted often

Suppose I have a field "epoch_date" that will be sorted often when I do Elastic Search queries. How should I map this field? Right now, I just have stored: yes. Should I index it even though this field will not count towards the relevancy scoring? What should I add to this field if I intend to sort on this field often, so it will be more efficient?
{
"tweet" : {
"properties" : {
"epoch_date" : {
"type" : "integer",
"store" : "yes"
}
}
}
}
There's nothing you need to change to sort on the field given your mapping. You can only sort on a field if it's indexed, and the default is "index":"yes" for numeric or dates. You can not set a numeric type to analyzed, since there's no text to analyze. Also, better to use the date type for a date instead of the integer.
Sorting can be memory expensive if your field you are sorting on has a lot of unique terms. Just make sure you have enough memory for it. Also, keep in mind that sorting on a specific field you throw away the relevance ranking, which is a big part of what a search engine is all about.
Whether you want to store the field too doesn't have anything to do with sorting, but just with the way you retrieve it in order to return it together with your search results. If you use the _source field (default behaviour) there's no reason to store specific fields. If you ask for specific fields using the fields option when querying, then the stored fields would be retrieved directly from lucene rather than extracted from the _source field parsing the json.
An index is used for efficient sorting. So YES, you want to create an index for the field.
As to needing it to be "more efficient", I'd kindly advise you to first check your results and see if they're fast enough. I don't see a reason beforehand (with the limited info you provided) to think it wouldn't be efficient.
If you intend to filter on the field as well (date-ranges?) be sure to use filters instead of queries whenever you feel the filters used will be used often. This because filters can be efficiently cached.

Field not searchable in ES?

I created an index myindex in elasticsearch, loaded a few documents into it. When I visit:
localhost:9200/myindex/mytype/1023
I noticed that my particular index has the following metadata for mappings:
mappings: {
mappinggroupname: {
properties: {
Aproperty: {
type: string
}
Bproperty: {
type: string
}
}
}
}
Is there some way to add "store:yes" and index: "analyzed" without having to reload/reindex all the documents?
Note that when i want to view a single document...
i.e. localhost:9200/myindex/mytype/1023
I can see the _source field contains all the fields of that document are and when I go to the "Browser" section of the head plugin it appears that all the columns are correct and corresponding to my fieldnames. So why is it that "stored" is not showing up in metadata? I can even perform a _search on them.
What is the difference between "stored":"true" versus the fact that I can see all my fields and values after indexing all my documents via the means I mention above?
Nope, no way! That's how your documents got indexed in the underlying lucene. The only way to change it is to reindex them all!
You see all those fields because you see the content of the special _source field in lucene, that's stored by default through elasticsearch. You are not storing all the fields separately but you do have the source document that you originally indexed through the _source, a single field that contains the whole document.
Generally the _source field is just enough, you don't usually need to configure every field as stored.
Also, the default is "index":"analyzed" if not specified for all the string fields. That means those fields are indexed and analyzed using the standard analyzer if not specified in the mapping. Therefore, as far as I can see from your mapping those two fields should be indexed, thus searchable.

Resources