In elasticsearch, how does aggregation work on fields which are not stored - elasticsearch

In the document that I index in elastic search i have 6 columns a,b,c,d,e,f. I have set _source=false for all, for columns a,b, I have set stored=true and for columns c,d,e,f, I have set stored=false.
As far as my understanding of aggregation in elasticsearch goes, aggregation works on the results of a query. But since I have set stored=true only for columns a,b, my search returns only columns a,b. What if I want to aggregate based on the column c. How will this aggregation work if I set stored=false. To make aggregation work on column c, will I have to set stored=true for it ?

You are correct that in order to do aggregations, the underlying values have to be saved somewhere on disk. These are not saved in the standard stored or _source fields of the mapping, but are instead saved in the doc_values.
This means that even if you set stored=False and _source=False, the values that you index may still nevertheless be saved if the doc values are saved. Doc values are automatically turned on for unanalyzed strings and numbers, but you can manually turn them off by setting doc_values to False in the mapping. If you turn them off, then aggregating on these fields will not work.
As a result of the above, you can retrieve the underlying values even if stored=False by querying the doc_values directly. Information on how to do that is located here.

Related

screen out document results that share the same property value accept the first one

I have a db of documents. Every document has a property(keyword) called index (noting to do with the elastic index) and a property(keyword) named superIndex. There can be multiple documents with the same index and multiple documents with the same superIndex in the DB, these fields are not unique.
I run a compound query searching free text on the text content of these documents, with sorting, and get the results I want. However, I get many documents having the same index and/or superIndex. Currently I programmatically filter the result list and take only the first result from each index and superIndex. My requirement is that at the end I'm left with the top results from the sort, the first from each index and superIndex.
Can this be done using elastic query. If so how?
Field collapsing allows you to collapse all search results having the same value in a field (e.g. index). (See Elasticsearch Reference: Field Collapsing)

Filtering Elasticsearch fields from index/store

I was wondering what is the recommended approach to filter out some of the fields that are sent to Elasticsearch from Store and Index?
I want to filter our some fields from getting indexed in Elasticsearch. You may ask why you are sending them to Elasticsearch from the first place. Unfortunately, it is sent via another application that doesn't accept any filtering mechanism. Hence, filtering should be addressed at the time of indexing. Here is what we have done, but I am not sure what would be the consequences of these steps:
1- Disable dynamic mapping ("dynamic": "false" ) in ES templates.
2- Including only the required fields in _source and excluding the rest.
According to ES website, some of the ES functionalities will be disabled by disabling _source fields. Given I don't need the filtered fields at all, I was wondering whether the mentioned solution will break anything regarding the remaining fields or not?
There are a few mapping parameters that allow you to do what you want:
index: true/false: if true the field value is indexed in order to be searched later on (default: true)
store: true/false: if true the field values are stored in addition to being indexed. Usually, the field values are stored in the source already, but you can choose to not store the source but store the field value itself (default: false)
enabled: true/false: only for the mapping type as a whole or for object types. you can decide whether to only store the value but not index it
So you can use any combination of the above parameters if you don't want to modify the source documents and simple let ES do it for you.

analyzed field vs doc_values: true field

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!
Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.
It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

Does Elasticsearch store or not store field values by default?

In Elasticsearch all fields of a mapping have a stored property which determines whether the data of the field will be stored on disk (in addition to the storing of the whole _source).
It defaults to false.
However each segment in every shard also has a Docvalues structure per field in the mapping. The structure stores the value of the field for all documents in the segment.
All documents and fields are included in the structured by default.
So on one hand, by default Elasticsearch doesn't store the values for fields. On the other hand, it does store the values in the Docvalues structure.
So which is it? Does Elasticsearch store or not store values by default?
ES stores the same field in multiple formats for different purposes.
For eg. Consider this :
"prop_1":{ "type":"string", "index":"not_analyzed","store":true,"doc_values":true}
prop_1 would be stored on its own as an indexed, doc_values and
stored field. On top of that the prop_1 is stored to into the
_source field together with your other fields.
As explained above, even if stored:false, the field data is still persisted on disk in multiple formats.
Stored fields are designed for optimal storage whereas doc values are
designed the access field values quickly. During the execution of
query many of doc values fields are accessed for candidate hits, so
the access must be fast. This the reason why you should use doc values
in sorting, aggregations and scrips.On the other hand stored fields should be used for return field values for the top matching documents.
Now, you can use doc_values to return fields in response as well :-
GET /_search
{
"query" : {
"match_all": {}
},
"docvalue_fields" : ["test1", "test2"]
}
Doc value fields can work on fields that are not stored. So IMO, stored fields do not have any significance now.

How to return fields in correct order for an ElasticSearch query

I'm performing a multimatch search against an ElasticSearch index, and I want to get back the source object with fields in the same order as they were stored in.
However, when I get the response back from the ElasticSearch query, the fields are in alphabetical order (which is not particularly useful for what I'm doing). I'm fairly confident that it used to behave the desired way in a previous version of ES, but since I upgraded recently it is only returning the fields in alphabetical order.
Edit: Note that if I perform a standard match_all search, then I do get the fields back in the original order. I wonder if it has something to do with the multimatch query?
Edit 2: OK, I just ran it again and it returned the fields in a random order (not alphabetical). Maybe this is a bug in ElasticSearch?
You cannot guarantee any order in what is returned. The source document is a plain old JSON object and by definition:
An object is an unordered set of name/value pairs.

Resources