Can ElasticSearch be used purely for aggregations?

Can ElasticSearch be used purely for aggregations? - elasticsearch

In my current usecase, I'm using ElasticSearch as a document store, over which I am building a faceted search feature.
The docs state the following:
Sorting, aggregations, and access to field values in scripts requires a different data access pattern.
Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. They store the same values as the _source but in a column-oriented fashion that is way more efficient for sorting and aggregations.
Does this imply that the aggregations are not dependent on the index? If so, is it advisable to prevent the fields from being indexed altogether by setting {"index": "no"} ?
This is a small deviation, but where does the setting enabled come in? How is it different from index?
On a broader note, should I be using ElasticSearch if aggregations is all I'm looking for? Should I opt for other solutions like MongoDB? If so, what are the performance considerations?
HELP!

It is definitely possible to use Elasticsearch for the sole purpose of aggregating data. I've seen such setups a few times. For instance, in one past project, we'd index data but we'd only run aggregations in order to build financial reports, and we rarely needed to get documents/hits. 99% of the use cases were simply aggregating data.
If you have such a use case, then you can tune your mapping to
The role of enabled is to decide whether your data is indexed or not. It is true by default, but if you set it to false, your data will simply be stored (in _source) but completely ignored by analyzers, i.e. it won't be analyzed, tokenized and indexed, and thus, it won't be searchable, you'll be be able to retrieve the _source, but not search for it. If you need to use aggregations, then enabled needs to be true (the default value)
The store parameter is to decide whether you want to store the field or not. By default, the field value is indexed, but not stored as it is already stored with the _source itself and you can retrieve it using source filtering. For aggregations, this parameter doesn't play any role.
If your use case is only about aggregations, you might be tempted to set _source: false, i.e. not store the _source at all since all you'll be needed is to index the field values in order to aggregate them, but this is rarely a good idea for various reasons.
So, to answer your main question, aggregations do depend on the index, but the (doc-)values used for aggregations are written in dedicated files, whose inner structure is much more performant and optimal than accessing the data from the index in order to build aggregations.
If you're using ES 1.x, make sure to set doc_values to true for all the fields you'll want to aggregate on (except analyzed strings and boolean fields).
If you're using ES 2.x, doc_values is true by default, so you don't need to do anything special.
Update:
It is worth noting that aggregations are dependent on doc_values (i.e. Per Document Values .dvd and .dvm Lucene files), which basically contains the same info as in the inverted index, but organized in a column-oriented fashion, which makes it much more efficient for aggregations.

Related

Using stored_fields for retrieving a subset of the fields in Elastic Search

The documentation and recommendation for using stored_fields feature in ElasticSearch has been changing. In the latest version (7.9), stored_fields is not recommended - https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-fields.html
Is there a reason for this?
Where as in version 7.4.0, there is no such negative comment - https://www.elastic.co/guide/en/elasticsearch/reference/7.4/mapping-store.html
What is the guidance in using this feature? Is using _source filtering a better option? I ask because in some other doc, _source filtering is supposed to kill performance - https://www.elastic.co/blog/found-optimizing-elasticsearch-searches
If you use _source or _fields you will quickly kill performance. They access the stored fields data structure, which is intended to be used when accessing the resulting hits, not when processing millions of documents.
What is the best way to filter fields and not kill performance with Elastic Search?

source filtering is the recommended way to fetch the fields and you are getting confused due to the blog, but you seem to miss the very important concept and use-case where it is applicable. Please read the below statement carefully.
_source is intended to be used when accessing the resulting hits, not when processing millions of documents.
By default, elasticsearch returns only 10 hits/search results which can be changed based on the size parameter and if in your search results, you want to fetch few fields value than using source_filter makes perfect sense as it's done on the final result set(not all the documents matching search results),
While if you use the script, and using source value try to read field-value and filter the search result, this will cause queries to scan all the index which is the second part of the above-mentioned statement(not when processing millions of documents.)
Apart from the above, as all the field values are already stored as part of _source field which is enabled by default, you need not allocate extra space if you explicitly mark few fields as stored(disabled by default to save the index size) to retrieve field-values.

Elasticsearch store field vs _source

Using Elasticsearch 1.4.3
I'm building a sort of "reporting" system. And the client can pick and chose which fields they want returned in their result.
In 90% of the cases the client will never pick all the fields, so I figured I can disable _source field in my mapping to save space. But then I learned that
GET myIndex/myType/_search/
{
"fields": ["field1", "field2"]
...
}
Does not return the fields.
So I assume I have to then use "store": true for each field. From what I read this will be faster for searches, but I guess space wise it will be the same as _source or we still save space?

The _source field stores the JSON you send to Elasticsearch and you can choose to only return certain fields if needed, which is perfect for your use case. I have never heard that the stored fields will be faster for searches. The _source field could be bigger on disk space, but if you have to store every field there is no need to use stored fields over the _source field. If you do disable the source field it will mean:
You won’t be able to do partial updates
You won’t be able to re-index your data from the JSON in your
Elasticsearch cluster, you’ll have to re-index from the data source
(which is usually a lot slower).

By default in elasticsearch, the _source (the document one indexed) is stored. This means when you search, you can get the actual document source back. Moreover, elasticsearch will automatically extract fields/objects from the _source and return them if you explicitly ask for it (as well as possibly use it in other components, like highlighting).
You can specify that a specific field is also stored. This means that the data for that field will be stored on its own. Meaning that if you ask for field1 (which is stored), elasticsearch will identify that its stored, and load it from the index instead of getting it from the _source (assuming _source is enabled).
When do you want to enable storing specific fields? Most times, you don't. Fetching the _source is fast and extracting it is fast as well. If you have very large documents, where the cost of storing the _source, or the cost of parsing the _source is high, you can explicitly map some fields to be stored instead.
Note, there is a cost of retrieving each stored field. So, for example, if you have a json with 10 fields with reasonable size, and you map all of them as stored, and ask for all of them, this means loading each one (more disk seeks), compared to just loading the _source (which is one field, possibly compressed).
I got this answer on below link answered by shay.banon you can read this whole thread to get good understanding about it. enter link description here

Clinton Gormley says in the link below
https://groups.google.com/forum/#!topic/elasticsearch/j8cfbv-j73g/discussion
by default ES stores your JSON doc in the _source field, which is
set to "stored"
by default, the fields in your JSON doc are set to NOT be "stored"
(ie stored as a separate field)
so when ES returns your doc (search or get) it just load the _source
field and returns that, ie a single disk seek
Some people think that by storing individual fields, it will be faster
than loading the whole JSON doc from the _source field. What they don't
realise is that each stored field requires a disk seek (10ms each seek!
), and that the sum of those seeks far outweighs the cost of just
sending the _source field.
In other words, it is almost always a false optimization.

Enabling _source will store the entire JSON document in the index while store will only store individual fields that are marked so. So using store might be better than using _source if you want to save disk space.

As a reference for ES 7.3, the answer becomes clearer. DO NOT try to optimize before you have strong testing reasons UNDER REALISTIC PRODUCTION CONDITIONS.
I might just quote from the _source:
Users often disable the _source field without thinking about the
consequences, and then live to regret it. If the _source field isn't
available then a number of features are not supported:
The update, update_by_query,
and reindex APIs.
On the fly highlighting.
The ability to reindex from one Elasticsearch index to another, either
to change mappings or analysis, or to upgrade an index to a new major
version.
The ability to debug queries or aggregations by viewing the original
document used at index time.
Potentially in the future, the ability to repair index corruption
automatically.
TIP: If disk space is a concern, rather increase the
compression level instead of disabling the _source.
Besides there are not obvious advantages using stored_fields as you might have thought of.
If you only want to retrieve the value of a single field or of a few fields, instead of the whole _source, then this can be achieved with source filtering.

What's the difference between source filtering and the fields option in the elasticsearch get API?

I'm confused between source filtering (i.e. using the _source_include parameter) and the fields option of the GET API in elasticsearch. How are they different in terms of performance? When are they supposed to be used?

Update: re: fields
Note that this is the 1.x documentation if you just arrived here from the future.
For backwards compatibility, if the fields parameter specifies fields which are not stored (store mapping set to false), it will load the _source and extract it from it. This functionality has been replaced by the source filtering parameter.
-- https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-request-fields.html#search-request-fields
AFAICT:
_source tells elasticsearch whether to include the source of matched documents in the response. The "source" is the data in the document as it was inserted.
fields tells elasticsearch to include source, but only include the defined fields.
Permformance: Unless you have low bandwidth to the Elasticsearch server, it might be negligible.

I had the same doubt, here I found what can be the answer.
fields restricts the fields whose contents are parsed and returned
_source_filtering restricts the fields which are returned
Another way of seeing it is to think that fields is used to optimize data transfer and CPU usage while _source_filtering only optimizes data transfer
Source filtering allows us to control which parts of the original JSON document are returned for each hit[...]It's worth keeping in mind that this only saves us on bandwidth costs between the nodes participating in the search as well as the client, not CPU or Disk, as was the case when using fields.
In addition:
One feature about fields that's not commonly known is the ability to select metadata-fields as well. Of particular note is its ability to select the _ttl-field, which actually returns the number of milliseconds until the document expires, not the original lifespan of the document. A very handy feature indeed.

The fields parameter applies only to stored fields. From the 2.3 documentation:
Besides indexing the values of a field, you can also choose to store
the original field value for later retrieval. Users with a Lucene
background use stored fields to choose which fields they would like to
be able to return in their search results. In fact the _source field
is a stored field. In Elasticsearch, setting individual document
fields to be stored is usually a false optimization. The whole
document is already stored as the _source field. It is almost always
better to just extract the fields that you need using the _source
parameter.
See source filetring for how to limit the fields returned from _source

In Elasticsearch, what happens if I set 'store' to yes on a few fields, but _source to false?

We're building a "unified" search across a lot of different resources in our system. Our index schema includes about 10 generic fields that are indexed, plus 5 which are required to identify the appropriate resource location in our system when results are returned.
The indexed fields often contain sensitive data, so we don't want them stored at all, only indexed for matching, thus we set the _source to FALSE.
I do however want the 5 ident fields returned, so is it possible to set the ident fields to store = yes, but the overall index _source to FALSE and get what I'm looking for in the results?

Have a look at this other answer as well. As mentioned there, in most of the cases the _source field helps a lot. Even though it might seem like a waste because elasticsearch effectively stores the whole document that comes in, that's really handy (e.g. when needing to update documents without sending the whole updated document). At the end of the day it hides a lucene implementation detail, the fact that you need to explicitly store fields if you want to get them back, while users usually expect to get back what they sent to the search engine. Surprisingly, the _source helps performance wise too, as it requires a single disk seek instead of more disk seeks that might be caused by retrieving multiple stored fields. At the end of the day the _source field is just a big lucene stored field containing json, which can be parsed in order to get to specific fields and do some work with them, without needing to store them separately.
That said, depending on your usecase (how many fields you retrieve) it might be useful to have a look at source include/exclude at the bottom of the _source field reference, which allows you to prevent parts (e.g. the sensitive parts of your documents) of the source field from being stored. That would be useful if you want to keep relying on the _source but don't want a part of the input documents to be returned, but you do want to search against those fields, as they are going to be indexed (but not stored!) in the underlying lucene index.
In both cases (either you disable the _source completely or exclude some parts), if you plan to update your documents keep in mind that you'll need to send the whole updated document using the index api. In fact you cannot rely on partial updates provided with the update api as you don't have in the index the complete document that you indexed in the first place, which you would need to apply changes to.

Yes, stored fields do not rely on the _source field, or vice-versa. They are separate, and changing or disabling one shouldn't impact the other.

what is the difference between _source and _all in Elasticsearch

The difference between the two, who hold all of the fields, eludes me.
If my document has:
{"mydoc":
{"properties":
{"name":{"type":"string","store":"true"}},
{"number":{"type":"long","store":"false"}},
{"title":{"type":"string","include_in_all":"false","store":"true"}}
}
}
I understand that _source is a field that has all the fields. But so does _all?
Does this mean that "name" is saved several times (twice? in _source and in _all), increasing the disk space the document takes?
Is "name" stored once for the field, once for _source, and once for _all?
what about "number", is it stored in _all, even though not in _source?
When should I use _source in my query, and when _all?
What is the use case where I can disable _all, and what functionality would then be denied?

It's pretty much the same as the difference between indexed fields and stored fields in lucene.
You use indexed fields when you want to search on them, while you store fields that you want to return as search results.
The _source field is meant to store the whole source document that was originally sent to elasticsearch. It's use as search result, to be retrieved. You can't search on it. In fact it is a stored field in lucene and not indexed.
The _all field is meant to index all the content that come from all the fields that your documents are composed of. You can search on it but never return it, since it's indexed but not stored in lucene.
There's no redundancy, the two fields are meant for a different usecase and stored in different places, within the lucene index. The _all field becomes part of what we call the inverted index, use to index text and be able to execute full-text search against it, while the _source field is just stored as part of the lucene documents.
You would never use the _source field in your queries, only when you get back results since that's what elasticsearch returns by default. There are a few features that depend on the _source field, that you lose if you disable it. One of them is the update API. Also, if you disable it you need to remember to configure as store:yes in your mapping all the fields that you want to return as search results. I would rather say don't disable it unless it bothers you, since it's really helpful in a lot of cases. One other common usecase would be when you need to reindex your data; you can just retrieve all your documents from elasticsearch itself and just resend them to another index.
On the other hand, the _all field is just a default catch all field, that you can use when you just want to search on all fields available and you don't want to specify them all in your queries. It's handy but I wouldn't rely on it too much on production, where it's better to run more complex queries on different fields, with different weights each. You might want to disable it if you don't use it, this will have a smaller impact than disabling the _source in my opinion.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio