I'm planning to use Elasticsearch mostly for data analytics. I have large document with many, moslty numeric (up to 4 bytes) attributes. Most fields in my document only have about 30% of values in them. If I understand correctly I can take advantage of Doc Values feature which is similar to columnar data layout found in some databases. I was wondering how Elasticsearch/Lucene will store this data. Is there any compressions used (e.g. run length) or it is dense data layout where nulls will take the same space on storage as the values?
The default behavior of ElasticSearch is not to add the field at all in case of NULL values. You can force map the field using null_value but for types where NULL is supported. For Example: long field cannot be mapped with string null_value.
So to address the question, ElasticSearch will not allocate default space for fields missing in the document. But you may run into MissingFieldException in case you query on some field which never had a value. To avoid this, map your fields explicitly before indexing. If you map explicitly make sure to set the null_value property of the field to outside the range of your data input.
Related
By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.
I am curious how does the implementation work on Elasticsearch backend works. How can they make a value not retrievable but searchable? (I would imagine it would need to be stored somewhere in order for you to search it right?) Why is Elasticsearch designed this way? what efficiency did it achieve for designing it this way?
The source document is actually "stored" in the _source field (but it is not indexed) and all fields of the source documents are indexed (but not stored). All field values can usually be retrieved from the _source field using source filtering. This is how ES is configured by default, but you're free to change that.
You can, for instance, decide to not store the _source document at all and store only certain fields of your document. This might be a good idea if for instance your document has a field which contains a huge blob of text. It might not be wise to store the _source because that would take a lot of space for nothing. That huge blob of text might only be useful for full-text search and so would only need to be indexed, while all other fields might need to be indexed and stored as well because they need to be retrieved in order to be displayed.
So the bottom line is:
if a field can be searched, it doesn't need to be stored, it only needs to be indexed
if a field can be retrieved, it can either be configured to be stored or retrieved/filtered from the _source field (which is stored by default)
In Lucene i want to store the full document as well which would be just stored and not analysed. What i want to do is something like _source in Elastic Search.
But I'm confused as what would be the best data type in Lucene to store such data.
What field type should I use in Lucene to store such data. Should it be a StringField or something else?
I think elasticsearch stores _source as hex data. Not sure though.
Which data type would take less space and still be fast enough to retrieve?
As per this this part of the doc, it seems that Lucene treats each and every data type as:
opaque bytes
which could ideally convey that it doesn't really matter what type of field you're having as long as it's relative to your requirement, where Lucene would anyways convert them.
So deciding on which data type the field should be, totally depends on how do you want your fields to be in the index and also how're you gonna use them to visualize graphs in Kibana. Hope it helps!
I have about 10 million very flat (like an RDBMS row) documents stored in ES. There are say 10 fields to each document, and 5 of the fields are actually enumerations.
I have created a mapping that maps the Enum's ordinal to a Short, and pass the ordinal in when I index the document.
Does Elasticsearch actually store these values as a Short in its index? Or do they get .toString()'ed? What is actually happening "under the hood" when I map a field to a data type?
Since ES is built on top of Lucene, that is the place to look to see how fields are actually stored and used "under the hood".
As far as I understand, Lucene does in fact store data in more than just String format. So to answer one of your questions, I believe the answer is no - everything does not get .toString()'ed. In fact, if you look at the documentation for Lucene's document package, you'll see it has many numeric types (e.g. IntField, LongField, etc).
The Elasticsearch documentation on Core Types also alludes to this fact:
"It uses specific constructs within Lucene in order to support numeric
values. The number types have the same ranges as corresponding Java
types."
Furthermore, Lucene offers queries (which ES takes advantage of) designed specifically for searching fields with known numeric terms, such as the NumericRangeQuery which is discussed in Lucene's search package. The same numeric types in Lucene allow for efficient sorting as well.
One other benefit is data integrity. Just like any database, if you only expect a field to contain numeric data and your application attempts to insert non-numeric data, in most cases you would want that insert to fail. This is the default behavior of ES when you try to index a document whose field values do not match the type mapping. (Though, you can disable this behavior on numeric fields using ignore_malformed, if you wish)
Hope this helps...
Using Elasticsearch 1.4.3
I'm building a sort of "reporting" system. And the client can pick and chose which fields they want returned in their result.
In 90% of the cases the client will never pick all the fields, so I figured I can disable _source field in my mapping to save space. But then I learned that
GET myIndex/myType/_search/
{
"fields": ["field1", "field2"]
...
}
Does not return the fields.
So I assume I have to then use "store": true for each field. From what I read this will be faster for searches, but I guess space wise it will be the same as _source or we still save space?
The _source field stores the JSON you send to Elasticsearch and you can choose to only return certain fields if needed, which is perfect for your use case. I have never heard that the stored fields will be faster for searches. The _source field could be bigger on disk space, but if you have to store every field there is no need to use stored fields over the _source field. If you do disable the source field it will mean:
You won’t be able to do partial updates
You won’t be able to re-index your data from the JSON in your
Elasticsearch cluster, you’ll have to re-index from the data source
(which is usually a lot slower).
By default in elasticsearch, the _source (the document one indexed) is stored. This means when you search, you can get the actual document source back. Moreover, elasticsearch will automatically extract fields/objects from the _source and return them if you explicitly ask for it (as well as possibly use it in other components, like highlighting).
You can specify that a specific field is also stored. This means that the data for that field will be stored on its own. Meaning that if you ask for field1 (which is stored), elasticsearch will identify that its stored, and load it from the index instead of getting it from the _source (assuming _source is enabled).
When do you want to enable storing specific fields? Most times, you don't. Fetching the _source is fast and extracting it is fast as well. If you have very large documents, where the cost of storing the _source, or the cost of parsing the _source is high, you can explicitly map some fields to be stored instead.
Note, there is a cost of retrieving each stored field. So, for example, if you have a json with 10 fields with reasonable size, and you map all of them as stored, and ask for all of them, this means loading each one (more disk seeks), compared to just loading the _source (which is one field, possibly compressed).
I got this answer on below link answered by shay.banon you can read this whole thread to get good understanding about it. enter link description here
Clinton Gormley says in the link below
https://groups.google.com/forum/#!topic/elasticsearch/j8cfbv-j73g/discussion
by default ES stores your JSON doc in the _source field, which is
set to "stored"
by default, the fields in your JSON doc are set to NOT be "stored"
(ie stored as a separate field)
so when ES returns your doc (search or get) it just load the _source
field and returns that, ie a single disk seek
Some people think that by storing individual fields, it will be faster
than loading the whole JSON doc from the _source field. What they don't
realise is that each stored field requires a disk seek (10ms each seek!
), and that the sum of those seeks far outweighs the cost of just
sending the _source field.
In other words, it is almost always a false optimization.
Enabling _source will store the entire JSON document in the index while store will only store individual fields that are marked so. So using store might be better than using _source if you want to save disk space.
As a reference for ES 7.3, the answer becomes clearer. DO NOT try to optimize before you have strong testing reasons UNDER REALISTIC PRODUCTION CONDITIONS.
I might just quote from the _source:
Users often disable the _source field without thinking about the
consequences, and then live to regret it. If the _source field isn't
available then a number of features are not supported:
The update, update_by_query,
and reindex APIs.
On the fly highlighting.
The ability to reindex from one Elasticsearch index to another, either
to change mappings or analysis, or to upgrade an index to a new major
version.
The ability to debug queries or aggregations by viewing the original
document used at index time.
Potentially in the future, the ability to repair index corruption
automatically.
TIP: If disk space is a concern, rather increase the
compression level instead of disabling the _source.
Besides there are not obvious advantages using stored_fields as you might have thought of.
If you only want to retrieve the value of a single field or of a few fields, instead of the whole _source, then this can be achieved with source filtering.
I'm confused between source filtering (i.e. using the _source_include parameter) and the fields option of the GET API in elasticsearch. How are they different in terms of performance? When are they supposed to be used?
Update: re: fields
Note that this is the 1.x documentation if you just arrived here from the future.
For backwards compatibility, if the fields parameter specifies fields which are not stored (store mapping set to false), it will load the _source and extract it from it. This functionality has been replaced by the source filtering parameter.
-- https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-request-fields.html#search-request-fields
AFAICT:
_source tells elasticsearch whether to include the source of matched documents in the response. The "source" is the data in the document as it was inserted.
fields tells elasticsearch to include source, but only include the defined fields.
Permformance: Unless you have low bandwidth to the Elasticsearch server, it might be negligible.
I had the same doubt, here I found what can be the answer.
fields restricts the fields whose contents are parsed and returned
_source_filtering restricts the fields which are returned
Another way of seeing it is to think that fields is used to optimize data transfer and CPU usage while _source_filtering only optimizes data transfer
Source filtering allows us to control which parts of the original JSON document are returned for each hit[...]It's worth keeping in mind that this only saves us on bandwidth costs between the nodes participating in the search as well as the client, not CPU or Disk, as was the case when using fields.
In addition:
One feature about fields that's not commonly known is the ability to select metadata-fields as well. Of particular note is its ability to select the _ttl-field, which actually returns the number of milliseconds until the document expires, not the original lifespan of the document. A very handy feature indeed.
The fields parameter applies only to stored fields. From the 2.3 documentation:
Besides indexing the values of a field, you can also choose to store
the original field value for later retrieval. Users with a Lucene
background use stored fields to choose which fields they would like to
be able to return in their search results. In fact the _source field
is a stored field. In Elasticsearch, setting individual document
fields to be stored is usually a false optimization. The whole
document is already stored as the _source field. It is almost always
better to just extract the fields that you need using the _source
parameter.
See source filetring for how to limit the fields returned from _source