In Lucene i want to store the full document as well which would be just stored and not analysed. What i want to do is something like _source in Elastic Search.
But I'm confused as what would be the best data type in Lucene to store such data.
What field type should I use in Lucene to store such data. Should it be a StringField or something else?
I think elasticsearch stores _source as hex data. Not sure though.
Which data type would take less space and still be fast enough to retrieve?
As per this this part of the doc, it seems that Lucene treats each and every data type as:
opaque bytes
which could ideally convey that it doesn't really matter what type of field you're having as long as it's relative to your requirement, where Lucene would anyways convert them.
So deciding on which data type the field should be, totally depends on how do you want your fields to be in the index and also how're you gonna use them to visualize graphs in Kibana. Hope it helps!
Related
From what I read, Elasticsearch is dropping support for types.
So, as the examples say indexes are similar to databases and documents are similar to rows of a relational database.
So now, everything is a top-level document right?
Then what is the need for a mapping, if we can store all sorts of documents in an index with whatever schema we want it to have.
I want to understand if my concepts are incorrect anywhere.
Elasticsearch is not dropping support for mapping types, they are dropping support for multiple mapping types within a single index. That's a slight, yet very important, difference.
Having a proper index mapping in ES is as much important as having a proper schema in any RDBMS, i.e. the main idea is to clearly define of which type each field is and how you want your data to be analyzed, sliced and diced, etc.
Without explicit mapping, it wouldn't be possible to do all the above (and much more), ES would guess the type of your fields and even though most of the time it gets it right, there are plenty of times where it is not exactly what you want/need.
For instance, some people store floating point values in string fields (see below), ES would detect that field as being text/keyword even though you want it to be double.
{
"myRatio": "0.3526472"
}
This is just once reason out of many others why it is important to define your own mapping and not rely on the fact that ES will guess it for you.
By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.
I am curious how does the implementation work on Elasticsearch backend works. How can they make a value not retrievable but searchable? (I would imagine it would need to be stored somewhere in order for you to search it right?) Why is Elasticsearch designed this way? what efficiency did it achieve for designing it this way?
The source document is actually "stored" in the _source field (but it is not indexed) and all fields of the source documents are indexed (but not stored). All field values can usually be retrieved from the _source field using source filtering. This is how ES is configured by default, but you're free to change that.
You can, for instance, decide to not store the _source document at all and store only certain fields of your document. This might be a good idea if for instance your document has a field which contains a huge blob of text. It might not be wise to store the _source because that would take a lot of space for nothing. That huge blob of text might only be useful for full-text search and so would only need to be indexed, while all other fields might need to be indexed and stored as well because they need to be retrieved in order to be displayed.
So the bottom line is:
if a field can be searched, it doesn't need to be stored, it only needs to be indexed
if a field can be retrieved, it can either be configured to be stored or retrieved/filtered from the _source field (which is stored by default)
I have about 10 million very flat (like an RDBMS row) documents stored in ES. There are say 10 fields to each document, and 5 of the fields are actually enumerations.
I have created a mapping that maps the Enum's ordinal to a Short, and pass the ordinal in when I index the document.
Does Elasticsearch actually store these values as a Short in its index? Or do they get .toString()'ed? What is actually happening "under the hood" when I map a field to a data type?
Since ES is built on top of Lucene, that is the place to look to see how fields are actually stored and used "under the hood".
As far as I understand, Lucene does in fact store data in more than just String format. So to answer one of your questions, I believe the answer is no - everything does not get .toString()'ed. In fact, if you look at the documentation for Lucene's document package, you'll see it has many numeric types (e.g. IntField, LongField, etc).
The Elasticsearch documentation on Core Types also alludes to this fact:
"It uses specific constructs within Lucene in order to support numeric
values. The number types have the same ranges as corresponding Java
types."
Furthermore, Lucene offers queries (which ES takes advantage of) designed specifically for searching fields with known numeric terms, such as the NumericRangeQuery which is discussed in Lucene's search package. The same numeric types in Lucene allow for efficient sorting as well.
One other benefit is data integrity. Just like any database, if you only expect a field to contain numeric data and your application attempts to insert non-numeric data, in most cases you would want that insert to fail. This is the default behavior of ES when you try to index a document whose field values do not match the type mapping. (Though, you can disable this behavior on numeric fields using ignore_malformed, if you wish)
Hope this helps...
Using Elasticsearch 1.4.3
I'm building a sort of "reporting" system. And the client can pick and chose which fields they want returned in their result.
In 90% of the cases the client will never pick all the fields, so I figured I can disable _source field in my mapping to save space. But then I learned that
GET myIndex/myType/_search/
{
"fields": ["field1", "field2"]
...
}
Does not return the fields.
So I assume I have to then use "store": true for each field. From what I read this will be faster for searches, but I guess space wise it will be the same as _source or we still save space?
The _source field stores the JSON you send to Elasticsearch and you can choose to only return certain fields if needed, which is perfect for your use case. I have never heard that the stored fields will be faster for searches. The _source field could be bigger on disk space, but if you have to store every field there is no need to use stored fields over the _source field. If you do disable the source field it will mean:
You won’t be able to do partial updates
You won’t be able to re-index your data from the JSON in your
Elasticsearch cluster, you’ll have to re-index from the data source
(which is usually a lot slower).
By default in elasticsearch, the _source (the document one indexed) is stored. This means when you search, you can get the actual document source back. Moreover, elasticsearch will automatically extract fields/objects from the _source and return them if you explicitly ask for it (as well as possibly use it in other components, like highlighting).
You can specify that a specific field is also stored. This means that the data for that field will be stored on its own. Meaning that if you ask for field1 (which is stored), elasticsearch will identify that its stored, and load it from the index instead of getting it from the _source (assuming _source is enabled).
When do you want to enable storing specific fields? Most times, you don't. Fetching the _source is fast and extracting it is fast as well. If you have very large documents, where the cost of storing the _source, or the cost of parsing the _source is high, you can explicitly map some fields to be stored instead.
Note, there is a cost of retrieving each stored field. So, for example, if you have a json with 10 fields with reasonable size, and you map all of them as stored, and ask for all of them, this means loading each one (more disk seeks), compared to just loading the _source (which is one field, possibly compressed).
I got this answer on below link answered by shay.banon you can read this whole thread to get good understanding about it. enter link description here
Clinton Gormley says in the link below
https://groups.google.com/forum/#!topic/elasticsearch/j8cfbv-j73g/discussion
by default ES stores your JSON doc in the _source field, which is
set to "stored"
by default, the fields in your JSON doc are set to NOT be "stored"
(ie stored as a separate field)
so when ES returns your doc (search or get) it just load the _source
field and returns that, ie a single disk seek
Some people think that by storing individual fields, it will be faster
than loading the whole JSON doc from the _source field. What they don't
realise is that each stored field requires a disk seek (10ms each seek!
), and that the sum of those seeks far outweighs the cost of just
sending the _source field.
In other words, it is almost always a false optimization.
Enabling _source will store the entire JSON document in the index while store will only store individual fields that are marked so. So using store might be better than using _source if you want to save disk space.
As a reference for ES 7.3, the answer becomes clearer. DO NOT try to optimize before you have strong testing reasons UNDER REALISTIC PRODUCTION CONDITIONS.
I might just quote from the _source:
Users often disable the _source field without thinking about the
consequences, and then live to regret it. If the _source field isn't
available then a number of features are not supported:
The update, update_by_query,
and reindex APIs.
On the fly highlighting.
The ability to reindex from one Elasticsearch index to another, either
to change mappings or analysis, or to upgrade an index to a new major
version.
The ability to debug queries or aggregations by viewing the original
document used at index time.
Potentially in the future, the ability to repair index corruption
automatically.
TIP: If disk space is a concern, rather increase the
compression level instead of disabling the _source.
Besides there are not obvious advantages using stored_fields as you might have thought of.
If you only want to retrieve the value of a single field or of a few fields, instead of the whole _source, then this can be achieved with source filtering.
Suppose I have a string field specified as not_analyzed in the mapping. If I then add "store":"yes" to the mapping, will ElasticSearch duplicate the storage? My understanding of not_analyzed fields is that they are not run through an Analyzer, indexed as is, but a client is able to match against it. So, if a field is both not_analyzed and store:yes, this could cause ElasticSearch to keep two copies of the string.
My question:
If a string field is stored as both not_analyzed and store:yes, will there be duplicate storage of the string?
I hope that's clear enough. Thanks!
You're mixing up the concept of indexed field and stored field in lucene, the library that elasticsearch is built on top of.
A field is indexed when it goes within the inverted index, the data structure that lucene uses to provide its great and fast full text search capabilities. If you want to search on a field, you do have to index it. When you index a field you can decide whether you want to index it as it is, or you want to analyze it, which means deciding a tokenizer to apply to it, which will generate a list of tokens (words) and a list of token filters that can modify the generated tokens (even add or delete some). The way you index a field affects how you can search on it. If you index a field but don't analyze it, and its text is composed of multiple words, you'll be able to find that document only searching for that exact specific text, whitespaces included.
A field is stored when you want to be able to retrieve it. Let's say Lucene provides some kind of storage too, which doesn't have anything to do with the inverted index itself.
When you search using lucene you get back a list of document ids that match. Then you can retrieve some text from their stored fields, which is what you literally show as search results. If you don't store a field you'll never be able to get it back from lucene (this is not true for elasticsearch though, as I'm going to explain below).
You can have fields that you only want to search on, and never show: indexed and not stored (default in lucene).
You can have fields that you want to search on and also retrieve: indexed and stored.
You can have fields that you don't want to search on, but you do want to retrieve to show them.
Therefore the two data structures are not related to each other. If you both index and store a field in lucene, its content will not be present twice in the same form. Stored fields are stored as they are, as you send them to lucene, while indexed fields might be analyzed and will be part of the inverted index, which is something else. Stored fields are made to be retrieved for a specific document (by lucene document id), while indexed fields are made to search, in such a structure that literally inverts the text having as a result each term as key, together with a list of document ids that contain it (the postings list).
When it comes to elasticsearch things change a little though. When you don't configure a field as stored in your mapping (default is store:no) you are able to retrieve it anyway by default. This happens because elasticsearch always stores in lucene the whole source document that you send to it (unless you disable this feature) within a special lucene field, called _source.
When you search using elasticsearch you get back by default the whole source field, but you can also ask for specific fields. What happens in that case is that elasticsearch checks whether those specific fields are stored or not in lucene. If they are the content will be retrieved from lucene, otherwise the _source stored field will be retrieved from lucene, parsed as json (pull parsing) and those specific fields will be extracted. In the first case it might be a little faster, but not necessarily. If your source is really big and you only want to load a couple of fields, configuring them as stored in lucene would probably make the loading process faster; on the other hand, if your _source is not that big and you want to load many fields, then it's probably better to load only one stored field (the _source), which would lead to a single disk seek, parse it etc. In most of the cases using the _source field works just fine.
To answer your question: inverted index and lucene storage are two completely different things. You end up having two copies of the same data in lucene only if you decide to store a field (store:yes in the mapping), since elasticsearch keeps that same content within the json _source, but this doesn't have anything to do with the fact that you're indexing or analyzing the field.