From what I read, Elasticsearch is dropping support for types.
So, as the examples say indexes are similar to databases and documents are similar to rows of a relational database.
So now, everything is a top-level document right?
Then what is the need for a mapping, if we can store all sorts of documents in an index with whatever schema we want it to have.
I want to understand if my concepts are incorrect anywhere.
Elasticsearch is not dropping support for mapping types, they are dropping support for multiple mapping types within a single index. That's a slight, yet very important, difference.
Having a proper index mapping in ES is as much important as having a proper schema in any RDBMS, i.e. the main idea is to clearly define of which type each field is and how you want your data to be analyzed, sliced and diced, etc.
Without explicit mapping, it wouldn't be possible to do all the above (and much more), ES would guess the type of your fields and even though most of the time it gets it right, there are plenty of times where it is not exactly what you want/need.
For instance, some people store floating point values in string fields (see below), ES would detect that field as being text/keyword even though you want it to be double.
{
"myRatio": "0.3526472"
}
This is just once reason out of many others why it is important to define your own mapping and not rely on the fact that ES will guess it for you.
Related
In my current usecase, I'm using ElasticSearch as a document store, over which I am building a faceted search feature.
The docs state the following:
Sorting, aggregations, and access to field values in scripts requires a different data access pattern.
Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. They store the same values as the _source but in a column-oriented fashion that is way more efficient for sorting and aggregations.
Does this imply that the aggregations are not dependent on the index? If so, is it advisable to prevent the fields from being indexed altogether by setting {"index": "no"} ?
This is a small deviation, but where does the setting enabled come in? How is it different from index?
On a broader note, should I be using ElasticSearch if aggregations is all I'm looking for? Should I opt for other solutions like MongoDB? If so, what are the performance considerations?
HELP!
It is definitely possible to use Elasticsearch for the sole purpose of aggregating data. I've seen such setups a few times. For instance, in one past project, we'd index data but we'd only run aggregations in order to build financial reports, and we rarely needed to get documents/hits. 99% of the use cases were simply aggregating data.
If you have such a use case, then you can tune your mapping to
The role of enabled is to decide whether your data is indexed or not. It is true by default, but if you set it to false, your data will simply be stored (in _source) but completely ignored by analyzers, i.e. it won't be analyzed, tokenized and indexed, and thus, it won't be searchable, you'll be be able to retrieve the _source, but not search for it. If you need to use aggregations, then enabled needs to be true (the default value)
The store parameter is to decide whether you want to store the field or not. By default, the field value is indexed, but not stored as it is already stored with the _source itself and you can retrieve it using source filtering. For aggregations, this parameter doesn't play any role.
If your use case is only about aggregations, you might be tempted to set _source: false, i.e. not store the _source at all since all you'll be needed is to index the field values in order to aggregate them, but this is rarely a good idea for various reasons.
So, to answer your main question, aggregations do depend on the index, but the (doc-)values used for aggregations are written in dedicated files, whose inner structure is much more performant and optimal than accessing the data from the index in order to build aggregations.
If you're using ES 1.x, make sure to set doc_values to true for all the fields you'll want to aggregate on (except analyzed strings and boolean fields).
If you're using ES 2.x, doc_values is true by default, so you don't need to do anything special.
Update:
It is worth noting that aggregations are dependent on doc_values (i.e. Per Document Values .dvd and .dvm Lucene files), which basically contains the same info as in the inverted index, but organized in a column-oriented fashion, which makes it much more efficient for aggregations.
I have about 10 million very flat (like an RDBMS row) documents stored in ES. There are say 10 fields to each document, and 5 of the fields are actually enumerations.
I have created a mapping that maps the Enum's ordinal to a Short, and pass the ordinal in when I index the document.
Does Elasticsearch actually store these values as a Short in its index? Or do they get .toString()'ed? What is actually happening "under the hood" when I map a field to a data type?
Since ES is built on top of Lucene, that is the place to look to see how fields are actually stored and used "under the hood".
As far as I understand, Lucene does in fact store data in more than just String format. So to answer one of your questions, I believe the answer is no - everything does not get .toString()'ed. In fact, if you look at the documentation for Lucene's document package, you'll see it has many numeric types (e.g. IntField, LongField, etc).
The Elasticsearch documentation on Core Types also alludes to this fact:
"It uses specific constructs within Lucene in order to support numeric
values. The number types have the same ranges as corresponding Java
types."
Furthermore, Lucene offers queries (which ES takes advantage of) designed specifically for searching fields with known numeric terms, such as the NumericRangeQuery which is discussed in Lucene's search package. The same numeric types in Lucene allow for efficient sorting as well.
One other benefit is data integrity. Just like any database, if you only expect a field to contain numeric data and your application attempts to insert non-numeric data, in most cases you would want that insert to fail. This is the default behavior of ES when you try to index a document whose field values do not match the type mapping. (Though, you can disable this behavior on numeric fields using ignore_malformed, if you wish)
Hope this helps...
We were comparing those search solutions and started to wonder why one does need a schema and the other does not. What are tradeoffs? Is it because one is like SQL and the other is like NoSQL in sense of schema configuration?
ES does have a schema defined as templates and mappings. You don't have to use it, but in practice you will. Schema is actually a good thing, and if you notice a database claiming to be pure schemaless - there will be performance implication.
Schema is a tradeoff between ease of developing and adoption against performance. It is easy to read/write into a schemaless database, but it it will be less performant, particularly for any non-trivial query.
Elasticsearch definitely has a schema. If you think it does not, try indexing a date into a field and then an int into the same field. Or even into different types with the same name (I think ES 2.0 disallows that now).
What Elasticsearch does is simplifies auto-creation of a schema. That has tradeoffs such as possible incorrect type detection, fields that are single-valued or multivalued in the result output based on number of elements they contain (they are always multivalued under the covers), and so on. Elasticsearch has some ways to work around that, mostly by defining some of the schema elements and explicit schema mapping as Oleksii wrote.
Solr also has schemaless mode that closely matches Elasticsearch mode, down to storing all JSON as a single field. And when you enable it, you get both similar benefits and similar disadvantages Elasticsearch has. Except, in Solr, you can change things like order of auto-type strategies and mapping to field types. In Elasticsearch (1.x at least) it was hard coded. You can see - slightly dated - comparison in my presentation from 2014.
As Slomo said, they both use Lucene underneath for storing and most of the search. So, the core engine approach cannot change.
I'm looking into elastic search right now and I am having a hard time grasping how index types fit into the data model, I've read examples and documentation but none really goes in depth or the examples seem to use a data model that is composed of several submodels.
I am currently using mongodb to store my data, let's take this example of an Article collection that I want to be indexed for search, my doc looks like this:
Article = {
title: String,
publisher: String,
subject: String,
description: String,
year: Integer,
}
Now I want each of those fields to be searchable, so I would make an elasticsearch index of 'Article'. I will need to define each field and how it should be analysed and whether it is stored or not, that I understand.
Now how does an index type come in here? As far as I am aware, Lucene does not have this concept, this is a layer added by Elasticsearch.
For example, maybe some of you may say that we can logically group the documents by subject or publisher and create index types on those but how is this different from searching by subject or publisher?
Is it more of a performance related aspect that we have index types?
Not a very easy question to answer, but I am going to give it a try. But be warned this just my opinion.
First of all, if you do not want to keep certain documents together in an index, just because it feels they should, create separate indices. There is not really a penalty for using more indices over more types. The only thing I can think of is that you could create analysers and mappings that you can reuse over the different types.
You can use types if you feel documents belong together, they have similar structure but not necessary the same structure. Be warned though, do not create separate mappings for fields with the same name in different types within the same index. Lucene does not like this.
Than there is the final scenario, in parent-child relationships, here you need types. This way the parent and it's children can be places in the same shard which is better for performance.
Hope that helps a bit.
If I'm not mistaken, the catch with using more than one data type in one index is almost identical to using different indices. Say, you can store (as I did) documents of types "simple_address", "delivery_address", "some_strange_but_official_address_info" in the same index "address" to make your code a bit more sane. But if you don't use parent-child links, it's equivalent to just having three indices.
Speaking of your example, you should wrap your head around what would you like to search. If, for instance, you add comments in equation, it's better to use some kind of separation - either as parent-child or different indices with manual mapping by keys. And, obviously, you should have different mappings for "Article" and "Comment" types.
We're building a "unified" search across a lot of different resources in our system. Our index schema includes about 10 generic fields that are indexed, plus 5 which are required to identify the appropriate resource location in our system when results are returned.
The indexed fields often contain sensitive data, so we don't want them stored at all, only indexed for matching, thus we set the _source to FALSE.
I do however want the 5 ident fields returned, so is it possible to set the ident fields to store = yes, but the overall index _source to FALSE and get what I'm looking for in the results?
Have a look at this other answer as well. As mentioned there, in most of the cases the _source field helps a lot. Even though it might seem like a waste because elasticsearch effectively stores the whole document that comes in, that's really handy (e.g. when needing to update documents without sending the whole updated document). At the end of the day it hides a lucene implementation detail, the fact that you need to explicitly store fields if you want to get them back, while users usually expect to get back what they sent to the search engine. Surprisingly, the _source helps performance wise too, as it requires a single disk seek instead of more disk seeks that might be caused by retrieving multiple stored fields. At the end of the day the _source field is just a big lucene stored field containing json, which can be parsed in order to get to specific fields and do some work with them, without needing to store them separately.
That said, depending on your usecase (how many fields you retrieve) it might be useful to have a look at source include/exclude at the bottom of the _source field reference, which allows you to prevent parts (e.g. the sensitive parts of your documents) of the source field from being stored. That would be useful if you want to keep relying on the _source but don't want a part of the input documents to be returned, but you do want to search against those fields, as they are going to be indexed (but not stored!) in the underlying lucene index.
In both cases (either you disable the _source completely or exclude some parts), if you plan to update your documents keep in mind that you'll need to send the whole updated document using the index api. In fact you cannot rely on partial updates provided with the update api as you don't have in the index the complete document that you indexed in the first place, which you would need to apply changes to.
Yes, stored fields do not rely on the _source field, or vice-versa. They are separate, and changing or disabling one shouldn't impact the other.