What does the "Type" mean in Elasticsearch? - elasticsearch

I am totally confused by Elasticsearch's documents.
In Basic Concepts: Type, "type" are somehow like collections in MongoDB:
In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.
But in Types and Mappings: Type Takeaways, it says:
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems.
Doesn't "user" and "blog" above mentioned have mutually exclusive sets of fields?
For example: there are "name", "age" fields for "user", and "createdAt", "content" for "blog".
I'm used to believe the mapping relation between Elasticsearch and MongoDB is:
index <=> database
type <=> collection
isn't it right?
If not, what is the recommended mapping style between them?

Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems.
The type is just another field in Elasticsearch, at the very basic level. When you do GET /my_index/my_type/_search ES will run a pre-filter for my_type value for field _type - it's like an automatic filter.
Don't think about indices and types as databases and tables in SQL world, because they are not that.
If you have type1 with fields f1 and f2 and type2 with fields f1 and f3 in the index there will be documents with fields f1, f2, f3. Why this matters - when the score for a document will be calculated with queries that search for values in field f1 the terms frequencies in field f1 will be global (both type1 and type2) so if you search some value in f1 from type1 then the score you get back is slightly influenced, also, by the values of f1 in type2.
Also, please, don't translate a set of SQL tables to ES by simply following the primary key/foreign key approach to define parent/child relationships in ES.

You're right, index == database and type == collection for elasticsearch. In RDBMS terms, index is a database and type can be a table which contains many rows(document in elasticsearch).
You could have a different index maintaining user information, with the "name", "age" and other such fields generally attributed to a person, and a different one for blogs with "createdAt", "content", etc. Yet, you might want to have a "user" field inside each blog document to be able to identify the person who posted it. Later, you can apply application-side joins, if need be.

Related

Does Elasticsearch store or not store field values by default?

In Elasticsearch all fields of a mapping have a stored property which determines whether the data of the field will be stored on disk (in addition to the storing of the whole _source).
It defaults to false.
However each segment in every shard also has a Docvalues structure per field in the mapping. The structure stores the value of the field for all documents in the segment.
All documents and fields are included in the structured by default.
So on one hand, by default Elasticsearch doesn't store the values for fields. On the other hand, it does store the values in the Docvalues structure.
So which is it? Does Elasticsearch store or not store values by default?
ES stores the same field in multiple formats for different purposes.
For eg. Consider this :
"prop_1":{ "type":"string", "index":"not_analyzed","store":true,"doc_values":true}
prop_1 would be stored on its own as an indexed, doc_values and
stored field. On top of that the prop_1 is stored to into the
_source field together with your other fields.
As explained above, even if stored:false, the field data is still persisted on disk in multiple formats.
Stored fields are designed for optimal storage whereas doc values are
designed the access field values quickly. During the execution of
query many of doc values fields are accessed for candidate hits, so
the access must be fast. This the reason why you should use doc values
in sorting, aggregations and scrips.On the other hand stored fields should be used for return field values for the top matching documents.
Now, you can use doc_values to return fields in response as well :-
GET /_search
{
"query" : {
"match_all": {}
},
"docvalue_fields" : ["test1", "test2"]
}
Doc value fields can work on fields that are not stored. So IMO, stored fields do not have any significance now.

How to use the elasticsearch type?

Note: it will be very appreciate if you tell me why you think this is a shit question by comment. Please do not just down vote and not telling why..
We know there is the concept called type under index. But I do not know why we need it.
Firstly I thought we use it to organize data. Like we have index like below:
curl -XPOST 'localhost:9200/customer/USA/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'
But in the above situation, we can always eliminate the type, move it to the json body like :
curl -XPOST 'localhost:9200/customer/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe","country":"USA" }
{"index":{"_id":"2"}}
{"name": "Jane Doe","country":"USA" }
'
In this way we can always add a field to replace the type.
Then I thought it may be performance related. I thought If you split the data into different type, then there is less data under each type. So the performance to query each type should be better. But it is also not like that.
The performance of elasticsearch index is related to the shard. So even you split the data into different type, it still stored under the same sets of shards.
Then why we need type?
First of all, although elastic search determine types of fields on runtime, but once it has assigned a particular type to a field it would always expect same type of value for that field. So you need multiple types if you need to store different type of data. Secondly it allows for storing multiple types with difference mappings in single index. Besides it makes querying on a particular type easier if you are sure about its schema.
From my understanding of ES , type is something we can relate to table concept in a relational database. In which a database can be said as group of related tables. Likewise in ES,index is a group of related types each type in index holds documents that share some common property or fields.
In your example,for a index say Customer we can have different employees from different countries like USA,india,UK etc. Customer records from each country can be grouped under different types so that it will be organized. And when we run a search query for customers in a particular country we will need to run that query on type USA only. We don't need to lookup in the whole index to get the data of customers from USA.
Another example : Let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data. So we are logically organizing the data to different types and looking up to the required type whenever we do a search.
So in general,type is a logical category/partition of your index whose semantics is completely up to you. It can be defined as documents that have a set of common fields.
You may refer to this post for better understanding https://www.elastic.co/blog/index-vs-type

Elasticsearch indexed database table column structure

I have a question regarding the setup of my elasticsearch database index... I have created a table which I have rivered to index in elasticsearch. The table is built from a script that queries multiple tables to denormalize data making it easier to index by a unique id 1:1 ratio
An example of a set of fields I have is street, city, state, zip, which I can query on, but my question is , should I be keeping those fields individually indexed , or be concatenating them as one big field like address which contains all of the previous fields into one? Or be putting in the extra time to setup parent-child indexes?
The use case example is I have a customer with billing info coming from one direction, I want to query elasticsearch to see if that customer already exists, or at least return the closest result
I know this question is more conceptual than programming, I just can't find any information of best practices.
Concatenation
For the first part of your question: I wouldn't concatenate the different fields into a field containing all information. Having multiple fields gives you the advantage of calculating facets and aggregates on those fields, e.g. how many customers are from a specific city or have a specific zip. You can still use a match or multimatch query to query for information from different fields.
In addition to having the information in separate fields I would use multifields with an analyzed and not_analyzed part (fieldname.raw). This again allows for aggregates, facets and sorting.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
Think of 'New York': if you analyze it will be stored as ['New', 'York'] and you will not be able to see all People from 'New York'. What you'd see are all people from 'New' and 'York'.
_all field
There is a special _all field in elasticsearch which does the concatenation in the background. You don't have to do it yourself. It is possible to enable/disable it.
Parent Child relationship
Concerning the part whether to use nested objects or parent child relationship: I think that using a parent child relationship is more appropriate for your case. Nested objects are stored in a 'flattened' way, i.e. the information from the nested objects in arrays is stored as being part of one object. Consider the following example:
You have an order for a client:
client: 'Samuel Thomson'
orderline: 'Strong Thinkpad'
orderline: 'Light Macbook'
client: 'Jay Rizzi'
orderline: 'Strong Macbook'
Using nested objects if you search for clients who ordered 'Strong Macbook' you'd get both clients. This because 'Samuel Thomson' and his orders are stored altogether, i.e. ['Strong' 'Thinkpad' 'Light' 'Macbook'], there is no distinction between the two orderlines.
By using parent child documents, the orderlines for the same client are not mixed and preserve their identity.

ElasticSearch: Performance Implications of Multiple Types in the same Index

We are storing a handful of polymorphic document subtypes in a single index (e.x. let's say we store vehicles with subtypes of car, van, motorcycle, and Batmobile).
At the moment, there is >80% commonality in fields across these subtypes (e.x manufacturer, number of wheels, ranking of awesomeness as a mode of transport).
The standard case is to search across all types, but sometimes users will want to filter the results to a subset of the subtypes: find only cars with...).
How much overhead (if any) is incurred at search/index time from modelling these subtypes as distinct ElasticSearch types vs. modelling them as a single type using some application-specific field to distinguish between subtypes?
I've looked through several related answers already, but can't find the answer to my exact question.
Thanks very much!
There shouldn't be any noticeable overhead.
If you keep everything under the same type, you can filter results by a subtype by adding a "class" field on your objects and adding a condition on this field in your search.
A good reason to model your different classes into different ES types is if there can be a conflict between type of fields with the same name.
That is, assume your "car" class has a "color" field that holds integer number, while your "van" class also has a "color" field but this one is a string. (Stupid example, I know, didn't have any better idea).
Elasticsearch holds the mapping (the data "schema") for a type. So if you index both "car" and "van" under the same type, you will have a field type conflict. A field in a type can have one specific type. If you set the field as integer and then try to index a string into it, it will fail.
This is one of the main guidelines on how to use Elasticsearch types - treat the type as a specific data schema that can't have conflicts.

Sorting Solr multivalue fields based on field values

I have multiple Solr instances with separate schemas.
I need to receive multivalue field in sorted order, e.g. by type: train_station, airport, city_district, and so on:
q=köln&sort=query({!v="type:(airport OR train_station)"}) desc
I would like to see airport type document before train_station type. For now I am always getting train_station type at the top.
How should I write the query?
You are getting train_stations at the top because of the IDF.
A quick hack to fix it would be to use a range query (which has the advantage of having constant scores) and query boosts: q=köln&sort=query({!v="type:([airport TO airport]^3 OR [train_station TO train_station]^2)"}) desc.
This way, documents which have airport in their type field will have a score of 3, documents which have train_station in their type field will have a score of 2 and documents which have airport and train_station in their field type will have a score of 2+3=5 (to a multiplicative constant).
A more elegant (and effective) way of doing this would be to write a custom query parser (or even a function query).
You can sort on a function only if it returns a single value per document. You definitely can't sort on a multiValued field or any field that is tokenized. Seems like you would need a function that returns "airport" if the field contains "airport" (even if it contains "train station") and "train station" if it contains "train station" but not "airport", and then sort on that.
Another option would be to handle this at index time. Add a field called "airport_train_station_sort" that returns 1 if the field contains "airport", 2 if the field contains "train station" but NOT airport, and 3 if it contains neither. Then simply sort on that field.
You cannot solve this problem inside SOLR. Check the documentation, SOLR does not sort multivalued fields. Older versions of SOLR let you try, but the results were undefined and unpredictable.
You either change your schema and put this sort data into single value indexed fields, or you need to make several queries, first for airports, then city districts, then train stations.
To order items within the field itself you have to either index it in order you want, or do post processing. Solr's sort will sort only docs!

Resources