ElasticSearch: Performance Implications of Multiple Types in the same Index - elasticsearch

We are storing a handful of polymorphic document subtypes in a single index (e.x. let's say we store vehicles with subtypes of car, van, motorcycle, and Batmobile).
At the moment, there is >80% commonality in fields across these subtypes (e.x manufacturer, number of wheels, ranking of awesomeness as a mode of transport).
The standard case is to search across all types, but sometimes users will want to filter the results to a subset of the subtypes: find only cars with...).
How much overhead (if any) is incurred at search/index time from modelling these subtypes as distinct ElasticSearch types vs. modelling them as a single type using some application-specific field to distinguish between subtypes?
I've looked through several related answers already, but can't find the answer to my exact question.
Thanks very much!

There shouldn't be any noticeable overhead.
If you keep everything under the same type, you can filter results by a subtype by adding a "class" field on your objects and adding a condition on this field in your search.
A good reason to model your different classes into different ES types is if there can be a conflict between type of fields with the same name.
That is, assume your "car" class has a "color" field that holds integer number, while your "van" class also has a "color" field but this one is a string. (Stupid example, I know, didn't have any better idea).
Elasticsearch holds the mapping (the data "schema") for a type. So if you index both "car" and "van" under the same type, you will have a field type conflict. A field in a type can have one specific type. If you set the field as integer and then try to index a string into it, it will fail.
This is one of the main guidelines on how to use Elasticsearch types - treat the type as a specific data schema that can't have conflicts.

Related

How to handle the situation of saving to Elasticsearch logs whose structure are very diverse?

My log POCO has several fixed properties, like user id, timestamp, with a flexible data bag property, which is a JSON representation of any kind of extra information I'd like to add to the log. This means the property names could be anything within this data bag, bringing me 2 questions:
How can I configure the mapping so that the data bag property, which is of type string, would be mapped to a JSON object during the indexing, instead of being treated as a normal string?
With the data bag object having arbitrary property names, meaning the overall document type could have a huge number of properties inside, would this hurt the search performance?
For the data translation from string to JSON you can use ingest pipeline with JSON processor:
https://www.elastic.co/guide/en/elasticsearch/reference/master/json-processor.html
It depends of you queries. If you'll use the "free text search" - yes, the huge number of fields will slow the query. If you you'll use query like "field":"value" - no, there is no problem with the fields number in the searches. Additional information about query optimization you cold find here:
https://www.elastic.co/guide/en/elasticsearch/reference/7.15/tune-for-search-speed.html#search-as-few-fields-as-possible
And the question is: what you meen, when say "huge number"? 1000? 10000? 100000? As part of optimization i recommend to use dynamic templates with the definition: each string field automatically ingest into the index as "keyword" and not text + keyword. This setting decrease the number of fields to half.

In ElasticSearch, what is the equivalent for creating custom field types?

In Solr, my schema is very clean as I've defined 5 common field types used across 20 fields. The field type contains a base class (text, etc.), an index analyzer, and a query analyzer. In Elastic, is there no way to represent this other than to define my fields as a base type and then split up each field type into its two analyzers?
For example, for 20 common fields, I would still need to create two analyzers to represent my field type: customfield_index_analyzer and customfield_query_analyzer, and I would need to reference these two separate analyzers for every field in the Elastic mapping? I just want to make sure I'm doing this the right way, but it seems wrong and makes the configuration file less manageable and more prone to mistakes.

What does the "Type" mean in Elasticsearch?

I am totally confused by Elasticsearch's documents.
In Basic Concepts: Type, "type" are somehow like collections in MongoDB:
In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.
But in Types and Mappings: Type Takeaways, it says:
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems.
Doesn't "user" and "blog" above mentioned have mutually exclusive sets of fields?
For example: there are "name", "age" fields for "user", and "createdAt", "content" for "blog".
I'm used to believe the mapping relation between Elasticsearch and MongoDB is:
index <=> database
type <=> collection
isn't it right?
If not, what is the recommended mapping style between them?
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems.
The type is just another field in Elasticsearch, at the very basic level. When you do GET /my_index/my_type/_search ES will run a pre-filter for my_type value for field _type - it's like an automatic filter.
Don't think about indices and types as databases and tables in SQL world, because they are not that.
If you have type1 with fields f1 and f2 and type2 with fields f1 and f3 in the index there will be documents with fields f1, f2, f3. Why this matters - when the score for a document will be calculated with queries that search for values in field f1 the terms frequencies in field f1 will be global (both type1 and type2) so if you search some value in f1 from type1 then the score you get back is slightly influenced, also, by the values of f1 in type2.
Also, please, don't translate a set of SQL tables to ES by simply following the primary key/foreign key approach to define parent/child relationships in ES.
You're right, index == database and type == collection for elasticsearch. In RDBMS terms, index is a database and type can be a table which contains many rows(document in elasticsearch).
You could have a different index maintaining user information, with the "name", "age" and other such fields generally attributed to a person, and a different one for blogs with "createdAt", "content", etc. Yet, you might want to have a "user" field inside each blog document to be able to identify the person who posted it. Later, you can apply application-side joins, if need be.

How to use the elasticsearch type?

Note: it will be very appreciate if you tell me why you think this is a shit question by comment. Please do not just down vote and not telling why..
We know there is the concept called type under index. But I do not know why we need it.
Firstly I thought we use it to organize data. Like we have index like below:
curl -XPOST 'localhost:9200/customer/USA/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'
But in the above situation, we can always eliminate the type, move it to the json body like :
curl -XPOST 'localhost:9200/customer/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe","country":"USA" }
{"index":{"_id":"2"}}
{"name": "Jane Doe","country":"USA" }
'
In this way we can always add a field to replace the type.
Then I thought it may be performance related. I thought If you split the data into different type, then there is less data under each type. So the performance to query each type should be better. But it is also not like that.
The performance of elasticsearch index is related to the shard. So even you split the data into different type, it still stored under the same sets of shards.
Then why we need type?
First of all, although elastic search determine types of fields on runtime, but once it has assigned a particular type to a field it would always expect same type of value for that field. So you need multiple types if you need to store different type of data. Secondly it allows for storing multiple types with difference mappings in single index. Besides it makes querying on a particular type easier if you are sure about its schema.
From my understanding of ES , type is something we can relate to table concept in a relational database. In which a database can be said as group of related tables. Likewise in ES,index is a group of related types each type in index holds documents that share some common property or fields.
In your example,for a index say Customer we can have different employees from different countries like USA,india,UK etc. Customer records from each country can be grouped under different types so that it will be organized. And when we run a search query for customers in a particular country we will need to run that query on type USA only. We don't need to lookup in the whole index to get the data of customers from USA.
Another example : Let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data. So we are logically organizing the data to different types and looking up to the required type whenever we do a search.
So in general,type is a logical category/partition of your index whose semantics is completely up to you. It can be defined as documents that have a set of common fields.
You may refer to this post for better understanding https://www.elastic.co/blog/index-vs-type

Elasticsearch indexed database table column structure

I have a question regarding the setup of my elasticsearch database index... I have created a table which I have rivered to index in elasticsearch. The table is built from a script that queries multiple tables to denormalize data making it easier to index by a unique id 1:1 ratio
An example of a set of fields I have is street, city, state, zip, which I can query on, but my question is , should I be keeping those fields individually indexed , or be concatenating them as one big field like address which contains all of the previous fields into one? Or be putting in the extra time to setup parent-child indexes?
The use case example is I have a customer with billing info coming from one direction, I want to query elasticsearch to see if that customer already exists, or at least return the closest result
I know this question is more conceptual than programming, I just can't find any information of best practices.
Concatenation
For the first part of your question: I wouldn't concatenate the different fields into a field containing all information. Having multiple fields gives you the advantage of calculating facets and aggregates on those fields, e.g. how many customers are from a specific city or have a specific zip. You can still use a match or multimatch query to query for information from different fields.
In addition to having the information in separate fields I would use multifields with an analyzed and not_analyzed part (fieldname.raw). This again allows for aggregates, facets and sorting.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
Think of 'New York': if you analyze it will be stored as ['New', 'York'] and you will not be able to see all People from 'New York'. What you'd see are all people from 'New' and 'York'.
_all field
There is a special _all field in elasticsearch which does the concatenation in the background. You don't have to do it yourself. It is possible to enable/disable it.
Parent Child relationship
Concerning the part whether to use nested objects or parent child relationship: I think that using a parent child relationship is more appropriate for your case. Nested objects are stored in a 'flattened' way, i.e. the information from the nested objects in arrays is stored as being part of one object. Consider the following example:
You have an order for a client:
client: 'Samuel Thomson'
orderline: 'Strong Thinkpad'
orderline: 'Light Macbook'
client: 'Jay Rizzi'
orderline: 'Strong Macbook'
Using nested objects if you search for clients who ordered 'Strong Macbook' you'd get both clients. This because 'Samuel Thomson' and his orders are stored altogether, i.e. ['Strong' 'Thinkpad' 'Light' 'Macbook'], there is no distinction between the two orderlines.
By using parent child documents, the orderlines for the same client are not mixed and preserve their identity.

Resources