Note: it will be very appreciate if you tell me why you think this is a shit question by comment. Please do not just down vote and not telling why..
We know there is the concept called type under index. But I do not know why we need it.
Firstly I thought we use it to organize data. Like we have index like below:
curl -XPOST 'localhost:9200/customer/USA/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'
But in the above situation, we can always eliminate the type, move it to the json body like :
curl -XPOST 'localhost:9200/customer/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe","country":"USA" }
{"index":{"_id":"2"}}
{"name": "Jane Doe","country":"USA" }
'
In this way we can always add a field to replace the type.
Then I thought it may be performance related. I thought If you split the data into different type, then there is less data under each type. So the performance to query each type should be better. But it is also not like that.
The performance of elasticsearch index is related to the shard. So even you split the data into different type, it still stored under the same sets of shards.
Then why we need type?
First of all, although elastic search determine types of fields on runtime, but once it has assigned a particular type to a field it would always expect same type of value for that field. So you need multiple types if you need to store different type of data. Secondly it allows for storing multiple types with difference mappings in single index. Besides it makes querying on a particular type easier if you are sure about its schema.
From my understanding of ES , type is something we can relate to table concept in a relational database. In which a database can be said as group of related tables. Likewise in ES,index is a group of related types each type in index holds documents that share some common property or fields.
In your example,for a index say Customer we can have different employees from different countries like USA,india,UK etc. Customer records from each country can be grouped under different types so that it will be organized. And when we run a search query for customers in a particular country we will need to run that query on type USA only. We don't need to lookup in the whole index to get the data of customers from USA.
Another example : Let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data. So we are logically organizing the data to different types and looking up to the required type whenever we do a search.
So in general,type is a logical category/partition of your index whose semantics is completely up to you. It can be defined as documents that have a set of common fields.
You may refer to this post for better understanding https://www.elastic.co/blog/index-vs-type
Related
I am exploring deepset haystack and found it very interesting for multiple use cases like a chatbot, search engine, document search, etc
But have not found any reference where I can create multiple indexes for different documents and search based on indexes. I thought of using meta tags for conditional search(on a particular area) by tagging the documents first and then using the params parameter of query API but the same doesn't seem to work and throws an error(I used its vanilla docker-compose based setup)
You can use multiple indices in the same document store if you want to support multiple use cases, indeed. The write_documents method of the document store has a parameter index so that you can store documents for your different use cases in different indices. In the same way, you can pass an index parameter to the query method.
As you expected, there is an alternative solution that uses the meta field of documents. However, the format needs to be slightly different. Your query needs to have the following format:
{"query": "What's the capital town?", "params": {"filters": {"name": "75_Algeria75.txt"}}}
and your documents need to have the following format:
{'text': 'Algeria is...', 'meta':{'name': "75_Algeria75.txt"}}
I am totally confused by Elasticsearch's documents.
In Basic Concepts: Type, "type" are somehow like collections in MongoDB:
In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.
But in Types and Mappings: Type Takeaways, it says:
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems.
Doesn't "user" and "blog" above mentioned have mutually exclusive sets of fields?
For example: there are "name", "age" fields for "user", and "createdAt", "content" for "blog".
I'm used to believe the mapping relation between Elasticsearch and MongoDB is:
index <=> database
type <=> collection
isn't it right?
If not, what is the recommended mapping style between them?
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems.
The type is just another field in Elasticsearch, at the very basic level. When you do GET /my_index/my_type/_search ES will run a pre-filter for my_type value for field _type - it's like an automatic filter.
Don't think about indices and types as databases and tables in SQL world, because they are not that.
If you have type1 with fields f1 and f2 and type2 with fields f1 and f3 in the index there will be documents with fields f1, f2, f3. Why this matters - when the score for a document will be calculated with queries that search for values in field f1 the terms frequencies in field f1 will be global (both type1 and type2) so if you search some value in f1 from type1 then the score you get back is slightly influenced, also, by the values of f1 in type2.
Also, please, don't translate a set of SQL tables to ES by simply following the primary key/foreign key approach to define parent/child relationships in ES.
You're right, index == database and type == collection for elasticsearch. In RDBMS terms, index is a database and type can be a table which contains many rows(document in elasticsearch).
You could have a different index maintaining user information, with the "name", "age" and other such fields generally attributed to a person, and a different one for blogs with "createdAt", "content", etc. Yet, you might want to have a "user" field inside each blog document to be able to identify the person who posted it. Later, you can apply application-side joins, if need be.
I have a use-case, where I have got a set of predefined fields and also need to support adding dynamic fields to ElasticSearch with some basic searching on them. I am able to achieve this using dynamic template mapping. However, the frequency of adding such dynamic fields is quite high.
Consider the this ES document for the Event type:
{
"name":"Youth Conference",
"venue":"Ahmedabad",
"date":"10/01/2015",
"organizer":"Invincible",
"extensions":{
"about": {
"vision":"Visualizes the image of an ideal Country. ",
"mission":"Encapsulates the gravity of the top reformative solutions for betterment of Country."
}
// Any thing can go here..
}
}
In the example above, each event document may have any unknown/new fields. Hence, for every such new dynamic field introduced, ES will update the mapping of the type. My concern is what is the cost of adding new field mapping in the existing type?
I am planning to separate out all dynamic mappings(inside extensions) from Event type by introducing another type, say EventExtensions and using parent/child relationship to map it with Event type. I believe this may limit the cost(if any) of adding dynamic fields frequently to the type. However, to my knowledge, using parent/child relationship will need more memory.
The first thing to remember here is that field is per index and not per type.
So wherever you add new fields , it would be made in the same index. Be it , in another type or as parent or child.
So decoupling the new fields to another type but same index is not going to make any change.
Second field addition is not that very expensive thing. I know people who uses 1000 of fields and are good with it. That being said , there should be a tab on number of field so that it wont go out to crazy numbers.
Here we have multiple approaches to solve the problem
1) Lets assume that the new field data need not be exactly searchable. In this case , you can deserialize the entire JSON as a string and add it to a field. Also make sure this field is not indexed. This way you can search based on other fields but then on retrieval of the document , get the information that was deserialized.
2) Lets say the new field looks like this
{
"newInfo1" : "log Of Info",
"newInfo2" : "A lot more info"
}
Instead of this , you can use
{
"newInfo" : [
{
"fieldName" : "newInfo1",
"fieldValue" : "log Of Info"
},
{
"fieldName" : "newInfo2",
"fieldValue" : "A lot more info"
}
]
}
This way , fields wont increase. But then to make field level specific search , like give me all documents with filedName as newInfo2 and having the word more in it , you will need to make newInfo field nested.
Hope this helps.
I have a question regarding the setup of my elasticsearch database index... I have created a table which I have rivered to index in elasticsearch. The table is built from a script that queries multiple tables to denormalize data making it easier to index by a unique id 1:1 ratio
An example of a set of fields I have is street, city, state, zip, which I can query on, but my question is , should I be keeping those fields individually indexed , or be concatenating them as one big field like address which contains all of the previous fields into one? Or be putting in the extra time to setup parent-child indexes?
The use case example is I have a customer with billing info coming from one direction, I want to query elasticsearch to see if that customer already exists, or at least return the closest result
I know this question is more conceptual than programming, I just can't find any information of best practices.
Concatenation
For the first part of your question: I wouldn't concatenate the different fields into a field containing all information. Having multiple fields gives you the advantage of calculating facets and aggregates on those fields, e.g. how many customers are from a specific city or have a specific zip. You can still use a match or multimatch query to query for information from different fields.
In addition to having the information in separate fields I would use multifields with an analyzed and not_analyzed part (fieldname.raw). This again allows for aggregates, facets and sorting.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
Think of 'New York': if you analyze it will be stored as ['New', 'York'] and you will not be able to see all People from 'New York'. What you'd see are all people from 'New' and 'York'.
_all field
There is a special _all field in elasticsearch which does the concatenation in the background. You don't have to do it yourself. It is possible to enable/disable it.
Parent Child relationship
Concerning the part whether to use nested objects or parent child relationship: I think that using a parent child relationship is more appropriate for your case. Nested objects are stored in a 'flattened' way, i.e. the information from the nested objects in arrays is stored as being part of one object. Consider the following example:
You have an order for a client:
client: 'Samuel Thomson'
orderline: 'Strong Thinkpad'
orderline: 'Light Macbook'
client: 'Jay Rizzi'
orderline: 'Strong Macbook'
Using nested objects if you search for clients who ordered 'Strong Macbook' you'd get both clients. This because 'Samuel Thomson' and his orders are stored altogether, i.e. ['Strong' 'Thinkpad' 'Light' 'Macbook'], there is no distinction between the two orderlines.
By using parent child documents, the orderlines for the same client are not mixed and preserve their identity.
I'm looking to search for a particular JSON document in a bucket and I don't know its document ID, all I know is the value of one of the sub-keys. I've looked through the API documentation but still confused when it comes to my particular use case:
In mongo I can do a dynamic query like:
bucket.get({ "name" : "some-arbritrary-name-here" })
With couchbase I'm under the impression that you need to create an index (for example on the name property) and use startKey / endKey but this feels wrong - could you still end up with multiple documents being returned? Would be nice to be able to pass a parameter to the view that an exact match could be performed on. Also how would we handle multi-dimensional searches? i.e. name and category.
I'd like to do as much of the filtering as possible on the couchbase instance and ideally narrow it down to one record rather than having to filter when it comes back to the App Tier. Something like passing a dynamic value to the mapping function and only emitting documents that match.
I know you can use LINQ with couchbase to filter but if I've read the docs correctly this filtering is still done client-side but at least if we could narrow down the returned dataset to a sensible subset, client-side filtering wouldn't be such a big deal.
Cheers
So you are correct on one point, you need to create a view (an index indeed) to be able to query on on the content of the JSON document.
So in you case you have to create a view with this kind of code:
function (doc, meta) {
if (doc.type == "youtype") { // just a good practice to type the doc
emit(doc.name);
}
}
So this will create a index - distributed on all the nodes of your cluster - that you can now use in your application. You can point to a specific value using the "key" parameter