ElasticSearch Wrapping Head on Index Types - elasticsearch

I'm looking into elastic search right now and I am having a hard time grasping how index types fit into the data model, I've read examples and documentation but none really goes in depth or the examples seem to use a data model that is composed of several submodels.
I am currently using mongodb to store my data, let's take this example of an Article collection that I want to be indexed for search, my doc looks like this:
Article = {
title: String,
publisher: String,
subject: String,
description: String,
year: Integer,
}
Now I want each of those fields to be searchable, so I would make an elasticsearch index of 'Article'. I will need to define each field and how it should be analysed and whether it is stored or not, that I understand.
Now how does an index type come in here? As far as I am aware, Lucene does not have this concept, this is a layer added by Elasticsearch.
For example, maybe some of you may say that we can logically group the documents by subject or publisher and create index types on those but how is this different from searching by subject or publisher?
Is it more of a performance related aspect that we have index types?

Not a very easy question to answer, but I am going to give it a try. But be warned this just my opinion.
First of all, if you do not want to keep certain documents together in an index, just because it feels they should, create separate indices. There is not really a penalty for using more indices over more types. The only thing I can think of is that you could create analysers and mappings that you can reuse over the different types.
You can use types if you feel documents belong together, they have similar structure but not necessary the same structure. Be warned though, do not create separate mappings for fields with the same name in different types within the same index. Lucene does not like this.
Than there is the final scenario, in parent-child relationships, here you need types. This way the parent and it's children can be places in the same shard which is better for performance.
Hope that helps a bit.

If I'm not mistaken, the catch with using more than one data type in one index is almost identical to using different indices. Say, you can store (as I did) documents of types "simple_address", "delivery_address", "some_strange_but_official_address_info" in the same index "address" to make your code a bit more sane. But if you don't use parent-child links, it's equivalent to just having three indices.
Speaking of your example, you should wrap your head around what would you like to search. If, for instance, you add comments in equation, it's better to use some kind of separation - either as parent-child or different indices with manual mapping by keys. And, obviously, you should have different mappings for "Article" and "Comment" types.

Related

How to calculate relevance in Elasticsearch based on associated documents

Main question:
I have one data type, let's call them People, and another associated data type, let's say Reports. A person can have many reports associated with them and vice versa in our relational database. These reports can be pretty long, sometimes over 1000 words mostly in English.
We want to be able to search people by a keyword, where the results are the people whose reports are most relevant to the keyword. For example if Person A's reports mention "art director" a lot more than any other person, we want Person A to show up high in the search results if someone searched "art director".
More details:
The key thing here, is that we don't want to combine all the reports together and add them as a field for the Person model. I think that with 100,000s of our People records and 1,000,000s of long reports, this would make the index super big. (And I think there might be limits on how long the text of a field can be.)
The reports are also indexed on their own, so that people can do full text searches of all the reports without considering the People records. This works great already.
To avoid these large and kind of redundant indexes, I want to use the Elasticsearch query language to search "through" the Person record to its associated reports.
Is this possible, and if so how?
P.S. I am using the Searchkick Ruby gem to generate the Elasticsearch queries through an API. But I can also use the Elasticsearch DSL directly if necessary.
Answering to your questions.
1.(...) we want Person A to show up high in the search results if someone searched "art director".
That's exactly what Elasticsearch does, so I would recommend you to start with a simple match query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
From there you can start adding up more complexity.
Elasticsearch uses TF-IDF which means:
TF(Term Frequency): The most frequent a term is within a document, more relevant it is.
IDF(Inverse Document Frequency): The most frequent a term is across the entire dataset the less relevant it is.
2.(...) To avoid these large and kind of redundant indexes, I want to use the Elasticsearch query language to search "through" the Person record to its associated reports.
You are right. The recommendation is not indexing a book as a field, but index the different chapter/pages/etc.. as documents.
https://www.elastic.co/guide/en/elasticsearch/reference/current/general-recommendations.html
There are some structures you can use. Which one to use will depend on how big is the scale of your data, en how do you want to show this data to your users.
The structures are:
Joined field type (parent=author child=report pages)
Nested field type (array of report pages within an author)
Collapsed results (each doc being a book page, collapse by author)
We can discuss a lot about the best one, but I invite you to try yourself.
Some guidelines:
If the number of reports outnumber for a lot to the author you can use joined field type.

Why do mappings exist in Elasticsearch?

From what I read, Elasticsearch is dropping support for types.
So, as the examples say indexes are similar to databases and documents are similar to rows of a relational database.
So now, everything is a top-level document right?
Then what is the need for a mapping, if we can store all sorts of documents in an index with whatever schema we want it to have.
I want to understand if my concepts are incorrect anywhere.
Elasticsearch is not dropping support for mapping types, they are dropping support for multiple mapping types within a single index. That's a slight, yet very important, difference.
Having a proper index mapping in ES is as much important as having a proper schema in any RDBMS, i.e. the main idea is to clearly define of which type each field is and how you want your data to be analyzed, sliced and diced, etc.
Without explicit mapping, it wouldn't be possible to do all the above (and much more), ES would guess the type of your fields and even though most of the time it gets it right, there are plenty of times where it is not exactly what you want/need.
For instance, some people store floating point values in string fields (see below), ES would detect that field as being text/keyword even though you want it to be double.
{
"myRatio": "0.3526472"
}
This is just once reason out of many others why it is important to define your own mapping and not rely on the fact that ES will guess it for you.

ElasticSearch suggestions on blog name, authors and tags

I've been looking all over the place for a good answer to my question but I just can't find any...
I'm using ElasticSearch along with Laravel. I've used ElasticSearch on another project but never used suggestions. I'm following this tutorial as I think it provides a great starting point for using Laravel with ElasticSearch: https://blog.madewithlove.be/post/how-to-integrate-your-laravel-app-with-elasticsearch/
My question is about suggestions; I want my search to be a search-as-you-type just like the one you would find on Spotify. I want my users to type a few letters in the search box and have the results be organized into multiple categories: blogs, authors, tags.
If I index my data into one index, with authors and tags being blog's nested objects, I can easily get suggestions using the completion suggester for blog names, but not for nested objects. I could also split each model and index data separately into different indexes, but that would mean I would have to make 3 queries to get my results back.
Am I doing something wrong? Should I structure my data differently? Is making 3 queries the way to go or is there a way to have a single query output search results from different indexes?
Thanks!
Xavier
Something that I did when I built a search-as-you-type was I used a separate index for suggestions. In your situation, you'd index the name (title, author, whatever) in one field and the type in another. Then you could search on one field and display the grouped results.
The advantage here is speed. This will likely be a heck of a lot faster than trying to do a suggester on your nested data. (Which you can probably do, but I'm not sure how.) And speed is pretty important for this type of feature.

When to use "_type" in Elasticsearch?

I started reading the documentation about Elasticsearch, and I read about _type metadata element, in Elasticsearch documentation:
Elasticsearch exposes a feature called types which allows you to logically partition data inside of an index. Documents in different types may have different fields, but it is best if they are highly similar.
So my question is: In which situations the best practice is to split documents into types? Because in the documentation, they wrote that the documents in different _types should have similar fields.
Let's say you create a new index "WWW" and the types of it would be "http" and "https". Both types have the same mapping and fields. It would be easier to search all the "http" documents like this:
GET /WWW/http/_search?pretty
and the https like this:
GET /WWW/https/_search?pretty
It also gives you a logical separation between your data.
There's a good blog post about type vs index: https://www.elastic.co/blog/index-vs-type
Having the same mappings and fields is a good starting point (since sparsity is an issue). Just be aware that types will be removed in the future, so don't structure your logic around it too heavily. But you will be able to do the same with an enum like field and a filter in your query.

MongoDB efficient dealing with embedded documents

I have serious trouble finding anything useful in Mongo documentation about dealing with embedded documents. Let's say I have a following schema:
{
_id: ObjectId,
...
data: [
{
_childId: ObjectId // let's use custom name so we can distinguish them
...
}
]
}
What's the most efficient way to remove everything inside data for
particular _id?
What's the most efficient way to remove embedded document with
particular _childId inside given _id? What's the performance
here, can _childId be indexed in order to achieve logarithmic (or
similar) complexity instead of linear lookup? If so, how?
What's the most efficient way to insert a lot of (let's say a 1000)
documents into data for given _id? And like above, can we get
O(n log n) or similar complexity with proper indexing?
What's the most efficient way to get the count of documents inside data for given _id?
The other two answers give sensible advice on your questions 1-4, but I want to address your question by interrogating the basis for asking it in the first place. The terminology of "embedded document" in the context of MongoDB storing "documents" confuses people. You should not think of an embedded document as another document in MongoDB that you search for, index, or update as its own document, because that's not what it is. It's a grouped collection of fields inside a document; it's a BSON field of type Object. To quote the embedded document docs,
Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations.
Starting from knowledge about your use case, you should pick your documents and document structure to make your common operations easier. If you are so concerned about 1-4, you probably want to unwind your data array of childIds into separate documents. A concrete example of this common "antipattern" is a blog with many authors - you could have a user document with a large, changing array of posts embedded inside, or a post document with user information replicated in each. I can't say for sure what is or isn't wrong with your data model as you've given no specific details about it, but struggling to understand why 1-4 seem hard or undocumented or slow in MongoDB is a good sign that you should rethink the data model so the equivalent of 1-4 are fun and easy! Or at least easier and more fun.
I can't find anything on speed so I will go with the ways found in the documentation in the hope that they made the most efficient ways the one they documented:
If you want to remove all subdocuments in data you can just update data to []
The official way to remove a document with a specific _childId from data would be $pull:
db.collection.update(
{ },
{ $pull: { data: { _childId: id } } },
)
might need to add { multi: true } if _childId is not unique (multipart subdocuments)
On indexing on subdocuments I would refer you to this question. Short answer yes you can index fields in subdocuments for faster lookup just like you would index normal fields by
db.collection.ensureIndex({"data._childId" : 1})
If you want to search for a subdocument in only one specific document you can use aggregation i.e.
db.collection.aggregate({$match:{_id : _id},
{$unwind:'$data'},
{$match:{data._childId: _childID})
which will first match for _id and only then for _childId. It will return the parent document with data only containing the subdocument(s) with _childId.
There is $push for that although for 1000 subdocument you might not want to do it in one query anyways
Trudbert is right: db.collection.update({_id:yourId},{$set:{data:[]}})
Two points for Trudbert. However, I would like to add that if you have the whole document available in your app, it might be reasonable to simply replace the contents of the whole document if suitable for your use case.
I have made good experience with bulk updates performance wise. You might want to try it.
I don't know how you come to the idea that an aggregate wouldn't use indices, but since _id is unique, it would make much more sense to use db.collection.findOne({_id:yourId},{"data._childId":1,_id:0}).data.length or use it's equivalent as a raw command in the driver of choice. Since the connection is already established, unless the array is very big, it should be faster to simply return the data instead of having the calculations done on a possibly (over)loaded server.
As per your comments to Trudberts answer: _id is unique. So exactly one doc will need to be modified for a known _id: db.collection.update({_id:theId},{$pull..... It does not get more efficient. For an unknown id, create an index on childId and do the same pull operation with a match on childId instead of id with the multi option set to remove all references to a specific childId.
I strongly second Trudberts suggestion of using the aggregation framework to create documents when needed out of optimized data. Currently, I have an aggregation pipeline which analyses 5M records with more than 7 million relations to each other in some 6 seconds. On a non sharded standalone instance. With spinning disks, crappy IO and not even optimized. With careful planning the aggregations (an early match limiting the documents passed to the ones not processed so far) and merging them with earlier results (adapt the _id in the group phase can achieve that), you can even optimize this for some mere fractions of seconds, if absolutely necessary.

Resources