Elasticsearch 6.0 Removal of mapping types - Alternatives - elasticsearch

Background
I migrating my ES index into ES version 6. I currenly stuck because ES6 removed the using on "_type" field.
Old Implementation (ES2)
My software has many users (>100K). Each user has at least one document in ES. So, the hierarchy looks like this:
INDEX -> TYPE -> Document
myindex-> user-123 -> document-1
The key point here is with this structure I can easily remove all the document of specific user.
DELETE /myindex/user-123
(Delete all the document of specific user, with a single command)
The problem
"_type" is no longer supported by ES6.
Possible solution
Instead of using _type, use the index name as USER-ID. So my index will looks like:
"user-123" -> "static-name" -> document
Delete user is done by delete index (instead of delete type in previous implementation).
Questions:
My first worry is about the amount of index and performance: Having like 1M indexes is something that acceptable in terms of performance? don't forget I have to search on them frequently.
Most of my users has small amount of documents stored in ES. Is that make sense to hold a shard, which should be expensive, for < 10 documents?
My data architecture sounds reasonable for you?
Any other tip will be welcome!
Thanks.

I would not have one index per user, it's a waste of resources, especially if there are only 10 docs per user.
What I would do instead is to use filtered aliases, one per user.
So the index would be named users and the type would be a static name, e.g. doc. For user 123, the documents of that user would all be stored in users/doc/xyz and in each document you need to add the user id, e.g.
PUT users/doc/xyz
{
...
"userId": 123,
...
}
Then you can define a filtered alias for all documents of user 123, like this:
POST /_aliases
{
"actions" : [
{
"add" : {
"index" : "users",
"alias" : "user-123",
"filter" : { "term" : { "userId" : "123" } }
}
}
]
}
If you need to delete all documents of user 123, then you can simply do it like this:
POST user-123/_delete_by_query?q=*

Having these many indexes is definitely not a good approach. If your only concern to delete multiple documents with a single command. Then you can use Delete by Query API provided by ElasticSearch
You can introduce "subtype" attribute in all your document containing value for each document like "user-" value. So in your case, document would looks like.
{
"attribute1":"value",
"subtype":"user-123"
}

Related

One large Elasticsearch lookup index, or several smaller ones?

I'm creating a lookup index that I'll use solely as a terms filter. So no searching/aggregating, only filtering and GETs.
I'm debating the structure of this lookup index, whether each document should contain all of the fields I want to filter for, or whether I should create an index per field.
For example, let's say each document pertains to a user. Each user has a list of games they've played, books they've read, and movies they've watched. When searching for game/book/movie recommendations, I'll use the term filter to filter out those items they've already interacted with.
I'm wondering if I should have a single lookup index with a document mapping like:
users_index
{
'game_ids': [],
'movie_ids' : [],
'book_ids': []
}
or one index per lookup value, like:
user_games_index
{
'game_ids': []
}
user_movies_index
{
'movie_ids': []
}
user_books_index
{
'book_ids': []
}
Pros for one index:
Each index comes with overhead, so the fewer the better
If I ever want to retrieve all of a user's info, it's all in one index
Pros for multiple indices:
According to the update api docs, updating a document means retrieving the whole thing first. I will be updating each document a lot, and those arrays can become rather large (think thousands of ids). Updating a book id will then retrieve all of the game ids, which takes up memory. If they were in separate indices, I could avoid that.
Just easier to maintain on my end of things
I should note that if I use multiple indices, it'll only be 4 or 5, with about 500k documents per index. Also, only 1 primary shard per index, no replicas, and I'm on a single m5.2xlarge EC2 instance (8 cores, 32G ram).
Are these stats so small that it won't really matter at this point, or should I favor one index or many?
How about a third option?
You have one index and each of your document in the index looks something like this:
{
"user_id" : "some_user",
"document_type" : "movie" or "game" or "book"
"document_id" : "id of movie, game or book"
}
Why? Since you say a user's games, movies or books will be updated often, this approach lets you easily add / delete individual movies, games or books for users.
You also can easily filter the books/movies/games for specific users.
All values are of type "keyword" and filtering should be fast.
PS: A "good" mapping for an ES index will try to minimize the numbers of updates on individual documents and rather work at the level of inserting / deleting documents as ES does this task very well compared to finding & updating documents.
Edit: I have added query examples to illustrate how you can filter out results with bool query.
Example:
I want all movies / games / books a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
}
}
}
}
I want only movies a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
},
"filter":{
"term" : {
"document_type" : "movie"
}
}
}
}
}

Search in multiple indexes in elastica

I am looking for a way to search in more than one index at the same time using Elastica.
I have an index products, and an index user.
products contains {product_id, product_name, price} and user contains {product_id, user_name, date}. Knowing that the product_id in both of them is the same, in products each products_id is unique but in user they're not as a user can buy the same product multiple times.
Anyway, I want to automatically get the price of a product from the products index while searching through the user index.
I know that we can search over multiple indexes like so (correct me if I'm wrong) :
$search = new \Elastica\Search($client);
$search->addIndex('users')
->addType('user')
->addIndex('products')
->addType('product');
But the problem is, when I write an aggregation on the products_id for example and then create a new query with some filters :
$products_agg = new \Elastica\Aggregation\Terms('products_id');
$products_agg->setField('products_id')->setSize(0);
$query = new \Elastica\Query();
$query->addAggregation($products_agg);
$query->setQuery($bool);
$search->setQuery($query);
How does elastica know in which index to search? How can I link this products_id to the other index?
The Elastica library has support for Multi Search API, The multi search API allows to execute several search requests within the same API. The endpoint for it is _msearch.
The format of the requests is similar to the bulk API, The first line
is header part that includes which index / indices to search on, The second line includes the typical search body requests.
{"index" : "products", "type": "products"}
{"query" : {"match_all" : {}}, "from" : 0, "size" : 10} // write your own query to get price
{"index" : "uesrs", "type" : "user"}
{"query" : {"match_all" : {}}} // query for user
Check test case in Multi/SearchTest.php to see how to use.
Basically you want to join two indexes based on a common field as in sql.
What you can do is model you data in the same index using join datatype
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html
Index all documents in the same index ,
Make all product documents - parent.
Make all user documents as child
And the use parent-child aggregations and queries
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html#_parent_join_queries_and_aggregations
NOTE: make sure of the performance implication of parent-child mapping
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html#_parent_join_and_performance
One more thing you can do is put all the information of the product with every user that buys it.
But this can unnecessarily waste you space and is not a good practice as per data rules are concerned.
But since this is a search engine and elasticsearch suggests that best is to normalise and duplicate data rather that using parent-child.
you can try the following:
1- naming indexes with specific name like the following
myFirstIndex-myProjectName
mySecIndex-myProjectName
myThirdIndex-myProjectName
and so on.
2- that's give me the ability using * in the field of indexes to search because it accepts wildcard so i can search across multiple fields like this using kibana Dev Tools
GET *-myProjectName/_search
{
"_source": {
"excludes": [ "*" ]
},
"query": { "match_all": {} },
}
this will search on each index includes -myProjectName.
You can't query two indices with different mappings. Best way to solve your problem is to just do two queries (application-side joins). First query you do the aggregations on the user and the second you get the prices.
Another option would be to add the price to the user index. Sometimes you have to sacrifice a little space for better usability.

How to update a document using index alias

I have created an index "index-000001" with primary shards = 5 and replica = 1. And I have created two aliases
alias-read -> index-000001
alias-write -> index-000001
for indexing and searching purposes. When I do a rollover on alias-write when it reaches its maximum capacity, it creates a new "index-000002" and updates aliases as
alias-read -> index-000001 and index-000002
alias-write -> index-000002
How do I update/delete a document existing in index-000001(what if in case all I know is the document id but not in which index the document resides) ?
Thanks
Updating using an index alias is not directly possible, the best solution for this is to use a search query using the document id or a term and get the required index. Using the index you can update your document directly.
GET alias-read/{type}/{doc_id} will get the required Document if doc_id is known.
If doc_id is not known, then find it using a unique id reference
GET alias-read/_search
{
"term" : { "field" : "value" }
}
In both cases, you will get a single document as a response.
Once the document is obtained, you can use the "_index" field to get the required index.
PUT {index_name}/{type}/{id} {
"required_field" : "new_value"
}
to update the document.

Why are Elasticsearch aliases not unique

The Elasticsearch documentation describes aliases as feature to reindex data with zero downtime:
Create a new index and index the whole data
Let your alias point to the new index
Delete the old index
This would be a great feature if aliases would be unique but it's possible that one alias points to multiple indexes. Considering that maybe the deletion of the old index fails my application might speak to two indexes which might not be in sync. Even worse: the application doesn't know about that.
Why is it possible to reuse an alias?
It allows you to easily have several indexes that are both used individually and together with other indexes. This is useful for example when having a logging index where sometimes you want to query the most recent (logs-recent alias) and sometimes want to query everything (logs alias). There are probably lots of other use cases but this one pops up as the first for me.
As per the documentation you can send both the remove and add in one request:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "test1", "alias" : "alias1" } },
{ "add" : { "index" : "test2", "alias" : "alias1" } }
]
}'
After that succeeds you can remove your old index and if that fails you will just have an extra index taking up some space until its cleaned out.

one-to-many relationships in Elastic Search

Suppose I have 2 tables called "twitter_user" and "twitter_comments".
twitter_users has the fields: username and bio
twitter_comments has the fields: username and comment
Obviously, an user has 1 entry in twitter_users and potentially many in twitter_comments
I want to model both twitter_users and twitter_comments in Elastic Search, have ES search both models when I query, knowing that a comment counts towards the overall relevancy score for a twitter user.
I know I can mimic this with just 1 model, by creating a single extra field (in addition to username and bio) with all the comments concatenated. But is there another "cleaner" way?
It depends.
If you just want to be able to search for a users comments ,full-text and over all fields, simply store all comments within the user object (no need to concatenate anything):
{
"user" : {
"username" : "TestUser",
"bio" : "whatever",
"comments" : [
{
"title" : "First comment",
"text" : "My 1st comment"
},
{
"title" : "Second comment",
"text" : "My 2nd comment"
}
]
}
}
If you need per-comment-based queries you need to map the comments as nested (before submitting any data), so that every comment gets treated as a single item.
For your scoring, simply add another field "comment_count" and use this for your boost/scoring.
As Thorsten already suggested you can use nested query and it's a good approach.
Alternatively, you can index comments as children of users. Then you can can search users as you do now, search comments using top_children query to find all relevant to your search comments, and finally combine scores from both of them together using bool or dis_max queries.
Nested approach would be more efficient during search, but you will have to reindex the user and all comments every time an additional comment is added. With child/parent approach you will need to index only new comments, but search will be slower and it will require more memory.

Resources