One large Elasticsearch lookup index, or several smaller ones? - elasticsearch

I'm creating a lookup index that I'll use solely as a terms filter. So no searching/aggregating, only filtering and GETs.
I'm debating the structure of this lookup index, whether each document should contain all of the fields I want to filter for, or whether I should create an index per field.
For example, let's say each document pertains to a user. Each user has a list of games they've played, books they've read, and movies they've watched. When searching for game/book/movie recommendations, I'll use the term filter to filter out those items they've already interacted with.
I'm wondering if I should have a single lookup index with a document mapping like:
users_index
{
'game_ids': [],
'movie_ids' : [],
'book_ids': []
}
or one index per lookup value, like:
user_games_index
{
'game_ids': []
}
user_movies_index
{
'movie_ids': []
}
user_books_index
{
'book_ids': []
}
Pros for one index:
Each index comes with overhead, so the fewer the better
If I ever want to retrieve all of a user's info, it's all in one index
Pros for multiple indices:
According to the update api docs, updating a document means retrieving the whole thing first. I will be updating each document a lot, and those arrays can become rather large (think thousands of ids). Updating a book id will then retrieve all of the game ids, which takes up memory. If they were in separate indices, I could avoid that.
Just easier to maintain on my end of things
I should note that if I use multiple indices, it'll only be 4 or 5, with about 500k documents per index. Also, only 1 primary shard per index, no replicas, and I'm on a single m5.2xlarge EC2 instance (8 cores, 32G ram).
Are these stats so small that it won't really matter at this point, or should I favor one index or many?

How about a third option?
You have one index and each of your document in the index looks something like this:
{
"user_id" : "some_user",
"document_type" : "movie" or "game" or "book"
"document_id" : "id of movie, game or book"
}
Why? Since you say a user's games, movies or books will be updated often, this approach lets you easily add / delete individual movies, games or books for users.
You also can easily filter the books/movies/games for specific users.
All values are of type "keyword" and filtering should be fast.
PS: A "good" mapping for an ES index will try to minimize the numbers of updates on individual documents and rather work at the level of inserting / deleting documents as ES does this task very well compared to finding & updating documents.
Edit: I have added query examples to illustrate how you can filter out results with bool query.
Example:
I want all movies / games / books a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
}
}
}
}
I want only movies a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
},
"filter":{
"term" : {
"document_type" : "movie"
}
}
}
}
}

Related

Elasticsearch Multi Index Query and Filter

I have 2 indexes, one that stores data about an event and one that stores the availability of that event. I am trying to create a single query that gets events by a query but only returns ones that are available, and I am having difficulty doing so.
The events index stores
{
"id" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"createdAt" : 1.5519999143126902E12,
"description" : "A very not so concise description",
"geoHash" : "dnh00x6x5",
"name" : "a name",
...etc...
}
The availability index stores availability like so:
{
"eventId" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"maxGuests" : 8,
"availability" : {
"lte" : "2019-10-18T22:15:00.000Z",
"gte" : "2019-10-18T02:30:00.000Z"
}
}
I am trying to create a query like below, but what I can't figure out is how to filter by listings that meet the criteria in the events index AND are available in the availability index.
GET events,availability/_search
{
"size": 5,
"from": 0,
"_source": [
"id"
],
"query": {
"bool": {
"must": [
{
"geo_distance": {
"distance": "25mi",
"geoHash": {
"lat": 34.0389,
"lon": -84.3826
}
}
}
],
"should": [],
"filter":[
{
"range" : {
"availability" : {
"gte" : "2019-10-31",
"lte" : "2020-11-01",
"relation" : "within"
}
}
}
]
}
}
}
--
The reason I want to only do one query is that the client is expecting a certain specified number of events. If I filter out the unavailable events after I get the event data then I am likely to be left with fewer events than the client expected and would need to do yet another search to fill the gap.
Also, of course, I could merge the two indices so that the event also stores the availability info, but I originally set them up this way because the availability info may have hundreds or thousands of entries per event.
What you want to accomplish is an equivalent of a foreign key of SQL (join). There is no way to have exactly what you want, meaning to filter documents from index A by querying an index B. Your options are:
As you've mentioned solve it on application level (although this causes other problems for you, so it's not a solution).
Merge the data in one index and have duplicated event informatin. Although it seems expensive, the duplication of data in a NoSQL database is to be expected. If you need a relational model then maybe you should use a SQL solution.
Use parent/child (join datatype). The problem here is that you will need to have the data in the same index overall. Moreover, parent and child will be stored in the same shard as well.
One approach to this (a bit more complex though) that I believe would work for you is to use the nested datatype, which actually is a more compact approach for the solution number 2 (combine your data in one index, but save root information only once). Make events be at the root and availability appear as nested. When you want to add one availability you can use the update api, and when you query, you can search by the root fields and by the nested. If you need to retrieve specific availability entries for an event you can use inner hits
What you are trying to do (multi-index search) will not join your data automatically, it will not work. Elasticsearch doesn't work that way, and the relational model is not suited for this product.
One last thing, it's a good thing to plan ahead, but it's a bad thing to try to optimize early on.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
An interesting read that summarizes the above

Elasticsearch 6.0 Removal of mapping types - Alternatives

Background
I migrating my ES index into ES version 6. I currenly stuck because ES6 removed the using on "_type" field.
Old Implementation (ES2)
My software has many users (>100K). Each user has at least one document in ES. So, the hierarchy looks like this:
INDEX -> TYPE -> Document
myindex-> user-123 -> document-1
The key point here is with this structure I can easily remove all the document of specific user.
DELETE /myindex/user-123
(Delete all the document of specific user, with a single command)
The problem
"_type" is no longer supported by ES6.
Possible solution
Instead of using _type, use the index name as USER-ID. So my index will looks like:
"user-123" -> "static-name" -> document
Delete user is done by delete index (instead of delete type in previous implementation).
Questions:
My first worry is about the amount of index and performance: Having like 1M indexes is something that acceptable in terms of performance? don't forget I have to search on them frequently.
Most of my users has small amount of documents stored in ES. Is that make sense to hold a shard, which should be expensive, for < 10 documents?
My data architecture sounds reasonable for you?
Any other tip will be welcome!
Thanks.
I would not have one index per user, it's a waste of resources, especially if there are only 10 docs per user.
What I would do instead is to use filtered aliases, one per user.
So the index would be named users and the type would be a static name, e.g. doc. For user 123, the documents of that user would all be stored in users/doc/xyz and in each document you need to add the user id, e.g.
PUT users/doc/xyz
{
...
"userId": 123,
...
}
Then you can define a filtered alias for all documents of user 123, like this:
POST /_aliases
{
"actions" : [
{
"add" : {
"index" : "users",
"alias" : "user-123",
"filter" : { "term" : { "userId" : "123" } }
}
}
]
}
If you need to delete all documents of user 123, then you can simply do it like this:
POST user-123/_delete_by_query?q=*
Having these many indexes is definitely not a good approach. If your only concern to delete multiple documents with a single command. Then you can use Delete by Query API provided by ElasticSearch
You can introduce "subtype" attribute in all your document containing value for each document like "user-" value. So in your case, document would looks like.
{
"attribute1":"value",
"subtype":"user-123"
}

Elasticsearch filter vs term query for many ids

I have an index of documents connected with some product_id. And I would like to find all documents for specific ids (around 100 000 product_ids to be found and 100 million are in total in index).
Would the filter query be the fastest and best option in that case?
"query": {
"bool": {
"filter": {"terms": {"product_id": product_ids}
}
}
Or is it better to chunkify ids and use just terms query or smth else?
The question is probably kind of a duplicate, but I would be very grateful for the best practice advice (and a bit of reasoning).
After some testing and more reading I found an answer:
Filter query works much much faster as chunks with just terms query.
But making really big filter can slower getting the result a lot.
In my case, using filter query with chunks of 10 000 ids is 10 times faster, than using filter query with all 100 000 ids at once (btw, this number is already restricted in Elasticsearch 6).
Also from official elasticsearch documentation:
Potentially the amount of ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
The only disadvantage to be taken into account is that filter query is stored in cache. (The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.)
P.S. In all cases I always used scroll.
you can use "paging" or "scrolling" feature of elastic search query for very large result sets.
Use "from - to" query : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
or "scroll" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I think that "From / To" is a more efficient way to go unless you want to return thousands of results each time (which could be many many MB of data so you probably don't want that)
Edit:
You can make a query like this in bulks:
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2", "3", .... "10000" ] // tune for the best array length
}
}
}
If your document Id is sequential or some other number form that you could easily order by, and have a field available you can do a "range query"
GET _search
{
"query": {
"range" : {
"document_id_that_is_a_number" : {
"gte" : 0, // bump this on each query by "lte" step factor
"lte" : 10000 // find a good number here
}
}
}
}

elastic search get distinct random field values

We have elastic search document that has following fields:
{
"stockId": 1
"sellerId": 100
}
Multiple stockId can be mapped to single sellerId but one stock can only be mapped to a single dealer. There are around 10K stocks mapped to 1K sellers. But each sellerId might have different number of stocks i.e. few might have 100 while others have only 1.
Problem Statement: We want to select 'N' random documents out of all these documents indexed. The condition is that each of these 'N' document should belong to different seller i.e. distinct "sellerId". (We need to give award to these sellers).
What I have tried: I am trying to solve this by elastic query that fetches 'N' random distinct 'sellerId'. (and then elastic query to fetch 1 document of each of these 'N' sellers). One way could be to aggregate on 'sellerId' and then pick random 'N' keys but this is not desirable approach performance wise. Can someone help with better query?
I would rebuild my mapping to create a nested document type, with seller being the parent and stockid being the nested object:
{
"sellerid" : {"type" : "integer" },
"stock_obj" : {
"type" : "nested",
"properties" : {
"stockid" : { "type" : "integer" }
}
}
When you rebuild your index, you would create only one object per seller. Each seller would have all of their stock ids. It seems like there are about 10 stocks per seller, elasticsearch can handle this fine. (If there are thousands of stocks per seller, I would do this differently)
Then, I would do a search for N sellers, sorted randomly, and then as a second sort field, you would sort the stock ids randomly. Not the simplest mapping, but the query is easy and should be fast.
Also, separately, if you're just dealing with ~10k seller/stock data points that are integers, using elasticsearch is probably overkill. It can do what you want, but its main purpose is for searching large amounts of text.

Nested count queries

i'm looking to add a feature to an existing query. Basically, I run a query that returns say 1000 documents. Those documents all have the same structure, only the values of certain fields vary. What i'd like, is to not only get the full list as a result, but also count how many results have a field X with the value Y, how many results have the same field X with the value Z etc...
Basically get all the results + 4 or 5 "counts" that would act like the SQL "group by", in a way.
The point of this is to allow full text search over all the clients in our database (without filtering), while showing how many of those are active clients, past clients, active prospects etc...
Any way to do this without running additional / separate queries ?
EDIT WITH ANSWER :
Aggregations is the way to go. Here's how I did it, it's so straightforward that I expected much harder work !
{
"query": {
"term": {
"_type":"client"
}
},
"aggregations" : {
"agg1" : {
"terms" : {
"field" : "listType.typeRef.keyword"
}
}
}
}
Note that it's even in a list of terms and not a single field, that's just how easy it was !
I believe what you are looking for is the aggregation query.
The documentation should be clear enough, but if you struggle please give us your ES query and we will help you from there.

Resources