elastic search get distinct random field values - elasticsearch

We have elastic search document that has following fields:
{
"stockId": 1
"sellerId": 100
}
Multiple stockId can be mapped to single sellerId but one stock can only be mapped to a single dealer. There are around 10K stocks mapped to 1K sellers. But each sellerId might have different number of stocks i.e. few might have 100 while others have only 1.
Problem Statement: We want to select 'N' random documents out of all these documents indexed. The condition is that each of these 'N' document should belong to different seller i.e. distinct "sellerId". (We need to give award to these sellers).
What I have tried: I am trying to solve this by elastic query that fetches 'N' random distinct 'sellerId'. (and then elastic query to fetch 1 document of each of these 'N' sellers). One way could be to aggregate on 'sellerId' and then pick random 'N' keys but this is not desirable approach performance wise. Can someone help with better query?

I would rebuild my mapping to create a nested document type, with seller being the parent and stockid being the nested object:
{
"sellerid" : {"type" : "integer" },
"stock_obj" : {
"type" : "nested",
"properties" : {
"stockid" : { "type" : "integer" }
}
}
When you rebuild your index, you would create only one object per seller. Each seller would have all of their stock ids. It seems like there are about 10 stocks per seller, elasticsearch can handle this fine. (If there are thousands of stocks per seller, I would do this differently)
Then, I would do a search for N sellers, sorted randomly, and then as a second sort field, you would sort the stock ids randomly. Not the simplest mapping, but the query is easy and should be fast.
Also, separately, if you're just dealing with ~10k seller/stock data points that are integers, using elasticsearch is probably overkill. It can do what you want, but its main purpose is for searching large amounts of text.

Related

Elastic Search - Sorting & Filtering on nested Documents

I am working on an E-Commerce application. Catalog Data is being served by Elastic Search.
I have document's for Product which is already indexed in Elastic Search.
Document Looks something like this (Excluded few fields for the purpose of better readability):
{
"title" : "Product Name",
"volume" : "200gm",
"brand" : {
"brand_code" : XXXX,
"brand_name" : "Brand Name"
},
"#timestamp" : "2021-08-26T08:08:11.319Z",
"store" : [
{
"physical_unit" : 0,
"default_price" : 115.0,
"_id" : "1234_111",
"product_code" : "1234",
"warehouse_code" : 111,
"available_unit" : 100
}
],
"category" : {
"category_code" : 987,
"category_name" : "CategoryName",
"category_url_link" : "CategoryName",
"super_category_name" : "SuperCategoryName",
"parent_category_name" : "ParentCategoryName"
}
}
store object in the above document is the one where ES Query will look for price and to decide if item is in stock or Out Of Stock.
I would like to add more child objects to store (Basically data from multiple inventory). This can go up to more than 150 child objects for each product.
Eventually, A product document will look something like this with multiple inventory's data mapped to a particular document.
{
"title" : "Product Name",
"volume" : "200gm",
"brand" : {
"brand_code" : XXXX,
"brand_name" : "Brand Name"
},
"#timestamp" : "2021-08-26T08:08:11.319Z",
"store" : [
{
"physical_unit" : 0,
"default_price" : 115.0,
"_id" : "1234_111",
"product_code" : "1234",
"warehouse_code" : 111,
"available_unit" : 100
},
{
"physical_unit" : 0,
"default_price" : 125.0,
"_id" : "1234_112",
"product_code" : "1234",
"warehouse_code" : 112,
"available_unit" : 100
},
{
"physical_unit" : 0,
"default_price" : 105.0,
"_id" : "1234_113",
"product_code" : "1234",
"warehouse_code" : 113,
"available_unit" : 100
}
Upto N no of stores
],
"category" : {
"category_code" : 987,
"category_name" : "CategoryName",
"category_url_link" : "CategoryName",
"super_category_name" : "SuperCategoryName",
"parent_category_name" : "ParentCategoryName"
}
}
Functional Requirement :
For any product, we should show lowest price across all warehouse.
For EX: If a particular product has 50 store mapped to it, Elastic Search query should look into the nested object and get the value which is lowest in all 50 stores if item is available.
Performance should not be degraded.
Challenges :
If we start storing those many stores for each product, data will go considerably high. Will that be a problem ?
What would be the efficient way to extract the lowest price from nested document?
How would facets work within nested document ? Like if i apply price range filter ES picks up the data which was not showed earlier. (It might pick the data from other store which matches the range)
We are using template to query ES and the Version of the Elastic Search is 6.0.
Thanks in Advance!!
First there are improvements to nested document search in version 7.x that are worth the upgrade.
As for version 6.x, there are a lot of factors there that I could not give you a concrete answer. It also seems you may not be understanding the way that nested documents work, they are not relational.
In particular when you say that each product might have 50 stores mapped to it that sounds like you are implying a relationship, which will not exist with a nested document. However, the values from those 50 stores would be stored within an index nested under the parent document. Having 50 stores under a product or category does not sound concerning.
ElasticSearch has not really talked in terms of facets since the introduction of the aggregation framework. Its not that they dont exist, just not how they are discussed.
So lets try this. ElasticSearch optimizes its search and query through a divide and conquer mechanism. The data is spread across several shards, a configurable number, and each shard is responsible for reviewing its own data. Further, those shards can be distributed across many machines so that there are many cpus and lots of memory for the search. So growing the data doesn't matter if you are willing to grow the cluster, as it is possible to maintain a situation where each machine is doing the same amount of work as it was doing before.
Unlike a relational database, filters search terms allow Elastic to drastically reduce the data that it is looking at and a larger number of filters will improve performance where on a relational database performance declines.
Now back to nested documents. They are stored as a separate index, but instead of mapping the results to the nested doc, the results map to the parent doc id. So you're nested docs arent exactly in the same index as the rest of the document, though they are not truly separate either. But that does mean that the nested documents should have minimal impact the performance of the queries against the parent documents. But if your data size grows beyond the capacity of your current system you will still need to increase its size.
As to how you would query, you would use Elastic aggregations. These will allow you to calculate your "facet" counts and identify the best prices. The Elastic aggregations are very powerful and very fast. There are caveats that are well documented, but in general they will work as you expect.
In version 6.x query string queries cannot access the search criteria in a nested document, and a complex query must be used.
To recap
Functional Requirement :
For any product, we should show lowest price across all warehouse.
For EX: If a particular product has 50 store mapped to it,
ElasticSearch query should look into the nested object and get the
value which is lowest in all 50 stores if item is available.
Yes a nested aggregation will do this.
Performance should not be degraded.
Performance will continue to depend on the ratio of the size of the data to the overall cluster size.
Challenges :
If we start storing those many stores for each product, data will go considerably high. Will that be a problem ?
No this should not be a problem
What would be the efficient way to extract the lowest price from nested document?
Elastic Aggregations
How would facets work within nested document ? Like if i apply price range filter ES picks up the data which was not showed earlier. (It might pick the data from other store which matches the range)
Yes filtering can work with Aggregations very well. The aggregation will be based on the filtered data. In fact you could have an aggregation based on just minimum price, and in the same query then have an aggregation using your price ranges, which will give you the count of documents that have a store within that price range, and you could have a sub aggregation showing the stores under each price range.
We are using template to query ES and the Version of the Elastic Search is 6.0. Thanks in Advance!!
I know nothing about template. The ElasticSearch API is so dead simple I do not know why anyone uses additional tools on top of the API, they just add weight, and increase complexity and make key features not available because the wrapper author did not pass through the feature.

Sorting by product price considering special prices (client, group, country)

we have a shop with a few products (~ 5000).
There are, of course, category overview sites which show all products that are in the current category. A requirement is that all products can be sorted by price (ASC and DESC).
This already works (partially), because the problem is, in our Elasticsearch, we currently only have the "original" price, so any product discounts are not considered and therefore the sorting does not work correctly.
My task is it now to fix that.
But I am already struggling with "how to" persist the "special prices" into Elasticsearch.
The problem is every product can be discounted in general, on a customer level, on a customer group level and on a country level.
So I imagine a structure like this would be a start:
# current
{
"articleNumber": "12345",
...
"price": 9.99,
...
}
# new
{
"articleNumber": "12345",
...
"price": 9.99,
...
"special_prices": [
{
"customer": "123456",
"client_price": 5.99,
"client_group_price": null,
"country_de": null
"country_es": null,
...
},
...
]
}
Following thoughts:
The specials prices could be stored as a nested object inside the product index (but I am not sure how to do the sorting on it later)
Maybe I could create a second index with prices, then I would have two queries, but I guess that would be ok? Because I have to build a whole matrix with every customer we have (also ~5000), with every product with every possible price. But if I would have a second index then I would have to join and maybe the sorting is incorrect then
If possible, I would like to only persist any prices if a product has a special price and if not, I don't want to blow up the index
I tried something with painless to return the special price if one exists for the product and customer, but this gives me this:
...
"script": "if (doc['special_prices.customer'] != null && doc['special_prices.customer'].value == '123456') { return 12.45; } else { return doc['price']; }",
"lang": "painless",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [special_prices.customer] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
...
Maybe something like SQL ORDER BY CASE WHEN would be an option?
Any ideas on how I should model and persist the special prices? And how can I achieve the sorting?
Is joining a second index a good idea?
Best regards
The error you see is because special_prices.customer is not indexed as keyword, and instead is a text (which allows full-text search). If you didn't specify mapping explicitly, Elasticsearch most likely created a keyword for you. Just try to replace special_prices.customer with special_prices.customer.keyword in your script.
The idea of using a script for sorting is good, given that you only have 5000 documents. Scripts do not have good performance, but in your case this might not matter.
In general this looks like a tough case, because you need some kind of joining between products and prices, and Elasticsearch is not good at joins. It has got some joining options: nested datatype, join datatype (a.k.a. parent-child), and denormalization. The last one you have already considered - when you put different prices in the original product document.
Unfortunately I can't recommend one over another, because there is no single recipe. I would try with scripts, and if performance is not good enough consider remodelling the data.
Hope that helps!

One large Elasticsearch lookup index, or several smaller ones?

I'm creating a lookup index that I'll use solely as a terms filter. So no searching/aggregating, only filtering and GETs.
I'm debating the structure of this lookup index, whether each document should contain all of the fields I want to filter for, or whether I should create an index per field.
For example, let's say each document pertains to a user. Each user has a list of games they've played, books they've read, and movies they've watched. When searching for game/book/movie recommendations, I'll use the term filter to filter out those items they've already interacted with.
I'm wondering if I should have a single lookup index with a document mapping like:
users_index
{
'game_ids': [],
'movie_ids' : [],
'book_ids': []
}
or one index per lookup value, like:
user_games_index
{
'game_ids': []
}
user_movies_index
{
'movie_ids': []
}
user_books_index
{
'book_ids': []
}
Pros for one index:
Each index comes with overhead, so the fewer the better
If I ever want to retrieve all of a user's info, it's all in one index
Pros for multiple indices:
According to the update api docs, updating a document means retrieving the whole thing first. I will be updating each document a lot, and those arrays can become rather large (think thousands of ids). Updating a book id will then retrieve all of the game ids, which takes up memory. If they were in separate indices, I could avoid that.
Just easier to maintain on my end of things
I should note that if I use multiple indices, it'll only be 4 or 5, with about 500k documents per index. Also, only 1 primary shard per index, no replicas, and I'm on a single m5.2xlarge EC2 instance (8 cores, 32G ram).
Are these stats so small that it won't really matter at this point, or should I favor one index or many?
How about a third option?
You have one index and each of your document in the index looks something like this:
{
"user_id" : "some_user",
"document_type" : "movie" or "game" or "book"
"document_id" : "id of movie, game or book"
}
Why? Since you say a user's games, movies or books will be updated often, this approach lets you easily add / delete individual movies, games or books for users.
You also can easily filter the books/movies/games for specific users.
All values are of type "keyword" and filtering should be fast.
PS: A "good" mapping for an ES index will try to minimize the numbers of updates on individual documents and rather work at the level of inserting / deleting documents as ES does this task very well compared to finding & updating documents.
Edit: I have added query examples to illustrate how you can filter out results with bool query.
Example:
I want all movies / games / books a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
}
}
}
}
I want only movies a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
},
"filter":{
"term" : {
"document_type" : "movie"
}
}
}
}
}

Search in multiple indexes in elastica

I am looking for a way to search in more than one index at the same time using Elastica.
I have an index products, and an index user.
products contains {product_id, product_name, price} and user contains {product_id, user_name, date}. Knowing that the product_id in both of them is the same, in products each products_id is unique but in user they're not as a user can buy the same product multiple times.
Anyway, I want to automatically get the price of a product from the products index while searching through the user index.
I know that we can search over multiple indexes like so (correct me if I'm wrong) :
$search = new \Elastica\Search($client);
$search->addIndex('users')
->addType('user')
->addIndex('products')
->addType('product');
But the problem is, when I write an aggregation on the products_id for example and then create a new query with some filters :
$products_agg = new \Elastica\Aggregation\Terms('products_id');
$products_agg->setField('products_id')->setSize(0);
$query = new \Elastica\Query();
$query->addAggregation($products_agg);
$query->setQuery($bool);
$search->setQuery($query);
How does elastica know in which index to search? How can I link this products_id to the other index?
The Elastica library has support for Multi Search API, The multi search API allows to execute several search requests within the same API. The endpoint for it is _msearch.
The format of the requests is similar to the bulk API, The first line
is header part that includes which index / indices to search on, The second line includes the typical search body requests.
{"index" : "products", "type": "products"}
{"query" : {"match_all" : {}}, "from" : 0, "size" : 10} // write your own query to get price
{"index" : "uesrs", "type" : "user"}
{"query" : {"match_all" : {}}} // query for user
Check test case in Multi/SearchTest.php to see how to use.
Basically you want to join two indexes based on a common field as in sql.
What you can do is model you data in the same index using join datatype
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html
Index all documents in the same index ,
Make all product documents - parent.
Make all user documents as child
And the use parent-child aggregations and queries
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html#_parent_join_queries_and_aggregations
NOTE: make sure of the performance implication of parent-child mapping
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html#_parent_join_and_performance
One more thing you can do is put all the information of the product with every user that buys it.
But this can unnecessarily waste you space and is not a good practice as per data rules are concerned.
But since this is a search engine and elasticsearch suggests that best is to normalise and duplicate data rather that using parent-child.
you can try the following:
1- naming indexes with specific name like the following
myFirstIndex-myProjectName
mySecIndex-myProjectName
myThirdIndex-myProjectName
and so on.
2- that's give me the ability using * in the field of indexes to search because it accepts wildcard so i can search across multiple fields like this using kibana Dev Tools
GET *-myProjectName/_search
{
"_source": {
"excludes": [ "*" ]
},
"query": { "match_all": {} },
}
this will search on each index includes -myProjectName.
You can't query two indices with different mappings. Best way to solve your problem is to just do two queries (application-side joins). First query you do the aggregations on the user and the second you get the prices.
Another option would be to add the price to the user index. Sometimes you have to sacrifice a little space for better usability.

Automatically indexing by a field name as desc

i have index type of book story that every week wants to put some books.
in this index i want to have always query by sorting a field name(in this case is "price" ) as desc so it's have some overhead on ES (cause of data volume)
in this service we always shows to user books by maximum to minimum price
is possible to have this feature automatically or manually for sorting document of book type in index always by price as desc and then when to want to query them it's always sorted by price as desc and dont need to give it by:
"sort" : { "price" { "order" : "desc" } }
No, you can not keep your data ordered based on a field. Elasticsearch keeps the data as Lucene segments inside. Take a look here to better understand internal structure of ES: https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

Resources