restructure elasticsearch index to allow filtering on sum of values - elasticsearch

I've an index of products.
Each product, has several variants (can be a few or hundreds, each has a color & size e.g. Red)
Each variant, is available (in a certain quantity) at several warehouses (aronud 100 warehouses).
Warehouses have codes e.g. AB, XY, CD etc.
If I had my choice, I'd index it as:
stock: {
Red: {
S: { AB: 100, XY: 200, CD: 20 },
M: { AB: 0, XY: 500, CD: 20 },
2XL: { AB: 5, XY: 0, CD: 9 }
},
Blue: {
...
}
}
Here's a kind of customer query I might receive:
Show me all products, that have Red.S color in stock (minimum 100) at warehouses AB & XY.
So this would probably be a filter like
Red.S.AB > 100 AND Red.S.XY > 100
I'm not writing whole filter query here, but its straightforward in elastic.
We might also get SUM queries, e.g. the sum of inventories at AB & XY should be > 500.
That'd be easy through a script filter, say Red.S.AB + Red.S.XY > 500
The problem is, given 100 warehouses, 100 sizes, 25 colors, this easily needs 100*100*25 = 250k mappings. Elasticsearch simply can't handle that many number of keys.
The easy answer is use nested documents, but nested documents pose a particular problem. We cannot sum across a given selection of nested documents, and nested docs are slow, specially when we're going to have 250k per product.
I'm open to external solutions than elastic as well. We're rails/postgres stack.

You have your product index with variants, that's fine, but I'd use another index for managing anything related to the multi-warehouse stock. One document per product/size/color/warehouse with the related count. For instance:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "XY",
"quantity": 200
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "CD",
"quantity": 20
}
etc...
That way, you'll be much more flexible with your stock queries, because all you'll need is to filter on the fields (product, color, size, warehouse) and simply aggregate on the quantity field, sums, averages or whatever you might think of.
You will probably need to leverage the bucket_script pipeline aggregation in order to decide whether sums are above or below a desired threshold.
It's also much easier to maintain the stock movements by simply indexing the new quantity for any given combination than having to update the master product document every time an item gets out of the stock.
No script, no nested documents required.

The best possible solution will be to create separate indexes for the warehouses and each warehouse index will have documents. One document per product/size/color/warehouse with related values like this:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
This will reduce your mappings 100 * 25 = 2500 mappings per index.
Rest for other operations, I feel #Val has mentioned in his answer which is quite impressive and beautiful.
Coming to external solutions, I would say you want to carry to out task of storing data, searching it and fetching it. Elasticsearch and Apache Solr are the best search engines to carry out these kind of tasks. I have not tried Apache Solr but I would highly recommend to go with Elasticsearch because of it's features, active community support and searching is really fast. Searching can also be made fast using analyzers and tokenizers. It also has some features like Full-Text Searching and Term Level Searching to customize searching according to situation or problem statement.

Related

Maps vs Lists in Elasticsearch for optimized query performance

I have some data I will be putting into Elasticsearch, and want to decide on a format that will optimize query performance. The query will be in words: "Is ID X in category Y?". I have a fixed number of categories (small, say, 5), and possibly a large number of IDs to put into each category (currently in the dozens, but of indeterminate size in the future). Each ID will be in at most one category (possibly none).
Format 1:
{
"field1": "value1",
...
"categories": {
"category1": ["id10", "id24", "id38",...],
...
"category5": ["id62", "id19", "id82" ...]
}
}
or
Format 2:
{
"field1": "value1",
...
"categories": {
"id1": "category4",
"id2": "category2",
"id3": "category1",
...
}
}
Which data format would be preferred? The latter format has linear lookup time, but possibly many keys.
I think method 1 is better, Id will be more in the future, if you press method 2, then you may need to close the categories index or increase the number of index fields, and using method 1 can be more convenient to determine the type of a single id (indeOf).There are pros and cons. Maybe there's a better way.

Elasticsearch filter on aggregation result (for search and aggregation)

Part of this question is related to : Elasticsearch filter on aggregation
Context
Let's say my Elasticsearch index contains some orders. Each order has one field price and one field amount. This result in an index that look like this :
[
{
"docKey": "order01",
"user": "1",
"price": 8,
"amount": 20
},
{
"docKey": "order02",
"user": "1",
"price": 14,
"amount": 3
},
{
"docKey": "order03",
"user": "2",
"price": 5,
"amount": 1
},
{
"docKey": "order04",
"user": "2",
"price": 10,
"amount": 3
}
]
What I would like to do
What I want to do is a filter on some values aggregated per user. I want to do this kind of filter for search and also in order to apply aggregation on it. For example in this example I would like to retrieve the documents of all user that have their average order with a price in the range of 9-14.
User 1 has an average price order of 11 so we keep both of his orders.
User 2 has an average price order of 7.5 so both his orders are not kept.
This was the easy part. After I filter to only get the user one. I want to do some more aggregates on the result.
So for example : I want the repartition of the average per user of the amout field among the bucket [0,10] and [10,20] for all user that have an average order with a price in the range of 9-14.
The answer I except for this question is 0 in the bucket [0,10] and one in the bucket [10,20] (Only user 1 is kept because of his average price. His average amount is 11.5 so in the bucket [10,20]).
What I have tried
I have manage do to my filter in order to retrieve the users that have their average order with a price in the range of 9-14. I did this by first doing a term aggregation on the user filed. Then I do a subaggregation that is an avg aggregation on the price. Then I do a bucket selector pipeline aggregation that check if the previous computed average price is between 9 and 14.
I have also manage to do the aggregation I wanted but without the previous filter. I did exactly the same thing that for the filter for each range. Then I count the number of results in each bucket.
I havn't find any way to apply an other aggregation on bucket selector result. So i could not first do the filter and then apply the range...
Also theses solution are not elegant.. I don't think they will scale up as a big part of the document need to be returned in the answer and processed further (even if it's off internet I prefer avoiding doing this and I might be limited in the result size of an aggregation ?).
I manage to find a solution but it's not elegant and might be poorly scalable.
Make a term aggregation on the user.
As a sub-aggregation of the term aggregation do an avg aggregation that compute the average of the price.
As a sub-aggregation of the term aggregation do an avg aggregation that compute the average of the amount.
Do a bucket selector pipeline aggregation that filter to only keep avg_price in range [9-14].
Do a bucket selector pipeline aggregation that filter to only keep avg_amount in a [0-10]
Do a "count" bucket script pipeline aggregation (with script returning one).
Do a bucket sum pipeline aggregation that sum the count.
Repeat all the steps for all ranges wanted ([0-10], [10-20])

Elasticsearch filter vs term query for many ids

I have an index of documents connected with some product_id. And I would like to find all documents for specific ids (around 100 000 product_ids to be found and 100 million are in total in index).
Would the filter query be the fastest and best option in that case?
"query": {
"bool": {
"filter": {"terms": {"product_id": product_ids}
}
}
Or is it better to chunkify ids and use just terms query or smth else?
The question is probably kind of a duplicate, but I would be very grateful for the best practice advice (and a bit of reasoning).
After some testing and more reading I found an answer:
Filter query works much much faster as chunks with just terms query.
But making really big filter can slower getting the result a lot.
In my case, using filter query with chunks of 10 000 ids is 10 times faster, than using filter query with all 100 000 ids at once (btw, this number is already restricted in Elasticsearch 6).
Also from official elasticsearch documentation:
Potentially the amount of ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
The only disadvantage to be taken into account is that filter query is stored in cache. (The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.)
P.S. In all cases I always used scroll.
you can use "paging" or "scrolling" feature of elastic search query for very large result sets.
Use "from - to" query : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
or "scroll" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I think that "From / To" is a more efficient way to go unless you want to return thousands of results each time (which could be many many MB of data so you probably don't want that)
Edit:
You can make a query like this in bulks:
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2", "3", .... "10000" ] // tune for the best array length
}
}
}
If your document Id is sequential or some other number form that you could easily order by, and have a field available you can do a "range query"
GET _search
{
"query": {
"range" : {
"document_id_that_is_a_number" : {
"gte" : 0, // bump this on each query by "lte" step factor
"lte" : 10000 // find a good number here
}
}
}
}

Add custom comparatorClass class in Solr

I am newbie in Solr. I want to add a custom comparatorClass in Solr. I also need to use fields - term and count in my custom class which I have defined in my schema.xml.
Structure of indexing document :
"docs": [
{
"count": 98,
"term": "age",
},
{
"count": 6,
"term": "age assan",
},
{
"count": 5,
"term": "age but",
},
{
"count": 10,
"term": "age salman",
}]
I have stored ngrams with term and their count but solr gives frequency by own that I don't need. I want my count frequency which I have defined for each term. And that term and count, I need to use and want to sort with frequency(count) and then edit distance which I need to implement by creating own class in comparator class or there is something else which helps me. Please share..
How can I do this. Any help please.
Thanks.
You should be able to do this without implementing a custom similarity class. The first requirement is (from your description) a straight forward sort on the count value, while the latter can be implemented by sorting on the value from the strdist() function. You can also multiply or weight these values against each other in a single sort statement by using several functions.
If you really, really need to build your own scorer (which I don't think you need to do from your description) - these are usually written to explore other ranking algorithms than tf/idf, bm25 etc. for larger corpuses, a search on Google gives you many resources with pre-made, easy to adopt solutions. I particularly want to point out "This is the Nuclear Option" in Build Your Own Custom Lucene Query and Scorer:
Unless you just want the educational experience, building a custom Lucene Query should be the “nuclear option” for search relevancy. It’s very fiddly and there are many ins-and-outs. If you’re actually considering this to solve a real problem, you’ve already gone down the following paths [...]

Elasticsearch: choose TOP N documents and apply query

I'm sorry I'm not good at English, please understand it.
Let's assume I have such data:
title category price
book1 study 10
book2 cook 20
book3 study 30
book4 study 40
book5 art 50
I can do "search books in 'study' category and sort them by price-descending order". Result would be:
book4 - book3 - book1
However, I couldn't find a way to do
"search books in 'study' category AMONG the books of TOP 40% in price".
(I wish 'TOP 40% in price' is correct expression)
In this case, result should be "book4" only, because "category search" would be performed for only book5 and book4.
At first, I thought I could do it by
sort all documents by price
select TOP 40%
post another query for category search among them
But now, I still have no idea how I can post a query among "part of documents", not all documents. After 2, I'd have a list of documents in TOP 40%. But how can I make a query which is applied to just them?
I realized that I don't know even "search TOP n%" in elasticsearch. Is there a way that is better than "sort all and select first n%"?
Any advice would be appreciated.
And this is my first question in stackoverflow. If my question is violating any rule of here, please tell me so that I can know it and apology.
If your data is normally distributed, or some other statistical distribution from which you can make sense of the data, you can probably do this in two queries.
You can take a look at the data in histogram form by doing:
{
"query": {
"match_all": {}
},
"facets": {
"stats": {
"histogram": {
"field": "price",
"interval": 100
}
}
}
}
I usually take this data into a spreadsheet to chart it and do other statistical analysis on it. "interval" above will need to be some reasonable value, 100 might not be the right fit.
The is just to decide how to code the intermediate step. Provided the data is normally distributed you can then get the statistical information about the collection using this query:
{
"query": {
"match_all": {}
},
"facets": {
"stats": {
"statistical": {
"field": "price"
}
}
}
}
The above gives you an output that looks like this:
count: 819517
total: 24249527030
min: 32
max: 53352
mean: 29590.023184387876
sum_of_squares: 875494716806082
variance: 192736269.99554798
std_deviation: 13882.94889407679
(the above is not based on your data sample, but just sample of available data I have to demonstrate statistical facet usage.)
So now that you know all of that, you can start applying your knowledge of statistics to the problem at hand. That is, find the Z score at the 60th percentile and find the location of the representative data point based on that.
How your final query looks like this:
{
"query": {
"range": {
"talent_profile": {
"gte": 40,
"lte": 50
}
}
}
the lte is going to be from the "max" from the stats facet and the gte is going to be from your intermediate analysis.

Resources