Elastic Search - Sorting & Filtering on nested Documents - elasticsearch

I am working on an E-Commerce application. Catalog Data is being served by Elastic Search.
I have document's for Product which is already indexed in Elastic Search.
Document Looks something like this (Excluded few fields for the purpose of better readability):
{
"title" : "Product Name",
"volume" : "200gm",
"brand" : {
"brand_code" : XXXX,
"brand_name" : "Brand Name"
},
"#timestamp" : "2021-08-26T08:08:11.319Z",
"store" : [
{
"physical_unit" : 0,
"default_price" : 115.0,
"_id" : "1234_111",
"product_code" : "1234",
"warehouse_code" : 111,
"available_unit" : 100
}
],
"category" : {
"category_code" : 987,
"category_name" : "CategoryName",
"category_url_link" : "CategoryName",
"super_category_name" : "SuperCategoryName",
"parent_category_name" : "ParentCategoryName"
}
}
store object in the above document is the one where ES Query will look for price and to decide if item is in stock or Out Of Stock.
I would like to add more child objects to store (Basically data from multiple inventory). This can go up to more than 150 child objects for each product.
Eventually, A product document will look something like this with multiple inventory's data mapped to a particular document.
{
"title" : "Product Name",
"volume" : "200gm",
"brand" : {
"brand_code" : XXXX,
"brand_name" : "Brand Name"
},
"#timestamp" : "2021-08-26T08:08:11.319Z",
"store" : [
{
"physical_unit" : 0,
"default_price" : 115.0,
"_id" : "1234_111",
"product_code" : "1234",
"warehouse_code" : 111,
"available_unit" : 100
},
{
"physical_unit" : 0,
"default_price" : 125.0,
"_id" : "1234_112",
"product_code" : "1234",
"warehouse_code" : 112,
"available_unit" : 100
},
{
"physical_unit" : 0,
"default_price" : 105.0,
"_id" : "1234_113",
"product_code" : "1234",
"warehouse_code" : 113,
"available_unit" : 100
}
Upto N no of stores
],
"category" : {
"category_code" : 987,
"category_name" : "CategoryName",
"category_url_link" : "CategoryName",
"super_category_name" : "SuperCategoryName",
"parent_category_name" : "ParentCategoryName"
}
}
Functional Requirement :
For any product, we should show lowest price across all warehouse.
For EX: If a particular product has 50 store mapped to it, Elastic Search query should look into the nested object and get the value which is lowest in all 50 stores if item is available.
Performance should not be degraded.
Challenges :
If we start storing those many stores for each product, data will go considerably high. Will that be a problem ?
What would be the efficient way to extract the lowest price from nested document?
How would facets work within nested document ? Like if i apply price range filter ES picks up the data which was not showed earlier. (It might pick the data from other store which matches the range)
We are using template to query ES and the Version of the Elastic Search is 6.0.
Thanks in Advance!!

First there are improvements to nested document search in version 7.x that are worth the upgrade.
As for version 6.x, there are a lot of factors there that I could not give you a concrete answer. It also seems you may not be understanding the way that nested documents work, they are not relational.
In particular when you say that each product might have 50 stores mapped to it that sounds like you are implying a relationship, which will not exist with a nested document. However, the values from those 50 stores would be stored within an index nested under the parent document. Having 50 stores under a product or category does not sound concerning.
ElasticSearch has not really talked in terms of facets since the introduction of the aggregation framework. Its not that they dont exist, just not how they are discussed.
So lets try this. ElasticSearch optimizes its search and query through a divide and conquer mechanism. The data is spread across several shards, a configurable number, and each shard is responsible for reviewing its own data. Further, those shards can be distributed across many machines so that there are many cpus and lots of memory for the search. So growing the data doesn't matter if you are willing to grow the cluster, as it is possible to maintain a situation where each machine is doing the same amount of work as it was doing before.
Unlike a relational database, filters search terms allow Elastic to drastically reduce the data that it is looking at and a larger number of filters will improve performance where on a relational database performance declines.
Now back to nested documents. They are stored as a separate index, but instead of mapping the results to the nested doc, the results map to the parent doc id. So you're nested docs arent exactly in the same index as the rest of the document, though they are not truly separate either. But that does mean that the nested documents should have minimal impact the performance of the queries against the parent documents. But if your data size grows beyond the capacity of your current system you will still need to increase its size.
As to how you would query, you would use Elastic aggregations. These will allow you to calculate your "facet" counts and identify the best prices. The Elastic aggregations are very powerful and very fast. There are caveats that are well documented, but in general they will work as you expect.
In version 6.x query string queries cannot access the search criteria in a nested document, and a complex query must be used.
To recap
Functional Requirement :
For any product, we should show lowest price across all warehouse.
For EX: If a particular product has 50 store mapped to it,
ElasticSearch query should look into the nested object and get the
value which is lowest in all 50 stores if item is available.
Yes a nested aggregation will do this.
Performance should not be degraded.
Performance will continue to depend on the ratio of the size of the data to the overall cluster size.
Challenges :
If we start storing those many stores for each product, data will go considerably high. Will that be a problem ?
No this should not be a problem
What would be the efficient way to extract the lowest price from nested document?
Elastic Aggregations
How would facets work within nested document ? Like if i apply price range filter ES picks up the data which was not showed earlier. (It might pick the data from other store which matches the range)
Yes filtering can work with Aggregations very well. The aggregation will be based on the filtered data. In fact you could have an aggregation based on just minimum price, and in the same query then have an aggregation using your price ranges, which will give you the count of documents that have a store within that price range, and you could have a sub aggregation showing the stores under each price range.
We are using template to query ES and the Version of the Elastic Search is 6.0. Thanks in Advance!!
I know nothing about template. The ElasticSearch API is so dead simple I do not know why anyone uses additional tools on top of the API, they just add weight, and increase complexity and make key features not available because the wrapper author did not pass through the feature.

Related

Finding all words and their frequencies in an elasticsearch index

Elasticsearch Newbie here. I have an elasticsearch cluster and an index http://localhost:9200/products and each product looks like this:
{
"name": "laptop",
"description" : "Intel Laptop with 16 GB RAM",
"title" : "...."
}
I wanted all keywords in a field and their frequencies across all documents for an index. For eg.
description : intel -> 2500, laptop -> 40000 etc. I looked at termvectors https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html but that only let's me do it across a single document. I want it across all documents in a particular field.
I wrote a plug-in for this ..but its expensive call ( based on how many terms you want to get and cardinality of terms ) https://github.com/nirmalc/es-termstat
Currently, there is no way to use term vectors on all documents at a time in an index. You can either use single term vector API for single document's term frequency count or multi-term vectors API to multiple document's term frequency. But a possible workaround could be like this -
make a scan request in order to get all documents from a given type,
and for each page to build a multi-term vector mentioned above to
request to get term vectors.
POST /products/_mtermvectors
{
"ids" : ["1", "2"],
"parameters": {
"fields": [
"description"
],
"term_statistics": true
}
}

Elasticsearch Multi Index Query and Filter

I have 2 indexes, one that stores data about an event and one that stores the availability of that event. I am trying to create a single query that gets events by a query but only returns ones that are available, and I am having difficulty doing so.
The events index stores
{
"id" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"createdAt" : 1.5519999143126902E12,
"description" : "A very not so concise description",
"geoHash" : "dnh00x6x5",
"name" : "a name",
...etc...
}
The availability index stores availability like so:
{
"eventId" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"maxGuests" : 8,
"availability" : {
"lte" : "2019-10-18T22:15:00.000Z",
"gte" : "2019-10-18T02:30:00.000Z"
}
}
I am trying to create a query like below, but what I can't figure out is how to filter by listings that meet the criteria in the events index AND are available in the availability index.
GET events,availability/_search
{
"size": 5,
"from": 0,
"_source": [
"id"
],
"query": {
"bool": {
"must": [
{
"geo_distance": {
"distance": "25mi",
"geoHash": {
"lat": 34.0389,
"lon": -84.3826
}
}
}
],
"should": [],
"filter":[
{
"range" : {
"availability" : {
"gte" : "2019-10-31",
"lte" : "2020-11-01",
"relation" : "within"
}
}
}
]
}
}
}
--
The reason I want to only do one query is that the client is expecting a certain specified number of events. If I filter out the unavailable events after I get the event data then I am likely to be left with fewer events than the client expected and would need to do yet another search to fill the gap.
Also, of course, I could merge the two indices so that the event also stores the availability info, but I originally set them up this way because the availability info may have hundreds or thousands of entries per event.
What you want to accomplish is an equivalent of a foreign key of SQL (join). There is no way to have exactly what you want, meaning to filter documents from index A by querying an index B. Your options are:
As you've mentioned solve it on application level (although this causes other problems for you, so it's not a solution).
Merge the data in one index and have duplicated event informatin. Although it seems expensive, the duplication of data in a NoSQL database is to be expected. If you need a relational model then maybe you should use a SQL solution.
Use parent/child (join datatype). The problem here is that you will need to have the data in the same index overall. Moreover, parent and child will be stored in the same shard as well.
One approach to this (a bit more complex though) that I believe would work for you is to use the nested datatype, which actually is a more compact approach for the solution number 2 (combine your data in one index, but save root information only once). Make events be at the root and availability appear as nested. When you want to add one availability you can use the update api, and when you query, you can search by the root fields and by the nested. If you need to retrieve specific availability entries for an event you can use inner hits
What you are trying to do (multi-index search) will not join your data automatically, it will not work. Elasticsearch doesn't work that way, and the relational model is not suited for this product.
One last thing, it's a good thing to plan ahead, but it's a bad thing to try to optimize early on.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
An interesting read that summarizes the above

Search in multiple indexes in elastica

I am looking for a way to search in more than one index at the same time using Elastica.
I have an index products, and an index user.
products contains {product_id, product_name, price} and user contains {product_id, user_name, date}. Knowing that the product_id in both of them is the same, in products each products_id is unique but in user they're not as a user can buy the same product multiple times.
Anyway, I want to automatically get the price of a product from the products index while searching through the user index.
I know that we can search over multiple indexes like so (correct me if I'm wrong) :
$search = new \Elastica\Search($client);
$search->addIndex('users')
->addType('user')
->addIndex('products')
->addType('product');
But the problem is, when I write an aggregation on the products_id for example and then create a new query with some filters :
$products_agg = new \Elastica\Aggregation\Terms('products_id');
$products_agg->setField('products_id')->setSize(0);
$query = new \Elastica\Query();
$query->addAggregation($products_agg);
$query->setQuery($bool);
$search->setQuery($query);
How does elastica know in which index to search? How can I link this products_id to the other index?
The Elastica library has support for Multi Search API, The multi search API allows to execute several search requests within the same API. The endpoint for it is _msearch.
The format of the requests is similar to the bulk API, The first line
is header part that includes which index / indices to search on, The second line includes the typical search body requests.
{"index" : "products", "type": "products"}
{"query" : {"match_all" : {}}, "from" : 0, "size" : 10} // write your own query to get price
{"index" : "uesrs", "type" : "user"}
{"query" : {"match_all" : {}}} // query for user
Check test case in Multi/SearchTest.php to see how to use.
Basically you want to join two indexes based on a common field as in sql.
What you can do is model you data in the same index using join datatype
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html
Index all documents in the same index ,
Make all product documents - parent.
Make all user documents as child
And the use parent-child aggregations and queries
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html#_parent_join_queries_and_aggregations
NOTE: make sure of the performance implication of parent-child mapping
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html#_parent_join_and_performance
One more thing you can do is put all the information of the product with every user that buys it.
But this can unnecessarily waste you space and is not a good practice as per data rules are concerned.
But since this is a search engine and elasticsearch suggests that best is to normalise and duplicate data rather that using parent-child.
you can try the following:
1- naming indexes with specific name like the following
myFirstIndex-myProjectName
mySecIndex-myProjectName
myThirdIndex-myProjectName
and so on.
2- that's give me the ability using * in the field of indexes to search because it accepts wildcard so i can search across multiple fields like this using kibana Dev Tools
GET *-myProjectName/_search
{
"_source": {
"excludes": [ "*" ]
},
"query": { "match_all": {} },
}
this will search on each index includes -myProjectName.
You can't query two indices with different mappings. Best way to solve your problem is to just do two queries (application-side joins). First query you do the aggregations on the user and the second you get the prices.
Another option would be to add the price to the user index. Sometimes you have to sacrifice a little space for better usability.

elastic search get distinct random field values

We have elastic search document that has following fields:
{
"stockId": 1
"sellerId": 100
}
Multiple stockId can be mapped to single sellerId but one stock can only be mapped to a single dealer. There are around 10K stocks mapped to 1K sellers. But each sellerId might have different number of stocks i.e. few might have 100 while others have only 1.
Problem Statement: We want to select 'N' random documents out of all these documents indexed. The condition is that each of these 'N' document should belong to different seller i.e. distinct "sellerId". (We need to give award to these sellers).
What I have tried: I am trying to solve this by elastic query that fetches 'N' random distinct 'sellerId'. (and then elastic query to fetch 1 document of each of these 'N' sellers). One way could be to aggregate on 'sellerId' and then pick random 'N' keys but this is not desirable approach performance wise. Can someone help with better query?
I would rebuild my mapping to create a nested document type, with seller being the parent and stockid being the nested object:
{
"sellerid" : {"type" : "integer" },
"stock_obj" : {
"type" : "nested",
"properties" : {
"stockid" : { "type" : "integer" }
}
}
When you rebuild your index, you would create only one object per seller. Each seller would have all of their stock ids. It seems like there are about 10 stocks per seller, elasticsearch can handle this fine. (If there are thousands of stocks per seller, I would do this differently)
Then, I would do a search for N sellers, sorted randomly, and then as a second sort field, you would sort the stock ids randomly. Not the simplest mapping, but the query is easy and should be fast.
Also, separately, if you're just dealing with ~10k seller/stock data points that are integers, using elasticsearch is probably overkill. It can do what you want, but its main purpose is for searching large amounts of text.

Automatically indexing by a field name as desc

i have index type of book story that every week wants to put some books.
in this index i want to have always query by sorting a field name(in this case is "price" ) as desc so it's have some overhead on ES (cause of data volume)
in this service we always shows to user books by maximum to minimum price
is possible to have this feature automatically or manually for sorting document of book type in index always by price as desc and then when to want to query them it's always sorted by price as desc and dont need to give it by:
"sort" : { "price" { "order" : "desc" } }
No, you can not keep your data ordered based on a field. Elasticsearch keeps the data as Lucene segments inside. Take a look here to better understand internal structure of ES: https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

Resources