Handle Frequently updated geo data in Elastic Search - elasticsearch

I have the location(geo points) data updated every 5 minutes for millions of users. We have to search users with specific attributes(age/interests/languages) & in particular geo range. Wanted to understand the right strategy to store such data in Elastic.
Option1
Create user document with following keys
user Metadata & attributes (age, interests, languages, salary etc around 8-10 searchable attributes)
Live location (changing every few minutes)
"liveLocation" : {
"type" : "Point",
"coordinates" : [-72.333077, 30.856567]
}
location data - multiple addresses - home address, work address etc along with geo points. (not updated frequently) -
"addresses" :
[
{
"type" : "home",
"address" : "first floor, xyz, near landmark",
"city" : "Newyork",
"country" : "Country",
"zipcode" : "US1029",
"location" : {
"type" : "Point",
"coordinates" : [-73.856077, 40.848447]
},
{
... more atype of addresses
}
]
We want to perform geo search queries over all the geo type fields. My worry - live location for users will be updated quite frequently. Is this a viable approach ?
Option2
Treat every location update as a time series data and insert a new document. This will avoid updating the documents. instead will insert new documents for each user every few minutes.
Problems -
While searching all the users(home/office/live location) in a particular geo range, I have to consider only the last updated documents for each user . How to do that in elastic ?
We have to search users with specific attributes(age/interests/language) & in particular geo range. If option2 is preferable should user attribute-metadata & location updates be treated as parent-child relationship or some other approach?
In Conclusion - What should be the right approach .

Related

Elastic Search - Sorting & Filtering on nested Documents

I am working on an E-Commerce application. Catalog Data is being served by Elastic Search.
I have document's for Product which is already indexed in Elastic Search.
Document Looks something like this (Excluded few fields for the purpose of better readability):
{
"title" : "Product Name",
"volume" : "200gm",
"brand" : {
"brand_code" : XXXX,
"brand_name" : "Brand Name"
},
"#timestamp" : "2021-08-26T08:08:11.319Z",
"store" : [
{
"physical_unit" : 0,
"default_price" : 115.0,
"_id" : "1234_111",
"product_code" : "1234",
"warehouse_code" : 111,
"available_unit" : 100
}
],
"category" : {
"category_code" : 987,
"category_name" : "CategoryName",
"category_url_link" : "CategoryName",
"super_category_name" : "SuperCategoryName",
"parent_category_name" : "ParentCategoryName"
}
}
store object in the above document is the one where ES Query will look for price and to decide if item is in stock or Out Of Stock.
I would like to add more child objects to store (Basically data from multiple inventory). This can go up to more than 150 child objects for each product.
Eventually, A product document will look something like this with multiple inventory's data mapped to a particular document.
{
"title" : "Product Name",
"volume" : "200gm",
"brand" : {
"brand_code" : XXXX,
"brand_name" : "Brand Name"
},
"#timestamp" : "2021-08-26T08:08:11.319Z",
"store" : [
{
"physical_unit" : 0,
"default_price" : 115.0,
"_id" : "1234_111",
"product_code" : "1234",
"warehouse_code" : 111,
"available_unit" : 100
},
{
"physical_unit" : 0,
"default_price" : 125.0,
"_id" : "1234_112",
"product_code" : "1234",
"warehouse_code" : 112,
"available_unit" : 100
},
{
"physical_unit" : 0,
"default_price" : 105.0,
"_id" : "1234_113",
"product_code" : "1234",
"warehouse_code" : 113,
"available_unit" : 100
}
Upto N no of stores
],
"category" : {
"category_code" : 987,
"category_name" : "CategoryName",
"category_url_link" : "CategoryName",
"super_category_name" : "SuperCategoryName",
"parent_category_name" : "ParentCategoryName"
}
}
Functional Requirement :
For any product, we should show lowest price across all warehouse.
For EX: If a particular product has 50 store mapped to it, Elastic Search query should look into the nested object and get the value which is lowest in all 50 stores if item is available.
Performance should not be degraded.
Challenges :
If we start storing those many stores for each product, data will go considerably high. Will that be a problem ?
What would be the efficient way to extract the lowest price from nested document?
How would facets work within nested document ? Like if i apply price range filter ES picks up the data which was not showed earlier. (It might pick the data from other store which matches the range)
We are using template to query ES and the Version of the Elastic Search is 6.0.
Thanks in Advance!!
First there are improvements to nested document search in version 7.x that are worth the upgrade.
As for version 6.x, there are a lot of factors there that I could not give you a concrete answer. It also seems you may not be understanding the way that nested documents work, they are not relational.
In particular when you say that each product might have 50 stores mapped to it that sounds like you are implying a relationship, which will not exist with a nested document. However, the values from those 50 stores would be stored within an index nested under the parent document. Having 50 stores under a product or category does not sound concerning.
ElasticSearch has not really talked in terms of facets since the introduction of the aggregation framework. Its not that they dont exist, just not how they are discussed.
So lets try this. ElasticSearch optimizes its search and query through a divide and conquer mechanism. The data is spread across several shards, a configurable number, and each shard is responsible for reviewing its own data. Further, those shards can be distributed across many machines so that there are many cpus and lots of memory for the search. So growing the data doesn't matter if you are willing to grow the cluster, as it is possible to maintain a situation where each machine is doing the same amount of work as it was doing before.
Unlike a relational database, filters search terms allow Elastic to drastically reduce the data that it is looking at and a larger number of filters will improve performance where on a relational database performance declines.
Now back to nested documents. They are stored as a separate index, but instead of mapping the results to the nested doc, the results map to the parent doc id. So you're nested docs arent exactly in the same index as the rest of the document, though they are not truly separate either. But that does mean that the nested documents should have minimal impact the performance of the queries against the parent documents. But if your data size grows beyond the capacity of your current system you will still need to increase its size.
As to how you would query, you would use Elastic aggregations. These will allow you to calculate your "facet" counts and identify the best prices. The Elastic aggregations are very powerful and very fast. There are caveats that are well documented, but in general they will work as you expect.
In version 6.x query string queries cannot access the search criteria in a nested document, and a complex query must be used.
To recap
Functional Requirement :
For any product, we should show lowest price across all warehouse.
For EX: If a particular product has 50 store mapped to it,
ElasticSearch query should look into the nested object and get the
value which is lowest in all 50 stores if item is available.
Yes a nested aggregation will do this.
Performance should not be degraded.
Performance will continue to depend on the ratio of the size of the data to the overall cluster size.
Challenges :
If we start storing those many stores for each product, data will go considerably high. Will that be a problem ?
No this should not be a problem
What would be the efficient way to extract the lowest price from nested document?
Elastic Aggregations
How would facets work within nested document ? Like if i apply price range filter ES picks up the data which was not showed earlier. (It might pick the data from other store which matches the range)
Yes filtering can work with Aggregations very well. The aggregation will be based on the filtered data. In fact you could have an aggregation based on just minimum price, and in the same query then have an aggregation using your price ranges, which will give you the count of documents that have a store within that price range, and you could have a sub aggregation showing the stores under each price range.
We are using template to query ES and the Version of the Elastic Search is 6.0. Thanks in Advance!!
I know nothing about template. The ElasticSearch API is so dead simple I do not know why anyone uses additional tools on top of the API, they just add weight, and increase complexity and make key features not available because the wrapper author did not pass through the feature.

Index DynamoDB streams to elastic search

I have a requirement for implementing following entities in a DynamoDB table
I have stored these entities in DynamoDB as below.
Partition Key : PROJ#ProjectId:CountryId
Sort Key : Project Name
Company : company data as JSON document
Since this is a one to many relationship, N number of projects of the same company will create N number of project records and same company details will be stored in their Company attribute. The reason for doing this is, the most critical data access point is via ProjectId and CountryId (Assume that I can't change this DB design)
I have a requirement to implement a search functionality which supports filter table using company name, address, project name, country etc (using a single filter or any combination of these filters). I'm using DynamoDB streams to feed elastic search cluster and update any creation, deleting or update of the details there and use elastic search API to query data.
But I need to index these data in following format, so that when I receive the details from elastic search, data will not be duplicated
{
"id" : 1
"name" : "ABC",
"description" : "description",
"address" : "address",
"projects" : [
{
"id" : 10,
"name" : "project 1",
"countryId" : 10
},
{
"id" : 20,
"name" : "project 1",
"countryId" : 10
}
]
}
At the record creation time, since Project records are creating as single records, is there any recommended or standard way that I can grab all the Project records of Company and create the above json document and index it in elastic search?
This is how I would approach it :
In elastic the document id will be the companyID
What you can do is create a lambda that is triggered based on the change streams and use elastic's update by query to query for the document and PAINLESS scripting to update the project section of the document, this will work for less frequent changes.

elasticsearch - query between document types

I have a production_order document_type
i.e.
{
part_number: "abc123",
start_date: "2018-01-20"
},
{
part_number: "1234",
start_date: "2018-04-16"
}
I want to create a commodity document type
i.e.
{
part_number: "abc123",
commodity: "1 meter machining"
},
{
part_number: "1234",
commodity: "small flat & form"
}
Production orders are datawarehoused every week and are immutable.
Commodities on the other hand could change over time. i.e abc123 could change from 1 meter machining to 5 meter machining, so I don't want to store this data with the production_order records.
If a user searches for "small flat & form" in the commodity document type, I want to pull all matching records from the production_order document type, the match being between part number.
Obviously I can do this in a relational database with a join. Is it possible to do the same in elasticsearch?
If it helps, we have about 500k part numbers that will be commoditized and our production order data warehouse currently holds 20 million records.
I have found that you can indeed now query between indexs in elasticsearch, however you have to ensure your data stored correctly. Here is an example from the 6.3 elasticsearch docs
Terms lookup twitter example At first we index the information for
user with id 2, specifically, its followers, then index a tweet from
user with id 1. Finally we search on all the tweets that match the
followers of user 2.
PUT /users/user/2
{
"followers" : ["1", "3"]
}
PUT /tweets/tweet/1
{
"user" : "1"
}
GET /tweets/_search
{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}
Here is the link to the original page
https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-terms-query.html
In my case above I need to setup my storage so that commodity is a field and it's values are an array of part numbers.
i.e.
{
"1 meter machining": ["abc1234", "1234"]
}
I can then look up the 1 meter machining part numbers against my production_order documents
I have tested and it works.
There is no joins supported in elasticsearch.
You can query twice first by getting all the partnumbers using "small flat & form" and then using all the partnumbers to query the other index.
Else try to find a way to merge these into a single index. That would be better. Updating the Commodities would not cause you any problem by combining the both.

elastic search get distinct random field values

We have elastic search document that has following fields:
{
"stockId": 1
"sellerId": 100
}
Multiple stockId can be mapped to single sellerId but one stock can only be mapped to a single dealer. There are around 10K stocks mapped to 1K sellers. But each sellerId might have different number of stocks i.e. few might have 100 while others have only 1.
Problem Statement: We want to select 'N' random documents out of all these documents indexed. The condition is that each of these 'N' document should belong to different seller i.e. distinct "sellerId". (We need to give award to these sellers).
What I have tried: I am trying to solve this by elastic query that fetches 'N' random distinct 'sellerId'. (and then elastic query to fetch 1 document of each of these 'N' sellers). One way could be to aggregate on 'sellerId' and then pick random 'N' keys but this is not desirable approach performance wise. Can someone help with better query?
I would rebuild my mapping to create a nested document type, with seller being the parent and stockid being the nested object:
{
"sellerid" : {"type" : "integer" },
"stock_obj" : {
"type" : "nested",
"properties" : {
"stockid" : { "type" : "integer" }
}
}
When you rebuild your index, you would create only one object per seller. Each seller would have all of their stock ids. It seems like there are about 10 stocks per seller, elasticsearch can handle this fine. (If there are thousands of stocks per seller, I would do this differently)
Then, I would do a search for N sellers, sorted randomly, and then as a second sort field, you would sort the stock ids randomly. Not the simplest mapping, but the query is easy and should be fast.
Also, separately, if you're just dealing with ~10k seller/stock data points that are integers, using elasticsearch is probably overkill. It can do what you want, but its main purpose is for searching large amounts of text.

Finding duplicate documents

I have some documents whose ids are randomly generated. The issue here is I need to find the duplicates amongst these documents. I have three fields which should not be identical for two documents. So how to check for duplicates based on multiple fields?
Sample documents
document 1 = {
"process" : "business",
"processId" : 5433321,
"country" : "US"
}
document 2 = {
"process" : "operations",
"processId" : 334233,
"country" : "UK"
}
document 3 = {
"process" : "business",
"processId" : 5433321,
"country" : "US"
}
Here as you can see, document 1 and document 3 are the same, but they are having different Ids in my database,so exist as separate documents. So on run I need to find the above as duplicates and if possible keep only one.
The best option here would be to model your document around doc ID. Now for each unique document , create a docID which is a hash of the content of the document. This makes sure that only one unique document exists across the index. Next use _create API to create documents. This will fail all requests on over write document with same document ID.
You can further read about other duplication issues and its solutions here.

Resources