Recommended model design for large objects - elasticsearch

I have a model structure that looked like this
Country
- City
- User
- Items
In my data I have about 50 countries.
Each country have about 100.000 cities.
Each city have about 10-100 users and each user have about 1-10 items.
I want to query cities and where country is X and return the hole city object with all child objects (user and the user items).
I want to query cities where user is X or users items is X
I want to query users where city is X or item is X
How would you index this data in elastic search?
First I thought that I could use parent child relationships (https://www.elastic.co/guide/en/elasticsearch/guide/2.x/parent-child.html) which lets me perform all the queries but it won't return the child objects right? This means that I would have to perform extra queries to return the children and the children children or I could use inner hits but that would need som manual mapping to the object model...
If I use nested objects I won't be able to query child objects without returning the parent? So I could not perform a query agains users and return a list of matching users? It would return the country and its cities and all the cities users?
How would you model this data? performance of search is much more important the performance of indexing the data.

Related

How to optimize the Search Filter which returns paginated result but the intermediate table can return millions for records but actual results is few

Example
I have 3 entities Users, Address & County
The User is having one to one Relation with Address Entity and Country has one to many Relation with Address. I want to display users details with filter parameter on country code.
The one of the possible flow is I will fetch the corresponding Object of Country from Country code. Then we find Country in all in address entity. Then we will be getting list of addresses of that country code. Then we will find all the address corresponding with users using in Query 'in' SQL. But the issue arise when the data set is very large.
Lets say we a Country Code IN corresponds to 1 Million Address then I will 1st fetch the all 1 Million records then map it with the Users. Then I am performing pagination. It will be very slow.
I am thinking of an idea to paginate the address entity from the response somehow ?
Please help

Elasticsearch - Count of associations between indexes?

Coming from the relational database background, I want to know if there is a way to retrieve the number of unique associations between two indexes.
Basic Example (Using relational databases)
I have 3 tables: Person, Cars, Person-Cars
Person-Cars has two columns (person_id, car_id) and holds the number of associations (ownership) between people and cars.
On Elasticsearch, I have created an index for Person and for Cars.
Main Point
Everytime that I fetch a Car document, I want to know how many people own that car (IOW how many associations it has to unique people)
--
To archieve that, I would need another index for Person-Cars, and then would have to index all the association records? Is there a simpler way? What would be the best way to do this in ES?
I have looked into aggregations, but I think that can only be done on a single level (person or cars) not sure.
Thanks!
On Elasticsearch, I have created an index for Person and for Cars.
Most of the times it makes sense to store the data in a denormalized fashion in elastic search, viz defining one-to-many relationships as either nested or parent-child relationship or simply in multi-value fields.
What would be the best way to do this in ES?
It depends on your use case (either parent-child or nested or multi-value). Having separate indexes for each type definitely will add overhead. If you add other use cases and type of queries which you would be needing then only schema can be better modelled.
Considering only the shared use case: Below car document will solve your case :
{
"id":1,
"brand":"Hyundai",
"owners":[21,31,51] // <===== Ids of owners. Ids & names both can be stored if required.
"owners_cnt": 3 // <==== OR You can simply maintain the counter as well.
}
Whenever a person buy/sell a car, then car document needs to updated in this case. If buying and selling of cars happens frequently and you need to update both car & person if a person bought a car then this type of modelling makes less sense.
In that case it makes sense to have car_ids within-person doc :
{
"id":1,
"name":"Raj",
"cars":[1,2,3]
}
In this case, we can use below query to fetch the number of persons who bought a car , having id=3
GET person/_count
{
"query": {
"match": {
"cars": 3
}
}
Again better modelling can be achieved if more context is shared.

RethinkDB query OrderBy distances between central point and subtable of locations

I'm fairly new to RethinkDB and am trying to solve a thorny problem.
I have a database that currently consists of two kinds of account, customers and technicians. I want to write a query that will produce a table of technicians, ordered by their distance to a given customer. The technician and customer accounts each have coordinate location attributes, and the technicians have a service area attribute in the form of a roughly circular polygon of coordinates.
For example, a query that returns a table of technicians whose service area overlaps the location of a given customer looks like this:
r.db('database').table('Account')
.filter(r.row('location')('coverage').intersects(r.db('database')
.table('Account').get("6aab8bbc-a49f-4a9d-80cc-88c95d0bae8d")
.getField('location').getField('point')))
From here I want to order the resulting subtable of technicians by their distance to the customer they're overlapping.
It's hard to work on this without a sample dataset so I can play around. I'm using my imagination.
Your Account table stores both of customer and technician
Technician document has field location.coverage
By using intersect, you can returns a list of technician who the coverage locations includes customer location.
To order it, we can pass a function into orderBy command. With each of technican, we get their point field using distance command, return that distance number, and using that to order.
r.db('database').table('Account')
.filter(
r.row('location')('coverage')
.intersects(
r.db('database').table('Account').get("6aab8bbc-a49f-4a9d-80cc-88c95d0bae8d")('location')('point')
)
)
.orderBy(function(technician) {
return technician('location')('point')
.distance(r.db('database').table('Account').get("6aab8bbc-a49f-4a9d-80cc-88c95d0bae8d")('location')('point'))
})
I hope it helps. If not, let's post some sample data here and we can try figure it out together.

Elastic search query scenario

I am building an application which requires a location based search of hotels.
I have three Main classes
class Hotel {
String name
String latitude
String longitude
}
class HotelResource {
Hotel hotel
String name
}
class HotelResourceAvailability{
HotelResource resource
}
HotelResourceAvailability - holds the availability data of a hotel resource.
The query scenario,
As a user I want to search for all the hotels in a particular location which have at least one hotel resource available
and get the count of available resources for each of the hotels
Note - The hotels matching the location criteria but without any available resource should be filtered out.
I am new to elastic search and finding it difficult to decide on the approach any pointers would be really appreciated.
Three points to get you started:
I would keep your data model as flat as possible - Elasticsearch isn't relational so you can't easily join from one object to another.
Latitude and Longitude can be stored in the geo_point type - you can then use queries to find the nearest matching hotels.
Is Hotel availability based on date? if so I would use nesting or a parent child relationship.

Optional Parent Relation in Elasticsearch

Example: Some items belong to specific users. The User is the parent, the item is the child. Indexing those items and users can be done by routing the items to the shards of the users.
Problem: The majority of items does not belong to a specific user since they have been posted anonymously. I could have those items routed to a parent-id:"anonymous", but that would lead to the majority of items being stored in one single shard.
Question: How can I introduce optional parent-child-relations so that items belonging to a registered user route to the users shard, while anonymous items get distributed randomly?
Store them in two different indexes and search both.
Here's a video and article that has more on sharding/index partitioning strategies:
Sizing Elasticsearch
ElasticSearch: Big Data, Search, and Analytics

Resources