elasticsearch - query between document types - elasticsearch

I have a production_order document_type
i.e.
{
part_number: "abc123",
start_date: "2018-01-20"
},
{
part_number: "1234",
start_date: "2018-04-16"
}
I want to create a commodity document type
i.e.
{
part_number: "abc123",
commodity: "1 meter machining"
},
{
part_number: "1234",
commodity: "small flat & form"
}
Production orders are datawarehoused every week and are immutable.
Commodities on the other hand could change over time. i.e abc123 could change from 1 meter machining to 5 meter machining, so I don't want to store this data with the production_order records.
If a user searches for "small flat & form" in the commodity document type, I want to pull all matching records from the production_order document type, the match being between part number.
Obviously I can do this in a relational database with a join. Is it possible to do the same in elasticsearch?
If it helps, we have about 500k part numbers that will be commoditized and our production order data warehouse currently holds 20 million records.

I have found that you can indeed now query between indexs in elasticsearch, however you have to ensure your data stored correctly. Here is an example from the 6.3 elasticsearch docs
Terms lookup twitter example At first we index the information for
user with id 2, specifically, its followers, then index a tweet from
user with id 1. Finally we search on all the tweets that match the
followers of user 2.
PUT /users/user/2
{
"followers" : ["1", "3"]
}
PUT /tweets/tweet/1
{
"user" : "1"
}
GET /tweets/_search
{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}
Here is the link to the original page
https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-terms-query.html
In my case above I need to setup my storage so that commodity is a field and it's values are an array of part numbers.
i.e.
{
"1 meter machining": ["abc1234", "1234"]
}
I can then look up the 1 meter machining part numbers against my production_order documents
I have tested and it works.

There is no joins supported in elasticsearch.
You can query twice first by getting all the partnumbers using "small flat & form" and then using all the partnumbers to query the other index.
Else try to find a way to merge these into a single index. That would be better. Updating the Commodities would not cause you any problem by combining the both.

Related

Handle Frequently updated geo data in Elastic Search

I have the location(geo points) data updated every 5 minutes for millions of users. We have to search users with specific attributes(age/interests/languages) & in particular geo range. Wanted to understand the right strategy to store such data in Elastic.
Option1
Create user document with following keys
user Metadata & attributes (age, interests, languages, salary etc around 8-10 searchable attributes)
Live location (changing every few minutes)
"liveLocation" : {
"type" : "Point",
"coordinates" : [-72.333077, 30.856567]
}
location data - multiple addresses - home address, work address etc along with geo points. (not updated frequently) -
"addresses" :
[
{
"type" : "home",
"address" : "first floor, xyz, near landmark",
"city" : "Newyork",
"country" : "Country",
"zipcode" : "US1029",
"location" : {
"type" : "Point",
"coordinates" : [-73.856077, 40.848447]
},
{
... more atype of addresses
}
]
We want to perform geo search queries over all the geo type fields. My worry - live location for users will be updated quite frequently. Is this a viable approach ?
Option2
Treat every location update as a time series data and insert a new document. This will avoid updating the documents. instead will insert new documents for each user every few minutes.
Problems -
While searching all the users(home/office/live location) in a particular geo range, I have to consider only the last updated documents for each user . How to do that in elastic ?
We have to search users with specific attributes(age/interests/language) & in particular geo range. If option2 is preferable should user attribute-metadata & location updates be treated as parent-child relationship or some other approach?
In Conclusion - What should be the right approach .

One large Elasticsearch lookup index, or several smaller ones?

I'm creating a lookup index that I'll use solely as a terms filter. So no searching/aggregating, only filtering and GETs.
I'm debating the structure of this lookup index, whether each document should contain all of the fields I want to filter for, or whether I should create an index per field.
For example, let's say each document pertains to a user. Each user has a list of games they've played, books they've read, and movies they've watched. When searching for game/book/movie recommendations, I'll use the term filter to filter out those items they've already interacted with.
I'm wondering if I should have a single lookup index with a document mapping like:
users_index
{
'game_ids': [],
'movie_ids' : [],
'book_ids': []
}
or one index per lookup value, like:
user_games_index
{
'game_ids': []
}
user_movies_index
{
'movie_ids': []
}
user_books_index
{
'book_ids': []
}
Pros for one index:
Each index comes with overhead, so the fewer the better
If I ever want to retrieve all of a user's info, it's all in one index
Pros for multiple indices:
According to the update api docs, updating a document means retrieving the whole thing first. I will be updating each document a lot, and those arrays can become rather large (think thousands of ids). Updating a book id will then retrieve all of the game ids, which takes up memory. If they were in separate indices, I could avoid that.
Just easier to maintain on my end of things
I should note that if I use multiple indices, it'll only be 4 or 5, with about 500k documents per index. Also, only 1 primary shard per index, no replicas, and I'm on a single m5.2xlarge EC2 instance (8 cores, 32G ram).
Are these stats so small that it won't really matter at this point, or should I favor one index or many?
How about a third option?
You have one index and each of your document in the index looks something like this:
{
"user_id" : "some_user",
"document_type" : "movie" or "game" or "book"
"document_id" : "id of movie, game or book"
}
Why? Since you say a user's games, movies or books will be updated often, this approach lets you easily add / delete individual movies, games or books for users.
You also can easily filter the books/movies/games for specific users.
All values are of type "keyword" and filtering should be fast.
PS: A "good" mapping for an ES index will try to minimize the numbers of updates on individual documents and rather work at the level of inserting / deleting documents as ES does this task very well compared to finding & updating documents.
Edit: I have added query examples to illustrate how you can filter out results with bool query.
Example:
I want all movies / games / books a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
}
}
}
}
I want only movies a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
},
"filter":{
"term" : {
"document_type" : "movie"
}
}
}
}
}

can terms lookup mechanism query by other field but id?

here is elasticsearch official website about terms:
https://www.elastic.co/guide/en/elasticsearch/reference/2.1/query-dsl-terms-query.html
As we can see, if we want to do terms lookup mechanism query, we should use command like this:
curl -XGET localhost:9200/tweets/_search -d '{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}'
But what if i want to do query by other field of users.
Assume that users has some other fields such as name and can i use terms lookup mechanism finding the tweets by giving users name but not id.
I have tried to use command like this:
curl -XGET localhost:9200/tweets/_search -d '{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"name" : "Jane",
"path" : "followers"
}
}
}
}'
but it occurs error.
Looking forward to your help. Thank you!
The terms lookup mechanism is basically a built-in optimization to not have to make two queries to JOIN two indices, i.e. one in index A to get the ids to lookup and a second to fetch the documents with those ids in index B.
In contrary to SQL, such a JOIN can only work on the id field since this is the only way to uniquely retrieve a document from Elasticsearch via a GET call, which is exactly what Elasticsearch will do in the terms lookup.
So to answer your question, the terms lookup mechanism will not work on any other field than the id field since the first document to be retrieved must be unique. In your case, ES would not know how to fetch the document for the user with name Jane since name is just a field present in the user document, but in no way a unique identifier for user Jane.
I think you did not understand exactly how this works. Terms lookup query works by reading values from a field of a document with the given id. In this case, you are trying to match the value of field user in tweets index with values of field followers in document with id "2" present in users index and user type.
If you want to read from any other field then simply mention that in "path".
What you mainly need to understand is that the lookup values are all fetched from a field of a single document and not multiple documents.

Finding duplicate documents

I have some documents whose ids are randomly generated. The issue here is I need to find the duplicates amongst these documents. I have three fields which should not be identical for two documents. So how to check for duplicates based on multiple fields?
Sample documents
document 1 = {
"process" : "business",
"processId" : 5433321,
"country" : "US"
}
document 2 = {
"process" : "operations",
"processId" : 334233,
"country" : "UK"
}
document 3 = {
"process" : "business",
"processId" : 5433321,
"country" : "US"
}
Here as you can see, document 1 and document 3 are the same, but they are having different Ids in my database,so exist as separate documents. So on run I need to find the above as duplicates and if possible keep only one.
The best option here would be to model your document around doc ID. Now for each unique document , create a docID which is a hash of the content of the document. This makes sure that only one unique document exists across the index. Next use _create API to create documents. This will fail all requests on over write document with same document ID.
You can further read about other duplication issues and its solutions here.

MongoDB compound indexes vs Single FIeld Indexes in terms of space consumption

According to this post compound indexes are bigger in terms of dimensions (I could not find much info on docs, so if you could point me there I would be grateful).
Suppose I have to search for the whole address (we can assume I will always have all the fields available both in collection and in the query) through a collection of addresses like
{
name: String,
street: String,
postcode: String,
City: String,
Country: String
}
My question is: how bigger a compound index would be?
If a compound index is bigger then a single field wouldn't it be better to add a hash of the concatenation of all values to all objects, add a single index to the hash field and search by that (although it do not sounds like a good practice)?
If a compound index is bigger then a single field wouldn't it be better to add a hash of the concatenation of all values to all objects, add a single index to the hash field and search by that (although it do not sounds like a good practice)?
These accomplish different things. A compound index has an order and that order has an effect. For instance, the index { 'country' : 1, 'city' : 1, 'postcode' : 1 } would allow to search for all address in a specific city of a specific country. A hash can't do that - hashes only support exact matches.
I don't see how this is bad practice at all, it's just a very narrow use case. Remember than every slight difference in spelling, additional white spaces, etc. will result in different hash values and that you can't even answer simple question like "how many address in country X do we store?". But if you don't need that, why not?
By the way, MongoDB has built-in support for this. If the address is embedded, using a hashed index on the entire subdocument will accomplish what you need:
MongoDB supports hashed indexes of any single field. The hashing function collapses embedded documents and computes the hash for the entire value,
e.g.:
> db.hash.insert( {"name": "john", "address" : { "city" : "Chicago", "state":"IL",
"country" : "US" } } );
WriteResult({ "nInserted" : 1 })
> db.hash.createIndex( { "address" : "hashed" } );
...
>
> This query uses the index and finds the document:
> db.hash.find({"address" : {"city" : "Chicago", "state": "IL", "country" : "US" } } );
>
> // this query wont find the document b/c of missing state, but is still fast (IXSCAN)
> db.hash.find({"address" : {"city" : "Chicago", "country" : "US" } } );

Resources