Elastic search multi index query - elasticsearch

I am building an app where I need to match users based on several parameters. I have two elastic search indexes, one with the user's likes and dislikes, one with some metadata about the user.
/user_profile/abc12345
{
"userId": "abc12345",
"likes": ["chocolate", "vanilla", "strawberry"]
}
/user_metadata/abc12345
{
"userId": "abc12345",
"seenBy": ["aaa123","bbb123", "ccc123"] // Potentially hundreds of thousands of userIds
}
I was advised to make these separate indexes and cross reference them, but how do I do that? For example I want to search for a user who likes chocolate and has NOT been seen by user abc123. How do I write this query?

If this is a frequent query in your use case, I would recommend merging the indices (always design your indices based on your queries).
Anyhow, a possible workaround for your current scenario is to exploit the fact that both indices store the user identifier in a field with the same name (userId). Then, you can (1) issue a boolean query over both indices, to match documents from one index based on the likes field, and documents from the other index based on the seenBy field, (2) use the terms bucket aggregation to get the list of unique userIds that satisfy your conditions.
For example
GET user_*/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"match": {
"likes": "chocolate"
}
},
{
"match": {
"seenBy": "abc123"
}
}
]
}
},
"aggs": {
"by_userId": {
"terms": {
"field": "userId.keyword",
"size": 100
}
}
}
}

Related

elasticsearch - how to combine results from two indexes

I have CDR log entries in Elasticsearch as in the below format. While creating this document, I won't have info about delivery_status field.
{
msgId: "384573847",
msgText: "Message text to be delivered"
submit_status: true,
...
delivery_status: //comes later
}
Later when delivery status becomes available, I can update this record.
But I have seen that update queries bring down the rate of ingestion. With pure inserts using bulk operations, I can reach upto 3000 or more transactions /sec, but if I combine with updates, the ingestion rate becomes very slow and crawls at 100 or less txns/sec.
So, I am thinking that I could create another index like below, where I store the delivery status along with msgId:
{
msgId:384573847,
delivery_status: 0
}
With this approach, I end up with 2 indices (similar to master-detail tables in an RDBMS). Is there a way to query the record by joining these indices? I have heard of aliases, but could not fully understand its concept and whether it can be applied in my use case.
thanks to anyone helping me out with suggestions.
As you mentioned, you can index both the document in separate index and used collapse functionality of Elasticsearch and retrieve both the documents.
Let consider, you have index document in index2 and index3 and both have common msgId then you can use below query:
POST index2,index3/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}
But again, you need to consider querying performance with large data set. You can do some benchmarking Evalue query performance and decide index or query time will be better.
Regarding alias, currently in above query we are providing index2,index3 as index name. (Comma separated). But if you use aliases then You can use the single unified name for query to both the index.
You can add both the index to single alias using below command:
POST _aliases
{
"actions": [
{
"add": {
"index": "index3",
"alias": "order"
}
},
{
"add": {
"index": "index2",
"alias": "order"
}
}
]
}
Now you can use below query with alias name insted of index name:
POST order/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}

Intersection of two (or more) elastic indices

I have two elasticsearch indices, one is for customers who bought item A, let's call it index_A, and similarly index_B.
Every record in these indices are transaction data, which has client_id and time_of_sale.
Every customer has an id (not the default _id field of elasticsearch)
I would like to find all customer_ids that are in both indices.
Right now I'm iterating through both (which is a huge pain), creating a list of all unique customer_ids for each index, and then finding the overlap in python.
Is there a better way? that doesn't iterate over all indices with match_all?
One way to achieve this would be to query both indexes at the same time and producing aggregation keys made of the index name and the client_id and then aggregating on those keys. Since that would involve some scripting, and can thus harm performance, there is another way using pipeline aggregations.
Using the bucket_selector pipeline aggregation, you can first aggregate on client_id and then on the index name and only select those client buckets whcih contain (at least) two indexes:
POST index_*/_search
{
"size": 0,
"aggs": {
"customers": {
"terms": {
"field": "client_id",
"size": 10
},
"aggs": {
"indexes": {
"terms": {
"field": "_index",
"size": 10
}
},
"customers_in_both_indexes": {
"bucket_selector": {
"buckets_path": {
"nb_buckets": "indexes._bucket_count"
},
"script": "params.nb_buckets > 1"
}
}
}
}
}
}

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

Elasticsearch - retriving documents only, if multiple match by specific field

I have an index in Elasticsearch with users' posts. I want to retrieve user_id from this index, if for given date range, there are at least X posts. Otherwise to skip such posts.
Anyway I can achieve it in ES or I have to get all entities and handle them later?
Trawa ;)
To answer your question I'll assume you have the fields user and datetime in your mapping.
You can get the requested data like so:
Get the list of users who have more then X (i.e X=100) posts by given date range - aggregate by user name for specific date range:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"datetime": {
"gte": "2017-05-01",
"lt": "2017-06-01"
}
}
}
]
}
},
"aggregations": {
"users": {
"terms": {
"field": "user",
"min_doc_count": 100
}
}
}
}
Edit the query to match your date range (and its format) and min_doc_count to the minimum X posts per user.
EDIT:
There is no way to avoid terms_aggregation to get all distinct values.
50k values do seems to be to much data to retrieve - but it also depends on your cluster.
My suggestion is to add another filter, lets say, alphabetically filter so instead of getting 50k results at once you can do it in other several queries:
"must": [
{
"range": {
"datetime": {
"gte": "2017-05-01",
"lt": "2017-06-01"
}
}
},
{
"wildcard": {
"user": "a*"
}
},
{
"wildcard": {
"user": "b*"
}
}
]
See Wildcard
Unfortunately, scrolling on aggregation results is not available. Manually dividing the data to pieces is the best thing I can see right now.

instruct elasticsearch to return random results from different types

I have an index in ES with say 3 types A,B,C. Each type holds 1000 products. When the user makes a query with no scoring , then ES returns first all results from A, then all from B and then all from C.
What I need is to present mixed results from the 3 types.
I looked into the random scoring but it s not quite what I need.
Any ideas?
Do you really need randomness or simple 3 results from a type? Three results from each type could be realized through the top hits aggregation. First you aggregate by the _type field, then the top hits aggregation is applied:
{
"query": {
"function_score": {
"query": {
"match_all": {
}
},
"random_score": {
"seed": 137677928418000
}
}
},
"aggs": {
"all_type": {
"terms": {
"field": "_type"
},
"aggs": {
"by_top_hit": {
"top_hits": {
"size": 3
}
}
}
}
}
}
Edit: I added random scoring, to get random results, I think to get special numbers of documents for each _type is difficult, a solution is probably to get just enough from all _type fields.

Resources