Intersection of two (or more) elastic indices - elasticsearch

I have two elasticsearch indices, one is for customers who bought item A, let's call it index_A, and similarly index_B.
Every record in these indices are transaction data, which has client_id and time_of_sale.
Every customer has an id (not the default _id field of elasticsearch)
I would like to find all customer_ids that are in both indices.
Right now I'm iterating through both (which is a huge pain), creating a list of all unique customer_ids for each index, and then finding the overlap in python.
Is there a better way? that doesn't iterate over all indices with match_all?

One way to achieve this would be to query both indexes at the same time and producing aggregation keys made of the index name and the client_id and then aggregating on those keys. Since that would involve some scripting, and can thus harm performance, there is another way using pipeline aggregations.
Using the bucket_selector pipeline aggregation, you can first aggregate on client_id and then on the index name and only select those client buckets whcih contain (at least) two indexes:
POST index_*/_search
{
"size": 0,
"aggs": {
"customers": {
"terms": {
"field": "client_id",
"size": 10
},
"aggs": {
"indexes": {
"terms": {
"field": "_index",
"size": 10
}
},
"customers_in_both_indexes": {
"bucket_selector": {
"buckets_path": {
"nb_buckets": "indexes._bucket_count"
},
"script": "params.nb_buckets > 1"
}
}
}
}
}
}

Related

Elastic search multi index query

I am building an app where I need to match users based on several parameters. I have two elastic search indexes, one with the user's likes and dislikes, one with some metadata about the user.
/user_profile/abc12345
{
"userId": "abc12345",
"likes": ["chocolate", "vanilla", "strawberry"]
}
/user_metadata/abc12345
{
"userId": "abc12345",
"seenBy": ["aaa123","bbb123", "ccc123"] // Potentially hundreds of thousands of userIds
}
I was advised to make these separate indexes and cross reference them, but how do I do that? For example I want to search for a user who likes chocolate and has NOT been seen by user abc123. How do I write this query?
If this is a frequent query in your use case, I would recommend merging the indices (always design your indices based on your queries).
Anyhow, a possible workaround for your current scenario is to exploit the fact that both indices store the user identifier in a field with the same name (userId). Then, you can (1) issue a boolean query over both indices, to match documents from one index based on the likes field, and documents from the other index based on the seenBy field, (2) use the terms bucket aggregation to get the list of unique userIds that satisfy your conditions.
For example
GET user_*/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"match": {
"likes": "chocolate"
}
},
{
"match": {
"seenBy": "abc123"
}
}
]
}
},
"aggs": {
"by_userId": {
"terms": {
"field": "userId.keyword",
"size": 100
}
}
}
}

How to limit search results from each index in a multi index search query?

I am using Elasticsearch version 6.3 and I want to make queries across multiple indices.Elasticsearch has support for this and I can give multiple indices as comma separated values in the url with one query in request body and also give size parameter to limit the number of search results returned.However this limits the size of the overall search results and might lead to no results from some indexes- so instead I want to fetch first n number of results from each index.
I tried using multi search api (_msearch) but with that it seems I have to give the same query and size for all indexes and that works, but I am not able to get a single aggregation over the entire result , is there any way to address both the issues?
Solution 1:
You're on the right path with the _msearch query. What I would do is to issue one query per index (no aggregations!) with the size you want for that index, as well as another query just for the aggregations, like this:
{ "index": "index1" }
{ "size": 5, "query": { ... }}
{ "index": "index2" }
{ "size": 5, "query": { ... }}
{ "index": "index3" }
{ "size": 5, "query": { ... }}
{ "index": "index1,index2,index3" }
{ "size": 0, "query": { ... }, "aggs": { ... } }
So the first three queries will return document hits from each of the three indexes and the last query will return the aggregation computed on all indexes, but no documents.
Solution 2:
Another way to tackle this if you have a small size, is to have a single query in the query part and then aggregate on the index name and retrieve hits from each index using top_hits, like this:
POST index1,index2,index3/_search
{
"size": 0,
"query": { ... },
"aggs": {
"indexes": {
"terms": {
"field": "_index",
"size": 50
},
"aggs": {
"hits": {
"top_hits": {
"size": 5
}
}
}
}
}
}

Compare IDs between two indices in elasticsearch

I have two indices in an elasticsearch cluster, containing what ought to be the same data in two slightly different formats. However, the number of records are different. The IDs of each document should be the same. Is there a way to extract a list of what IDs are present in one index but not the other?
If your two indices have the same type where these documents are stored, you can use something like this:
GET index1,index2/_search
{
"size": 0,
"aggs": {
"group_by_uid": {
"terms": {
"field": "_uid"
},
"aggs": {
"count_indices": {
"cardinality": {
"field": "_index"
}
},
"values_bucket_filter_by_index_count": {
"bucket_selector": {
"buckets_path": {
"count": "count_indices"
},
"script": "params.count < 2"
}
}
}
}
}
}
The query above works in 5.x. If your ID is a field inside a document, that's even better to test.
For anyone that comes across this, Scrutineer (https://github.com/Aconex/scrutineer/) provides this sort of ability if you follow convention of ID & Version concepts within Elasticsearch.

Elastcsearch aggregation (duplicate) search not returning all duplicates

I am searching for and counting duplicated phrases within a single, or group of, human readable documents. I break each document into phrases/sentences and populate an Elasticsearch index with these phrases, one per ES document.
I have 707 documents in my index. I KNOW that I should have, at least, 21 duplicate documents. My search is returning 19 duplicate docs. I don't understand why I am missing some matches. Here is my query:
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "content",
"min_doc_count": 2
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
}
}
}
}
}
}
My process:
Create index
Build bulk insert data objects
Bulk insert documents into index
Reindex documents
Run duplicates query (above)
Parse results - SUM buckets.doc_counts
delete index
NOTE: Since Elastic Search will match words, not phrases/sentences, I md5 hash each phrase/sentence before insert into my index.
More detail can be provided (I didn't want my post to be too massive).
Why is ES not returning all duplicates????
Thanks
UPDATE: When creating my index I set the shards property to 1 and this helped return a few more duplicates but still not all.
If you know approximate size of the document , add it like below:
"aggs": {
"productId": {
"terms": {
"field": "productId",
"min_doc_count": 2,
"size": 1000
}
}
}
Please check if this will fix your problem.

Elasticsearch - calculate percentage in nested aggregations in relation to parent bucket

Updated question
In my query I aggregate on date and then on sensor name. It is possible to calculate a ratio from a nested aggregation and the total count of documents (or any other aggregation) of the parent bucket? Example query:
{
"size": 0,
"aggs": {
"over_time": {
"aggs": {
"by_date": {
"date_histogram": {
"field": "date",
"interval": "1d",
"min_doc_count": 0
},
"aggs": {
"measure_count": {
"cardinality": {
"field": "date"
}
},
"all_count": {
"value_count": {
"field": "name"
}
},
"by_name": {
"terms": {
"field": "name",
"size": 0
},
"aggs": {
"count_by_name": {
"value_count": {
"field": "name"
}
},
"my ratio": count_by_name / all_count * 100 <-- How to do that?
}
}
}
}
}
}
}
}
I want a custom metric that gives me the ratio count_by_name / all_count * 100. Is that possible in ES, or do I have to compute that on the client?
This seems very simple to me, but I haven't found a way yet.
Old post:
Is there a way to let Elasticsearch consider the overall count of documents (or any other metric) when calculating the average for a bucket?
Example:
I have like 100000 sensors that generate events on different times. Every event is indexed as a document that has a timestamp and a value.
When I want to calculate a ratio of the value and a date histogram, and some sensors only generated values at one time, I want Elasticsearch to treat the not existing values(documents) for my sensors as 0 instead of null.
So when aggregating by day and a sensor only has generated two values at 10pm (3) and 11pm (5), the aggregate for the day should be (3+5)/24, or formal: SUM(VALUE)/24.
Instead, Elasticsearch calculates the average like (3+5)/2, which is not correct in my case.
There was once a ticket on Github https://github.com/elastic/elasticsearch/issues/9745, but the answer was "handle it in your application". That's no answer for me, as I would have to generate zillions of zero-Value documents for every sensor/time combination to get the average ratio right.
Any ideas on this?
If this is the case , simply divide the results by 24 from application side.And when granularity change , change this value accordingly. Number of hours per day is fixed right ....
You can use the Bucket script aggregation to do what you want.
{
"bucket_script": {
"buckets_path": {
"count_by_name": "count_by_name",
"all_count": "all_count"
},
"script": "count_by_name / all_count*100"
}
}
It's just an example.
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-pipeline-bucket-script-aggregation.html

Resources