elasticsearch - how to combine results from two indexes - elasticsearch

I have CDR log entries in Elasticsearch as in the below format. While creating this document, I won't have info about delivery_status field.
{
msgId: "384573847",
msgText: "Message text to be delivered"
submit_status: true,
...
delivery_status: //comes later
}
Later when delivery status becomes available, I can update this record.
But I have seen that update queries bring down the rate of ingestion. With pure inserts using bulk operations, I can reach upto 3000 or more transactions /sec, but if I combine with updates, the ingestion rate becomes very slow and crawls at 100 or less txns/sec.
So, I am thinking that I could create another index like below, where I store the delivery status along with msgId:
{
msgId:384573847,
delivery_status: 0
}
With this approach, I end up with 2 indices (similar to master-detail tables in an RDBMS). Is there a way to query the record by joining these indices? I have heard of aliases, but could not fully understand its concept and whether it can be applied in my use case.
thanks to anyone helping me out with suggestions.

As you mentioned, you can index both the document in separate index and used collapse functionality of Elasticsearch and retrieve both the documents.
Let consider, you have index document in index2 and index3 and both have common msgId then you can use below query:
POST index2,index3/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}
But again, you need to consider querying performance with large data set. You can do some benchmarking Evalue query performance and decide index or query time will be better.
Regarding alias, currently in above query we are providing index2,index3 as index name. (Comma separated). But if you use aliases then You can use the single unified name for query to both the index.
You can add both the index to single alias using below command:
POST _aliases
{
"actions": [
{
"add": {
"index": "index3",
"alias": "order"
}
},
{
"add": {
"index": "index2",
"alias": "order"
}
}
]
}
Now you can use below query with alias name insted of index name:
POST order/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}

Related

How to return multiple hit counts for multiple search strings in elasticsearch extraction query?

This is a many to many kind of query.
I want to combine multiple search strings in a single query and I want hit counts for each of them.
For example:
"Apple", "Orange", "Pineapple"
For each of them, I want a count.
Currently, I have written OR query, but that defeats the purpose.
OR query combines everything, and hit count is accumulation of all of the search strings,
but, I want a hit count specifically for each search string.
Tldr;
Elasticsearch is not a relational database trying to implement RDBMS behaviour will be doom to fail eventually.
But with that said, if you just want to count the number of it for a series of terms, you should be looking into the aggregation feature of elasticsearch.
In the case you absolutely want to perform search queries, you could try to look into the msearch api allowing you to send multiple queries to elasticsearch using a single call.
Solution (aggregation)
GET <index name>/_search
{
"size": 0,
"query": {
"terms": {
"fruits": [
"Apple",
"Orange",
...
]
}
},
"aggs": {
"NAME": {
"terms": {
"field": "fruits",
"size": 10
}
}
}
}
Solution (msearch)
GET <index name>/_msearch
{ }
{"query": {"terms": {"fruits": ["Apple"]}}}
{}
{"query": {"terms": {"fruits": ["Orange"]}}}
...

Elastic search multi index query

I am building an app where I need to match users based on several parameters. I have two elastic search indexes, one with the user's likes and dislikes, one with some metadata about the user.
/user_profile/abc12345
{
"userId": "abc12345",
"likes": ["chocolate", "vanilla", "strawberry"]
}
/user_metadata/abc12345
{
"userId": "abc12345",
"seenBy": ["aaa123","bbb123", "ccc123"] // Potentially hundreds of thousands of userIds
}
I was advised to make these separate indexes and cross reference them, but how do I do that? For example I want to search for a user who likes chocolate and has NOT been seen by user abc123. How do I write this query?
If this is a frequent query in your use case, I would recommend merging the indices (always design your indices based on your queries).
Anyhow, a possible workaround for your current scenario is to exploit the fact that both indices store the user identifier in a field with the same name (userId). Then, you can (1) issue a boolean query over both indices, to match documents from one index based on the likes field, and documents from the other index based on the seenBy field, (2) use the terms bucket aggregation to get the list of unique userIds that satisfy your conditions.
For example
GET user_*/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"match": {
"likes": "chocolate"
}
},
{
"match": {
"seenBy": "abc123"
}
}
]
}
},
"aggs": {
"by_userId": {
"terms": {
"field": "userId.keyword",
"size": 100
}
}
}
}

Intersection of two (or more) elastic indices

I have two elasticsearch indices, one is for customers who bought item A, let's call it index_A, and similarly index_B.
Every record in these indices are transaction data, which has client_id and time_of_sale.
Every customer has an id (not the default _id field of elasticsearch)
I would like to find all customer_ids that are in both indices.
Right now I'm iterating through both (which is a huge pain), creating a list of all unique customer_ids for each index, and then finding the overlap in python.
Is there a better way? that doesn't iterate over all indices with match_all?
One way to achieve this would be to query both indexes at the same time and producing aggregation keys made of the index name and the client_id and then aggregating on those keys. Since that would involve some scripting, and can thus harm performance, there is another way using pipeline aggregations.
Using the bucket_selector pipeline aggregation, you can first aggregate on client_id and then on the index name and only select those client buckets whcih contain (at least) two indexes:
POST index_*/_search
{
"size": 0,
"aggs": {
"customers": {
"terms": {
"field": "client_id",
"size": 10
},
"aggs": {
"indexes": {
"terms": {
"field": "_index",
"size": 10
}
},
"customers_in_both_indexes": {
"bucket_selector": {
"buckets_path": {
"nb_buckets": "indexes._bucket_count"
},
"script": "params.nb_buckets > 1"
}
}
}
}
}
}

Elasticsearch query speed up with repeated used terms query filter

I will need to find out the co-occurrence times between one single tag and another fixed set of tags as whole. I have 10000 different single tags, and there are 10k tags inside fixed set of tags. I loop through all single tags under a fixed set of tags context with a fixed time range. I have total 1 billion documents inside the index with 20 shards.
Here is the elasticsearch query, elasticsearch 6.6.0:
es.search(index=index, size=0, body={
"query": {
"bool": {
"filter": [
{"range": {
"created_time": {
"gte": fixed_start_time,
"lte": fixed_end_time,
"format": "yyyy-MM-dd-HH"
}}},
{"term": {"tags": dynamic_single_tag}},
{"terms": {"tags": {
"index" : "fixed_set_tags_list",
"id" : 2,
"type" : "twitter",
"path" : "tag_list"
}}}
]
}
}, "aggs": {
"by_month": {
"date_histogram": {
"field": "created_time",
"interval": "month",
"min_doc_count": 0,
"extended_bounds": {
"min": two_month_start_time,
"max": start_month_start_time}
}}}
})
My question: Is there any solution which can have a cache inside elasticsearch for a fixed 10k set of tags terms query and time range filter which can speed up the query time? It took 1.5s for one single tag for my query above.
What you are seeing is normal behavior for Elasticsearch aggregations (actually, pretty good performance given that you have 1 billion documents).
There are a couple of options you may consider: using a batch of filter aggregations, re-indexing with a subset of documents, and downloading the data out of Elasticsearch and computing the co-occurrences offline.
But probably it is worth trying to send those 10K queries and see if Elasticsearch built-in caching kicks in.
Let me explain in a bit more detail each of these options.
Using filter aggregation
First, let's outline what we are doing in the original ES query:
filter documents with create_time in certain time window;
filter documents containing desired tag dynamic_single_tag;
also filter documents who have at least one tag from the list fixed_set_tags_list;
count how many such documents there are per each month in certain time period.
The performance is a problem because we've got 10K of tags to make such queries for.
What we can do here is to move filter on dynamic_single_tag from query to aggregations:
POST myindex/_doc/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { ... } }
]
}
},
"aggs": {
"by tag C": {
"filter": {
"term": {
"tags": "C" <== here's the filter
}
},
"aggs": {
"by month": {
"date_histogram": {
"field": "created_time",
"interval": "month",
"min_doc_count": 0,
"extended_bounds": {
"min": "2019-01-01",
"max": "2019-02-01"
}
}
}
}
}
}
}
The result will look something like this:
"aggregations" : {
"by tag C" : {
"doc_count" : 2,
"by month" : {
"buckets" : [
{
"key_as_string" : "2019-01-01T00:00:00.000Z",
"key" : 1546300800000,
"doc_count" : 2
},
{
"key_as_string" : "2019-02-01T00:00:00.000Z",
"key" : 1548979200000,
"doc_count" : 0
}
]
}
}
Now, if you are asking how this can help the performance, here is the trick: to add more such filter aggregations, for each tag: "by tag D", "by tag E", etc.
The improvement will come from doing "batch" requests, combining many initial requests into one. It might not be practical to put all 10K of them in one query, but even batches of 100 tags per query can be a game changer.
(Side note: roughly the same behavior can be achieved via terms aggregation with include filter parameter.)
This method of course requires getting hands dirty and writing a bit more complex query, but it will come handy if one needs to run such queries at random times with 0 preparation.
re-index the documents
The idea behind second method is to reduce the set of documents beforehand, via reindex API. reindex query might look like this:
POST _reindex
{
"source": {
"index": "myindex",
"type": "_doc",
"query": {
"bool": {
"filter": [
{
"range": {
"created_time": {
"gte": "fixed_start_time",
"lte": "fixed_end_time",
"format": "yyyy-MM-dd-HH"
}
}
},
{
"terms": {
"tags": {
"index": "fixed_set_tags_list",
"id": 2,
"type": "twitter",
"path": "tag_list"
}
}
}
]
}
}
},
"dest": {
"index": "myindex_reduced"
}
}
This query will create a new index, myindex_reduced, containing only elements that satisfy first 2 clauses of filtering.
At this point, the original query can be done without those 2 clauses.
The speed-up in this case will come from limiting the number of documents, the smaller it will be, the bigger the gain. So, if fixed_set_tags_list leaves you with a little portion of 1 billion, this is the option you can definitely try.
Downloading data and processing outside Elasticsearch
To be honest, this use-case looks more like a job for pandas. If data analytics is your case, I would suggest using scroll API to extract the data on disk and then process it with an arbitrary script.
In python it could be as simple as using .scan() helper method of elasticsearch library.
Why not to try the brute force approach?
Elasticsearch will already try to help you with your query via request cache. It is applied only to pure-aggregation queries (size: 0), so should work in your case.
But it will not, because the content of the query will always be different (the whole JSON of the query is used as caching key, and we have a new tag in every query). A different level of caching will start to play.
Elasticsearch heavily relies on the filesystem cache, which means that under the hood the more often accessed blocks of the filesystem will get cached (practically loaded into RAM). For the end-user it means that "warming up" will come slowly and with volume of similar requests.
In your case, aggregations and filtering will occur on 2 fields: create_time and tags. This means that after doing maybe 10 or 100 requests, with different tags, the response time will drop from 1.5s to something more bearable.
To demonstrate my point, here is a Vegeta plot from my study of Elasticsearch performance under the same query with heavy aggregations sent with fixed RPS:
As you can see, initially the request was taking ~10s, and after 100 requests it diminished to brilliant 200ms.
I would definitely suggest to try this "brute force" approach, because if it works it is good, if it does not - it costed nothing.
Hope that helps!

Sort documents by size of a field

I have documents like below indexed,
1.
{
"name": "Gilly",
"hobbyName" : "coin collection",
"countries": ["US","France","Georgia"]
}
2.
{
"name": "Billy",
"hobbyName":"coin collection",
"countries":["UK","Ghana","China","France"]
}
Now I need to sort these documents based on the array length of the field "countries", such that the result after the sorting would be of the order document2,document1. How can I achieve this using elasticsearch?
You can use script based sorting to achieve this.
{
"query": {
"match_all": {}
},
"sort": {
"_script": {
"type": "number",
"script": "doc['countries'].values.size()",
"order": "desc"
}
}
}
I would suggest using token count type in Elasticsearch.
By using scripts , it can be done (can check here for how to do it using scripts). But then results wont be perfect.
Scripts mostly uses filed data cache and duplicate are removed in this.
You can read more on how to use token count type here.

Resources