What is the best way to aggregate the time between events in ElasticSearch? - elasticsearch

I'm querying an ElasticSearch database in which several applications are logging every change they make to a shared entity - each application is responsible for managing different aspects of this shared entity. The entity is persisted in a document-database, but each change is persisted in this ElasticSearch database.
I'm attempting to query for changes to a specific property (status) in order to track the lifecycle of these Product entities over time. I need to be able to dynamically answer questions like:
Over the last N weeks, what's the average time it took for a Product to move from status-"Created" to status-"Details Submitted"?
During a specific time range, what's the average time it took for a Product to move from status-"Reviewed" to status-"Available Online"?
How long did take for Products in Group-A to move from status-"Details Submitted" to status-"Reviewed"?
In SQL I might use the group-by clause and perhaps some sub-queries, like:
select avg(submitted), avg(reviewed)
from (
select id,
max(timestamp) as reviewed,
min(timestamp) as submitted,
count(*) as statusChanges
from changes
where (
(key = 'status' and previous = 'Created' and updated = 'Details Submitted')
or (key = 'status' and previous = 'Details Submitted' and updated = 'Reviewed')
) and timestamp > ? and timestamp < ? and group_id = ?
group by id
)
where statusChanges = 2
What's the best way to accomplish something comparable in ElasticSearch?
I've tried using a Composite Index, which works decently when I need to examine the specific dates of when each Product changed its status - since it allows pagination. However this doesn't allow any further sorting of results nor overall aggregation. You can only sort by the field you grouped-by and you can't aggregate across all products.
I've just recently come across the concept of a Transform index? Is that the best approach for aggregating the results of an aggregation? I haven't gotten access to try this out yet, but I'm attempting to formulate a potential Transform Index now and struggling a bit.
Here's the composite query was able to write for finding out how long each Product remained in a specific status, although I couldn't figure out how to get min_doc_count to work in a composite query...
// GET: https://<my-cluster-hostname>:9092/product-index/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"match_phrase": {
"change.key": "status"
}
},
{
"match_phrase": {
"change.previousValue": "Created"
}
},
{
"match_phrase": {
"change.updatedValue": "Details Submitted"
}
}
]
}
},
{
"bool": {
"must": [
{
"match_phrase": {
"change.key": "status"
}
},
{
"match_phrase": {
"change.previousValue": "Details Submitted"
}
},
{
"match_phrase": {
"change.updatedValue": "Reviewed"
}
}
]
}
}
]
}
},
"aggs": {
"how-long-before-submitted-details-reviewed": {
"composite": {
"size": 20,
"after": {
"item": "<last_uuid_from_previous_page>"
},
"sources": [
{
"product": {
"terms": {
"field": "metadata.uuid.keyword",
"order": "desc"
}
}
}
]
},
"aggs": {
"detailsSubmitted": {
"min": {
"field": "timestamp"
}
},
"detailsReviewed": {
"max": {
"field": "timestamp"
}
}
}
}
}
}
Here's the Transform Index I'm thinking of submitting. But I wonder if there's a way of getting it to cover all status changes, or if instead I'll need to create an index for each status change like this and then filter/sort/aggregate over this Transform Index:
// PUT: https://<my-cluster-hostname>:9092/_transform/details-submitted-to-reviewed
{
"source": {
"index": "product-index",
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"match_phrase": {
"change.key": "status"
}
},
{
"match_phrase": {
"change.previousValue": "Created"
}
},
{
"match_phrase": {
"change.updatedValue": "Details Submitted"
}
}
]
}
},
{
"bool": {
"must": [
{
"match_phrase": {
"change.key": "status"
}
},
{
"match_phrase": {
"change.previousValue": "Details Submitted"
}
},
{
"match_phrase": {
"change.updatedValue": "Reviewed"
}
}
]
}
}
]
}
}
},
"dest": {
"index": "details-submitted-to-reviewed"
},
"pivot": {
"group_by": {
"product-id": {
"terms": {
"field": "metadata.uuid.keyword"
}
}
},
"aggregations": {
"detailsSubmitted": {
"min": {
"field": "timestamp"
}
},
"detailsReviewed": {
"max": {
"field": "timestamp"
}
}
}
}
}

Related

How to return results from elasticsearch after a threshold match

I have two queries as follows:
The first query returns the count of all documents per domain.
The second query returns the count where a field is empty.
Later I filter it in my backend, such that, if for a domain the count of documents missing field value is more than a specific threshold then only consider them else ignore. Could these two queries be combined together, such that I could do the threshold comparison and then return the results.
The first query is as follows:
GET database/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"source": {
"value": "Web"
}
}
}
]
}
},
"aggs": {
"domains": {
"terms": {
"field": "domain_id"
}
}
}
}
The second query just applies a should filter as follows:
GET mapachitl/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"source": {
"value": "Web"
}
}
}
],
"should": [
{
"term": {
"address.city.keyword": {
"value": ""
}
}
},
{
"term": {
"address.zip.keyword": {
"value": ""
}
}
}
],
"minimum_should_match": 1
}
},
"aggs": {
"domains": {
"terms": {
"field": "domain_id"
}
}
}
}
Can I only return those domains where the ratio of documents missing city or zip code is more than 25%? I read about scripting but not sure how can I use it here.

Elasticsearch Remove duplicate results if greater than some value

I have news articles form multiple sources saved and each source have different category I need to write a query which will reverse time sort the article in chunks of 15 at a time also I don't need more than 3 articles from a particular source I am using the below query but the results are wrong can any one tell me what am I doing wrong.
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"category": "Digital"
}
},
{
"match_phrase": {
"type": "Local"
}
}
]
}
},
"collapse": {
"field": "source.keyword",
"max_concurrent_group_searches": 3
},
"sort": [
{
"pub_date": {
"order": "desc"
}
}
]
}

Elasticsearch sum first occurences of a day by term

I am trying to convert our MySQL statistic Queries to our new Elasticsearch (version 5.4) Server.
In mysql we have a statistik table and a second table with only the first request of a user within a day.
Currently we're using PHP/MySQL to fill that second table and ignore all the other request that are in the statistik table.
The query we're running over the table looks like this:
SELECT
SUM(price_displayed * (query_amount / pricing_unit)) AS `requests`,
SUM(price_displayed * (order_amount / pricing_unit)) AS `orders`,
EXTRACT(YEAR_MONTH FROM `ts`) AS `date`
FROM statistics_values
The goal is to get rid of the second table.
Is it possible to get only the first document of the day and user and than use a script to calculate the result like in the mysql query?
I tried using a date_histogramm aggregation with a terms aggregation but it doesn't work.
The Query looks like this:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "(client_branch:DE*)"
}
},
{
"range": {
"created": {
"gte": "2017-01-01",
"lte": "2017-05-31",
"format": "date"
}
}
},
{
"term": {
"action": "query"
}
},
{
"term": {
"type": "enquiry"
}
}
],
"must_not": [
{
"term": {
"exclude_statistics": "1"
}
}
]
}
},
"aggs": {
"by_day": {
"date_histogram": {
"field": "created",
"interval": "day"
},
"aggs": {
"by_term": {
"terms": {
"field": "user_id"
}
}
}
}
},
"size": 0
}
Any ideas?

elasticsearch to apply a sort to a query, the select top N for aggregate

The query below aggregates over the entire result of the query, and size only affects what is returned rather than what is aggregated.
How would I modify the search so that only the top N results after sort is processed by the average aggregation?
It seems such a simple requirement that I'm expecting it to be possible but so far all my efforts have failed, and similar questions on SO have gone unanswered.
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"term": {
"jobType": "LiveEventScoring"
}
},
{
"term": {
"host": "MTVMDANS"
}
},
{
"term": {
"dataSourceCode": "AU_VIRT"
}
},
{
"term": {
"measurement": "EventDataLoadFromCacheDuration"
}
}
]
}
}
}
},
"sort": {
"timestamp": {
"order": "desc"
}
},
"aggs": {
"avgDuration": {
"avg": {
"field": "elapsedMs"
}
}
}
}

How to do nested AND and OR filters in ElasticSearch?

My filters are grouped together into categories.
I would like to retrieve documents where a document can match any filter in a category, but if two (or more) categories are set, then the document must match any of the filters in ALL categories.
If written in pseudo-SQL it would be:
SELECT * FROM Documents WHERE (CategoryA = 'A') AND (CategoryB = 'B' OR CategoryB = 'C')
I've tried Nested filters like so:
{
"sort": [{
"orderDate": "desc"
}],
"size": 25,
"query": {
"match_all": {}
},
"filter": {
"and": [{
"nested": {
"path":"hits._source",
"filter": {
"or": [{
"term": {
"progress": "incomplete"
}
}, {
"term": {
"progress": "completed"
}
}]
}
}
}, {
"nested": {
"path":"hits._source",
"filter": {
"or": [{
"term": {
"paid": "yes"
}
}, {
"term": {
"paid": "no"
}
}]
}
}
}]
}
}
But evidently I don't quite understand the ES syntax. Is this on the right track or do I need to use another filter?
This should be it (translated from given pseudo-SQL)
{
"sort": [
{
"orderDate": "desc"
}
],
"size": 25,
"query":
{
"filtered":
{
"filter":
{
"and":
[
{ "term": { "CategoryA":"A" } },
{
"or":
[
{ "term": { "CategoryB":"B" } },
{ "term": { "CategoryB":"C" } }
]
}
]
}
}
}
}
I realize you're not mentioning facets but just for the sake of completeness:
You could also use a filter as the basis (like you did) instead of a filtered query (like I did). The resulting json is almost identical with the difference being:
a filtered query will filter both the main results as well as facets
a filter will only filter the main results NOT the facets.
Lastly, Nested filters (which you tried using) don't relate to 'nesting filters' like you seemed to believe, but related to filtering on nested-documents (parent-child)
Although I have not understand completely your structure this might be what you need.
You have to think tree-wise. You create a bool where you must (=and) fulfill the embedded bools. Each embedded checks if the field does not exist or else (using should here instead of must) the field must (terms here) be one of the values in the list.
Not sure if there is a better way, and do not know the performance.
{
"sort": [
{
"orderDate": "desc"
}
],
"size": 25,
"query": {
"query": { #
"match_all": {} # These three lines are not necessary
}, #
"filtered": {
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"not": {
"exists": {
"field": "progress"
}
}
},
{
"terms": {
"progress": [
"incomplete",
"complete"
]
}
}
]
}
},
{
"bool": {
"should": [
{
"not": {
"exists": {
"field": "paid"
}
}
},
{
"terms": {
"paid": [
"yes",
"no"
]
}
}
]
}
}
]
}
}
}
}
}

Resources