Nested query in Elasticsearch? - elasticsearch

My team owns several dashboard and are considering the possibility to move to Elasticsearch in order to consolidate the software stacks. One type of common charts we expose is like "What's the pending workflow by the end of each day?". Here are some example data:
day workflow_id version status
20151101 1 1 In Progress
20151101 2 1 In Progress
20151102 1 2 In Progress
20151102 3 1 In Progress
20151102 4 1 In Progress
20151102 2 2 Completed
20151103 1 3 Completed
20151103 3 2 In Progress
20151104 3 3 Completed
20151105 4 2 Completed
Every time when something changed in the workflow, a new record is inserted, which might or might not change the status. The record with the max(version) is the most recent data for the workflow_id.
The goal is to have a chart to show what's the total number of 'In Progress' and 'Completed' workflows at the end of each day. This should only consider the record that has the largest version number until the day. This can be done in SQL with nested queries:
with
snapshot_dates as
(select distinct day from workflow),
snapshot as
(select d.day, w.workflow_id, max(w.version) as max_version
from snapshot_dates d, workflow w
where d.day >= w.day
group by d.day, w.workflow_id
order by d.day, w.workflow_id)
select s.day, w.status, count(1)
from workflow w join snapshot s on w.workflow_id=s.workflow_id and w.version = s.max_version
group by s.day, w.status
order by s.day, w.status;
Here is expected output from the query:
day,status,count
20151101,In Progress,2
20151102,Completed,1
20151102,In Progress,3
20151103,Completed,2
20151103,In Progress,2
20151104,Completed,3
20151104,In Progress,1
20151105,Completed,4
I am still new to Elasticsearch and wonder if Elasticsearch can do a similar query without using application side logic by properly define the mapping and query. More generally, what is the best practice to solve such problem using Elasticsearch?

I tried to find the solution using bucket selector aggregation, but I was stuck at one point. I discussed the same in elasticsearch forum. Following is what suggested by Christian Dahlqvist.
In addition to this you also index the record into a workflow-centric
index with a unique identifier, e.g. workflow id, as the document id.
If several updates come in for the same workflow, each will result in
an update and the latest state will be preserved. Running aggregations
across this index to find the current or latest state will be
considerably more efficient and scalable as you only have a single
record per workflow and do not need to filter out documents based on
relationships to other documents.
So as per this suggestion, you should use Workflow Id as document id while indexing. And whenever there is an update for that workflow, you update new version and date using workflow id. Let's say the index name is workflow and index type is workflow_status. So mapping of this workflow_status type will be as follow:
{
"workflow_status": {
"properties": {
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"status": {
"type": "string",
"index": "not_analyzed"
},
"version": {
"type": "long"
},
"workFlowId": {
"type": "long"
}
}
}
}
Keep adding/updating the documents to this index type keeping workFlowId as document id.
Now for showing a chart day wise, you may need to create another index type, let's say per_day_workflow with following mapping:
{
"per_day_workflow": {
"properties": {
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"in_progress": {
"type": "long"
},
"completed": {
"type": "long"
}
}
}
}
This index will hold data for each day. So you need to create a job which will run at the end of day and fetch total "In Progress" & "Completed" workflow from workflow_status index type using following aggregation search:
POST http://localhost:9200/workflow/workflow_status/_search?search_type=count
{
"aggs": {
"per_status": {
"terms": {
"field": "status"
}
}
}
}
Response will look as follow (I ran for date 2015-11-02 on your sample data):
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"per_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "In Progress",
"doc_count": 3
},
{
"key": "Completed",
"doc_count": 1
}
]
}
}
}
From this response you need to extract In Progress and Completed count and add them to per_day_workflow index type with today's' date.
Now when you need per day data for your graph, then you can fetch easily from this per_day_workflow index type.

Related

Sorting a set of results with pre-ordered items

I have a list of pre-ordered items (order by score ASC) like:
[{
"id": "id2",
"score": 1
}, {
"id": "id12",
"score": 1
}, {
"id": "id8",
"score": 1.4
}, {
"id": "id9",
"score": 1.4
}, {
"id": "id14",
"score": 1.75
}, {
...
}]
Let's say I have an elasticsearch index with a massive of items. Note that there's no "score" field in indexed documents.
Now I want elasticsearch to return only those items with ids in the said list. Ok, this one is easy. I'm now stuck at sorting the result. That means I need the result to be sorted exactly as my pre-ordered list above.
Any suggestion for me to achieve that?
I'm not an English native speaker, so sorry for my grammar and words.
As version of 7.4, Elastic introduced pinned query that promotes selected documents to rank higher than those matching a given query. In your case this search query should return what you want:
GET /_search
{
"query": {
"pinned" : {
"ids" : ["id2", "id12", "id8"],
"organic" : {
other queries
}
}
}
}
For more information you can check Elasticsearch official documentation here.

Elastic Search Limiting the records that are aggregated

I am running an elastic search query with aggregation, which I intend to limit to say 100 records. The problem is that even when I apply the "size" filter, there is no effect on the aggregation.
GET /index_name/index_type/_search
{
"size":0,
"query":{
"match_all": {}
},
"aggregations":{
"courier_code" : {
"terms" : {
"field" : "city"
}
}
}}
The result set is
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 10,
"successful": 10,
"failed": 0
},
"hits": {
"total": 10867,
"max_score": 0,
"hits": []
},
"aggregations": {
"city": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Mumbai",
"doc_count": 2706
},
{
"key": "London",
"doc_count": 2700
},
{
"key": "Patna",
"doc_count": 1800
},
{
"key": "New York",
"doc_count": 1800
},
{
"key": "Melbourne",
"doc_count": 900
}
]
}
}
}
As you can see there is no effect on limiting the records on which the aggregation is to be performed. Is there a filter for, say top 100 records in Elastic Search.
Search operations in elasticsearch are performed in two phases query and fetch. During the first phase elasticsearch obtains results from all shards sorts them and determines which records should be returned. These records are then retrieved during the second phase. The size parameter controls the number of records that are returned to you in the response. Aggregations are executed during the first phase before elasticsearch actually knows which records will need to be retrieved and they are always executed on all records in the search. So, it's not possible to limit it by the total number of results. If you want to limit the scope of aggregation execution you need to limit the search query instead changing retrieval parameter. For example, if you add a filter to your search query that will only include records from the last year, aggregations will be executed on this filter.
It's also possible to limit the number of records that are analyzed on each shard using terminate_after parameter, however you will have no control on which records will be included and which records wouldn't be included into results, so this option is most likely not what you want.

Elasticsearch - aggregation of unique counts

I have an Elasticsearch database of books:
{
"id": 1,
"name": "Animal Farm"
},
{
"id": 2,
"name": "Brave New World"
},
{
"id": 3,
"name": "Nineteen Eighty-Four"
},
{
"id": 4,
"name": "Animal Farm"
},
{
"id": 5,
"name": "We"
}
As you can see, the books with id of 1 and 4 have the conflict book name "Animal Farm". However, they are different books. One is by George Orwell, and another one is literally about farm animals.
I want to know how often do the book names conflict. For the example above, the expected results are:
{
"conflicts": [
{
"num_of_books": 2,
"count": "1"
},
{
"num_of_books": 1,
"count": "3"
}
]
}
The entry with num_of_books of 2 is the conflict of "Animal Farm", and it happened once (therefore the count is 1). The other 3 books has all different names, so they appears in the entry with num_of_books of 1 and count of 3. I don't need the names of the books. Only the counts matter.
I know SQL has "subquery" to do this:
SELECT num_of_books, COUNT(*) AS _count
FROM (
SELECT COUNT(*) AS num_of_books
FROM books
GROUP BY name
)
GROUP BY num_of_books;
I read the articles of Nested Aggregation and Sub-Aggregations, but failed to see the possibility to achieve my goal.
Any comment will help, thanks!
Running aggregations on aggregations is not yet possible in ES, as far as I know. I know of a few outstanding issues on allowing to apply additional logic on the results of bucket aggregations, but they are still being discussed and debated.
In your case, you can get away with the inner SQL query by using a terms aggregation in order to get the name of all conflicting books by using min_doc_count: 2.
{
"size": 0,
"aggs": {
"books": {
"terms": {
"field": "name",
"min_doc_count": 2
}
}
}
}
Then you can parse the buckets on the client side and re-bucket them into new num_of_books buckets depending on their count. For instance, using the head plugin you can add the following code in the Transform section
var num_of_books = {};
root.aggregations.books.buckets.forEach(function(b) {
num_of_books[b.doc_count] = (num_of_books[b.doc_count] || 0) + 1;
});
return num_of_books;
num_of_books would then contain something like this:
{
"2": 1,
"1": 3
}

Elasticsearch - aggregating multi level hierarchy

I am facing a problem with providing aggregated search result of documents with multi level hierarchy. Simplified documents structure looks like this:
Magazine title (Hunting) -> Magazine year (1999) -> Magazine issue (II.) -> Pages (Text of pages ...)
Every level od document is mapped to its parent by attribute "parentDocumentId".
I have prepared simple query, which works just fine for hierarchy with just 2 levels:
POST http://localhost:9200/my_index/document/_search?search_type=count&q=hunter
{
"query": {
"multi_match" : {
"query": "hunter",
"fields": [ "title", "text", "labels" ]
}
},
"aggregations": {
"my_agg": {
"terms": {
"field": "parentDocumentId"
}
}
}
}
This query is able to search through text of pages, and istead of giving me thousands of pages containting work "hunter" returns buckets (aggregated by parentDocumentId) of documents. However these buckets represent just "Magazine issues" which containt these pages.
Response:
{
"took": 54,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 44,
"max_score": 0,
"hits": []
},
"aggregations": {
"my_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 5,
"doc_count": 43
},
{
"key": 0,
"doc_count": 1
}
]
}
}
}
What I need, is to be able to aggregate search results on highest possible level. That means, in this particular case, to aggregate on "Magazine title" level. This could be done outside the elasticsearch query (on our application side), but as I see this, it should be definitely made in elasticsearch (performance, and other issues).
Does anybody have experience with similar aggregation? Is elasticsearch aggregations the right approach to use?
Every idea is welcome.
Thanks
Peter
Update:
Our mapping looks like this:
{
"my_index": {
"mappings": {
"document": {
"properties": {
"dateIssued": {
"type": "date",
"format": "dateOptionalTime"
},
"documentId": {
"type": "long"
},
"filter": {
"properties": {
"geo_bounding_box": {
"properties": {
"issuedLocation": {
"properties": {
"bottom_right": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
},
"top_left": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
}
}
}
}
}
}
},
"issuedLocation": {
"type": "geo_point"
},
"labels": {
"type": "string"
},
"locationLinks": {
"type": "geo_point"
},
"parentDocumentId": {
"type": "long"
},
"query": {
"properties": {
"match_all": {
"type": "object"
}
}
},
"storedLocation": {
"type": "geo_point"
},
"text": {
"type": "string"
},
"title": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
That means we use 1 mapping for all types of documents. We are indexing set of books, newspapers and other press. That means, that sometimes there is only one parent for set of pages, any sometimes there are multiple levels of parents above the pages level.
To distinguish the type of document there is an attribute "type".
When indexing top levels (these contain especially book meta-data) we leave the "text" attribute empty, always specifying the parent of document using the parentDocumentId. The top level documents have their parentDocumentId set to 0. When indexing the lowest level (pages), we provide only text attribute and parentDocumentId for indexed document.
The link used is very similar to classic one-to-many mapping (magazine has many years, has many issues, has many pages).
You could also say, that we have flattened the nested documents in elasticsearch, but the reason for this is, that there are multiple document types, that can have different level of their hierarchy.
You need to rethink your data modelling. In essence, you need a join over your data and moreover the join needs to be over an arbitrarily deep hierarchy. That is a problem even in relational databases let alone in a fulltext search engine like Elasticsearch.
Elasticsearch does support a couple of joins. You could use nested documents - a single document with all the subdocs nested. That's clearly not ideal in your case.
You could use the parent-child relationship feature which lets you index your (sub-)docs separately always referring to their parent. Underneath, that feature uses Lucene's blockjoin. However, to aggregate over a hierarchy, you would have to explicitly specify the join - listing all the intermediate steps. You want to always aggregate by the top-most available doc but that could be a different level each time (once a magazine, another time a magazine collection or perhaps a publisher).
I would consider indexing each doc with a field pointing to the top-most document. Then you can easily aggregate by that field. It would mean precomputing a part of the complex aggregation you want to do but it would result in fast aggregations and updates also wouldn't be very painful. It all depends on the source of your data, how you imagine that it will change, what updates and other queries you'll need to do.
This blog post could help to guide you a bit too: https://www.elastic.co/blog/managing-relations-inside-elasticsearch

Can I use ElasticSearch Facets as an equivalent to GROUP BY and how?

I'm wondering if I can use the ElasticSearch Facets features to replace to Group By feature used in rational databases or even in a Sphinx client?
If so, beside the official documentation, can someone point out a good tutorial to do so?
EDIT :
Let's consider an SQL table products in which I have the following fields :
id
title
description
price
etc.
I omitted the others fields in the tables because I don't want to put them into my ES index.
I've indexed my database with ElasticSearch.
A product is not unique in the index. We can have the same product with different price offers and I wish to group them by price range.
Facets gives you the number of the docs it a particular word is present for a particular field...
Now let's suppose you have an index named tweets, with type tweet and field "name"...
A facet query for the field "name" would be:
curl -XPOST "http://localhost:9200/tweets/tweet/_search?search_type=count" -d'
{
"facets": {
"name": {
"terms": {
"field": "name"
}
}
}
}'
Now the response you get is the as below
"hits": {
"total": 3475368,
"max_score": 0,
"hits": []
},
"facets": {
"name": {
"_type": "terms",
"total": 3539206,
"other": 3460406,
"terms": [
{
"term": "brickeyee",
"count": 9205
},
{
"term": "ken_adrian",
"count": 9160
},
{
"term": "rhizo_1",
"count": 9143
},
{
"term": "purpleinopp",
"count": 8747
}
....
....
This is called term facet as this is term based count...There are other facets also which can be seen here

Resources