Elasticsearch filter multiple terms with only matching results and not any of them - elasticsearch

How I can get only filtered matching results with all the multi term search. I have this sample table where titleid is a mapping int field and personid is a keyword:
titleid:1,personid:a
titleid:3,personid:a
titleid:1,personid:b
titleid:2,personid:b
titleid:1,personid:c
titleid:5,personid:c
The expeted result is:
titleid:1
With a sample query like this one:
{query:
{bool:
{filter:
{must:[
{terms : {fields: {personid:[a,b,c]}}
]
}}}}
I have the following results:
titleid: 1,2,3,5
Maybe this will help, I did the query in sql and got the expected result. What I did was ask the query to give me the sum of titleid that matches the quantity of searched parameters. This is only to be more self explained, the idea is to use elasticsearch.
select titleid
from (
select count(titleid) as title_count, titleid
from table1
where personid in ('a','b','c')
group by titleid
) as vw
where title_count = 3

if you only want records with titleid == 1 AND personid == 'a' you can filter on both fields. only the boolean query uses must, should, and most_not. with a filter since it's filtering (eg, removing) by definition it's a must
"query": {
"bool": {
"filter": [
{
"term": {
"titleId": { "value": 1 }
}
},
{
"term": {
"personid": { "value": "a" }
}
}
]
}
}
UPDATE::
Now your question looks like you want to filter and aggregate your results and then aggregate on those. There's a few metrics and bucket aggregations
Using bucket selector aggregation (this isn't tested but should be very close if not correct)
{
"aggs" : {
"title_id" : {
"filter" : { "terms": { "personid": ["a","b","c"] } },
"aggs" : {
"id_count" : { "count" : { "field" : "titleid" } }
}
},
aggs": {
"count_filter": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "the_doc_count == 3"
}
}
}
}
}
However, be aware that Pipeline aggregations work on the outputs produced from other aggregations, so the overall amount of work that needs to be done to calculate the initial doc_counts will be the same. Since the script parts needs to be executed for each input bucket, the opetation might potentially be slow for high cardinality fields as in thousands of thousands of terms.

Related

How to aggs for group by result in elasticsearch

There is an index: person
"_source" : {
"id" : 304028598,
"name" : "aaa"
},
want to get these information:
1. average count per name
2. max count one name can have
For sql I could get these info by below sql
select max(count), avg(count), min(count) from (
select name, count(*) count from t group by name
);
but how to implement it by elasticsearch?
The answer to this question relies on Pipeline aggregations -- these aggregations operate on the output of another aggregation.
For example, we have many documents, each with a different hostVersion and use the following to find the max, min and average number of documents per host version:
"aggs": {
"per_hostver": {
"terms": {
"field": "hostVersion"
}
},
"avg_docs_per_version": {
"avg_bucket": {
"buckets_path": "per_hostver>_count"
}
},
"max_docs_per_version": {
"max_bucket": {
"buckets_path": "per_hostver>_count"
}
},
"min_docs_per_version": {
"min_bucket": {
"buckets_path": "per_hostver>_count"
}
}
}
The syntax per_hostver>_count refers to the _count field generated by each bucket of the aggregation per_hostver. _count is how you refer to the special document count field generated by all ES aggregations.

Return distinct values in Elasticsearch

I am trying to solve an issue where I have to get distinct result in the search.
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "GEORGE",
"favorite_cars" : [ "honda","Hyundae" ]
}
When I perform a term query on favourite cars "ferrari". I get two results whose name is ABC. I simply want that the result returned should be one in this case. So my requirement will be if I can apply a distinct on name field to receive one 1 result.
Thanks
One way to achieve what you want is to use a terms aggregation on the name field and then a top_hits sub-aggregation with size 1, like this:
{
"size": 0,
"query": {
"term": {
"favorite_cars": "ferrari"
}
},
"aggs": {
"names": {
"terms": {
"field": "name"
},
"aggs": {
"single_result": {
"top_hits": {
"size": 1
}
}
}
}
}
}
That way, you'll get a single term ABC and then nested into it a single matching document

how to achieve an exists filter on ES5.0?

The exists filter has been replaced by an exists query in ES5.0.
So how can we achieve, within the same query the equivalent? In other words, we don't want to do two query but just on for various aggregations, including the exists count?
So I want to count the number of time the field "the_field" exists (or is not null)
"aggregation":{
"exists_count":{
"filter":{
"exists":{
"field":"the_field"
}
}
}
}
I think you can use stats aggregation,
{ "aggs" :
{ "time_stats" :
{ "extended_stats" :
{ "field" : "time" }
}
}
}
Look at elastic stats doc
With Elastic 5.0, filters didn't so much get replaced by queries, but combined. Syntactically they look the same, but the context in which you use it determines if it gets interpreted as a query (factors into scoring) or as a filter to simply weed out documents. The below code should achieve exactly what you want:
{
"query": {
"match_all": {}
},
"aggs": {
"field_exists": {
"filter": {
"exists": {
"field": "name"
}
}
}
}
}
The aggregation returned will look something like this, with the doc_count representing the number of documents where the "name field exists. Hope this helps!
{
"aggregations": {
"field_exists": {
"doc_count": 11984
}
}
}

Multiple OR filter in Elasticsearch

Hello I'm having trouble deciding the correctness of the following query for multiple OR in Elasticsearch. I want to select all the unique data (not count, but select all rows)
My best try for this in elastic query is
GET mystash/_search
{
"aggs": {
"uniques":{
"filter":
{
"or":
[
{ "term": { "url.raw" : "/a.json" } },
{ "term": { "url.raw" : "/b.json" } },
{ "term": { "url.raw" : "/c.json"} },
{ "term": { "url.raw" : "/d.json"} }
]
},
"aggs": {
"unique" :{
"terms" :{
"field" : "id.raw",
"size" : 0
}
}
}
}
}
}
The equivalent SQL would be
SELECT DISTINCT id
FROM json_record
WHERE
json_record.url = 'a.json' OR
json_record.url = 'b.json' OR
json_record.url = 'c.json' OR
json_record.url = 'd.json'
I was wondering whether the query above is correct, since the data will be needed for report generations.
Some remarks:
You should use a query filter instead of an aggregation filter. Your query loads all documents.
You can replace your or+term filter by a single terms filter
You could use a size=0 at the root of the query to get only agg result and not search results
Example code:
{"size":0,
"query" :{"filtered":{"filter":{"terms":{"url":["a", "b", "c"]}}}},
"aggs" :{"unique":{"term":{"field":"id", "size" :0}}}
}

Elasticsearch grouping facet by owner, mine vs others

I am using Elasticsearch to index documents that have an owner which is stored in a userId property of the source object. I can easily do a facet on the userId and get facets for each owner that there is, but I'd like to have the facets for owner show up like so:
Documents owned by me (X)
Documents owned by others (Y)
I could handle this on the client side and take all of the facets returned by elasticsearch and go through them and figure out those owned by the current user and not and display it appropriately, but I was hoping there was a way to tell elasticsearch to handle this in the query itself.
You can use filtered facets to do this:
curl -XGET "http://localhost:9200/_search" -d'
{
"query": {
"match_all": {}
},
"facets": {
"my_docs": {
"filter": {
"term": { "user_id": "my_user_id" }
}
},
"others_docs": {
"filter": {
"not": {
"term": { "user_id": "my_user_id" }
}
}
}
}
}'
One of the nice things about this is that the two terms filters are identical and so are only executed once. The not filter just inverts the results of the cached term filter.
You're right, ElasticSearch has a way to do that. Take a look to scripting term facets, specially to the second example ("using the boolean feature"). You should be able to do somthing like:
{
"query" : {
"match_all" : { }
},
"facets" : {
"userId" : {
"terms" : {
"field" : "userId",
"size" : 10,
"script" : "term == '<your user id>' ? true : false"
}
}
}
}

Resources