Selecting all the results from a bucket using TopHits aggregation - elasticsearch

I am using TopHits aggregation over the Terms aggregation to fetch the records as shown in below query.
{
"aggregations" : {
"group by" : {
"terms" : {
"field" : "City"
},
"aggregations" : {
"top" : {
"top_hits" : {
"size" : 200
}
}}}}
I want to fetch all the records that are present in bucket instead of only top 200 records, but as the value of size increases the query time also increases for the same indexed data (for same number of records).
So I can not set the size value to a randomly large number as it is hampering the querying time.
Is there any way to achieve the same efficiently ?
Thanks.

In elastic search size having limitations default it returns 10 documents but if you want to increase documents then size values increase.
Let's check this example in this case
if deep pagination with from and size — e.g. ?size=10&from=10000 — is very inefficient as (in this example) 100,000 sorted results have to be retrieved from each shard and resorted in order to return just 10 results. This process has to be repeated for every page requested.
So this case you should use scroll api because of
The scroll API keeps track of which results have already been returned and so is able to return sorted results more efficiently than with deep pagination. However, sorting results (which happens by default) still has a cost.
In your case you should use scan and scroll as below :
curl - s - XGET localhost: 9200 / logs / syslogs / _search ? scroll = 10 m & search_type = scan ' {
"aggregations": {
"group by": {
"terms": {
"field": "City"
},
"aggregations": {
"top": {
"top_hits": {
"size": 200
}
}
}
}
}
}'
Above query return scroll id then pass that scroll id as below
curl -XGET 'localhost:9200/_search/scroll?scroll=1m' -d 'scroll id '

Related

Elastic search : query to get all elements

I can't get all the items, the maximum reached is size:10000.
thanks
Error: [query_phase_execution_exception] Result window is too large,
from + size must be less than or equal to: [10000] but was [90000].
See the scroll API for a more efficient way to request large data
sets. This limit can be set by changing the [index.max_result_window]
index level parameter.
Any idea how can I solve it?
GetTweets: function (callback) {
client.search({
index: 'twitter',
type: 'tweet',
size:10000,
body: {
query: {
"query": {
"match_all": {}
}
}
}
}, function (err, resp, status) {
callback(err,resp);
});
},
search_after can be used to apply pagination.Efficient than Scroll Api
GET twitter/_search
{
"size": 10,
"query": {
"match" : {
"title" : "elasticsearch"
}
},
"search_after": [1463538857, "654323"],
"sort": [
{"date": "asc"},
{"tie_breaker_id": "asc"}
]
}
ES docs:
It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher
It is the default feature of Elasticsearch not to get data at once after 10000 window ie. size:10000 or upper. See here at scroll api, because of that restriction you're getting below error.
Result window is too large, from + size must be less than or equal to: [10000]
Try Scroll API like,
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
{
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
'
The result from the above request includes a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results.
curl -XGET 'localhost:9200/_search/scroll' -d'
{
"scroll" : "1m",
"scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1"
}
'
N.B I've used both the python and php version of elasticsearch client api. Scroll API is really awesome and very flexible to get data-sets using it.

How can I get options for filtering by a field directly from elasticsearch?

I want to populate a filtering field based on the data I have indexed inside Elasticsearch. How can I retrieve this data? For example, my documents inside index "test" and type "doc" could be
{"id":1, "tag":"foo", "name":"foothing"}
{"id":2, "tag":"bar", "name":"barthing"}
{"id":3, "tag":"foo", "name":"something"}
{"id":4, "tag":"quux", "name":"quuxthing"}
I'm looking for something like GET /test/doc/_magic?q=tag that would return [foo,bar,quux] from my data. I don't know what this is called or even possible. I don't want to get all index entries into memory and do this programmatically, I have millions of documents in the index with around a hundred different tags.
Is this possible with ES?
Yes, that's possible and this is called a terms aggregation
You can do it like this:
GET /test/doc/_search
{
"size": 0,
"aggs" : {
"tags" : {
"terms" : {
"field" : "tag.keyword",
"size": 100
}
}
}
}
Note that depending on the cardinality of your tag field, you can increase/decrease the size setting (10 by default).

Elasticsearch From and Size on aggregation for pagination

First of all, I want to say that the requirement I want to achieve is working very well on SOLR 5.3.1 but not on ElasticSearch 6.2 as a service on AWS.
My actual query is very large and complex and it is working fine on kibana but not when I cross the from = 100 and size = 50 as it is showing error on kibana console,
What I know:
For normal search, the maximum from can be 10000 and
for aggregated search, the maximum from can be 100
If I cross that limit then I've to change the maximum limit which is not possible as I am using ES on AWS as a service OR I've use scroll API with scroll id feature to get paginated data.
The Scroll API works fine as I've used it to another part of my project but when I try the same Scroll with aggregation it is not working as expected.
Here with Scroll API, the first search gets the aggregated data but the second calling with scroll id not returns the Aggregated results only showing the Hits result
Query on Kibana
GET /properties/_search
{
"size": 10,
"query": {
"bool": {
"must": [
{
"match": {
"published": true
}
},
{
"match": {
"country": "South Africa"
}
}
]
}
},
"aggs": {
"aggs_by_feed": {
"terms": {
"field": "feed",
"order": {
"_key": "desc"
}
},
"aggs": {
"tops": {
"top_hits": {
from: 100,
size: 50,
"_source": [
"id",
"feed_provider_id"
]
}
}
}
}
},
"sort": [
{
"instant_book": {
"order": "desc"
}
}
]
}
With Search on python: The problem I'm facing with this search, first time the search gets the Aggregated data along with Hits data but for next calling with scroll id it is not returning the Aggregated data only showing the Hits data.
if index_name is not None and doc_type is not None and body is not None:
es = init_es()
page = es.search(index_name,doc_type,scroll = '30s',size = 10, body = body)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while (scroll_size > 0):
print("Scrolling...")
page = es.scroll(scroll_id=sid, scroll='30s')
# Update the scroll ID
sid = page['_scroll_id']
print("scroll id: " + sid)
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
print("scroll size: " + str(scroll_size))
print("scrolled data :" )
print(page['aggregations'])
With Elasticsearch-DSL on python: With this approach I'm struggling to select the _source fields names like id and feed_provider_id on the second aggs i.g tops->top_hits
es = init_es()
s = Search(using=es, index=index_name,doc_type=doc_type)
s.aggs.bucket('aggs_by_feed', 'terms', field='feed').metric('top','top_hits',field = 'id')
response = s.execute()
print('Hit........')
for hit in response:
print(hit.meta.score, hit.feed)
print(response.aggregations.aggs_by_feed)
print('AGG........')
for tag in response.aggregations.aggs_by_feed:
print(tag)
So my question is
Is it not possible to get data using from and size field on for the aggregated query above from=100?
if it is possible then please give me a hint with normal elasticsearch way or elasticsearch-dsl python way as I am not well known with elasticsearch-dsl and elasticsearch bucket, matric etc.
Some answer on SO told to use partition but I don't know how to use it on my scenario How to control the elasticsearch aggregation results with From / Size?
Some others says that this feature is not currently supported by ES (currently on feature request). If that's not possible, what else can be done in place of grouping in Solr?

Compare documents in Elasticsearch

I am new to Elasticsearch and I am trying to get all documents which have same mobile type. I couldn't find a relevant question and am currently stuck.
curl -XPUT 'http://localhost:9200/sessions/session/1' \
-d '{"useragent": "1121212","mobile": "android", "browser": "mozilla", "device": "computer", "service-code": "1112"}'
EDIT -
I need Elasticsearch equivalent of following -
SELECT * FROM session s1, session s2
where s1.device == s2.device
What you are trying to achieve is simple grouping docs on a field via self-join.
The similar notion of grouping can be achieved by terms aggregation in elasticsearch. Although this aggregation returns only the group level metrics like count, sum etc. It does not return the individual records.
However, there is another aggregation which can be applied as a sub-aggregation to the terms aggregation, top-hits aggregations.
The top_hits aggregator can effectively be used to group result sets
by certain fields via a bucket aggregator. One or more bucket
aggregators determines by which properties a result set get sliced
into.
Options
from - The offset from the first result you want to fetch.
size - The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned.
sort - How the top matching hits should be sorted. By default the hits are sorted by the score of the main query.
Here is a sample query
{
"query": {
"match_all": {}
},
"aggs": {
"top-mobiles": {
"terms": {
"field": "device"
},
"aggs": {
"top_device_hits": {
"top_hits": {}
}
}
}
}
}

Ordering term aggregation buckets by sub-aggregration result values

I have two questions about the query seen on this capture:
How do I order by value in the sum_category field in the results?
I use respsize again in the query but it's not correct as you can see below.
Even if I make only an aggregration, why do all the documents come with the result? I mean, if I make a group by query in SQL it retrieves only grouped data, but Elasticsearch retrieves all documents as if I made a normal search query. How do I skip them?
Try this:
{
"query" : {
"match_all" : {}
},
"size" : 0,
"aggs" : {
"categories" : {
"terms" : {
"field" : "category",
"size" : 999999,
"order" : {
"sum_category" : "desc"
}
},
"aggs" : {
"sum_category" : {
"sum" : {
"field" : "respsize"
}
}
}
}
}
}
1). See the note in (2) for what your sort is doing. As for ordering the categories by the value of sum_category, see the order portion. There appears to be an old and closed issue related to that https://github.com/elastic/elasticsearch/issues/4643 but it worked fine for me with v1.5.2 of Elasticsearch.
2). Although you do not have that match_all query, I think that's probably what you are getting results for. And so the sort your specified is actually getting applied to those results. To not get these back, I just have size: 0 portion.
Do you want buckets for all the categories? I noticed you do not have size specified for the main aggregation. That's the size: 999999 portion.

Resources