Elasticsearch From and Size on aggregation for pagination - elasticsearch

First of all, I want to say that the requirement I want to achieve is working very well on SOLR 5.3.1 but not on ElasticSearch 6.2 as a service on AWS.
My actual query is very large and complex and it is working fine on kibana but not when I cross the from = 100 and size = 50 as it is showing error on kibana console,
What I know:
For normal search, the maximum from can be 10000 and
for aggregated search, the maximum from can be 100
If I cross that limit then I've to change the maximum limit which is not possible as I am using ES on AWS as a service OR I've use scroll API with scroll id feature to get paginated data.
The Scroll API works fine as I've used it to another part of my project but when I try the same Scroll with aggregation it is not working as expected.
Here with Scroll API, the first search gets the aggregated data but the second calling with scroll id not returns the Aggregated results only showing the Hits result
Query on Kibana
GET /properties/_search
{
"size": 10,
"query": {
"bool": {
"must": [
{
"match": {
"published": true
}
},
{
"match": {
"country": "South Africa"
}
}
]
}
},
"aggs": {
"aggs_by_feed": {
"terms": {
"field": "feed",
"order": {
"_key": "desc"
}
},
"aggs": {
"tops": {
"top_hits": {
from: 100,
size: 50,
"_source": [
"id",
"feed_provider_id"
]
}
}
}
}
},
"sort": [
{
"instant_book": {
"order": "desc"
}
}
]
}
With Search on python: The problem I'm facing with this search, first time the search gets the Aggregated data along with Hits data but for next calling with scroll id it is not returning the Aggregated data only showing the Hits data.
if index_name is not None and doc_type is not None and body is not None:
es = init_es()
page = es.search(index_name,doc_type,scroll = '30s',size = 10, body = body)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while (scroll_size > 0):
print("Scrolling...")
page = es.scroll(scroll_id=sid, scroll='30s')
# Update the scroll ID
sid = page['_scroll_id']
print("scroll id: " + sid)
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
print("scroll size: " + str(scroll_size))
print("scrolled data :" )
print(page['aggregations'])
With Elasticsearch-DSL on python: With this approach I'm struggling to select the _source fields names like id and feed_provider_id on the second aggs i.g tops->top_hits
es = init_es()
s = Search(using=es, index=index_name,doc_type=doc_type)
s.aggs.bucket('aggs_by_feed', 'terms', field='feed').metric('top','top_hits',field = 'id')
response = s.execute()
print('Hit........')
for hit in response:
print(hit.meta.score, hit.feed)
print(response.aggregations.aggs_by_feed)
print('AGG........')
for tag in response.aggregations.aggs_by_feed:
print(tag)
So my question is
Is it not possible to get data using from and size field on for the aggregated query above from=100?
if it is possible then please give me a hint with normal elasticsearch way or elasticsearch-dsl python way as I am not well known with elasticsearch-dsl and elasticsearch bucket, matric etc.
Some answer on SO told to use partition but I don't know how to use it on my scenario How to control the elasticsearch aggregation results with From / Size?
Some others says that this feature is not currently supported by ES (currently on feature request). If that's not possible, what else can be done in place of grouping in Solr?

Related

Daterange + top_hits aggregation (as subaggregation) with Elasticsearch Java API Client 7.17.x

I've been at this for a day and I don't quite understand how I do it! This is the query I want to "recreate" with the new Java API Client (using Spring Boot)
{
"aggs": {
"range": {
"date_range": {
"field": "timestamp",
"ranges": [
{ "to": "now-2d" }
]
}
}
,
"aggs": {
"top_hits": {
"_source": {
"includes": [ "Id", "timestamp" ]
}
}
}
}
}
I tried doing it with DateRangeAggregation.of but I can't seem to get the right results or type. Here's what I have
SearchResponse<MyDto> response = client.search(b -> b
.index("test-index")
.size(0)
.aggregations("range",a->a.dateRange(DateRangeAggregation.of(d->d
.field("timestamp").ranges(r->r.to(t->t.expr("now-2d")))))),
.aggregations("hits", a -> a
.topHits(h->h.source(SourceConfig.of(c->c.filter(f->f.includes(Arrays.asList("Id", "timestamp"))))))),
MyDto.class
);
I've also tried removing the subaggregation and query for now, but I don't seem to be on the right track to even get the number of doc_count from the bucket. I kind of don't get how to work with the dateRange() here.
Edit: I played around a bit and was able to at least get the number of doc_count, I'm not very sure if this is a good way to do it though?
Aggregation agg = Aggregation.of(a -> a
.dateRange(d->d.field("timestamp").ranges(r->r.to(FieldDateMath.of(v->v.expr("now-2d"))))));
SearchResponse<MyDto> response = client.search(b -> b
.index("test-index")
.size(0)
.aggregations("range", agg),
MyDto.class
);
return response.aggregations().get("range").dateRange().buckets().array().get(0).docCount();
I also fixed the query above, it had an unnecessary extra query that broke the result.
My thought process was wrong. I wanted the documents that were aggregated within this a time but I misunderstood and thought tophits would give them to me, but that's not how it works! I made a seperate range query that actually queries the documents I needed back instead.

Elastic search : query to get all elements

I can't get all the items, the maximum reached is size:10000.
thanks
Error: [query_phase_execution_exception] Result window is too large,
from + size must be less than or equal to: [10000] but was [90000].
See the scroll API for a more efficient way to request large data
sets. This limit can be set by changing the [index.max_result_window]
index level parameter.
Any idea how can I solve it?
GetTweets: function (callback) {
client.search({
index: 'twitter',
type: 'tweet',
size:10000,
body: {
query: {
"query": {
"match_all": {}
}
}
}
}, function (err, resp, status) {
callback(err,resp);
});
},
search_after can be used to apply pagination.Efficient than Scroll Api
GET twitter/_search
{
"size": 10,
"query": {
"match" : {
"title" : "elasticsearch"
}
},
"search_after": [1463538857, "654323"],
"sort": [
{"date": "asc"},
{"tie_breaker_id": "asc"}
]
}
ES docs:
It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher
It is the default feature of Elasticsearch not to get data at once after 10000 window ie. size:10000 or upper. See here at scroll api, because of that restriction you're getting below error.
Result window is too large, from + size must be less than or equal to: [10000]
Try Scroll API like,
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
{
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
'
The result from the above request includes a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results.
curl -XGET 'localhost:9200/_search/scroll' -d'
{
"scroll" : "1m",
"scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1"
}
'
N.B I've used both the python and php version of elasticsearch client api. Scroll API is really awesome and very flexible to get data-sets using it.

Elasticsearch: get documents only when value changes

I have an ES index with such kind of documents:
from_1,to_1,timestamp_1
from_1,to_1,timestamp_2
from_1,to_2,timestamp_3
from_2,to_3,timestamp_4
from_1,to_2,timestamp_5
from_2,to_3,timestamp_6
from_1,to_1,timestamp_7
from_2,to_4,timestamp_8
I need a query that would return a document only if its combination of from and to values is different than the previous seen document with the same from value.
So with the provided sample above:
document with timestamp_1 should be in the result because there is no earlier document with from_1+to_1 combination
document with timestamp_2 must be skipped because its from+to combination is exactly the same as the last seen document with from = from_1
document with timestamp_3 should be in the result because its to field (to_2) is different than the value of the last seen with the same from (to_1 in document with timestamp_1
document with timestamp_4 should be in the result
document with timestamp_5 must not be in the result because it has the same combination of from+to as the last seen with from_1 (document with timestamp_3)
document with timestamp_6 must not be in the result because it has the same combination of from+to as the last seen with from_2 (document with timestamp_4)
document with timestamp_7 should be in the result because it has the different combination of from+to to the last seen with from_1 (document with timestamp_3)
document with timestamp_8 should be in the result because its combination is completely new so far
I need to fetch all such "semi-unique" documents from the index, so it would be nice if it possible to use scroll request or after_key if an aggregation is used.
Any ideas how to approach it?
The closest thing I could come up with is the following (let me know if it does not work with your data).
{
"size": 0,
"aggs": {
"from_and_to": {
"composite" : {
"size": 5,
"sources": [
{
"from_to_collected":{
"terms": {
"script": {
"lang": "painless",
"source": "doc['from'].value + '_' + doc['to'].value"
}
}
}
}]
},
"aggs": {
"top_from_and_to_hits": {
"top_hits": {
"size": 1,
"sort": [{"timestamp":{"order":"asc"}}],
"_source": {"includes": ["_id"]}
}
}
}
}
}
}
Keep in mind that the terms aggregations is probabilistic.
This will allow you to scroll to the next set of buckets over the from_to_collected key.

Compare documents in Elasticsearch

I am new to Elasticsearch and I am trying to get all documents which have same mobile type. I couldn't find a relevant question and am currently stuck.
curl -XPUT 'http://localhost:9200/sessions/session/1' \
-d '{"useragent": "1121212","mobile": "android", "browser": "mozilla", "device": "computer", "service-code": "1112"}'
EDIT -
I need Elasticsearch equivalent of following -
SELECT * FROM session s1, session s2
where s1.device == s2.device
What you are trying to achieve is simple grouping docs on a field via self-join.
The similar notion of grouping can be achieved by terms aggregation in elasticsearch. Although this aggregation returns only the group level metrics like count, sum etc. It does not return the individual records.
However, there is another aggregation which can be applied as a sub-aggregation to the terms aggregation, top-hits aggregations.
The top_hits aggregator can effectively be used to group result sets
by certain fields via a bucket aggregator. One or more bucket
aggregators determines by which properties a result set get sliced
into.
Options
from - The offset from the first result you want to fetch.
size - The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned.
sort - How the top matching hits should be sorted. By default the hits are sorted by the score of the main query.
Here is a sample query
{
"query": {
"match_all": {}
},
"aggs": {
"top-mobiles": {
"terms": {
"field": "device"
},
"aggs": {
"top_device_hits": {
"top_hits": {}
}
}
}
}
}

Selecting all the results from a bucket using TopHits aggregation

I am using TopHits aggregation over the Terms aggregation to fetch the records as shown in below query.
{
"aggregations" : {
"group by" : {
"terms" : {
"field" : "City"
},
"aggregations" : {
"top" : {
"top_hits" : {
"size" : 200
}
}}}}
I want to fetch all the records that are present in bucket instead of only top 200 records, but as the value of size increases the query time also increases for the same indexed data (for same number of records).
So I can not set the size value to a randomly large number as it is hampering the querying time.
Is there any way to achieve the same efficiently ?
Thanks.
In elastic search size having limitations default it returns 10 documents but if you want to increase documents then size values increase.
Let's check this example in this case
if deep pagination with from and size — e.g. ?size=10&from=10000 — is very inefficient as (in this example) 100,000 sorted results have to be retrieved from each shard and resorted in order to return just 10 results. This process has to be repeated for every page requested.
So this case you should use scroll api because of
The scroll API keeps track of which results have already been returned and so is able to return sorted results more efficiently than with deep pagination. However, sorting results (which happens by default) still has a cost.
In your case you should use scan and scroll as below :
curl - s - XGET localhost: 9200 / logs / syslogs / _search ? scroll = 10 m & search_type = scan ' {
"aggregations": {
"group by": {
"terms": {
"field": "City"
},
"aggregations": {
"top": {
"top_hits": {
"size": 200
}
}
}
}
}
}'
Above query return scroll id then pass that scroll id as below
curl -XGET 'localhost:9200/_search/scroll?scroll=1m' -d 'scroll id '

Resources