Elastic search : query to get all elements - elasticsearch

I can't get all the items, the maximum reached is size:10000.
thanks
Error: [query_phase_execution_exception] Result window is too large,
from + size must be less than or equal to: [10000] but was [90000].
See the scroll API for a more efficient way to request large data
sets. This limit can be set by changing the [index.max_result_window]
index level parameter.
Any idea how can I solve it?
GetTweets: function (callback) {
client.search({
index: 'twitter',
type: 'tweet',
size:10000,
body: {
query: {
"query": {
"match_all": {}
}
}
}
}, function (err, resp, status) {
callback(err,resp);
});
},

search_after can be used to apply pagination.Efficient than Scroll Api
GET twitter/_search
{
"size": 10,
"query": {
"match" : {
"title" : "elasticsearch"
}
},
"search_after": [1463538857, "654323"],
"sort": [
{"date": "asc"},
{"tie_breaker_id": "asc"}
]
}
ES docs:
It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher

It is the default feature of Elasticsearch not to get data at once after 10000 window ie. size:10000 or upper. See here at scroll api, because of that restriction you're getting below error.
Result window is too large, from + size must be less than or equal to: [10000]
Try Scroll API like,
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
{
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
'
The result from the above request includes a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results.
curl -XGET 'localhost:9200/_search/scroll' -d'
{
"scroll" : "1m",
"scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1"
}
'
N.B I've used both the python and php version of elasticsearch client api. Scroll API is really awesome and very flexible to get data-sets using it.

Related

elasticsearch how to get previous data using scroll

I used this sample for making a pagination function
POST /twitter/_search?scroll=1m
{
"size": 100,
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}
But there is one problem
This code can get next data but not previous data how should I solve this problem??
If you want to use pagination you should rather use From / Size than scroll.
You can refer also to this answer : Elasticsearch Scroll it explain the difference with scan and scroll

Elasticsearch From and Size on aggregation for pagination

First of all, I want to say that the requirement I want to achieve is working very well on SOLR 5.3.1 but not on ElasticSearch 6.2 as a service on AWS.
My actual query is very large and complex and it is working fine on kibana but not when I cross the from = 100 and size = 50 as it is showing error on kibana console,
What I know:
For normal search, the maximum from can be 10000 and
for aggregated search, the maximum from can be 100
If I cross that limit then I've to change the maximum limit which is not possible as I am using ES on AWS as a service OR I've use scroll API with scroll id feature to get paginated data.
The Scroll API works fine as I've used it to another part of my project but when I try the same Scroll with aggregation it is not working as expected.
Here with Scroll API, the first search gets the aggregated data but the second calling with scroll id not returns the Aggregated results only showing the Hits result
Query on Kibana
GET /properties/_search
{
"size": 10,
"query": {
"bool": {
"must": [
{
"match": {
"published": true
}
},
{
"match": {
"country": "South Africa"
}
}
]
}
},
"aggs": {
"aggs_by_feed": {
"terms": {
"field": "feed",
"order": {
"_key": "desc"
}
},
"aggs": {
"tops": {
"top_hits": {
from: 100,
size: 50,
"_source": [
"id",
"feed_provider_id"
]
}
}
}
}
},
"sort": [
{
"instant_book": {
"order": "desc"
}
}
]
}
With Search on python: The problem I'm facing with this search, first time the search gets the Aggregated data along with Hits data but for next calling with scroll id it is not returning the Aggregated data only showing the Hits data.
if index_name is not None and doc_type is not None and body is not None:
es = init_es()
page = es.search(index_name,doc_type,scroll = '30s',size = 10, body = body)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while (scroll_size > 0):
print("Scrolling...")
page = es.scroll(scroll_id=sid, scroll='30s')
# Update the scroll ID
sid = page['_scroll_id']
print("scroll id: " + sid)
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
print("scroll size: " + str(scroll_size))
print("scrolled data :" )
print(page['aggregations'])
With Elasticsearch-DSL on python: With this approach I'm struggling to select the _source fields names like id and feed_provider_id on the second aggs i.g tops->top_hits
es = init_es()
s = Search(using=es, index=index_name,doc_type=doc_type)
s.aggs.bucket('aggs_by_feed', 'terms', field='feed').metric('top','top_hits',field = 'id')
response = s.execute()
print('Hit........')
for hit in response:
print(hit.meta.score, hit.feed)
print(response.aggregations.aggs_by_feed)
print('AGG........')
for tag in response.aggregations.aggs_by_feed:
print(tag)
So my question is
Is it not possible to get data using from and size field on for the aggregated query above from=100?
if it is possible then please give me a hint with normal elasticsearch way or elasticsearch-dsl python way as I am not well known with elasticsearch-dsl and elasticsearch bucket, matric etc.
Some answer on SO told to use partition but I don't know how to use it on my scenario How to control the elasticsearch aggregation results with From / Size?
Some others says that this feature is not currently supported by ES (currently on feature request). If that's not possible, what else can be done in place of grouping in Solr?

Elasticsearch query returns 10 when expecting > 10,000

I want to retrieve all the JSON objects in Elasticsearch that have a null value for awsKafkaTimestamp. This is the query I have set up:
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "tracer.awsKafkaTimestamp"
}
}
}
}
}
When I curl to my elasticsearch endpoint with the DSL I only get a few values back. I am expecting all (10000+) of them because I know for sure all the awsKafkaTimestamp values are null
This is the response I get when I use Postman. As you can see, there are only 10 JSON objects returned to me:
It's correct behaviour of the elasticsearch. By default, it only returns 10 records and provides information in hits.total field about the total number of documents matching search criteria. To retrieve more data than 10 you should specify size field in your query as shown below (you can read more about it here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html):
{
"from" : 0, "size" : 10,
"query" : {
"term" : { "user" : "kimchy" }
}
}
By default elasticsearch will give you 10 results, even if it matches to 10212. You can set the size parameter but that is limited to 10000, so your only option is to use the scroll API to get,
Example from elasticsearch site Scroll API
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
{
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
'

Selecting all the results from a bucket using TopHits aggregation

I am using TopHits aggregation over the Terms aggregation to fetch the records as shown in below query.
{
"aggregations" : {
"group by" : {
"terms" : {
"field" : "City"
},
"aggregations" : {
"top" : {
"top_hits" : {
"size" : 200
}
}}}}
I want to fetch all the records that are present in bucket instead of only top 200 records, but as the value of size increases the query time also increases for the same indexed data (for same number of records).
So I can not set the size value to a randomly large number as it is hampering the querying time.
Is there any way to achieve the same efficiently ?
Thanks.
In elastic search size having limitations default it returns 10 documents but if you want to increase documents then size values increase.
Let's check this example in this case
if deep pagination with from and size — e.g. ?size=10&from=10000 — is very inefficient as (in this example) 100,000 sorted results have to be retrieved from each shard and resorted in order to return just 10 results. This process has to be repeated for every page requested.
So this case you should use scroll api because of
The scroll API keeps track of which results have already been returned and so is able to return sorted results more efficiently than with deep pagination. However, sorting results (which happens by default) still has a cost.
In your case you should use scan and scroll as below :
curl - s - XGET localhost: 9200 / logs / syslogs / _search ? scroll = 10 m & search_type = scan ' {
"aggregations": {
"group by": {
"terms": {
"field": "City"
},
"aggregations": {
"top": {
"top_hits": {
"size": 200
}
}
}
}
}
}'
Above query return scroll id then pass that scroll id as below
curl -XGET 'localhost:9200/_search/scroll?scroll=1m' -d 'scroll id '

Filter facet returns count of all documents and not range

I'm using Elasticsearch and Nest to create a query for documents within a specific time range as well as doing some filter facets. The query looks like this:
{
"facets": {
"notfound": {
"query": {
"term": {
"statusCode": {
"value": 404
}
}
}
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"from": "2014-04-05T05:25:37",
"to": "2014-04-07T05:25:37"
}
}
}
]
}
}
}
In the specific case, the total hits of the search is 21 documents, which fits the documents within that time range in Elasticsearch. But the "notfound" facet returns 38, which fits the total number of ErrorDocuments with a StatusCode value of 404.
As I understand the documentation, facets collects data from withing the search. In this case, the "notfound" facet should never be able to return a count higher that 21.
What am I doing wrong here?
There's a distinct difference between filter/query/filtered_query/facet filter which is good to know.
Top level filter
{
filter: {}
}
This acts as a post-filter, meaning it will filter the results after the query phase has ended. Since facets are part of the query phase filters do not influence the documents that are facetted over. Filters do not alter score and are therefor very cacheable.
Top level query
{
query: {}
}
Queries influence the score of a document and are therefor less cacheable than filters. Queries run in the query phase and thus also influence the documents that are facetted over.
Filtered query
{
query: {
filtered: {
filter: {}
query: {}
}
}
}
This allows you to run filters in the query phase taking advantage of their better cacheability and have them influence the documents that are facetted over.
Facet filter
"facets" : {
"<FACET NAME>" : {
"<FACET TYPE>" : {
...
},
"facet_filter" : {
"term" : { "user" : "kimchy"}
}
}
}
this allows you to apply a filter to the documents that the facet is run over. Remember that the it'll be a combination of the queryphase/facetfilter unless you also specify global:true on the facet as well.
Query Facet/Filter Facet
{
"facets" : {
"wow_facet" : {
"query" : {
"term" : { "tag" : "wow" }
}
}
}
}
Which is the one that #thomasardal is using in this case which is perfectly fine, it's a facet type which returns a single value: the query hit count.
The fact that your Query Facet returns 38 and not 21 is because you use a filter for your time range.
You can fix this by either doing the filter in a filtered_query in the query phase or apply a facet filter(not a filter_facet) to your query_facet although because filters are cached better you better use facet filter inside you filter facet.
Confusingly Filter Facets are specified using .FacetFilter() on the search object. I will change this in 1.0 to avoid future confusion.
Sadly: .FacetFilter() and .FacetQuery() in NEST do not allow you to specify a facet filter like you can with other facets:
var results = typedClient.Search<object>(s => s
.FacetTerm(ft=>ft
.OnField("myfield")
.FacetFilter(f=>f.Term("filter_facet_on_this_field", "value"))
)
);
You issue here is that you are performing a Filter Facet and not a normal facet on your query (which will follow the restrictions applied via the query filter). In the JSON, the issue is because of the "query" between the facet name "notfound" and the "terms" entry. This is telling Elasticsearch to run this as a separate query and facet on the results of this separate query and not your main query with the date range filter. So your JSON should look like the following:
{
"facets": {
"notfound": {
"term": {
"statusCode": {
"value": 404
}
}
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"from": "2014-04-05T05:25:37",
"to": "2014-04-07T05:25:37"
}
}
}
]
}
}
}
Since I see you have this tagged with NEST as well, in your call using NEST, you are probably using FacetFilter on your search request, switch this to just Facet to get the desired result.

Resources