Elasticsearch query returns 10 when expecting > 10,000 - elasticsearch

I want to retrieve all the JSON objects in Elasticsearch that have a null value for awsKafkaTimestamp. This is the query I have set up:
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "tracer.awsKafkaTimestamp"
}
}
}
}
}
When I curl to my elasticsearch endpoint with the DSL I only get a few values back. I am expecting all (10000+) of them because I know for sure all the awsKafkaTimestamp values are null
This is the response I get when I use Postman. As you can see, there are only 10 JSON objects returned to me:

It's correct behaviour of the elasticsearch. By default, it only returns 10 records and provides information in hits.total field about the total number of documents matching search criteria. To retrieve more data than 10 you should specify size field in your query as shown below (you can read more about it here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html):
{
"from" : 0, "size" : 10,
"query" : {
"term" : { "user" : "kimchy" }
}
}

By default elasticsearch will give you 10 results, even if it matches to 10212. You can set the size parameter but that is limited to 10000, so your only option is to use the scroll API to get,
Example from elasticsearch site Scroll API
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
{
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
'

Related

How to apply filter on filterered data in elastic search using API

How could I be able to add multiple filters on the index
I want to filter results by first_name and then by category using elastic search client
In kibana dashboard
I want to achieve the same functionality using the elastic search client and python
but I am able to filter the data only once
Sample code
#app.route('/get-data')
#login_required
def get_permission():
uri = f'https://localhost:9200/'
client = Elasticsearch(hosts=uri, basic_auth=(session['username'], session['password']), ca_certs=session['cert'], verify_certs=False)
body = {
"from" : 0,
"size" : 20,
"query" : {
"bool" : {
"must" : [],
"filter" : [],
"must_not":[],
"should" :[],
}
}
}
index_data = client.search(index=index, body=body)
return render_template('showdata.html', index_data=index_data)
I have looked into the msearch but it's not working
msearch method on devtool
Result are not correct
Is there any way to filter or reapply the search method on filtered data without messing up the old query
filter is an array in Elasticsearch DSL, and you should be able to provide multiple filters in that array, I can't help with python code, but in JSON filter array looks like
{
"query": {
"bool": {
"filter": [
{
"prefix": {
"question_body_markdown": "i"
}
},
{
"term": {
"customer.first_name": "foo"
}
}
]
}
}
}

How to compare two date fields in same document in elasticsearch

In my elastic search index, each document will have two date fields createdDate and modifiedDate. I'm trying to add a filter in kibana to fetch the documents where the modifiedDate is greater than createdDate. How to create this filter in kibana?
Tried Using below query instead of greater than it is considering as gte and fetching all records
GET index/_search
{
"query": {
"bool": {
"filter": {
"script": {
"script" : {
"inline" : "doc['modifiedTime'].value.getMillis() > doc['createdTime'].value.getMillis()",
"lang" : "painless"
}
}
}
}
}
}
There are a few options.
Option A: The easiest and most performant one is to store the difference of the two fields inside a new field of your document, e.g.
{
"createDate": "2022-01-11T12:34:56Z",
"modifiedDate": "2022-01-11T12:34:56Z",
"diffMillis": 0
}
{
"createDate": "2022-01-11T12:34:56Z",
"modifiedDate": "2022-01-11T12:35:58",
"diffMillis": 62000
}
Then, in Kibana you can query on diffMillis > 0 and figure out all documents that have been modified after their creation.
Option B: You can use a script query
GET index/_search
{
"query": {
"bool": {
"filter": {
"script": {
"script": """
return doc['createdDate'].value.millis < doc['modifiedDate'].value.millis;
"""
}
}
}
}
}
Note: depending on the amount of data you have, this option can potentially have disastrous performance, because it needs to be evaluated on ALL of your documents.
Option C: If you're using ES 7.11+, you can use runtime fields directly from the Kibana Discover view.
You can use the following script in order to add a new runtime field (e.g. name it diffMillis) to your index pattern:
emit(doc['modifiedDate'].value.millis - doc['createdDate'].value.millis)
And then you can add the following query into your search bar
diffMillis > 0

Search for empty/present arrays in elasticsearch

I'm currently using the elasticsearch 6.5.4 and I'm trying to query for all docs in an index with an empty array on a specific field. I found the the elasticsearch has a exists dsl who is supposed to cover the empty array case.
The problem is: whem I query for a must exists no doc is returned and when I query for must not exists all documents are returned.
Since I can't share the actual mapping for legal reasons, this is the closest I can give you:
{
"foo_production" : {
"mappings" : {
"foo" : {
"properties" : {
"bar" : {
"type" : "text",
"index" : false
}
}
}
}
}
}
And the query I am performing is:
GET foo_production/_search
{
"query": {
"bool": {
"must": {
"exists": {
"field": "bar"
}
}
}
}
}
Can you guys tell me where the problem is?
Note: Upgrading the elasticsearch version is not a viable solution
Enable indexing for the field bar by setting "index" : true
The index option controls whether field values are indexed. It accepts true or false and defaults to true. Fields that are not indexed are not queryable.
Source : https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html

Elastic search : query to get all elements

I can't get all the items, the maximum reached is size:10000.
thanks
Error: [query_phase_execution_exception] Result window is too large,
from + size must be less than or equal to: [10000] but was [90000].
See the scroll API for a more efficient way to request large data
sets. This limit can be set by changing the [index.max_result_window]
index level parameter.
Any idea how can I solve it?
GetTweets: function (callback) {
client.search({
index: 'twitter',
type: 'tweet',
size:10000,
body: {
query: {
"query": {
"match_all": {}
}
}
}
}, function (err, resp, status) {
callback(err,resp);
});
},
search_after can be used to apply pagination.Efficient than Scroll Api
GET twitter/_search
{
"size": 10,
"query": {
"match" : {
"title" : "elasticsearch"
}
},
"search_after": [1463538857, "654323"],
"sort": [
{"date": "asc"},
{"tie_breaker_id": "asc"}
]
}
ES docs:
It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher
It is the default feature of Elasticsearch not to get data at once after 10000 window ie. size:10000 or upper. See here at scroll api, because of that restriction you're getting below error.
Result window is too large, from + size must be less than or equal to: [10000]
Try Scroll API like,
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
{
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
'
The result from the above request includes a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results.
curl -XGET 'localhost:9200/_search/scroll' -d'
{
"scroll" : "1m",
"scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1"
}
'
N.B I've used both the python and php version of elasticsearch client api. Scroll API is really awesome and very flexible to get data-sets using it.

Filter facet returns count of all documents and not range

I'm using Elasticsearch and Nest to create a query for documents within a specific time range as well as doing some filter facets. The query looks like this:
{
"facets": {
"notfound": {
"query": {
"term": {
"statusCode": {
"value": 404
}
}
}
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"from": "2014-04-05T05:25:37",
"to": "2014-04-07T05:25:37"
}
}
}
]
}
}
}
In the specific case, the total hits of the search is 21 documents, which fits the documents within that time range in Elasticsearch. But the "notfound" facet returns 38, which fits the total number of ErrorDocuments with a StatusCode value of 404.
As I understand the documentation, facets collects data from withing the search. In this case, the "notfound" facet should never be able to return a count higher that 21.
What am I doing wrong here?
There's a distinct difference between filter/query/filtered_query/facet filter which is good to know.
Top level filter
{
filter: {}
}
This acts as a post-filter, meaning it will filter the results after the query phase has ended. Since facets are part of the query phase filters do not influence the documents that are facetted over. Filters do not alter score and are therefor very cacheable.
Top level query
{
query: {}
}
Queries influence the score of a document and are therefor less cacheable than filters. Queries run in the query phase and thus also influence the documents that are facetted over.
Filtered query
{
query: {
filtered: {
filter: {}
query: {}
}
}
}
This allows you to run filters in the query phase taking advantage of their better cacheability and have them influence the documents that are facetted over.
Facet filter
"facets" : {
"<FACET NAME>" : {
"<FACET TYPE>" : {
...
},
"facet_filter" : {
"term" : { "user" : "kimchy"}
}
}
}
this allows you to apply a filter to the documents that the facet is run over. Remember that the it'll be a combination of the queryphase/facetfilter unless you also specify global:true on the facet as well.
Query Facet/Filter Facet
{
"facets" : {
"wow_facet" : {
"query" : {
"term" : { "tag" : "wow" }
}
}
}
}
Which is the one that #thomasardal is using in this case which is perfectly fine, it's a facet type which returns a single value: the query hit count.
The fact that your Query Facet returns 38 and not 21 is because you use a filter for your time range.
You can fix this by either doing the filter in a filtered_query in the query phase or apply a facet filter(not a filter_facet) to your query_facet although because filters are cached better you better use facet filter inside you filter facet.
Confusingly Filter Facets are specified using .FacetFilter() on the search object. I will change this in 1.0 to avoid future confusion.
Sadly: .FacetFilter() and .FacetQuery() in NEST do not allow you to specify a facet filter like you can with other facets:
var results = typedClient.Search<object>(s => s
.FacetTerm(ft=>ft
.OnField("myfield")
.FacetFilter(f=>f.Term("filter_facet_on_this_field", "value"))
)
);
You issue here is that you are performing a Filter Facet and not a normal facet on your query (which will follow the restrictions applied via the query filter). In the JSON, the issue is because of the "query" between the facet name "notfound" and the "terms" entry. This is telling Elasticsearch to run this as a separate query and facet on the results of this separate query and not your main query with the date range filter. So your JSON should look like the following:
{
"facets": {
"notfound": {
"term": {
"statusCode": {
"value": 404
}
}
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"from": "2014-04-05T05:25:37",
"to": "2014-04-07T05:25:37"
}
}
}
]
}
}
}
Since I see you have this tagged with NEST as well, in your call using NEST, you are probably using FacetFilter on your search request, switch this to just Facet to get the desired result.

Resources