ElasticSearch: Return the query within the response body when hits = 0 - elasticsearch

Please note that the following example is a very minified version of a real life use case, it is for the question to be easy to read and to make a point.
I have the following document structure:
{
"date" : 1400500,
"idc" : 1001,
"name": "somebody",
}
I am performing an _msearch query (multiple searchs at a time) based on different values (the "idc" and a "date" range)
When ES could not find any documents for the given date range it returns:
"hits":{
"total":0,
"max_score":null,
"hits":[
]
}
But, since there are N results, I cannot tell which "idc" and what "date" range was this result for.
I would like the response to have the "searched" date range and "idc" when there are no results for the given query. for example, if I am searching documents for IDC = 1001 and date between 1400100 and 1400200, but there are no results found, the response should have the query terms in the response body, something like this:
"hits":{
"total":0,
"max_score":null,
"query": {
"date": {
"gt": 1400100,
"lte": 1400200,
}
"idc": 1001,
}
}
That way I can tell what date range and "idc" combination has no results.
Please note that the above example is a very minified version of a real life use case, it is for the question to be easy to read and to make a point.

This is from the docs
multi search API(_msearch) response returns a responses array, which includes the search
response and status code for each search request matching its order in
the original multi search request.
since you know the order in which you sent the requests , you can find out which request failed.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html

Related

elasticsearch get date range of most recent ingestion

I have an elasticsearch index that gets new data in large dumps, so from looking at the graph its very obvious when new data is added.
If I only want to get data from the most recent ingestion (in this case data from 2020-08-06, whats the best way of doing this?
I can use this query to get the most recent document:
GET /indexname/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": queryString
}
}
]
}
},
"sort": {
"#timestamp" : "desc"
},
"size": 1
}
Which will return the most recent document, in this case a document with a timestamp of 2020-08-06. I can set that to my endDate and set my startDate to that date minus one day, but im worried of cases where the data was ingested overnight and spanned two days.
I could keep making requests to go back in time 5 hours at a time to find when the most recent large gap is, but im worried that making a request in a for loop could be time consuming? Is there a smarter way for getting the date range of my most recent ingestion?thx
When your data is coming in batches it'd be best to attribute an identifier to each batch. That way, there's no date math required.

Validating my understanding of Dismax query in elasticsearch

I have tried understanding how dismax query works and I want to validate my understanding, please see if I understood it correctly.
According to documentation a dismax query is:
A query that generates the union of documents produced by its
subqueries, and that scores each document with the maximum score for
that document as produced by any subquery, plus a tie breaking
increment for any additional matching subqueries.
Suppose, the total documents in our ES cluster be as follows:
{"FOO":"ABC"},{"FOO":"XYZ"},{"FOO":"ABC XYZ"},{"FOO":"ABC DEF"},{"FOO":"DEF"} and the dismax query is:
"dis_max": {
"queries": [
{
"match": {
"FOO": "ABC"
}
},
{
"match": {
"FOO": "XYZ"
}
}
]
}
}
So, as per the documentation let us first find out union of documents returned by dismax's sub-queries. The union of documents would be {"FOO":"ABC"},{"FOO":"XYZ"},{"FOO":"ABC XYZ"},{"FOO":"ABC DEF"}. According to the next step we need to score each document with the maximum score for that document as produced by any subquery. Which will be something like:
{"FOO":"ABC"}will be scored on {"match":{"FOO": "ABC"}} and {"match":{"FOO": "XYZ"}} and the maximum score returned will be used.
And similarly, {"FOO":"XYZ"}will be scored on {"match":{"FOO": "ABC"}} and {"match":{"FOO": "XYZ"}} and the maximum score returned will be used and this will be done for all the union of documents and finally the documents will be returned in a sorted way.
Is this how dismax query works? Or did I misunderstand or miss out anything?

ElasticSearch 7.7 how can I increase the count of results of whole index

I understand that its theres hardcoded limit in Elasticsearch of 10k results per query. What I wanna know if theres any way to search results within this 10k limit but at the same time at least show count of all results for this particular query.
So let's suppose if there are 1M results matching for certain query, the count should show 1M instead of max limit of 10k.
Thank you.
Yes, You can.
You need to add the below attribute to your search query
{
"track_total_hits": true
}
It will show you the total count along with default result.
Elasticsearch supports a /_count API to result the count of all hits in query
GET /index/_count
{
// your search query here
"query": {
"match_all": {}
}
}
You can add "from" and "size" to visit specific hits of response
Example
GET index/_search
{
"from": 0,
"size": 100,
"query": {
"match_all": {}
}
}
In the returned query response from Elasticsearch, there is a field response['hits']['total']['value'] which has the count of hits too, but it also has its limitations.
NOTE: /_count API doesn't support "from" and "size", it gives you the total count.
for more details visit
Elasticsearch Count API.

Complex ElasticSearch Query

I have documents with (id, value, modified_date). Need to get all the documents for ids which have a specific value as of the last modified_date.
My understanding is that I first need to find such ids and then put them inside a bigger query. To find such ids, looks like, I would use "top_hits" with some post-filtering of the results.
The goal is to do as much work as possible on the server side to speed things up. Would've been trivial in SQL, but with ElasticSearch I am at a loss. And then I would need to write this in python using elasticsearch_dsl. Can anyone help?
UPDATE: In case it's not clear, "all the documents for ids which have a specific value as of the last modified_date" means: 1. group by id, 2. in each group select the record with the largest modified_date, 3. keep only those records that have the specific value, 4. from those records keep only ids, 5. get all documents where ids are in the list coming from 4.
Specifically, 1 is an aggregation, 2 is another aggregation using "top_hits" and reverse sorting by date, 3 is an analog of SQL's HAVING clause - Bucket Selector Aggregation (?), 4 _source, 5 terms-lookup.
My biggest challenge so far has been figuring out that Bucket Selector Aggregation is what I need and putting things together.
This shows an example on how to get the latest elements in each group:
How to get latest values for each group with an Elasticsearch query?
This will return the average price bucketed in days intervals:
GET /logstash-*/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"2": {
"date_histogram": {
"field": "#timestamp",
"interval": "1d",
"time_zone": "Europe/Berlin",
"min_doc_count": 1
},
"aggs": {
"1": {
"avg": {
"field": "price"
}
}
}
}
}
}
I wrote it so it matches all record, that obviously returns more data than you need. Depending on the amount of data it might be easier to finish the task on client side.

String range query in Elasticsearch

I'm trying to query data in an Elasticsearch cluster (2.3) using the following range query. To clarify, I'm searching on a field that contains an array of values that were derived by concatenating two ids together with a count. For example:
Schema:
{
id1: 111,
id2: 222,
count: 5
}
The query I'm using looks like the following:
Query:
{
"query": {
"bool": {
"must": {
"range": {
"myfield": {
"from": "111_222_1",
"to": "111_222_2147483647",
"include_lower": true,
"include_upper": true
}
}
}
}
}
}
The to field uses Integer.MAX_VALUE
This works alright but doesn't exactly match the underlying data. Querying through other means produces more results than this method.
More strangely, trying 111_222_5 in the from field produces 0 results, while trying 111_222_10 does produce results.
How is ES (and/or Lucene) interpreting this range query and why is it producing such strange results? My initial guess is that it's not looking at the full value of the last portion of the String and possibly only looking at the first digit.
Is there a way to specify a format for the TermRange? I understand date ranging allows formatting.
A look here provides the answer.
The way it's doing range is lexicographic, 5 comes before 50 comes before 6, etc.
To get around this, I reindexed using a fixed length string for the count.
0000000001
0000000100
0001000101
...

Resources