Elasticsearch DSL query - Get all matching results - elasticsearch

I am trying to search an index using DSL query. I have many documents which matches the criteria of log and the range of timestamp.
I am passing dates and converting it to epoch milli seconds.
But I am specifying size parameter in DSL query.
What I see is that if I specify 5000, it extracts 5000 records in the time range. But there are more number of records in the specified time range.
How to retrieve all data matching the range of time so that I dont need to specify the size?
My DSL query is as below.
GET localhost:9200/_search
{
"query": {
"bool": {
"must": [
{"match_phrase": {
"log": "SOME_VALUE"
}
},
{"range": {
"#timestamp": {
"gte": "'"${fromDate}"'",
"lte": "'"${toDate}"'",
"format": "epoch_millis"
}
}
}
]
}
},
"size":5000
}
fromDate = 1519842600000
toDate = 1520533800000

I couldn't get the scan API or scroll pattern working as it was also not showing expected result.
I finally figured out a way to capture the number of hits and then pass that as parameter to extract the data.
GET localhost:9200/_count
{
"query": {
"bool": {
"must": [
{"match_phrase": {
"log": "SOME_VALUE"
}
},
{"range": {
"#timestamp": {
"gte": "'"${fromDate}"'",
"lte": "'"${toDate}"'",
"format": "epoch_millis"
}
}
}
]
}
}
}' > count_size.txt
size_count=`cat count_size.txt | cut -d "," -f1 | cut -d ":" -f2`
echo "Total hits matching this criteria is ${size_count}"
From this I get the size_count value.
If this value is less than 10000, extract the value, else reduce the time range for extraction.
GET localhost:9200/_search
{
"query": {
"bool": {
"must": [
{"match_phrase": {
"log": "SOME_VALUE"
}
},
{"range": {
"#timestamp": {
"gte": "'"${fromDate}"'",
"lte": "'"${toDate}"'",
"format": "epoch_millis"
}
}
}
]
}
},
"size":'"${size_count}"'
}
If large set of data is required for an extensive period, I need to run this with a different set of dates and combine them together to get an overall required reports.
This complete piece of code is written is shell script so I am able to use it much simpler.

Related

How to sum the size of documents within a time interval?

I'm attempting to estimate the sum of size of n documents across an index using below query :
GET /events/_search
{
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"total_size": {
"sum": {
"field": "doc['_source'].bytes"
}
}
}
}
This returns documents but the size of the aggregation is 0 :
"aggregations" : {
"total_size" : {
"value" : 0.0
}
}
How to sum the size of documents within a time interval ?
The best way to achieve what you want is to actually add another field that contains the real source size at indexing time.
However, if you want to run it once to see how it looks like, you can leverage runtime fields to compute this at search time, just know that it can put a heavy burden on your cluster. Since the Painless scripting language doesn't yet provide a way to transform the source document to the same JSON you sent at indexing time, we can only approximate the value you're looking for by stringifying the _source Hashmap, yielding this:
GET /events/_search
{
"runtime_mappings": {
"source.size": {
"type": "double",
"script": """
def size = params._source.toString().length() * 8;
emit(size);
"""
}
},
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"size": {
"sum": {
"field": "source.size"
}
}
}
}
Another way is to install the Mapper size plugin so that you can make use of the _size field computed at indexing time.

Elasticsearch : filter results based on the date range

I'm using Elasticsearch 6.6, trying to extract multiple results/records based on multiple values (email_address) passed to the query (Bool) on a date range. For ex: I want to extract information about few employees based on their email_address (annie#test.com, charles#test.com, heman#test.com) and from the period i.e project_date (2019-01-01).
I did use should expression but unfortunately it's pulling all the records from elasticsearch based on the date range i.e. it's even pulling other employees information from project_date 2019-01-01.
{
"query": {
"bool": {
"should": [
{ "match": { "email_address": "annie#test.com" }},
{ "match": { "email_address": "chalavadi#test.com" }}
],
"filter": [
{ "range": { "project_date": { "gte": "2019-08-01" }}}
]
}
}
}
I also tried must expression but getting no result. Could you please help me on finding employees using their email_address with the date range?
Thanks in advance.
Should(Or) clauses are optional
Quoting from this article.
"In a query, if must and filter queries are present, the should query occurrence then helps to influence the score. However, if bool query is in a filter context or has neither must nor filter queries, then at least one of the should queries must match a document."
So in your query should is only influencing the score and not actually filtering the document. You must wrap should in must, or move it in filter(if scoring not required).
GET employeeindex/_search
{
"query": {
"bool": {
"filter": {
"range": {
"projectdate": {
"gte": "2019-01-01"
}
}
},
"must": [
{
"bool": {
"should": [
{
"term": {
"email.raw": "abc#text.com"
}
},
{
"term": {
"email.raw": "efg#text.com"
}
}
]
}
}
]
}
}
}
You can also replace should clause with terms clause as in #AlwaysSunny's answer.
You can do it with terms and range along with your existing query inside filter in more shorter way. Your existing query doesn't work as expected because of should clause, it makes your filter weaker. Read more here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
{
"query": {
"bool": {
"filter": [
{
"terms": {
"email_address.keyword": [
"annie#test.com", "chalavedi#test.com"
]
}
},
{
"range": {
"project_date": {
"gte": "2019-08-01"
}
}
}
]
}
}
}

Elasticsearch - retriving documents only, if multiple match by specific field

I have an index in Elasticsearch with users' posts. I want to retrieve user_id from this index, if for given date range, there are at least X posts. Otherwise to skip such posts.
Anyway I can achieve it in ES or I have to get all entities and handle them later?
Trawa ;)
To answer your question I'll assume you have the fields user and datetime in your mapping.
You can get the requested data like so:
Get the list of users who have more then X (i.e X=100) posts by given date range - aggregate by user name for specific date range:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"datetime": {
"gte": "2017-05-01",
"lt": "2017-06-01"
}
}
}
]
}
},
"aggregations": {
"users": {
"terms": {
"field": "user",
"min_doc_count": 100
}
}
}
}
Edit the query to match your date range (and its format) and min_doc_count to the minimum X posts per user.
EDIT:
There is no way to avoid terms_aggregation to get all distinct values.
50k values do seems to be to much data to retrieve - but it also depends on your cluster.
My suggestion is to add another filter, lets say, alphabetically filter so instead of getting 50k results at once you can do it in other several queries:
"must": [
{
"range": {
"datetime": {
"gte": "2017-05-01",
"lt": "2017-06-01"
}
}
},
{
"wildcard": {
"user": "a*"
}
},
{
"wildcard": {
"user": "b*"
}
}
]
See Wildcard
Unfortunately, scrolling on aggregation results is not available. Manually dividing the data to pieces is the best thing I can see right now.

How to check field data is numeric when using inline Script in ElasticSearch

Per our requirement we need to find the max ID of the document before adding new document. Problem here is doc may contain string data also So had to use inline script on the elastic query to find out max id only for the document which has integer data otherwise returning 0. am using following inline script query to find max-key but not working. can you help me onthis ?.
{
"size":0,
"query":
{"bool":
{"filter":[
{"term":
{"Name":
{
"value":"Test2"
}
}}
]
}},
"aggs":{
"MaxId":{
"max":{
"field":"Key","script":{
"inline":"((doc['Key'].value).isNumber()) ? Integer.parseInt(doc['Key'].value) : 0"}}
}
}
}
The error is because the max aggregation only supports numeric fields, i.e. you cannot specify a string field (i.e. Key) in a max aggregation.
Simply remove the "field":"Key" part and only keep the script part
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"Name": "Test2"
}
}
]
}
},
"aggs": {
"MaxId": {
"max": {
"script": {
"source": "((doc['Key'].value).isNumber()) ? Integer.parseInt(doc['Key'].value) : 0"
}
}
}
}
}

Get events count by last minute and event level

I have parsed events with field like "level" (DEBUG, INFO, ERROR, FATAL). How to retrieve events count by last minute and level type = ERROR?
screen from Kibana
I'm trying like that:
curl -XGET 'mysite.com:9200/myindex/_count?pretty=true' -d '
{
"query":{
"term":{
"level":"error"
}
},
"filter":{
"range":{
"_timestamp":{
"gt":"now-1m"
}
}
}
}'
You must have timestamp on your events.If yes, write a count aggregate query on events with query filters of level type and range timestamp(elasticsearch do support range on time/date field with 'now' parameter).
confusing part is you did't mention what kind of count you want.Total event count or you want to count by type or some name parameter(in that case use terms aggregation on that parameter).
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/mapping-date-format.html#date-math
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"level": "trace"
}
},
{
"range": {
"timestamp": {
"gt": "now-1m"
}
}
}
]
}
}
}
}
}

Resources