Return every nth record in Elastic Search - elasticsearch

I have time series data and I want to query Elasticsearch by using time ranges with a fixed set of 2000 records.
I have this query
GET http://IP:9200/MYINDEX/_search
{
"_source": ["XXX1", "XXX2","timestamp"],
"sort" :
{ "#timestamp" : {"order" : "asc"}},
"query" : {
"range" : {
"#timestamp" : {
"gte" : "2017-02-10T10:55:31,259Z",
"lte" : "2017-02-10T10:55:32,272Z"
}
}
}
Is it possible to return only every 5th or 10th record?
I found some filter scripts but none of them seems to work.
Since there could be millions of records in one index its crucial to limit the number of returned values!
EDIT: rework query becasue filtered was replaced by bool:
{
"_source":[
"XXX1",
"XXX2",
"timestamp"
],
"sort":{
"#timestamp":{
"order":"asc"
}
},
"query":{
"bool":{
"must":{
"range":{
"#timestamp":{
"gte":"2017-02-10T10:55:31,259Z",
"lte":"2017-02-10T10:55:32,272Z"
}
}
},
"filter":{
"script":{
"script":"doc['#timestamp'].value % 5 == 0"
}
}
}
}
}

There is one way to do it. You can add a field which can behave like an auto increment field of a DB.
Then you can add a filter to the query that you want to run.
"filter": {
"script": {
"script": "doc['auto_increment'].value % n == 0",
"params" : {
"n" : 5
}
}
}
This should work for indexes that have time series data and are going to be searched for a range. It will not work properly if you have an added text search to the field.
For the query that you are trying it would transform into something like this.
GET http://IP:9200/MYINDEX/_search
{
"_source": ["XXX1", "XXX2","timestamp"],
"sort" :
{ "#timestamp" : {"order" : "asc"}},
"query" : {
"filtered": {
"query": {
"range" : {
"#timestamp" : {
"gte" : "2017-02-10T10:55:31,259Z",
"lte" : "2017-02-10T10:55:32,272Z"
}
}
},
"filter": {
"script": {
"script": "doc['auto_increment'].value % 5 == 0"
}
}
}
}
}
For reference do look into this

Related

Elasticsearch - EXISTS syntax + Filter not working

I am trying to query for a date range where a particular field exists. This seems like it would be easy but I am sensing that the keyword "exists" has changed per the documentation. I am on 5.4. https://www.elastic.co/guide/en/elasticsearch/reference/5.4/query-dsl-exists-filter.html
I use #timestamp for dates and the field "error_data" is in the mapping and only appears if an error condition is found.
Here is my query....
GET /filebeat-2017.07.25/_search
{
"query": {
"bool" : {
"filter" : {
"range" : {
"#timestamp" : {
"gte" : "now-5m",
"lte" : "now-1m"
}
}
},
"exists": {
"field": "error_data"
}
}
}
}
but it says that "[bool] query does not support [exists]" whereas the following does not work either but gets an parsing error message of "[exists] malformed query, expected [END_OBJECT] but found [FIELD_NAME]" on line 6 column 9. Thanks for your help.
GET /filebeat-2017.07.25/_search
{
"query": {
"exists": {
"field": "error_data"
},
"bool" : {
"filter" : {
"range" : {
"#timestamp" : {
"gte" : "now-5m",
"lte" : "now-1m"
}
}
}
}
}
}
You're almost there. Try like this:
GET /filebeat-2017.07.25/_search
{
"query": {
"bool" : {
"filter" : [
{
"range" : {
"#timestamp" : {
"gte" : "now-5m",
"lte" : "now-1m"
}
}
},
{
"exists": {
"field": "error_data"
}
}
]
}
}
}
i.e. the bool/filter clause must be an array if you have several clauses to put in it:

Elastic Search must not queries are slow

I have a test index of 50K documents.
I'm firing 500 (same) queries against it, which have a clause that a field (that is an array of values) "must not" be of "some value".
Out of these 500 queries several fail/time out. (Sometimes it's 5, sometimes it's 9, sometimes it's 18 queries...) Is there a way to make the "must not" queries faster? In production the index is going to be several million docs, and the majority of queries are going to have "must not" clauses.
Mapping is as follows:
{
"jobs_en":{
"mappings":{
"index":{
"_all":{
"enabled":false
},
"properties":{
"GUID":{
"type":"string",
"index":"not_analyzed"
},
"channel":{
"type":"string",
"index":"not_analyzed"
},
"country":{
"type":"string",
"analyzer":"standard"
}
}
}
}
}
}
The query is as follows:
{
"bool" : {
"must" : [ {
"bool" : {
"must" : {
"bool" : { }
},
"must_not" : {
"term" : {
"channel" : "Email"
}
}
}
}, {
"bool" : {
"must" : {
"match" : {
"country" : {
"query" : "US",
"type" : "boolean"
}
}
}
}
} ]
}
}"
We have a large database in ES, I don't think it is as large as yours. Several things help me:
1. Use Must if you can.
2. Use Must Not WITH Must.
3. If you are able to: use Source.
"query" : {
"bool" : {
"must": [
{"term": {
"createUser": {
"value": "processor.imsignal"
}
}
},
{"terms" : {
"imcampaignid" : [70191,66983,70188,70235,70190]
}
}
],
"must_not": [
{"term": {
"source": {
"value": "EMAIL"
}
}
},
{"terms" : {
"category" : ["campaign_email","unsubscribe","from_email"]
}
}
]
}
},
"_source": ["category","source","accountPlatformID"]
By specifying a must first, it speeds up the query. By specifying must_not it can reduce the number of returned records which can be a real hit. Finally, reducing what is returned on those records can really be helpful.
Since there was no other answer, I figured I'd help with what I knew. Believe it or not, this query with the must not outperforms the identical query with only the musts for my purposes by tens of seconds. Telling something what it should be is essential, then filter with what it is not.

Elasticsearch - Remove double results in search

I don't know how to remove double results with the same value in one field.
My Searchquery:
query :{
range : {
"endtime" : {
"lt" : "2017-02-09T20:00:00",
"gt" : "2017-02-09T01:00:00"
}
}
}
In my results there's one field called "link" which has often the same value (f.ex. https://www.facebook.com).
I would prefer a solution for my query, that would be great.
Thanks.
Greetings!
You can do a terms aggregation.
GET /cars/transactions/_search?search_type=count
{
"query": {
"range" : {
"endtime" : {
"gte" : "2017-02-09T20:00:00",
"lt" : "2017-02-09T01:00:00"
}
}
},
"aggs": {
"distinct_links": {
"terms": {
"field": "links",
"size": 100
}
}
}
}
something like this.

Difference between using multiple filters and specifying multiple filters in a single "and" clause

I am a new to elasticsearch and don't know what is the difference between the two queries. Is it just processing time or are they fundamentally different queries.
1) filters : { and: [{
"bool" : {
"should" : {
"term" : {
"Code" : "1510"
}
}
}
}
,
{
"bool" : {
"should" : {
"term" : {
"Id" : "Id3"
}
}
}
}] }
2) filter: [{
"bool" : {
"must" : [{
"term" : {
"Code" : "1510"
},
"term":{
"Id":"Id3"}]
}
}
}]
The queries in OP are logically equivalent.
However that being said I find 2) to be intutive , readable and simpler.
Generally for perfomance reasons bool filters are preferred over and although for the queries in question I doubt this difference is perceptible.
Also for the and filter the query in 1) is better written as follows :
"filter": {
"and": [
{
"term": {
"Code": "1510"
}
},
{
"term": {
"Id": "Id3"
}
}
]
}

filter by child frequency in ElasticSearch

I currently have parents indexed in elastic search (documents) and child (comments) related to these documents.
My first objective was to search for a document with more than N comments, based on a child query. Here is how I did it:
documents/document/_search
{
"min_score": 0,
"query": {
"has_child" : {
"type" : "comment",
"score_type" : "sum",
"boost": 1,
"query" : {
"range": {
"date": {
"lte": 20130204,
"gte": 20130201,
"boost": 1
}
}
}
}
}
}
I used score to calculate the amount of comments a document has and then I filtered the documents by this amount, using "min_score".
Now, my objective is to search not just comments, but several other child documents related to the document, always based on frequency. Something like the query bellow:
documents/document/_search
{
"query": {
"match_all": {
}
},
"filter" : {
"and" : [{
"query": {
"has_child" : {
"type" : "comment",
"query" : {
"range": {
"date": {
"lte": 20130204,
"gte": 20130201
}
}
}
}
}
},
{
"or" : [
{"query": {
"has_child" : {
"type" : "comment",
"query" : {
"match": {
"text": "Finally"
}
}
}
}
},
{ "query": {
"has_child" : {
"type" : "comment",
"query" : {
"match": {
"text": "several"
}
}
}
}
}
]
}
]
}
}
The query above works fine, but it doesn't filter based on frequency as the first one does. As filters are computed before scores are calculated, I cannot use min_score to filter each child query.
Any solutions to this problem?
There is no score at all associated with filters. I'd suggest to move the whole logic to the query part and use a bool query to combine the different queries together.

Resources