Elastic relative data math - finding all things today - elasticsearch

I'm trying to do so fairly simple query with Elasticsearch, but I don't think I understand what I'm doing wrong, so I'm posting here for some pointers.
I have an elastic index where each document has a date like so:
{
// edited for brevity
"releasedate": "2020-10-03T15:55:03+00:00",
}
and I am using django DRF to make queries like so, where I pass this value along &releasedate__gt=now-3d/d
Which ends up with an elastic range query like this.
{
"from": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"releasedate": {
"gt": "now/d-3d"
}
}
}
]
}
},
"size": 10,
"sort": [
"_score"
]
}
If I want to see all "documents since yesterday", I think of it in terms of all documents with releasedate greater than midnight yesterday, I figured the key part of the query would need to be like so:
{
"query": {
"bool": {
"filter": [
{
"range": {
"releasedate": {
"gt": "now/d-1d"
}
}
}
]
}
}
}
So I expect this would round the time now, to 00:00 today, then go back one day.
So if I ran this on 2020-10-04. I'd assume this would catch a document with the release date of 2020-10-03T15:55:03+00:00.
Here's my reasoning
Rounding down with now/d would take us to 2020-10-04T00:00.
And then going back one day with -1d would take us to 2020-10-03T00:00.
This ought to include the document, but I'm not seeing it. I need to look back more than one day to find the documents, so I need to use now/d-2d to find matching documents.
Any idea why this might be? I'm unsure of how to see what now/d-1d evaluates in terms of a timezone aware object, to check - that's what I might reach for, but I don't know how with Elastic.
FWIW, this is using Elastic 5.6. We'll be updating soon.

I'd say that once you round down to the nearest day (either with now-2d/d or now/d-2d -- as you did), the gt query's intervals will indeed be day-based.
In other words, gt : 2020-10-03T00:00 is >= 2020-10-04T00:00. So what you need instead of gt is gte and that'll work as >=2020-10-03T00:00.

Related

Find same text within time range

I'm storing articles of blogs in ElasticSearch in this format:
{
blog_id: keyword,
blog_article_id: keyword,
timestamp: date,
article_text: text
}
Suppose I want to find all blogs with articles that mention X at least twice within the last 30 days. Is there a simple query to find all blog_ids that have articles with the same word at least n times within a date range?
Is this the right way to model the problem or should I use a nested objects for an easier query?
Can this be made into a report in Kibana?
The simplest query that comes to mind is
{
"_source": "blog_id",
"query": {
"bool": {
"must": [
{
"match": {
"article_text": "xyz"
}
},
{
"range": {
"timestamp": {
"gte": "now-30d"
}
}
}
]
}
}
}
nested objects are most probably not going to simplify anything -- on the contrary.
Can it be made into a Kibana report?
Sure. Just apply the filters either in KQL (Kib. query lang) or using the dropdowns & choose a metric that you want to track (total blog_id count, timeseries frequency etc.)
EDIT re # of occurrences:
I know of 2 ways:
there's the term_vector API which gives you the word frequency information but it's a standalone API and cannot be used at query time.
Then there's the scripted approach whereby you look at the whole article text, treat is as a case-sensitive keyword, and count the # of substrings, thereby eliminating the articles with non-sufficient word frequency. Note that you don't have to use function_score as I did -- a simple script query will do. it may take a non-trivial amount of time to resolve if you have non-trivial # of docs.
In your case it could look like this:
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
def word = 'xyz';
def docval = doc['article_text.keyword'].value;
String temp = docval.replace(word, "");
def no_of_occurences = ((docval.length() - temp.length()) / word.length());
return no_of_occurences >= 2;
"""
}
}
}
]
}
}
}

ES: How do quasi-join queries using global aggregation compare to parent-child / nested queries?

At my work, I came across the following pattern for doing quasi-joins in Elasticsearch. I wonder whether this is a good idea, performance-wise.
The pattern:
Connects docs in one index in one-to-many relationship.
Somewhat like ES parent-child, but implemented without it.
Child docs need to be indexed with a field called e.g. "my_parent_id", with value being the parent ID.
Can be used when querying for parent, knowing its ID in advance, to also get the children in the same query.
The query with quasi-join (assume 123 is parent ID):
GET /my-index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": 123
}
}
}
]
}
},
"aggs": {
"my-global-agg" : {
"global" : {},
"aggs" : {
"my-filtering-all-but-children": {
"filter": {
"term": {
"my_parent_id": 123
}
},
"aggs": {
"my-returning-children": {
"top_hits": {
"_source": {
"includes": [
"my_child_field1_to_return",
"my_child_field2_to_return"
]
},
"size": 1000
}
}
}
}
}
}
}
}
This query returns:
the parent (as search query result), and
its children (as the aggregation result).
Performance-wise, is the above:
definitively a good idea,
definitively a bad idea,
hard to tell / it depends?
It depends ;-) The idea is good, however, by default the maximum number of hits you can return in a top_hits aggregation is 100, if you try 1000 you'll get an error like this:
Top hits result window is too large, the top hits aggregator [hits]'s from + size must be less than or equal to: [100] but was [1000]. This limit can be set by changing the [index.max_inner_result_window] index level setting.
As the error states, you can increase this limit by changing the index.max_inner_result_window index setting. But, if there's a default, there's usually a good reason. I would take that as a hint that it might not be that great an idea to increase it too much.
So, if your parent documents have less than 100 children, why not, otherwise I'd seriously consider going another approach.

Elastic search wildcard query crashes cluster

I run the query below on a large elastic search cluster. The cluster bcomes unresponsive
{
"size": 10000,
"query": {
"bool": {
"must": [
{
"regexp": {
"message": {
"value": ".*exception.*"
}
}
},
{
"bool": {
"should": [
{
"term": {
"beat.hostname": "ip-xxx-xx-xx-xx"
}
}
]
}
},
{
"range": {
"#timestamp": {
"lt": 1518459660000,
"format": "epoch_millis",
"gte": 1518459600000
}
}
}
]
}
}
}
When I remove the wildcarded .*exception.* and replace it with any non wildcarded string like xyz it returns fast. Though the query uses a wildcarded expression, it also looks for a small time range and a specific host. I would think this is a very simple query. Any reason why elasticsearch server can't handle this query? The cluster has 10 nodes and 20 TB of data.
See the documentation for Regexp Query. It clearly states the following:
Note: The performance of a regexp query heavily depends on the regular
expression chosen. Matching everything like .* is very slow
What would be ideal is to change the text analysis on the message field with a WordDelimiterTokenFilter and set split_on_case_change to true. Then something like NullPointerException will get indexed as three separate tokens [Null, Pointer, Exception]. This can help you search on exception without using a regex. Caveat is you need to reindex all your documents.
Another quick thing to try might be to keep your filter conditions on the hostname and timestamp in a filter context, which will prefilter documents before running your regexp query. This may be a short-term solution for you until you fix the text analysis.

get documents irrespective of years in elasticsearch

I want to get all documents on 8 Dec irrespective of years. I have tried two queries but both fails, Is there any way to calculate this?
First Query
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"myDate": {
"gte": "12-08",
"lte": "12-08",
"format": "MM-dd"
}
}
}
]
}
}
}
Second Query
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"mydate": "12-08"
}
}
]
}
}
}
Unfortunately, I don't think that will be easily possible. DateTime datatypes are actually just long numbers. The range query will also transform the defined input into a number. Example: now -> 1497541939892. See https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html for more information - specifically this:
Internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch.
With that in mind, you would have to subtract 1 (or x) years (in milliseconds) for every subquery. That doesn't sound practical.
I think your best bet would be, to additionally index the day and month - and maybe year as well - separately. Then you would be able to query just by month/day, which would be integer values. I don't know if that is easily done in your case, but I really have no other idea right now.

How to get only x results from elastic and then stop searching?

My whole index is about 700M docs, this query:
{
"query": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
},
"size": 10
}
applies to ca 5M docs. "Some_field" is indexed, not analysed.
Query takes ca 1s on average hetzner. Too slow :) I don't care about pagination or sorting or scoring. I just want 10 first "random" matching docs.
Is there the way to do it with disabled score, in the "mysql way"?
filter or constant_score do not help
If you go with filters, that will remove the score computation and should provide faster query speeds:
{
"query": {
"bool": {
"filter": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
}
}
}
"size": 10
}
If that's still too slow, you could consider using document routing, but it may not be a viable option for you as you might have just 1 shard or very few terms for SOME_FIELD.
I also suggest you go over the production deployment document by Elastic, it gives you an overview on how to configure your cluster optimally and can also produce some serious performance boost in case you currently have a misconfigured cluster, i.e. running on a strong machine but keeping the default ES_HEAP_SIZE value.
The option i was looking for is "terminate_after". Unfortunately it is not "very well" documemented, see:
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-limit-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-count.html#_request_parameters
so, my query looks like this:
{
"query": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
},
"size": 10,
"terminate_after": 10
}
Don't use "10" instead of 10. Elastic does not cast it to integer and ignores the parameter

Resources