Elasticsearch date range intersection - elasticsearch

I'm storing something like the following information in elastic search:
{ "timeslot_start_at" : "2013-02-01", "timeslot_end_at" : "2013-02-03" }
Given that I have another date range (given from user input for example) I am wanting to search for an intersecting time range. Similar to this: Determine Whether Two Date Ranges Overlap Which outlines that the following logic is what i'm after:
(StartDate1 <= EndDate2) and (StartDate2 <= EndDate1)
But I'm unsure of how to fit this into an elastic search query, would I use a range filter and only set the 'to' values, leaving the from blank? Or is there a more efficient way of doing this?

Update: It is now possible to use date_range data type that was added in elasticsearch v5.2. For an earlier version of elasticsearch the following solution still applies.
To test for intersection, you should combine two range queries into a single query using bool query:
{
"bool": {
"must": [
{
"range": {
"timeslot_start_at": {
"lte": "2013-02-28"
}
}
},
{
"range": {
"timeslot_end_at": {
"gte": "2013-02-03"
}
}
}
]
}
}

Related

Find same text within time range

I'm storing articles of blogs in ElasticSearch in this format:
{
blog_id: keyword,
blog_article_id: keyword,
timestamp: date,
article_text: text
}
Suppose I want to find all blogs with articles that mention X at least twice within the last 30 days. Is there a simple query to find all blog_ids that have articles with the same word at least n times within a date range?
Is this the right way to model the problem or should I use a nested objects for an easier query?
Can this be made into a report in Kibana?
The simplest query that comes to mind is
{
"_source": "blog_id",
"query": {
"bool": {
"must": [
{
"match": {
"article_text": "xyz"
}
},
{
"range": {
"timestamp": {
"gte": "now-30d"
}
}
}
]
}
}
}
nested objects are most probably not going to simplify anything -- on the contrary.
Can it be made into a Kibana report?
Sure. Just apply the filters either in KQL (Kib. query lang) or using the dropdowns & choose a metric that you want to track (total blog_id count, timeseries frequency etc.)
EDIT re # of occurrences:
I know of 2 ways:
there's the term_vector API which gives you the word frequency information but it's a standalone API and cannot be used at query time.
Then there's the scripted approach whereby you look at the whole article text, treat is as a case-sensitive keyword, and count the # of substrings, thereby eliminating the articles with non-sufficient word frequency. Note that you don't have to use function_score as I did -- a simple script query will do. it may take a non-trivial amount of time to resolve if you have non-trivial # of docs.
In your case it could look like this:
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
def word = 'xyz';
def docval = doc['article_text.keyword'].value;
String temp = docval.replace(word, "");
def no_of_occurences = ((docval.length() - temp.length()) / word.length());
return no_of_occurences >= 2;
"""
}
}
}
]
}
}
}

How to calculate the overlap / elapsed time range in elasticsearch?

I have some records in ES, they are different online meeting records that people join/leave at the different time.
{name:"p1", join:'2017-11-17T00:01:00.293Z', leave: "2017-11-17T00:06:00.293Z"}
{name:"p2", join:'2017-11-17T00:02:00.293Z', leave: "2017-11-17T00:04:00.293Z"}
{name:"p3", join:'2017-11-17T00:03:00.293Z', leave: "2017-11-17T00:05:00.293Z"}
Time range could be something like this:
p1: [============================================]
p2: [=================]
p3: [==================]
The question is how to calculate the overlap time range (common/meeting/shared time), which should be 3 min
Another further question is that is it possible to know when to when there is 1/2/3 people at that time? 2 mins 2 persons; 1 min 3 persons
I don't think its possible to do only with ES. Simply because all you need is that in search it should go to all documents that matched and calculate based on that
I would do it in following steps.
1.Before indexing new document search for documents which overlaps.
GET /meetings/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"join": {
"gte": "2007-10-01T00:00:00"
}
}
},
{
"range": {
"leave": {
"lte": "2007-10-01T00:00:00"
}
}
}
]
}
}
}
Calculate all functionality on back-end for all documents that overlaps.
Save to to documents as nested object overlaps metadata you need
You can do the first part easily using max(join) and min(leave):
GET your_index/your_type/_search
{
"size": 0,
"aggs": {
"startTime": {
"max": {
"field": "join"
}
},
"endTime": {
"min": {
"field": "leave"
}
}
}
}
And then you can compute endTime-startTime either when you process Elasticsearch response or using a bucket script aggregation. It may be negative in which case there is no overlap.
For the second one, it depends of what you want:
If you want the exact boundaries, which may be hard to read, you can do it using a Scripted Metric Aggregation.
If you want to have the number per slot (hour for instance) it may be easier to use a Date Histogram Aggregation.

get documents irrespective of years in elasticsearch

I want to get all documents on 8 Dec irrespective of years. I have tried two queries but both fails, Is there any way to calculate this?
First Query
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"myDate": {
"gte": "12-08",
"lte": "12-08",
"format": "MM-dd"
}
}
}
]
}
}
}
Second Query
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"mydate": "12-08"
}
}
]
}
}
}
Unfortunately, I don't think that will be easily possible. DateTime datatypes are actually just long numbers. The range query will also transform the defined input into a number. Example: now -> 1497541939892. See https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html for more information - specifically this:
Internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch.
With that in mind, you would have to subtract 1 (or x) years (in milliseconds) for every subquery. That doesn't sound practical.
I think your best bet would be, to additionally index the day and month - and maybe year as well - separately. Then you would be able to query just by month/day, which would be integer values. I don't know if that is easily done in your case, but I really have no other idea right now.

Elasticsearch: storing a range of values in a field

This is the first time I am asking a question.
I am planning to use Elasticsearch for storing certain data that I have.
The problem that I face is that I need to store a field's value as a range thats tolerated.
Like this-
the field name - tolerated pH
Example value - 5.1 - 7.0
I need to save it like this and when a query is executed, it has to see if the entered value lies in the range.
I can't find it in the reference and guide.
All I find is Range filter.
Can someone please help me out?
And guide me how it can be done?
I would create two fields
{ ...
"minToleratedPh": 5.1,
"maxToleratedPh": 7.0
}
And then use two range queries in order to check that the constraint minPh < input_value < maxPh holds true (just replace input_value by the pH value to check):
{
"query": {
"bool": {
"filter": [
{
"range": {
"minToleratedPh": {
"lt": input_value
}
}
},{
"range": {
"maxToleratedPh": {
"gt": input_value
}
}
}
]
}
}
}

Random document in ElasticSearch

Is there a way to get a truly random sample from an elasticsearch index? i.e. a query that retrieves any document from the index with probability 1/N (where N is the number of documents currently indexed)?
And as a follow-up question: if all documents have some numeric field s, is there a way to get a document through weighted random sampling, i.e. where the probability to get document i with value s_i is equal to s_i / sum(s_j for j in index)?
I know it is an old question, but now it is possible to use random_score,
with the following search query:
{
"size": 1,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": "1477072619038"
}
}
]
}
}
}
For me it is very fast with about 2 million documents.
I use current timestamp as seed, but you can use anything you like. The best is if you use the same seed, you will get the same results. So you can use your user's session id as seed and all users will have different order.
The only way I know of to get random documents from an index (at least in versions <= 1.3.1) is to use a script:
sort: {
_script: {
script: "Math.random() * 200000",
type: "number",
params: {},
order: "asc"
}
}
You can use that script to make some weighting based on some field of the record.
It's possible that in the future they might add something more complicated, but you'd likely have to request that from the ES team.
You can use random_score with a function_score query.
{
"size":1,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": 11
}
}
],
"score_mode": "sum",
}
}
}
The bad part is that this will apply a random score to every document, sort the documents, and then return the first one. I don't know of anything that is smart enough to just pick a random document.
NEST Way :
var result = _elastic.Search<dynamic>(s => s
.Query(q => q
.FunctionScore(fs => fs.Functions(f => f.RandomScore())
.Query(fq => fq.MatchAll()))));
raw query way :
GET index-name/_search
"size": 1,
"query": {
"function_score": {
"query" : { "match_all": {} },
"random_score": {}
}
}
}
You can use random_score to randomly order responses or retrieve a document with roughly 1/N probability.
Additional notes:
https://github.com/elastic/elasticsearch/issues/1170
https://github.com/elastic/elasticsearch/issues/7783

Resources