Advice on ElasticSearch query design - elasticsearch

I've got ES documents that looks like this:
{
"auctionOn": "2018-01-01",
"inspections: [
{
"startsOn": "2018-01-02 09:00",
"endsOn": "2018-01-02 10:00"
}
]
}
I need the following answers from a search (or multiple searches)
number of documents with an auctionOn in the future (e.g > now)
number of documents with an inspection.startsOn in the future (e.g > now)
date histogram (day breakdown) of the next 7 days, with # of documents with a auctionOn on that day
date histogram (day breakdown) of the next 7 days, with # of documents with a inspection.startsOn on that day
So, i'm trying to figure out how to efficiently get these answers. I know i can/should test out all different approaches, but i'm relatively new to ES so easier said than done.
Can someone give me a advice (or ideally, a query) on how to get these 4 values?
Ideas i had:
Query for all documents with an inspection/auction in the future. Create date histogram aggregations filtered to the next 7 days for both auction and inspections. Use range aggregations to get number of docs with auction/inspection > today.
Pros: one search for all answers. Cons: lots of documents to aggregate over?
Create seperate searches (e.g msearch) for:
query all documents with an inspection in the next 7 days. aggregate by day.
query all documents with an auction in the next 7 days. aggregate by day.
query all documents with an inspection in the future. use hits to get total
query all documents with an auction in the future. use hits to get total.
Pros: queries are simpler.. more cache hits? Cons: 4 seperate searches.
Can someone please guide me down the right path, and give me hints on how to do the query/aggregations?
Thanks

Use range query on the field auctionOn setting from as current date and to date as null.
Use range query inside nested query on the field inspection.startsOn as above.
Use date histogram aggregation using interval as day
Same as 3.) but inside nested aggregation
You can adjust all these in one query.

Related

ElasticSearch query specifying an indexname using todays date

I'm using logstash to populate ES with a number of metrics from our live services across a number of machines. Logstash creates a new index each day and i am finding that querying ES without specifying the index, is running slowly. ( i currently maintain 5 days of indicies). If i specify the specific index eg today
.es(index=logstash-2018.01.15, q= examplequery
it runs very quickly
Is there a way i can specify todays index using the date field?
eg
.es(index=logstash-'get date', q= examplequery
You can use the query for getting the indices of today's date:
.es(index='<logstash-{now/d}>')
An interesting read with all the options available in elastic search to include date math in index names:
https://www.elastic.co/guide/en/elasticsearch/reference/current/date-math-index-names.html
By looking at the syntax I guess you are using Timelion or something that uses query string. There is a good tutorial here that includes specifying index patterns:
https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
In your case it will be
.es(index=logstash-*, q= examplequery
or
.es(index=logstash-2018.01.*, q= examplequery
if you need this year january and the index pattern is 'logstash-YYYY.MM.dd'

Aggregate results of top_hits with date_histogram or similar?

We have an index containing one document for every visit event to our site in a day, which contains the time of the visit and a user ID and the same user can visit multiple times in the same day. I am trying to get the number of users visiting for the first time that day per minute. Is this possible to do in a single query?
I know that a top_hits aggregation within a terms aggregation, sorted by the time field, will get me the documents representing the first unique visit each day. I know that date_histogram will aggregate the visits by minute, but not apply a uniqueness check. A cardinality subaggregation of date_histogram only verifies uniqueness per bucket, not over the whole day. date_histogram doesn't accept pipeline specifications for what to aggregate over.
I'm currently afraid that the only answer is to do the top_hits aggregation and then aggregate it myself client side, or do a separate query for every minute I want to verify unique users over (something like query for unique user ids from midnight to 12:01 AM, then midnight to 12:02, etc, tracking the growth in the count in each query.)
You can do multiple aggregations in a single elasticsearch query it goes like
{
"query": {
// some query
},
"aggs": {
"aggregation1": {
},
"aggrgation2": {
},
}

Elasticsearch aggregation on latest documents

I have a document which can be modified any number of times a day.
I've ordered these document in time series creating index for each day.
And each day would have multiple versions of the same document with different modified date.
Document sample:
{
id: 1234,
user: kc,
subscriptions: [
'paper1',
'paper2'
],
modified_date: 1466697434020
}
What I'm looking for is to get the latest documents in a particular time range for all users
and to apply aggregation on top of it.
That would give a result like, in the last week/month how many people are subscribed for each of the papers.
Using top_hits I was able to get the latest document for different users in a time range, but I cannot apply further aggregations on this set of data.

elasticsearch scoring on multiple indexes

i have an index for any quarter of a year ("index-2015.1","index-2015.2"... )
i have around 30 million documents on each index.
a document has a text field ('title')
my document sorting method is (1)_score (2)created date
the problem is:
when searching for some text on on 'title' field for all indexes ("index-201*"), always the first results is from one index.
lets say if i am searching for 'title=home' and i have 10k documents on "index-2015.1" with title=home and 10k documents on "index-2015.2" with title=home then the first results are all documents from "index-2015.1" (and not from "index-2015.2", or mixed) even that on "index-2015.2" there are documents with "created date" higher then in "index-2015.1".
is there a reason for this?
The reason is probably, that the scores are specific to the index. So if you really have multiple indices, the result score of the documents will be calculated (slightly) different for each index.
Simply put, among other things, the score of a matching document is dependent on the query terms and their occurrences in the index. The score is calculated in regard to the index (actually, by default even to each separate shard). There are some normalizations elasticsearch does, but I don't know the details of those.
I'm not really able to explain it well, but here's the article about scoring. I think you want to read at least the part about TF/IDF. Which I think, should explain why you get different scores.
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
EDIT:
So, after testing it a bit on my machine, it seems possible to use another search_type, to achieve a score suitable for your case.
POST /index1,index2/_search?search_type=dfs_query_then_fetch
{
"query" : {
"match": {
"title": "home"
}
}
}
The important part is search_type=dfs_query_then_fetch. If you are programming java or something similar, there should be a way to specify it in the request. For details about the search_types, refer to the documentation.
Basically it will first collect the term-frequencies on all affected shards (+ indexes). Therefore the score should be generalized over all these.
according to Andrei Stefan and Slomo, index boosting solve my problem:
body={
"indices_boost" : { "index-2015.4" : 1.4, "index-2015.3" : 1.3,"index-2015.2" : 1.2 ,"index-2015.1" : 1.1 }
}
EDIT:
using search_type=dfs_query_then_fetch (as Slomo described) will solve the problem in better way (depend what is your business model...)

ElasticSearch Aggregating Against FIRST/Max Nested Document

I'm using Elastic Search trying to get an aggregation of "last login country" for a set of users, and am not sure whether ES supports this type of aggregation? Here's a rough picture of the mapping:
User
UserId
Sessions (array)
Session1 - CreateDate, Country
Session2 - CreateDate, Country
What I'm wanting to do is pass in a date range, and get an output of the logins by country, with ONLY a single session per user. In other words, if the user logged in 3 times during the date range, only 1 of those sessions would count towards the overall count.
The output would look something like the following:
Country Aggregations
USA, Count: 10
Japan, Count: 15
Spain, Count: 23
I've been looking over nested aggregations, but I'm not sure they can give me what I need. The main problem I'm having is that if a User has multiple Sessions during the date range, each of those sessions contribute to the overall country count. Is there a way to filter this inner list of nested documents down so that only 1 will contribute to the aggregation per User?
I posted this question on ElasticSearch's github forum, and apparently this functionality is not available in the current ES version (1.4.2):
https://github.com/elasticsearch/elasticsearch/issues/9536

Resources