Elasticsearch aggregation on latest documents - elasticsearch

I have a document which can be modified any number of times a day.
I've ordered these document in time series creating index for each day.
And each day would have multiple versions of the same document with different modified date.
Document sample:
{
id: 1234,
user: kc,
subscriptions: [
'paper1',
'paper2'
],
modified_date: 1466697434020
}
What I'm looking for is to get the latest documents in a particular time range for all users
and to apply aggregation on top of it.
That would give a result like, in the last week/month how many people are subscribed for each of the papers.
Using top_hits I was able to get the latest document for different users in a time range, but I cannot apply further aggregations on this set of data.

Related

Painless script with Spring Data Elasticsearch

We are using Spring Data Elasticsearch to build a 'fan out on read' user content feed. Our first attempt is currently showing content based on keyword matching and latest content using NativeSearchQueryBuilder.
We want to further improve the relevancy order of what is shown to the user based on additional factors (e.g. user engagement, what currently the user is working on etc).
Can this custom ordering be done using NativeSearchQueryBuilder or do we get more control using a painless script? If it's a painless script, can we call this from Spring Data ElasticSearch?
Any examples, recommendations would be most welcome.
Elasticsearch orders it result by it relevance-score (which marks a result relevancy to your search query), think that each document in the result set includes a number which signifies how relevant the document is to the given query.
If the data you want to change your ordering upon is part of your indexed data (document fields for example), you can use QueryDSL, to boost the _score field, few options I can think on:
boost a search query dependent on it criteria: a user searches for a 3x room flat but 4x room in same price would be much better match, then we can: { "range": { "rooms": { "gte": 4, "boost": 1 }}}
field-value-factor you can favor results by it field value: more 'clicks' by users, more 'likes', etc..,
random-score if you want randomness in your results: different
result every time a user refreshes your page or you can mix with existing scoring.
decay functions (Gauss!) to boost/unboost results that are close/far to our central point. lets say we want to search apartments and our budget is set to 1700. { "gauss": { "price": { "origin": "1700", "scale": "300" } } } will give us a feeling on how close we are to our budget of 1,700. any flat with much higher prices (let's say 2,300) - would get much more penalized by the gauss function - as it is far from our origin. the decay and the behavior of gauss function - will separate our results accordingly to our origin.
I don't think this has any abstraction on spring-data-es and I would use FunctionScoreQueryBuilder with the NativeSearchQueryBuilder.

Advice on ElasticSearch query design

I've got ES documents that looks like this:
{
"auctionOn": "2018-01-01",
"inspections: [
{
"startsOn": "2018-01-02 09:00",
"endsOn": "2018-01-02 10:00"
}
]
}
I need the following answers from a search (or multiple searches)
number of documents with an auctionOn in the future (e.g > now)
number of documents with an inspection.startsOn in the future (e.g > now)
date histogram (day breakdown) of the next 7 days, with # of documents with a auctionOn on that day
date histogram (day breakdown) of the next 7 days, with # of documents with a inspection.startsOn on that day
So, i'm trying to figure out how to efficiently get these answers. I know i can/should test out all different approaches, but i'm relatively new to ES so easier said than done.
Can someone give me a advice (or ideally, a query) on how to get these 4 values?
Ideas i had:
Query for all documents with an inspection/auction in the future. Create date histogram aggregations filtered to the next 7 days for both auction and inspections. Use range aggregations to get number of docs with auction/inspection > today.
Pros: one search for all answers. Cons: lots of documents to aggregate over?
Create seperate searches (e.g msearch) for:
query all documents with an inspection in the next 7 days. aggregate by day.
query all documents with an auction in the next 7 days. aggregate by day.
query all documents with an inspection in the future. use hits to get total
query all documents with an auction in the future. use hits to get total.
Pros: queries are simpler.. more cache hits? Cons: 4 seperate searches.
Can someone please guide me down the right path, and give me hints on how to do the query/aggregations?
Thanks
Use range query on the field auctionOn setting from as current date and to date as null.
Use range query inside nested query on the field inspection.startsOn as above.
Use date histogram aggregation using interval as day
Same as 3.) but inside nested aggregation
You can adjust all these in one query.

How to search for documents that match some query where there exists a document that matches a separate query at a nearby time

I have a set of documents indexed in elasticsearch that have a bunch of data, but two styles with differentiated fields.
The first has a style like this:
{
type: "measurement",
startTime: "iso-time-goes-here",
duration: 30, // seconds
locationId: "abc"
}
The second has a style like this:
{
type: "event",
startTime: "iso-time-goes-here",
locationId: "abc"
}
The rest of the fields is identical between the documents.
I want to run a search such as "Show me all event documents where there is a measurement document less than 1 minute away from it and the locationIds match"
Is such a query possible in elasticsearch? How can I pull this off?
No, Elasticsearch has no joins, which is what you would need here since you are joining measurements with events on the locationId.
Maybe you can denormalize measurements into events at index time and then you have the information there. If you can manage this, that would be the fastest. Anything else like doing multiple queries and then filtering your result set is likely to get expensive; which is the reason why Elasticsearch does not support joins. But from what you describe that won't work for you likely.

Aggregate results of top_hits with date_histogram or similar?

We have an index containing one document for every visit event to our site in a day, which contains the time of the visit and a user ID and the same user can visit multiple times in the same day. I am trying to get the number of users visiting for the first time that day per minute. Is this possible to do in a single query?
I know that a top_hits aggregation within a terms aggregation, sorted by the time field, will get me the documents representing the first unique visit each day. I know that date_histogram will aggregate the visits by minute, but not apply a uniqueness check. A cardinality subaggregation of date_histogram only verifies uniqueness per bucket, not over the whole day. date_histogram doesn't accept pipeline specifications for what to aggregate over.
I'm currently afraid that the only answer is to do the top_hits aggregation and then aggregate it myself client side, or do a separate query for every minute I want to verify unique users over (something like query for unique user ids from midnight to 12:01 AM, then midnight to 12:02, etc, tracking the growth in the count in each query.)
You can do multiple aggregations in a single elasticsearch query it goes like
{
"query": {
// some query
},
"aggs": {
"aggregation1": {
},
"aggrgation2": {
},
}

How to retrieve unique count of a field using Kibana + Elastic Search

Is it possible to query for a distinct/unique count of a field using Kibana? I am using elastic search as my backend to Kibana.
If so, what is the syntax of the query? Heres a link to the Kibana interface I would like to make my query: http://demo.kibana.org/#/dashboard
I am parsing nginx access logs with logstash and storing the data into elastic search. Then, I use Kibana to run queries and visualize my data in charts. Specifically, I want to know the count of unique IP addresses for a specific time frame using Kibana.
For Kibana 4 go to this answer
This is easy to do with a terms panel:
If you want to select the count of distinct IP that are in your logs, you should specify in the field clientip, you should put a big enough number in length (otherwise, it will join different IP under the same group) and specify in the style table. After adding the panel, you will have a table with IP, and the count of that IP:
Now Kibana 4 allows you to use aggregations. Apart from building a panel like the one that was explained in this answer for Kibana 3, now we can see the number of unique IPs in different periods, that was (IMO) what the OP wanted at the first place.
To build a dashboard like this you should go to Visualize -> Select your Index -> Select a Vertical Bar chart and then in the visualize panel:
In the Y axis we want the unique count of IPs (select the field where you stored the IP) and in the X axis we want a date histogram with our timefield.
After pressing the Apply button, we should have a graph that shows the unique count of IP distributed on time. We can change the time interval on the X axis to see the unique IPs hourly/daily...
Just take into account that the unique counts are approximate. For more information check also this answer.
Be aware with Unique count you are using 'cardinality' metric, which does not always guarantee exact unique count. :-)
the cardinality metric is an approximate algorithm. It is based on the
HyperLogLog++ (HLL) algorithm. HLL works by hashing your input and
using the bits from the hash to make probabilistic estimations on the
cardinality.
Depending on amount of data I can get differences of 700+ entries missing in a 300k dataset via Unique Count in Elastic which are otherwise really unique.
Read more here: https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
Create "topN" query on "clientip" and then histogram with count on "clientip" and set "topN" query as source. Then you will see count of different ips per time.
Unique counts of field values are achieved by using facets. See ES documentation for the full story, but the gist is that you will create a query and then ask ES to prepare facets on the results for counting values found in fields. It's up to you to customize the fields used and even describe how you want the values returned. The most basic of facet types is just to group by terms, which would be like an IP address above. You can get pretty complex with these, even requiring a query within your facet!
{
"query": {
"match_all": {}
},
"facets": {
"terms": {
"field": "ip_address"
}
}
}
Using Aggs u can easily do that.
Writing down query for now.
GET index/_search
{
"size":0,
"aggs": {
"source": {
"terms": {
"field": "field",
"size": 100000
}
}
}
}
This would return the different values of field with there doc counts.
For Kibana 7.x, Unique Count is available in most visualizations.
For example, in Lens:
In aggregation based visualizations:
And even in TSVB (supporting normal fields as well as Runtime Fields, Scripted Fields are not supported):

Resources