Aggregate results of top_hits with date_histogram or similar?

Aggregate results of top_hits with date_histogram or similar? - elasticsearch

We have an index containing one document for every visit event to our site in a day, which contains the time of the visit and a user ID and the same user can visit multiple times in the same day. I am trying to get the number of users visiting for the first time that day per minute. Is this possible to do in a single query?
I know that a top_hits aggregation within a terms aggregation, sorted by the time field, will get me the documents representing the first unique visit each day. I know that date_histogram will aggregate the visits by minute, but not apply a uniqueness check. A cardinality subaggregation of date_histogram only verifies uniqueness per bucket, not over the whole day. date_histogram doesn't accept pipeline specifications for what to aggregate over.
I'm currently afraid that the only answer is to do the top_hits aggregation and then aggregate it myself client side, or do a separate query for every minute I want to verify unique users over (something like query for unique user ids from midnight to 12:01 AM, then midnight to 12:02, etc, tracking the growth in the count in each query.)

You can do multiple aggregations in a single elasticsearch query it goes like
{
"query": {
// some query
},
"aggs": {
"aggregation1": {
},
"aggrgation2": {
},
}

Related

Advice on ElasticSearch query design

I've got ES documents that looks like this:
{
"auctionOn": "2018-01-01",
"inspections: [
{
"startsOn": "2018-01-02 09:00",
"endsOn": "2018-01-02 10:00"
}
]
}
I need the following answers from a search (or multiple searches)
number of documents with an auctionOn in the future (e.g > now)
number of documents with an inspection.startsOn in the future (e.g > now)
date histogram (day breakdown) of the next 7 days, with # of documents with a auctionOn on that day
date histogram (day breakdown) of the next 7 days, with # of documents with a inspection.startsOn on that day
So, i'm trying to figure out how to efficiently get these answers. I know i can/should test out all different approaches, but i'm relatively new to ES so easier said than done.
Can someone give me a advice (or ideally, a query) on how to get these 4 values?
Ideas i had:
Query for all documents with an inspection/auction in the future. Create date histogram aggregations filtered to the next 7 days for both auction and inspections. Use range aggregations to get number of docs with auction/inspection > today.
Pros: one search for all answers. Cons: lots of documents to aggregate over?
Create seperate searches (e.g msearch) for:
query all documents with an inspection in the next 7 days. aggregate by day.
query all documents with an auction in the next 7 days. aggregate by day.
query all documents with an inspection in the future. use hits to get total
query all documents with an auction in the future. use hits to get total.
Pros: queries are simpler.. more cache hits? Cons: 4 seperate searches.
Can someone please guide me down the right path, and give me hints on how to do the query/aggregations?
Thanks

Use range query on the field auctionOn setting from as current date and to date as null.
Use range query inside nested query on the field inspection.startsOn as above.
Use date histogram aggregation using interval as day
Same as 3.) but inside nested aggregation
You can adjust all these in one query.

Elasticsearch: group into buckets, reduce to one document per bucket, group these documents

I'm looking for a way how to compute the bounce rate of webpages with elastic search.
We collect data in the following simplified structure
{"id":"1", "timestamp"="2017-01-25:15:23", "sessionid"="s1", "page"="index"}
{"id":"2", "timestamp"="2017-01-25:15:24", "sessionid"="s1", "page"="checkout"}
{"id":"3", "timestamp"="2017-01-25:15:25", "sessionid"="s1", "page"="confirm"}
{"id":"4", "timestamp"="2017-01-25:15:26", "sessionid"="s2", "page"="index"}
{"id":"5", "timestamp"="2017-01-25:15:27", "sessionid"="s2", "page"="checkout"}
{"id":"6", "timestamp"="2017-01-25:15:26", "sessionid"="s3", "page"="product_a"}
{"id":"7", "timestamp"="2017-01-25:15:28", "sessionid"="s3", "page"="checkout"}
For this sample the result of the analysis should be:
2/3 of the users get lost at the checkout page.
1/3 of the users get lost at the confirm page
More formally, I'm looking for a generic approach how to implement the following algorithm in an elastic query:
group documents by a field
sort each group (bucket) by a second field and reduce to the topmost document
group all these remaining documents by a third field
sort groups by number of documents
My first attempt was to solve this with a terms aggregation followed by a top_hits aggregation and finally use a
terms_pipeline aggregation to group the pages.
(simplified aggregation structure)
aggs
terms
field: sessionid
aggs
top_hits
sort:timestamp desc
size: 1
terms_pipeline
bucket_path: terms>top_hits
field: page
... but unfortunately there is no such thing like a terms_pipeline aggregation. My bad.
Any ideas for an alternative approach?

Maybe I misunderstood something but if you are willing to know where your users are bouncing, since all pages are in a sequence, you could simply have a terms aggregation on the page field (to know which pages were visited) and a cardinalityone on the sessionid field (to know how many different unique sessions you have). In this case, cardinality(sessionid) would yield 3.
Then again, since all pages are in a sequence, I think you don't really need to know what happened within a given session.
In your example, from the terms(page) aggregation, you'd know that 3 users landed on the checkout page but only one went to the confirm one. Using the cardinality of the sessions, this implicitly means that 2 users (3 total sessions - 1 confirm page hit) bounced on the checkout page.

Elasticsearch aggregation on latest documents

I have a document which can be modified any number of times a day.
I've ordered these document in time series creating index for each day.
And each day would have multiple versions of the same document with different modified date.
Document sample:
{
id: 1234,
user: kc,
subscriptions: [
'paper1',
'paper2'
],
modified_date: 1466697434020
}
What I'm looking for is to get the latest documents in a particular time range for all users
and to apply aggregation on top of it.
That would give a result like, in the last week/month how many people are subscribed for each of the papers.
Using top_hits I was able to get the latest document for different users in a time range, but I cannot apply further aggregations on this set of data.

ElasticSearch Aggregating Against FIRST/Max Nested Document

I'm using Elastic Search trying to get an aggregation of "last login country" for a set of users, and am not sure whether ES supports this type of aggregation? Here's a rough picture of the mapping:
User
UserId
Sessions (array)
Session1 - CreateDate, Country
Session2 - CreateDate, Country
What I'm wanting to do is pass in a date range, and get an output of the logins by country, with ONLY a single session per user. In other words, if the user logged in 3 times during the date range, only 1 of those sessions would count towards the overall count.
The output would look something like the following:
Country Aggregations
USA, Count: 10
Japan, Count: 15
Spain, Count: 23
I've been looking over nested aggregations, but I'm not sure they can give me what I need. The main problem I'm having is that if a User has multiple Sessions during the date range, each of those sessions contribute to the overall country count. Is there a way to filter this inner list of nested documents down so that only 1 will contribute to the aggregation per User?

I posted this question on ElasticSearch's github forum, and apparently this functionality is not available in the current ES version (1.4.2):
https://github.com/elasticsearch/elasticsearch/issues/9536

How to retrieve unique count of a field using Kibana + Elastic Search

Is it possible to query for a distinct/unique count of a field using Kibana? I am using elastic search as my backend to Kibana.
If so, what is the syntax of the query? Heres a link to the Kibana interface I would like to make my query: http://demo.kibana.org/#/dashboard
I am parsing nginx access logs with logstash and storing the data into elastic search. Then, I use Kibana to run queries and visualize my data in charts. Specifically, I want to know the count of unique IP addresses for a specific time frame using Kibana.

For Kibana 4 go to this answer
This is easy to do with a terms panel:
If you want to select the count of distinct IP that are in your logs, you should specify in the field clientip, you should put a big enough number in length (otherwise, it will join different IP under the same group) and specify in the style table. After adding the panel, you will have a table with IP, and the count of that IP:

Now Kibana 4 allows you to use aggregations. Apart from building a panel like the one that was explained in this answer for Kibana 3, now we can see the number of unique IPs in different periods, that was (IMO) what the OP wanted at the first place.
To build a dashboard like this you should go to Visualize -> Select your Index -> Select a Vertical Bar chart and then in the visualize panel:
In the Y axis we want the unique count of IPs (select the field where you stored the IP) and in the X axis we want a date histogram with our timefield.
After pressing the Apply button, we should have a graph that shows the unique count of IP distributed on time. We can change the time interval on the X axis to see the unique IPs hourly/daily...
Just take into account that the unique counts are approximate. For more information check also this answer.

Be aware with Unique count you are using 'cardinality' metric, which does not always guarantee exact unique count. :-)
the cardinality metric is an approximate algorithm. It is based on the
HyperLogLog++ (HLL) algorithm. HLL works by hashing your input and
using the bits from the hash to make probabilistic estimations on the
cardinality.
Depending on amount of data I can get differences of 700+ entries missing in a 300k dataset via Unique Count in Elastic which are otherwise really unique.
Read more here: https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

Create "topN" query on "clientip" and then histogram with count on "clientip" and set "topN" query as source. Then you will see count of different ips per time.

Unique counts of field values are achieved by using facets. See ES documentation for the full story, but the gist is that you will create a query and then ask ES to prepare facets on the results for counting values found in fields. It's up to you to customize the fields used and even describe how you want the values returned. The most basic of facet types is just to group by terms, which would be like an IP address above. You can get pretty complex with these, even requiring a query within your facet!
{
"query": {
"match_all": {}
},
"facets": {
"terms": {
"field": "ip_address"
}
}
}

Using Aggs u can easily do that.
Writing down query for now.
GET index/_search
{
"size":0,
"aggs": {
"source": {
"terms": {
"field": "field",
"size": 100000
}
}
}
}
This would return the different values of field with there doc counts.

For Kibana 7.x, Unique Count is available in most visualizations.
For example, in Lens:
In aggregation based visualizations:
And even in TSVB (supporting normal fields as well as Runtime Fields, Scripted Fields are not supported):

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio