I setup an ElasticStack and imported Millions of LogEntries. Each log entry contains a Tiestamp and a sessionID. Each session produces multiple log entries thus I have the following information available
SessionID | Timestamp
1234 | stamp1
1234 | stamp2
2223 | stamp3
1234 | stamp4
5566 | stamp5
5566 | stamp6
2223 | stamp7
Now I would like to calculate the average/minimum/maximum session duration.
Does anyone know how to achieve this?
Thanks in advance
To do exactly what you want isn't going to be simple, I'm not even convinced it's possible with your data in its current form.
I'm also not sure what having the average, minimum and maximum session lengths actually gives you in terms of actionable information - why do you need the max/min/avg session times?
Something that could be easily visualised using you data would be session count against a date histogram. From Kibana, create a line graph visualisation. On the y-axis do a unique count of the session ID, on the x-axis select date histogram and use your timestamp field...
I would have thought knowing the session count over a period of time would give you a better idea for capacity planning than knowing max/min session times - perhaps you have already done this? This assumes each session is regularly logging... If you zoom in too far (i.e. between log events) the graph will look choppy, but it should smooth as you zoom out and there are options available for smoothing.
Related
Suppose i have user info coming into DB at realtime. How to monitor for a (column)value change of users every 20minute. If customer does not have a particular value in a column, output it or store somewhere. How to do these kind of query in clickhouse for realtime data.
table Schema like:-
user_name|metric_1|metric_2|status|timestamp|
james |12 |34 |fail |12:00:00 |
roon |23 |67 |fail |12:05:00 |
james |56 |5 |succ |12:20:00 |
roon |45 |6 |fail |12:21:02 |
In that case, query should output roon as roon's status is still fail. it did not change.
P.S - say if users status changes to succ after sometime, will it be also possible to remove the data from the location we stored the user's data at, say array or dictionary. I have a hunch these type of queries are possible in real-time with clickhouse, but i am not able to get my head around it.
I have a couple of joined Athena tables in Quicksight. The data looks something like this:
Ans_Count | ID | Alias
10 | 1 | A
10 | 1 | B
10 | 1 | C
20 | 2 | D
20 | 2 | E
20 | 2 | F
I want to create a calculated field such that it sums the Ans_Count column based on distinct IDs only. i.e., in the example above the result should be 30.
How do I do that?? Thanks!
Are you looking for the sum before or after applying a filter?
Sumif(Ans_Count,ID) may be what your looking for.
If you need to always return the result of the sum, regardless of the filter on the visual, look at the sumOver() function.
You can use distinctCountOver at PRE_AGG level to count unique number of values for a given partition. You could use that count to drive the sumIf condition as well.
Example : distinctCountOver(operand, [partition fields], PRE_AGG)
More details about what will be visual's group by specification and an example where there duplicate IDs will help give a specific solution.
It might even be as simple as minOver(Ans_Count, [ID], PRE_AGG) and using SUM aggregation on top of it in the visual.
If you want another column with the values repeated, use sumOver(Ans_Count, [ID], PRE_AGG). Or, if you want to aggregate via QuickSight, you would use sumOver(sum(Ans_Count), [ID]).
I agree with the above suggestions to use sumOver(sum(Ans_Count), [ID]).
I have yet to understand the use cases for pre_agg, so if anyone has concrete examples please share them!
Another suggestion would be to do a sumover + partition by in your table (if possible) before uploading the dataset, then checking if the results matche with Quicksight's aggregations. I find Quicksight can be tricky with calculated fields, aggregations, and nested ifs so I've been doing calculations in SQL where possible before bringing it in to quicksight to have a better grasp of what the outputs should look like. This obviously is an extra step, but can help in understanding how quicksight pulls off calcs and brings up figures (as the documentation doesn't always give much), and spotting things that don't look right (I've had a few) before you share your analysis with a wider group.
I am working on an use case and help me in improving the scan performance.
Customers visiting our website are generated as logs and we will be processing it which is usually done by Apache Pig and inserts the output from pig into hbase table(test) directly using HbaseStorage. This will be done every morning. Data consists of following columns
Customerid | Name | visitedurl | timestamp | location | companyname
I have only one column family (test_family)
As of now I have generated random no for each row and it is inserted as row key for that table. For ex I have following data to be inserted into table
1725|xxx|www.something.com|127987834 | india |zzzz
1726|yyy|www.some.com|128389478 | UK | yyyy
If so I will add 1 as row key for first row and 2 for second one and so on.
Note : Same id will be repeated for different days so I chose random no to be row-key
while querying data from table where I use scan 'test', {FILTER=>"SingleColumnValueFilter('test_family','Customerid',=,'binary:1002')"} it takes more than 2 minutes to return the results.`
Suggest me a way so that I have to bring down this process to 1 to 2 seconds since I am using it in real-time analytics
Thanks
As per the query you have mentioned, I am assuming you need records based on Customer ID. If it is correct, then, to improve the performance, you should use Customer ID as Row Key.
However, multiple entries could be there for single Customer ID. So, better design Row key as CustomerID|unique number. This unique number could be the timestamp too. It depends upon your requirements.
To scan the data in this case, you need to use PrefixFilter on row key. This will give you better performance.
Hope this help..
I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.
Context
I'm working on a small web app to store photos. Photos are ordered according to their timestamp (the date they've been taken), and it's working great. Here's a simplified look at the database:
+--------------+-------------------+
| id | timestamp |
+--------------+-------------------+
| 1 | 1000000003 |
| 2 | 1000000000 |
+--------------+-------------------+
Now I'd like to add the possibility to re-order photos. And I can't find a way of doing that without any downsides.
What I did
I first added a column to the table to save a custom order.
+--------------+-------------------+-------------+
| id | timestamp | order |
+--------------+-------------------+-------------+
| 1 | 1000000003 | 1 |
| 2 | 1000000000 | 2 |
+--------------+-------------------+-------------+
First issue: I believe I can't order photos according to two different criteria, because it'd be hard to know which one has to be given precedence.
So I'm ordering them using the order column, and only this one. When I added the order column, I gave each photo a value so that the current order would remain. I now have photos ordered by order, in the same order as when they were ordered by timestamp.
I can now re-order some photos manually, and the other ones will stay where they belong. The first issue has been solved.
But now, I want to add a new photo.
Second issue: I know when the new photo I'm adding has been taken, but my photos aren't ordered by their timestamp anymore. This photo needs to be correctly ordered, thus it needs a correct order value.
This is the issue: a correct order value.
Here are two ways I could handle a new photo:
Give it an order value greater than others. In the previous table, a new photo would be given order = 3. This is obviously a bad idea, since it doesn't take its timestamp into account. A recent photo would still be the last one displayed.
"Insert" it where it belongs, according to its timestamp. Looking at the same table, if the timestamp of the new photo was 1000000002, the new photo would be given order = 2, and the order of every following photo would be increased by 1.
The second solution looks great, except in one case: if the order of the photo #2 had been manually changed to let's say 50, the new photo would have been given order = 50 even though it belongs among the first photos (according to its timestamp).
What I need
What I need is a way of ordering photos according to their timestamp and to their manually-set order.
Maybe you have a solution to the second issue I highlighted, or maybe you're aware of a whole other way to deal with this. Either way, thank you for your help.
At no point in your question do you mention computers or programming languages. This is OK (actually, it's a good approach, get the problem and solution worked out on paper before coding) and here's an answer which also ignores computers and programming languages.
Put all your photos into a shoebox in the order in which you get them.
Now, take three pieces of paper:
On page 1 write the numbers (one to a line) from 1 to N (the number of photos the box can hold). Whenever you put a photo in the box, write its timestamp on the line corresponding to its order in the box.
On page 2 write the timestamp of photo 1 a few lines down. Write a 1 on the same line. For the next photo, write its timestamp in the appropriate place on the paper, leaving as much space above and below as seems necessary for future photo insertions. Write a 2 on the same line. Continue until you run out of space between lines, when you need to copy all the information onto a new version of the page with more space for insertions. The information on this page is the same as the information on page 1, but with the two numbers on each line swapping positions.
On page 3 write the numbers from 1 to N again. As you collect each photo write its number from page 1 (ie its number in the sequence of all photo numbers) in the correct position for your manually-set ordering. You'll probably have to do a lot of rubbing-out and re-writing on this page as you decide that latecomers ought to be inserted high onto this page.
Now you have:
a store for your photos, the shoebox; you should already have realised that you can't store the photos in more than one order at a time;
three indexes (indices if you prefer); the first is fixed and simply assigns a unique sequence number to each photo; it also tells you the timestamp of each photo in the box;
the second index enables you to find the unique sequence number of a photo given its timestamp, and then find the photo in the shoebox;
the third index allows you to order photos as you wish; the first number on each line is the sequence number in the sorted order, the second number is the photo's unique sequence number from the first index.
All of this is an extremely long-winded way of telling you that, since you can't (either in a shoebox or a computerised data store) keep photos in multiple orders simultaneously, you will have to maintain indices for the orderings you wish to use. Those indices point (that's what an index does) from a number to a location in the shoebox, either directly or indirectly.