Count multiple aggregates in a sliding window in Spark Structured Streaming - spark-streaming

I have a streaming source which sends events where every record consiste of 3 fields (CreationTime, FP, Detected)
Here, 'FP' stands for false positive. 'FP' and 'Detected' fields can have values 1 or 0.
I want to calculate the following values over a sliding window.
FPR1 = Count(FP) / Count(Detected) and FPR2 = Count(FP) / Count(Total records in window)
I am able to aggregate Count(FP) using following query. I want to count the other 2 aggregates as well. ie DetectedCount and TotalCount and calculate FPR1 and FPR2 and write to a file sink. How do I do this? Thanks in advance.
val aggDF = finaldata
.withWatermark("CreatedTime", "2 minute")
.groupBy(col("FP"),
window(col("CreatedTime"), "5 minute", "1 minute"))
.agg(sum("FP").alias("FPCount"))

Figured it out finally. I was using groupby wrongly. here is the final query.
val aggDF = finaldata
.withWatermark("CreatedTime", "2 minute")
.groupBy(window(col("CreatedTime"), "5 minute", "1 minute"))
.agg(sum("FP").alias("FPCount"),sum("Detected").alias("DetectedCount"),sum("Count").alias("TotalCount"))
.withColumn("FPR", col("FPCount")/col("DetectedCount"))
.withColumn("FPR2", col("DetectedCount")/col("TotalCount"))

Related

Show only last occurence of the field

I have an issue with a report I am trying to make in SAP. The problem is that I want to only show each SR NUM only once. But there are many appearances in my report. I saw that each number has multiple activities and comments and that is why there are appearing more than once. The thing is that I only need the last appearance based on date for each SR Num and there is no filter that can help me with this. I also tried Ranking but I do not have a metric and created a new variable finding max date for each sr num. That also did not work as there are multi values.
Please help!
For example i have 3 columns sr num, date and comments. The first has 3 different nums but multiple times and the dates are all different as the comments.I need to only keep each sr num once with the most recent date and comment
I created some sample data in a free-hand SQL query which yields this...
You will need to find the maximum date for each SR Num and then only show that row for each SR Num. I used two variables to achieve this.
Var Max Activity Date...
=Max([Activity Date]) In ([SR Num])
Var Is Max Activity Date...
=If([Activity Date] = [Var Max Activity Date]; 1; 0)
Finally add a table filter to only show the rows where the Activity Date is the Max Activity Date for each SR Num.
You do not need the variables in your table in the end. I just put them there in order to visualize what is going on. That's it.
Noel

How to make a random sampling of 20% of records in Tableau?

In Tableau 9.2, is it possible to generate a random sample of records? If so, how could I do this? Say I have a field called Field1, then I intend to only select 20% of the records. So far, I have found how to a generate random integer in Tableau, though it is bewildering that Tableau does not already have a function for this:
Random Seed
(DATEPART('second', NOW()) + 1) * (DATEPART('minute', NOW()) + 1) * (DATEPART('hour', NOW()) + 1) * (DATEPART('day', NOW()) + 1)
Random Number
((PREVIOUS_VALUE(MIN([Seed])) * 1140671485 + 12820163) % (2^24))
Random Int
INT([Random Number] / (2^24) * [Random Upper Limit]) + 1
So how could I create a calculated field to only show random records that make up 20% of Field1?
When you make an extract, there is a dialog panel where you can filter records and specify rolling up to visible dimensions.
For at least some data sources, you can also specify a limit of the number of records (say grab the first 2000 records) or a random percentage (say, 10% of the records)
Then you can work with the small extract quickly to design you viz, and then remove the extract or refresh with all the data when you are ready. I don't think every data source supports the random selection though.
There is a random number function ins Tableau, but it is hidden and doesn't appear on the list of available functions.
It is "random()". It generates a uniformly distributed number between 0 and 1.
It isn't documented but it works. See, for example, this previous answer: how to generate pseudo random numbers and row-count in Tableau
I ended up solving my issue through the back-end in my MS Access database with the following MS Access SQL Query within a MS Access VBA macro I made:
value1 = "some_value"
fieldName = "[my_field_name]"
sqlQuery = "SELECT [my_table].* " & _
" INTO new_table_name" & _
" FROM [my_table] " & _
" WHERE [some_field] = '" & value1 & "'" & _
" ORDER BY Rnd(-(100000*" & fieldName & ")*Time())"
Debug.Print sqlQuery
CurrentDb.Execute sqlQuery
I ended up deciding that something like this would be best left to the back-end and to leave the visual analytics to Tableau.

Pig case statement for finding no. Of events in a specific period of time

Pig case statement for finding no. Of events in a specific period of time.
There is a dataset which is like a movie data base bearing movies, rating, duration of movie, year of release.
The question is that how do u find the no. Of movies released during 10 years of span?
The dataset is comma separated.
Movie = load '/home/movie/movies.txt' using PigStorage(',') as (movieid:int, moviename:chararray, yearrelease:int, ratingofmovie:float, moviedurationinsec:float);
movies_released_between_2000_2010 = filter Movie by yearofrelease >2000 and yearofrelease < 2010;
result = foreach movies_released_between_2000_2010 generate moviename,yearofrelease;
dump result;
year_count = FOREACH movie GENERATE (case when year>2000 and year<2010 then 1 else 0 end) as year_flag,movie_name;
year_grp = GROUP year_count BY year_flag;
movie_count_out = FOREACH year_grp GENERATE group,COUNT(year_flag);
The above example can help you give an understanding of the solution, there might be some syntax errors tough. If you need to group on the basis of decade then you can use a substring function on top of year and get the specific range.

What is the difference between TYPE_STEP_COUNT_DELTA and AGGREGATE_STEP_COUNT_DELTA data type in Google Fit Android Api?

The Google Fit API describes two of these data types of the Sensor API. However both seem to be returning the same data. Can anyone explain the difference?
TYPE_STEP_COUNT_DELTA:
In the com.google.step_count.delta data type, each data point represents the number of steps taken since the last reading.
AGGREGATE_STEP_COUNT_DELTA:
Aggregate number of steps during a time interval.
You can see more here:
https://developers.google.com/android/reference/com/google/android/gms/fitness/data/DataType
// Setting a start and end date using a range of 1 week before this moment.
Calendar cal = Calendar.getInstance();
Date now = new Date();
cal.setTime(now);
long endTime = cal.getTimeInMillis();
cal.add(Calendar.WEEK_OF_YEAR, -1);
long startTime = cal.getTimeInMillis();
java.text.DateFormat dateFormat = getDateInstance();
Log.i(TAG, "Range Start: " + dateFormat.format(startTime));
Log.i(TAG, "Range End: " + dateFormat.format(endTime));
DataReadRequest readRequest = new DataReadRequest.Builder()
// The data request can specify multiple data types to return, effectively
// combining multiple data queries into one call.
// In this example, it's very unlikely that the request is for several hundred
// datapoints each consisting of a few steps and a timestamp. The more likely
// scenario is wanting to see how many steps were walked per day, for 7 days.
.aggregate(DataType.TYPE_STEP_COUNT_DELTA, DataType.AGGREGATE_STEP_COUNT_DELTA)
// Analogous to a "Group By" in SQL, defines how data should be aggregated.
// bucketByTime allows for a time span, whereas bucketBySession would allow
// bucketing by "sessions", which would need to be defined in code.
.bucketByTime(1, TimeUnit.DAYS)
.setTimeRange(startTime, endTime, TimeUnit.MILLISECONDS)
.build();

BIRT report cross tabs: How to calculate and display durations of time?

I have a BIRT report that displays some statistics of calls to a certain line on certain days. Now I have to add a new measeure called "call handling time". The data is collected from a MySQL DB:
TIME_FORMAT(SEC_TO_TIME(some calculations on the duration of calls in seconds),'%i:%s') AS "CHT"
I fail to display the duration in my crosstab in a "mm:ss"-format even when not converting to String. I can display the seconds by not converting them to a time/string but that's not very human readable.
Also I am supposed to add a "grand total" which calculates the average over all days. No problem when using seconds but I have no idea how to do that in a time format.
Which data types/functoins/expressions/settings do I have to use in the query, Data Cube definition and the cross tab cell to make it work?
Time format is not a duration measure, it cannot be summarized or used for an average. A solution is to keep "seconds" as measure in the datacube to compute aggregations, and create a derived measure for display.
In your datacube, select this "seconds" measure and click "add" to create a derived measure. I would use BIRT math functions to build this expression:
BirtMath.round(measure["seconds"]/60)+":"+BirtMath.mod(measure["seconds"],60)
Here are some things to watch out for: seconds are displayed as single digit values (if <10). The "seconds" values this is based on is not an integer, so I needed another round() for the seconds as well, which resulted in seconds sometimes being "60".
So I had to introduce some more JavaScript conditions to display the correct formatting, including not displaying at all if "0:00".
For the "totals" column I used the summary total of the seconds value and did the exact same thing as below.
This is the actual script I ended up using:
if (measure["seconds"] > 0)
{
var seconds = BirtMath.round(BirtMath.mod(measure["seconds"],60));
var minutes = BirtMath.round(measure["seconds"]/60);
if(seconds == 60)
{
seconds = 0;
}
if (seconds < 10)
{
minutes + ":0" + seconds;
}
else
{
minutes + ":" + seconds;
}
}

Resources