Extracting weekly or monthly data from a daily Daru time series - ruby

I am trying to understand how to obtain weekly or monthly data from a daily Daru time series.
Given this sample time series:
require 'daru'
index = Daru::DateTimeIndex.date_range start: '2020-01-15',
periods: 80, freq: 'D'
data = { price: Array.new(index.periods) { rand 100 } }
df = Daru::DataFrame.new data, index: index
I would like to:
Create a new monthly data frame with only the last row in each month.
Create a new weekly data frame with only the last row of each week.
In fact, I am not sure if I need to create a new data frame, or just query the original data frame.
The objective is to be able to plot weekly and monthly percentage change.
I am not even sure if this operation is something I am supposed to do on the index or on the data frame, so I am a bit lost in the documentation.
Any assistance is appreciated.

Related

SSAS Tabular - how to aggregate differently at month grain?

In my cube, I have several measures at the day grain that I'd like to sum at the day grain but average (or take latest) at the month grain or year grain.
Example:
We have a Fact table with Date and number of active subscribers in that day (aka PMC). This is snapshotted per day.
dt
SubscriberCnt
1/1/22
50
1/2/22
55
This works great at the day level. At the month level, we don't want to sum these two values (count = 105) because it doesn't make sense and not accurate.
when someone is looking at month grain, it should look like this - take the latest for the month. (we may change this to do an average instead, management is still deciding)
option 1 - Take latest
Month-Dt
Subscribers
Jan-2022
55
Feb-2022
-
option 2 - Take aveage
Month-Dt
Subscribers
Jan-2022
52
Feb-2022
-
I've not been able to find the right search terms for this but this seems like a common problem.
I added some sample data at the end of a month for testing:
dt
SubscriberCnt
12/30/21
46
12/31/21
48
This formula uses LASTNONBLANKVALUE, which sorts by the first column and provides the latest value that is not blank:
Monthly Subscriber Count = LASTNONBLANKVALUE( 'Table'[dt], SUM('Table'[SubscriberCnt]) )
If you do an AVERAGE, a simple AVERAGE formula will work. If you want an average just for the current month, then try this:
Current Subscriber Count =
VAR _EOM = CLOSINGBALANCEMONTH( SUM('Table'[SubscriberCnt]), DateDim[Date] )
RETURN IF(_EOM <> 0, _EOM, AVERAGE('Table'[SubscriberCnt]) )
But the total row will be misleading, so I would add this so the total row is the latest number:
Current Subscriber Count =
VAR _EOM = CLOSINGBALANCEMONTH( SUM('Table'[SubscriberCnt]), DateDim[Date] ) //Get the number on the last day of the month
VAR _TOT = NOT HASONEVALUE(DateDim[MonthNo]) // Check if this is a total row (more than one month value)
RETURN IF(_TOT, [Monthly Subscriber Count], // For total rows, use the latest nonblank value
IF(_EOM <> 0, _EOM, AVERAGE('Table'[SubscriberCnt]) ) // For month rows, use final day if available, else use the average
)

SSIS Data Flow: Aggregate Task 'hides' my other columns downstream

I'm hoping someone can help me with this.
I have an SSIS package that updates a table, and one of it's columns is a 'current average'. In my Data Flow Task, I need to look at 'tickets sold' for each record, aggregate the average, and add it as part of my insert.
Problem is, that Aggregate Task hides all my other columns. I've got 'tickets' and 'avg tickets', 'venue' and 'time'. When I go to put in the record to my DB Destination, all 3 source columns (venue, time, tickets) aren't visible, and the only one available is my aggregate. I need all four for my insert. How do I get those other columns to 'pass through' the aggregate task so I can use them?
Source: Excel sheet
Venue, Tickets Sold, Show Time
Royal Oak Music Theatre, 300, 7:00 PM
Saint Andrew's Hall, 200, 9:00 PM
Fox Music Theatre, 700, 8:00 PM
Destination: SQL Table
Venue, Tickets Sold, Show Time, Average Tickets Sold Per Show
Royal Oak Music Theatre, 300, 7:00PM, 300
Saint Andrews Halls, 200, 9:00PM, 250
Fox Music Theatre, 700, 8:00PM, 400
Given your sample data, it appears you're creating a running average. Each row in adds a new weight to be factored into the average.
The challenge is, the Aggregate component in SSIS doesn't do that. It's going to give you an average by each grouping (or none, in your case).
You're going to need a Script Component to compute this.
Check the "Tickets Sold" column as an input for the script (which will likely be named TicketsSold or Tickets_Sold or some permutation there of)
You'll need to define a new column in your Output Buffer I'll assume is named runningAverage and it's type is dt_numeric,
I'm free handing this code so syntax errors are mine but the logic is sound ;)
public class ScriptMain : SomeComponent
{
int itemCount;
int total;
/// initialize members
public override void PreExecute()
{
this.itemCount = 0;
this.total = 0;
}
/// Process all the rows, one at a time
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
// accumulate
this.itemCount++;
this.total += Row.TicketsSold;
// populate the new column
// force the floating point division lest we truncate with integer division
Row.runningAverage = this.total / (this.itemCount * 1.0);
}
}

Amchart error with the baseInterval set as month

I trying to use amchart setting
dateAxis.baseInterval = {
"timeUnit": "month",
"count": 1
}
But i have an error to show the data, when i have more than one day in the month with data, the graph show more than one bullet for the same month.
for example if I have the next data
2019-10-11 => 20
2019-10-12 => 30
in place to display
(2019-10) => 50
the graph show the next data
(2019-10) => 20,
(2019-10) => 30
Thanks in advance.
AmCharts v4 doesn't aggregate your data for you. baseInterval merely tells the chart how to render your data with the minimum intervals between your points. Setting it to month with multiple data points in the same month will display multiple points; this is as designed.
If you intend to display your data in monthly intervals and have some data points where more than one point is in the same month, you need to manually aggregate your data beforehand - in your case, convert that point to a single data item in October with a value of 50.

Power Query (M language) 50 day moving Average

I have a list of products and would like to get a 50 day simple moving average of its volume using Power Query (M).
The table is sorted by product name and date. I add a custom column and applied the code below.
if [date] >= #date(2018,1,29)
then List.Average(List.Range(Source[Volume],[Volume]-1,-50))
else ""
Since it is already sorted by date and name, an if statement was applied with a date as criteria/filter. However, an error occurs that says
'Volume' column not found in the table.
I expect to have an added column in the power query with volume 50 day moving average per product. the calculation to be done if date is greater than or equal Jan 29, 2018.
We don't know what your columns are, but assuming you have [product], [date] and [volume] in Source, this would average the last 50 days of [volume] for the identical [product] based on each [date], and place in a new column
AvgAmountAdded = Table.AddColumn(Source, "AverageAmount", (i) => List.Average(Table.SelectRows(Source, each ([product] = i[product] and [date]<=i[date] and [date]>=Date.AddDays(i[date],-50)))[volume]), type number)
Finally! found a solution.
First, apply Index by product see this post for further details
Then index again without criteria (index all rows)
Then, apply below code
= Table.AddColumn(#"Previous Step", "Volume SMA(50)", each if [Index_byProduct] >= 50 then List.Average(List.Range(#"Previous Step"[Volume], ([Index_All]-50),50)) else 0),
For large dataset, Table.Buffer function is recommended after index-expand step to improve PQ calculation speed

h2o H2OAutoEncoderEstimator

I was trying to detect outliers using the H2OAutoEncoderEstimator.
Basically I load 4 KPIs from a CSV file.
For each KPI I have 1 month of data.
The data in the CSV file has been manually created and are all the same for each KPI
The following picture shows the trend of the KPIs:
The first black vertical line (x=4000) indicates the end of the training data.
All the others light black vertical lines indicate the data that I use to detect the outliers every time.
As you can see data are very regular (I'v copied & pasted first 1000 rows 17 times).
This is what my code does:
Loads the data from a CSV file (1 row represents the value of all kpis in a specific timestamp)
Trains the model using the first 4000 timestamps
Starting from the 4001 timestamp, every 250 Timestamps it calls the function model.anomaly to detect the outliers in a specific window (250 timestamps)
My questions are:
Is it normal that every time that I call the function model.anomaly the errors returned increases every time (from 0.1 to 1.8)?
If I call again model.train, the training phase will be performed from scratch replacing the existing model or it will be updated with the new data provided?
This is my python code:
data = loadDataFromCsv()
nOfTimestampsForTraining = 4000
frTrain = h2o.H2OFrame(data[:nOfTimestampsForTraining])
colsName = frTrain.names
model = H2OAutoEncoderEstimator(activation="Tanh",
hidden=[5,4,3],
l1=1e-5,
ignore_const_cols=False,
autoencoder=True,
epochs=100)
# Init indexes
nOfTimestampsForWindows = 250
fromIndex = nOfTimestampsForWindows
toIndex = fromWindow + nOfTimestampsForWindows
# Perform the outlier detection every nOfTimestampsForWindows TimeStamps
while toIndex <= len(data) :
frTest = h2o.H2OFrame(data[fromWindow:toWindow])
error = model.anomaly(frTest)
df = error.as_data_frame()
print(df)
print(df.describe())
# Adjust indexes for the next window
fromIndex = toIndex
toIndex = fromIndex + nOfTimestampsForWindows

Resources