Specifying the frequencies for msts objects - time

I have hourly level data for one month which has both daily as well as weekly seasonality. I am defining it in the following way
msts_cons<-xts_series %>% msts(seasonal.periods = c(24, 7*24),ts.frequency=365.25*24)
Is this specification correct?

Related

Improve performance of wide GroupBy + write

I need to tune a job that looks like below.
import pyspark.sql.functions as F
dimensions = ["d1", "d2", "d3"]
measures = ["m1", "m2", "m3"]
expressions = [F.sum(m).alias(m) for m in measures]
# Aggregation
aggregate = (spark.table("input_table")
.groupBy(*dimensions)
.agg(*expressions))
# Write out summary table
aggregate.write.format("delta").mode("overwrite").save("output_table")
The input table contains transactions, partitioned by date, 8 files per date.
It has 108 columns and roughly half a billion records. The aggregated result has 37 columns and ~20 million records.
I am unable to make any sort of improvement in the runtime whatever I do, so I would like to understand what are the things that affect the performance of this aggregation, i.e. what are the things I can potentially change?
The only thing that seems to help is to manually partition the work, i.e. starting multiple concurrent copies of the same code but with different date ranges.
to the best of my understanding currently the groupBy clause doesn't include the 'date' column so you are actually aggregating all dates in the query and you are not using the input table partition at all.
you can add the "date" column to the partitionBy clause and then you will sum up the measures for each date.
also, as for the input_table, when it is built, if possible, you can additionally partition it by d1, d2, d3 if they don't have a high cardinality or at least some of them.
finally the input_table will benefit from a columnar file type (parquet) so you won't have to i/o all 108 columns if you are using something like csv. assuming you are using something like parquet but just in case.

Dynamic Calculated Column on different report level SSAS DAX query

I'm trying to create a calculated column based on a derived measure in SSAS cube, this measure which will count the number of cases per order so for one order if it has 3 cases it will have the value 3.
Now I'm trying to create a bucket attribute which says 1caseOrder,2caseOrder,3caseOrder,3+caseOrder. I tried the below one
IF([nrofcase] = 1, "nrofcase[1]", IF([nrofcase] = 2, "nrofcase[2]",
IF([nrofcase] = 3, "nrofcase[3]", "nrofcase[>3]") )
But it doesn't work as expected, when the level of the report is changed from qtr to week it was suppose to recalculate on different level.
Please let me know if it case work.
Calculated columns are static. When the column is added and when the table is processed, the value is calculated and stored. The only way for the value to change is to reprocess the model. If the formula refers to a DAX measure, it will use the measure without any of the context from the report (eg. no row filters or slicers, etc.).
Think of it this way:
Calculated column is a fact about a row that doesn't change. It is known just by looking at a single row. An example of this is Cost = [Quantity] * [Unit Price]. Cost never changes and is known by looking at the Quantity and Unit Price columns. It doesn't matter what filters or context are in the report. Cost doesn't change.
A measure is a fact about a table. You have to look at multiple rows to calculate its value. An example is Total Cost = SUM(Sales[Cost]). You want this value to change depending on the context of time, region, product, etc., so it's value is not stored but calculated dynamically in the report.
It sounds like for your data, there are multiple rows that tell you the number of cases per order, so this is a measure. Use a measure instead of a calculated column.

Is there any algorithm that can include rule based data association for spatial clustering?

Consider a group of events A,B,C and D. These events are related to each other and this can be defined through a set of rules like -
A is followed by B or D and never C.
If B occurrs twice in 5 minutes, C is triggered.
.. and so on.
The dataset I have has more than a 10000 rows where each record consists of the geographic coordinates, timestamp and an event that occurred at that time. I want to cluster these data points based on the rules as mentioned above but I'm not sure how to do it. The threshold factors to prevent all the events from being grouped together could be decided based on the time intervals or spatial difference.
How can these rules be represented and be used as a deciding factor during clustering?
So far I've tried clustering based on the spatial and temporal factors using algorithms like ST-DBSCAN and Clustream but I'd really like to find a way to group the data points based on the event sequences according to the rules.

Training the same Google AutoML Model multiple times

Question: Is it possible to train the same Model, from Google AutoML, multiple times?
Problem: I have several datasets with time series data. Example:
Dataset A: [[product1, date1, price], [product1, date2, price]]
Dataset B: [[product2, date1, price], [product2, date2, price]]
Dataset C: [[product3, date1, price], [product3, date2, price]]
When describing the columns in Google AutoML you can mark the data as time series data and specify the date column as the time serie. It is very important to keep in mind it is time series data. I'd think combining the datasets wouldn't be a good idea because there will be duplicate dates.
Is it possible to train the model on dataset A and after that finishes on dataset B, etc. or would you advice to combine the datasets?
Thanks.
You can combine the data, I don't see how that would matter with what you are describing. Marking a column as a Time column has AutoML Tables split the data based on that column, putting the oldest 80% as the training set, next more recent 10% as the validation set, and the most recent 10% as the test set.
If there is not enough data in your set that is distinct in the time column to split the data as 80/10/10 described above, you will want to not mark it as the Time column and instead manually split the data.
If the datasets are not related and are distinct from each other, then you would want to train individual models for each.

Aggregation over a specific partition in Apache Kafka Streams

Lets say I have a Kafka topic named SensorData to which two sensors S1 and S2 are sending data (timestamp and value) to two different partitions e.g. S1 -> P1 and S2 -> P2. Now I need to aggregate the values for these two sensors separately, lets say calculating the average sensor value over a time window of 1 hour and writing it into a new topic SensorData1Hour. With this scenario
How can I select a specific topic partition using the KStreamBuilder#stream method?
Is it possible to apply some aggregation function over two (multiple) different partitions from same topic?
You cannot (directly) access single partitions and you cannot (directly) apply an aggregation function over multiple partitions.
Aggregations are always done per key: http://docs.confluent.io/current/streams/developer-guide.html#stateful-transformations
Thus, you could use a different key for each partition and than aggregate by key. See http://docs.confluent.io/current/streams/developer-guide.html#windowing-a-stream
The simplest way is to let each of your producers apply a key to each message right away.
If you want to aggregate multiple partitions, you first need to set a new key (e.g., using selectKey()) and set the same key for all data you want to aggregate (if you want to aggregate all partitions, you would use a single key value -- however, keep in mind, this might quickly become a bottleneck!).

Resources