What is the order of AR and MA? - arima

I am unable to understand how to choose the order of AR and MA for ARIMA model. I am using Automatic ARIMA forecasting tool of E-views and it gives ARMA(2,4) but the output table shows p-values of these greater than 0.05.

You could take a look at the Information Criterion for different orders (for example in the range of [0, 5] ). The model which minimizes either the AIC or BIC is preferred.

Related

Questions about feature selection and data engineering when using H2O autoencoder for anomaly detection

I am using H2O autoencoder in R for anomaly detection. I don’t have a training dataset, so I am using the data.hex to train the model, and then the same data.hex to calculate the reconstruction errors. The rows in data.hex with the largest reconstruction errors are considered anomalous. Mean squared error (MSE) of the model, which is calculated by the model itself, would be the sum of the squared reconstruction errors and then divided by the number of rows (i.e. examples). Below is some sudo code of the model.
# Deeplearning Model
model.dl <- h2o.deeplearning(x = x, training_frame = data.hex, autoencoder = TRUE, activation = "Tanh", hidden = c(25,25,25), variable_importances = TRUE)
# Anomaly Detection Algorithm
errors <- h2o.anomaly(model.dl, data.hex, per_feature = FALSE)
Currently there are about 10 features (factors) in my data.hex, and they are all categorical features. I have two questions below:
(1) Do I need to perform feature selection to select a subset of the 10 features before the data go into the deep learning model (with autoencoder=TRUE), in case some features are significantly associated with each other? Or I don’t need to since the data will go into an autoencoder which compresses the data and selects only the most importance information already, so feature selection would be redundant?
(2) The purpose of using the H2O autoencoder here is to identify the senders in data.hex whose action is anomalous. Here are two examples of data.hex. Example B is a transformed version of Example A, by concatenating all the actions for each sender-receiver pair in Example A.
After running the model on data.hex in Example A and in Example B separately, what I got is
(a) MSE from Example A (~0.005) is 20+ times larger than MSE from Example B;
(b) When I put the reconstruction errors in ascending order and plot them (so errors increase from left to right in the plot), the reconstruction error curve from Example A is steeper (e.g. skyrocketing) on the right end, while the reconstruction error curve from Example B increases more gradually.
My question is, which example of data.hex works better for my purpose to identify anomalies?
Thanks for your insights!
Question 1
You shouldn't need to decrease the number of inputted features into the model. I can't say I know what would happen during training, but collinear/associated features could be eliminated in the hidden layers as you said. You could consider adjusting your hidden nodes and see how it behaves. hidden = c(25,25,25) -> hidden = c(25,10,25) or hidden = c(15,15) or even hidden = c(7, 5, 7) for your few features.
Question 2
What is the purpose of your model? Are you trying to determine which "Sender/Receiver combinations" are anomalies or are you trying to determine which "Sender/Receiver + specific Action combo" are anomalies? If it's the former ("Sender/Receiver combinations") I would guess Example B is better.
If you want to know "Sender/Receiver combinations" and use Example A, then how would you aggregate all the actions for one Sender-Receiver combo? Will you average their error?
But it sounds like Example A has more of a response for anomalies in ascended order list (where only a few rows have high error). I would sample different rows and see if the errors make sense (as a domain expert). See if higher errors tend to seem to be anomaly-like rows.

Dynamic factor model : forecasting the factors

The statsmodels package offers a DynamicFactor object that, when fit, yields a statsmodels.tsa.statespace.dynamic_factor.DynamicFactorResultsWrapper object. That offers predict and simulate methods, but both forecast the original time-series, not the underlying latent factor.
I've tried reconstructing the latent factor as an AR process, but have been unsuccessful. The coefficients in both the .ssm["transition"] and in the results .summary() match, but when simulated as an AR process, don't give me back the factor on the results .factors["filtered"]...
How can I generate future values of the latent factors ?
One way to do this is:
m = sm.tsa.DynamicFactor(endog, k_factors=1, factor_order=1)
r = m.fit()
f = r.get_forecast(10)
print(f.prediction_results.filtered_state)
Note that this is always a numpy array, so if your data has e.g. a Pandas date index, you would need to create the Pandas Series with that index yourself.
Another way to do this is to append np.nan values to the end of your dataset, and then use the typical .factors["filtered"] accessor. If you append n observations with np.nan, then the last n values of .factors["filtered"] will contain the forecasts of the factors.

Clustering Scikit - Convert Business Data to machine learning input data

I am new to world of data science and am trying to understand the concepts on the outcomes of the the ML. I have started off to use scikit - clustering example. Using the scikit library is well documented everywhere. But all the examples go with the assumption of ready numerical data.
Now how does a data scientist convert a business data into machine learning data. Just to give an example, here is a customer and sales data I have prepared..
The first picture shows the customer data with some parameters having an integer, string and boolean values
The second picture shows the historical sales data for those customers.
Now how does such a real business data gets translated to feed to a Machine Learning algorithm? How do I convert each data to a common factor which the algorithm can understand?
Thanks
K
Technicaly, there are many ways, such as one-hot encoding, standardization, and going into logspace for skewed attributes.
But the problem is not just of a technical nature.
Finding a way is not enough, but you need to find one that works really well for your problem. This is usually very different from problem to another. There is no "turn key solution".
Just addition to comment by #Anony-Mousse, you can convert Won/Lost column to value 1, 0 (e.g. 1 for Won, 0 for Lost). For Y column, suppose you have 3 unique values in the column, you can convert A to says [1, 0, 0] and B to [0, 1, 0] and C to [0, 0, 1] (called one-hot encoding). Same on Z column, you can convert TRUE column to 1 and FALSE to 0 (or True or False respectively).
To merge 2 table or excel file together, you can use additional library called pandas which allows you to merge two dataframe together e.g. df1.merge(df2, on='CustID', how='left'). Now, you can put your feature set to scikit learn properly.

How to work with Grouped Responses (Event/Trial) in StatsModels Logistic Regression

I'm starting with StatsModels, coming from Minitab. And I can't find the option to do a binary logistic regression with the response in event/trial format.
Here's a very simple example of what I'm saying:
I have the data like this, grouped by variables, with the number of events (number of ones in binary) in one side and the number of trials (number of zeroes and ones) in the other:
enter image description here
Do you know how can tell this to StatsModels?
Thanks a lot!!
Logit and Probit are only defined for binary (Bernoulli) events, 0 or 1. (In the quasi-likelihood interpretation it can take any values in the interval [0, 1]).
However, GLM with family binomial can be used for either binary Bernoulli data or for Binomial counts.
see the description of endog (which is the statsmodels term for response or dependent variable) in
http://www.statsmodels.org/dev/generated/statsmodels.genmod.generalized_linear_model.GLM.html
"Binomial family models accept a 2d array with two columns. If supplied, each observation is expected to be [success, failure]."
an example is here
http://www.statsmodels.org/dev/examples/notebooks/generated/glm.html

What is the difference between a Confusion Matrix and Contingency Table?

I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n matrix like A = {aij} where aij is the number of data points that are members of class ci and elements of cluster kj.
But there appear to be two of this type of matrix in Introduction to Data Mining (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which best describes the matrix I want to use?
Wikipedia's definition:
In the field of artificial intelligence, a confusion matrix is a
visualization tool typically used in supervised learning (in
unsupervised learning it is typically called a matching matrix). Each
column of the matrix represents the instances in a predicted class,
while each row represents the instances in an actual class.
Confusion matrix should be clear, it basically tells how many actual results match the predicted results. For example, see this confusion matrix
predicted class
c1 - c2
Actual class c1 15 - 3
___________________
c2 0 - 2
It tells that:
Column1, row 1 means that the classifier has predicted 15 items as belonging to class c1, and actually 15 items belong to class c1 (which is a correct prediction)
the second column row 1 tells that the classifier has predicted that 3 items belong to class c2, but they actually belong to class c1 (which is a wrong prediction)
Column 1 row 2 means that none of the items that actually belong to class c2 have been predicted to belong to class c1 (which is a wrong prediction)
Column 2 row 2 tells that 2 items that belong to class c2 have been predicted to belong to class c2 (which is a correct prediction)
Now see the formula of Accuracy and Error Rate from your book (Chapter 4, 4.2), and you should be able to clearly understand what is a confusion matrix. It is used to test the accuracy of a classifier using data with known results. The K-Fold method (also mentioned in the book) is one of the methods to calculate the accuracy of a classifier that has also been mentioned in your book.
Now, for Contingency table:
Wikipedia's definition:
In statistics, a contingency table (also referred to as cross
tabulation or cross tab) is a type of table in a matrix format that
displays the (multivariate) frequency distribution of the variables.
It is often used to record and analyze the relation between two or
more categorical variables.
In data mining, contingency tables are used to show what items appeared in a reading together, like in a transaction or in the shopping-cart of a sales analysis. For example (this is the example from the book you have mentioned):
Coffee !coffee
tea 150 50 200
!tea 650 150 800
800 200 1000
It tells that in 1000 responses (responses about do they like Coffee and tea or both or one of them, results of a survey):
150 people like both tea and coffee
50 people like tea but do not like coffee
650 people do not like tea but like coffee
150 people like neither tea nor coffee
Contingency tables are used to find the Support and Confidence of association rules, basically to evaluate association rules (read Chapter 6, 6.7.1).
Now the difference is that Confusion Matrix is used to evaluate the performance of a classifier, and it tells how accurate a classifier is in making predictions about classification, and contingency table is used to evaluate association rules.
Now after reading the answer, google a bit (always use google while you are reading your book), read what is in the book, see a few examples, and don't forget to solve a few exercises given in the book, and you should have a clear concept about both of them, and also what to use in a certain situation and why.
Hope this helps.
In short, contingency table is used to describe data. and confusion matrix is, as others have pointed out, often used when comparing two hypothesis. One can think of predicted vs actual classification/categorization as two hypothesis, with the ground truth being the null and the model output being the alternative.

Resources