R Studio Query define
Using rstudio and making a decision tree using the C5.0 algorithm.
What does train.indices <- sample(1:nrow(iris), 100) do?
Thanks.
This is just choosing a random sample of 100 indices to be used for the training set, chosen from the number of indices available in your dataset. You could then get your training data using iris.train <- iris[train.indices, ].
Related
I am using H2O autoencoder in R for anomaly detection. I don’t have a training dataset, so I am using the data.hex to train the model, and then the same data.hex to calculate the reconstruction errors. The rows in data.hex with the largest reconstruction errors are considered anomalous. Mean squared error (MSE) of the model, which is calculated by the model itself, would be the sum of the squared reconstruction errors and then divided by the number of rows (i.e. examples). Below is some sudo code of the model.
# Deeplearning Model
model.dl <- h2o.deeplearning(x = x, training_frame = data.hex, autoencoder = TRUE, activation = "Tanh", hidden = c(25,25,25), variable_importances = TRUE)
# Anomaly Detection Algorithm
errors <- h2o.anomaly(model.dl, data.hex, per_feature = FALSE)
Currently there are about 10 features (factors) in my data.hex, and they are all categorical features. I have two questions below:
(1) Do I need to perform feature selection to select a subset of the 10 features before the data go into the deep learning model (with autoencoder=TRUE), in case some features are significantly associated with each other? Or I don’t need to since the data will go into an autoencoder which compresses the data and selects only the most importance information already, so feature selection would be redundant?
(2) The purpose of using the H2O autoencoder here is to identify the senders in data.hex whose action is anomalous. Here are two examples of data.hex. Example B is a transformed version of Example A, by concatenating all the actions for each sender-receiver pair in Example A.
After running the model on data.hex in Example A and in Example B separately, what I got is
(a) MSE from Example A (~0.005) is 20+ times larger than MSE from Example B;
(b) When I put the reconstruction errors in ascending order and plot them (so errors increase from left to right in the plot), the reconstruction error curve from Example A is steeper (e.g. skyrocketing) on the right end, while the reconstruction error curve from Example B increases more gradually.
My question is, which example of data.hex works better for my purpose to identify anomalies?
Thanks for your insights!
Question 1
You shouldn't need to decrease the number of inputted features into the model. I can't say I know what would happen during training, but collinear/associated features could be eliminated in the hidden layers as you said. You could consider adjusting your hidden nodes and see how it behaves. hidden = c(25,25,25) -> hidden = c(25,10,25) or hidden = c(15,15) or even hidden = c(7, 5, 7) for your few features.
Question 2
What is the purpose of your model? Are you trying to determine which "Sender/Receiver combinations" are anomalies or are you trying to determine which "Sender/Receiver + specific Action combo" are anomalies? If it's the former ("Sender/Receiver combinations") I would guess Example B is better.
If you want to know "Sender/Receiver combinations" and use Example A, then how would you aggregate all the actions for one Sender-Receiver combo? Will you average their error?
But it sounds like Example A has more of a response for anomalies in ascended order list (where only a few rows have high error). I would sample different rows and see if the errors make sense (as a domain expert). See if higher errors tend to seem to be anomaly-like rows.
I have the following questions that still confused me after I read the h2o document. Can someone provide some explanation for me
For the stopping_tolerance = 0.001, let's use AUC for example, current AUC is 0.8. Does that mean the AUC need to increase 0.8 + 0.001 or need to increase 0.8*(1+0.1%)?
score_each_iteration, in H2O document
(http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/score_each_iteration.html) it just say "iteration". But what exactly is the definition for each
"iteration", is that each tree or each grid search or each K folder
cross validation or something else?
Can I define score_tree_interval and set score_each_iteration = True
at the same time or I can only use one of them to make the grid
search repeatable?
Is there any difference to put 'stopping_metric',
'stopping_tolerance', 'stopping_rounds' in
H2OGradientBoostingEstimator vs in search_criteria of H2OGridSearch?
I found put in H2OGradientBoostingEstimator will make the code run
much faster when I test it in Spark environment
0.001 is the same as 0.1%, for AUC since bigger is better, you will want to see an increase of at least .001 after a specified number of scoring rounds.
You have linked to a portion of the documentation that is specific to the algorithms listed in Available in at the top of the page. So let's stick to answering this question with respect to individual models and not grid search. If you want to see what is being scored at each iteration take a look at your model results in Flow or use my_model.plot() (for the python api) to see what is getting scored at each iteration. For GBM and DRF this will be ntrees, but since different algorithms will have different aspects that change the word iteration is used since it is more generic.
Did you test this out? what did you find when you did this? Take a look at the scoring history plot in flow and notice what happens when you set both score_tree_interval and score_each_iteration = True versus when you only set score_tree_interval (I would recommend trying to understand these parameters at the individual model level before you use grid search).
yes, in once case you are specifying early stopping as you build an individual model in the case of grid search you are indicating whether on not to build more models.
I need to take a larger (more dense) than needed list of lidar survey points (longitude, latitude, and elevation) for terrain definition and decimate it based on a 2 dimensional grid. The idea would be to end up with points based on a NxN (i.e. 1 meter x 1 meter) dimension grid using the longitude, latitude (x,y) values, therefore eliminating the points that are more than are needed. The goal is to determine what the elevation is at each point in the grid after the decimation, not use elevation as part of the decimation rule itself.
An actual or precisely structured grid is not necessary or the goal here, I only use the grid terminology to best approximate what I envision as the remainder of the cloud of points after reducing it in a manner that we have always have a point within a certain radius (i.e. 1 meter). It is possible there is a better term to use than grid.
I would like to either code/script this myself in a scripting or programming language if I can start with a decimation algorithm or use a command line tool from a project that may already exist that can do this that can run on Ubuntu and called from our application as system call. The approach should not require using a GUI based type of software or tool to solve this. It needs to be part of an automated set of steps.
The data currently exists in a tab separated values file but I could load the data into a sqlite database file if using an database/sql query driven algorithm would be better/faster. The ideal scripting language would be ruby or python but can be any really and if there exists C/C++/C# libraries for this already then we could wrap those for our needs.
Ideas?
Update
Clarifying the use of the result of this decimated list: Given a user's location (known by latitude and longitude), what is the closest point in the list and in turn its elevation? We can do this now of course, but we have more data than is necessary so we just want to relax the density of the data so that if we can find the closest point within a tolerance distance (i.e. 1 meter) if able to use a decimated list vs the full list. The latitude, longitude values in the list are in decimal GPS (i.e. 38.68616190027656, -121.11013105991036)
PART 1: decimated version
Load data
Load the data from the tabular file (change sep according to the separator you are using):
# installed as dependency
import pandas as pd
# https://github.com/daavoo/pyntcloud
from pyntcloud import PyntCloud
dense = PyntCloud(pd.read_csv("example.tsv",
sep='\t',
names=["x","y","z"]))
This is how it looks the example I created:
Build VoxelGrid
Asuming that the latitude and longitude in your file are in meters you can generate a grid as follows:
grid_id = dense.add_structure("voxelgrid",
sizes=[1, 1,None],
bb_cuboid=False)
voxelgrid = dense.voxelgrids[grid_id]
This voxelgrid has a size of 1 along the x (latitude) and y (longitude) dimensions.
Build decimated version
decimated = dense.get_sample("voxelgrid_centroids", voxelgrid=grid_id)
decimated is a numpy (N,3) array. You can store it for later use in a SQL database, etc.
PART 2: Query
Option A: query voxelgrid
Get mean altitudes for each grid cell
You can know get a vector with the mean z (altitude) value for each cell in the grid:
z_mean = voxelgrid.get_feature_vector(mode="z_mean")
Query the grid with users's location:
users_location = np.random.rand(100000, 2)
Add a column of zeros because query requires 3D (This doesn't affect the results):
users_location = np.c_[ users_location, np.zeros(users_location.shape[0]) ]
Get in wich cell each user is:
users_cell = voxelgrid.query(users_location)
And finally, get the altitude corresponding to each user:
users_altitude = z_mean[users_cell]
Option B: Use decimated version for query
Build a KDTree of decimated:
from scipy.spatial import cKDTree
kdt = cKDTree(decimated)
Query the KDTree with user locations:
users_location = np.random.rand(100000, 2)
users_location = np.c_[ users_location, np.zeros(users_location.shape[0])
distances, indices = kdt.query(user_locations, k=1, n_jobs=-1)
Extra, you can save and laod the voxelgrid with pickle:
pickle.dump(voxelgrid, open("voxelgrid.pkl", "wb"))
voxelgrid = pickle.load(open("voxelgrid.pkl", "rb"))
If you have a point cloud as a text file (.xyz) a simple and fast solution is to take a random sample from the file using shuf.
10 million points in a xyz-file equals 10 million lines of text. You can run:
shuf input.xyz -l 5000000 -o out.xyz
You have decimated the file to half the original size.
I'm trying to find out which statistical/data mining algorithms in R or R packages at CRAN/github/R-Forge exist that can handle large datasets either in parallel on 1 server or sequentially without running into out-of-memory issues or which work on several machines at once.
This in order to evaluate if I can easily port them to work with ff/ffbase like ffbase::bigglm.ffdf.
I would like to split these up into 3 parts:
Algorithms that update or work on parameter estimates in parallel
Buckshot (https://github.com/lianos/buckshot)
lm.fit # Programming For Big Data (https://github.com/RBigData)
Algorithms that work sequentially (get data in R but only use 1 process and only 1 process updates the parameters)
bigglm (http://cran.r-project.org/web/packages/biglm/index.html)
Compound Poisson linear models (http://cran.r-project.org/web/packages/cplm/index.html)
Kmeans # biganalytics (http://cran.r-project.org/web/packages/biganalytics/index.html)
Work on part of the data
Distributed text processing (http://www.jstatsoft.org/v51/i05/paper)
And I would like to exclude simple parallelisation like optimising over a hyperparameter by e.g. crossvalidating.
Any other pointers to these kind of models/optimisers or algorithms? Maybe Bayesian? Maybe a package called RGraphlab (http://graphlab.org/)?
Have you read through the High Performance Computing Task View on CRAN?
It covers many of the points you mention and gives overviews of packages in those areas.
Random forest are trivial to run in parallel. It's one of the examples in the foreach vignette:
x <- matrix(runif(500), 100)
y <- gl(2, 50)
library(randomForest); library(foreach)
rf <- foreach(ntree=rep(250, 4), .combine=combine,
.packages='randomForest') %dopar% randomForest(x, y, ntree=ntree)
You can use this construct to split your forest over every core in your cluster.
I'd like to perform the expensive operation of cross product across two data sets in Hadoop using Java MapReduce.
For example, I have records from data set A and data set B, and I'd like each record in data set A to be matched up to each record in data set B in the output. I realize that the output size of this would be |A| * |B|, but want to do it anyways.
I see that Pig has CROSS but am unaware of how it is implemented at a high-level. Perhaps I will go take a look at the source code.
Not looking for any code, just want to know at a high-level how I should approach this problem.
I have done something similar when looking at document similarity (comparing a document to every other document) and ended up with a custom input format that splits up the two datasets and then ensured there was a 'split' for each subset of data.
So your splits would look like (each merging two sets of 10 records, outputting 100 records)
A(1-10) x B(1-10)
A(11-20) x B(1-10)
A(21-30) x B(1-10)
A(1-10) x B(11-20)
A(11-20) x B(11-20)
A(21-30) x B(11-20)
A(1-10) x B(21-30)
A(11-20) x B(21-30)
A(21-30) x B(21-30)
I don't remember how performant it was though, but had a document set in the size order of thousands to compare against one another (on an 8 node dev cluster), with millions of cross products calculated.
I could also make improvements to the algorithm as some documents would never score well against others (if they had too much temporal time between them for example), and generate better splits as a result.