How to run multiple grid searches with H2O - h2o

I have an H2O frame X from which I extract 3 different H2O frames X1, X2, X2 (based on some filtering condition)
I want to run a grid search on X1 and get Model1, a grid search on X2 and get Model2..etc
What is the best way to do that?

The best way is to create 3 different R or Python scripts that each create the Frame X_i and then run GridSearch_i on it, you can run all of them in parallel. If you're doing it in one script or interactively, then the grid searches will be sequential.

Related

I am doing simple A/B testing of two different user interfaces, and looking to count clicks to accomplish a task in both

I am trying to A/B test two different Web UIs for the same task and say one takes x1 amount of time and y1 number of clicks to perform the task while the second takes x2 amount of time and y2 number of clicks. Unfortunately, there does not seem to be any simple chrome extensions for counting clicks...
Does anyone have a suggestion?

Algorithm to suggest which chart to build using structure of the data

I am working on a development where we need to dynamically generate visuals based on data. Structure of data changes based on query fired on data.
I have seen this feature in many BI tools where it suggests the type of suitable visualization based on data structure.
I have tried creating my own algorithm based on rules which we generally use to create a chart.
I want to know if there are any such algorithms or rules which can help us build this
I'm just trying to give you a head start here with what I could think. Since you have mentioned you have already tried writing your own algorithm with rules, please show your work. As far as I know the chart types can be determined based on the nature of (x, y) points you're trying to plot.
For a single x, if there are many corresponding y values, go with scatter plot.
For a single x, if there is only one corresponding y value,
If x takes integer values, go with line charts
If x takes rather string values, go with bar charts

Algorithm or command line tool to decimate point cloud of terrain points?

I need to take a larger (more dense) than needed list of lidar survey points (longitude, latitude, and elevation) for terrain definition and decimate it based on a 2 dimensional grid. The idea would be to end up with points based on a NxN (i.e. 1 meter x 1 meter) dimension grid using the longitude, latitude (x,y) values, therefore eliminating the points that are more than are needed. The goal is to determine what the elevation is at each point in the grid after the decimation, not use elevation as part of the decimation rule itself.
An actual or precisely structured grid is not necessary or the goal here, I only use the grid terminology to best approximate what I envision as the remainder of the cloud of points after reducing it in a manner that we have always have a point within a certain radius (i.e. 1 meter). It is possible there is a better term to use than grid.
I would like to either code/script this myself in a scripting or programming language if I can start with a decimation algorithm or use a command line tool from a project that may already exist that can do this that can run on Ubuntu and called from our application as system call. The approach should not require using a GUI based type of software or tool to solve this. It needs to be part of an automated set of steps.
The data currently exists in a tab separated values file but I could load the data into a sqlite database file if using an database/sql query driven algorithm would be better/faster. The ideal scripting language would be ruby or python but can be any really and if there exists C/C++/C# libraries for this already then we could wrap those for our needs.
Ideas?
Update
Clarifying the use of the result of this decimated list: Given a user's location (known by latitude and longitude), what is the closest point in the list and in turn its elevation? We can do this now of course, but we have more data than is necessary so we just want to relax the density of the data so that if we can find the closest point within a tolerance distance (i.e. 1 meter) if able to use a decimated list vs the full list. The latitude, longitude values in the list are in decimal GPS (i.e. 38.68616190027656, -121.11013105991036)
PART 1: decimated version
Load data
Load the data from the tabular file (change sep according to the separator you are using):
# installed as dependency
import pandas as pd
# https://github.com/daavoo/pyntcloud
from pyntcloud import PyntCloud
dense = PyntCloud(pd.read_csv("example.tsv",
sep='\t',
names=["x","y","z"]))
This is how it looks the example I created:
Build VoxelGrid
Asuming that the latitude and longitude in your file are in meters you can generate a grid as follows:
grid_id = dense.add_structure("voxelgrid",
sizes=[1, 1,None],
bb_cuboid=False)
voxelgrid = dense.voxelgrids[grid_id]
This voxelgrid has a size of 1 along the x (latitude) and y (longitude) dimensions.
Build decimated version
decimated = dense.get_sample("voxelgrid_centroids", voxelgrid=grid_id)
decimated is a numpy (N,3) array. You can store it for later use in a SQL database, etc.
PART 2: Query
Option A: query voxelgrid
Get mean altitudes for each grid cell
You can know get a vector with the mean z (altitude) value for each cell in the grid:
z_mean = voxelgrid.get_feature_vector(mode="z_mean")
Query the grid with users's location:
users_location = np.random.rand(100000, 2)
Add a column of zeros because query requires 3D (This doesn't affect the results):
users_location = np.c_[ users_location, np.zeros(users_location.shape[0]) ]
Get in wich cell each user is:
users_cell = voxelgrid.query(users_location)
And finally, get the altitude corresponding to each user:
users_altitude = z_mean[users_cell]
Option B: Use decimated version for query
Build a KDTree of decimated:
from scipy.spatial import cKDTree
kdt = cKDTree(decimated)
Query the KDTree with user locations:
users_location = np.random.rand(100000, 2)
users_location = np.c_[ users_location, np.zeros(users_location.shape[0])
distances, indices = kdt.query(user_locations, k=1, n_jobs=-1)
Extra, you can save and laod the voxelgrid with pickle:
pickle.dump(voxelgrid, open("voxelgrid.pkl", "wb"))
voxelgrid = pickle.load(open("voxelgrid.pkl", "rb"))
If you have a point cloud as a text file (.xyz) a simple and fast solution is to take a random sample from the file using shuf.
10 million points in a xyz-file equals 10 million lines of text. You can run:
shuf input.xyz -l 5000000 -o out.xyz
You have decimated the file to half the original size.

Using multiple GPU for one dataset at once instead of splitting dataset in tensorflow

I know when training DNN, usual way to use multiple GPU is to split dataset and assign each divided dataset to each GPU.
However, is there a way to use multiple GPU for faster calculation for undivided, whole dataset? I mean when GPU is used for training network, matrix multiplications are parallelized inside the single GPU. Can I make this matrix multiplication faster by using multiple GPU at once?
For example, I have only one picture for dataset. Because I don't have multiple pictures for splitting and distributing to multiple GPU, I want to utilize all GPUs to contribute for this one picture calculation.
Is it possible in Tensorflow? I've searched in the Internet, but found nothing because it is very rare case.
You're trying to do something like model parallelism. It's a little hacky to do that in tensorflow.
One way to parallelize matmul with two GPU cards. A X B = C, A,B,C are matrices with shape of (m,k), (k,n), (m,n).
You can:
split A to A1 and A2 with the shape of (m/2, k), (m/2, k).
place A1 on GPU1 and A2 on GPU2.
replicate B to both GPU.
do computation of A1 X B = C1 and A2 X B = C2 concurrently.
concatenate C1 and C2 to get C.
Tensorflow provides operators like split, concanate, since B should be replicated on both gpu, you can place B on the parameter server.

what parallel algorithms exist in R, working on large data

I'm trying to find out which statistical/data mining algorithms in R or R packages at CRAN/github/R-Forge exist that can handle large datasets either in parallel on 1 server or sequentially without running into out-of-memory issues or which work on several machines at once.
This in order to evaluate if I can easily port them to work with ff/ffbase like ffbase::bigglm.ffdf.
I would like to split these up into 3 parts:
Algorithms that update or work on parameter estimates in parallel
Buckshot (https://github.com/lianos/buckshot)
lm.fit # Programming For Big Data (https://github.com/RBigData)
Algorithms that work sequentially (get data in R but only use 1 process and only 1 process updates the parameters)
bigglm (http://cran.r-project.org/web/packages/biglm/index.html)
Compound Poisson linear models (http://cran.r-project.org/web/packages/cplm/index.html)
Kmeans # biganalytics (http://cran.r-project.org/web/packages/biganalytics/index.html)
Work on part of the data
Distributed text processing (http://www.jstatsoft.org/v51/i05/paper)
And I would like to exclude simple parallelisation like optimising over a hyperparameter by e.g. crossvalidating.
Any other pointers to these kind of models/optimisers or algorithms? Maybe Bayesian? Maybe a package called RGraphlab (http://graphlab.org/)?
Have you read through the High Performance Computing Task View on CRAN?
It covers many of the points you mention and gives overviews of packages in those areas.
Random forest are trivial to run in parallel. It's one of the examples in the foreach vignette:
x <- matrix(runif(500), 100)
y <- gl(2, 50)
library(randomForest); library(foreach)
rf <- foreach(ntree=rep(250, 4), .combine=combine,
.packages='randomForest') %dopar% randomForest(x, y, ntree=ntree)
You can use this construct to split your forest over every core in your cluster.

Resources