I am new to H2O, I installed H2O Driverless AI in evaluation license. I can successfully perform visualisation and classification model prediction. But I'm wondering how to start with clustering. Because I don't find any option for unsupervised learning or clustering technique? where should i perform clustering operation in driverless AI? Is Clustering operation available in Driverless AI or not?
Thanks in Advance.
Currently, as of version 1.2.0, unsupervised clustering is not supported in DAI; DAI is designed to solve supervised learning problems.
Here are the current supported problem types (please review the documentation to see changes of future releases at http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/release_notes.html):
Problem types supported:
Regression (continuous target variable, for age, income, house price,
loss prediction, time-series forecasting)
Binary classification (0/1
or “N”/”Y”, for fraud prediction, churn prediction, failure
prediction, etc.)
Multinomial classification (0/1/2/3 or
“A”/”B”/”C”/”D” for categorical target variables, for prediction of
membership type, next-action, product recommendation, etc.)
Related
Using sentiment analysis API and want to know how the AI bias that gets in through the training set of data and other biases quantified. Any help would be appreciated.
There are several tools developed to deal with it:
Fair Learn https://fairlearn.github.io/
Interpretability Toolkit https://learn.microsoft.com/en-us/azure/machine-learning/how-to-machine-learning-interpretability
In Fair Learn you can see how biased a ML model is after it has been trained with the data set and choose a maybe less accurate model which performs better with biases. The explainable ML models provide different correlation of inputs with outputs and combined with Fair Learn can give an idea of the health of the ML model.
The all algorithms that are available in h2o will applicable in Automl. For example, will H2O automl run on these algorithms like time series, Cox Proportional Hazards (CoxPH), naive bayes.
As mentioned in the docs, during H2O's AutoML
all appropriate H2O algorithms will be used if the search stopping criteria allows and if the include_algos option is not specified
If you would like to specify certain algos, you can specify a list or vector in the include_algos argument (see here).
I am teaching myself machine learning through the book "Introduction to Machine Learning with Python: A Guide for Data Scientists", and I am currently at the k-Nearest Neighbors section. The authors mention that this algorithm is rarely used in real life due to "prediction being slow and its inability to handle many features". However, the k-Nearest Neighbors is mentioned as one of the most popular algorithms for data scientist in many articles. So, could somebody explain it for me here?
K-nearest neighbor has a lot of application in machine learning because of the nature of the problem which is solved by a k-nearest neighbor. In other words, the problem of the k-nearest neighbor is fundamental and it is used in a lot of solutions. For example, in data representation such as tSNE, to run the algorithm we need to compute the k-nearest neighbor of each point base on the predefined perplexity.
Also, you can find more application of kNN here and its application in the industry in the last page of this article.
The KNN algorithm is one of the most popular
algorithms for text categorization or text mining.
Another interesting application is the evaluation of forest
inventories and for estimating forest variables. In
these applications, satellite imagery is used, with the
aim of mapping the land cover and land use with few
discrete classes. The other applications of the k-NN
method in agriculture include climate forecasting and
estimating soil water parameters.
Some of the other applications of KNN in finance are
mentioned below:
Forecasting stock market: Predict the price of a
stock, on the basis of company performance
measures and economic data.
Currency exchange rate
Bank bankruptcies
Understanding and managing financial risk
Trading futures
Credit rating
Loan management
Bank customer profiling
Money laundering analyses
Medicine
Predict whether a patient, hospitalized due to a
heart attack, will have a second heart attack. The
prediction is to be based on demographic, diet
and clinical measurements for that patient.
Estimate the amount of glucose in the blood of a
diabetic person, from the infrared absorption
spectrum of that person’s blood.
Identify the risk factors for prostate cancer,
based on clinical and demographic variables.
The KNN algorithm has been also applied
for analyzing micro-array gene expression data,
where the KNN algorithm has been coupled with
genetic algorithms, which are used as a search tool.
Other applications include the prediction of solvent
accessibility in protein molecules, the detection of
intrusions in computer systems, and the management
of databases of moving objects such as computer
with wireless connections.
I see H2O DAI picks up the optimized algorithm for the dataset automatically. I hear that the contents of MLI (machine learning interpretation) from other platforms (like SAS Viya) is dependent on the algorithm it uses. For example, LOCO is not available for GBM, etc. (Of course, this is a purely hypothetical example.)
Is it the same with H2O DriverlessAI ? Or does it always show the same MLI menus regardless of the algorithms it used?
Currently, MLI will always display the same dashboard for any DAI algorithm, with the following exceptions: Shapley plots are not currently supported for RuleFit and TensorFlow models, Multinomial currently only show Shapley and Feature Importance (global and local), and Time Series experiments are not yet supported for MLI. What this means is you can expect to always see: K-LIME/LIME-SUP, Surrogate Decision Tree, Partial Dependence Plot and Individual Conditional Expectation. Note this may change in the future, for the most up-to-date details please see the documentation.
I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content.
What would be the best way to approach this?
I tried using k-means but it does not seem suitable because it needs all documents at once in order to do the calculations.
Are there any cluster algorithms suitable for larger datasets?
For reference: I am using Elasticsearch to store my data.
According to Prof. J. Han, who is currently teaching the Cluster Analysis in Data Mining class at Coursera, the most common methods for clustering text data are:
Combination of k-means and agglomerative clustering (bottom-up)
topic modeling
co-clustering.
But I can't tell how to apply these on your dataset. It's big - good luck.
For k-means clustering, I recommend to read the dissertation of Ingo Feinerer (2008). This guy is the developer of the tm package (used in R) for text mining via Document-Term-matrices.
The thesis contains case-studies (Ch. 8.1.4 and 9) on applying k-Means and then the Support Vector Machine Classifier on some documents (mailing lists and law texts). The case studies are written in tutorial style, but the dataset are not available.
The process contains lots of intermediate steps of manual inspection.
There are k-means variants thst process documents one by one,
MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1.
and k-means variants that repeatedly draw a random sample.
D. Sculley (2010). Web Scale K-Means clustering. Proceedings of the 19th international conference on World Wide Web
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
But in the end, it's still useless old k-means. It's a good quantization approach, but not very robust to noise, not capable of handling clusters of different size, non-convex shape, hierarchy (e.g. sports, inside baseball) etc. it's a signal processing technique, not a data organization technique.
So the practical impact of all these is 0. Yes, they can run k-means on insane data - but if you can't make sense of the result, why would you do so?