H2O driverless AI deployment with Snowflake clarification? - h2o

I see Snowflake has a partner connect through which I could activate the H2O Driverless AI and access Snowflake from there.
I also see that H2O Driverless AI can be independently deployed on any cloud cluster, by we managing our own cluster instances.
How are both the clusters in above different?
In the H2O Driverless AI activated through the partner connect from Snowflake, don't we not require to manage the instances of H2O Driverless AI, so are we charged accordingly for that?
In the H2O Driverless AI deployed on our own Cloud cluster instances, is it the licensed version of H2O Driverless AI that we deploy and manage? Also, can we deploy the H2O-3(h2o flow) on these instances for building using h20 python packages, since i don't see any notebooks on Driverless AI for developing from ground-up?

Thank you for the question, Roshan,
Snowflake Partner Connect enables a 14-day trial of Driverless AI to be started directly from the Snowflake UI.
When Driverless AI is deployed independently or in the H2O.ai Managed cloud it can connect to Snowflake using the Snowflake connector or JDBC.
Partner Connect is only for trials and labs that demonstrate the capabilities of the products. Customers would then pick a deployment (on-prem, cloud, managed cloud etc.) that aligns with your deployment requirements.
Both Driverless AI and H2O-3 can connect to Snowflake to access data for training, using the Snowflake Connector (https://docs.h2o.ai/driverless-ai/1-10-lts/docs/userguide/connectors/snowflake.html) or JDBC (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/getting-data-into-h2o.html#jdbc-databases)
When inferencing (scoring) models can be used with Snowflake as either external functions or user defined functions (https://docs.h2o.ai/driverless-ai/1-10-lts/docs/userguide/snowflake-integration.html)
The functionality in Driverless AI can be used via an API, here is a link the describes how to use the client and notebooks (https://docs.h2o.ai/driverless-ai/1-10-lts/docs/userguide/python_client.html)
Please reach out if you have any questions.

Related

Azure synapse With Databricks Framework for modern data warehouse

I am working On Databricks. I have curated data in form of fact and dims. Theses data consume for power bi reporting by synapse. I am not sure what is the use of synapse, If data is already cook in databricks layer. why we are using synapse in this framework.
why we are using synapse in this framework
An analytics service for data warehouses and large data is called Azure Synapse. We can combine Azure services like Power BI, Machine Learning, and others using Azure Synapse.
It offers a number of connectors that make it easier to transfer a sizable volume of data between Azure Databricks and Azure Synapse. It also offers a mechanism for Azure Databricks users to connect to Azure Synapse.
Additionally, Azure Synapse offers SQL pools for the computing environment and data warehousing.

Use Kafka Connect with Azure Event Hubs and/or AWS Kinesis/MSK to send data to ElasticSearch

Has anyone used Kafka connect with one or more of the following cloud streaming services?
AWS Kinesis
AWS MSK
Azure Event Hubs
FWIW we're looking to send data from Kafka to ElasticSearch without needing to use additional component such as Logstash or FileBeat.
At first I thought we could only do this using the Confluent platform, but then read that Kafka Connect is just an open-source Apache project. The only need for Confluent would be if we want/need to use one of the proprietary connectors, but given the ElasticSearch Sink connector is the only one we need (at least for now) and this is a community connector - see here (and here for licensing info), we might be able to do this using one of the AWS/Azure streaming services assuming this is supported (Note: AWS or Azure represents a path of less resistance as the company I work for already has vendor relationships with both AWS & Microsoft. Not saying we won't use Confluent or migrate to it at some stage, but for now Azure/AWS is going to be easier to get across the line).
I found a Microsoft document that implies we can use Azure Event Hubs with Kafka Connect, even though AEH is a bit different to open source Kafka... not sure about AWS Kinesis or MSK - I assume MSK would be fine, but not sure... any guidance/blogs/articles would be much appreciated....
Cheers,

Can we download DSE i.e datastax Enterprise graphDB and store in a file

I have a requirement where I want to download DSE i.e datastax Enterprise graphDB from server and store it on client in client's cache. This will be a small graph. Later, on client I want to be able to read this graph from local file/cache and use it to serve the requests (do faster lookups/traversal). Is this possible? how?
DSE Graph is designed as a distributed, highly available, server based Graph database. It is not meant to be an embedded, client side graph database.

General starter Hadoop/ Spark fiware-cosmos questions

I have some general questions about fiware-cosmos, apologies if they are basic but Im trying to understand the architecture and use of cosmos.
I saw that you are planning to integrate Apache Spark into Cosmos ? Do you have a road map or date for that to happen ? What happens if I want to use Spark now ?
What Hadoop service sources can be used ? I think I read that Cosmos supports Cloudera CDH services and raw Hadoop server services ? What about HortonWorks or MapR ?
I know that non standard file systems can be used with Hadoop, for instance MapR-FS, are options like this possible with Cosmos ?
I also read that Cosmos "sits" on top of fiware and so Hadoop as a service (HaaS) can be used and Hadoop clusters generated using open stack ? However, I saw that people are referring to a shared fiware cloud ? Does fiware run as a remote cloud ? Can a local cloud be used on a customer site ?
Is cosmos the only Apache Hadoop/Spark solution on fiware.org ?
Finally, if Cloudera CDH can be used with Cosmos how does the Cloudera cluster manager fit into the mix ? Can it still be used ?
Sorry for all of the questions :)
Cosmos is the name of the Global Instance of the Big Data GE in FIWARE Lab. It is a shared Hadoop instance already deployed in the cloud, ready to be used by FIWARE users.
In fact, there are two instances: The "old" one, serving a pretty old version of the Hadoop stack and being cosmos.lab.fiware.org its entry point. And the "new" one, which is a pair of Hadoop clusters, one for data storage and another one for data analysis; the entry points are storing.cosmos.lab.fiware.org and computing.cosmos.lab.fiware.org.
Of course, you can deploy any other Hadoop (or even Spark) instance by your own in the FIWARE cloud (or any other cloud, such as Amazon one).
Regarding Spark, since it was initially in our plans to deploy it in FIWARE Lab (that's why it appears in the roadmap), it is not clear nowadays it is gonna be deployed.

Hadoop remote data sources connectivity

What are the available hadoop remote data sources connectivity options?
I know about drivers for MongoDB, MySQL and Vertica connectivity but my question is what are other available data sources that have driver for hadoop connectivity?
These are the few I am aware of :
Oracle
ArcGIS Geodatabase
Teradata
Microsoft SQL Server 2008 R2 Parallel Data Warehouse (PDW)
PostgreSQL
IBM InfoSphere warehouse
Couchbase
Netezza
Tresata
But I am still wondering about the intent of this question. Every data source fits into a particular use case. Like, Couchbase for document data storage, Tresata for financial data storage and so on. Are you going to decide your store based on the connector availability??I don't think so.
Your list will be too long to be useful.
Just one reference: cascading gives you access to almost anything you want to access. More, you're not limited with Java. For example there is scalding component which provides very good framework for Scala programmers.

Resources