Using the dask labextenstion to connect to a remote cluster - amazon-ec2

I'm interested in running a Dask cluster on EMR and interacting with it from inside of a Jupyter Lab notebook running on a separate EC2 instance (e.g. an EC2 instance not within the cluster and not managed by EMR).
The Dask documentation points to dask-labextension as the tool of choice for this use case. dask-labextension relies on a YAML config file (and/or some environment vars) to understand how to talk to the cluster. However, as far as I can tell, this configuration can only be set to point to a local Dask cluster. In other words, you must be in a Jupyter Lab notebook running on an instance within the cluster (presumably on the master instance?) in order to use this extension.
Is my read correct? Is it not currently possible to use dask-labextension with an external Dask cluster?

Dask Labextension can talk to any Dask cluster that is visible from where your web client is running. If you can connect to a dashboard in a web browser then you can copy that same address to the Dask-Labextension search bar and it will connect.

Related

Arangodump between two different AWS ec2 clusters

I have created a Graph database in ArangoDB in a 5 machine AWS cluster. I do not have enough space in the Database AWS cluster to store the dump. So I would like to take a dump of the database in an AWS instance in a different cluster. I have the key files to connect to the machines. How to do it using Arangodump ? Thanks.
I do get that correctly that you're using DC/OS clusters on AWS?
The problem with arangoimp is, that it doesn't know howto authenticate with the DC/OS proxy, and thus can't reach the routes it would require to import to arangodb.
The problem is similar to Running Arango Shell on DC/OS cluster - you want to use sshutle as lalitlogical describes to forward the ArangoDB server port (usually 8529) to your target environment.

Fabric scripts to install Hadoop on a cluster of machines?

I am beginning to install hadoop on a cluster. I have ssh access to these machines and I have already installed fabric on them. I was wondering if someone has already written a fabfile to install and deploy hadoop to a cluster easily.
I found this project [0]; but this is written for deploying over AWS instances. I was looking for something where I can just fill in the IPs of my machines and then execute a set of fab commands to bring up the cluster.
[0] http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#ec2-deployment-with-fabric-script
I'm AlexJF, the author of the scripts you linked.
The scripts you reference can also be used outside EC2. You just need to configure, as you requested, the list of hosts and configurations on the top of the fabfile.py. Be sure to set EC2 = False (which just happens to be the default).
You'll then have several useful commands available to you.

AWS EMR Hadoop Administration

We are currently using Apache Hadoop (Vanilla Version) in our org. We are planning to migrate to AWS EMR. I'm trying to understand how AWS EMR Hadoop works internally (not how to use it), I'm mainly interested in Hadoop administration steps and how master and slave communicates and various configuration configurations. I already checked the AWS EMR documentation but I don't see detailed comparison.
Can someone recommend me a link/tutorial for migrating to AWS EMR from an Apache Hadoop.
During EMR cluster creation, it will ask you to specify Master and Node. a default settings will provision 1 master and two nodes for you. You can also specify what all applications you want to be in the cluster (e.g.: hadoop, hive, spark, zeppelin, hue, etc.).
Once the cluster is created, it will provision all the services. you can click on these services and access them via web, or using ssh into the master. For e.g: to access the ambari interface, go to the service within EMR and click it. a new window will be launched with the ambari monitoring service interface.
Installing these applications is very easy. all you have to do is specify all the services while cluster creation.
Amazon Elastic MapReduce uses a mostly standard implementation of Hadoop and associated tools.
See: AMI Versions Supported in Amazon EMR
The benefits of using EMR are in the automated deployment of instances. For example, launching a cluster with an appropriate AMI means that software is already loaded on each instance and HDFS is configured across the core nodes.
The Master and Slave (Core/Task) nodes communicate in exactly the normal way that they communicate in any Hadoop cluster. However, only one Master is supported (with no backup Master).
When migrating to EMR, check that you are using compatible versions of software (eg Hadoop, Hive, Pig, Impala, etc). Also consider using Amazon S3 for storage of data instead of HDFS, especially for storing source data, since data on S3 persists even after the EMR cluster is terminated.
Technically, Hadoop provided with EMR, can be few releases back. You should check EMR release notes for detailed application provided with each version. EMR takes care application provisioning, setup and configuration. Based on EC2 instance type, Hadoop (and other application configuration) will change. You can override default settings using configure application.
Other than this Hadoop you have on premises and EMR should be the same.

Is there an Amazon community AMI for Hadoop/HBase?

I would like to test out Hadoop & HBase in Amazon EC2, but I am not sure how complicate it is. Is there a stable community AMI that has Hadoop & HBase installed? I am thinking of something like bioconductor AMI
Thank you.
I highly recommend using Amazon's Elastic MapReduce service, especially if you already have an AWS/EC2 account. The reasons are:
EMR comes with a working Hadoop/HBase cluster "out of the box" - you don't need to tune anything to get Hadoop/HBase working. It Just Works(TM).
Amazon EC2's networking is quite different from what you are likely used to. It has, AFAIK, a 1-to-1 NAT where the node sees its own private IP address, but it connects to the outside world on a public IP. When you are manually building a cluster, this causes problems - even using software like Apache Whirr or BigTop specifically for EC2.
An AMI alone is not likely to help you get a Hadoop or HBase cluster up and running - if you want to run a Hadoop/HBase cluster, you will likely have to spend time tweaking the networking settings etc.
To my knowledge there isn't, but you should be able to easily deploy on EC2 using Apache Whirr which is a very good alternative.
Here is a good tutorial to do this with Whirr, as the tutorial says you should be able to do this in minutes !
The key is creating a recipe like this:
whirr.cluster-name=hbase
whirr.instance-templates=1 zk+nn+jt+hbase-master,5 dn+tt+hbase-regionserver
whirr.provider=ec2
whirr.identity=${env:AWS_ACCESS_KEY_ID}
whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
whirr.hardware-id=c1.xlarge
whirr.image-id=us-east-1/ami-da0cf8b3
whirr.location-id=us-east-1
You will then be able to launch your cluster with:
bin/whirr launch-cluster --config hbase-ec2.properties

Running MRToolkit hadoop jobs on AWS elastic map/reduce

Loving MRToolkit -- great to get away from Java while writing Hadoop jobs. It has become apparent that the library was written to interface with an EC2 cluster, and not with Amazon's elastic map/reduce system. Does anybody have insights into running jobs defined using the toolkit on elastic map/reduce servers? It isn't readily apparent from the web interface, and I'd love to avoid the headache of setting up a cluster by hand on EC2.
I've looked into updloading files under the 'streaming' option (as that's what MRToolkit uses), but Amazon is expecting separate files for the mapper and reducer -- typical MRToolkit style defines them in the a single file as subclasses of predefined Base(Map|Reduce) classes.
Thanks much for any thoughts.
Isaac
It's doable, but not through the web GUI.
Download and install the Ruby Client
Create your cluster: elastic-mapreduce --create --alive [params to size cluster]
Confirm your Elastic Map Reduce Master security group has port 22 open
SSH into your master node
Use git / scp to copy over your application code
Run your app

Resources