When setting up a basic cluster using the supplied spark script for ec2, the cluster is created (1 mater, 1 slave) but I continuously get SSH hostname resolution errors.
This is because the host names are empty.
As understand it, the point of the script is that it creates the instances and so it should know the names and be able to resolve these as part of the setup.
So the question is - am I supposed to configure ec2-script before trying to launch the cluster?
Related
I'm interested in running a Dask cluster on EMR and interacting with it from inside of a Jupyter Lab notebook running on a separate EC2 instance (e.g. an EC2 instance not within the cluster and not managed by EMR).
The Dask documentation points to dask-labextension as the tool of choice for this use case. dask-labextension relies on a YAML config file (and/or some environment vars) to understand how to talk to the cluster. However, as far as I can tell, this configuration can only be set to point to a local Dask cluster. In other words, you must be in a Jupyter Lab notebook running on an instance within the cluster (presumably on the master instance?) in order to use this extension.
Is my read correct? Is it not currently possible to use dask-labextension with an external Dask cluster?
Dask Labextension can talk to any Dask cluster that is visible from where your web client is running. If you can connect to a dashboard in a web browser then you can copy that same address to the Dask-Labextension search bar and it will connect.
I have created a Graph database in ArangoDB in a 5 machine AWS cluster. I do not have enough space in the Database AWS cluster to store the dump. So I would like to take a dump of the database in an AWS instance in a different cluster. I have the key files to connect to the machines. How to do it using Arangodump ? Thanks.
I do get that correctly that you're using DC/OS clusters on AWS?
The problem with arangoimp is, that it doesn't know howto authenticate with the DC/OS proxy, and thus can't reach the routes it would require to import to arangodb.
The problem is similar to Running Arango Shell on DC/OS cluster - you want to use sshutle as lalitlogical describes to forward the ArangoDB server port (usually 8529) to your target environment.
I am an experienced person in Java and wanted to get my hands dirty with Hadoop. I have gone through the basics and now preparing for the practical things.
I have started with the tutorials given at https://developer.yahoo.com/hadoop/tutorial/ to setup and running hadoop on virtual machine.
So, to create a cluster I need multiple virtual machine running in parallel. right? And needs to add ip address of all in hadoop-site.xml. Or can I do it with single virtual machine?
No you cannot create a cluster with single VM. Cluster meaning is group of machines.
If you have a good configuration of Host machine, on top of that you can run 'n' number of guest OS. By doing like this only you can create Hadoop cluster (1 NN, 1 SNN, 1 DN)
If you want, you can install Pseudo mode (all services run in one machine) Hadoop, which runs like a testing machine
You can setup a multinode cluster using any virtual box like Oracle VM. Create 5 nodes(1-NN,1-SNN,3-DN). Assign each node its IP address and set up all the configuration files on all the nodes. There are 2 files - (Masters and slave). In the NN node give the IP address of SNN in Master file and all the 3 DN's IP address in the slave files. Also set up the ssh connectivity between all the nodes using the public keys.
I have been using the wekaDistributedHadoop1.0.4 and wekaDistributedBase1.0.2 packages on my local machine to run some of the basic jobs. There is a field "HDFS host" which must be filled in order to run the jobs. I have been using "localhost" since I have been testing on my local machine and this works fine. I blindly tried using "localhost" when running on AWS EMR but the job failed. What I would like to know is what host name should I be entering into the field so that weka will call on the correct master? Is it the public DNS name which is provided when starting the cluster or is there a method in the API which gets that address for me?
If you want to manually do it.
Create a cluster and keep it alive, you can find info in amazon ec2 instances manage console, in security group elastic mapreduce master/slave. Find it out, login master node and edit conf file and fill with right name.
If you need automatically do it.
Write a shell executed in bootstrap. You can refer to https://serverfault.com/questions/279297/what-is-the-easiest-way-to-get-a-ec2-public-dns-inside-a-running-instance
I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?
Thanks
Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.
That's easier then to create own cluster manually.
But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.
In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.
Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:
Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
About the way you are starting the cluster, please clarify:
If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
How exactly you are trying start it? What exactly AMIs are you using?