I am working spark processes using python (pyspark). I create an amazon EMR cluster to run my spark scripts, but when cluster is just created a lot of processes ar launched by itself (¿?), when I check cluster UI:
So, when I try to lunch my own script, they enter in an endless queue, sometime ACCEPTED but never get into RUNNING state.
I couldn't find any info about this issue even in amazon forums, so I'll glad any advice.
Thanks in advance.
you need to check in the security group of the master node, check the inbound traffic,
maybe you have a rule for anywhere, please remove that or try to remove and check if the things work it is a vulnerability.
Related
I am trying to set up a cluster on 3 nodes on a Cloud Server with Cloudera Manager. But at Cluster installation step, it gets stuck at 64%. Please guide me on how to go forward with it and where to see logs of the same.
Following is the image of the installation screen
Some cloud companies have policies in which they if lots of data requests are coming, they remove the IP from public hostings for sometime. This is done to prevent DDoS attacks.
A solution can be to ask them to raise the data transfer limit.
I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)
I've been having a few days of unalloyed torture getting Hive jobs to run via Oozie on an AWS 5 machine cluster. The simplest job that involved the live metastore succeeds or fails unpredictably. The error messages are pretty unhelpful:
Hive failed, error message[Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [1]]
Thanks Oozie!
After a lot of fun changing just about every imaginable setting, I studied hivemetastore.log carefully (we have mySQL as the metastore) and realised that every successful request came from 172.31.40.3. Unsuccessful requests came from 172.31.40.2,172.31.40.4 and 172.31.40.5 . The Hive console app makes requests without problems on 172.31.40.1
This is getting somewhere after nearly week of having no idea whatsover is going on. The question is now, what do I need to change to allow all requests from 172.31.40.1-5 in? Or funnel Oozie requests solely through 172.31.40.1 or 172.31.40.3, either.
Why would only 172.31.40.1 and 172.31.40.3 work?
all ideas and suggestions warmly received.
many thanks
Toby
this was so simple in the end - the Oozie client was only installed on 2 of the 5 machines in the cluster. Corresponding, of course, to the 2 IP addresses that could make successful requests to the hive metastore
Once we installed the Oozie client onto all the machines in the cluster, all the jobs were automatically accepted and ran OK
obvious when you know the answer ...
i am new to Hadoop ,i likes to go in hadoop administration line so studied basics of hadoop and tried to install hadoop in pseudo distribution mode and installed successfully and run some basic examples also, now i need to improve me further,so i need to try a way to learn hadoop installation and configuration in real time so decided to go for Amazon micro instance ,can any one please tell how to install and configure hadoop in Amazon cloud.
Thanks in Advance.
I have tried this personally and you will not really be able to use hadoop on a single micro instance due to memory restrictions. IMHO you should atleast try a medium instance to run hadoop or better yet use their elastic-mapreduce api which is a modified version of hadoop. You can run a 3 node cluster for around 00.25 cents an hour. If you really want to learn big data this is the way I went.
You should check out their documentation here
http://aws.amazon.com/documentation/elasticmapreduce/
I would like to test out Hadoop & HBase in Amazon EC2, but I am not sure how complicate it is. Is there a stable community AMI that has Hadoop & HBase installed? I am thinking of something like bioconductor AMI
Thank you.
I highly recommend using Amazon's Elastic MapReduce service, especially if you already have an AWS/EC2 account. The reasons are:
EMR comes with a working Hadoop/HBase cluster "out of the box" - you don't need to tune anything to get Hadoop/HBase working. It Just Works(TM).
Amazon EC2's networking is quite different from what you are likely used to. It has, AFAIK, a 1-to-1 NAT where the node sees its own private IP address, but it connects to the outside world on a public IP. When you are manually building a cluster, this causes problems - even using software like Apache Whirr or BigTop specifically for EC2.
An AMI alone is not likely to help you get a Hadoop or HBase cluster up and running - if you want to run a Hadoop/HBase cluster, you will likely have to spend time tweaking the networking settings etc.
To my knowledge there isn't, but you should be able to easily deploy on EC2 using Apache Whirr which is a very good alternative.
Here is a good tutorial to do this with Whirr, as the tutorial says you should be able to do this in minutes !
The key is creating a recipe like this:
whirr.cluster-name=hbase
whirr.instance-templates=1 zk+nn+jt+hbase-master,5 dn+tt+hbase-regionserver
whirr.provider=ec2
whirr.identity=${env:AWS_ACCESS_KEY_ID}
whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
whirr.hardware-id=c1.xlarge
whirr.image-id=us-east-1/ami-da0cf8b3
whirr.location-id=us-east-1
You will then be able to launch your cluster with:
bin/whirr launch-cluster --config hbase-ec2.properties