Unable to create cluster in AWS EMR - hadoop

I was trying to create an EMR cluster from Eclipse using Java. I was able to output the job id but when I viewed the EMR web console. There wasn't any cluster created. What's wrong ?
My code:
AWSCredentials credentials = new BasicAWSCredentials("xxx", "xxx");
AmazonElasticMapReduceClient emr = new AmazonElasticMapReduceClient(credentials);
StepFactory stepFactory = new StepFactory();
StepConfig enableDebugging = new StepConfig()
.withName("Enable Debugging")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newEnableDebuggingStep());
StepConfig installHive = new StepConfig()
.withName("Install Hive")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newInstallHiveStep());
StepConfig hiveScript = new StepConfig().withName("Hive Script")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newRunHiveScriptStep("s3://mywordcountbuckett/binary/WordCount2.jar"));
RunJobFlowRequest request = new RunJobFlowRequest()
.withName("Hive Interactive")
.withReleaseLabel("emr-4.1.0")
.withSteps(enableDebugging, installHive)
.withJobFlowRole("EMR_DefaultRole")
.withServiceRole("EMR_EC2_DefaultRole")
//.withSteps(enableDebugging, installHive)
.withLogUri("s3://mywordcountbuckett/")
.withInstances(new JobFlowInstancesConfig()
.withEc2KeyName("mykeypair")
.withHadoopVersion("2.4.0")
.withInstanceCount(5)
.withKeepJobFlowAliveWhenNoSteps(true)
.withMasterInstanceType("m3.xlarge")
.withSlaveInstanceType("m1.large"));//.withAmiVersion("3.10.0");
RunJobFlowResult result = emr.runJobFlow(request);
System.out.println(result.toString());
My output was:
{JobFlowId: j-3T7H65FOSKHDQ}
What could be the reason that I was unable to create the cluster ?

Related

Using spring-session-hazelcast on Kubernetes with service-dns causing SplitBrainMergeValidationOp ERROR

We are deploying a spring-boot application using spring-session-hazelcast + hazelcast-kubernetes on an OpenShift/Kubernetes cluster.
Due to the nature of our platform, we can only use service-dns configuration. We expose a service on port 5701 for multicasting and set service-dns property to the multicast service name.
Below is a snippet for creation of the Hazelcast instance.
#Bean
public HazelcastInstance hazelcastInstance() {
var config = new Config();
config.setClusterName("spring-session-cluster");
var join = config.getNetworkConfig().getJoin();
join.getTcpIpConfig().setEnabled(false);
join.getMulticastConfig().setEnabled(false);
join.getKubernetesConfig().setEnabled(true)
.setProperty("service-dns", "<multicast-service-name>");
var attribute = new AttributeConfig()
.setName(Hazelcast4IndexedSessionRepository.PRINCIPAL_NAME_ATTRIBUTE)
.setExtractorClassName(Hazelcast4PrincipalNameExtractor.class.getName());
config.getMapConfig(Hazelcast4IndexedSessionRepository.DEFAULT_SESSION_MAP_NAME)
.addAttributeConfig(attribute)
.addIndexConfig(new IndexConfig(IndexType.HASH, Hazelcast4IndexedSessionRepository.PRINCIPAL_NAME_ATTRIBUTE));
var serializer = new SerializerConfig();
serializer.setImplementation(new HazelcastSessionSerializer())
.setTypeClass(MapSession.class);
config.getSerializationConfig().addSerializerConfig(serializer);
return Hazelcast.newHazelcastInstance(config);
}
When we run 2 pods for this application, we see the below ERROR log:
com.hazelcast.internal.cluster.impl.operations.SplitBrainMergeValidationOp
Message: [<private-ip>]:5701 [spring-session-cluster] [4.2] Target is this node! -> [<private-ip>]:5701
Can someone please explain how to fix this error, still using "service-dns" configuration?
You need to enable headless mode for your service in openshift.
https://github.com/hazelcast/hazelcast-kubernetes#dns-lookup
Just add configuration for split brain protection
SplitBrainProtectionConfig splitBrainProtectionConfig = new SplitBrainProtectionConfig();
splitBrainProtectionConfig.setName("splitBrainProtectionRuleWithFourMembers")
.setEnabled(true)
.setMinimumClusterSize(4);
MapConfig mapConfig = new MapConfig();
mapConfig.setSplitBrainProtectionName("splitBrainProtectionRuleWithFourMembers");
Config config = new Config();
config.addSplitBrainProtectionConfig(splitBrainProtectionConfig);
config.addMapConfig(mapConfig);
You can read more about this in hazelcast documentation:
https://docs.hazelcast.com/imdg/4.2/network-partitioning/split-brain-protection.html

Connect to Hbase using hadoop config in spark

I am trying to create hbase connection in MapPartitionFunction of spark.
Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration
I tried the following code
SparkConf conf = new SparkConf()
.setAppName("EnterPrise Risk Score")
.setMaster("local");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.set("spark.kryo.registrationRequired", "true");
conf.registerKryoClasses(new Class<?>[] {
Class.forName("org.apache.hadoop.conf.Configuration"),
Class.forName("org.apache.hadoop.hbase.client.Table"),
Class.forName("com.databricks.spark.avro.DefaultSource$SerializableConfiguration")});
SparkSession sparkSession = SparkSession.builder().config(conf)
.getOrCreate();
Configuration hbaseConf= HBaseConfiguration
.create(hadoopConf);
I am using sparkSession to create dataset and pass hbaseConf to create connections to hbase.
Is there any way to connect to hbase?
You probably implicitly pass an HBase configuration to a spark action like this:
Configuration hbaseConfiguration = HBaseConfiguration.create();
sc.hadoopFile(inDirTrails, AvroInputFormat.class, AvroWrapper.class, NullWritable.class)).mapPartitions( i -> {
Connection connection = ConnectionFactory.createConnection(hbaseConfiguration)
//more valid code
});
Why don't you just create Configuration right inside of it like this:
sc.hadoopFile(inDirTrails, AvroInputFormat.class, AvroWrapper.class, NullWritable.class)).mapPartitions( i -> {
Configuration hbaseConfiguration = HBaseConfiguration.create();
hbaseConfiguration.set("hbase.zookeeper.quorum", HBASE_ZOOKEEPER_QUORUM);
Connection connection = ConnectionFactory.createConnection(hbaseConfiguration)
//more valid code
});

how to retrive aws-ec2 windows instance password using aws sdk in c#

I am using aws sdk to retrieve cloud data from aws sdk.
I get all ec2 related data but I am not able to find out how to connect instance.
Here is code of retrieve instance from amazon cloud:
IAmazonEC2 ec2Client = new AmazonEC2Client(accesskey,secretkey, new AmazonEC2Config
{
Timeout = TimeSpan.FromSeconds(300),
MaxErrorRetry = 3,
RegionEndpoint = RegionEndpoint.GetBySystemName(regionName)
});
var instanceRequest = new DescribeInstancesRequest();
DescribeInstancesResponse ec2Response = ec2Client.DescribeInstances(instanceRequest);
First of you will need a key/pair file that you've used while creating an instance as it is needed to retrieve windows password.
Following are the steps to retrieve windows instance password using AWS SDK:
#1. You need to pass instanceId as well as RSA key from your .pem file to the following code.
IAmazonEC2 ec2Client = new AmazonEC2Client(accesskey, secretkey, new AmazonEC2Config
{
Timeout = TimeSpan.FromSeconds(300),
MaxErrorRetry = 3,
RegionEndpoint = region,
});
var passwordRequest = new GetPasswordDataRequest();
passwordRequest.InstanceId = instanceId;
var passwordResponse = ec2Client.GetPasswordData(passwordRequest);
var password = passwordResponse.GetDecryptedPassword(rsaKey);
return password;
});
Note: You have to wait at least 4 minutes after launching an instance to get the windows password.

Create HTML snapshot on AWS EC2 instance for Google rendering

How can I create an HTML snapshot, server side, to provide it to Google for rendering(Crawling) and I have to do it on AWS EC2 instance which is the headless server.
So how should I go about this?
Here is the answer for it:
String Xport = System.getProperty("lmportal.xvfb.id", ":99");
final File firefoxPath = new File(System.getProperty("lmportal.deploy.firefox.path", "/opt/firefox/firefox"));
FirefoxBinary firefoxBinary = new FirefoxBinary(firefoxPath);
firefoxBinary.setEnvironmentProperty("DISPLAY", Xport);
FirefoxDriver driver = new FirefoxDriver(firefoxBinary,null);
driver.get(url);
Thread.sleep(1500);
res.setContentType("text/html;charset=UTF-8");
PrintWriter out = res.getWriter();
String str = driver.getPageSource();
out.println(str);
out.close();
driver.quit();
And install the xvfb and firefox on your ec2 instance thats inportant.....

How can I know when the amazon mapreduce task is complete?

i am trying to run a mapreduce task on amazon ec2.
i set all the configuration params and then call runFlowJob method of the AmazonElasticMapReduce service.
i wonder is there any way to know whether the job has completed and what was the status.
(i need it to know when i can pick up the mapreduce results from s3 for further processing)
currently the code just keep executing bacause the call to runJobFlow is non-blocking.
public void startMapReduceTask(String accessKey, String secretKey
,String eC2KeyPairName, String endPointURL, String jobName
,int numInstances, String instanceType, String placement
,String logDirName, String bucketName, String pigScriptName) {
log.info("Start running MapReduce");
// config.set
ClientConfiguration config = new ClientConfiguration();
AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonElasticMapReduce service = new AmazonElasticMapReduceClient(credentials, config);
service.setEndpoint(endPointURL);
JobFlowInstancesConfig conf = new JobFlowInstancesConfig();
conf.setEc2KeyName(eC2KeyPairName);
conf.setInstanceCount(numInstances);
conf.setKeepJobFlowAliveWhenNoSteps(true);
conf.setMasterInstanceType(instanceType);
conf.setPlacement(new PlacementType(placement));
conf.setSlaveInstanceType(instanceType);
StepFactory stepFactory = new StepFactory();
StepConfig enableDebugging = new StepConfig()
.withName("Enable Debugging")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newEnableDebuggingStep());
StepConfig installPig = new StepConfig()
.withName("Install Pig")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newInstallPigStep());
StepConfig runPigScript = new StepConfig()
.withName("Run Pig Script")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newRunPigScriptStep("s3://" + bucketName + "/" + pigScriptName, ""));
RunJobFlowRequest request = new RunJobFlowRequest(jobName, conf)
.withSteps(enableDebugging, installPig, runPigScript)
.withLogUri("s3n://" + bucketName + "/" + logDirName);
try {
RunJobFlowResult res = service.runJobFlow(request);
log.info("Mapreduce job with id[" + res.getJobFlowId() + "] completed successfully");
} catch (Exception e) {
log.error("Caught Exception: ", e);
}
log.info("End running MapReduce");
}
thanks,
aviad
From the AWS documentation:
Once the job flow completes, the cluster is stopped and the HDFS partition is lost. To prevent loss of data, configure the last step of the job flow to store results in Amazon S3.
It goes on to say:
If the JobFlowInstancesDetail : KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the job flow will transition to the WAITING state rather than shutting down once the steps have completed.
A maximum of 256 steps are allowed in each job flow.
For long running job flows, we recommended that you periodically store your results.
So it looks like there is no way of knowing when it is done. Instead you need to save your data as part of the job.

Resources