i am trying to run a mapreduce task on amazon ec2.
i set all the configuration params and then call runFlowJob method of the AmazonElasticMapReduce service.
i wonder is there any way to know whether the job has completed and what was the status.
(i need it to know when i can pick up the mapreduce results from s3 for further processing)
currently the code just keep executing bacause the call to runJobFlow is non-blocking.
public void startMapReduceTask(String accessKey, String secretKey
,String eC2KeyPairName, String endPointURL, String jobName
,int numInstances, String instanceType, String placement
,String logDirName, String bucketName, String pigScriptName) {
log.info("Start running MapReduce");
// config.set
ClientConfiguration config = new ClientConfiguration();
AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonElasticMapReduce service = new AmazonElasticMapReduceClient(credentials, config);
service.setEndpoint(endPointURL);
JobFlowInstancesConfig conf = new JobFlowInstancesConfig();
conf.setEc2KeyName(eC2KeyPairName);
conf.setInstanceCount(numInstances);
conf.setKeepJobFlowAliveWhenNoSteps(true);
conf.setMasterInstanceType(instanceType);
conf.setPlacement(new PlacementType(placement));
conf.setSlaveInstanceType(instanceType);
StepFactory stepFactory = new StepFactory();
StepConfig enableDebugging = new StepConfig()
.withName("Enable Debugging")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newEnableDebuggingStep());
StepConfig installPig = new StepConfig()
.withName("Install Pig")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newInstallPigStep());
StepConfig runPigScript = new StepConfig()
.withName("Run Pig Script")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newRunPigScriptStep("s3://" + bucketName + "/" + pigScriptName, ""));
RunJobFlowRequest request = new RunJobFlowRequest(jobName, conf)
.withSteps(enableDebugging, installPig, runPigScript)
.withLogUri("s3n://" + bucketName + "/" + logDirName);
try {
RunJobFlowResult res = service.runJobFlow(request);
log.info("Mapreduce job with id[" + res.getJobFlowId() + "] completed successfully");
} catch (Exception e) {
log.error("Caught Exception: ", e);
}
log.info("End running MapReduce");
}
thanks,
aviad
From the AWS documentation:
Once the job flow completes, the cluster is stopped and the HDFS partition is lost. To prevent loss of data, configure the last step of the job flow to store results in Amazon S3.
It goes on to say:
If the JobFlowInstancesDetail : KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the job flow will transition to the WAITING state rather than shutting down once the steps have completed.
A maximum of 256 steps are allowed in each job flow.
For long running job flows, we recommended that you periodically store your results.
So it looks like there is no way of knowing when it is done. Instead you need to save your data as part of the job.
Related
Using below mentioned NiFi Rest API endpoint and code snippet,
I am fetching a list of Remote Process Groups (RPG), iterating and fetching each RPG details. The problem is, I am getting inaccurate RPG data. If I hit this Endpoint (https://nifihost:8080/nifi-api/remote-process-groups/{id}), I am receiving accurate details. Please clarify,
Why there is discrepancy between the results of these two
end-points?
(https://nifihost:8080/nifi-api/process-groups/{id}/remote-process-groups
Vs.
https://nifihost:8080/nifi-api/remote-process-groups/{id})
As my requirement is to iterate via each Process Group, getting a list of
Remote Process Groups (RPG) within it and fetching each RPG details? What is the right way
to achieve this?
Endpoint:
https://nifihost:8080/nifi-api/process-groups/{id}/remote-process-groups
Source Code
ArrayList<NifiRemoteProcessGroup> remoteProcessGroupArrayList = new ArrayList<>();
String returnedJSON = "";
String remoteProcessGroupURL = getNifiURL() + "/nifi-api/process-groups/" + processGroup + "/remote-process-groups";
HttpEntity httpEntity = RestCall.oAuthHeaders(token);
RestTemplate restTemplate = new RestTemplate();
try{
ResponseEntity<String> response = restTemplate.exchange(remoteProcessGroupURL,HttpMethod.GET,httpEntity,String.class);
returnedJSON = response.getBody();
}
catch(Exception e){
logger.error("There was an error retrieving the remote-process-groups : " + e.getMessage());
}
try{
ObjectMapper objectMapper = new ObjectMapper();
JsonNode rootNode = objectMapper.readTree(returnedJSON);
JsonNode processorNode = rootNode.path("remoteProcessGroups");
Iterator<JsonNode> elements = processorNode.elements();
while(elements.hasNext()){
JsonNode remoteProcessGroup = elements.next();
JsonNode statusElement = remoteProcessGroup.path("status");
JsonNode bulletinElement = remoteProcessGroup.path("bulletins");
JsonNode componentElement = remoteProcessGroup.path("component");
JsonNode aggregateSnapshot = statusElement.path("aggregateSnapshot");
NifiRemoteProcessGroup remoteProcessGroupInstance = new NifiRemoteProcessGroup();
remoteProcessGroupInstance.setRemoteProcessGroupId(checkExists(statusElement,"id"));
remoteProcessGroupInstance.setRemoteProcessGroupName(checkExists(componentElement,"name"));
remoteProcessGroupInstance.setRemoteProcessGroupGroupId(checkExists(statusElement,"groupId"));
remoteProcessGroupInstance.setRemoteProcessGroupTargetURL(checkExists(componentElement,"targetUri"));
remoteProcessGroupInstance.setRemoteProcessGroupBulletins(bulletinElement.asText());
remoteProcessGroupInstance.setRemoteProcessGroupTransmitting(Boolean.valueOf(checkExists(componentElement,"transmitting")));
remoteProcessGroupInstance.setRemoteProcessGroupTransmissionStatus(checkExists(statusElement,"transmissionStatus"));
remoteProcessGroupInstance.setRemoteProcessGroupActiveThreadCount(Double.valueOf(checkExists(aggregateSnapshot,"activeThreadCount")));
remoteProcessGroupInstance.setRemoteProcessGroupFlowFilesReceived(Double.valueOf(checkExists(aggregateSnapshot,"flowFilesReceived")));
remoteProcessGroupInstance.setRemoteProcessGroupBytesReceived(Double.valueOf(checkExists(aggregateSnapshot,"bytesReceived")));
remoteProcessGroupInstance.setRemoteProcessGroupReceived(checkExists(aggregateSnapshot,"received"));
remoteProcessGroupArrayList.add(remoteProcessGroupInstance);
}
}
catch(Exception e){
logger.info("There was an error creating the list of remote process groups: " + e.getMessage());
}
'process-groups/{id}/remote-process-groups' is part of the ProcessGroupsAPI subsection, and will return a RemoteProcessGroupsEntity, which contains a listing of the Remote Process Groups bounded with the ProcessGroup of the ID you submit.
'remote-process-groups/{id}' is part of the RemoteProcessGroups API, and will fetch the exact RemoteProcessGroupEntity (note the lack of plural) requested.
I maintain the nominal Python client for NiFi, given the outcome you mention seeking I suggest you could try:
import nipyapi
nipyapi.utils.set_endpoint('http://localhost:8080/nifi')
rpg_info = [nipyapi.canvas.get_remote_process_group(rpg.id) for rpg in nipyapi.canvas.list_all_remote_process_groups('root', True)]
The RPG info returned will give you the parent ProcessGroup ID under .component.parent_group_id, allowing you to reconstruct the tree, but you should find it much more performant than seeking each individually.
I'm trying to write a custom Nifi processor which will take in the contents of the incoming flow file, perform some math operations on it, then write the results into an outgoing flow file. Is there a way to dump the contents of the incoming flow file into a string or something? I've been searching for a while now and it doesn't seem that simple. If anyone could point me toward a good tutorial that deals with doing something like that it would be greatly appreciated.
The Apache NiFi Developer Guide documents the process of creating a custom processor very well. In your specific case, I would start with the Component Lifecycle section and the Enrich/Modify Content pattern. Any other processor which does similar work (like ReplaceText or Base64EncodeContent) would be good examples to learn from; all of the source code is available on GitHub.
Essentially you need to implement the #onTrigger() method in your processor class, read the flowfile content and parse it into your expected format, perform your operations, and then re-populate the resulting flowfile content. Your source code will look something like this:
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
FlowFile flowFile = session.get();
if (flowFile == null) {
return;
}
final ComponentLog logger = getLogger();
AtomicBoolean error = new AtomicBoolean();
AtomicReference<String> result = new AtomicReference<>(null);
// This uses a lambda function in place of a callback for InputStreamCallback#process()
processSession.read(flowFile, in -> {
long start = System.nanoTime();
// Read the flowfile content into a String
// TODO: May need to buffer this if the content is large
try {
final String contents = IOUtils.toString(in, StandardCharsets.UTF_8);
result.set(new MyMathOperationService().performSomeOperation(contents));
long stop = System.nanoTime();
if (getLogger().isDebugEnabled()) {
final long durationNanos = stop - start;
DecimalFormat df = new DecimalFormat("#.###");
getLogger().debug("Performed operation in " + durationNanos + " nanoseconds (" + df.format(durationNanos / 1_000_000_000.0) + " seconds).");
}
} catch (Exception e) {
error.set(true);
getLogger().error(e.getMessage() + " Routing to failure.", e);
}
});
if (error.get()) {
processSession.transfer(flowFile, REL_FAILURE);
} else {
// Again, a lambda takes the place of the OutputStreamCallback#process()
FlowFile updatedFlowFile = session.write(flowFile, (in, out) -> {
final String resultString = result.get();
final byte[] resultBytes = resultString.getBytes(StandardCharsets.UTF_8);
// TODO: This can use a while loop for performance
out.write(resultBytes, 0, resultBytes.length);
out.flush();
});
processSession.transfer(updatedFlowFile, REL_SUCCESS);
}
}
Daggett is right that the ExecuteScript processor is a good place to start because it will shorten the development lifecycle (no building NARs, deploying, and restarting NiFi to use it) and when you have the correct behavior, you can easily copy/paste into the generated skeleton and deploy it once.
I have created REST API to create EC2 instance using AWS JAVA SDK provided.
Now I am trying to connect to created EC2 instance and then need to install software's in the instance again through java. I didn't find any appropriate article for this. Is there any possible way to do this? I don't want to use SSH client like putty.. Thanks..
Sounds like you're looking for a java ssh client.
You should set up key authentication and use the ssh client library from java to execute the installation for you.
See this post: for a list of solutions
public static void connectToEC2(){
try{
JSch jsch=new JSch();
String user = "User-name";
String host = "host";
int port = 22;
File directory = new File(".");
String privateKey = directory.getCanonicalPath() + File.separator + "pem file path";
jsch.addIdentity(privateKey);
System.out.println("identity added ");
Session session = jsch.getSession(user, host, port);
System.out.println("session created.");
java.util.Properties config = new java.util.Properties();
config.put("StrictHostKeyChecking", "no");
session.setConfig(config);
session.connect();
Channel channel=session.openChannel("shell");
channel.setInputStream(System.in);
channel.setOutputStream(System.out);
channel.connect(3*1000);
}
catch(Exception e){
System.out.println(e);
}
}
Consumer queues are allocated in client side, broker knows nothing about this.
So how can we monitor which queue is allocated to which consumer client?
Though there is no exiting command, for each message queue per consumer group, You can find out the client using provided admin infrastructure. Here is the snippet achieving this:
private Map<MessageQueue, String> getClientConnection(DefaultMQAdminExt defaultMQAdminExt, String groupName){
Map<MessageQueue, String> results = new HashMap<MessageQueue, String>();
try{
ConsumerConnection consumerConnection = defaultMQAdminExt.examineConsumerConnectionInfo(groupName);
for (Connection connection : consumerConnection.getConnectionSet()){
String clinetId = connection.getClientId();
ConsumerRunningInfo consumerRunningInfo = defaultMQAdminExt.getConsumerRunningInfo(groupName, clinetId, false);
for(MessageQueue messageQueue : consumerRunningInfo.getMqTable().keySet()){
results.put(messageQueue, clinetId + " " + connection.getClientAddr());
}
}
}catch (Exception e){
}
return results;
}
In case you have not used the RocketMQ-Console project, please try and run it: https://github.com/rocketmq/rocketmq-console-ng
In the Consumer tab, Click "consumer detail" button, you will see message queue allocation result visually as below:
Message queues allocation result
everybody.I'm using mr to process some log file, the file is on hdfs. I want to retrieve some info form the file and store them to hbase.
so I launch the job
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar crm_hbase-1.0.jar /datastream/music/useraction/2014-11-30/music_useraction_20141130-230003072+0800.24576015364769354.00018022.lzo
if I just run job as "hadoop jar xxxx" it shows "not find HbaseConfiguraion"
My code is quite simple,
public int run(String[] strings) throws Exception {
Configuration config = HBaseConfiguration.create();
String kerbConfPrincipal = "ndir#HADOOP.HZ.NETEASE.COM";
String kerbKeytab = "/srv/zwj/ndir.keytab";
UserGroupInformation.loginUserFromKeytab(kerbConfPrincipal, kerbKeytab);
UserGroupInformation ugi = UserGroupInformation.getLoginUser();
System.out.println(" auth: " + ugi.getAuthenticationMethod());
System.out.println(" name: " + ugi.getUserName());
System.out.println(" using keytab:" + ugi.isFromKeytab());
HBaseAdmin.checkHBaseAvailable(config);
//set job name
Job job = new Job(config, "Import from file ");
job.setJarByClass(LogRun.class);
//set map class
job.setMapperClass(LogMapper.class);
//set output format and output table name
job.setOutputFormatClass(TableOutputFormat.class);
job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "crm_data");
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Put.class);
job.setNumReduceTasks(0);
TableMapReduceUtil.addDependencyJars(job);
but when i try to run this MR, I cannot execute "context.write(null,put)", it seems the "map" halts at this line.
I think it has relationship with "kerbKeytab", does it mean I need to "login" when I run the "map" process
after adding TableMapReduceUtil it works
Job job = new Job(config, "Import from file ");
job.setJarByClass(LogRun.class);
//set map class
job.setMapperClass(LogMapper.class);
TableMapReduceUtil.initTableReducerJob(table, null, job);
job.setNumReduceTasks(0);
TableMapReduceUtil.addDependencyJars(job);
FileInputFormat.setInputPaths(job,input);
//FileInputFormat.addInputPath(job, new Path(strings[0]));
int ret = job.waitForCompletion(true) ? 0 : 1;