Get the total mapping and reducing times in hadoop programmatically - hadoop

I am trying to calculate the individual total times of Mapping, Shuffling and Reducing by all tasks in my MR code.
I need help retrieving that information for each MapReduce Job.
Can someone post any code snippet that does that calculation?

You need to use the JobClient API as shown below:
There are however some quirks to the API. Try it let me know i will help you out.
JobClient client = null;
Configuration configuration = new Configuration();
configuration.set("mapred.job.tracker", jobTrackerURL);
client = new JobClient(new JobConf(configuration));
while (true) {
List<JobStatus> jobEntries = getTrackerEntries(jobName,
client);
for (JobStatus jobStatus : jobEntries) {
JobID jobId = jobStatus.getJobID();
String trackerJobName = client.getJob(jobId)
.getJobName();
TaskReport[] mapReports = client
.getMapTaskReports(jobId);
TaskReport[] reduceReports = client
.getReduceTaskReports(jobId);
client.getJob(jobId).getJobStatus().getStartTime();
int jobMapper = mapReports.length;
mapNumber = +jobMapper;
int jobReducers = reduceReports.length;
reduceNumber = +jobReducers;
}
}

Related

How to perform a simple select from HBase with criteria (Where clause)

I have the following simple table I've created from the following source: https://hortonworks.com/hadoop-tutorial/introduction-apache-hbase-concepts-apache-phoenix-new-backup-restore-utility-hbase/#start-hbase
using the following:
create 'driver_dangerous_event','events'
put 'driver_dangerous_event','4','events:driverId','78'
put 'driver_dangerous_event','4','events:driverName','Carl'
put 'driver_dangerous_event','4','events:eventTime','2016-09-23 03:25:03.567'
put 'driver_dangerous_event','4','events:eventType','Normal'
put 'driver_dangerous_event','4','events:latitudeColumn','37.484938'
put 'driver_dangerous_event','4','events:longitudeColumn','-119.966284'
put 'driver_dangerous_event','4','events:routeId','845'
put 'driver_dangerous_event','4','events:routeName','Santa Clara to San Diego'
put 'driver_dangerous_event','4','events:truckId','637'
I need to query this row but using a where filter (for future use), I have a rest api or thrift api running on my server.
I tried using rest api but failed to do it, is it possible?
also I tried using this nuget: https://hbasenet.codeplex.com/releases/view/133288 but I can't understand how to filter the data with where clause, I can only select a specific row:
Hbase.Client c = new Hbase.Client(serverHostName, port, 10000);
var res = c.Scan<Driver>("driver_dangerous_event", "events", "1");
Is there any option to do a simple filtered query with REST api/ Thrift API/ some other C# library?
I used Microsoft.HBase.Client for preforming a simple query (https://github.com/hdinsight/hbase-sdk-for-net)
// Connection
RequestOptions scanOptions = RequestOptions.GetDefaultOptions();
scanOptions.Port = int.Parse(hbaseDataConnection.Port);
scanOptions.AlternativeEndpoint = "/";
var nodeIPs = new List<string>();
nodeIPs.Add(hbaseDataConnection.Address);
HBaseClient client = new HBaseClient(null, scanOptions, new LoadBalancerRoundRobin(nodeIPs));
Scanner scanner = new Scanner { batch = 10 };
ScannerInformation scannerInfo = await client.CreateScannerAsync(_tableName, scanner, scanOptions);
var options = RequestOptions.GetDefaultOptions();
options.Port = int.Parse(hbaseDataConnection.Port);
options.AlternativeEndpoint = "/";
options.AlternativeHost = scannerInfo.Location.Host;
var f1 = new SingleColumnValueFilter(
Encoding.UTF8.GetBytes(ColumnFamilyName),
Encoding.UTF8.GetBytes("driverName"),
CompareFilter.CompareOp.Equal,
new SubstringComparator(fld.Values[0].ToString()))
var filter = new FilterList(FilterList.Operator.MustPassAll, f1);
scanner.filter = filter.ToEncodedString();
ScannerInformation scanInfo = client.CreateScannerAsync(_tableName, scanner, scanOptions).Result;
result = RetrieveResults(client, scanInfo, scanOptions).ToList();
Make sure REST API is running on the HBase machine, e.g.
hbase rest start -p 20050 --infoport 20051

Accessing Tivoli Performance Module

I am trying to extract the PMI data using a Java Application, I already been able to access performance module, but unfortunately, i cannot access SubModule as in the below example.
I extracted the ThreadPool Module data using this code
StatDescriptor mysd = new StatDescriptor(new String[] { PmiConstants.THREADPOOL_MODULE });
MBeanStatDescriptor mymsd = new MBeanStatDescriptor(nodeAgent, mysd);
Object[] params = new Object[]{mymsd, new Boolean(false)};
String[] signature = new String[] { "com.ibm.websphere.pmi.stat.MBeanStatDescriptor", "java.lang.Boolean" };
com.ibm.ws.pmi.stat.StatsImpl myStats = (StatsImpl) adminClient.invoke(perfOn, "getStatsObject", params, signature);
//System.out.println("myStats Size = " + myStats.dataMembers().size()+ "\n" + myStats.toString());
but I cannot access Threadpool submodules and their counters as AriesThreadPool
any recommended suggestion?
I solved the problem
just I enabled the recursively searching
by replacing the parameter from false to true
Object[] params = new Object[]{mymsd, new Boolean(true)};

TApplicationException: Required field 'client_protocol' is unset

I am developing a thrift client,
I have build a thrift hive server(apache-hive-0.14.0) on my machine and am also have access to Cloudera Dist Hive 4.6.0
When i connect thrift client to CDH client give following error:
TApplicationException: Required field 'client_protocol' is unset!
Struct:TOpenSessionReq(client_protocol:null, username:
I am passing the right protocol to the server but it seems some thing is over riding it....
Moreover if I point to localhost(where i have my hive server running) every thing seems to working fine....
Please let me know what is wrong here....
Code:
var socket = new TSocket("XXX.XXX.XXX.XXX", 10000);
TStreamTransport sTransport = (TStreamTransport)socket;
var transport = new TBufferedTransport(socket);
underlyingTransport = transport;
var proto = new TBinaryProtocol(transport);
var client = new TCLIService.Client(proto);
transport.Open();
TOpenSessionReq req = new TOpenSessionReq(TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V6);
req.Username = "hive";
req.Password = "hive";
TOpenSessionResp oSResponse = client.OpenSession(req);
TSessionHandle sessionHandle = oSResponse.SessionHandle;
TExecuteStatementReq execReq = new TExecuteStatementReq(sessionHandle, "select * from emp");
TExecuteStatementResp exeRes= client.ExecuteStatement(execReq);
TOperationHandle operationHandle = exeRes.OperationHandle;
TFetchResultsReq fechReq = new TFetchResultsReq(operationHandle,TFetchOrientation.FETCH_FIRST, 1);
TFetchResultsResp fechRes = client.FetchResults(fechReq);
TRowSet results = fechRes.Results;
List<TRow> resultRows = results.Rows;
foreach (var row in resultRows)
{
var val = row.ColVals[0];
System.Console.WriteLine(val.StringVal);
}
TCloseOperationReq closeOprReq = new TCloseOperationReq(operationHandle);
client.CloseOperation(closeOprReq);
TCloseSessionReq creq = new TCloseSessionReq(sessionHandle);
client.CloseSession(creq);
I believe it is the problem of the hive-jdbc version. This solution may solve your problem: Required field 'client_protocol' is unset

Spring Batch ~ Dynamic commit interval or a custom completion policy

What I have?
Spring Integration that watch recursively a folder for new CSV's files; and send them back to Spring batch.
The job: read the CSV file; in the processor, I modify some data in the items; then I use a custom writer to save my data on the DB.
Problem?
In fact that I have dynamic number of CSV beeing send to the batch. I want that my job commit interval will be based on the number of items (lines) present in the CSV's file. In other way, I don't want to commit my data in every fixed number of item, but every end of file. Exemple: CSV 1 have 200 Lines, I want to process all the lines, writes them, commit, close the transaction then read the next CSV.
I have two idea, but I didn't know whoch is the perfect and how to implement it:
Get from the reader the number of lines in my CSV and send it to my commit interval using a job parameter argument like so #{jobParameters['commit.interval.value']}
Implement a Custom Completion Policy to replace my commit inteval, how to implement isComplete() Do you have any exemples? Github project?
But before all that, how can I get the number of items?
Could any one helps me? a code sample maybe?
Thank you in advance.
No answer, but I found a solution
I'm using a Dynamic commit interval instead of a completion policy.
With Spring batch integration, I can use a transformer to send my file to the batch, for that I have a custom class FileMessageToJobRequest in that one I added this function that helps me to get the count lines
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
and this one to send parameters
#Transformer
public JobLaunchRequest toRequest(Message<File> message) throws IOException{
JobParametersBuilder jobParametersBuilder = new JobParametersBuilder();
jobParametersBuilder.addString("commit.interval", Integer.toString(countLines(message.getPayload().getAbsolutePath())));
return new JobLaunchRequest(job, jobParametersBuilder.toJobParameters());
}
and in my job context, I just added this commit-interval="#{jobParameters['commit.interval']}"
Hope it help someone in need ;)

How can I know when the amazon mapreduce task is complete?

i am trying to run a mapreduce task on amazon ec2.
i set all the configuration params and then call runFlowJob method of the AmazonElasticMapReduce service.
i wonder is there any way to know whether the job has completed and what was the status.
(i need it to know when i can pick up the mapreduce results from s3 for further processing)
currently the code just keep executing bacause the call to runJobFlow is non-blocking.
public void startMapReduceTask(String accessKey, String secretKey
,String eC2KeyPairName, String endPointURL, String jobName
,int numInstances, String instanceType, String placement
,String logDirName, String bucketName, String pigScriptName) {
log.info("Start running MapReduce");
// config.set
ClientConfiguration config = new ClientConfiguration();
AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonElasticMapReduce service = new AmazonElasticMapReduceClient(credentials, config);
service.setEndpoint(endPointURL);
JobFlowInstancesConfig conf = new JobFlowInstancesConfig();
conf.setEc2KeyName(eC2KeyPairName);
conf.setInstanceCount(numInstances);
conf.setKeepJobFlowAliveWhenNoSteps(true);
conf.setMasterInstanceType(instanceType);
conf.setPlacement(new PlacementType(placement));
conf.setSlaveInstanceType(instanceType);
StepFactory stepFactory = new StepFactory();
StepConfig enableDebugging = new StepConfig()
.withName("Enable Debugging")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newEnableDebuggingStep());
StepConfig installPig = new StepConfig()
.withName("Install Pig")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newInstallPigStep());
StepConfig runPigScript = new StepConfig()
.withName("Run Pig Script")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newRunPigScriptStep("s3://" + bucketName + "/" + pigScriptName, ""));
RunJobFlowRequest request = new RunJobFlowRequest(jobName, conf)
.withSteps(enableDebugging, installPig, runPigScript)
.withLogUri("s3n://" + bucketName + "/" + logDirName);
try {
RunJobFlowResult res = service.runJobFlow(request);
log.info("Mapreduce job with id[" + res.getJobFlowId() + "] completed successfully");
} catch (Exception e) {
log.error("Caught Exception: ", e);
}
log.info("End running MapReduce");
}
thanks,
aviad
From the AWS documentation:
Once the job flow completes, the cluster is stopped and the HDFS partition is lost. To prevent loss of data, configure the last step of the job flow to store results in Amazon S3.
It goes on to say:
If the JobFlowInstancesDetail : KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the job flow will transition to the WAITING state rather than shutting down once the steps have completed.
A maximum of 256 steps are allowed in each job flow.
For long running job flows, we recommended that you periodically store your results.
So it looks like there is no way of knowing when it is done. Instead you need to save your data as part of the job.

Resources