cascading sinkmode.update notworking - hadoop

I just started cascading programming and have a cascading job which needs to run variable times of iteration. During each iteration, it ready from file (Tap) generated from previous iteration and write calculated data to two separate SinkTaps.
One Tap (Tap Final) is used to collect data from each iterations.
The other Tap (Tap intermediate) using to collect data that need to be calculated in the next iteration.
I am using SinkMode.UPDATE for "Tap final" to make this happen. It works correct at local mode. But failed at cluster mode. Complain about file already existed ("Tap final").
I am running CDH4.4 and cascading 2.5.2. Seems like there is no one has experienced the same problem.
If anyone knows any possible way to fix it, please let me know. Thanks
Caused by: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://dv-db.machines:8020/tmp/xxxx/cluster/97916 already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:126)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:419)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:332)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1269)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1266)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:606)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:601)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:601)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:586)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

It would helpful to understand the issue better if you could add cascading flow code to your question.
It seems the job file with same name is being used between different jobs on cluster mode. One simple solution in case you are fine to not run it concurrently would be be set max concurrent steps to 1.
Flow flow = flowConnector.connect("name", sources, sinks, outPipe1, outPipe2);
flow.setMaxConcurrentSteps(jobProperties, 1);

UPDATE only works with sinks (like databases) that support in-place updating.
If you're using Hfs (a file system sink) then you'll need to use SinkMode.REPLACE.

Related

Can't connect to Bigtable to scan HTable data due to hardcoded managed=true in hbase client jars

I'm working on a custom load function to load data from Bigtable using Pig on Dataproc. I compile my java code using the following list of jar files I grabbed from Dataproc. When I run the following Pig script, it fails when it tries to establish a connection with Bigtable.
Error message is:
Bigtable does not support managed connections.
Questions:
Is there a work around for this problem?
Is this a known issue and is there a plan to fix or adjust?
Is there a different way of implementing multi scans as a load function for Pig that will work with Bigtable?
Details:
Jar files:
hadoop-common-2.7.3.jar
hbase-client-1.2.2.jar
hbase-common-1.2.2.jar
hbase-protocol-1.2.2.jar
hbase-server-1.2.2.jar
pig-0.16.0-core-h2.jar
Here's a simple Pig script using my custom load function:
%default gte '2017-03-23T18:00Z'
%default lt '2017-03-23T18:05Z'
%default SHARD_FIRST '00'
%default SHARD_LAST '25'
%default GTE_SHARD '$gte\_$SHARD_FIRST'
%default LT_SHARD '$lt\_$SHARD_LAST'
raw = LOAD 'hbase://events_sessions'
USING com.eduboom.pig.load.HBaseMultiScanLoader('$GTE_SHARD', '$LT_SHARD', 'event:*')
AS (es_key:chararray, event_array);
DUMP raw;
My custom load function HBaseMultiScanLoader creates a list of Scan objects to perform multiple scans on different ranges of data in the table events_sessions determined by the time range between gte and lt and sharded by SHARD_FIRST through SHARD_LAST.
HBaseMultiScanLoader extends org.apache.pig.LoadFunc so it can be used in the Pig script as load function.
When Pig runs my script, it calls LoadFunc.getInputFormat().
My implementation of getInputFormat() returns an instance of my custom class MultiScanTableInputFormat which extends org.apache.hadoop.mapreduce.InputFormat.
MultiScanTableInputFormat initializes org.apache.hadoop.hbase.client.HTable object to initialize the connection to the table.
Digging into the hbase-client source code, I see that org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal() calls org.apache.hadoop.hbase.client.ConnectionManager.createConnection() with the attribute “managed” hardcoded to “true”.
You can see from the stack track below that my code (MultiScanTableInputFormat) tries to initialize an HTable object which invokes getConnectionInternal() which does not provide an option to set managed to false.
Going down the stack trace, you will get to AbstractBigtableConnection that will not accept managed=true and therefore cause the connection to Bigtable to fail.
Here’s the stack trace showing the error:
2017-03-24 23:06:44,890 [JobControl] ERROR com.turner.hbase.mapreduce.MultiScanTableInputFormat - java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:431)
at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:424)
at org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:302)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
at com.eduboom.hbase.mapreduce.MultiScanTableInputFormat.setConf(Unknown Source)
at com.eduboom.pig.load.HBaseMultiScanLoader.getInputFormat(Unknown Source)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:264)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
... 26 more
Caused by: java.lang.IllegalArgumentException: Bigtable does not support managed connections.
at org.apache.hadoop.hbase.client.AbstractBigtableConnection.<init>(AbstractBigtableConnection.java:123)
at com.google.cloud.bigtable.hbase1_2.BigtableConnection.<init>(BigtableConnection.java:55)
... 31 more
The original problem was caused by the use of outdated and deprecated hbase client jars and classes.
I updated my code to use the newest hbase client jars provided by Google and the original problem was fixed.
I still get stuck with some ZK issue that I still did not figure out, but that's a conversation for a different question.
This one is answered!
I have confronted the same error message:
Bigtable does not support managed connections.
However, according to my research, the root cause is that the class HTable can not be constructed explicitly. After changed the way to construct HTable by connection.getTable. The problem resolved.

Hbase Bulk load - Map Reduce job failing

I have map reduce job for hbase bulk load. Job is converting data into Hfiles and loading into hbase but after certain map % job is failing. Below is the exception that I am getting.
Error: java.io.FileNotFoundException: /var/mapr/local/tm4/mapred/nodeManager/spill/job_1433110149357_0005/attempt_1433110149357_0005_m_000000_0/spill83.out.index
at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:198)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:800)
at org.apache.hadoop.io.SecureIOUtils.openFSDataInputStream(SecureIOUtils.java:156)
at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:74)
at org.apache.hadoop.mapred.MapRFsOutputBuffer.mergeParts(MapRFsOutputBuffer.java:1382)
at org.apache.hadoop.mapred.MapRFsOutputBuffer.flush(MapRFsOutputBuffer.java:1627)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:709)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Only thing that i noticed in job that for small set of data it is working fine but as data grows job starts failing.
Let me know if anyone has faced this issue.
Thanks
This was a bug in MapR. I got reply on MapR forum. If someone is facing similar issue then refere below link.
http://answers.mapr.com/questions/163440/hbase-bulk-load-map-reduce-job-failing-on-mapr.html

Cascading 2.0.0 job failing on hadoop FileNotFoundException job.split

When i run my job on a larger dataset, lots of mappers / reducers fail causing the whole job to crash. Here's the error i see on many mappers:
java.io.FileNotFoundException: File does not exist: /mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.staging/job_201405050818_0001/job.split
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1933)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1924)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:608)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:429)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:385)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Has anybody been able to solve this problem ? I see another human experiencing the same pain as me (here), sadly he could not be saved in time.
After hours of debugging, I found absolutely nothing useful in hadoop logs (as usual). Then i tried the following changes:
Increasing the cluster size to 10
Increase the failure limits :
mapred.map.max.attempts=20
mapred.reduce.max.attempts=20
mapred.max.tracker.failures=20
mapred.max.map.failures.percent=20
mapred.max.reduce.failures.percent=20
I was able to run my cascading job on large amounts of data subsequently. It seems like a problem caused by cascading.

Spring Batch 'RunIdIncrementer' not generating next value

I have several Spring Batch (2.1.9.RELEASE) jobs running in production that use org.springframework.batch.core.launch.support.RunIdIncrementer.
Sporadically, I get the following error:
org.springframework.batch.core.repository.JobInstanceAlreadyCompleteException: A job instance already exists and is complete for parameters={run.id=23, tenant.code=XXX}. If you want to run this job again, change the parameters.
at org.springframework.batch.core.repository.support.SimpleJobRepository.createJobExecution(SimpleJobRepository.java:122) ~[spring-batch-core-2.1.9.RELEASE.jar:na]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.6.0_39]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) ~[na:1.6.0_39]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) ~[na:1.6.0_39]
at java.lang.reflect.Method.invoke(Method.java:597) ~[na:1.6.0_39]
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:318) ~[spring-aop-3.1.1.RELEASE.jar:3.1.1.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183) ~[spring-aop-3.1.1.RELEASE.jar:3.1.1.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) ~[spring-aop-3.1.1.RELEASE.jar:3.1.1.RELEASE]
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110) ~[spring-tx-3.1.1.RELEASE.jar:3.1.1.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) ~[spring-aop-3.1.1.RELEASE.jar:3.1.1.RELEASE]
at org.springframework.batch.core.repository.support.AbstractJobRepositoryFactoryBean$1.invoke(AbstractJobRepositoryFactoryBean.java:168) ~[spring-batch-core-2.1.9.RELEASE.jar:na]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) ~[spring-aop-3.1.1.RELEASE.jar:3.1.1.RELEASE]
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202) ~[spring-aop-3.1.1.RELEASE.jar:3.1.1.RELEASE]
at sun.proxy.$Proxy64.createJobExecution(Unknown Source) ~[na:na]
at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:111) ~[spring-batch-core-2.1.9.RELEASE.jar:na]
at org.springframework.batch.core.launch.support.CommandLineJobRunner.start(CommandLineJobRunner.java:349) [spring-batch-core-2.1.9.RELEASE.jar:na]
at org.springframework.batch.core.launch.support.CommandLineJobRunner.main(CommandLineJobRunner.java:574) [spring-batch-core-2.1.9.RELEASE.jar:na]
at (omitted for brevity)
A sampling from the various XML contexts:
<bean
id="jobParametersIncrementer"
class="org.springframework.batch.core.launch.support.RunIdIncrementer" />
<batch:job id="rootJob"
abstract="true"
restartable="true">
<batch:validator>
<bean class="org.springframework.batch.core.job.DefaultJobParametersValidator">
<property name="requiredKeys" value="tenant.code"/>
</bean>
</batch:validator>
</batch:job>
<batch:job id="rootJobWithIncrementer"
abstract="true"
parent="rootJob"
incrementer="jobParametersIncrementer" />
I use org.springframework.batch.core.launch.support.CommandLineJobRunner to execute the job:
java org.springframework.batch.core.launch.support.CommandLineJobRunner /com/XXX/job123/job123-context.xml job123 tenant.code=XXX -next
All of the jobs (that use the incrementer) have rootJobWithIncrementer as parent.
I did quite a bit of research and found that some who got this error had success changing the isolation level of the transaction manager. I fiddled with several levels, finally arriving at READ_COMMITED.
<batch:job-repository
id="jobRepository"
data-source="oracle_hmp"
transaction-manager="dataSourceTransactionManager"
isolation-level-for-create="READ_COMMITTED"/>
Based on my understanding, this type of error should only happen if the same job is executed at the same time from multiple processes -- so that there might be contention for the incrementer. In this instance, that is not the case, yet we see the error.
Any ideas as to what might be causing this problem? Should I try a different isolation level? Something else?
Thanks!
There is a similar question here, but it is not as well documented (and also lacks and answer).
This might be a long shot but it took me a long time to figure it out because the only symptom was sporadically getting the JobInstanceAlreadyCompleteException as you describe so I figured I'd suggest it.
The database I was using was Oracle and the BATCH_JOB_SEQ and BATCH_JOB_EXECUTION_SEQ I had created both had a CACHE_SIZE of 10.
This had the effect of sometimes causing the JOB_INSTANCE_ID and JOB_EXECUTION_ID to not be ordered correctly. Spring batch makes the assumption that the most recent JOB_INSTANCE is the one with max(JOB_INSTANCE_ID) (see org.springframework.batch.core.repository.dao.JdbcJobInstanceDao.FIND_LAST_JOBS_BY_NAME). Since my sequence was sometimes thrown off, this assumption did not always hold true.
I fixed it by setting the sequences to NO_CACHE.
An easy way to tell if this might be your problem is to check if your sequences are set to CACHE at all and/or to make sure that your JOB_INSTANCE_ID and JOB_EXECUTION_ID are always ascending with each new run.

how to collect hadoop tasktracker status?

I am trying to collect various metrics for active tasttrackers but it throws an exception
Not sure why?
for(String s: jc.getClusterStatus(true).getActiveTrackerNames()){
System.out.println("tt "+s);
System.out.println(""+ new org.apache.hadoop.mapreduce.server.jobtracker.TaskTracker(s).getAvailableSlots(TaskType.MAP)); }
output
prompt $ /installs/hadoop-0.20.2//bin/hadoop jar tools.jar tools.MetaInfo
tt tracker_10.0.0.6:localhost/127.0.0.1:53256 java.lang.NullPointerException at org.apache.hadoop.mapreduce.server.jobtracker.TaskTracker.getAvailableSlots(TaskTracker.java:90) at tools.MetaInfo.(MetaInfo.java:44) at tools.MetaInfo.main(MetaInfo.java:51) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) jabir:CompareHdfs jabir.ahmed$
Not sure why it doesnt display the available slots but displays an error
I am trying to collect various metrics for active tasttrackers but it throws an exception
new TaskTracker() will create a new TaskTracker which is not you wanted to do.
Check the JobCounter and the TaskCounter classes for the various built-in counters in the Hadoop framework. This tutorial will help to retrieve counters. Besides the in-built counters, custom counters can also be built for any application specific data for any additional data related to the Hadoop framework.
Also, split the lines into multiple lines. With a().b().c().d().e(), it's very difficult to know when the NPE came from.
You can also find current status, what are thing doing with tasktracker
http://x.x.x.x:50060/tasktracker.jsp

Resources