TaskID.<init>(Lorg/apache/hadoop/mapreduce/JobID;Lorg/apache/hadoop/mapreduce/TaskType;I)V - hadoop

val jobConf = new JobConf(hbaseConf)
jobConf.setOutputFormat(classOf[TableOutputFormat])
jobConf.set(TableOutputFormat.OUTPUT_TABLE, tablename)
val indataRDD = sc.makeRDD(Array("1,jack,15","2,Lily,16","3,mike,16"))
indataRDD.map(_.split(','))
val rdd = indataRDD.map(_.split(',')).map{arr=>{
val put = new Put(Bytes.toBytes(arr(0).toInt))
put.add(Bytes.toBytes("cf"),Bytes.toBytes("name"),Bytes.toBytes(arr(1)))
put.add(Bytes.toBytes("cf"),Bytes.toBytes("age"),Bytes.toBytes(arr(2).toInt))
(new ImmutableBytesWritable, put)
}}
rdd.saveAsHadoopDataset(jobConf)
When I run hadoop or spark jobs, I often meet the error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.mapred.TaskID.<init>(Lorg/apache/hadoop/mapreduce/JobID;Lorg/apache/hadoop/mapreduce/TaskType;I)V
at org.apache.spark.SparkHadoopWriter.setIDs(SparkHadoopWriter.scala:158)
at org.apache.spark.SparkHadoopWriter.preSetup(SparkHadoopWriter.scala:60)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1188)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1161)
at com.iteblog.App$.main(App.scala:62)
at com.iteblog.App.main(App.scala)`
At the begin, I think, is the jar conflict, but I carefully checked the jar: there are no other jars. The spark and hadoop versions are:
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.1</version>`
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>2.6.0-mr1-cdh5.5.0</version>
And I found that the TaskID and TaskType are all in the hadoop-core jar, but not in the same package. Why the mapred.TaskID can refer the mapreduce.TaskType ?

Oh,I have resolve this problem,add the maven dependency
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.6.0-cdh5.5.0</version>
</dependency>
the error disappear!

I have also faced such issue . It basically due to jar issue only.
Add the Jar file from Maven spark-core_2.10
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.0.2</version>
</dependency>
After changing the Jar file

Related

S3 access point as Spark Eventlog Directory

We have a standalone Spring boot based Spark application where at the moment property spark.eventLog.dir is set to an s3 location.
SparkConf sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("MyApp")
.set("spark.hadoop.fs.permissions.umask-mode", "000")
.set("hive.warehouse.subdir.inherit.perms", "false")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.set("spark.speculation", "false")
.set("spark.eventLog.enabled", "true")
.set("spark.extraListeners", "com.ClassName");
sparkConf.set("spark.eventLog.dir", "s3a://my-bucket-name/eventlog");
This has been working as expected however now the bucket access has changed to access point, so now the URL has to be arn:aws:s3:<bucket-region>:<accountNumber>:accesspoint:<access-point-name> e.g:
sparkConf.set("spark.eventLog.dir", "s3a://arn:aws:s3:eu-west-2:1234567890:accesspoint:my-access-point/eventlog");
After this change we are getting following Stack trace while booting up this app:
java.lang.NullPointerException: null uri host.
at java.base/java.util.Objects.requireNonNull(Objects.java:246)
at org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)
at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1866)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:71)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:522)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
Looking at the class S3xLoginHelper, looks like it is failing to create a java.net.URI object with the : char in the URL string.
I have following following relevant maven dependencies:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.2.0</version>
</dependency>
Update:
Also, tried to add following in core-site.xml (also tried in hdfc-site.xml) as mentioned in the hadoop-aws documentation:
<property>
<name>fs.s3a.bucket.my-access-point.accesspoint.arn</name>
<value>arn:aws:s3:eu-west-2:1234567890:accesspoint:my-access-point</value>
<description>Configure S3a traffic to use this AccessPoint</description>
</property>
And updated the code with sparkConf.set("spark.eventLog.dir", "s3a://my-access-point/eventlog");
This give a stack trace with java.io.FileNotFoundException: Bucket my-access-point does not exist which indicates that it is not using those updated properties for spark.eventLog.dir and treating my-access-point as bucket name!

Databricks local test fail with java.lang.NoSuchMethodError: org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism

I have a unit test to databricks code, and I want to run it locally on windows. Unluckily when I run pytest with PyCharm, it throws the following exception:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism(Ljava/lang/String;)V
at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:84)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:315)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:575)
at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2747)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2747)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:368)
at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:368)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$8(SparkSubmit.scala:376)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
And from source code it is from the initialization:
spark = SparkSession.builder \
.master("local[2]") \
.appName("Helper Functions Unit Testing") \
.getOrCreate()
I do search the above error and most of them are related to maven configure to add dependency of hadoop auth. However, for pyspark, I don't know how to deal with it. Does anyone have experience or insight for this error?
Here my workaround is to have python version to 3.7 and change pyspark version to 3.0, and then it seems ok. So it is related to the environment and dependency inconsistent.
This is just limit to my case, and from my search on web most is related to maven to add hadoop-auth.jar dependency for hadoop configuration.
Encountered this error for a Maven project written in Scala, not Python. What did it for me was adding not only the hadoop-auth dependency like OP specified but also the hadoop-common dependency in my pom file like so,
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>3.1.2</version>
</dependency>
Replace 3.1.2 with whatever version you're using. However, I also found that I had to find other dependencies that conflicted with hadoop-common and hadoop-auth and add exclusions to them like so,
<exclusions>
<exclusion>
<artifactId>hadoop-common</artifactId>
<groupId>org.apache.hadoop</groupId>
</exclusion>
<exclusion>
<artifactId>hadoop-auth</artifactId>
<groupId>org.apache.hadoop</groupId>
</exclusion>
</exclusions>

Edit YARN's classpath in Oozie

I am trying to run a hadoop job through Oozie. The job uploads data to DynamoDB in AWS. As such, I use AmazonDynamoDBClient. I get the following exception in reducers:
2016-06-14 10:30:52,997 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchMethodError: com.fasterxml.jackson.core.JsonFactory.requiresPropertyOrdering()Z
at com.fasterxml.jackson.databind.ObjectMapper.<init>(ObjectMapper.java:458)
at com.fasterxml.jackson.databind.ObjectMapper.<init>(ObjectMapper.java:379)
at com.amazonaws.util.json.Jackson.<clinit>(Jackson.java:32)
at com.amazonaws.internal.config.InternalConfig.loadfrom(InternalConfig.java:233)
at com.amazonaws.internal.config.InternalConfig.load(InternalConfig.java:251)
at com.amazonaws.internal.config.InternalConfig$Factory.<clinit>(InternalConfig.java:308)
at com.amazonaws.util.VersionInfoUtils.userAgent(VersionInfoUtils.java:139)
at com.amazonaws.util.VersionInfoUtils.initializeUserAgent(VersionInfoUtils.java:134)
at com.amazonaws.util.VersionInfoUtils.getUserAgent(VersionInfoUtils.java:95)
at com.amazonaws.ClientConfiguration.<clinit>(ClientConfiguration.java:42)
at com.amazonaws.PredefinedClientConfigurations.dynamoDefault(PredefinedClientConfigurations.java:38)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.<init>(AmazonDynamoDBClient.java:292)
at com.mypackage.UploadDataToDynamoDBMR$DataUploaderReducer.setup(UploadDataToDynamoDBMR.java:396)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:168)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
I used a fat jar which packages all dependencies and copied the jar to Oozie's lib directory.
I have also used dependency management in pom to pin fasterxml jackson dependency to 2.4.1 (which is used by AWS dynamodb SDK). However, when the execution happens on the reducers, somehow some other version of fasterxml jackson appears first on the classpath (or so I believe).
I also excluded jackson dependency from dynamodb and aws sdks.
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-dynamodb</artifactId>
<version>1.10.11</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-core</artifactId>
<version>1.10.11</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
How can I make sure that my jar is the first one on the classpath in mappers and reducers? I tried the suggestion on this page and added the following property to the job's configuration xml:
<property>
<name>oozie.launcher.mapreduce.user.classpath.first</name>
<value>true</value>
</property>
But this did not help.
Any suggestions?
Have you copied your jar into the lib folder next to the lib workflow.xml or into sharelib?
Check what version of Jackson your Hadoop distribution is using and try to use that version of Jackson everywhere. Also, it might worth checking that no other Jackson jars are on the classpath.
From the exception it looks like that Hadoop tries to call a method:
com.fasterxml.jackson.core.JsonFactory.requiresPropertyOrdering
This method was introduced in Jackson version 2.3, so probably an even older version of Jackson is in there somewhere.

java.lang.ClassNotFoundException: Class org.apache.hadoop.hdfs.DistributedFileSystem not found

I am trying to use the Hadoop HDFS Java API to list all files in HDFS.
I am able to list the files on remote HDFS by running the code in my local eclipse.
But i get the exception
java.lang.ClassNotFoundException: Class org.apache.hadoop.hdfs.DistributedFileSystem
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2290)
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2303)
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:87)
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2342)
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2324)
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:351)
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:163)
when I execute the code from a web server.
I have added the below maven dependencies.
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.0.0-cdh4.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>2.0.0-cdh4.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.0.0-cdh4.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>2.0.0-mr1-cdh4.5.0</version>
</dependency>
Also I have embedded the required jars into the exported jar and maven has added the same in the buildpath.
If any one has encountered this issue earlier request you to please share the solution.
I am facing a similar issue with Apache hadoop 2.2.0 realease, I did a workaround by running it as a separate process, by
final Process p = Runtime.getRuntime ().exec ("java -jar {jarfile} {classfile}";
final Scanner output = new Scanner (p.getErrorStream ());
while (output.hasNext ()) {
try {
System.err.println (output.nextLine ());
} catch (final Exception e) {
}
}
The jar file contains the implementation using the apache hadoop 2.2.0 jars.
Though, I am still searching for an exact solution.
For me, hadoop-hdfs-2.6.0.jar was missing in zeppelin server's lib dir. I copied in zeppelin lib forder and my problem was resolved. :)
and add dependency for hadoop-hdfs-2.6.0.jar in pom.xml also.

Maven fails to download CoreNLP models

When building the sample application from the Stanford CoreNLP website, I ran into a curious exception:
Exception in thread "main" java.lang.RuntimeException: edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
at edu.stanford.nlp.pipeline.StanfordCoreNLP$4.create(StanfordCoreNLP.java:493)
…
Caused by: java.io.IOException: Unable to resolve "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
…
This only happened when the property pos and the ones after it were included in the properties.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Here is the dependency from my pom.xml:
<dependencies>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.2.0</version>
<scope>compile</scope>
</dependency>
</dependencies>
I actually found the answer to that in the problem description of another question on Stackoverflow.
Quoting W.P. McNeill:
Maven does not download the model files automatically, but only if you
add models line to the .pom. Here is a .pom
snippet that fetches both the code and the models.
Here's what my dependencies look like now:
<dependencies>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.2.0</version>
<classifier>models</classifier>
</dependency>
</dependencies>
The important part to note is the entry <classifier>models</classifier> at the bottom. In order for Eclipse to maintain both references, you'll need to configure a dependency for each stanford-corenlp-3.2.0 and stanford-corenlp-3.2.0-models.
In case you need to use the models for other languages (like Chinese, Spanish, or Arabic) you can add the following piece to your pom.xml file (replace models-chinese with models-spanish or models-arabic for these two languages, respectively):
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.8.0</version>
<classifier>models-chinese</classifier>
</dependency>
With Gradle apparently you can use:
implementation 'edu.stanford.nlp:stanford-corenlp:3.9.2'
implementation 'edu.stanford.nlp:stanford-corenlp:3.9.2:models'
or if you use compile (depricated):
compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.9.2'
compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.9.2' classifier: 'models'

Resources