Submitting remote spark job from eclipse IDE, getting paranamer error - hadoop

I'm submitting remote spark job to the cluster using the yarn-client mode using eclipse IDE.
But getting below error message in the eclipse IDE.
java.lang.NoClassDefFoundError: com/thoughtworks/paranamer/BytecodeReadingParanamer
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.<init>(BeanIntrospector.scala:40)
com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.<clinit>(BeanIntrospector.scala)
com.fasterxml.jackson.module.scala.introspect.ScalaPropertiesCollector.<init>(ScalaPropertiesCollector.scala:22)
com.fasterxml.jackson.module.scala.introspect.ScalaClassIntrospector$.constructPropertyCollector(ScalaClassIntrospector.scala:24)
com.fasterxml.jackson.databind.introspect.BasicClassIntrospector.collectProperties(BasicClassIntrospector.java:142)
com.fasterxml.jackson.databind.introspect.BasicClassIntrospector.forSerialization(BasicClassIntrospector.java:68)
com.fasterxml.jackson.databind.introspect.BasicClassIntrospector.forSerialization(BasicClassIntrospector.java:11)
com.fasterxml.jackson.databind.SerializationConfig.introspect(SerializationConfig.java:490)
com.fasterxml.jackson.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:133)
com.fasterxml.jackson.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:873)
com.fasterxml.jackson.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:833)
com.fasterxml.jackson.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:387)
com.fasterxml.jackson.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:478)
com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:97)
com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2718)
com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2210)
org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:51)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:144)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
org.apache.spark.SparkContext.textFile(SparkContext.scala:830)
org.apache.spark.api.java.JavaSparkContext.textFile(JavaSparkContext.scala:181)
com.nuevora.core.spark.commons.CommonFunctions.getJavaRDDFromFile(CommonFunctions.java:61)
com.nuevora.core.spark.UpdateDataset.modifyInputDataset(UpdateDataset.java:103)
com.nuevora.controllers.FormsValidatorServlet.service(FormsValidatorServlet.java:3070)
javax.servlet.http.HttpServlet.service(HttpServlet.java:722)
Executor logs are not helpful. But the final status of job is "succeeded "
Need help in solving this error

I solved this issue by adding the paranamer jar which comes with the cloudera distribution

Related

How to run Sparkling Water example with spark in local mode

I am trying to run sparkling water deep learning demo in IntelliJ IDEA
The code link is:
https://github.com/h2oai/sparkling-water/blob/RELEASE-2.0.3/examples/src/main/scala/org/apache/spark/examples/h2o/DeepLearningDemo.scala
If fails to start, the exception is:
17/01/06 11:18:41 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(1,1483672721446,JobFailed(org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down))
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:187)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:177)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
at java.lang.Thread.getStackTrace(Thread.java:1108)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1930)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:912)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:911)
at org.apache.spark.h2o.backends.internal.InternalBackendUtils$class.startH2O(InternalBackendUtils.scala:163)
at org.apache.spark.h2o.backends.internal.InternalBackendUtils$.startH2O(InternalBackendUtils.scala:262)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:99)
at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:102)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:279)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:301)
at com.xyz.HelloSparklingWater$.main(HelloSparklingWater.scala:35)
at com.xyz.HelloSparklingWater.main(HelloSparklingWater.scala)
It looks exception is thrown when constructing H2OContext and InternalH2OBackend.
I would ask how to run this example in spark local mode that is run within IDE

how to trigger spark job from java code with SparkSubmit.scala

I saw oozie is using
List<String> sparkArgs = new ArrayList<String>();
sparkArgs.add("--master");
sparkArgs.add("yarn-cluster");
sparkArgs.add("--class");
sparkArgs.add("com.sample.spark.HelloSpark");
...
SparkSubmit.main(sparkArgs.toArray(new String[sparkArgs.size()]));
But when I ran this on cluster, I always got
Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
I think that is because my program can not find HADOOP_CONF_DIR. But how do I let SparkSubmit know that settings in Java code ?

Internal error while connecting to hadoop dfs

I built an eclipse plugin for hadoop-2.3.0. Bundle-classpath is
Bundle-classpath: classes/,
lib/hadoop-mapreduce-client-core-${hadoop.version}.jar,
lib/hadoop-mapreduce-client-common-${hadoop.version}.jar,
lib/hadoop-mapreduce-client-jobclient-${hadoop.version}.jar,
lib/hadoop-auth-${hadoop.version}.jar,
lib/hadoop-common-${hadoop.version}.jar,
lib/hadoop-hdfs-${hadoop.version}.jar,
lib/protobuf-java-${protobuf.version}.jar,
lib/log4j-${log4j.version}.jar,
lib/commons-cli-1.2.jar,
lib/commons-configuration-1.6.jar,
lib/commons-httpclient-3.1.jar,
lib/commons-lang-2.5.jar,
lib/commons-collections-${commons-collections.version}.jar,
lib/jackson-core-asl-1.8.8.jar,
lib/jackson-mapper-asl-1.8.8.jar,
lib/slf4j-log4j12-1.7.5.jar,
lib/slf4j-api-1.7.5.jar,
lib/guava-${guava.version}.jar,
lib/netty-${netty.version}.jar
I added built jar file hadoop-eclipse-plugin.jar in eclipse\plugins. I am using Eclipse kepler sr2 package. On trying to create a hdfs and connect ..it is generating an internal error-->
An internal error occurred during: "Map/Reduce location status updater". org/apache/commons/lang/StringUtils
What might have cause this error and how to resolve it? Any help is appriciated.

Unresponsive Spark Master

I'm trying to run a simple Spark app in Standalone mode on Mac.
I manage to run ./sbin/start-master.sh to start the master server and worker.
./bin/spark-shell --master spark://MacBook-Pro.local:7077 also works and I can see it in running application list in Master WebUI
Now I'm trying to write a simple spark app.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application")
.setMaster("spark://MacBook-Pro.local:7077")
val sc = new SparkContext(conf)
val logFile = "README.md"
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
Running this simple app gives me error message that Master is unresponsive
15/02/15 09:47:47 INFO AppClient$ClientActor: Connecting to master spark://MacBook-Pro.local:7077...
15/02/15 09:47:48 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#MacBook-Pro.local:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/02/15 09:48:07 INFO AppClient$ClientActor: Connecting to master spark://MacBook-Pro.local:7077...
15/02/15 09:48:07 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#MacBook-Pro.local:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/02/15 09:48:27 INFO AppClient$ClientActor: Connecting to master spark://MacBook-Pro.local:7077...
15/02/15 09:48:27 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#MacBook-Pro.local:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/02/15 09:48:47 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/02/15 09:48:47 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
15/02/15 09:48:47 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
Any idea what is the problem?
Thanks
You can either set the master when calling spark-submit, or (as you've done here) by explicitly setting it via the SparkConf. Try following the example in the Spark Configuration docs, and setting the master as follows:
val conf = new SparkConf().setMaster("local[2]")
From the same page (explaining the number in brackets that follows local): "Note that we run with local[2], meaning two threads - which represents “minimal” parallelism, which can help detect bugs that only exist when we run in a distributed context."
I got the same issue and solve it finally. In my case, because I wrote the source code based on scala 2.11. But for spark, I build it with Maven following the default command:
build/mvn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
According to this build script, it will set the version of scala to version 2.10. Due to the different scala version between Spark Client and Master, it will raise incompatible serialization when client send message to master via remote actor. Finally "All masters are unresponsive" error message was shown in the console.
My Solution:
1. Re-build spark for scala 2.11 (Make sure your programming env to scala 2.11). Please run this command as below in SPARK_HOME:
dev/change-version-to-2.11.sh
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
After building, the package will be located in SPARK_HOME/assembly/target/scala-2.11. If you start your spark server using start-all.sh, it will report the target package can't found.
Go to conf folder, edit spark-env.sh file. Append the code line as below:
export SPARK_SCALA_VERSION="2.11"
Please run start-all.sh, and set the correct master url in your program, and run it. It done!
Notice: The error message in the console is not enough. So that you need toggle your log feature on to inspect what happen. You can go to conf folder, and copy log4j.properties.template to log4j.properties. After the spark master was started, the log files will save on SPARK_HOME/logs folder.
I write my code in JAVA, but I got the same problem with you. Because my scala version is 2.10, my dependencies is 2.11. Then I changed spark-core_2.11 and spark-sql_2.11 to spark-core_2.10 and spark-sql_2.10 in pom.xml. Maybe you can solve your issue in similar way.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
</dependency>

my pig UDF runs in local mode but fails with "Deserialization error: could not instantiate" on my cluster

I have a pig UDF which runs perfectly in local mode, but fails with: could not instantiate 'com.bla.myFunc' with arguments 'null' when I try it on the cluster.
my mistake was not digging hard enough in the task logs.
when you dig there thru the jobTracker UI, you could see that the root cause was:
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Maps
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
so, besides the usual:
pigServer.registerFunction("myFunc", new FuncSpec("com.bla.myFunc"));
we should add:
registerJar(pigServer, Maps.class);
and so on for any jar used by the UDF.
Another option is to use build-jar-with-dependencies, but then you have to put the pig.jar before yours in the classpath, or else you'll tackle this one: embedded hadoop-pig: what's the correct way to use the automatic addContainingJar for UDFs?

Resources